Linear Models

Your boss presents you with something that you determine is a data set which calls for the predication of column B based on column A.


In [1]:
import os
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
font = {'size': 20}
matplotlib.rc('font', **font)

curdir = !pwd
rootdir = os.path.abspath(curdir[0])

def get_columns(columns):
    with open(os.path.join(rootdir, 'data.csv')) as f:
        for line in f:
            yield [line.split(',')[i - 1] for i in columns]

Visually inspect the first few rows of data


In [2]:
data = np.array(list(get_columns([1, 2])), dtype=np.float)
print("Top 10 rows\n")
print(data[:10,:])


Top 10 rows

[[  25.74    29.916]
 [   3.992   51.298]
 [  37.85    21.533]
 [  20.63    32.997]
 [ 102.529  -10.174]
 [ 114.806  -17.414]
 [  80.43    -0.501]
 [  63.249    8.557]
 [  96.956   -8.662]
 [   0.131   61.098]]

Inspect the distribution of values for each column


In [3]:
def histogram(col):
    data = np.array(list(get_columns([col])), dtype=np.float)
    fig = plt.figure(figsize=(15,8))
    ax = plt.subplot(111)
    plt.grid(lw=2)
    plt.hist(data, bins=20)
    plt.xlabel("Column %d" % col)
    plt.show()

In [4]:
histogram(1)



In [5]:
histogram(2)


How do the datapoints vary through the data set

Is there a time/sequence dependency?


In [6]:
def timeseries(col):
    data = np.array(list(get_columns([col])), dtype=np.float)
    fig = plt.figure(figsize=(15,8))
    ax = plt.subplot(111)
    plt.grid(lw=2)
    plt.plot(data, 'bo')
    plt.xlabel("Column %d" % col)
    plt.show()

In [7]:
timeseries(1)



In [8]:
timeseries(2)


Inspect the relation between the two data sets


In [9]:
def scatter(x, y):
    data = np.array(list(get_columns([x,y])), dtype=np.float)
    fig = plt.figure(figsize=(15,8))
    ax = plt.subplot(111)
    plt.grid(lw=2)
    plt.plot(data[:,0], data[:,1], 'bo')
    plt.xlabel("Column %d" % x)
    plt.ylabel("Column %d" % y)
    plt.show()

In [10]:
scatter(1, 2)


  • Is there a simple model that can reasonably predict column B given column A?
  • For a random point A, how close does your model get to the point B?
  • Is there a limitation to the input or your model?
  • Could the model be improved?
  • What could be added to the model to improve it's accuracy?

In [10]: