Linear Models

Your boss presents you with something that you determine is a data set which calls for the predication of column B based on column A.



In [1]:

    
import os
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
font = {'size': 20}
matplotlib.rc('font', **font)

curdir = !pwd
rootdir = os.path.abspath(curdir[0])

def get_columns(columns):
    with open(os.path.join(rootdir, 'data.csv')) as f:
        for line in f:
            yield [line.split(',')[i - 1] for i in columns]

Visually inspect the first few rows of data



In [2]:

    
data = np.array(list(get_columns([1, 2])), dtype=np.float)
print("Top 10 rows\n")
print(data[:10,:])









    



Top 10 rows

[[  25.74    29.916]
 [   3.992   51.298]
 [  37.85    21.533]
 [  20.63    32.997]
 [ 102.529  -10.174]
 [ 114.806  -17.414]
 [  80.43    -0.501]
 [  63.249    8.557]
 [  96.956   -8.662]
 [   0.131   61.098]]

Inspect the distribution of values for each column



In [3]:

    
def histogram(col):
    data = np.array(list(get_columns([col])), dtype=np.float)
    fig = plt.figure(figsize=(15,8))
    ax = plt.subplot(111)
    plt.grid(lw=2)
    plt.hist(data, bins=20)
    plt.xlabel("Column %d" % col)
    plt.show()



In [4]:

    
histogram(1)



In [5]:

    
histogram(2)

How do the datapoints vary through the data set

Is there a time/sequence dependency?



In [6]:

    
def timeseries(col):
    data = np.array(list(get_columns([col])), dtype=np.float)
    fig = plt.figure(figsize=(15,8))
    ax = plt.subplot(111)
    plt.grid(lw=2)
    plt.plot(data, 'bo')
    plt.xlabel("Column %d" % col)
    plt.show()



In [7]:

    
timeseries(1)



In [8]:

    
timeseries(2)

Inspect the relation between the two data sets



In [9]:

    
def scatter(x, y):
    data = np.array(list(get_columns([x,y])), dtype=np.float)
    fig = plt.figure(figsize=(15,8))
    ax = plt.subplot(111)
    plt.grid(lw=2)
    plt.plot(data[:,0], data[:,1], 'bo')
    plt.xlabel("Column %d" % x)
    plt.ylabel("Column %d" % y)
    plt.show()



In [10]:

    
scatter(1, 2)

Is there a simple model that can reasonably predict column B given column A?
For a random point A, how close does your model get to the point B?
Is there a limitation to the input or your model?
Could the model be improved?
What could be added to the model to improve it's accuracy?



In [10]: