Always start by import everything in a separate code block. That way if you forgot stuff, it's easy to just add and re-run without it actually doing anything.


In [1]:
from __future__ import division

import numpy as np
import matplotlib.pyplot as plt
import mltools as ml

np.random.seed(0)
%matplotlib inline

Linear Regression

In the class you've learned the concept of the Linear Regression. In this disccussion we will see how we can use the mltools package to fit, evaluate and visualize the model.

The code for the linear regression model sits in ml.linear.lnearRegress.

Example 1: Simple Slope

Starting with a simple example where the linear regression model has to fit a simple slope line.


In [2]:
# First we create the "fake data" with xs from 0 to 10 and Y = X + 2
X = np.linspace(0, 10, 50)
Y = np.copy(X) + 2

In [3]:
# Plotting the data
f, ax = plt.subplots(1, 1, figsize=(10, 8))
    
ax.scatter(X, Y, s=80, color='blue', alpha=0.75)

ax.set_xlim(-0.2, 10.2)
ax.set_ylim(1.8, 12.2)
ax.set_xticklabels(ax.get_xticks(), fontsize=25)
ax.set_yticklabels(ax.get_yticks(), fontsize=25)   

plt.show()


Now let's create a test and train data out of it.


In [4]:
X, Y = ml.shuffleData(X, Y)


---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-4-2de4beac7679> in <module>()
----> 1 X, Y = ml.shuffleData(X, Y)

/home/mlichman/repositories/ml-notebooks/discussions/week2/mltools/utils.pyc in shuffleData(X, Y)
    161 
    162     pi = np.random.permutation(nx)
--> 163     X = X[pi,:]
    164 
    165     if ny > 0:

IndexError: too many indices for array

This error is a result of some assumptions in the code. All the mltools package code assumes that the X is a 2d array and not 1d. This is a common assumption and it is also used in more popular packages.

There are many ways to convert from 1d to 2d. The most popular is the atleast_2d


In [5]:
_ = np.atleast_2d(X).T

Another option is to use reshape. Look at the documentation to see what's the -1 is all about.


In [6]:
X = X.reshape(-1, 1)

Notice that I transformed it after the atleast2d call. That's because it is common to think of X where the rows are the points and the columns are the dimensions. Please play around with those methods to make sure you understand what it's doing.

Now let's continue from where we stopped.


In [8]:
X, Y = ml.shuffleData(X, Y)
Xtr, Xte, Ytr, Yte = ml.splitData(X, Y, 0.75)

In [9]:
# Plotting the data
f, ax = plt.subplots(1, 1, figsize=(10, 8))
    
ax.scatter(Xtr, Ytr, s=80, color='blue', alpha=0.75, label='Train')
ax.scatter(Xte, Yte, s=240, marker='*', color='red', alpha=0.75, label='Test')

ax.set_xlim(-0.2, 10.2)
ax.set_ylim(1.8, 12.2)
ax.set_xticklabels(ax.get_xticks(), fontsize=25)
ax.set_yticklabels(ax.get_yticks(), fontsize=25)   

# Controlling the size of the legend and the location.
ax.legend(fontsize=30, loc=4)

plt.show()


Now let's see how we can call the linear regression.


In [10]:
lr = ml.linear.linearRegress(Xtr, Ytr)

Boom, that's it. But you should go into the code and make sure you understand how it works. You will be asked in exams to derive a linear regression.

Plotting the regression line


In [11]:
# We start with creating a set of xs on the space we want to predict for.
xs = np.linspace(0, 10, 200)

# Converting to the rate shape
xs = np.atleast_2d(xs).T

# And now the prediction
ys = lr.predict(xs)

In [12]:
# Plotting the data
f, ax = plt.subplots(1, 1, figsize=(10, 8))
    
ax.scatter(Xtr, Ytr, s=80, color='blue', alpha=0.75, label='Train')
ax.scatter(Xte, Yte, s=240, marker='*', color='red', alpha=0.75, label='Test')

# Also plotting the regression line
ax.plot(xs, ys, lw=3, color='black', alpha=0.75, label='Prediction')

ax.set_xlim(-0.2, 10.2)
ax.set_ylim(1.8, 12.2)
ax.set_xticklabels(ax.get_xticks(), fontsize=25)
ax.set_yticklabels(ax.get_yticks(), fontsize=25)   

# Controlling the size of the legend and the location.
ax.legend(fontsize=30, loc=4)

plt.show()


We can also print the learned regression object. This will show us the coefficients for each feature. Notice that the regression model added a constant for us.


In [13]:
print lr


linearRegress model, 1 features
[[ 2.  1.]]

The print above means that the linear regression learned the function Y = 2 + 1 * X.

Example 2: Real Data

That was a toy example, let's look at how this is done on real data. This is what you'll have to do in the HW assignment using the 'curve80.txt' data. We're not going to spoile it here for you so we're going to use a different data set.


In [14]:
path_to_file = 'poly_data.txt' 
data = np.genfromtxt(path_to_file, delimiter='\t') # Read data from file

In [15]:
# Plotting the data
f, ax = plt.subplots(1, 1, figsize=(10, 8))
    
ax.scatter(data[:, 0], data[:, 1], s=80, color='blue', alpha=0.75)

ax.set_xlim(-0.2, 4.3)
ax.set_ylim(-13, 18)

ax.set_xticklabels(ax.get_xticks(), fontsize=25)
ax.set_yticklabels(ax.get_yticks(), fontsize=25)   

plt.show()


Now let's repeate everything on the real data.


In [16]:
X, Y = np.atleast_2d(data[:, 0]).T, data[:, 1]
X, Y = ml.shuffleData(X, Y)
Xtr, Xte, Ytr, Yte = ml.splitData(X, Y, 0.75)

lr = ml.linear.linearRegress(Xtr, Ytr)

# Make sure you use the currect space.
xs = np.linspace(0, 4.2, 200)
xs = np.atleast_2d(xs).T

ys = lr.predict(xs)

In [17]:
# Plotting the data
f, ax = plt.subplots(1, 1, figsize=(10, 8))
    
ax.scatter(Xtr, Ytr, s=80, color='blue', alpha=0.75, label='Train')
ax.scatter(Xte, Yte, s=240, marker='*', color='red', alpha=0.75, label='Test')

# Also plotting the regression line
ax.plot(xs, ys, lw=3, color='black', alpha=0.75, label='Prediction')

ax.set_xlim(-0.2, 4.3)
ax.set_ylim(-13, 18)
ax.set_xticklabels(ax.get_xticks(), fontsize=25)
ax.set_yticklabels(ax.get_yticks(), fontsize=25)   

# Controlling the size of the legend and the location.
ax.legend(fontsize=30, loc=0)

plt.show()


Meh, the predicions don't look that great. Why is that?

(Because we're fitting Y=X+c line where it's clear that this data comes from a more complex model.)

So let's fit a more complex model. For that we can use the ml.transform.fpoly method that will convert the features for us.


In [19]:
degree = 12
XtrP = ml.transforms.fpoly(Xtr, degree, False)

lr = ml.linear.linearRegress(XtrP, Ytr)

# Make sure you use the currect space.
xs = np.linspace(0, 4.2, 200)
xs = np.atleast_2d(xs).T

# Notice that we have to transform the predicting xs too.
xsP = ml.transforms.fpoly(xs, degree, False)
ys = lr.predict(xsP)

In [20]:
# Plotting the data
f, ax = plt.subplots(1, 1, figsize=(10, 8))
    
ax.scatter(Xtr, Ytr, s=80, color='blue', alpha=0.75, label='Train')
ax.scatter(Xte, Yte, s=240, marker='*', color='red', alpha=0.75, label='Test')

# Also plotting the regression line. in the plotting we plot the xs and not the xsP
ax.plot(xs, ys, lw=3, color='black', alpha=0.75, label='Prediction')

ax.set_xlim(-0.2, 4.3)
ax.set_ylim(-13, 18)
ax.set_xticklabels(ax.get_xticks(), fontsize=25)
ax.set_yticklabels(ax.get_yticks(), fontsize=25)   

# Controlling the size of the legend and the location.
ax.legend(fontsize=30, loc=0)

plt.show()


Feel free to play around with different degrees and see the differences. You should!

Measuring Prediction Accuracy

In the HW assignment you are required to measure the prediction error using MSE and plot it for different degrees.


In [21]:
def MSE(y_true, y_hat):
    """Mock MSE method.
    
    You'll have to fill it in yourself with the true way of computing the MSE.
    """
    return np.random.rand() * 1000

In [22]:
# Predicting on the test data - DO NOT FORGET TO TRANSFORM Xte TOO!!!
XteP = ml.transforms.fpoly(Xte, degree, False)
YteHat = lr.predict(XteP)

Adding the predicted Yhat to the plot. Notice that it sits on the regression line (as expected).


In [23]:
# Plotting the data
f, ax = plt.subplots(1, 1, figsize=(10, 8))
    
ax.scatter(Xtr, Ytr, s=80, color='blue', alpha=0.75, label='Train')
ax.scatter(Xte, Yte, s=240, marker='*', color='red', alpha=0.75, label='Test')
ax.scatter(Xte, YteHat, s=80, marker='D', color='forestgreen', alpha=0.75, label='Yhat')

# Also plotting the regression line. in the plotting we plot the xs and not the xsP
ax.plot(xs, ys, lw=3, color='black', alpha=0.75, label='Prediction')

ax.set_xlim(-0.2, 4.3)
ax.set_ylim(-13, 18)
ax.set_xticklabels(ax.get_xticks(), fontsize=25)
ax.set_yticklabels(ax.get_yticks(), fontsize=25)   

# Controlling the size of the legend and the location.
ax.legend(fontsize=20, loc=0)

plt.show()


Computing the MSE for the different degrees.


In [24]:
degrees = np.array([2, 4, 6, 8, 10, 20])
mse_error = np.zeros(degrees.shape[0])

for i, degree in enumerate(degrees):
    XtrP = ml.transforms.fpoly(Xtr, degree, False)

    lr = ml.linear.linearRegress(XtrP, Ytr)
    XteP = ml.transforms.fpoly(Xte, degree, False)
    YteHat = lr.predict(XteP)

    mse_error[i] = MSE(Yte, YteHat)

In [25]:
f, ax = plt.subplots(1, 1, figsize=(10, 8))

# Plotting a line with markers where there's an actual x value.
ax.semilogy(degrees, mse_error, lw=4, marker='d', markersize=20, alpha=0.75, label='MSE ERROR')

ax.set_xlim(1.2, 20.5)
ax.set_ylim(30, 1100)

# Setting the X-ticks manually.
ax.set_xticks(np.arange(2, 21, 2))

ax.set_xticklabels(ax.get_xticks(), fontsize=25)
ax.set_yticklabels(ax.get_yticks(), fontsize=25)   

ax.legend(fontsize=20, loc=0)

plt.show()


Cross Validation

Let’s now imagine that we do not have access to the target values of the test data we held out in the previous problem, and we wanted to decide on the best polynomial degree.

Cross-validation works by creating many training/validation splits, called folds, and using all of these splits to assess the “out-of-sample” (validation) performance by averaging them.


In [26]:
nFolds = 4

In [27]:
f, ax = plt.subplots(2, 2, figsize=(20, 20))
ax = ax.flatten()
for iFold in range(nFolds):
    Xti, Xvi, Yti, Yvi = ml.crossValidate(Xtr, Ytr, nFolds, iFold)
    ax[iFold].scatter(Xti, Yti, s=80, color='blue', alpha=0.75, label='Train')
    ax[iFold].scatter(Xvi, Yvi, s=240, marker='*', color='red', alpha=0.75, label='Test')

    ax[iFold].set_xlim(-0.2, 4.3)
    ax[iFold].set_ylim(-13, 18)
    ax[iFold].set_xticklabels(ax[iFold].get_xticks(), fontsize=25)
    ax[iFold].set_yticklabels(ax[iFold].get_yticks(), fontsize=25)   
    
plt.show()


This part in the HW is pretty straight forward, you basically just have to repeate what we did so far in a loop.