Lab 11: Regularization and Cross-Validation

In [ ]:
!pip install -U sklearn

import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as skl
import sklearn.linear_model as lm
import as sio

!pip install -U okpy
from client.api.notebook import Notebook
ok = Notebook('lab11.ok')

Today's lab covers:

  • How to use regularization to avoid overfitting
  • How to use cross-validation to find the amount of regularization that produces a model with the least error for new data

Dammed Data

We've put our data into two files: train.csv and test.csv which contain the training and test data, respectively. You are not allowed to train on the test data.

The y values in the training data correspond to the amount of water that flows out of the dam on a particular day. There is only 1 feature: the increase in water level for the dam's reservoir on that day, which we'll call x.

In [ ]:
!head train.csv

Let's load in the data:

In [ ]:
data = pd.read_csv('train.csv')
X = data[['X']].as_matrix()
y = data['y'].as_matrix()
X.shape, y.shape

In [ ]:
_ = plt.plot(X[:, 0], y, '.')
plt.xlabel('Change in water level (X)')
plt.ylabel('Water flow out of dam (y)')

In [ ]:
def plot_data_and_curve(curve_x, curve_y):
    plt.plot(X[:, 0], y, '.')
    plt.plot(curve_x, curve_y, '-')
    plt.ylim(-20, 60)
    plt.xlabel('Change in water level (X)')
    plt.ylabel('Water flow out of dam (y)')

Question 1: As a warmup, let's fit a line to this data using sklearn. We've imported sklearn.linear_model as lm, so you can use that instead of typing out the whole module name. Running the cell should show the data with your best fit line.

In [ ]:
linear_clf = lm.LinearRegression() #SOLUTION

# Fit your classifier, y)

# Predict a bunch of points to draw best fit line
all_x = np.linspace(-55, 55, 200).reshape(-1, 1)
line = linear_clf.predict(all_x)

plot_data_and_curve(all_x, line)

Question 2: If you had to guess, which has a larger effect on error for this dataset: bias or variance? Explain briefly.

SOLUTION: Bias. Our data are curved up, but our line cannot model that curve, so bias is high. Variance is low because the complexity of our model is low relative to the size of the training set. That is, if we drew a new dataset from the population of days, we would probably get a similar line.

Question 3: Let's now add some complexity to our model by adding polynomial features.

Reference the sklearn docs on the PolynomialFeatures class. You should use this class to add polynomial features to X up to degree 8 using the fit_transform method.

The final X_poly data matrix should have shape (33, 9). Think about and discuss why.

In [ ]:
from sklearn.preprocessing import PolynomialFeatures

X_poly = PolynomialFeatures(degree=8).fit_transform(X) #SOLUTION

Question 4: Now, fit a linear regression to the data, using polynomial features.

Then, follow the code in Question 1 to make predictions for the values in all_x. You'll have to add polynomial features to all_x in order to make predictions.

Then, running this cell should plot the best fit curve using a degree 8 polynomial.

In [ ]:
poly_clf = lm.LinearRegression() #SOLUTION

# Fit your classifier, y) #SOLUTION

# Set curve to your model's predictions on all_x
curve = poly_clf.predict(PolynomialFeatures(degree=8).fit_transform(all_x)) #SOLUTION

plot_data_and_curve(all_x, curve)

Question 5: Think about and discuss what you notice in the model's predictions.

Now, compute the mean squared training error for both the best fit line and polynomial. Again, you'll have to transform the training data for the polynomial regression before you can make predictions.

You should get training errors of around 52.8 and 5.23 for line and polynomial models, respectively. Why does the polynomial model get a lower training error than the linear model?

In [ ]:
def mse(predicted_y, actual_y):
    return np.mean((predicted_y - actual_y) ** 2)

line_training_error = mse(linear_clf.predict(X), y) #SOLUTION
poly_training_error = mse(poly_clf.predict(PolynomialFeatures(degree=8).fit_transform(X)), y) #SOLUTION

line_training_error, poly_training_error

Question 6: It's annoying to have to transform the data every time we want to use polynomial features. We can use a Pipeline to let us do both transformation and regression in one step.

Read the docs for make_pipeline and create a pipeline for polynomial regression called poly_pipeline. Then, fit it on X and y and compute the training error as in Question 5. The training errors should match.

In [ ]:
from sklearn.pipeline import make_pipeline

poly_pipeline = make_pipeline(PolynomialFeatures(degree=8), lm.LinearRegression()) #SOLUTION

# Fit the pipeline on X and y, y) #SOLUTION

# Compute the training error
pipeline_training_error = mse(poly_pipeline.predict(X), y) #SOLUTION

Nice! With pipelines, we can combine any number of transformations and treat the whole thing as a single classifier.

Question 7: Now, we know that a low training error doesn't necessarily mean your model is good. So, we'll hold out some points from the training data for a validation set. We'll use these held-out points to choose the best model.

Use the train_test_split function to split out one third of the training data for validation. Call the resulting datasets X_train, X_valid, y_train, y_valid.

X_train should have shape (22, 1). X_valid should have shape (11, 1)

In [ ]:
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.33) #SOLUTION
X_train.shape, X_valid.shape

Question 8: Now, set X_train_line_error, X_valid_line_error, X_train_poly_error, X_valid_poly_error to the training and validation errors for both linear and polynomial regression.

You'll have to call .fit on your classifiers/pipelines again since we're using X_train and y_train instead of X and y.

You should see that the validation error for the polynomial fit is significantly higher than the linear fit (152.6 vs 115.2).

In [ ]:
# Fit the linear classifier, y_train) #SOLUTION

# Fit the polynomial pipeline, y_train) #SOLUTION

X_train_line_error = mse(linear_clf.predict(X_train), y_train) #SOLUTION
X_valid_line_error = mse(linear_clf.predict(X_valid), y_valid) #SOLUTION

X_train_poly_error = mse(poly_pipeline.predict(X_train), y_train) #SOLUTION
X_valid_poly_error = mse(poly_pipeline.predict(X_valid), y_valid) #SOLUTION

X_train_line_error, X_valid_line_error, X_train_poly_error, X_valid_poly_error

Question 9: Our 8 degree polynomial is overfitting our data. To reduce overfitting, we can use regularization.

The usual cost function for linear regression is:

$$J(\theta) = (Y - X \theta)^T (Y - X \theta)$$

Edit the cell below to show the cost function of linear regressions with L2 regularization. Use $\lambda $ as your regularization parameter.

$$J(\theta) = (Y - X \theta)^T (Y - X \theta)$$

Now, explain why this cost function helps reduce overfitting.

SOLUTION: Adding regularization effectively restricts the set of possible polynomials we're allowed to use to fit the data. In particular, polynomials with large coefficients, which can be used to fit data very precisely, are discouraged.

Question 10: L2 regularization for linear regression is also known as Ridge regression. Create a pipeline called ridge_pipeline that again creates polynomial features with degree 8 and then uses the Ridge sklearn classifier.

The alpha argument is the same as our $\lambda$. Leave it as the default (1.0). You should set normalize=True to normalize your data before fitting. Why do we have to do this?

Then, fit your pipeline on the data. The cell will then plot the curve of your regularized classifier. You should notice the curve is significantly smoother.

Then, fiddle around with the alpha value. What do you notice as you increase alpha? Decrease alpha?

In [ ]:
ridge_pipeline = make_pipeline(PolynomialFeatures(degree=8), lm.Ridge(normalize=True, alpha=1.)) #SOLUTION

# Fit your classifier, y_train) #SOLUTION

# Set curve to your model's predictions on all_x
ridge_curve = ridge_pipeline.predict(all_x) #SOLUTION

plot_data_and_curve(all_x, ridge_curve)

Question 11: Compute the training and validation error for the ridge_pipeline.

How do the errors compare to the errors for the unregularized model? Why did each one go up/down?

In [ ]:
ridge_train_error = mse(ridge_pipeline.predict(X_train), y_train) #SOLUTION
ridge_valid_error = mse(ridge_pipeline.predict(X_valid), y_valid) #SOLUTION

ridge_train_error, ridge_valid_error

Question 12: Now we want to know: how do we choose the best alpha value?

This is where we use our validation set. We can try out a bunch of alphas and pick the one that gives us the least error on the validation set. Why can't we use the one that gives us the least error on the training set? The test set?

For each alpha in the given alphas list, fit a Ridge regression model to the training set and check its accuracy on the validation set.

Finally, set best_alpha to the best value. You should get a best alpha of 0.01 with a validation error of 15.7.

In [ ]:
alphas = [0.0001, 0.001, 0.01, 0.1, 0.5, 1.0, 10.0]

# Your code to find the best alpha
def compute_error(alpha):
    pline = make_pipeline(PolynomialFeatures(degree=8),
                          lm.Ridge(normalize=True, alpha=alpha)), y_train)
    return mse(pline.predict(X_valid), y_valid)

errors = [compute_error(alpha) for alpha in alphas]
best_alpha_idx = np.argmin(errors)
best_alpha, best_error = alphas[best_alpha_idx], errors[best_alpha_idx]

best_alpha, best_error

Question 13: Now, set best_pipeline to the pipeline with the degree 8 polynomial transform and the ridge regression model with the best value of alpha.

In [ ]:
best_pipeline = make_pipeline(PolynomialFeatures(degree=8), lm.Ridge(normalize=True, alpha=best_alpha)) #SOLUTION, y_train)

best_curve = best_pipeline.predict(all_x)

plot_data_and_curve(all_x, best_curve)

Now, run the cell below to find the test error of your simple linear model, your polynomial model, and your regularized polynomial model.

In [ ]:
test_data = pd.read_csv('test.csv')
X_test = data[['X']].as_matrix()
y_test = data['y'].as_matrix()

line_test_error = mse(linear_clf.predict(X_test), y_test)
poly_test_error = mse(poly_pipeline.predict(X_test), y_test)
best_test_error = mse(best_pipeline.predict(X_test), y_test)

line_test_error, poly_test_error, best_test_error

Nice! You've use regularization and cross-validation to fit an accurate polynomial model to the dataset.

In the future, you'd probably want to use something like RidgeCV to automatically perform cross-validation, but it's instructive to do it yourself at least once.

Submitting your assignment

If you made a good-faith effort to complete the lab, change i_finished_the_lab to True in the cell below. In any case, run the cells below to submit the lab.

In [ ]:
i_finished_the_lab = False

In [ ]:
_ = ok.grade('qcompleted')
_ = ok.backup()

In [ ]:
_ = ok.submit()

Now, run this code in your terminal to make a git commit that saves a snapshot of your changes in git. The last line of the cell runs git push, which will send your work to your personal Github repo.

# Tell git to commit your changes to this notebook
git add -A

# Tell git to make the commit
git commit -m "lab11 finished"

# Send your updates to your personal private repo
git push origin master