In [ ]:
!pip install -U sklearn
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as skl
import sklearn.linear_model as lm
import scipy.io as sio
!pip install -U okpy
from client.api.notebook import Notebook
ok = Notebook('lab11.ok')
Today's lab covers:
We've put our data into two files: train.csv
and test.csv
which contain
the training and test data, respectively. You are not allowed to train
on the test data.
The y values in the training data correspond to the amount of water that flows out of the dam on a particular day. There is only 1 feature: the increase in water level for the dam's reservoir on that day, which we'll call x.
In [ ]:
!head train.csv
Let's load in the data:
In [ ]:
data = pd.read_csv('train.csv')
X = data[['X']].as_matrix()
y = data['y'].as_matrix()
X.shape, y.shape
In [ ]:
_ = plt.plot(X[:, 0], y, '.')
plt.xlabel('Change in water level (X)')
plt.ylabel('Water flow out of dam (y)')
In [ ]:
def plot_data_and_curve(curve_x, curve_y):
plt.plot(X[:, 0], y, '.')
plt.plot(curve_x, curve_y, '-')
plt.ylim(-20, 60)
plt.xlabel('Change in water level (X)')
plt.ylabel('Water flow out of dam (y)')
Question 1: As a warmup, let's fit a line to this data using sklearn
.
We've imported sklearn.linear_model
as lm
, so you can use that instead of
typing out the whole module name. Running the cell should show the data
with your best fit line.
In [ ]:
linear_clf = lm.LinearRegression() #SOLUTION
# Fit your classifier
linear_clf.fit(X, y)
# Predict a bunch of points to draw best fit line
all_x = np.linspace(-55, 55, 200).reshape(-1, 1)
line = linear_clf.predict(all_x)
plot_data_and_curve(all_x, line)
Question 2: If you had to guess, which has a larger effect on error for this dataset: bias or variance? Explain briefly.
SOLUTION: Bias. Our data are curved up, but our line cannot model that curve, so bias is high. Variance is low because the complexity of our model is low relative to the size of the training set. That is, if we drew a new dataset from the population of days, we would probably get a similar line.
Question 3: Let's now add some complexity to our model by adding polynomial features.
Reference the sklearn
docs on the PolynomialFeatures
class. You should use this class to add polynomial features to X
up to degree 8 using the fit_transform
method.
The final X_poly
data matrix should have shape (33, 9)
. Think about and discuss why.
In [ ]:
from sklearn.preprocessing import PolynomialFeatures
X_poly = PolynomialFeatures(degree=8).fit_transform(X) #SOLUTION
X_poly.shape
Question 4: Now, fit a linear regression to the data, using polynomial features.
Then, follow the code in Question 1 to make predictions for the values in all_x
. You'll have to add polynomial features to all_x
in order to make predictions.
Then, running this cell should plot the best fit curve using a degree 8 polynomial.
In [ ]:
poly_clf = lm.LinearRegression() #SOLUTION
# Fit your classifier
poly_clf.fit(X_poly, y) #SOLUTION
# Set curve to your model's predictions on all_x
curve = poly_clf.predict(PolynomialFeatures(degree=8).fit_transform(all_x)) #SOLUTION
plot_data_and_curve(all_x, curve)
Question 5: Think about and discuss what you notice in the model's predictions.
Now, compute the mean squared training error for both the best fit line and polynomial. Again, you'll have to transform the training data for the polynomial regression before you can make predictions.
You should get training errors of around 52.8 and 5.23 for line and polynomial models, respectively. Why does the polynomial model get a lower training error than the linear model?
In [ ]:
def mse(predicted_y, actual_y):
return np.mean((predicted_y - actual_y) ** 2)
line_training_error = mse(linear_clf.predict(X), y) #SOLUTION
poly_training_error = mse(poly_clf.predict(PolynomialFeatures(degree=8).fit_transform(X)), y) #SOLUTION
line_training_error, poly_training_error
Question 6: It's annoying to have to transform the data every time we want to use polynomial features. We can use a Pipeline
to let us do both transformation and regression in one step.
Read the docs for make_pipeline
and create a pipeline for polynomial regression called poly_pipeline
. Then, fit it on X
and y
and compute the training error as in Question 5. The training errors should match.
In [ ]:
from sklearn.pipeline import make_pipeline
poly_pipeline = make_pipeline(PolynomialFeatures(degree=8), lm.LinearRegression()) #SOLUTION
# Fit the pipeline on X and y
poly_pipeline.fit(X, y) #SOLUTION
# Compute the training error
pipeline_training_error = mse(poly_pipeline.predict(X), y) #SOLUTION
pipeline_training_error
Nice! With pipelines, we can combine any number of transformations and treat the whole thing as a single classifier.
Question 7: Now, we know that a low training error doesn't necessarily mean your model is good. So, we'll hold out some points from the training data for a validation set. We'll use these held-out points to choose the best model.
Use the train_test_split
function to split out one third of the training data for validation. Call the resulting datasets X_train
, X_valid
, y_train
, y_valid
.
X_train
should have shape (22, 1)
. X_valid
should have shape (11, 1)
In [ ]:
from sklearn.model_selection import train_test_split
np.random.seed(42)
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.33) #SOLUTION
X_train.shape, X_valid.shape
Question 8: Now, set X_train_line_error
, X_valid_line_error
,
X_train_poly_error
, X_valid_poly_error
to the training and validation
errors for both linear and polynomial regression.
You'll have to call .fit
on your classifiers/pipelines again since we're using
X_train
and y_train
instead of X
and y
.
You should see that the validation error for the polynomial fit is significantly higher than the linear fit (152.6 vs 115.2).
In [ ]:
# Fit the linear classifier
linear_clf.fit(X_train, y_train) #SOLUTION
# Fit the polynomial pipeline
poly_pipeline.fit(X_train, y_train) #SOLUTION
X_train_line_error = mse(linear_clf.predict(X_train), y_train) #SOLUTION
X_valid_line_error = mse(linear_clf.predict(X_valid), y_valid) #SOLUTION
X_train_poly_error = mse(poly_pipeline.predict(X_train), y_train) #SOLUTION
X_valid_poly_error = mse(poly_pipeline.predict(X_valid), y_valid) #SOLUTION
X_train_line_error, X_valid_line_error, X_train_poly_error, X_valid_poly_error
Question 9: Our 8 degree polynomial is overfitting our data. To reduce overfitting, we can use regularization.
The usual cost function for linear regression is:
$$J(\theta) = (Y - X \theta)^T (Y - X \theta)$$Edit the cell below to show the cost function of linear regressions with L2 regularization. Use $\lambda $ as your regularization parameter.
Now, explain why this cost function helps reduce overfitting.
SOLUTION: Adding regularization effectively restricts the set of possible polynomials we're allowed to use to fit the data. In particular, polynomials with large coefficients, which can be used to fit data very precisely, are discouraged.
Question 10: L2 regularization for linear regression is also known as
Ridge regression. Create a pipeline called ridge_pipeline
that again
creates polynomial features with degree 8 and then uses the Ridge
sklearn
classifier.
The alpha
argument is the same as our $\lambda$. Leave it as the default (1.0). You should set normalize=True
to normalize your data before fitting. Why do we have to do this?
Then, fit your pipeline on the data. The cell will then plot the curve of your regularized classifier. You should notice the curve is significantly smoother.
Then, fiddle around with the alpha value. What do you notice as you increase alpha? Decrease alpha?
In [ ]:
ridge_pipeline = make_pipeline(PolynomialFeatures(degree=8), lm.Ridge(normalize=True, alpha=1.)) #SOLUTION
# Fit your classifier
ridge_pipeline.fit(X_train, y_train) #SOLUTION
# Set curve to your model's predictions on all_x
ridge_curve = ridge_pipeline.predict(all_x) #SOLUTION
plot_data_and_curve(all_x, ridge_curve)
Question 11: Compute the training and validation error for the ridge_pipeline
.
How do the errors compare to the errors for the unregularized model? Why did each one go up/down?
In [ ]:
ridge_train_error = mse(ridge_pipeline.predict(X_train), y_train) #SOLUTION
ridge_valid_error = mse(ridge_pipeline.predict(X_valid), y_valid) #SOLUTION
ridge_train_error, ridge_valid_error
Question 12: Now we want to know: how do we choose the best alpha value?
This is where we use our validation set. We can try out a bunch of alphas and pick the one that gives us the least error on the validation set. Why can't we use the one that gives us the least error on the training set? The test set?
For each alpha in the given alphas
list, fit a Ridge regression model to the training set and check its accuracy on the validation set.
Finally, set best_alpha
to the best value. You should get a best alpha of 0.01 with a validation error of 15.7.
In [ ]:
alphas = [0.0001, 0.001, 0.01, 0.1, 0.5, 1.0, 10.0]
# Your code to find the best alpha
def compute_error(alpha):
pline = make_pipeline(PolynomialFeatures(degree=8),
lm.Ridge(normalize=True, alpha=alpha))
pline.fit(X_train, y_train)
return mse(pline.predict(X_valid), y_valid)
errors = [compute_error(alpha) for alpha in alphas]
best_alpha_idx = np.argmin(errors)
best_alpha, best_error = alphas[best_alpha_idx], errors[best_alpha_idx]
best_alpha, best_error
Question 13: Now, set best_pipeline
to the pipeline with the degree 8 polynomial transform and the ridge regression model with the best value of alpha.
In [ ]:
best_pipeline = make_pipeline(PolynomialFeatures(degree=8), lm.Ridge(normalize=True, alpha=best_alpha)) #SOLUTION
best_pipeline.fit(X_train, y_train)
best_curve = best_pipeline.predict(all_x)
plot_data_and_curve(all_x, best_curve)
Now, run the cell below to find the test error of your simple linear model, your polynomial model, and your regularized polynomial model.
In [ ]:
test_data = pd.read_csv('test.csv')
X_test = data[['X']].as_matrix()
y_test = data['y'].as_matrix()
line_test_error = mse(linear_clf.predict(X_test), y_test)
poly_test_error = mse(poly_pipeline.predict(X_test), y_test)
best_test_error = mse(best_pipeline.predict(X_test), y_test)
line_test_error, poly_test_error, best_test_error
Nice! You've use regularization and cross-validation to fit an accurate polynomial model to the dataset.
In the future, you'd probably want to use something like RidgeCV
to automatically perform cross-validation, but it's instructive to do it yourself at least once.
In [ ]:
i_finished_the_lab = False
In [ ]:
_ = ok.grade('qcompleted')
_ = ok.backup()
In [ ]:
_ = ok.submit()
Now, run this code in your terminal to make a
git commit
that saves a snapshot of your changes in git
. The last line of the cell
runs git push, which will send your work to your personal Github repo.
# Tell git to commit your changes to this notebook
git add -A
# Tell git to make the commit
git commit -m "lab11 finished"
# Send your updates to your personal private repo
git push origin master