```
In [ ]:
```!pip install -U sklearn
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as skl
import sklearn.linear_model as lm
import scipy.io as sio
!pip install -U okpy
from client.api.notebook import Notebook
ok = Notebook('lab11.ok')

Today's lab covers:

- How to use regularization to avoid overfitting
- How to use cross-validation to find the amount of regularization that produces a model with the least error for new data

We've put our data into two files: `train.csv`

and `test.csv`

which contain
the training and test data, respectively. You are not allowed to train
on the test data.

The y values in the training data correspond to the amount of water that flows out of the dam on a particular day. There is only 1 feature: the increase in water level for the dam's reservoir on that day, which we'll call x.

```
In [ ]:
```!head train.csv

Let's load in the data:

```
In [ ]:
```data = pd.read_csv('train.csv')
X = data[['X']].as_matrix()
y = data['y'].as_matrix()
X.shape, y.shape

```
In [ ]:
```_ = plt.plot(X[:, 0], y, '.')
plt.xlabel('Change in water level (X)')
plt.ylabel('Water flow out of dam (y)')

```
In [ ]:
```def plot_data_and_curve(curve_x, curve_y):
plt.plot(X[:, 0], y, '.')
plt.plot(curve_x, curve_y, '-')
plt.ylim(-20, 60)
plt.xlabel('Change in water level (X)')
plt.ylabel('Water flow out of dam (y)')

**Question 1:** As a warmup, let's fit a line to this data using `sklearn`

.
We've imported `sklearn.linear_model`

as `lm`

, so you can use that instead of
typing out the whole module name. Running the cell should show the data
with your best fit line.

```
In [ ]:
```linear_clf = lm.LinearRegression() #SOLUTION
# Fit your classifier
linear_clf.fit(X, y)
# Predict a bunch of points to draw best fit line
all_x = np.linspace(-55, 55, 200).reshape(-1, 1)
line = linear_clf.predict(all_x)
plot_data_and_curve(all_x, line)

**Question 2:** If you had to guess, which has a larger effect on error for this dataset: bias or variance?
Explain briefly.

**SOLUTION:** Bias. Our data are curved up, but our line cannot model that curve, so bias is high. Variance is low because the complexity of our model is low relative to the size of the training set. That is, if we drew a new dataset from the population of days, we would probably get a similar line.

**Question 3:** Let's now add some complexity to our model by adding polynomial features.

Reference the `sklearn`

docs on the `PolynomialFeatures`

class. You should use this class to add polynomial features to `X`

up to degree 8 using the `fit_transform`

method.

The final `X_poly`

data matrix should have shape `(33, 9)`

. Think about and discuss why.

```
In [ ]:
```from sklearn.preprocessing import PolynomialFeatures
X_poly = PolynomialFeatures(degree=8).fit_transform(X) #SOLUTION
X_poly.shape

**Question 4:** Now, fit a linear regression to the data, using polynomial features.

Then, follow the code in Question 1 to make predictions for the values in `all_x`

. You'll have to add polynomial features to `all_x`

in order to make predictions.

Then, running this cell should plot the best fit curve using a degree 8 polynomial.

```
In [ ]:
```poly_clf = lm.LinearRegression() #SOLUTION
# Fit your classifier
poly_clf.fit(X_poly, y) #SOLUTION
# Set curve to your model's predictions on all_x
curve = poly_clf.predict(PolynomialFeatures(degree=8).fit_transform(all_x)) #SOLUTION
plot_data_and_curve(all_x, curve)

**Question 5:** Think about and discuss what you notice in the model's predictions.

Now, compute the mean squared training error for both the best fit line and polynomial. Again, you'll have to transform the training data for the polynomial regression before you can make predictions.

You should get training errors of around 52.8 and 5.23 for line and polynomial models, respectively. Why does the polynomial model get a lower training error than the linear model?

```
In [ ]:
```def mse(predicted_y, actual_y):
return np.mean((predicted_y - actual_y) ** 2)
line_training_error = mse(linear_clf.predict(X), y) #SOLUTION
poly_training_error = mse(poly_clf.predict(PolynomialFeatures(degree=8).fit_transform(X)), y) #SOLUTION
line_training_error, poly_training_error

**Question 6:** It's annoying to have to transform the data every time we want to use polynomial features. We can use a `Pipeline`

to let us do both transformation and regression in one step.

Read the docs for `make_pipeline`

and create a pipeline for polynomial regression called `poly_pipeline`

. Then, fit it on `X`

and `y`

and compute the training error as in Question 5. The training errors should match.

```
In [ ]:
```from sklearn.pipeline import make_pipeline
poly_pipeline = make_pipeline(PolynomialFeatures(degree=8), lm.LinearRegression()) #SOLUTION
# Fit the pipeline on X and y
poly_pipeline.fit(X, y) #SOLUTION
# Compute the training error
pipeline_training_error = mse(poly_pipeline.predict(X), y) #SOLUTION
pipeline_training_error

**Question 7:** Now, we know that a low training error doesn't necessarily mean your model is good. So, we'll hold out some points from the training data for a *validation set*. We'll use these held-out points to choose the best model.

Use the `train_test_split`

function to split out one third of the training data for validation. Call the resulting datasets `X_train`

, `X_valid`

, `y_train`

, `y_valid`

.

`X_train`

should have shape `(22, 1)`

. `X_valid`

should have shape `(11, 1)`

```
In [ ]:
```from sklearn.model_selection import train_test_split
np.random.seed(42)
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.33) #SOLUTION
X_train.shape, X_valid.shape

**Question 8:** Now, set `X_train_line_error`

, `X_valid_line_error`

,
`X_train_poly_error`

, `X_valid_poly_error`

to the training and validation
errors for both linear and polynomial regression.

You'll have to call `.fit`

on your classifiers/pipelines again since we're using
`X_train`

and `y_train`

instead of `X`

and `y`

.

You should see that the validation error for the polynomial fit is significantly higher than the linear fit (152.6 vs 115.2).

```
In [ ]:
```# Fit the linear classifier
linear_clf.fit(X_train, y_train) #SOLUTION
# Fit the polynomial pipeline
poly_pipeline.fit(X_train, y_train) #SOLUTION
X_train_line_error = mse(linear_clf.predict(X_train), y_train) #SOLUTION
X_valid_line_error = mse(linear_clf.predict(X_valid), y_valid) #SOLUTION
X_train_poly_error = mse(poly_pipeline.predict(X_train), y_train) #SOLUTION
X_valid_poly_error = mse(poly_pipeline.predict(X_valid), y_valid) #SOLUTION
X_train_line_error, X_valid_line_error, X_train_poly_error, X_valid_poly_error

**Question 9:** Our 8 degree polynomial is overfitting our data.
To reduce overfitting, we can use **regularization**.

The usual cost function for linear regression is:

$$J(\theta) = (Y - X \theta)^T (Y - X \theta)$$Edit the cell below to show the cost function of linear regressions with L2 regularization. Use $\lambda $ as your regularization parameter.

$$J(\theta) = (Y - X \theta)^T (Y - X \theta)$$

**Now,** explain why this cost function helps reduce overfitting.

**SOLUTION:** Adding regularization effectively restricts the set of possible polynomials we're allowed to use to fit the data. In particular, polynomials with large coefficients, which can be used to fit data very precisely, are discouraged.

**Question 10:** L2 regularization for linear regression is also known as
*Ridge* regression. Create a pipeline called `ridge_pipeline`

that again
creates polynomial features with degree 8 and then uses the `Ridge`

sklearn
classifier.

The `alpha`

argument is the same as our $\lambda$. Leave it as the default (1.0). You should set `normalize=True`

to normalize your data before fitting. Why do we have to do this?

Then, fit your pipeline on the data. The cell will then plot the curve of your regularized classifier. You should notice the curve is significantly smoother.

Then, fiddle around with the alpha value. What do you notice as you increase alpha? Decrease alpha?

```
In [ ]:
```ridge_pipeline = make_pipeline(PolynomialFeatures(degree=8), lm.Ridge(normalize=True, alpha=1.)) #SOLUTION
# Fit your classifier
ridge_pipeline.fit(X_train, y_train) #SOLUTION
# Set curve to your model's predictions on all_x
ridge_curve = ridge_pipeline.predict(all_x) #SOLUTION
plot_data_and_curve(all_x, ridge_curve)

**Question 11:** Compute the training and validation error for the `ridge_pipeline`

.

How do the errors compare to the errors for the unregularized model? Why did each one go up/down?

```
In [ ]:
```ridge_train_error = mse(ridge_pipeline.predict(X_train), y_train) #SOLUTION
ridge_valid_error = mse(ridge_pipeline.predict(X_valid), y_valid) #SOLUTION
ridge_train_error, ridge_valid_error

**Question 12:** Now we want to know: how do we choose the best alpha value?

This is where we use our validation set. We can try out a bunch of alphas and pick the one that gives us the least error on the validation set. Why can't we use the one that gives us the least error on the training set? The test set?

For each alpha in the given `alphas`

list, fit a Ridge regression model to the training set and check its accuracy on the validation set.

Finally, set `best_alpha`

to the best value. You should get a best alpha of 0.01 with a validation error of 15.7.

```
In [ ]:
```alphas = [0.0001, 0.001, 0.01, 0.1, 0.5, 1.0, 10.0]
# Your code to find the best alpha
def compute_error(alpha):
pline = make_pipeline(PolynomialFeatures(degree=8),
lm.Ridge(normalize=True, alpha=alpha))
pline.fit(X_train, y_train)
return mse(pline.predict(X_valid), y_valid)
errors = [compute_error(alpha) for alpha in alphas]
best_alpha_idx = np.argmin(errors)
best_alpha, best_error = alphas[best_alpha_idx], errors[best_alpha_idx]
best_alpha, best_error

**Question 13:** Now, set `best_pipeline`

to the pipeline with the degree 8 polynomial transform and the ridge regression model with the best value of alpha.

```
In [ ]:
```best_pipeline = make_pipeline(PolynomialFeatures(degree=8), lm.Ridge(normalize=True, alpha=best_alpha)) #SOLUTION
best_pipeline.fit(X_train, y_train)
best_curve = best_pipeline.predict(all_x)
plot_data_and_curve(all_x, best_curve)

```
In [ ]:
```test_data = pd.read_csv('test.csv')
X_test = data[['X']].as_matrix()
y_test = data['y'].as_matrix()
line_test_error = mse(linear_clf.predict(X_test), y_test)
poly_test_error = mse(poly_pipeline.predict(X_test), y_test)
best_test_error = mse(best_pipeline.predict(X_test), y_test)
line_test_error, poly_test_error, best_test_error

Nice! You've use regularization and cross-validation to fit an accurate polynomial model to the dataset.

In the future, you'd probably want to use something like `RidgeCV`

to automatically perform cross-validation, but it's instructive to do it yourself at least once.

```
In [ ]:
```i_finished_the_lab = False

```
In [ ]:
```_ = ok.grade('qcompleted')
_ = ok.backup()

```
In [ ]:
```_ = ok.submit()

Now, run this code in your terminal to make a
git commit
that saves a snapshot of your changes in `git`

. The last line of the cell
runs git push, which will send your work to your personal Github repo.

```
# Tell git to commit your changes to this notebook
git add -A
# Tell git to make the commit
git commit -m "lab11 finished"
# Send your updates to your personal private repo
git push origin master
```