In [115]:
import numpy as np
import pandas as pd
import random
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 12, 10
In this discussion, we're going to check numerically some of the results from yesterday's lecture. Speicifically, we're going to see how bias and variance change as we modify model complexity.
We'll start by defining some toy data and using it to build our models. In this case, we'll use the sum of some sine waves as our underlying function $h(x)$, plus some gaussian noise. Run the cell below to see an example set of this data.
In [116]:
np.random.seed(100)
x = np.array([i*np.pi/180 for i in range(0,360,18)])
h = lambda x: np.sin(x) + np.cos(x) + np.sin(2*x) + np.cos(2*x)
y = h(x) + np.random.normal(0,0.7,len(x))
plt.plot(x, y, '.')
We're going to compare a number of models we might use to fit this data, using polynomial regression. Specifically, we're going to compare models with polynomial terms varying from 1 to 20. The cell below implements a polynomial regression of degree $p$, given data $x$ and $y$.
Recall that a polynomial regression of degree $p$ will find weights $\theta$ to fit the following equation:
$$f_\theta(x) = \theta_0 + \theta_1 x + \theta_2 x^2 + ... + \theta_{p-1} x^{p-1} + \theta_p x^p$$
In [126]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
def poly_regression(x, y, p):
reg = LinearRegression(normalize=True)
poly = PolynomialFeatures(degree=p)
poly_basis = poly.fit_transform(x.reshape(-1,1))
reg.fit(poly_basis, y)
f_theta = reg.predict(poly_basis)
return f_theta
def smooth_curve(x, y, p):
# computes predictions for all more values, to get a smoother curve
# you will not need to use this function
reg = LinearRegression(normalize=True)
poly = PolynomialFeatures(degree=p)
poly_basis = poly.fit_transform(x.reshape(-1,1))
reg.fit(poly_basis, y)
x_extended = np.array([i*np.pi/180 for i in range(0,360,6)])
poly_extended = PolynomialFeatures(degree=p)
poly_basis_extended = poly_extended.fit_transform(x_extended.reshape(-1,1))
f_theta = reg.predict(poly_basis_extended)
return f_theta
We plot some of these prediction functions below, with the prediction curve $f_\theta(x)$ for each in green. What causes the change in shape between various prediction curves? Which models seem to be overfitting? Which are underfitting? Which models have lower bias? Which have lower variance?
In [127]:
curve = np.array([i*np.pi/180 for i in range(0,360,6)])
plots = [(1,231),(3,232),(5,233),(10,234),(15,235),(30,236)]
for p, subplot in plots:
plt.subplot(subplot)
plt.plot(x, y, ".")
plt.plot(curve, smooth_curve(x, y, p))
plt.title('Plot for p={}'.format(p))
plt.ylim((-3,4))
Now let's consider the three componets that make up our squared error:
$$\mathbf{E}\left[\left(y - f_\theta (x)\right)^2\right] = \mathbf{E}\left[\left(y - h (x)\right)^2\right] + \mathbf{E}\left[\left(h(x) - \mathbf{E}_\theta[f_\theta (x)]\right)^2\right] + \mathbf{E}\left[\left(\mathbf{E}_\theta[f_\theta (x)] - f_\theta (x)\right)^2\right]$$Recall from lecture that these three components represent:
Noise: $\mathbf{E}\left[\left(y - h (x)\right)^2\right]$
Bias$^2$: $\mathbf{E}\left[\left(h(x) - \mathbf{E}_\theta[f_\theta (x)]\right)^2\right]$
Variance: $\mathbf{E}\left[\left(\mathbf{E}_\theta[f_\theta (x)] - f_\theta (x)\right)^2\right]$
We want to calculate the necessary terms numerically.
Note that the noise term does not depend in any way on our model. Since $y = h(x) + \epsilon$, we see that $\mathbf{E}\left[\left(y - h (x)\right)^2\right] = \mathbf{E}[\epsilon^2]$. And since we are using $\epsilon \sim \mathcal{N}(0, 0.7)$, this is simply $0.49$, regardless of our model.
You don't need to know how to derive this, but if you're interested, note that $\mathbf{E}[X^2] = 1$ when $X \sim \mathcal{N}(0, 1)$. From this we see that, if $\epsilon = 0.7X$, then
$$\mathbf{E}[\epsilon^2] = \mathbf{E}[(0.7X)^2] = \mathbf{E}[0.49 X^2] = 0.49\mathbf{E}[X^2] = 0.49$$
Now, let's calculate the other two terms - Bias$^2$ and Variance.
We can calculate $\mathbf{E}_\theta[f_\theta (x)]$ by simulating a series of datasets and computing a model on each dataset, then getting the expected value of each point $x_i$ across all models. Every dataset should be created in the same manner - by generating random values $\epsilon$ and calculating $y = h(x) + \epsilon$, where $\epsilon \sim \mathcal{N}(0, 0.7)$.
Fill in the function below to calculate $\mathbf{E}_\theta[f_\theta (x)]$ in this manner. The function E_f_theta_x takes in a single argument $p$, which represents the degree of the polynomial basis that will be used to create the function $f_\theta(x)$.
In [128]:
def E_f_theta_x(p, n_sims=1000, sigma=0.7):
f_thetas = np.zeros(20)
for sim in range(n_sims):
y = h(x) + np.random.normal(0, sigma, len(x)) #SOLUTION
f_thetas = f_thetas + poly_regression(x, y, p)
E_f_thetas = f_thetas/n_sims #SOLUTION
return E_f_thetas
E_f_theta_x(3)
Now let's compute the bias term for a variety of different models. Fill in the bias_term function below to calculate the squared bias of a polynomial regression of degree $p$ when fitting the function $y = f_\theta(x) + \epsilon$ for the provided values $x$. You should find that bias_term(3) $\approx 0.547$.
In [129]:
def bias_term(p): # really squared bias
bias = np.mean((E_f_theta_x(p, sigma=0.7) - h(x))**2) #SOLUTION
return bias
bias_term(3)
In [130]:
def variance_term(p):
y = h(x) + np.random.normal(0, 0.7, len(x)) #SOLUTION
f_theta_x = poly_regression(x, y, p)
variance = np.mean((E_f_theta_x(p, sigma=0.7) - f_theta_x)**2) #SOLUTION
return variance
Since your variance term will depend heavily on the specifc $\theta$-values of each individual regression (which are in turn dependent on the specific values of $\epsilon$ for each simulated dataset), we will average together the variance terms of a number of regressions. The function avg_var_term is provided below. You should find that avg_var_term(10) $\approx 0.27$, though this will be less precise than for the bias term.
In [131]:
def avg_var_term(p, n_sims=100):
variance = 0
E_f_theta = E_f_theta_x(p, sigma=0.7)
for sim in range(n_sims):
y = h(x) + np.random.normal(0, 0.7, len(x))
f_theta_x = poly_regression(x, y, p)
variance += np.mean((E_f_theta - f_theta_x)**2)
return variance/n_sims
avg_var_term(10)
Now we can visualize how our bias and variance terms change as $p$ increases.
BEFORE RUNNING THE NEXT CELL: What do you expect to happen as $p$ increases? How will the model's bias change? How will its variance change?
After considering this, run the plotting cell below and see how the results match up with your expectations. It may take a minute or two for the cell to run.
In [132]:
varis = np.array([avg_var_term(i) for i in range(1, 20)])
bias = np.array([bias_term(i) for i in range(1, 20)])
plt.plot(np.arange(1, 20), varis, color='b')
plt.plot(np.arange(1, 20), bias, color='r')
plt.title('Bias^2 and Variance vs. p')
To make sure we've done our calculations correctly, let's go back to our original formula:
$$\mathbf{E}\left[\left(y - f_\theta (x)\right)^2\right] = \mathbf{E}\left[\left(y - h (x)\right)^2\right] + \mathbf{E}\left[\left(h(x) - \mathbf{E}_\theta[f_\theta (x)]\right)^2\right] + \mathbf{E}\left[\left(\mathbf{E}_\theta[f_\theta (x)] - f_\theta (x)\right)^2\right]$$We can plot our MSE for models with various $p$ against the sums of our individual components for noise, bias$^2$, and variance. Fill in the test_error function below, which calculates the test MSE for a regression with degree $p$ on some simulated data. Then, run the cells below plot test error versus the sum of your computed bias$^2$ and variance, and noise. You should find that the two curves are nearly identical.
In [133]:
def test_error(p):
y_train = h(x) + np.random.normal(0,0.7,len(x))
y_test = h(x) + np.random.normal(0,0.7,len(x))
pred_values = poly_regression(x, y_train, p)
mse = np.mean((pred_values - y_test)**2) #SOLUTION
return mse
def avg_test_error(p, n_sims=100):
return np.mean([test_error(p) for sim in range(n_sims)])
In [134]:
noise = 0.49
plt.plot(range(1, 20), [avg_test_error(i) for i in range(1, 20)], color='b')
plt.plot(range(1, 20), noise + bias + varis, color='g')
plt.title("Test error and calculated (noise + bias^2 + variance) vs. p")
Finally, let's take a look at how this test error compares to our training error. Fill in the train_error function below, and then run the plotting cell to see how your training error compares to your test error. Does this make sense, given what you know about the relationship between training and test error?
In [ ]:
def train_error(p):
y_train = h(x) + np.random.normal(0,0.7,len(x))
pred_values = poly_regression(x, y_train, p)
mse = np.mean((pred_values - y_train)**2) #SOLUTION
return mse
def avg_train_error(p, n_sims=100):
return np.mean([train_error(p) for sim in range(n_sims)])
In [ ]:
plt.plot(range(1, 20), [avg_test_error(i) for i in range(1, 20)], color='b')
plt.plot(range(1, 20), [avg_train_error(i) for i in range(1, 20)], color='g')
plt.title("Training vs. Test Error")
In [ ]: