The Bias-Variance Trade-Off



In [115]:

    
import numpy as np
import pandas as pd
import random
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 12, 10

In this discussion, we're going to check numerically some of the results from yesterday's lecture. Speicifically, we're going to see how bias and variance change as we modify model complexity.

We'll start by defining some toy data and using it to build our models. In this case, we'll use the sum of some sine waves as our underlying function $h(x)$, plus some gaussian noise. Run the cell below to see an example set of this data.



In [116]:

    
np.random.seed(100)
x = np.array([i*np.pi/180 for i in range(0,360,18)])
h = lambda x: np.sin(x) + np.cos(x) + np.sin(2*x) + np.cos(2*x)
y = h(x) + np.random.normal(0,0.7,len(x))
plt.plot(x, y, '.')

We're going to compare a number of models we might use to fit this data, using polynomial regression. Specifically, we're going to compare models with polynomial terms varying from 1 to 20. The cell below implements a polynomial regression of degree $p$, given data $x$ and $y$.

Recall that a polynomial regression of degree $p$ will find weights $\theta$ to fit the following equation:

$$f_\theta(x) = \theta_0 + \theta_1 x + \theta_2 x^2 + ... + \theta_{p-1} x^{p-1} + \theta_p x^p$$



In [126]:

    
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

def poly_regression(x, y, p):
    
    reg = LinearRegression(normalize=True)
    poly = PolynomialFeatures(degree=p)
    poly_basis = poly.fit_transform(x.reshape(-1,1))
    reg.fit(poly_basis, y)
    f_theta = reg.predict(poly_basis)
    
    return f_theta

def smooth_curve(x, y, p):
    
    # computes predictions for all more values, to get a smoother curve
    # you will not need to use this function
    
    reg = LinearRegression(normalize=True)
    poly = PolynomialFeatures(degree=p)
    poly_basis = poly.fit_transform(x.reshape(-1,1))
    reg.fit(poly_basis, y)
    
    x_extended = np.array([i*np.pi/180 for i in range(0,360,6)])
    poly_extended = PolynomialFeatures(degree=p)
    poly_basis_extended = poly_extended.fit_transform(x_extended.reshape(-1,1))
    f_theta = reg.predict(poly_basis_extended)
    
    return f_theta

Question 1:

We plot some of these prediction functions below, with the prediction curve $f_\theta(x)$ for each in green. What causes the change in shape between various prediction curves? Which models seem to be overfitting? Which are underfitting? Which models have lower bias? Which have lower variance?



In [127]:

    
curve = np.array([i*np.pi/180 for i in range(0,360,6)])
plots = [(1,231),(3,232),(5,233),(10,234),(15,235),(30,236)]

for p, subplot in plots:
    plt.subplot(subplot)
    plt.plot(x, y, ".")
    plt.plot(curve, smooth_curve(x, y, p))
    plt.title('Plot for p={}'.format(p))
    plt.ylim((-3,4))

Now let's consider the three componets that make up our squared error:

$$\mathbf{E}\left[\left(y - f_\theta (x)\right)^2\right] = \mathbf{E}\left[\left(y - h (x)\right)^2\right] + \mathbf{E}\left[\left(h(x) - \mathbf{E}_\theta[f_\theta (x)]\right)^2\right] + \mathbf{E}\left[\left(\mathbf{E}_\theta[f_\theta (x)] - f_\theta (x)\right)^2\right]$$

Recall from lecture that these three components represent:

Noise: $\mathbf{E}\left[\left(y - h (x)\right)^2\right]$

Bias$^2$: $\mathbf{E}\left[\left(h(x) - \mathbf{E}_\theta[f_\theta (x)]\right)^2\right]$

Variance: $\mathbf{E}\left[\left(\mathbf{E}_\theta[f_\theta (x)] - f_\theta (x)\right)^2\right]$

We want to calculate the necessary terms numerically.

Note that the noise term does not depend in any way on our model. Since $y = h(x) + \epsilon$, we see that $\mathbf{E}\left[\left(y - h (x)\right)^2\right] = \mathbf{E}[\epsilon^2]$. And since we are using $\epsilon \sim \mathcal{N}(0, 0.7)$, this is simply $0.49$, regardless of our model.

You don't need to know how to derive this, but if you're interested, note that $\mathbf{E}[X^2] = 1$ when $X \sim \mathcal{N}(0, 1)$. From this we see that, if $\epsilon = 0.7X$, then $$\mathbf{E}[\epsilon^2] = \mathbf{E}[(0.7X)^2] = \mathbf{E}[0.49 X^2] = 0.49\mathbf{E}[X^2] = 0.49$$

Now, let's calculate the other two terms - Bias$^2$ and Variance.

Question 2

First, let's consider the term $\mathbf{E}_\theta[f_\theta (x)]$. What does this term represent? Is it a constant or a variable? If it's a variable, is its value dependent on the specific data $x$ that we have? What about on our model $\theta$?

Question 3

We can calculate $\mathbf{E}_\theta[f_\theta (x)]$ by simulating a series of datasets and computing a model on each dataset, then getting the expected value of each point $x_i$ across all models. Every dataset should be created in the same manner - by generating random values $\epsilon$ and calculating $y = h(x) + \epsilon$, where $\epsilon \sim \mathcal{N}(0, 0.7)$.

Fill in the function below to calculate $\mathbf{E}_\theta[f_\theta (x)]$ in this manner. The function E_f_theta_x takes in a single argument $p$, which represents the degree of the polynomial basis that will be used to create the function $f_\theta(x)$.



In [128]:

    
def E_f_theta_x(p, n_sims=1000, sigma=0.7):
    f_thetas = np.zeros(20)
    for sim in range(n_sims):
        y = h(x) + np.random.normal(0, sigma, len(x)) #SOLUTION
        f_thetas = f_thetas + poly_regression(x, y, p)
    E_f_thetas = f_thetas/n_sims #SOLUTION
    return E_f_thetas
E_f_theta_x(3)

Question 4

Now let's compute the bias term for a variety of different models. Fill in the bias_term function below to calculate the squared bias of a polynomial regression of degree $p$ when fitting the function $y = f_\theta(x) + \epsilon$ for the provided values $x$. You should find that bias_term(3) $\approx 0.547$.



In [129]:

    
def bias_term(p): # really squared bias
    bias = np.mean((E_f_theta_x(p, sigma=0.7) - h(x))**2) #SOLUTION
    return bias
bias_term(3)

Question 5

Now let's do the same for the variance. Fill in the variance_term function below to calculate the bias of a polynomial regression of degree $p$ when fitting the function $y = f_\theta(x) + \epsilon$ for the provided values $x$. You will need to calculate $y$ just as in question 3.



In [130]:

    
def variance_term(p):
    y = h(x) + np.random.normal(0, 0.7, len(x)) #SOLUTION
    f_theta_x = poly_regression(x, y, p)
    variance = np.mean((E_f_theta_x(p, sigma=0.7) - f_theta_x)**2) #SOLUTION
    return variance

Since your variance term will depend heavily on the specifc $\theta$-values of each individual regression (which are in turn dependent on the specific values of $\epsilon$ for each simulated dataset), we will average together the variance terms of a number of regressions. The function avg_var_term is provided below. You should find that avg_var_term(10) $\approx 0.27$, though this will be less precise than for the bias term.



In [131]:

    
def avg_var_term(p, n_sims=100):
    variance = 0
    E_f_theta = E_f_theta_x(p, sigma=0.7)
    for sim in range(n_sims):
        y = h(x) + np.random.normal(0, 0.7, len(x))
        f_theta_x = poly_regression(x, y, p)
        variance += np.mean((E_f_theta - f_theta_x)**2)
    return variance/n_sims

avg_var_term(10)

Question 6

Now we can visualize how our bias and variance terms change as $p$ increases.

BEFORE RUNNING THE NEXT CELL: What do you expect to happen as $p$ increases? How will the model's bias change? How will its variance change?

After considering this, run the plotting cell below and see how the results match up with your expectations. It may take a minute or two for the cell to run.



In [132]:

    
varis = np.array([avg_var_term(i) for i in range(1, 20)])
bias = np.array([bias_term(i) for i in range(1, 20)])
plt.plot(np.arange(1, 20), varis, color='b')
plt.plot(np.arange(1, 20), bias, color='r')
plt.title('Bias^2 and Variance vs. p')

Question 7

To make sure we've done our calculations correctly, let's go back to our original formula:

We can plot our MSE for models with various $p$ against the sums of our individual components for noise, bias$^2$, and variance. Fill in the test_error function below, which calculates the test MSE for a regression with degree $p$ on some simulated data. Then, run the cells below plot test error versus the sum of your computed bias$^2$ and variance, and noise. You should find that the two curves are nearly identical.



In [133]:

    
def test_error(p):
    y_train = h(x) + np.random.normal(0,0.7,len(x))
    y_test = h(x) + np.random.normal(0,0.7,len(x))
    pred_values = poly_regression(x, y_train, p)
    mse = np.mean((pred_values - y_test)**2) #SOLUTION
    return mse

def avg_test_error(p, n_sims=100):
    return np.mean([test_error(p) for sim in range(n_sims)])



In [134]:

    
noise = 0.49

plt.plot(range(1, 20), [avg_test_error(i) for i in range(1, 20)], color='b')
plt.plot(range(1, 20), noise + bias + varis, color='g')
plt.title("Test error and calculated (noise + bias^2 + variance) vs. p")

Question 8

What seems to be the optimal degree $p$ for your regression? What is the average MSE across different y-sets when using that value $p$ for your model?

Do you think that any model could achieve an average MSE much below 0.49? Why or why not?

Question 9

Finally, let's take a look at how this test error compares to our training error. Fill in the train_error function below, and then run the plotting cell to see how your training error compares to your test error. Does this make sense, given what you know about the relationship between training and test error?



In [ ]:

    
def train_error(p):
    y_train = h(x) + np.random.normal(0,0.7,len(x))
    pred_values = poly_regression(x, y_train, p)
    mse = np.mean((pred_values - y_train)**2) #SOLUTION
    return mse

def avg_train_error(p, n_sims=100):
    return np.mean([train_error(p) for sim in range(n_sims)])



In [ ]:

    
plt.plot(range(1, 20), [avg_test_error(i) for i in range(1, 20)], color='b')
plt.plot(range(1, 20), [avg_train_error(i) for i in range(1, 20)], color='g')
plt.title("Training vs. Test Error")



In [ ]: