Lab 3: Generalization

Neil D. Lawrence COM4509/6509 Machine Learning and Adaptive Intelligence

Firstly, let's get the plots running in line on the notebook.$$\newcommand{\inputScalar}{x}\newcommand{\lengthScale}{\ell}\newcommand{\mappingVector}{\mathbf{w}}\newcommand{\gaussianDist}[3]{\mathcal{N}\left(#1|#2,#3\right)}\newcommand{\zerosVector}{\mathbf{0}}\newcommand{\eye}{\mathbf{I}}\newcommand{\dataStd}{\sigma}\newcommand{\dataMatrix}{\mathbf{Y}}\newcommand{\dataScalar}{y}\newcommand{\dataVector}{\mathbf{y}}\newcommand{\inputVector}{\mathbf{x}}\newcommand{\kernelMatrix}{\mathbf{K}}\newcommand{\basisMatrix}{\mathbf{\Phi}}\newcommand{\expSamp}[1]{\left<#1\right>}\newcommand{\covarianceMatrix}{\mathbf{C}}\newcommand{\numData}{N}\newcommand{\mappingScalar}{w}\newcommand{\mappingFunctionScalar}{f}$$


In [ ]:
%pylab inline

The aim of this notebook is to review the different methods of model selection: hold out validation, leave one out cross validation and cross validation. The next thing to do is to download the olympics data just to make sure it is available for us to study.


In [ ]:
import urllib
url = ("http://staffwww.dcs.shef.ac.uk/"
    + "people/N.Lawrence/dataset_mirror/"
    + "olympic_marathon_men/olympicMarathonTimes.csv")
urllib.urlretrieve(url, 'olympicMarathonTimes.csv')
olympics = np.loadtxt('olympicMarathonTimes.csv', delimiter=',')

and as before extract both the olympic years and the pace of the winning runner into 2-dimensional arrays with the data points in the rows of the array (the first dimension).


In [ ]:
x = olympics[:, 0:1]
y = olympics[:, 1:2]

We can plot them to check that they've loaded in correctly.


In [ ]:
plt.plot(x, y, 'rx')

Training Error

The first thing we'll do is plot the training error for the polynomial fit. To do this let's set up some parameters.


In [ ]:
num_data = x.shape[0]
num_pred_data = 100 # how many points to use for plotting predictions
x_pred = linspace(1890, 2016, num_pred_data)[:, None] # input locations for predictions
order = 4 # The polynomial order to use.

now let's build the basis matrices.


In [ ]:
# build the basis set
Phi = np.zeros((num_data, order+1))
Phi_pred = np.zeros((num_pred_data, order+1))
for i in range(0, order+1):
    Phi[:, i:i+1] = x**i
    Phi_pred[:, i:i+1] = x_pred**i

now we can solve for the regression weights and make predictions both for the training data points, and the test data points. That involves solving the linear system given by

$$\basisMatrix^\top \basisMatrix \mappingVector^* = \basisMatrix^\top \dataVector$$

In [ ]:
# solve the linear system
w_star = np.linalg.solve(np.dot(Phi.T, Phi), np.dot(Phi.T, y))

and using the resulting vector to make predictions at the training points and test points,

$$\mathbf{f} = \basisMatrix \mappingVector.$$

To implement this in practice we need to use basis matrices for both the predictions and the training points.


In [ ]:
# predict at training and test points
f = np.dot(Phi, w_star)
f_pred = np.dot(Phi_pred, w_star)

These can be used to compute the error

$$E(\mappingVector) = \frac{\numData}{2} \log \dataStd^2 + \frac{1}{2\dataStd^2} \sum_{i=1}^\numData \left(\dataScalar_i - \mappingVector^\top \phi(\inputVector_i)\right)^2 \\\ E(\mappingVector) = \frac{\numData}{2} \log \dataStd^2 + \frac{1}{2\dataStd^2} \sum_{i=1}^\numData \left(\dataScalar_i - \mappingFunctionScalar_i\right)^2$$

In [ ]:
# compute the sum of squares term
sum_squares = ((y-f)**2).sum()
# fit the noise variance
sigma2 = sum_squares/num_data
error = 0.5*(num_data*np.log(sigma2) + sum_squares/sigma2)

Now we have the fit and the error, let's plot the fit and the error.


In [ ]:
# print the error and plot the predictions
print("The error is: %2.4f"%error)
plt.plot(x_pred, f_pred)
plt.plot(x, y, 'rx')
ax = plt.gca()
ax.set_title('Predictions for Order 5')
ax.set_xlabel('year')
ax.set_ylabel('pace (min/km)')

Now use the loop structure below to compute the error for different model orders.


In [ ]:
# import the time model to allow python to pause.
import time
# import the IPython display module to clear the output.
from IPython.display import clear_output

error_list = []
max_order = 6
fig, axes = plt.subplots(nrows=1, ncols=2)
for order in range(0, max_order+1):
    # 1. build the basis set
    Phi = np.zeros((num_data, order+1))
    Phi_pred = np.zeros((num_pred_data, order+1))
    for i in range(0, order+1):
        Phi[:, i:i+1] = x**i
        Phi_pred[:, i:i+1] = x_pred**i
    # 2. solve the linear system
    # 3. make predictions at training and test points
    # 4. compute the error and append it to a list.
    # 5. plot the predictions
    axes[0].clear()
    axes[1].clear()    
    axes[0].plot(x_pred, f_pred)
    axes[0].plot(x, y, 'rx')
    axes[0].set_ylim((2.5, 5.5))
    axes[0].set_title('Predictions for Order ' + str(order) + ' model.')
    axes[1].plot(np.arange(0, order+1), np.asarray(error_list))
    axes[1].set_xlim((0, max_order))
    axes[1].set_ylim((-30, 0))
    axes[1].set_title('Training Error')
    display(fig)
    time.sleep(1)
    clear_output()

Hold Out Validation

The error we computed above is the training error. It doesn't assess the model's generalization ability, it only assesses how well it's performing on the given training data. In hold out validation, we keep back some of the training data for assessing generalization performance. In the case of time series prediction, it often makes sense to hold out the last few data points, in particular, when we are interested in extrapolation, i.e. predicting into the future given the past. To perform hold out validation, we first remove the hold out set. If we were interested in interpolation, we would hold out some random points. Here, because we are interested in extrapolation, we will hold out all points since 1980.


In [ ]:
# Create a training set
x_train = x
y_train = y
indices_hold_out = np.nonzero(x>1980)
x_train = np.delete(x, indices_hold_out)
y_train = np.delete(y, indices_hold_out)

# Create a hold out set
x_hold_out = x[indices_hold_out]
y_hold_out = y[indices_hold_out]

# Now use the training set and hold out set to compute the errors like above.

Now you have the training and hold out data, you should be able to use the code above to evaluate the model on the hold out data. Do this in the code block below.

Leave One Out Validation

In leave one out validation, you take out one data point at a time and compute the error. Perform leave one out validation on this data set. Test polynomial basis from 0th order to 6th order.

$k$-fold Cross Validation

Repeat this test again for 5-fold cross validation. Here you need to split the data into 5 parts. Then use each part as a hold out validation set (in turn), whilst training on the four remaining parts. This is known as five fold cross validation.

Do the different forms of validation select different models?