Model checking: Training, Testing and Validation

4th November 2014 Neil D. Lawrence


In [ ]:
# download the software
import urllib

urllib.urlretrieve('https://github.com/sods/ods/archive/master.zip', 'master.zip')

# unzip the software
import zipfile
zip = zipfile.ZipFile('./master.zip', 'r')
for name in zip.namelist():
    zip.extract(name, '.')

# add the module location to the python path.    
import sys
sys.path.append("./ods-master/")

We will review the different methods of model selection on the Olympics marathon data. Firstly we import the olympics data.


In [1]:
import pods
data = pods.datasets.olympic_marathon_men()
x = data['X']
y = data['Y']

We can plot them to check that they've loaded in correctly.


In [3]:
%matplotlib inline
import pylab as plt
plt.plot(x, y, 'rx')


Out[3]:
[<matplotlib.lines.Line2D at 0x10db65ed0>]

Linear Model Fit

The first thing we'll do is fit a standard linear model to the data. We recall from previous lectures and the lab classes that to do this we need to solve the system

Assignment Question 1

Construct a basis set $\boldsymbol{\Phi}$ of the form $$ \boldsymbol{\Phi} = \begin{bmatrix} \mathbf{1} & \mathbf{x}\end{bmatrix} $$ from the olympics data and solve the system $$ \boldsymbol{\Phi}^\top \boldsymbol{\Phi} \mathbf{w} = \boldsymbol{\Phi}^\top \mathbf{y} $$ to recover the parameters of the linear model. Plot the linear fit to the data between 1880 and 2020.

15 marks


In [5]:
def linear(x):
    return np.hstack((np.ones_like(x), x))

import numpy as np
Phi = linear(x)
PhiTPhi = np.dot(Phi.T, Phi)
PhiTy = np.dot(Phi.T, y)
# solve the system PhiTPhi w = PhiTy
w = np.linalg.solve(PhiTPhi, PhiTy)

x_test = np.linspace(1880, 2020, 140)[:, None]
Phi_test = kinear(x_test)
f_test = np.dot(Phi_test.T, w)
fix, ax = plt.subplots()
plt.plot(x_test, f_test, 'r-')

In [8]:
len(np.linspace(1, 2, 10).shape)


Out[8]:
1

Training Error

The first thing we'll do is plot the training error for the polynomial fit. To do this let's set up some parameters.


In [ ]:
import numpy as np
num_data = x.shape[0]
num_pred_data = 100 # how many points to use for plotting predictions
x_pred = np.linspace(1890, 2016, num_pred_data)[:, None] # input locations for predictions
order = 4 # The polynomial order to use.

now let's build the basis matrices.


In [ ]:
# build the basis set
Phi = np.zeros((num_data, order+1))
Phi_pred = np.zeros((num_pred_data, order+1))
for i in range(0, order+1):
    Phi[:, i:i+1] = x**i
    Phi_pred[:, i:i+1] = x_pred**i

now we can solve for the regression weights and make predictions both for the training data points, and the test data points. That involves solving the linear system given by

$$\basisMatrix^\top \basisMatrix \mappingVector = \basisMatrix^\top \dataVector$$

In [ ]:
# solve the linear system
w_star = np.linalg.solve(np.dot(Phi.T, Phi), np.dot(Phi.T, y))

and using the resulting vector to make predictions at the training points and test points,

$$\mathbf{f} = \basisMatrix \mappingVector.$$

To implement this in practice we need to use basis matrices for both the predictions and the training points.


In [ ]:
# predict at training and test points
f = np.dot(Phi, w_star)
f_pred = np.dot(Phi_pred, w_star)

These can be used to compute the objective functions

$$E(\mappingVector) = \frac{\numData}{2} \log \dataStd^2 + \frac{1}{2\dataStd^2} \sum_{i=1}^\numData \left(\dataScalar_i - \mappingVector^\top \phi(\inputVector_i)\right)^2 \\\ E(\mappingVector) = \frac{\numData}{2} \log \dataStd^2 + \frac{1}{2\dataStd^2} \sum_{i=1}^\numData \left(\dataScalar_i - \mappingFunctionScalar_i\right)^2$$

In [ ]:
# compute the sum of squares term
sum_squares = ((y-f)**2).sum()
# fit the noise variance
sigma2 = sum_squares/num_data
objective = 0.5*(num_data*np.log(sigma2) + sum_squares/sigma2)

Now we have the fit and the objective function, let's plot the fit and the objective.


In [ ]:
# print the error and plot the predictions
print("The objective is: %2.4f"%objective)
plt.plot(x_pred, f_pred)
plt.plot(x, y, 'rx')
ax = plt.gca()
ax.set_title('Predictions for Order 5')
ax.set_xlabel('year')
ax.set_ylabel('pace (min/km)')

Now use the loop structure below to compute the error for different model orders.


In [ ]:
# import the time model to allow python to pause.
import time
# import the IPython display module to clear the output.
from IPython.display import clear_output

error_list = []
max_order = 6
fig, axes = plt.subplots(nrows=1, ncols=2)
for order in range(0, max_order+1):
    # 1. build the basis set
    Phi = np.zeros((num_data, order+1))
    Phi_pred = np.zeros((num_pred_data, order+1))
    for i in range(0, order+1):
        Phi[:, i:i+1] = x**i
        Phi_pred[:, i:i+1] = x_pred**i
    # 2. solve the linear system
    # 3. make predictions at training and test points
    # 4. compute the error and append it to a list.
    # 5. plot the predictions
    axes[0].clear()
    axes[1].clear()    
    axes[0].plot(x_pred, f_pred)
    axes[0].plot(x, y, 'rx')
    axes[0].set_ylim((2.5, 5.5))
    axes[0].set_title('Predictions for Order ' + str(order) + ' model.')
    axes[1].plot(np.arange(0, order+1), np.asarray(error_list))
    axes[1].set_xlim((0, max_order))
    axes[1].set_ylim((-30, 0))
    axes[1].set_title('Training Error')
    display(fig)
    time.sleep(1)
    clear_output()

Hold Out Validation

The error we computed above is the training error. It doesn't assess the model's generalization ability, it only assesses how well it's performing on the given training data. In hold out validation, we keep back some of the training data for assessing generalization performance. In the case of time series prediction, it often makes sense to hold out the last few data points, in particular, when we are interested in extrapolation, i.e. predicting into the future given the past. To perform hold out validation, we first remove the hold out set. If we were interested in interpolation, we would hold out some random points. Here, because we are interested in extrapolation, we will hold out all points since 1980.


In [ ]:
# Create a training set
x_train = x
y_train = y
indices_hold_out = np.nonzero(x>1980)
x_train = np.delete(x, indices_hold_out)
y_train = np.delete(y, indices_hold_out)

# Create a hold out set
x_hold_out = x[indices_hold_out]
y_hold_out = y[indices_hold_out]

# Now use the training set and hold out set to compute the errors like above.

Now you have the training and hold out data, you should be able to use the code above to evaluate the model on the hold out data. Do this in the code block below.

Leave One Out Validation

In leave one out validation, you take out one data point at a time and compute the error. Perform leave one out validation on this data set. Test polynomial basis from 0th order to 6th order.

$k$-fold Cross Validation

Repeat this test again for 5-fold cross validation. Here you need to split the data into 5 parts. Then use each part as a hold out validation set (in turn), whilst training on the four remaining parts. This is known as five fold cross validation.

Do the different forms of validation select different models?