Lab 1: Regression

11th February 2014 Gaussian Process Road Show, Pereira, Colombia

written by Neil Lawrence

$$\newcommand{\inputScalar}{x} \newcommand{\inputVector}{\mathbf{x}} \newcommand{\inputMatrix}{\mathbf{X}} \newcommand{\dataScalar}{y} \newcommand{\dataVector}{\mathbf{y}} \newcommand{\dataMatrix}{\mathbf{Y}} \newcommand{\lengthScale}{\ell} \newcommand{\mappingScalar}{w} \newcommand{\mappingVector}{\mathbf{w}} \newcommand{\mappingFunctionScalar}{f} \newcommand{\mappingFunctionVector}{\mathbf{f}} \newcommand{\dataStd}{\sigma} \newcommand{\numData}{n} \newcommand{\gaussianDist}[3]{\mathcal{N}\left(#1|#2,#3\right)} \newcommand{\gaussianSamp}[2]{\mathcal{N}\left(#1,#2\right)} \newcommand{\zerosVector}{\mathbf{0}} \newcommand{\eye}{\mathbf{I}} \newcommand{\noiseScalar}{\epsilon} \newcommand{\noiseVector}{\mathbf{\epsilon}} \newcommand{\noiseMatrix}{\mathbf{\Epsilon}} \newcommand{\basisMatrix}{\mathbf{\Phi}} \newcommand{\basisVector}{\mathbf{\phi}} \newcommand{\basisScalar}{\phi} \newcommand{\expSamp}[1]{\left<#1\right>} \newcommand{\expDist}[2]{\left<#1\right>_{#2}} \newcommand{\covarianceMatrix}{\mathbf{C}} \newcommand{\meanVector}{\boldsymbol{\mu}} \newcommand{\kernelScalar}{k} \newcommand{\kernelVector}{\mathbf{\kernelScalar}} \newcommand{\kernelMatrix}{\mathbf{K}} \newcommand{\meanScalar}{\mu} \newcommand{\ltwoNorm}[1]{\left\Vert #1 \right\Vert_2}$$

Welcome to the IPython notebook! We will be using the IPython notebook for all our lab classes and assignments. It is a really convenient way to interact with data using python. In this first lab session we are going to familiarise ourselves with the notebook and start getting used to python whilst we review some of the material from the first lecture.

Python is a generic programming language with 'numerical' and scientific capabilities added on through the numpy and scipy libraries. There are excellent 2-D plotting facilities available through matplotlib.

Importing Libraries

The numpy library provides most of the manipulations we need for arrays in python. numpy is short for numerical python, but as well as providing the numerics, numpy provides contiguous array objects. These objects weren't available in the original python. The first step is to import numpy. We'll then use it to draw samples from a "standard normal". A standard normal is a Gaussian density with mean of zero and variance of one. We'll draw 10 samples from the standard normal.


In [1]:
%pylab inline


Populating the interactive namespace from numpy and matplotlib

In [2]:
import numpy as np
import pylab as pb
import GPy
import os


/usr/lib/pymodules/python2.7/mpl_toolkits/__init__.py:2: UserWarning: Module GPy was already imported from /home/neil/SheffieldML/GPy/GPy/__init__.pyc, but /home/neil/.local/lib/python2.7/site-packages is being added to sys.path
  __import__('pkg_resources').declare_namespace(__name__)
/usr/lib/pymodules/python2.7/mpl_toolkits/__init__.py:2: UserWarning: Module dap was already imported from None, but /usr/lib/python2.7/dist-packages is being added to sys.path
  __import__('pkg_resources').declare_namespace(__name__)

To get help about any command in the notebook simply type that command followed by a question mark.


In [53]:
X = np.random.normal(0, 1, size=(10))

Now let's look at the samples, we can show them using the print command.


In [54]:
print X


[-0.53532993 -0.43795483 -1.51425463  2.05165361  0.44440496 -1.44108218
  0.06695473 -0.29276403  0.51029745 -1.98340514]

Estimating Moments

We can compute the sample mean by adding all the samples together and dividing by the number of samples.


In [51]:
X.mean(axis = 1).shape


Out[51]:
(10, 10)

In [70]:
A = np.random.normal(loc=0, scale=1, size=(10, 10))
x = np.random.normal(loc=0, scale=1, size=(10, 1))
print (A.T*x).sum(axis=0)
print np.dot(A, x)


[ -1.22221889e+00  -2.03231200e+00  -3.36988193e+00  -2.39802734e+00
  -1.72211513e+00   1.23046956e-03   5.04874577e-01  -3.22701871e+00
   5.51391558e-01  -5.43034318e-01]
[[ -1.22221889e+00]
 [ -2.03231200e+00]
 [ -3.36988193e+00]
 [ -2.39802734e+00]
 [ -1.72211513e+00]
 [  1.23046956e-03]
 [  5.04874577e-01]
 [ -3.22701871e+00]
 [  5.51391558e-01]
 [ -5.43034318e-01]]

which is easy to write in code as follows


In [71]:
X.var()


Out[71]:
0.980973748980555

We know in this case, because we sampled from a standard normal, that the mean and variance of the distribution should be 0 and 1. Why do you not get a mean of 0 and a variance of 1? Let's explore what happens as we increase the number of samples. To do this we are going to use for loops and python lists. We start by creating empty lists for the means and variances. Then we create a list of integers to iterate through. In Python, a for loop always iterates through a list (in some languages this is called a foreach loop, its counterpart the counter for loop only exists by creating a list of integers, see http://en.wikipedia.org/wiki/Foreach_loop#Python). We can use the range command to create the numbers of samples.


In [75]:
means = []
variances = []
samples = [10, 50, 100, 500, 1000, 5000, 10000, 50000, 100000] 
for n in samples:
    x = np.random.normal(loc=0, scale=1, size=(n))
    mean = x.mean()
    variance = (x**2).mean() - mean**2
    means.append(mean)
    variances.append(variance)

Plotting in Python

We'll now plot the variance and the mean against the number of samples. To do this, we need to first convert the samples, varianes and means from Python lists, to numpy arrays.


In [76]:
means = np.asarray(means)
variances = np.asarray(variances)
samples = np.asarray(samples)

Next we need to include the plotting functionality from matplotlib, and instruct IPython notebook to include the plots inline with the notebook, rather than in a different window. First we import the plotting library, matplotlib.

Here we plot the estimated mean against the number of samples. However, since the samples go up logarithmically it's better to use a logarithmic axis for the x axis, as follows.


In [81]:
pb.semilogx(samples, means)
xlabel('$\log_{10}n$')
ylabel('mean')


Out[81]:
<matplotlib.text.Text at 0x6348990>

We can do the same for the variances, again using a logarithmic axis for the samples. This time, we're going to lavel the x axis using a latex formula.


In [82]:
plt.semilogx(samples, variances)
xlabel('$\log_{10}n$')
ylabel('variance')


Out[82]:
<matplotlib.text.Text at 0x6f85e50>

Linear Regression: Iterative Solution

Next we will code linear regression in python. We will do it in two ways, once using iterative updates (coordinate ascent) and then using linear algebra. For this part we are going to load in some real data, we will use the example from the Olympics: the pace of Marathon winners. To load their data (which is in comma separated values [csv] format) we first need to download it from this link, and then load it as follows:


In [135]:
data = GPy.util.datasets.olympic_marathon_men()
x = data['X']
x = x - x.mean()
x = x/np.sqrt(x.var())
y = data['Y']

You can se what these values are by typing:


In [136]:
print(x)
print(y)


[[-1.69721878]
 [-1.58462687]
 [-1.47203496]
 [-1.35944305]
 [-1.24685114]
 [-1.02166733]
 [-0.90907542]
 [-0.79648351]
 [-0.6838916 ]
 [-0.57129969]
 [-0.23352396]
 [-0.12093205]
 [-0.00834014]
 [ 0.10425177]
 [ 0.21684368]
 [ 0.32943559]
 [ 0.4420275 ]
 [ 0.55461941]
 [ 0.66721131]
 [ 0.77980322]
 [ 0.89239513]
 [ 1.00498704]
 [ 1.11757895]
 [ 1.23017086]
 [ 1.34276277]
 [ 1.45535468]
 [ 1.56794659]]
[[ 4.47083333]
 [ 4.46472926]
 [ 5.22208333]
 [ 4.15467867]
 [ 3.90331675]
 [ 3.56951267]
 [ 3.82454477]
 [ 3.62483707]
 [ 3.59284275]
 [ 3.53880792]
 [ 3.67010309]
 [ 3.39029111]
 [ 3.43642612]
 [ 3.20583007]
 [ 3.13275665]
 [ 3.32819844]
 [ 3.13583758]
 [ 3.0789588 ]
 [ 3.10581822]
 [ 3.06552909]
 [ 3.09357349]
 [ 3.16111704]
 [ 3.14255244]
 [ 3.08527867]
 [ 3.10265829]
 [ 2.99877553]
 [ 3.03392977]]

Plotting the Data

And you can make a plot of $y$ vs $x$ with the following command:


In [138]:
plt.plot(x, y, 'rx')


Out[138]:
[<matplotlib.lines.Line2D at 0x8aedad0>]

Maximum Likelihood: Iterative Solution

Now we are going to fit a line, $y_i=mx_i + c$, to the data you've plotted. We are trying to minimize the error function:

$$E(m, c, \sigma^2) = \frac{N}{2} \log \sigma^2 + \frac{1}{2\sigma^2} \sum_{i=1}^N(y_i-mx_i-c)^2$$

with respect to $m$, $c$ and $\sigma^2$. We can start with an initial guess for $m$,


In [139]:
m = -0.4
c = 80

Then we use the maximum likelihood update, derived in the lecture, to find an estimate for the offset, $c$,

$$c^* = \frac{\sum_{i=1}^N(y_i-m^*x_i)}{N}$$

In [140]:
c = (y - m*x).mean()
print c


3.50125262684

And now we can make an estimate for the gradient of the line,

$$m^* = \frac{\sum_{i=1}^N ((y_i - c)*x_i))}{\sum_{i=1}^N x_i^2}$$

In [141]:
m = ((y - c)*x).sum()/(x**2).sum()
print m


-0.461157389283

We can have a look at how good our fit is by computing the prediction across the input space. First create a vector of 'test points',


In [146]:
x_test = np.linspace(-1., 1., 130)[:, None]

Now use this vector to compute some test predictions,


In [147]:
f_test = m*x_test + c

Now plot those test predictions with a blue line on the same plot as the data,


In [149]:
plt.plot(x_test, f_test, 'b-')
plt.plot(x, y, 'rx')


Out[149]:
[<matplotlib.lines.Line2D at 0x9044f50>]

The fit isn't very good, we need to iterate between these parameter updates in a loop to improve the fit, we have to do this several times,


In [150]:
for i in np.arange(100000):
    m = ((y - c)*x).sum()/(x*x).sum()
    c = (y-m*x).sum()/y.shape[0]
print(m)
print(c)


-0.461157389283
3.50125262684

And let's try plotting the result again


In [151]:
f_test = m*x_test + c
plt.plot(x_test, f_test, 'b-')
plt.plot(x, y, 'rx')


Out[151]:
[<matplotlib.lines.Line2D at 0x930b590>]

Clearly we need more iterations than 10! Let's try add more iterations above to try and get closer to the solution.

Direct Solution with Linear Algebra

Hopefully, you are now persuaded of the merits of solving the entire system, simultaneously, using linear algebra. To do that, we need to make a design matrix of the data, which includes the $x_0=1$ column, to represent the bias, remember (from the lecture notes) that we are now moving to a system where our prediction is given by an inner product:

$$f(\mathbf{x}_i) = \mathbf{x}_i^\top\mathbf{w}$$

where each vector $\mathbf{x}_i$ is given by appending a 1 onto the original vector

$$\mathbf{x}_i = \begin{bmatrix} 1 \\\ x_i \end{bmatrix}$$

We can do this for the entire data set to form a design matrix $\mathbf{X}$,

$$\mathbf{X} = \begin{bmatrix} \mathbf{x}_1^\top \\\ \mathbf{x}_2^\top \\\ \vdots \\\ \mathbf{x}_N^\top \end{bmatrix} = \begin{bmatrix} 1 & x_1 \\\ 1 & x_2 \\\ \vdots & \vdots \\\ 1 & x_N \end{bmatrix},$$

which in numpy is done with the following commands:


In [152]:
X = np.hstack((np.ones_like(x), x))
print(X)


[[ 1.         -1.69721878]
 [ 1.         -1.58462687]
 [ 1.         -1.47203496]
 [ 1.         -1.35944305]
 [ 1.         -1.24685114]
 [ 1.         -1.02166733]
 [ 1.         -0.90907542]
 [ 1.         -0.79648351]
 [ 1.         -0.6838916 ]
 [ 1.         -0.57129969]
 [ 1.         -0.23352396]
 [ 1.         -0.12093205]
 [ 1.         -0.00834014]
 [ 1.          0.10425177]
 [ 1.          0.21684368]
 [ 1.          0.32943559]
 [ 1.          0.4420275 ]
 [ 1.          0.55461941]
 [ 1.          0.66721131]
 [ 1.          0.77980322]
 [ 1.          0.89239513]
 [ 1.          1.00498704]
 [ 1.          1.11757895]
 [ 1.          1.23017086]
 [ 1.          1.34276277]
 [ 1.          1.45535468]
 [ 1.          1.56794659]]

From the multivariate regression solution we derived in the lecture, the maximum likelihood solution for $\mathbf{w}^*$ is given by

$$\mathbf{w}^* = \left[\mathbf{X}^\top \mathbf{X}\right]^{-1} \mathbf{X}^\top \mathbf{y}$$

First let's persuade ourselves of a few things. We suggested in the lecture that

$$\sum_{i=1}^N \mathbf{x}_i\mathbf{x}_i^\top = \mathbf{X}^\top\mathbf{X}$$

We can show that this is, indeed the case for our data. First we need to know how to do matrix multiplication and transpose using numpy, this is done as


In [153]:
np.dot(X.T, X)


Out[153]:
array([[  2.70000000e+01,  -6.43929354e-15],
       [ -6.43929354e-15,   2.70000000e+01]])

Now we will compute the same thing with a for loop


In [154]:
store = np.zeros((2, 2))
for i in range(X.shape[0]):
    store += np.outer(X[i, :], X[i, :])
print store


[[  2.70000000e+01  -6.43929354e-15]
 [ -6.43929354e-15   2.70000000e+01]]

I hope that you agree that the first version is a little more compact.

Solving the System

The solution for $\mathbf{w}$ is given in terms of a matrix inverse, but numerically this isn't the best way to compute it. What we actually want python to do is to solve the system of linear equations given by

$$\mathbf{X}^\top\mathbf{X} \mathbf{w} = \mathbf{X}^\top\mathbf{y}$$

for $\mathbf{w}$. This can be done in numpy using the command


In [155]:
np.linalg.solve?

so we can obtain the solution using


In [156]:
w = np.linalg.solve(np.dot(X.T, X), np.dot(X.T, np.log(y)))
print w
print m
print c


[[ 1.24278789]
 [-0.12483821]]
-0.461157389283
3.50125262684

Allowing us to plot the fit as follows


In [157]:
m = w[1]; c=w[0]
f_test = m*x_test + c
print(m)
print(c)
plt.plot(x_test, np.exp(f_test), 'b-')
plt.plot(x, y, 'rx')


[-0.12483821]
[ 1.24278789]
Out[157]:
[<matplotlib.lines.Line2D at 0x9339650>]

Quadratic Fit

Now we will fit a quadratic model using basis functions. Given everything we've learnt above, this is now quite easy to do. Firstly, we need to create a design matrix that contains the quadratic basis,

$$\mathbf{\Phi} = \left[ \mathbf{1} \quad \mathbf{x} \quad \mathbf{x}^2\right]$$

where this notation means that each column of $\mathbf{\Phi}$ is derived from the entire set of input years.


In [113]:
Phi = np.hstack([np.ones(x.shape), x, x**2])

Now we can solve this system for $\mathbf{w}$ just as we did for the linear case, so we have,


In [114]:
w = np.linalg.solve(np.dot(Phi.T, Phi), np.dot(Phi.T, y))
print(w)


[[  6.43641954e+02]
 [ -6.42502988e-01]
 [  1.61109704e-04]]

We can plot the solution in two different ways, either we take


In [115]:
f_test = w[2]*x_test**2 + w[1]*x_test + w[0]
plt.plot(x_test, f_test, 'b-')
plt.plot(x, y, 'rx')


Out[115]:
[<matplotlib.lines.Line2D at 0x7a09910>]

Or we can do the matrix form of this equation which first involves creating a design matrix for the test points,


In [117]:
Phi_test = np.hstack((np.ones_like(x_test), x_test, x_test**2))

and then computing the value of the function using a matrix multiply


In [119]:
f_test = np.dot(Phi_test,w)
plt.plot(x_test, f_test, 'b-')
plt.plot(x, y, 'rx')
w


Out[119]:
array([[  6.43641954e+02],
       [ -6.42502988e-01],
       [  1.61109704e-04]])

Note the values of the coefficient $w_2$ in particular. It is relatively small, because it is multiplying a large number (square of 2000 is 4 million). This need to use small coefficients becomes worse as we increase the order of the fit. As an exercise for later, try fitting higher order polynomials to the data. See what happens as you increase the polynomial order.

Generalization

The aim of this notebook is to review the different methods of model selection: hold out validation, leave one out cross validation and cross validation.

Training Error

The first thing we'll do is plot the training error for the polynomial fit. To do this let's set up some parameters.


In [120]:
num_data = x.shape[0]
num_pred_data = 100 # how many points to use for plotting predictions
x_pred = linspace(1890, 2016, num_pred_data)[:, None] # input locations for predictions
order = 4 # The polynomial order to use.

now let's build the basis matrices.


In [121]:
# build the basis set
Phi = np.zeros((num_data, order+1))
Phi_pred = np.zeros((num_pred_data, order+1))
for i in range(0, order+1):
    Phi[:, i:i+1] = x**i
    Phi_pred[:, i:i+1] = x_pred**i

now we can solve for the regression weights and make predictions both for the training data points, and the test data points. That involves solving the linear system given by

$$\basisMatrix^\top \basisMatrix \mappingVector^* = \basisMatrix^\top \dataVector$$

In [122]:
# solve the linear system
w_star = np.linalg.solve(np.dot(Phi.T, Phi), np.dot(Phi.T, y))

and using the resulting vector to make predictions at the training points and test points,

$$\mathbf{f} = \basisMatrix \mappingVector.$$

To implement this in practice we need to use basis matrices for both the predictions and the training points.


In [123]:
# predict at training and test points
f = np.dot(Phi, w_star)
f_pred = np.dot(Phi_pred, w_star)

These can be used to compute the error

$$E(\mappingVector) = \frac{\numData}{2} \log \dataStd^2 + \frac{1}{2\dataStd^2} \sum_{i=1}^\numData \left(\dataScalar_i - \mappingVector^\top \phi(\inputVector_i)\right)^2 \\\ E(\mappingVector) = \frac{\numData}{2} \log \dataStd^2 + \frac{1}{2\dataStd^2} \sum_{i=1}^\numData \left(\dataScalar_i - \mappingFunctionScalar_i\right)^2$$

In [125]:
# compute the sum of squares term
sum_squares = ((y-f)**2).sum()
# fit the noise variance
sigma2 = sum_squares/num_data
error = 0.5*(num_data*np.log(sigma2) + sum_squares/sigma2)

Now we have the fit and the error, let's plot the fit and the error.


In [126]:
# print the error and plot the predictions
print("The error is: %2.4f"%error)
plt.plot(x_pred, f_pred)
plt.plot(x, y, 'rx')
ax = plt.gca()
ax.set_title('Predictions for Order 5')
ax.set_xlabel('year')
ax.set_ylabel('pace (min/km)')


The error is: -29.9427
Out[126]:
<matplotlib.text.Text at 0x7a42d90>

Now use the loop structure below to compute the error for different model orders.


In [128]:
# import the time model to allow python to pause.
import time
# import the IPython display module to clear the output.
from IPython.display import clear_output

num_data = len(x)
error_list = []
max_order = 6
sigma2 = 1
fig, axes = plt.subplots(nrows=1, ncols=2)
for order in range(0, max_order+1):
    # 1. build the basis set
    Phi = np.zeros((num_data, order+1))
    Phi_pred = np.zeros((num_pred_data, order+1))
    for i in range(0, order+1):
        Phi[:, i] = x.flatten()**i
        Phi_pred[:, i] = x_pred.flatten()**i
    
    # 2. solve the linear system
    w = np.linalg.solve(np.dot(Phi.T, Phi), np.dot(Phi.T, y))
    # 3. make predictions at training and test points
    f_pred = np.dot(Phi_pred, w)
    f = np.dot(Phi, w)
    # 4. compute the error and append it to a list.
    error_list.append(((y-f)**2).sum() + num_data/2.*np.log(sigma2))
    # 5. plot the predictions
    axes[0].clear()
    axes[1].clear()    
    axes[0].plot(x_pred, f_pred)
    axes[0].plot(x, y, 'rx')
    axes[0].set_ylim((2.5, 5.5))
    axes[0].set_title('Predictions for Order ' + str(order) + ' model.')
    axes[1].plot(np.arange(0, order+1), np.asarray(error_list))
    axes[1].set_xlim((0, max_order))
    axes[1].set_ylim((0, 10))
    axes[1].set_title('Training Error')
    display(fig)
    time.sleep(2)
    clear_output()



In [ ]:

Lab 5: Bayesian Regression

Neil D. Lawrence COM4509/6509 Machine Learning and Adaptive Intelligence

As normal, let's first get the plots running in line on the notebook. $$\newcommand{\inputScalar}{x} \newcommand{\lengthScale}{\ell} \newcommand{\mappingVector}{\mathbf{w}} \newcommand{\gaussianDist}[3]{\mathcal{N}\left(#1|#2,#3\right)} \newcommand{\gaussianSamp}[2]{\mathcal{N}\left(#1,#2\right)} \newcommand{\zerosVector}{\mathbf{0}} \newcommand{\eye}{\mathbf{I}} \newcommand{\dataStd}{\sigma} \newcommand{\dataScalar}{y} \newcommand{\dataVector}{\mathbf{y}} \newcommand{\dataMatrix}{\mathbf{Y}} \newcommand{\noiseScalar}{\epsilon} \newcommand{\noiseVector}{\mathbf{\epsilon}} \newcommand{\noiseMatrix}{\mathbf{\Epsilon}} \newcommand{\inputVector}{\mathbf{x}} \newcommand{\kernelMatrix}{\mathbf{K}} \newcommand{\basisMatrix}{\mathbf{\Phi}} \newcommand{\basisVector}{\mathbf{\phi}} \newcommand{\basisScalar}{\phi} \newcommand{\expSamp}[1]{\left<#1\right>} \newcommand{\expDist}[2]{\left<#1\right>_{#2}} \newcommand{\covarianceMatrix}{\mathbf{C}} \newcommand{\numData}{N} \newcommand{\mappingScalar}{w} \newcommand{\mappingFunctionScalar}{f} \newcommand{\mappingFunctionVector}{\mathbf{f}} \newcommand{\meanVector}{\boldsymbol{\mu}} \newcommand{\meanScalar}{\mu}$$


In [ ]:
%pylab inline

The next thing to do is to download the olympics data just to make sure it is available for us to study.


In [ ]:
import urllib
url = ("http://staffwww.dcs.shef.ac.uk/"
    + "people/N.Lawrence/dataset_mirror/"
    + "olympic_marathon_men/olympicMarathonTimes.csv")
urllib.urlretrieve(url, 'olympicMarathonTimes.csv')
olympics = np.loadtxt('olympicMarathonTimes.csv', delimiter=',')

The aim of this notebook is to study Bayesian approaches to regression. As in previous sessions, first extract both the olympic years and the pace of the winning runner into 2-dimensional arrays with the data points in the rows of the array (the first dimension).


In [ ]:
x = olympics[:, 0:1]
y = olympics[:, 1:2]

We can plot them to check that they've loaded in correctly.


In [ ]:
plt.plot(x, y, 'rx')

The Prior Distribution

In the Bayesian approach, the first thing we do is assume a prior distribution for the parameters, $\mappingVector$. In the lectures we took this prior to be

$$\mappingVector \sim \gaussianSamp{\zerosVector}{\alpha \eye}$$

In other words, we assumed for the prior that each element of the parameters vector, $\mappingScalar_i$, was drawn from a Gaussian density as follows

$$\mappingScalar_i \sim \gaussianSamp{0}{\alpha}$$

Let's start by assigning the parameter of the prior distribution, which is the variance of the prior distribution, $\alpha$.


In [ ]:
# set prior variance on w
alpha = 4.
# set the order of the polynomial basis set
order = 5
# set the noise variance
sigma2 = 0.01

Now we have the prior variance, we can sample from the prior distribution to see what form we are imposing on the functions a priori. To do this, we first sample a weight vector, then we multiply that weight vector by our basis to compute the the functions. Firstly we compute the basis function matrix. We will do it both for our training data, and for a range of prediction locations (x_pred).


In [ ]:
num_data = x.shape[0]
num_pred_data = 100 # how many points to use for plotting predictions
x_pred = linspace(1890, 2016, num_pred_data)[:, None] # input locations for predictions

now let's build the basis matrices.


In [ ]:
# build the basis set
Phi = np.zeros((num_data, order+1))
Phi_pred = np.zeros((num_pred_data, order+1))
for i in xrange(0, order+1):
    Phi[:, i:i+1] = x**i
    Phi_pred[:, i:i+1] = x_pred**i

Sampling from the Prior

Now we will sample from the prior to produce a vector $\mappingVector$ and use it to plot a function which is representative of our belief before we fit the data. To do this we are going to use the properties of the Gaussian density and a sample from a standard normal using the function np.random.normal.

Scaling Gaussian-distributed Variables

First, let's consider the case where we have one data point and one feature in our basis set. In otherwords $\mappingFunctionVector$ would be a scalar, $\mappingVector$ would be a scalar and $\basisMatrix$ would be a scalar. In this case we have

$$\mappingFunctionScalar = \basisScalar \mappingScalar$$

If $\mappingScalar$ is drawn from a normal density,

$$\mappingScalar \sim \gaussianSamp{\meanScalar_\mappingScalar}{c_\mappingScalar}$$

and $\basisScalar$ is a scalar value which we are given, then properties of the Gaussian density tell us that

$$\basisScalar \mappingScalar \sim \gaussianSamp{\basisScalar\meanScalar_\mappingScalar}{\basisScalar^2c_\mappingScalar}$$

Let's test this out numerically. First we will draw 200 samples from a standard normal,


In [ ]:
w_vec = np.random.normal(size=200)

We can compute the mean of these samples and their variance


In [ ]:
print 'w sample mean is ', w_vec.mean()
print 'w sample variance is ', w_vec.var()

These are close to zero (the mean) and one (the variance) as you'd expect. Now compute the mean and variance of the scaled version,


In [ ]:
phi = 7
f_vec = phi*w_vec
print 'True mean should be phi*0 = 0.'
print 'True variance should be phi*phi*1 = ', phi*phi
print 'f sample mean is ', f_vec.mean()
print 'f sample variance is ', f_vec.var()

If you increase the number of samples then you will see that the sample mean and the sample variance begin to converge towards the true mean and the true variance. Obviously adding an offset to a sample from np.random.normal will change the mean. So if you want to sample from a Gaussian with mean mu and standard deviation sigma one way of doing it is to sample from the standard normal and scale and shift the result, so to sample a set of $\mappingScalar$ from a Gaussian with mean $\meanScalar$ and variance $\alpha$,

$$w \sim \gaussianSamp{\meanScalar}{\alpha}$$

We can simply scale and offset samples from the standard normal.


In [ ]:
mu = 4 # mean of the distribution
alpha = 2 # variance of the distribution
w_vec = np.random.normal(size=200)*np.sqrt(alpha) + mu
print 'w sample mean is ', w_vec.mean()
print 'w sample variance is ', w_vec.var()

Here the np.sqrt is necesssary because we need to multiply by the standard deviation and we specified the variance as alpha. So scaling and offsetting a Gaussian distributed variable keeps the variable Gaussian, but it effects the mean and variance of the resulting variable.

To get an idea of the overal shape of the resulting distribution, let's do the same thing with a histogram of the results.


In [ ]:
# First the standard normal
z_vec = np.random.normal(size=1000) # by convention, in statistics, z is often used to denote samples from the standard normal
w_vec = z_vec*np.sqrt(alpha) + mu
# plot normalized histogram of w, and then normalized histogram of z on top
plt.hist(w_vec, bins=30, normed=True)
plt.hist(z_vec, bins=30, normed=True)
plt.legend(('$w$', '$z$'))

Now re-run this histogram with 100,000 samples and check that the both histograms look qualitatively Gaussian.

Sampling from the Prior

Let's use this way of constructing samples from a Gaussian to check what functions look like a priori. The process will be as follows. First, we sample a random vector $K$ dimensional from np.random.normal. Then we scale it by $\sqrt{\alpha}$ to obtain a prior sample of $\mappingVector$.


In [ ]:
K = order + 1
z_vec = np.random.normal(size=K)
w_sample = z_vec*np.sqrt(alpha)
print w_sample

Now we can combine our sample from the prior with the basis functions to create a function,


In [ ]:
f_sample = np.dot(Phi_pred,w_sample)
plt.plot(x_pred.flatten(), f_sample.flatten(), 'r-')

This shows the recurring problem with the polynomial basis. Our prior allows relatively large coefficients for the basis associated with high polynomial degrees. Because we are operating with input values of around 2000, this leads to output functions of very high values. One fix for this is to rescale our inputs to be between -1 and 1 before applying the model. This is a disadvantage of the polynomial basis. Let's rescale x and x_pred now.


In [ ]:
span = np.max(x) - np.min(x) 
offset = np.min(x)
x -= offset
x_pred -= offset
x /= span # x is now between zero and 1
x_pred /= span 
x = x*2-1 # x is now between -1 and 1
x_pred = x_pred*2 - 1

Now we need to recompute the basis functions from above,


In [ ]:
# rebuild the basis set
Phi = np.zeros((num_data, order+1))
Phi_pred = np.zeros((num_pred_data, order+1))
for i in xrange(0, order+1):
    Phi[:, i:i+1] = x**i
    Phi_pred[:, i:i+1] = x_pred**i

In [ ]:
f_sample = np.dot(Phi_pred,w_sample)
plt.plot(x_pred.flatten(), f_sample.flatten(), 'r-')

Now let's loop through some samples and plot various functions as samples from this system,


In [ ]:
num_samples = 10
K = order+1
for i in xrange(num_samples):
    z_vec = np.random.normal(size=K)
    w_sample = z_vec*np.sqrt(alpha)
    f_sample = np.dot(Phi_pred,w_sample)
    plt.plot(x_pred.flatten(), f_sample.flatten())

The predictions for the mean output can now be computed. We want the expected value of the predictions under the posterior distribution. In matrix form, the predictions can be computed as

$$\mathbf{f} = \basisMatrix \mappingVector.$$

This involves a matrix multiplication between a fixed matrix $\basisMatrix$ and a vector that is drawn from a distribution $\mappingVector$. Because $\mappingVector$ is drawn from a distribution, this imples that $\mappingFunctionVector$ should also be drawn from a distribution. Let's work out what that distributions should be.

Computing the Posterior

In the lecture we went through how to compute the posterior distribution for $\mappingVector$. This distribution is also Gaussian,

$$p(\mappingVector | \dataVector, \inputVector, \dataStd^2) = \mathcal{N}\left(\mappingVector|\meanVector_\mappingScalar, \covarianceMatrix_\mappingScalar\right)$$

with covariance, $\covarianceMatrix_\mappingScalar$, given by

$$\covarianceMatrix_\mappingScalar = \left(\dataStd^{-2}\basisMatrix^\top \basisMatrix + \alpha^{-1} \eye\right)^{-1}$$

whilst the mean is given by

$$\meanVector_\mappingScalar = \covarianceMatrix_\mappingScalar \dataStd^{-2}\basisMatrix^\top \dataVector$$

Let's compute the posterior covariance and mean, then we'll sample from these densities to have a look at the posterior belief about $\mappingVector$ once the data has been accounted for. Remember, the process of Bayesian inference involves combining the prior, $p(\mappingVector)$ with the likelihood, $p(\dataVector|\inputVector, \mappingVector)$ to form the posterior, $p(\mappingVector | \dataVector, \inputVector)$ through Bayes' rule,

$$p(\mappingVector|\dataVector, \inputVector) = \frac{p(\dataVector|\inputVector, \mappingVector)p(\mappingVector)}{p(\dataVector)}$$

We've looked at the samples for our function $\mappingFunctionVector = \basisMatrix\mappingVector$, which forms the mean of the Gaussian likelihood, under the prior distribution. I.e. we've sampled from $p(\mappingVector)$ and multiplied the result by the basis matrix. Now we will sample from the posterior density, $p(\mappingVector|\dataVector, \inputVector)$, and check that the new samples fit do correspond to the data, i.e. we want to check that the updated distribution includes information from the data set. First we need to compute the posterior mean and covariance.


In [ ]:
# compute the posterior covariance and mean
w_cov = np.linalg.inv(1/sigma2*np.dot(Phi.T, Phi) + 1/alpha*np.eye(order+1))
w_mean = np.dot(w_cov, 1/sigma2*np.dot(Phi.T, y))

Before we were able to sample the prior values for the mean independently from a Gaussian using np.random.normal and scaling the result. However, observing the data correlates the parameters. Recall this from the first lab where we had a correlation between the offset, $c$ and the slope $m$ which caused such problems with the coordinate ascent algorithm. We need to sample from a correlated Gaussian. For this we can use np.random.multivariate_normal.


In [ ]:
w_sample = np.random.multivariate_normal(w_mean.flatten(), w_cov)
f_sample = np.dot(Phi_pred,w_sample)
plt.plot(x_pred.flatten(), f_sample.flatten(), 'r-')
plt.plot(x, y, 'rx') # plot data to show fit.

Now let's sample several functions and plot them all to see how the predictions fluctuate.


In [ ]:
for i in xrange(num_samples):
    w_sample = np.random.multivariate_normal(w_mean.flatten(), w_cov)
    f_sample = np.dot(Phi_pred,w_sample)
    plt.plot(x_pred.flatten(), f_sample.flatten())
plt.plot(x, y, 'rx') # plot data to show fit.

Sum of Gaussian-distributed Variables

The sum of Gaussian random variables is also Gaussian, so if we have a random variable $y_i$ drawn from a Gaussian density with mean $\meanScalar_i$ and variance $\dataStd^2_i$,

$$y_i \sim \gaussianSamp{\meanScalar_i}{\dataStd^2_i}$$

Then the sum of $K$ independently sampled values of $y_i$ will be drawn from a Gaussian with mean $\sum_{i=1}^K \mu_i$ and variance $\sum_{i=1}^K \dataStd_i^2$,

$$\sum_{i=1}^K y_i \sim \gaussianSamp{\sum_{i=1}^K \meanScalar_i}{\sum_{i=1}^K \dataStd_i^2}.$$

Let's try that experimentally. First let's generate a vector of samples from a standard normal distribution, $z \sim \gaussianSamp{0}{1}$, then we will scale and offset them, then keep adding them into a vector y_vec.


In [ ]:
K = 10 # how many Gaussians to add.
num_samples = 1000 # how many samples to have in y_vec
mus = np.linspace(0, 5, K) # mean values generated linearly spaced between 0 and 5
sigmas = np.linspace(0.5, 2, K) # sigmas generated linearly spaced between 0.5 and 2
y_vec = np.zeros(num_samples)
for mu, sigma in zip(mus, sigmas):
    z_vec = np.random.normal(size=num_samples) # z is from standard normal
    y_vec += z_vec*sigma + mu # add to y z*sigma + mu

# now y_vec is the sum of each scaled and off set z.
print 'Sample mean is ', y_vec.mean(), ' and sample variance is ', y_vec.var()
print 'True mean should be ', mus.sum()
print 'True variance should be ', (sigmas**2).sum(), ' standard deviation ', np.sqrt((sigmas**2).sum())

Of course, we can histogram y_vec as well.


In [ ]:
plt.hist(y_vec, bins=30, normed=True)
plt.legend('$y$')

Matrix Multiplication of Gaussian Variables

Matrix multiplication is just adding and scaling together, in the formula, $\mappingFunctionVector = \basisMatrix \mappingVector$ we can extract the first element from $\mappingFunctionVector$ as

$$\mappingFunctionScalar_i = \basisVector_i^\top \mappingVector$$

where $\basisVector$ is a column vector from the $i$th row of $\basisMatrix$ and $\mappingFunctionScalar_i$ is the $i$th element of $\mappingFunctionVector$. This vector inner product itself merely implies that

$$\mappingFunctionScalar_i = \sum_{j=1}^K \mappingScalar_j \basisScalar_{i, j}$$

and if we now say that $\mappingScalar_i$ is Gaussian distributed, then because a scaled Gaussian is also Gaussian, and because a sum of Gaussians is also Gaussian, we know that $\mappingFunctionScalar_i$ is also Gaussian distributed. It merely remains to work out its mean and covariance. We can do this by looking at the expectation under a Gaussian distribution. The expectation of the mean vector is given by

$$\expDist{\mappingFunctionVector}{\gaussianDist{\mappingVector}{\meanVector}{\covarianceMatrix}} = \int \mappingFunctionVector \gaussianDist{\mappingVector}{\meanVector}{\covarianceMatrix} \text{d}\mappingVector = \int \basisMatrix\mappingVector \gaussianDist{\mappingVector}{\meanVector}{\covarianceMatrix} \text{d}\mappingVector = \basisMatrix \int \mappingVector \gaussianDist{\mappingVector}{\meanVector}{\covarianceMatrix} \text{d}\mappingVector = \basisMatrix \meanVector$$

Which is straightforward. The expectation of $\mappingFunctionVector=\basisMatrix\mappingVector$ under the Gaussian distribution for $\mappingFunctionVector$ is simply $\mappingFunctionVector=\basisMatrix\meanVector$, where $\meanVector$ is the mean of the Gaussian density for $\mappingVector$. Because our prior distribution was Gaussian with zero mean, the expectation under the prior is given by

$$\expDist{\mappingFunctionVector}{\gaussianDist{\mappingVector}{\zerosVector}{\alpha\eye}} = \zerosVector$$

The covariance is a little more complicated. A covariance matrix is defined as

$$\text{cov}\left(\mappingFunctionVector\right)_{\gaussianDist{\mappingVector}{\meanVector}{\covarianceMatrix}} = \expDist{\mappingFunctionVector\mappingFunctionVector^\top}{\gaussianDist{\mappingVector}{\meanVector}{\covarianceMatrix}} - \expDist{\mappingFunctionVector}{\gaussianDist{\mappingVector}{\meanVector}{\covarianceMatrix}}\expDist{\mappingFunctionVector}{\gaussianDist{\mappingVector}{\meanVector}{\covarianceMatrix}}^\top$$

we've already computed $\expDist{\mappingFunctionVector}{\gaussianDist{\mappingVector}{\meanVector}{\covarianceMatrix}}=\basisMatrix \meanVector$ so we can substitute that in to recover

$$\text{cov}\left(\mappingFunctionVector\right)_{\gaussianDist{\mappingVector}{\meanVector}{\covarianceMatrix}} = \expDist{\mappingFunctionVector\mappingFunctionVector^\top}{\gaussianDist{\mappingVector}{\meanVector}{\covarianceMatrix}} - \basisMatrix \meanVector \meanVector^\top \basisMatrix^\top$$

So we need the expectation of $\mappingFunctionVector\mappingFunctionVector^\top$. Substituting in $\mappingFunctionVector = \basisMatrix \mappingVector$ we have

$$\text{cov}\left(\mappingFunctionVector\right)_{\gaussianDist{\mappingVector}{\meanVector}{\covarianceMatrix}} = \expDist{\basisMatrix\mappingVector\mappingVector^\top \basisMatrix^\top}{\gaussianDist{\mappingVector}{\meanVector}{\covarianceMatrix}} - \basisMatrix \meanVector \meanVector^\top \basisMatrix^\top$$
$$\text{cov}\left(\mappingFunctionVector\right)_{\gaussianDist{\mappingVector}{\meanVector}{\covarianceMatrix}} = \basisMatrix\expDist{\mappingVector\mappingVector^\top}{\gaussianDist{\mappingVector}{\meanVector}{\covarianceMatrix}} \basisMatrix^\top - \basisMatrix \meanVector \meanVector^\top \basisMatrix^\top$$

Which is dependent on the second moment of the Gaussian,

$$\expDist{\mappingVector\mappingVector^\top}{\gaussianDist{\mappingVector}{\meanVector}{\covarianceMatrix}} = \covarianceMatrix + \meanVector\meanVector^\top$$

that can be substituted in to recover,

$$\text{cov}\left(\mappingFunctionVector\right)_{\gaussianDist{\mappingVector}{\meanVector}{\covarianceMatrix}} = \basisMatrix\covarianceMatrix \basisMatrix^\top$$

so in the case of the prior distribution, where we have $\covarianceMatrix = \alpha \eye$ we can write

$$\text{cov}\left(\mappingFunctionVector\right)_{\gaussianDist{\mappingVector}{\zerosVector}{\alpha \eye}} = \alpha \basisMatrix \basisMatrix^\top$$

This implies that the prior we have suggested for $\mappingVector$, which is Gaussian with a mean of zero and covariance of $\alpha \eye$ suggests that the distribution for $\mappingVector$ is also Gaussian with a mean of zero and covariance of $\alpha \basisMatrix\basisMatrix^\top$. Since our observed output, $\dataVector$, is given by a noise corrupted variation of $\mappingFunctionVector$, the final distribution for $\dataVector$ is given as

$$\dataVector = \mappingFunctionVector + \noiseVector$$

where the noise, $\noiseVector$, is sampled from a Gaussian density: $\noiseVector \sim \gaussianSamp{\zerosVector}{\dataStd^2\eye}$. So, in other words, we are taking a Gaussian distributed random value $\mappingFunctionVector$,

$$\mappingFunctionVector \sim \gaussianSamp{\zerosVector}{\alpha\basisMatrix\basisMatrix^\top}$$

and adding to it another Gaussian distributed value, $\noiseVector \sim \gaussianSamp{\zerosVector}{\dataStd^2\eye}$, to form our data observations, $\dataVector$. Once again the sum of two (multivariate) Gaussian distributed variables is also Gaussian, with a mean given by the sum of the means (both zero in this case) and the covariance given by the sum of the covariances. So we now have that the marginal likelihood for the data, $p(\dataVector)$ is given by

$$p(\dataVector) = \gaussianDist{\dataVector}{\zerosVector}{\alpha \basisMatrix \basisMatrix^\top + \dataStd^2\eye}$$

This is our implicit assumption for $\dataVector$ given our prior assumption for $\mappingVector$.

Computing the Mean and Error Bars of the Functions

You should now know enough to compute the mean of the predictions under the posterior density.


In [ ]:
# compute mean under posterior density
f_pred_mean = np.dot(Phi_pred, w_mean)

We can plot these predictions alongside the real data,


In [ ]:
# print the error and plot the predictions
plt.plot(x_pred, f_pred_mean)
plt.plot(x, y, 'rx')
ax = plt.gca()
ax.set_title('Predictions for Order ' + str(order))
ax.set_xlabel('year')
ax.set_ylabel('pace (min/km)')

Computing the Error

We can also compute what the training error was. First compute the expected output under the posterior density,


In [ ]:
f_mean = np.dot(Phi, w_mean)

These can be used to compute the error

$$E(\mappingVector) = \frac{\numData}{2} \log \dataStd^2 + \frac{1}{2\dataStd^2} \sum_{i=1}^\numData \left(\dataScalar_i - \mappingVector^\top \phi(\inputVector_i)\right)^2 \\\ E(\mappingVector) = \frac{\numData}{2} \log \dataStd^2 + \frac{1}{2\dataStd^2} \sum_{i=1}^\numData \left(\dataScalar_i - \mappingFunctionScalar_i\right)^2$$

In [ ]:
# compute the sum of squares term
sum_squares = ((y-f_mean)**2).sum()
# fit the noise variance
error = (num_data/2*np.log(sigma2) + sum_squares/(2*sigma2))
print 'The error is: ',error

Now we have the fit and the error, let's plot the fit and the error.

Computing Error Bars

Finally, we can compute error bars for the predictions. The error bars are the standard deviations of the predictions for $\mappingFunctionVector=\basisMatrix\mappingVector$ under the posterior density for $\mappingVector$. The standard deviations of these predictions can be found from the variance of the prediction at each point. Those variances are the diagonal entries of the covariance matrix. We've already computed the form of the covariance under Gaussian expectations,

$$\text{cov}\left(\mappingFunctionVector\right)_{\gaussianDist{\mappingVector}{\meanVector}{\covarianceMatrix}} = \basisMatrix\covarianceMatrix \basisMatrix^\top$$

which under the posterior density is given by

$$\text{cov}\left(\mappingFunctionVector\right)_{\gaussianDist{\mappingVector}{\meanVector_w}{\covarianceMatrix_w}} = \basisMatrix\covarianceMatrix_w \basisMatrix^\top$$

In [ ]:
# compute the error bars
f_pred_cov = np.dot(Phi_pred, np.dot(w_cov, Phi_pred.T))
f_pred_var = np.diag(f_pred_cov)[:, None]
f_pred_std = np.sqrt(f_pred_var)

# plot mean, and error bars at 2 standard deviations
plt.plot(x_pred.flatten(), f_pred_mean.flatten(), 'b-')
plt.plot(x_pred.flatten(), (f_pred_mean+2*f_pred_std).flatten(), 'b--')
plt.plot(x_pred.flatten(), (f_pred_mean-2*f_pred_std).flatten(), 'b--')
plt.plot(x, y, 'rx')

Gaussian Processes

Now we're going to build on our understanding of the marginal likelihood we derived in the lectures, and also across the last part of the lab, to try and see the relationship with Gaussian procces. In the last lab section we sampled directly from the weight vector $\mappingVector$ and applied it to the basis matrix $\basisMatrix$ to obtain a sample from the prior and a sample from the posterior. Now we'll start by constructing the prior directly, rather than sampling the weights and then combining with the basis.

Sampling from the Prior

The first thing we'll do is to set up the parameters of the model, these include the parameters of the prior, the parameters of the basis functions and the noise level.

$$\mappingVector \sim \gaussianSamp{\zerosVector}{\alpha \eye}$$

In [ ]:
# set prior variance on w
alpha = 4.
# set the order of the polynomial basis set
order = 5
# set the noise variance
sigma2 = 0.01

Now we have the variance, we can sample from the prior distribution to see what form we are imposing on the functions a priori.

Before we sample from our prior, recall the problems with the basis set from our last lab: the basis doesn't work well when predictions are made outside the $-1, 1$ region. Let's rescale the data to be within that region.


In [ ]:
span = np.max(x) - np.min(x) 
offset = np.min(x)
x -= offset
x /= span # x is now between zero and 1
x = x*2-1 # x is now between -1 and 1

Let's now compute a range of values to make predictions at, spanning the new space of inputs,


In [ ]:
num_data = x.shape[0]
num_pred_data = 100 # how many points to use for plotting predictions
x_pred = linspace(-1.2, 1.2, num_pred_data)[:, None] # input locations for predictions

now let's build the basis matrices.


In [ ]:
# build the basis set
Phi = np.zeros((num_data, order+1))
Phi_pred = np.zeros((num_pred_data, order+1))
for i in range(0, order+1):
    Phi[:, i:i+1] = x**i
    Phi_pred[:, i:i+1] = x_pred**i

Weight Space View

Now we will sample from the prior density to obtain a vector $\mappingVector$ using the function np.random.normal and combine with our basis to create 10 samples from the model over functions,


In [ ]:
num_samples = 10
K = order+1
for i in xrange(num_samples):
    z_vec = np.random.normal(size=K)
    w_sample = z_vec*np.sqrt(alpha)
    f_sample = np.dot(Phi_pred,w_sample)
    plt.plot(x_pred.flatten(), f_sample.flatten())

Function Space View

The process we have used to generate the samples is a two stage process. To obtain each function, we first generated a sample from the prior,

$$\mappingVector \sim \gaussianSamp{\zerosVector}{\alpha \eye}$$

We next applied the likelihood. The mean of the function in the likelihood is given by

$$\mathbf{f} = \basisMatrix \mappingVector.$$

so we plotted the result. In the lecture we talked about computing the marginal likelihood directly. We used properties of Gaussian densities to show that,

$$\mappingFunctionVector \sim \gaussianSamp{\zerosVector}{\alpha \basisMatrix \basisMatrix^\top},$$

so we should be able to sample directly from this density. To do that we use the np.random.multivariate_normal command introduced in the last session. We need to sample from a multivariate normal with covariance given by $\alpha\basisMatrix\basisMatrix^\top$ and a zero mean,


In [ ]:
K = alpha*np.dot(Phi_pred, Phi_pred.T)
for i in xrange(10):
    f_sample = np.random.multivariate_normal(mean=np.zeros(x_pred.size), cov=K)
    plt.plot(x_pred.flatten(), f_sample.flatten())

We can plot the covariance given as an image in python with a colorbar to show scale.


In [ ]:
plt.imshow(K)
plt.colorbar()

Because the full model involves corrupting the latent function with Gaussian noise,

$$\dataVector = \mappingFunctionVector + \noiseVector$$

and the noise is sampled from an independent Gaussian distribution with variance $\dataStd^2$,

$$\noiseVector \sim \gaussianSamp{\zerosVector}{\dataStd^2 \eye},$$

we can use properties of Gaussian variables, i.e. the fact that sum of two Gaussian variables is also Gaussian, and that it's covariance is given by the sum of the two covariances, whilst the mean is given by the sum of the means, to write down the marginal likelihood,

$$\dataScalar \sim \gaussianSamp{\zerosVector}{\alpha \basisMatrix \basisMatrix^\top + \dataStd^2 \eye}$$

.

Sampling directly from this distribution gives us the noise corrupted functions,


In [ ]:
K = alpha*np.dot(Phi_pred, Phi_pred.T) + sigma2*np.eye(x_pred.size)
for i in xrange(10):
    y_sample = np.random.multivariate_normal(mean=np.zeros(x_pred.size), cov=K)
    plt.plot(x_pred.flatten(), y_sample.flatten())

Here we can see the small effect of our noise value, $\dataStd^2$. We can increase the noise value to see a different effect,


In [ ]:
sigma2 = 1.
K = alpha*np.dot(Phi_pred, Phi_pred.T) + sigma2*np.eye(x_pred.size)
for i in xrange(10):
    y_sample = np.random.multivariate_normal(mean=np.zeros(x_pred.size), cov=K)
    plt.plot(x_pred.flatten(), y_sample.flatten())

Function Space Reflection

How would you include the noise term when sampling in the weight space point of view?

Gaussian Process

Rather than sampling from the prior over parameters, we sampled from the marginal likelihood. Specifying this marginal likelihood directly, and avoiding the intermediate weight-space representation is what Gaussian processes are all about. In a Gaussian process you specify the covariance function directly, rather than implicitly through a basis matrix and a prior over parameters. Gaussian processes have the advantage that they can be nonparametric, which in simple terms means that they can have infinite basis functions. In the lectures we introduced the exponentiated quadratic covariance, also known as the RBF or the Gaussian or the squared exponential covariance function. This covariance function is specified by

$$\kernelScalar(\inputVector_i, \inputVector_j) = \alpha \exp\left( -\frac{\ltwoNorm{\inputVector_i-\inputVector_j}^2}{2\lengthScale^2}\right).$$

where $\ltwoNorm{\inputVector_i-\inputVector_j}^2$ is the squared distance between the two input vectors

Let's build a covariance matrix based on this function. We will compute the covariance at the points given by x_pred


In [91]:
def kern(x, xprime, variance=1.0, lengthscale=1.0):
    return variance*np.exp((-0.5*(x - xprime)**2)/lengthscale**2)

alpha = 1.0
lengthscale = 10
K = np.zeros((x_pred.size, x_pred.size))
for i in xrange(x_pred.size):
    for j in xrange(x_pred.size):
        K[i, j] = kern(x_pred[i], x_pred[j], variance=alpha, lengthscale=lengthscale)

Now we can image the resulting covariance,


In [92]:
plt.imshow(K,interpolation='none')
plt.colorbar()


Out[92]:
<matplotlib.colorbar.Colorbar instance at 0x87b44d0>

and sample functions from the marginal likelihood.


In [93]:
for i in xrange(10):
    y_sample = np.random.multivariate_normal(mean=np.zeros(x_pred.size), cov=K)
    plt.plot(x_pred.flatten(), y_sample.flatten())


The new covariance function doesn't have the problems with scaling exhibited by the the polynomial basis. We can reset our data from the olympics matrix and show samples computed across the actual years.


In [95]:
x = data['X'][:, 0:1]
x_pred = linspace(1892, 2016, num_pred_data)[:, None]

alpha = 1.0
lengthscale = 4.
K = np.zeros((x_pred.size, x_pred.size))
for i in xrange(x_pred.size):
    for j in xrange(x_pred.size):
        K[i, j] = kern(x_pred[i], x_pred[j], variance=alpha, lengthscale=lengthscale)
        
for i in xrange(10):
    y_sample = np.random.multivariate_normal(mean=np.zeros(x_pred.size), cov=K)
    plt.plot(x_pred.flatten(), y_sample.flatten())


Moving Parameters

Have a play with the parameters for this covariance function and see what effects the parameters have on the types of functions you observe.

Making Predictions

We now have a probability density that represents functions. How do we make predictions with this density? The density is known as a process because it is consistent. By consistency, here, we mean that the model makes predictions for $\mappingFunctionVector$ that are unaffected by future values of $\mappingFunctionVector^*$ that are currently unobserved (such as test points). If we think of $\mappingFunctionVector^\ast$ as test points, we can still write down a joint probability ensity over the training observations, $\mappingFunctionVector$ and the test observations, $\mappingFunctionVector^\ast$. This joint probability density will be Gaussian, with a covariance matrix given by our covariance function, $\kernelScalar(\inputVector_i, \inputVector_j)$.

$$\begin{bmatrix}\mappingFunctionVector \\\ \mappingFunctionVector^\ast\end{bmatrix} \sim \gaussianSamp{\zerosVector}{\begin{bmatrix} \kernelMatrix & \kernelMatrix_\ast \\\ \kernelMatrix_\ast^\top & \kernelMatrix_{\ast,\ast}\end{bmatrix}}$$

where here $\kernelMatrix$ is the covariance computed between all the training points, $\kernelMatrix_\ast$ is the covariance matrix computed between the training points and the test points and $\kernelMatrix_{\ast,\ast}$ is the covariance matrix computed betwen all the tests points. To be clear, let's compute these now for our example, using x and y for the training data (although y doesn't enter the covariance) and x_pred as the test locations.


In [97]:
# set covariance function parameters
alpha = 16.0
lengthscale = 8
# set noise variance
sigma2 = 0.05

# compute covariance for training data, x
K = np.zeros((x.size, x.size))
for i in xrange(x.size):
    for j in xrange(x.size):
        K[i, j] = kern(x[i], x[j], variance=alpha, lengthscale=lengthscale)

# compute covariance between training data, x, and test data, x_pred
K_star = np.zeros((x.size, x_pred.size))
for i in xrange(x.size):
    for j in xrange(x_pred.size):
        K_star[i, j] = kern(x[i], x_pred[j], variance=alpha, lengthscale=lengthscale)
        
# compute covariance for test data, x_pred
K_starstar = np.zeros((x_pred.size, x_pred.size))
for i in xrange(x_pred.size):
    for j in xrange(x_pred.size):
        K_starstar[i, j] = kern(x_pred[i], x_pred[j], variance=alpha, lengthscale=lengthscale)

The overall covariance between our training and test data can now be plotted as


In [98]:
full_K = np.vstack([np.hstack([K, K_star]), np.hstack([K_star.T, K_starstar])])
plt.imshow(full_K)
plt.colorbar


Out[98]:
<function matplotlib.pyplot.colorbar>

The banded structure we now observe is because some of the training points are near to some of the test points. This is how we obtain 'communication' between our training data and our test data. If there is no structure in $\kernelMatrix_\ast$ then our belief about the test data simply matches our prior.

To make predictions about the test data we need the conditional distribution: $p(\mappingFunctionVector^\ast|\mappingFunctionVector)$, or when we include noise, $p(\mappingFunctionVector^\ast | \dataVector)$. This conditional distribution is also Gaussian,

$$\mappingFunctionVector^\ast \sim \gaussianSamp{\meanVector_\mappingFunctionScalar}{\covarianceMatrix_\mappingFunctionScalar}$$

with a mean given by

$$\meanVector_\mappingFunctionScalar = \kernelMatrix^\top_\ast \left[\kernelMatrix + \dataStd^2 \eye\right]^{-1} \dataVector$$

and a covariance given by

$$\covarianceMatrix_\mappingFunctionScalar = \kernelMatrix_{\ast,\ast} - \kernelMatrix^\top_\ast \left[\kernelMatrix + \dataStd^2 \eye\right]^{-1} \kernelMatrix_\ast.$$

These results can be proved using block matrix inverse rules, but they are beyond the scope of this course, so you don't need to worry about remembering them or rederiving them. We are simply writing them here because it is this conditional density that is necessary for making predictions. Let's compute what those predictions are for the olympic marathon data.


In [99]:
a = np.dot(np.linalg.inv(K + sigma2*eye(x.size)), K_star)
mu_f = np.dot(a.T, y)
C_f = K_starstar - np.dot(a.T, K_star)

where for convenience we've defined

$$\mathbf{a} = \left[\kernelMatrix + \dataStd^2\eye\right]^{-1}\kernelMatrix_\ast.$$

We can visualize the covariance of the conditional,


In [100]:
plt.imshow(C_f)
plt.colorbar


Out[100]:
<function matplotlib.pyplot.colorbar>

and we can plot the mean of the conditional


In [101]:
plt.plot(x, y, 'rx')
plt.plot(x_pred, mu_f, 'b-')


Out[101]:
[<matplotlib.lines.Line2D at 0x88100d0>]

as well as the associated error bars. These are given (similarly to the Bayesian parametric model from the last lab) by the standard deviations of the marginal posterior densities. The marginal posterior variances are given by the diagonal elements of the posterior covariance,


In [102]:
var_f = np.diag(C_f)[:, None]
std_f = np.sqrt(var_f)

They can be added to the underlying mean function to give the error bars,


In [103]:
plt.plot(x, y, 'rx')
plt.plot(x_pred, mu_f, 'b-')
plt.plot(x_pred, mu_f+2*std_f, 'b--')
plt.plot(x_pred, mu_f-2*std_f, 'b--')


Out[103]:
[<matplotlib.lines.Line2D at 0x890fb90>]

This gives us a prediction from the Gaussian process. Remember machine learning is data + model = prediction. Here our data is from the olympics, and our model is a Gaussian process with two parameters. The main thing the model expresses is smoothness. But it also sustains the uncertainty about the function, this means we see an increase in the size of the error bars during periods like the 1st and 2nd World Wars when no olympic marathon was held.

Exercises

Now try changing the parameters of the covariance function (and the noise) to see how the predictions change.

Now try sampling from this conditional density to see what your predictions look like. What happens if you sample from the conditional density in regions a long way into the future or the past? How does this compare with the results from the polynomial model?