First we download the latest version of the `pods`

software.
$$\newcommand{\inputScalar}{x}
\newcommand{\lengthScale}{\ell}
\newcommand{\mappingVector}{\mathbf{w}}
\newcommand{\gaussianDist}[3]{\mathcal{N}\left(#1|#2,#3\right)}
\newcommand{\gaussianSamp}[2]{\mathcal{N}\left(#1,#2\right)}
\newcommand{\zerosVector}{\mathbf{0}}
\newcommand{\eye}{\mathbf{I}}
\newcommand{\dataStd}{\sigma}
\newcommand{\dataScalar}{y}
\newcommand{\dataVector}{\mathbf{y}}
\newcommand{\dataMatrix}{\mathbf{Y}}
\newcommand{\noiseScalar}{\epsilon}
\newcommand{\noiseVector}{\mathbf{\epsilon}}
\newcommand{\noiseMatrix}{\mathbf{\Epsilon}}
\newcommand{\inputVector}{\mathbf{x}}
\newcommand{\basisMatrix}{\mathbf{\Phi}}
\newcommand{\basisVector}{\mathbf{\phi}}
\newcommand{\basisScalar}{\phi}
\newcommand{\expSamp}[1]{\left<#1\right>}
\newcommand{\expDist}[2]{\left<#1\right>_{#2}}
\newcommand{\covarianceMatrix}{\mathbf{C}}
\newcommand{\numData}{N}
\newcommand{\mappingScalar}{w}
\newcommand{\mappingFunctionScalar}{f}
\newcommand{\mappingFunctionVector}{\mathbf{f}}
\newcommand{\meanVector}{\boldsymbol{\mu}}
\newcommand{\meanScalar}{\mu}
\newcommand{\ltwoNorm}[1]{\left\Vert #1 \right\Vert_2}
\newcommand{\kernelScalar}{k}
\newcommand{\kernelVector}{\mathbf{\kernelScalar}}
\newcommand{\kernelMatrix}{\mathbf{K}}
\newcommand{\lengthScale}{\ell}$$

```
In [ ]:
```# download the software
import urllib
urllib.urlretrieve('https://github.com/sods/ods/archive/master.zip', 'master.zip')
# unzip the software
import zipfile
zip = zipfile.ZipFile('./master.zip', 'r')
for name in zip.namelist():
zip.extract(name, '.')
# add the module location to the python path.
import sys
sys.path.append("./ods-master/")

And next we get the plots running in line. We also also load in the `pylab`

library as `plt`

.

```
In [ ]:
```%matplotlib inline
import pods
import pylab as plt

Once again, we will be working with the olympics data this week.

```
In [ ]:
```data = pods.datasets.olympic_marathon_men()
x = data['X']
y = data['Y']
plt.plot(x, y, 'rx')

This week we're going to build on our understanding of the marginal likelihood we derived in the lectures, and also across the last lab session, to try and see the relationship with Gaussian procces. In the last lab class we sampled directly from the weight vector $\mappingVector$ and applied it to the basis matrix $\basisMatrix$ to obtain a sample from the prior and a sample from the posterior. This week we'll start by constructing the prior directly, rather than sampling the weights and then combining with the basis.

The first thing we'll do is to set up the parameters of the model, these include the parameters of the prior, the parameters of the basis functions and the noise level.

$$\mappingVector \sim \gaussianSamp{\zerosVector}{\alpha \eye}$$```
In [ ]:
```# set prior variance on w
alpha = 4.
# set the order of the polynomial basis set
order = 5
# set the noise variance
sigma2 = 0.01

Now we have the variance, we can sample from the prior distribution to see what form we are imposing on the functions *a priori*.

Before we sample from our prior, recall the problems with the basis set from our last lab: the basis doesn't work well when predictions are made outside the $-1, 1$ region. Let's rescale the data to be within that region.

```
In [ ]:
```import numpy as np
span = np.max(x) - np.min(x)
offset = np.min(x)
x -= offset
x /= span # x is now between zero and 1
x = x*2-1 # x is now between -1 and 1

Let's now compute a range of values to make predictions at, spanning the *new* space of inputs,

```
In [ ]:
```num_data = x.shape[0]
num_pred_data = 100 # how many points to use for plotting predictions
x_pred = np.linspace(-1.2, 1.2, num_pred_data)[:, None] # input locations for predictions

now let's build the basis matrices.

```
In [ ]:
```# build the basis set
Phi = np.zeros((num_data, order+1))
Phi_pred = np.zeros((num_pred_data, order+1))
for i in range(0, order+1):
Phi[:, i:i+1] = x**i
Phi_pred[:, i:i+1] = x_pred**i

```
In [ ]:
```num_samples = 10
K = order+1
for i in xrange(num_samples):
z_vec = np.random.normal(size=K)
w_sample = z_vec*np.sqrt(alpha)
f_sample = np.dot(Phi_pred,w_sample)
plt.plot(x_pred.flatten(), f_sample.flatten())

The process we have used to generate the samples is a two stage process. To obtain each function, we first generated a sample from the prior,

$$\mappingVector \sim \gaussianSamp{\zerosVector}{\alpha \eye}$$We next applied the likelihood. The mean of the function in the likelihood is given by

$$\mathbf{f} = \basisMatrix \mappingVector.$$so we plotted the result. In the lecture this week we talked about computing the marginal likelihood directly. We used properties of Gaussian densities to show that,

$$\mappingFunctionVector \sim \gaussianSamp{\zerosVector}{\alpha \basisMatrix \basisMatrix^\top},$$so we should be able to sample directly from this density. To do that we use the `np.random.multivariate_normal`

command introduced in the last session. We need to sample from a multivariate normal with covariance given by $\alpha\basisMatrix\basisMatrix^\top$ and a zero mean,

```
In [ ]:
```K = alpha*np.dot(Phi_pred, Phi_pred.T)
for i in xrange(10):
f_sample = np.random.multivariate_normal(mean=np.zeros(x_pred.size), cov=K)
plt.plot(x_pred.flatten(), f_sample.flatten())

We can plot the covariance given as an image in python with a colorbar to show scale.

```
In [ ]:
```plt.imshow(K)
plt.colorbar()

Because the full model involves corrupting the latent function with Gaussian noise,

$$\dataVector = \mappingFunctionVector + \noiseVector$$

and the noise is sampled from an independent Gaussian distribution with variance $\dataStd^2$,

$$\noiseVector \sim \gaussianSamp{\zerosVector}{\dataStd^2 \eye},$$

$$\dataScalar \sim \gaussianSamp{\zerosVector}{\alpha \basisMatrix \basisMatrix^\top + \dataStd^2 \eye}$$

.

Sampling directly from this distribution gives us the noise corrupted functions,

```
In [ ]:
```K = alpha*np.dot(Phi_pred, Phi_pred.T) + sigma2*np.eye(x_pred.size)
for i in xrange(10):
y_sample = np.random.multivariate_normal(mean=np.zeros(x_pred.size), cov=K)
plt.plot(x_pred.flatten(), y_sample.flatten())

```
In [ ]:
```sigma2 = 1.
K = alpha*np.dot(Phi_pred, Phi_pred.T) + sigma2*np.eye(x_pred.size)
for i in xrange(10):
y_sample = np.random.multivariate_normal(mean=np.zeros(x_pred.size), cov=K)
plt.plot(x_pred.flatten(), y_sample.flatten())

How would you include the noise term when sampling in the weight space point of view?

Rather than sampling from the prior over parameters, we sampled from the marginal likelihood. Specifying this marginal likelihood directly, and avoiding the intermediate weight-space representation is what Gaussian processes are all about. In a Gaussian process you specify the *covariance function* directly, rather than *implicitly* through a basis matrix and a prior over parameters. Gaussian processes have the advantage that they can be *nonparametric*, which in simple terms means that they can have *infinite* basis functions. In the lectures we introduced the *exponentiated quadratic* covariance, also known as the RBF or the Gaussian or the squared exponential covariance function. This covariance function is specified by

where $\ltwoNorm{\inputVector_i-\inputVector_j}^2$ is the squared distance between the two input vectors

Let's build a covariance matrix based on this function. We will compute the covariance at the points given by `x_pred`

```
In [ ]:
```alpha = 1.0
lengthscale = 0.1
K = np.zeros((x_pred.size, x_pred.size))
for i in xrange(x_pred.size):
for j in xrange(x_pred.size):
K[i, j] = alpha*np.exp((-0.5*(x_pred[i]-x_pred[j])**2)/lengthscale**2)

Now we can image the resulting covariance,

```
In [ ]:
```plt.imshow(K,interpolation='none')
plt.colorbar()

and sample functions from the marginal likelihood.

```
In [ ]:
```for i in xrange(10):
y_sample = np.random.multivariate_normal(mean=np.zeros(x_pred.size), cov=K)
plt.plot(x_pred.flatten(), y_sample.flatten())

```
In [ ]:
```x_pred = np.linspace(1892, 2016, num_pred_data)[:, None]
alpha = 1.0
lengthscale = 4
K = np.zeros((x_pred.size, x_pred.size))
for i in xrange(x_pred.size):
for j in xrange(x_pred.size):
K[i, j] = alpha*np.exp((-0.5*(x_pred[i]-x_pred[j])**2)/lengthscale**2)
for i in xrange(10):
y_sample = np.random.multivariate_normal(mean=np.zeros(x_pred.size), cov=K)
plt.plot(x_pred.flatten(), y_sample.flatten())

Have a play with the parameters for this covariance function and see what effects the parameters have on the types of functions you observe.

We now have a probability density that represents functions. How do we make predictions with this density? The density is known as a process because it is *consistent*. By consistency, here, we mean that the model makes predictions for $\mappingFunctionVector$ that are unaffected by future values of $\mappingFunctionVector^*$ that are currently unobserved (such as test points). If we think of $\mappingFunctionVector^\ast$ as test points, we can still write down a joint probability ensity over the training observations, $\mappingFunctionVector$ and the test observations, $\mappingFunctionVector^\ast$. This joint probability density will be Gaussian, with a covariance matrix given by our covariance function, $\kernelScalar(\inputVector_i, \inputVector_j)$.

where here $\kernelMatrix$ is the covariance computed between all the training points, $\kernelMatrix_\ast$ is the covariance matrix computed between the training points and the test points and $\kernelMatrix_{\ast,\ast}$ is the covariance matrix computed betwen all the tests points. To be clear, let's compute these now for our example, using `x`

and `y`

for the training data (although `y`

doesn't enter the covariance) and `x_pred`

as the test locations.

```
In [ ]:
```# set covariance function parameters
alpha = 16.0
lengthscale = 8
# set noise variance
sigma2 = 0.05
# define a function to compute the covariance.
def kern(x, xprime):
return alpha*np.exp((-0.5*(x-xprime)**2)/lengthscale**2)
# compute covariance for training data, x
K = np.zeros((x.size, x.size))
for i in xrange(x.size):
for j in xrange(x.size):
K[i, j] = kern(x[i], x[j])
# compute covariance between training data, x, and test data, x_pred
K_star = np.zeros((x.size, x_pred.size))
for i in xrange(x.size):
for j in xrange(x_pred.size):
K_star[i, j] = kern(x[i], x_pred[j])
# compute covariance for test data, x_pred
K_starstar = np.zeros((x_pred.size, x_pred.size))
for i in xrange(x_pred.size):
for j in xrange(x_pred.size):
K_starstar[i, j] = kern(x_pred[i], x_pred[j])

The overall covariance between our training and test data can now be plotted as

```
In [ ]:
```full_K = np.vstack([np.hstack([K, K_star]), np.hstack([K_star.T, K_starstar])])
plt.imshow(full_K)
plt.colorbar

The banded structure we now observe is because some of the training points are near to some of the test points. This is how we obtain 'communication' between our training data and our test data. If there is no structure in $\kernelMatrix_\ast$ then our belief about the test data simply matches our prior.

To make predictions about the test data we need the conditional distribution: $p(\mappingFunctionVector^\ast|\mappingFunctionVector)$, or when we include noise, $p(\mappingFunctionVector^\ast | \dataVector)$. This conditional distribution is also Gaussian,

$$\mappingFunctionVector^\ast \sim \gaussianSamp{\meanVector_\mappingFunctionScalar}{\covarianceMatrix_\mappingFunctionScalar}$$with a mean given by

$$\meanVector_\mappingFunctionScalar = \kernelMatrix^\top_\ast \left[\kernelMatrix + \dataStd^2 \eye\right]^{-1} \dataVector$$and a covariance given by

$$\covarianceMatrix_\mappingFunctionScalar = \kernelMatrix_{\ast,\ast} - \kernelMatrix^\top_\ast \left[\kernelMatrix + \dataStd^2 \eye\right]^{-1} \kernelMatrix_\ast.$$These results can be proved using *block matrix inverse* rules, but they are beyond the scope of this course, so you don't need to worry about remembering them or rederiving them. We are simply writing them here because it is this *conditional* density that is necessary for making predictions. Let's compute what those predictions are for the olympic marathon data.

```
In [ ]:
```a = np.dot(np.linalg.inv(K + sigma2*np.eye(x.size)), K_star)
mu_f = np.dot(a.T, y)
C_f = K_starstar - np.dot(a.T, K_star)

where for convenience we've defined

$$\mathbf{a} = \left[\kernelMatrix + \dataStd^2\eye\right]^{-1}\kernelMatrix_\ast.$$We can visualize the covariance of the *conditional*,

```
In [ ]:
```plt.imshow(C_f)
plt.colorbar

and we can plot the mean of the conditional

```
In [ ]:
```plt.plot(x, y, 'rx')
plt.plot(x_pred, mu_f, 'b-')

```
In [ ]:
```var_f = np.diag(C_f)[:, None]
std_f = np.sqrt(var_f)

They can be added to the underlying mean function to give the error bars,

```
In [ ]:
```plt.plot(x, y, 'rx')
plt.plot(x_pred, mu_f, 'b-')
plt.plot(x_pred, mu_f+2*std_f, 'b--')
plt.plot(x_pred, mu_f-2*std_f, 'b--')

This gives us a prediction from the Gaussian process. Remember machine learning is data + model = prediction. Here our data is from the olympics, and our model is a Gaussian process with two parameters. The main thing the model expresses is smoothness. But it also sustains the uncertainty about the function, this means we see an increase in the size of the error bars during periods like the 1st and 2nd World Wars when no olympic marathon was held.

Now try changing the parameters of the covariance function (and the noise) to see how the predictions change.

Now try sampling from this conditional density to see what your predictions look like. What happens if you sample from the conditional density in regions a long way into the future or the past? How does this compare with the results from the polynomial model?