```
In [3]:
```%pylab inline

```
```

```
In [ ]:
```import urllib
url = ("http://staffwww.dcs.shef.ac.uk/"
+ "people/N.Lawrence/dataset_mirror/"
+ "olympic_marathon_men/olympicMarathonTimes.csv")
urllib.urlretrieve(url, 'olympicMarathonTimes.csv')
olympics = np.loadtxt('olympicMarathonTimes.csv', delimiter=',')

```
In [42]:
```import GPy
data = GPy.util.datasets.olympic_marathon_men()
x = data['X']
y = data['Y']

We can plot them to check that they've loaded in correctly.

```
In [4]:
```plt.plot(x, y, 'rx')

```
Out[4]:
```

$$\mappingVector \sim \gaussianSamp{\zerosVector}{\alpha \eye}$$

$$\mappingScalar_i \sim \gaussianSamp{0}{\alpha}$$

```
In [5]:
```# set prior variance on w
alpha = 4.
# set the order of the polynomial basis set
order = 5
# set the noise variance
sigma2 = 0.01

*a priori*. To do this, we first sample a weight vector, then we multiply that weight vector by our basis to compute the the functions. Firstly we compute the basis function matrix. We will do it both for our training data, and for a range of prediction locations (`x_pred`

).

```
In [6]:
```num_data = x.shape[0]
num_pred_data = 100 # how many points to use for plotting predictions
x_pred = linspace(1890, 2016, num_pred_data)[:, None] # input locations for predictions

now let's build the basis matrices.

```
In [7]:
```# build the basis set
Phi = np.zeros((num_data, order+1))
Phi_pred = np.zeros((num_pred_data, order+1))
for i in xrange(0, order+1):
Phi[:, i:i+1] = x**i
Phi_pred[:, i:i+1] = x_pred**i

*before* we fit the data. To do this we are going to use the properties of the Gaussian density and a sample from a *standard normal* using the function `np.random.normal`

.

$$\mappingFunctionScalar = \basisScalar \mappingScalar$$

If $\mappingScalar$ is drawn from a normal density,

$$\mappingScalar \sim \gaussianSamp{\meanScalar_\mappingScalar}{c_\mappingScalar}$$

$$\basisScalar \mappingScalar \sim \gaussianSamp{\basisScalar\meanScalar_\mappingScalar}{\basisScalar^2c_\mappingScalar}$$

Let's test this out numerically. First we will draw 200 samples from a standard normal,

```
In [ ]:
```w_vec = np.random.normal(size=200)

We can compute the mean of these samples and their variance

```
In [ ]:
```print 'w sample mean is ', w_vec.mean()
print 'w sample variance is ', w_vec.var()

```
In [ ]:
```phi = 7
f_vec = phi*w_vec
print 'True mean should be phi*0 = 0.'
print 'True variance should be phi*phi*1 = ', phi*phi
print 'f sample mean is ', f_vec.mean()
print 'f sample variance is ', f_vec.var()

`np.random.normal`

will change the mean. So if you want to sample from a Gaussian with mean `mu`

and standard deviation `sigma`

one way of doing it is to sample from the standard normal and scale and shift the result, so to sample a set of $\mappingScalar$ from a Gaussian with mean $\meanScalar$ and variance $\alpha$,

$$w \sim \gaussianSamp{\meanScalar}{\alpha}$$

We can simply scale and offset samples from the *standard normal*.

```
In [ ]:
```mu = 4 # mean of the distribution
alpha = 2 # variance of the distribution
w_vec = np.random.normal(size=200)*np.sqrt(alpha) + mu
print 'w sample mean is ', w_vec.mean()
print 'w sample variance is ', w_vec.var()

`np.sqrt`

is necesssary because we need to multiply by the standard deviation and we specified the variance as `alpha`

. So scaling and offsetting a Gaussian distributed variable keeps the variable Gaussian, but it effects the mean and variance of the resulting variable.

```
In [ ]:
```# First the standard normal
z_vec = np.random.normal(size=1000) # by convention, in statistics, z is often used to denote samples from the standard normal
w_vec = z_vec*np.sqrt(alpha) + mu
# plot normalized histogram of w, and then normalized histogram of z on top
plt.hist(w_vec, bins=30, normed=True)
plt.hist(z_vec, bins=30, normed=True)
plt.legend(('$w$', '$z$'))

*a priori*. The process will be as follows. First, we sample a random vector $K$ dimensional from `np.random.normal`

. Then we scale it by $\sqrt{\alpha}$ to obtain a prior sample of $\mappingVector$.

```
In [8]:
```K = order + 1
z_vec = np.random.normal(size=K)
w_sample = z_vec*np.sqrt(alpha)
print w_sample

```
```

Now we can combine our sample from the prior with the basis functions to create a function,

```
In [9]:
```f_sample = np.dot(Phi_pred,w_sample)
plt.plot(x_pred.flatten(), f_sample.flatten(), 'r-')

```
Out[9]:
```

```
In [10]:
```span = np.max(x) - np.min(x)
offset = np.min(x)
x -= offset
x_pred -= offset
x /= span # x is now between zero and 1
x_pred /= span
x = x*2-1 # x is now between -1 and 1
x_pred = x_pred*2 - 1

Now we need to recompute the basis functions from above,

```
In [11]:
```# rebuild the basis set
Phi = np.zeros((num_data, order+1))
Phi_pred = np.zeros((num_pred_data, order+1))
for i in xrange(0, order+1):
Phi[:, i:i+1] = x**i
Phi_pred[:, i:i+1] = x_pred**i

```
In [12]:
```f_sample = np.dot(Phi_pred,w_sample)
plt.plot(x_pred.flatten(), f_sample.flatten(), 'r-')

```
Out[12]:
```

Now let's loop through some samples and plot various functions as samples from this system,

```
In [13]:
```num_samples = 10
K = order+1
for i in xrange(num_samples):
z_vec = np.random.normal(size=K)
w_sample = z_vec*np.sqrt(alpha)
f_sample = np.dot(Phi_pred,w_sample)
plt.plot(x_pred.flatten(), f_sample.flatten())

```
```

$$\mathbf{f} = \basisMatrix \mappingVector.$$

$$p(\mappingVector | \dataVector, \inputVector, \dataStd^2) = \mathcal{N}\left(\mappingVector|\meanVector_\mappingScalar, \covarianceMatrix_\mappingScalar\right)$$

with covariance, $\covarianceMatrix_\mappingScalar$, given by

$$\covarianceMatrix_\mappingScalar = \left(\dataStd^{-2}\basisMatrix^\top \basisMatrix + \alpha^{-1} \eye\right)^{-1}$$

whilst the mean is given by

$$\meanVector_\mappingScalar = \covarianceMatrix_\mappingScalar \dataStd^{-2}\basisMatrix^\top \dataVector$$

$$p(\mappingVector|\dataVector, \inputVector) = \frac{p(\dataVector|\inputVector, \mappingVector)p(\mappingVector)}{p(\dataVector)}$$

*covariance*.

```
In [14]:
```# compute the posterior covariance and mean
w_cov = np.linalg.inv(1/sigma2*np.dot(Phi.T, Phi) + 1/alpha*np.eye(order+1))
w_mean = np.dot(w_cov, 1/sigma2*np.dot(Phi.T, y))

*independently* from a Gaussian using `np.random.normal`

and scaling the result. However, observing the data *correlates* the parameters. Recall this from the first lab where we had a correlation between the offset, $c$ and the slope $m$ which caused such problems with the coordinate ascent algorithm. We need to sample from a *correlated* Gaussian. For this we can use `np.random.multivariate_normal`

.

```
In [17]:
```w_sample = np.random.multivariate_normal(w_mean.flatten(), w_cov)
f_sample = np.dot(Phi_pred,w_sample)
plt.plot(x_pred.flatten(), f_sample.flatten(), 'r-')
plt.plot(x, y, 'rx') # plot data to show fit.

```
Out[17]:
```

Now let's sample several functions and plot them all to see how the predictions fluctuate.

```
In [18]:
```for i in xrange(num_samples):
w_sample = np.random.multivariate_normal(w_mean.flatten(), w_cov)
f_sample = np.dot(Phi_pred,w_sample)
plt.plot(x_pred.flatten(), f_sample.flatten())
plt.plot(x, y, 'rx') # plot data to show fit.

```
Out[18]:
```

$$y_i \sim \gaussianSamp{\meanScalar_i}{\dataStd^2_i}$$

$$\sum_{i=1}^K y_i \sim \gaussianSamp{\sum_{i=1}^K \meanScalar_i}{\sum_{i=1}^K \dataStd_i^2}.$$

`y_vec`

.

```
In [ ]:
```K = 10 # how many Gaussians to add.
num_samples = 1000 # how many samples to have in y_vec
mus = np.linspace(0, 5, K) # mean values generated linearly spaced between 0 and 5
sigmas = np.linspace(0.5, 2, K) # sigmas generated linearly spaced between 0.5 and 2
y_vec = np.zeros(num_samples)
for mu, sigma in zip(mus, sigmas):
z_vec = np.random.normal(size=num_samples) # z is from standard normal
y_vec += z_vec*sigma + mu # add to y z*sigma + mu
# now y_vec is the sum of each scaled and off set z.
print 'Sample mean is ', y_vec.mean(), ' and sample variance is ', y_vec.var()
print 'True mean should be ', mus.sum()
print 'True variance should be ', (sigmas**2).sum(), ' standard deviation ', np.sqrt((sigmas**2).sum())

Of course, we can histogram `y_vec`

as well.

```
In [ ]:
```plt.hist(y_vec, bins=30, normed=True)
plt.legend('$y$')

$$\mappingFunctionScalar_i = \basisVector_i^\top \mappingVector$$

$$\mappingFunctionScalar_i = \sum_{j=1}^K \mappingScalar_j \basisScalar_{i, j}$$

$$\expDist{\mappingFunctionVector}{\gaussianDist{\mappingVector}{\meanVector}{\covarianceMatrix}} = \int \mappingFunctionVector \gaussianDist{\mappingVector}{\meanVector}{\covarianceMatrix} \text{d}\mappingVector = \int \basisMatrix\mappingVector \gaussianDist{\mappingVector}{\meanVector}{\covarianceMatrix} \text{d}\mappingVector = \basisMatrix \int \mappingVector \gaussianDist{\mappingVector}{\meanVector}{\covarianceMatrix} \text{d}\mappingVector = \basisMatrix \meanVector$$

*mean* of the Gaussian density for $\mappingVector$. Because our prior distribution was Gaussian with zero mean, the expectation under the prior is given by

$$\expDist{\mappingFunctionVector}{\gaussianDist{\mappingVector}{\zerosVector}{\alpha\eye}} = \zerosVector$$

The covariance is a little more complicated. A covariance matrix is defined as

$$\text{cov}\left(\mappingFunctionVector\right)_{\gaussianDist{\mappingVector}{\meanVector}{\covarianceMatrix}} = \expDist{\mappingFunctionVector\mappingFunctionVector^\top}{\gaussianDist{\mappingVector}{\meanVector}{\covarianceMatrix}} - \expDist{\mappingFunctionVector}{\gaussianDist{\mappingVector}{\meanVector}{\covarianceMatrix}}\expDist{\mappingFunctionVector}{\gaussianDist{\mappingVector}{\meanVector}{\covarianceMatrix}}^\top$$

$$\text{cov}\left(\mappingFunctionVector\right)_{\gaussianDist{\mappingVector}{\meanVector}{\covarianceMatrix}} = \expDist{\mappingFunctionVector\mappingFunctionVector^\top}{\gaussianDist{\mappingVector}{\meanVector}{\covarianceMatrix}} - \basisMatrix \meanVector \meanVector^\top \basisMatrix^\top$$

$$\text{cov}\left(\mappingFunctionVector\right)_{\gaussianDist{\mappingVector}{\meanVector}{\covarianceMatrix}} = \expDist{\basisMatrix\mappingVector\mappingVector^\top \basisMatrix^\top}{\gaussianDist{\mappingVector}{\meanVector}{\covarianceMatrix}} - \basisMatrix \meanVector \meanVector^\top \basisMatrix^\top$$

$$\text{cov}\left(\mappingFunctionVector\right)_{\gaussianDist{\mappingVector}{\meanVector}{\covarianceMatrix}} = \basisMatrix\expDist{\mappingVector\mappingVector^\top}{\gaussianDist{\mappingVector}{\meanVector}{\covarianceMatrix}} \basisMatrix^\top - \basisMatrix \meanVector \meanVector^\top \basisMatrix^\top$$

Which is dependent on the second moment of the Gaussian,

$$\expDist{\mappingVector\mappingVector^\top}{\gaussianDist{\mappingVector}{\meanVector}{\covarianceMatrix}} = \covarianceMatrix + \meanVector\meanVector^\top$$

that can be substituted in to recover,

$$\text{cov}\left(\mappingFunctionVector\right)_{\gaussianDist{\mappingVector}{\meanVector}{\covarianceMatrix}} = \basisMatrix\covarianceMatrix \basisMatrix^\top$$

$$\text{cov}\left(\mappingFunctionVector\right)_{\gaussianDist{\mappingVector}{\zerosVector}{\alpha \eye}} = \alpha \basisMatrix \basisMatrix^\top$$

$$\dataVector = \mappingFunctionVector + \noiseVector$$

$$\mappingFunctionVector \sim \gaussianSamp{\zerosVector}{\alpha\basisMatrix\basisMatrix^\top}$$

$$p(\dataVector) = \gaussianDist{\dataVector}{\zerosVector}{\alpha \basisMatrix \basisMatrix^\top + \dataStd^2\eye}$$

This is our *implicit* assumption for $\dataVector$ given our prior assumption for $\mappingVector$.

You should now know enough to compute the mean of the predictions under the posterior density.

```
In [19]:
```# compute mean under posterior density
f_pred_mean = np.dot(Phi_pred, w_mean)

We can plot these predictions alongside the real data,

```
In [20]:
```# print the error and plot the predictions
plt.plot(x_pred, f_pred_mean)
plt.plot(x, y, 'rx')
ax = plt.gca()
ax.set_title('Predictions for Order ' + str(order))
ax.set_xlabel('year')
ax.set_ylabel('pace (min/km)')

```
Out[20]:
```

```
In [ ]:
```f_mean = np.dot(Phi, w_mean)

These can be used to compute the error

$$E(\mappingVector) = \frac{\numData}{2} \log \dataStd^2 + \frac{1}{2\dataStd^2} \sum_{i=1}^\numData \left(\dataScalar_i - \mappingVector^\top \phi(\inputVector_i)\right)^2 \\\
E(\mappingVector) = \frac{\numData}{2} \log \dataStd^2 + \frac{1}{2\dataStd^2} \sum_{i=1}^\numData \left(\dataScalar_i - \mappingFunctionScalar_i\right)^2$$

```
In [ ]:
```# compute the sum of squares term
sum_squares = ((y-f_mean)**2).sum()
# fit the noise variance
error = (num_data/2*np.log(sigma2) + sum_squares/(2*sigma2))
print 'The error is: ',error

Now we have the fit and the error, let's plot the fit and the error.

which under the posterior density is given by

$$\text{cov}\left(\mappingFunctionVector\right)_{\gaussianDist{\mappingVector}{\meanVector_w}{\covarianceMatrix_w}} = \basisMatrix\covarianceMatrix_w \basisMatrix^\top$$

```
In [22]:
```# compute the error bars
f_pred_cov = np.dot(Phi_pred, np.dot(w_cov, Phi_pred.T))
f_pred_var = np.diag(f_pred_cov)[:, None]
f_pred_std = np.sqrt(f_pred_var)
# plot mean, and error bars at 2 standard deviations
plt.plot(x_pred.flatten(), f_pred_mean.flatten(), 'b-')
plt.plot(x_pred.flatten(), (f_pred_mean+2*f_pred_std).flatten(), 'b--')
plt.plot(x_pred.flatten(), (f_pred_mean-2*f_pred_std).flatten(), 'b--')
plt.plot(x, y, 'rx')

```
Out[22]:
```

$$\mappingVector \sim \gaussianSamp{\zerosVector}{\alpha \eye}$$

```
In [ ]:
```# set prior variance on w
alpha = 4.
# set the order of the polynomial basis set
order = 5
# set the noise variance
sigma2 = 0.01

*a priori*.

```
In [ ]:
```span = np.max(x) - np.min(x)
offset = np.min(x)
x -= offset
x /= span # x is now between zero and 1
x = x*2-1 # x is now between -1 and 1

Let's now compute a range of values to make predictions at, spanning the *new* space of inputs,

```
In [ ]:
```num_data = x.shape[0]
num_pred_data = 100 # how many points to use for plotting predictions
x_pred = linspace(-1.2, 1.2, num_pred_data)[:, None] # input locations for predictions

now let's build the basis matrices.

```
In [ ]:
```# build the basis set
Phi = np.zeros((num_data, order+1))
Phi_pred = np.zeros((num_pred_data, order+1))
for i in range(0, order+1):
Phi[:, i:i+1] = x**i
Phi_pred[:, i:i+1] = x_pred**i

`np.random.normal`

and combine with our basis to create 10 samples from the model over functions,

```
In [ ]:
```num_samples = 10
K = order+1
for i in xrange(num_samples):
z_vec = np.random.normal(size=K)
w_sample = z_vec*np.sqrt(alpha)
f_sample = np.dot(Phi_pred,w_sample)
plt.plot(x_pred.flatten(), f_sample.flatten())

$$\mappingVector \sim \gaussianSamp{\zerosVector}{\alpha \eye}$$

We next applied the likelihood. The mean of the function in the likelihood is given by

$$\mathbf{f} = \basisMatrix \mappingVector.$$

$$\mappingFunctionVector \sim \gaussianSamp{\zerosVector}{\alpha \basisMatrix \basisMatrix^\top},$$

`np.random.multivariate_normal`

command introduced in the last session. We need to sample from a multivariate normal with covariance given by $\alpha\basisMatrix\basisMatrix^\top$ and a zero mean,

```
In [23]:
```K = alpha*np.dot(Phi_pred, Phi_pred.T)
for i in xrange(10):
f_sample = np.random.multivariate_normal(mean=np.zeros(x_pred.size), cov=K)
plt.plot(x_pred.flatten(), f_sample.flatten())

```
```

We can plot the covariance given as an image in python with a colorbar to show scale.

```
In [24]:
```plt.imshow(K)
plt.colorbar()

```
Out[24]:
```

Because the full model involves corrupting the latent function with Gaussian noise,

$$\dataVector = \mappingFunctionVector + \noiseVector$$

and the noise is sampled from an independent Gaussian distribution with variance $\dataStd^2$,

$$\noiseVector \sim \gaussianSamp{\zerosVector}{\dataStd^2 \eye},$$

$$\dataScalar \sim \gaussianSamp{\zerosVector}{\alpha \basisMatrix \basisMatrix^\top + \dataStd^2 \eye}$$

.

Sampling directly from this distribution gives us the noise corrupted functions,

```
In [25]:
```K = alpha*np.dot(Phi_pred, Phi_pred.T) + sigma2*np.eye(x_pred.size)
for i in xrange(10):
y_sample = np.random.multivariate_normal(mean=np.zeros(x_pred.size), cov=K)
plt.plot(x_pred.flatten(), y_sample.flatten())

```
```

```
In [26]:
```sigma2 = 1.
K = alpha*np.dot(Phi_pred, Phi_pred.T) + sigma2*np.eye(x_pred.size)
for i in xrange(10):
y_sample = np.random.multivariate_normal(mean=np.zeros(x_pred.size), cov=K)
plt.plot(x_pred.flatten(), y_sample.flatten())

```
```

How would you include the noise term when sampling in the weight space point of view?

*covariance function* directly, rather than *implicitly* through a basis matrix and a prior over parameters. Gaussian processes have the advantage that they can be *nonparametric*, which in simple terms means that they can have *infinite* basis functions. In the lectures we introduced the *exponentiated quadratic* covariance, also known as the RBF or the Gaussian or the squared exponential covariance function. This covariance function is specified by

$$\kernelScalar(\inputVector_i, \inputVector_j) = \alpha \exp\left( -\frac{\ltwoNorm{\inputVector_i-\inputVector_j}^2}{2\lengthScale^2}\right).$$

`x_pred`

```
In [33]:
```import urllib
url = ("http://staffwww.dcs.shef.ac.uk/"
+ "people/N.Lawrence/dataset_mirror/"
+ "olympic_marathon_men/olympicMarathonTimes.csv")
urllib.urlretrieve(url, 'olympicMarathonTimes.csv')
olympics = np.loadtxt('olympicMarathonTimes.csv', delimiter=',')
x_pred = linspace(1890, 2016, num_pred_data)[:, None] # input locations for predictions

```
In [34]:
```def kern(x, xprime, variance=1.0, lengthscale=1.0):
return variance*np.exp((-0.5*(x - xprime)**2)/lengthscale**2)
alpha = 1.0
lengthscale = 10.
K = np.zeros((x_pred.size, x_pred.size))
for i in xrange(x_pred.size):
for j in xrange(x_pred.size):
K[i, j] = kern(x_pred[i], x_pred[j], variance=alpha, lengthscale=lengthscale)

Now we can image the resulting covariance,

```
In [35]:
```plt.imshow(K,interpolation='none')
plt.colorbar()

```
Out[35]:
```

and sample functions from the marginal likelihood.

```
In [36]:
```for i in xrange(10):
y_sample = np.random.multivariate_normal(mean=np.zeros(x_pred.size), cov=K)
plt.plot(x_pred.flatten(), y_sample.flatten())

```
```

```
In [43]:
```x = data['X'][:, 0:1]
x_pred = linspace(1892, 2016, num_pred_data)[:, None]
alpha = 1.0
lengthscale = 4.
K = np.zeros((x_pred.size, x_pred.size))
for i in xrange(x_pred.size):
for j in xrange(x_pred.size):
K[i, j] = kern(x_pred[i], x_pred[j], variance=alpha, lengthscale=lengthscale)
for i in xrange(10):
y_sample = np.random.multivariate_normal(mean=np.zeros(x_pred.size), cov=K)
plt.plot(x_pred.flatten(), y_sample.flatten())

```
```

*consistent*. By consistency, here, we mean that the model makes predictions for $\mappingFunctionVector$ that are unaffected by future values of $\mappingFunctionVector^*$ that are currently unobserved (such as test points). If we think of $\mappingFunctionVector^\ast$ as test points, we can still write down a joint probability ensity over the training observations, $\mappingFunctionVector$ and the test observations, $\mappingFunctionVector^\ast$. This joint probability density will be Gaussian, with a covariance matrix given by our covariance function, $\kernelScalar(\inputVector_i, \inputVector_j)$.

$$\begin{bmatrix}\mappingFunctionVector \\\ \mappingFunctionVector^\ast\end{bmatrix} \sim \gaussianSamp{\zerosVector}{\begin{bmatrix} \kernelMatrix & \kernelMatrix_\ast \\\ \kernelMatrix_\ast^\top & \kernelMatrix_{\ast,\ast}\end{bmatrix}}$$

`x`

and `y`

for the training data (although `y`

doesn't enter the covariance) and `x_pred`

as the test locations.

```
In [60]:
```# set covariance function parameters
alpha = 16.0
lengthscale = 40.
# set noise variance
sigma2 = 0.05
# compute covariance for training data, x
K = np.zeros((x.size, x.size))
for i in xrange(x.size):
for j in xrange(x.size):
K[i, j] = kern(x[i], x[j], variance=alpha, lengthscale=lengthscale)
# compute covariance between training data, x, and test data, x_pred
K_star = np.zeros((x.size, x_pred.size))
for i in xrange(x.size):
for j in xrange(x_pred.size):
K_star[i, j] = kern(x[i], x_pred[j], variance=alpha, lengthscale=lengthscale)
# compute covariance for test data, x_pred
K_starstar = np.zeros((x_pred.size, x_pred.size))
for i in xrange(x_pred.size):
for j in xrange(x_pred.size):
K_starstar[i, j] = kern(x_pred[i], x_pred[j], variance=alpha, lengthscale=lengthscale)

The overall covariance between our training and test data can now be plotted as

```
In [61]:
```full_K = np.vstack([np.hstack([K, K_star]), np.hstack([K_star.T, K_starstar])])
plt.imshow(full_K)
plt.colorbar

```
Out[61]:
```

$$\mappingFunctionVector^\ast \sim \gaussianSamp{\meanVector_\mappingFunctionScalar}{\covarianceMatrix_\mappingFunctionScalar}$$

with a mean given by

$$\meanVector_\mappingFunctionScalar = \kernelMatrix^\top_\ast \left[\kernelMatrix + \dataStd^2 \eye\right]^{-1} \dataVector$$

and a covariance given by

$$\covarianceMatrix_\mappingFunctionScalar = \kernelMatrix_{\ast,\ast} - \kernelMatrix^\top_\ast \left[\kernelMatrix + \dataStd^2 \eye\right]^{-1} \kernelMatrix_\ast.$$

*block matrix inverse* rules, but they are beyond the scope of this course, so you don't need to worry about remembering them or rederiving them. We are simply writing them here because it is this *conditional* density that is necessary for making predictions. Let's compute what those predictions are for the olympic marathon data.

```
In [62]:
```a = np.dot(np.linalg.inv(K + sigma2*eye(x.size)), K_star)
mu_f = np.dot(a.T, y)
C_f = K_starstar - np.dot(a.T, K_star)

where for convenience we've defined

$$\mathbf{a} = \left[\kernelMatrix + \dataStd^2\eye\right]^{-1}\kernelMatrix_\ast.$$

We can visualize the covariance of the *conditional*,

```
In [63]:
```plt.imshow(C_f)
plt.colorbar

```
Out[63]:
```

and we can plot the mean of the conditional

```
In [64]:
```plt.plot(x, y, 'rx')
plt.plot(x_pred, mu_f, 'b-')

```
Out[64]:
```

```
In [65]:
```var_f = np.diag(C_f)[:, None]
std_f = np.sqrt(var_f)

They can be added to the underlying mean function to give the error bars,

```
In [66]:
```plt.plot(x, y, 'rx')
plt.plot(x_pred, mu_f, 'b-')
plt.plot(x_pred, mu_f+2*std_f, 'b--')
plt.plot(x_pred, mu_f-2*std_f, 'b--')

```
Out[66]:
```

**In this session you should try and complete at least as far as Exercise 5**.

We assume that GPy is already installed on your machine. You can get instructions on how to install GPy from the SheffieldML github page. They are written as markdown in the README.md file, which is automatically parsed for you just under the file listing there.

We first tell the ipython notebook that we want the plots to appear inline, then we import the libraries we will need:

```
In [ ]:
```%pylab inline
import numpy as np
import pylab as pb
import GPy

The current draft of the online documentation of GPy is available from this page.

```
In [ ]:
```d = 1 # input dimension
var = 1. # variance
theta = 0.2 # lengthscale
k = GPy.kern.rbf(d, variance=var, lengthscale=theta)

A summary of the kernel can be obtained using the command `print k`

.

```
In [ ]:
```print k

It is also possible to plot the kernel as a function of one of its inputs (whilst fixing the other) with `k.plot()`

.

*Note*: if you need help with a command in ipython notebook, then you can get it at any time by typing a question mark after the command, e.g. `k.plot?`

```
In [ ]:
```k.plot()

The value of the covariance function parameters can be accessed and modified using `k['.*var']`

where the string in bracket is a regular expression matching the parameter name as it appears in `print k`

. Let's use this to get an insight into the effect of the parameters on the shape of the covariance function.

We'll now use to set the lengthscale of the covariance to different values, and then plot the resulting covariance using the `k.plot()`

method.

```
In [ ]:
```k = GPy.kern.rbf(d) # By default, the parameters are set to 1.
theta = np.asarray([0.2,0.5,1.,2.,4.])
for t in theta:
k['.*lengthscale']=t
k.plot()
pb.legend(theta)

a) What is the effect of the lengthscale parameter on the covariance function?

# Exercise 1 a) answer

```
In [ ]:
``````
# Exercise 1 b) answer
```

`exponential`

, `Matern32`

, `Matern52`

, `Brownian`

, `linear`

, `bias`

,
`rbfcos`

, `periodic_Matern32`

, etc. Some of these covariance functions, such as `rbfcos`

, are not
parametrized by a variance and a lengthscale. Furthermore, not all kernels are stationary (i.e., they can’t all be written as $k ( x, y) = f ( x − y)$, see for example the Brownian
covariance function). For plotting so it may be interesting to change the value of the fixed input:

```
In [ ]:
```kb = GPy.kern.Brownian(input_dim=1)
inputs = np.array([2., 4.])
for x in inputs:
kb.plot(x,plot_limits=[0,5])
pb.legend(inputs)
pb.ylim(-0.1,5.1)

`C = k.K(X,X)`

. The positive semi-definiteness of $k$ ensures that `C`

is a positive semi-definite (psd) matrix regardless of the initial points $\mathbf{X}$. This can be
checked numerically by looking at the eigenvalues:

```
In [ ]:
```k = GPy.kern.Matern52(input_dim=2)
X = np.random.rand(50,2) # 50*2 matrix of iid standard Gaussians
C = k.K(X,X)
eigvals = np.linalg.eigvals(C) # Computes the eigenvalues of a matrix
pb.bar(np.arange(len(eigvals)), eigvals)
title('Eigenvalues of the Matern 5/2 Covariance')

# Exercise 2 a) answer

*Hint*: there are actually two related interpretations for this. Think about the properties of a Gaussian distribution, and where the sum of Gaussian variances arises.

What about the element-wise product of two covariance functions? In other words if we define

\begin{align*}
k(\mathbf{x}, \mathbf{x}^\prime) = k_1(\mathbf{x}, \mathbf{x}^\prime) k_2(\mathbf{x}, \mathbf{x}^\prime)
\end{align*}

then is $k(\mathbf{x}, \mathbf{x}^\prime)$ a valid covariance function?

# Exercise 2 b) answer

`+`

and `*`

. So, for example, if we wish to combine an exponentiated quadratic covariance with a Matern 5/2 then we can write

```
In [ ]:
```kern1 = GPy.kern.rbf(1, variance=1., lengthscale=2.)
kern2 = GPy.kern.Matern52(1, variance=2., lengthscale=4.)
kern = kern1 + kern2
print kern
kern.plot()

Or if we wanted to multiply them we can write

```
In [ ]:
```kern = kern1*kern2
print kern
kern.plot()

*function* and a mean *function*. When we compute the covariance matrix using `kern.K(X, X)`

we are computing a covariance *matrix* between the values of the function that correspond to the input locations in the matrix `X`

. If we want to have a look at the type of functions that arise from a particular Gaussian process we can never generate all values of the function, because there are infinite values. However, we can generate samples from a Gaussian *distribution* based on a covariance matrix associated with a particular matrix of input locations `X`

. If these locations are chosen appropriately then they give us a good idea of the underlying function. For example, for a one dimensional function, if we choose `X`

to be uniformly spaced across part of the real line, and the spacing is small enough, we'll get an idea of the underlying function. We will now use this trick to draw sample paths from a Gaussian process.

```
In [ ]:
```k = GPy.kern.rbf(input_dim=1,lengthscale=0.2)
X = np.linspace(0.,1.,500) # define X to be 500 points evenly spaced over [0,1]
X = X[:,None] # reshape X to make it n*p --- we try to use 'design matrices' in GPy
mu = np.zeros((500)) # vector of the means --- we could use a mean function here, but here it is just zero.
C = k.K(X,X) # compute the covariance matrix associated with inputs X
# Generate 20 separate samples paths from a Gaussian with mean mu and covariance C
Z = np.random.multivariate_normal(mu,C,20)
pb.figure() # open a new plotting window
for i in range(20):
pb.plot(X[:],Z[i,:])

`X`

means that the points are close enough together to look like functions. We can see the structure of the covariance matrix we are plotting from if we visualize C.

```
In [ ]:
```pb.matshow(C)

```
In [ ]:
``````
# Try plotting sample paths here
```

# Exercise 3 answer

```
In [ ]:
```np.random.normal?

```
In [ ]:
```X = np.linspace(0.05,0.95,10)[:,None]
Y = -np.cos(np.pi*X) + np.sin(4*np.pi*X) + np.random.normal(loc=0.0, scale=0.1, size=(10,1))
pb.figure()
pb.plot(X,Y,'kx',mew=1.5)

```
In [ ]:
```k = GPy.kern.rbf(input_dim=1, variance=1., lengthscale=1.)

And then combining it with the data to form a Gaussian process model,

```
In [ ]:
```m = GPy.models.GPRegression(X,Y,k)

`print m`

.

```
In [ ]:
```print m

`m.plot()`

.

```
In [ ]:
```m.plot()

`Xstar`

(an $m \times p$ array) can be computed using `Ystar, Vstar, up95, lo95 = m.predict(Xstar)`

a) What do you think about this first fit? Does the prior given by the GP seem to be adapted?

# Exercise 4 a) answer

`m['noise'] = 0.001`

). Change the values of the parameters to obtain a better fit.

```
In [ ]:
``````
# Exercise 4 b) answer
```

`np.random.multivariate_normal(mu[:,0],C)`

where the mean vector and covariance
matrix `mu`

, `C`

are obtained through the predict function `mu, C, up95, lo95 = m.predict(Xp,full_cov=True)`

. Obtain 10 samples from the posterior sample and plot them alongside the data below.

```
In [ ]:
``````
# Exercise 4 c) answer
```

```
In [ ]:
```m.constrain_positive('.*') # Constrains all parameters matching .* to be positive, i.e. everything

Now we can optimize the model using the `m.optimize()`

method.

```
In [ ]:
```m.optimize()
m.plot()
print m

```
In [ ]:
```GPy.util.datasets.authorize_download = True # prevents requesting authorization for download.
data = GPy.util.datasets.olympic_marathon_men()
print data['details']

```
In [ ]:
```X = data['X']
Y = data['Y']
pb.plot(X, Y, 'bx')
pb.xlabel('year')
pb.ylabel('marathon pace min/km')

```
In [ ]:
```# Exercise 5 a) answer
kern = GPy.kern.rbf(1) + GPy.kern.bias(1)
model = GPy.models.GPRegression(X, Y, kern)
model.optimize()
model.plot()# Exercise 5 d) answer
model.log_likelihood()

b) Fit the same model, but this time intialize the length scale of the exponentiated quadratic to 0.5. What has happened? Which of model has the higher log likelihood, this one or the one from (a)?

*Hint:* use `model.log_likelihood()`

for computing the log likelihood.

```
In [ ]:
``````
# Exercise 5 b) answer
```

```
In [ ]:
``````
# Exercise 5 c) answer
```

```
In [ ]:
``````
# Exercise 5 d) answer
```

```
In [ ]:
``````
# Exercise 5 e) answer
```

```
In [ ]:
```# Definition of the Branin test function
def branin(X):
y = (X[:,1]-5.1/(4*np.pi**2)*X[:,0]**2+5*X[:,0]/np.pi-6)**2
y += 10*(1-1/(8*np.pi))*np.cos(X[:,0])+10
return(y)
# Training set defined as a 5*5 grid:
xg1 = np.linspace(-5,10,5)
xg2 = np.linspace(0,15,5)
X = np.zeros((xg1.size * xg2.size,2))
for i,x1 in enumerate(xg1):
for j,x2 in enumerate(xg2):
X[i+xg1.size*j,:] = [x1,x2]
Y = branin(X)[:,None]

`np.mean(Y)`

. Since the points
are distributed on a grid, this can be seen as the approximation of the integral by a
rough Riemann sum. The result can be compared with the actual mean of the Branin
function which is 54.31.

```
In [ ]:
```# Fit a GP
# Create an exponentiated quadratic plus bias covariance function
kg = GPy.kern.rbf(input_dim=2, ARD = True)
kb = GPy.kern.bias(input_dim=2)
k = kg + kb
# Build a GP model
m = GPy.models.GPRegression(X,Y,k,normalize_Y=True)
# constrain parameters to be bounded
m.constrain_bounded('.*rbf_var',1e-3,1e5)
m.constrain_bounded('.*bias_var',1e-3,1e5)
m.constrain_bounded('.*rbf_len',.1,200.)
# fix the noise variance
m.constrain_fixed('.*noise',1e-5)
# Randomize the model and optimize
m.randomize()
m.optimize()
# Plot the resulting approximation to Brainin
# Here you get a two-d plot becaue the function is two dimensional.
m.plot()
# Compute the mean of model prediction on 1e5 Monte Carlo samples
Xp = np.random.uniform(size=(1e5,2))
Xp[:,0] = Xp[:,0]*15-5
Xp[:,1] = Xp[:,1]*15
Yp = m.predict(Xp)[0]
np.mean(Yp)

a) Has the approximation of the mean been improved by using the GP model?

# Exercise 6 a) answer

```
In [ ]:
``````
# Exercise 6 b) answer
```

```
In [ ]:
``````
# Exercise 7 a) answer
```

b) Compute the probability that the best predictor is greater than the threshold.

```
In [ ]:
``````
# Exercise 7 b) answer
```

c) Compute some confidence intervals for the previous result

```
In [ ]:
``````
# Exercise 7 c) answer
```

These two values can be compared with the actual value {$P( f (U ) > 200) = 1.23\times 10^{−2}$ .

# Exercise 8 a) answer

```
In [ ]:
``````
# Exercise 8 b) answer
```