Special Topics I: Gaussian Processes

2nd December 2014 Neil D. Lawrence

First we download the latest version of the pods software. $$\newcommand{\inputScalar}{x} \newcommand{\lengthScale}{\ell} \newcommand{\mappingVector}{\mathbf{w}} \newcommand{\gaussianDist}[3]{\mathcal{N}\left(#1|#2,#3\right)} \newcommand{\gaussianSamp}[2]{\mathcal{N}\left(#1,#2\right)} \newcommand{\zerosVector}{\mathbf{0}} \newcommand{\eye}{\mathbf{I}} \newcommand{\dataStd}{\sigma} \newcommand{\dataScalar}{y} \newcommand{\dataVector}{\mathbf{y}} \newcommand{\dataMatrix}{\mathbf{Y}} \newcommand{\noiseScalar}{\epsilon} \newcommand{\noiseVector}{\mathbf{\epsilon}} \newcommand{\noiseMatrix}{\mathbf{\Epsilon}} \newcommand{\inputVector}{\mathbf{x}} \newcommand{\basisMatrix}{\mathbf{\Phi}} \newcommand{\basisVector}{\mathbf{\phi}} \newcommand{\basisScalar}{\phi} \newcommand{\expSamp}[1]{\left<#1\right>} \newcommand{\expDist}[2]{\left<#1\right>_{#2}} \newcommand{\covarianceMatrix}{\mathbf{C}} \newcommand{\numData}{N} \newcommand{\mappingScalar}{w} \newcommand{\mappingFunctionScalar}{f} \newcommand{\mappingFunctionVector}{\mathbf{f}} \newcommand{\meanVector}{\boldsymbol{\mu}} \newcommand{\meanScalar}{\mu} \newcommand{\ltwoNorm}[1]{\left\Vert #1 \right\Vert_2} \newcommand{\kernelScalar}{k} \newcommand{\kernelVector}{\mathbf{\kernelScalar}} \newcommand{\kernelMatrix}{\mathbf{K}} \newcommand{\lengthScale}{\ell}$$


In [ ]:
# download the software
import urllib

urllib.urlretrieve('https://github.com/sods/ods/archive/master.zip', 'master.zip')

# unzip the software
import zipfile
zip = zipfile.ZipFile('./master.zip', 'r')
for name in zip.namelist():
    zip.extract(name, '.')

# add the module location to the python path.    
import sys
sys.path.append("./ods-master/")

And next we get the plots running in line. We also also load in the pylab library as plt.


In [ ]:
%matplotlib inline
import pods
import pylab as plt

Once again, we will be working with the olympics data this week.


In [ ]:
data = pods.datasets.olympic_marathon_men()
x = data['X']
y = data['Y']
plt.plot(x, y, 'rx')

This week we're going to build on our understanding of the marginal likelihood we derived in the lectures, and also across the last lab session, to try and see the relationship with Gaussian procces. In the last lab class we sampled directly from the weight vector $\mappingVector$ and applied it to the basis matrix $\basisMatrix$ to obtain a sample from the prior and a sample from the posterior. This week we'll start by constructing the prior directly, rather than sampling the weights and then combining with the basis.

Sampling from the Prior

The first thing we'll do is to set up the parameters of the model, these include the parameters of the prior, the parameters of the basis functions and the noise level.

$$\mappingVector \sim \gaussianSamp{\zerosVector}{\alpha \eye}$$

In [ ]:
# set prior variance on w
alpha = 4.
# set the order of the polynomial basis set
order = 5
# set the noise variance
sigma2 = 0.01

Now we have the variance, we can sample from the prior distribution to see what form we are imposing on the functions a priori.

Before we sample from our prior, recall the problems with the basis set from our last lab: the basis doesn't work well when predictions are made outside the $-1, 1$ region. Let's rescale the data to be within that region.


In [ ]:
import numpy as np
span = np.max(x) - np.min(x) 
offset = np.min(x)
x -= offset
x /= span # x is now between zero and 1
x = x*2-1 # x is now between -1 and 1

Let's now compute a range of values to make predictions at, spanning the new space of inputs,


In [ ]:
num_data = x.shape[0]
num_pred_data = 100 # how many points to use for plotting predictions
x_pred = np.linspace(-1.2, 1.2, num_pred_data)[:, None] # input locations for predictions

now let's build the basis matrices.


In [ ]:
# build the basis set
Phi = np.zeros((num_data, order+1))
Phi_pred = np.zeros((num_pred_data, order+1))
for i in range(0, order+1):
    Phi[:, i:i+1] = x**i
    Phi_pred[:, i:i+1] = x_pred**i

Weight Space View

Now we will sample from the prior density to obtain a vector $\mappingVector$ using the function np.random.normal and combine with our basis to create 10 samples from the model over functions,


In [ ]:
num_samples = 10
K = order+1
for i in xrange(num_samples):
    z_vec = np.random.normal(size=K)
    w_sample = z_vec*np.sqrt(alpha)
    f_sample = np.dot(Phi_pred,w_sample)
    plt.plot(x_pred.flatten(), f_sample.flatten())

Function Space View

The process we have used to generate the samples is a two stage process. To obtain each function, we first generated a sample from the prior,

$$\mappingVector \sim \gaussianSamp{\zerosVector}{\alpha \eye}$$

We next applied the likelihood. The mean of the function in the likelihood is given by

$$\mathbf{f} = \basisMatrix \mappingVector.$$

so we plotted the result. In the lecture this week we talked about computing the marginal likelihood directly. We used properties of Gaussian densities to show that,

$$\mappingFunctionVector \sim \gaussianSamp{\zerosVector}{\alpha \basisMatrix \basisMatrix^\top},$$

so we should be able to sample directly from this density. To do that we use the np.random.multivariate_normal command introduced in the last session. We need to sample from a multivariate normal with covariance given by $\alpha\basisMatrix\basisMatrix^\top$ and a zero mean,


In [ ]:
K = alpha*np.dot(Phi_pred, Phi_pred.T)
for i in xrange(10):
    f_sample = np.random.multivariate_normal(mean=np.zeros(x_pred.size), cov=K)
    plt.plot(x_pred.flatten(), f_sample.flatten())

We can plot the covariance given as an image in python with a colorbar to show scale.


In [ ]:
plt.imshow(K)
plt.colorbar()

Because the full model involves corrupting the latent function with Gaussian noise,

$$\dataVector = \mappingFunctionVector + \noiseVector$$

and the noise is sampled from an independent Gaussian distribution with variance $\dataStd^2$,

$$\noiseVector \sim \gaussianSamp{\zerosVector}{\dataStd^2 \eye},$$

we can use properties of Gaussian variables, i.e. the fact that sum of two Gaussian variables is also Gaussian, and that it's covariance is given by the sum of the two covariances, whilst the mean is given by the sum of the means, to write down the marginal likelihood,

$$\dataScalar \sim \gaussianSamp{\zerosVector}{\alpha \basisMatrix \basisMatrix^\top + \dataStd^2 \eye}$$

.

Sampling directly from this distribution gives us the noise corrupted functions,


In [ ]:
K = alpha*np.dot(Phi_pred, Phi_pred.T) + sigma2*np.eye(x_pred.size)
for i in xrange(10):
    y_sample = np.random.multivariate_normal(mean=np.zeros(x_pred.size), cov=K)
    plt.plot(x_pred.flatten(), y_sample.flatten())

Here we can see the small effect of our noise value, $\dataStd^2$. We can increase the noise value to see a different effect,


In [ ]:
sigma2 = 1.
K = alpha*np.dot(Phi_pred, Phi_pred.T) + sigma2*np.eye(x_pred.size)
for i in xrange(10):
    y_sample = np.random.multivariate_normal(mean=np.zeros(x_pred.size), cov=K)
    plt.plot(x_pred.flatten(), y_sample.flatten())

Function Space Reflection

How would you include the noise term when sampling in the weight space point of view?

Gaussian Process

Rather than sampling from the prior over parameters, we sampled from the marginal likelihood. Specifying this marginal likelihood directly, and avoiding the intermediate weight-space representation is what Gaussian processes are all about. In a Gaussian process you specify the covariance function directly, rather than implicitly through a basis matrix and a prior over parameters. Gaussian processes have the advantage that they can be nonparametric, which in simple terms means that they can have infinite basis functions. In the lectures we introduced the exponentiated quadratic covariance, also known as the RBF or the Gaussian or the squared exponential covariance function. This covariance function is specified by

$$\kernelScalar(\inputVector_i, \inputVector_j) = \alpha \exp\left( -\frac{\ltwoNorm{\inputVector_i-\inputVector_j}^2}{2\lengthScale^2}\right).$$

where $\ltwoNorm{\inputVector_i-\inputVector_j}^2$ is the squared distance between the two input vectors

Let's build a covariance matrix based on this function. We will compute the covariance at the points given by x_pred


In [ ]:
alpha = 1.0
lengthscale = 0.1
K = np.zeros((x_pred.size, x_pred.size))
for i in xrange(x_pred.size):
    for j in xrange(x_pred.size):
        K[i, j] = alpha*np.exp((-0.5*(x_pred[i]-x_pred[j])**2)/lengthscale**2)

Now we can image the resulting covariance,


In [ ]:
plt.imshow(K,interpolation='none')
plt.colorbar()

and sample functions from the marginal likelihood.


In [ ]:
for i in xrange(10):
    y_sample = np.random.multivariate_normal(mean=np.zeros(x_pred.size), cov=K)
    plt.plot(x_pred.flatten(), y_sample.flatten())

The new covariance function doesn't have the problems with scaling exhibited by the the polynomial basis. We can reset our data from the olympics matrix and show samples computed across the actual years.


In [ ]:
x_pred = np.linspace(1892, 2016, num_pred_data)[:, None]

alpha = 1.0
lengthscale = 4
K = np.zeros((x_pred.size, x_pred.size))
for i in xrange(x_pred.size):
    for j in xrange(x_pred.size):
        K[i, j] = alpha*np.exp((-0.5*(x_pred[i]-x_pred[j])**2)/lengthscale**2)

for i in xrange(10):
    y_sample = np.random.multivariate_normal(mean=np.zeros(x_pred.size), cov=K)
    plt.plot(x_pred.flatten(), y_sample.flatten())

Moving Parameters

Have a play with the parameters for this covariance function and see what effects the parameters have on the types of functions you observe.

Making Predictions

We now have a probability density that represents functions. How do we make predictions with this density? The density is known as a process because it is consistent. By consistency, here, we mean that the model makes predictions for $\mappingFunctionVector$ that are unaffected by future values of $\mappingFunctionVector^*$ that are currently unobserved (such as test points). If we think of $\mappingFunctionVector^\ast$ as test points, we can still write down a joint probability ensity over the training observations, $\mappingFunctionVector$ and the test observations, $\mappingFunctionVector^\ast$. This joint probability density will be Gaussian, with a covariance matrix given by our covariance function, $\kernelScalar(\inputVector_i, \inputVector_j)$.

$$\begin{bmatrix}\mappingFunctionVector \\\ \mappingFunctionVector^\ast\end{bmatrix} \sim \gaussianSamp{\zerosVector}{\begin{bmatrix} \kernelMatrix & \kernelMatrix_\ast \\\ \kernelMatrix_\ast^\top & \kernelMatrix_{\ast,\ast}\end{bmatrix}}$$

where here $\kernelMatrix$ is the covariance computed between all the training points, $\kernelMatrix_\ast$ is the covariance matrix computed between the training points and the test points and $\kernelMatrix_{\ast,\ast}$ is the covariance matrix computed betwen all the tests points. To be clear, let's compute these now for our example, using x and y for the training data (although y doesn't enter the covariance) and x_pred as the test locations.


In [ ]:
# set covariance function parameters
alpha = 16.0
lengthscale = 8
# set noise variance
sigma2 = 0.05

# define a function to compute the covariance.
def kern(x, xprime):
    return alpha*np.exp((-0.5*(x-xprime)**2)/lengthscale**2)

# compute covariance for training data, x
K = np.zeros((x.size, x.size))
for i in xrange(x.size):
    for j in xrange(x.size):
        K[i, j] = kern(x[i], x[j])

# compute covariance between training data, x, and test data, x_pred
K_star = np.zeros((x.size, x_pred.size))
for i in xrange(x.size):
    for j in xrange(x_pred.size):
        K_star[i, j] = kern(x[i], x_pred[j])
        
# compute covariance for test data, x_pred
K_starstar = np.zeros((x_pred.size, x_pred.size))
for i in xrange(x_pred.size):
    for j in xrange(x_pred.size):
        K_starstar[i, j] = kern(x_pred[i], x_pred[j])

The overall covariance between our training and test data can now be plotted as


In [ ]:
full_K = np.vstack([np.hstack([K, K_star]), np.hstack([K_star.T, K_starstar])])
plt.imshow(full_K)
plt.colorbar

The banded structure we now observe is because some of the training points are near to some of the test points. This is how we obtain 'communication' between our training data and our test data. If there is no structure in $\kernelMatrix_\ast$ then our belief about the test data simply matches our prior.

To make predictions about the test data we need the conditional distribution: $p(\mappingFunctionVector^\ast|\mappingFunctionVector)$, or when we include noise, $p(\mappingFunctionVector^\ast | \dataVector)$. This conditional distribution is also Gaussian,

$$\mappingFunctionVector^\ast \sim \gaussianSamp{\meanVector_\mappingFunctionScalar}{\covarianceMatrix_\mappingFunctionScalar}$$

with a mean given by

$$\meanVector_\mappingFunctionScalar = \kernelMatrix^\top_\ast \left[\kernelMatrix + \dataStd^2 \eye\right]^{-1} \dataVector$$

and a covariance given by

$$\covarianceMatrix_\mappingFunctionScalar = \kernelMatrix_{\ast,\ast} - \kernelMatrix^\top_\ast \left[\kernelMatrix + \dataStd^2 \eye\right]^{-1} \kernelMatrix_\ast.$$

These results can be proved using block matrix inverse rules, but they are beyond the scope of this course, so you don't need to worry about remembering them or rederiving them. We are simply writing them here because it is this conditional density that is necessary for making predictions. Let's compute what those predictions are for the olympic marathon data.


In [ ]:
a = np.dot(np.linalg.inv(K + sigma2*np.eye(x.size)), K_star)
mu_f = np.dot(a.T, y)
C_f = K_starstar - np.dot(a.T, K_star)

where for convenience we've defined

$$\mathbf{a} = \left[\kernelMatrix + \dataStd^2\eye\right]^{-1}\kernelMatrix_\ast.$$

We can visualize the covariance of the conditional,


In [ ]:
plt.imshow(C_f)
plt.colorbar

and we can plot the mean of the conditional


In [ ]:
plt.plot(x, y, 'rx')
plt.plot(x_pred, mu_f, 'b-')

as well as the associated error bars. These are given (similarly to the Bayesian parametric model from the last lab) by the standard deviations of the marginal posterior densities. The marginal posterior variances are given by the diagonal elements of the posterior covariance,


In [ ]:
var_f = np.diag(C_f)[:, None]
std_f = np.sqrt(var_f)

They can be added to the underlying mean function to give the error bars,


In [ ]:
plt.plot(x, y, 'rx')
plt.plot(x_pred, mu_f, 'b-')
plt.plot(x_pred, mu_f+2*std_f, 'b--')
plt.plot(x_pred, mu_f-2*std_f, 'b--')

This gives us a prediction from the Gaussian process. Remember machine learning is data + model = prediction. Here our data is from the olympics, and our model is a Gaussian process with two parameters. The main thing the model expresses is smoothness. But it also sustains the uncertainty about the function, this means we see an increase in the size of the error bars during periods like the 1st and 2nd World Wars when no olympic marathon was held.

Exercises

Now try changing the parameters of the covariance function (and the noise) to see how the predictions change.

Now try sampling from this conditional density to see what your predictions look like. What happens if you sample from the conditional density in regions a long way into the future or the past? How does this compare with the results from the polynomial model?