Gaussian Process Summer School, 14th Semptember 2015
written by Nicolas Durrande, Neil Lawrence and James Hensman
The aim of this lab session is to illustrate the concepts seen during the lectures. We will focus on three aspects of GPs: the kernel, the random sample paths and the GP regression model.
Since the background of the attendees is very diverse, this session will attempt to cover a large spectrum from the very basic of GPs to more technical questions. This in mind, the difficulties of the questions of the lab session varies. We do not mark these labs in any way, you should focus on the parts that you find most useful. In this session you should complete at least as far as Exercise 5.
We assume that GPy is already installed on your machine. You can get instructions on how to install GPy from the SheffieldML github page. They are written as markdown in the README.md file, which is automatically parsed for you just under the file listing there.
We first tell the ipython notebook that we want the plots to appear inline, then we import the libraries we will need:
In [1]:
%matplotlib inline
import numpy as np
from matplotlib import pyplot as plt
import GPy
The current draft of the online documentation of GPy is available from this page.
Let's start with defining an exponentiated quadratic covariance function (also known as squared exponential or rbf or Gaussian) in one dimension:
In [2]:
d = 1 # input dimension
var = 1. # variance
theta = 0.2 # lengthscale
k = GPy.kern.RBF(d, variance=var, lengthscale=theta)
A summary of the kernel can be obtained using the command print k
.
In [3]:
print k
It is also possible to plot the kernel as a function of one of its inputs (whilst fixing the other) with k.plot()
.
Note: if you need help with a command in ipython notebook, then you can get it at any time by typing a question mark after the command, e.g. k.plot?
In [7]:
k.plot()
Out[7]:
The value of the covariance function parameters can be accessed and modified using k['.*var']
where the string in bracket is a regular expression matching the parameter name as it appears in print k
. Let's use this to get an insight into the effect of the parameters on the shape of the covariance function.
We'll now use to set the lengthscale of the covariance to different values, and then plot the resulting covariance using the k.plot()
method.
In [8]:
k = GPy.kern.RBF(d) # By default, the parameters are set to 1.
theta = np.asarray([0.2,0.5,1.,2.,4.])
for t in theta:
k.lengthscale=t
k.plot()
plt.legend(theta)
Out[8]:
a) What is the effect of the lengthscale parameter on the covariance function?
b) Now change the code used above for plotting the covariances associated with the length scale to see the influence of the variance parameter. What is the effect of the the variance parameter on the covariance function?
In [15]:
k = GPy.kern.RBF(d) # By default, the parameters are set to 1.
theta = np.asarray([1.0, 2.0, 4.0, 8.0])
for t in theta:
k.variance=t
k.plot()
plt.legend(theta)
Out[15]:
Many covariance functions are already implemented in GPy. Instead of rbf, try constructing and plotting the following covariance functions: exponential
, Matern32
, Matern52
, Brownian
, linear
, bias
,
rbfcos
, periodic_Matern32
, etc. Some of these covariance functions, such as rbfcos
, are not
parametrized by a variance and a lengthscale. Furthermore, not all kernels are stationary (i.e., they can’t all be written as $k ( x, y) = f ( x − y)$, see for example the Brownian
covariance function). For plotting so it may be interesting to change the value of the fixed input:
In [17]:
kb = GPy.kern.Brownian(input_dim=1)
inputs = np.array([2., 4.])
for x in inputs:
kb.plot(x,plot_limits=[0,5])
plt.legend(inputs)
plt.ylim(-0.1,5.1)
Out[17]:
Let $\mathbf{X}$ be a $n$ × $d$ numpy array. Given a kernel $k$, the covariance matrix associated to
$\mathbf{X}$ is obtained with C = k.K(X,X)
. The positive semi-definiteness of $k$ ensures that C
is a positive semi-definite (psd) matrix regardless of the initial points $\mathbf{X}$. This can be
checked numerically by looking at the eigenvalues:
In [18]:
k = GPy.kern.Matern52(input_dim=2)
X = np.random.rand(50,2) # 50*2 matrix of iid standard Gaussians
C = k.K(X,X)
eigvals = np.linalg.eigvals(C) # Computes the eigenvalues of a matrix
plt.bar(np.arange(len(eigvals)), eigvals)
plt.title('Eigenvalues of the Matern 5/2 Covariance')
Out[18]:
a) A matrix, $\mathbf{K}$, is positive semi-definite if the matrix inner product, $\mathbf{x}^\top \mathbf{K}\mathbf{x}$ is greater than or equal to zero regardless of the values in $\mathbf{x}$. Given this it should be easy to see that the sum of two positive semi-definite matrices is also positive semi-definite. In the context of Gaussian processes, this is the sum of two covariance functions. What does this mean from a modelling perspective?
Hint: there are actually two related interpretations for this. Think about the properties of a Gaussian distribution, and where the sum of Gaussian variances arises.
What about the element-wise product of two covariance functions? In other words if we define
then is $k(\mathbf{x}, \mathbf{x}^\prime)$ a valid covariance function?
In GPy you can easily combine covariance functions you have created using the sum and product operators, +
and *
. So, for example, if we wish to combine an exponentiated quadratic covariance with a Matern 5/2 then we can write
In [13]:
kern1 = GPy.kern.RBF(1, variance=1., lengthscale=2.)
kern2 = GPy.kern.Matern52(1, variance=2., lengthscale=4.)
kern = kern1 + kern2
print kern
kern.plot()
Or if we wanted to multiply them we can write
In [14]:
kern = kern1*kern2
print kern
kern.plot()
The Gaussian process provides a prior over an infinite dimensional function. It is defined by a covariance function and a mean function. When we compute the covariance matrix using kern.K(X, X)
we are computing a covariance matrix between the values of the function that correspond to the input locations in the matrix X
. If we want to have a look at the type of functions that arise from a particular Gaussian process we can never generate all values of the function, because there are infinite values. However, we can generate samples from a Gaussian distribution based on a covariance matrix associated with a particular matrix of input locations X
. If these locations are chosen appropriately then they give us a good idea of the underlying function. For example, for a one dimensional function, if we choose X
to be uniformly spaced across part of the real line, and the spacing is small enough, we'll get an idea of the underlying function. We will now use this trick to draw sample paths from a Gaussian process.
In [ ]:
k = GPy.kern.RBF(input_dim=1,lengthscale=0.2)
X = np.linspace(0.,1.,500) # define X to be 500 points evenly spaced over [0,1]
X = X[:,None] # reshape X to make it n*p --- we try to use 'design matrices' in GPy
mu = np.zeros((500)) # vector of the means --- we could use a mean function here, but here it is just zero.
C = k.K(X,X) # compute the covariance matrix associated with inputs X
In [22]:
_ = plt.imshow(C)
In [23]:
help(np.random.multivariate_normal)
In [19]:
# Generate 20 separate samples paths from a Gaussian with mean mu and covariance C
Z = np.random.multivariate_normal(mu,C,20)
plt.figure() # open a new plotting window
for i in range(20):
plt.plot(X[:],Z[i,:])
Our choice of X
means that the points are close enough together to look like functions. We can see the structure of the covariance matrix we are plotting from if we visualize C.
In [24]:
plt.matshow(C)
Out[24]:
Now try a range of different covariance functions and values and plot the corresponding sample paths for each using the same approach given above.
In [22]:
# Try plotting sample paths here
Can you tell the covariance structures that have been used for generating the
sample paths shown in the figure below?
We will now combine the Gaussian process prior with some data to form a GP regression model with GPy. We will generate data from the function $f ( x ) = − \cos(\pi x ) + \sin(4\pi x )$ over $[0, 1]$, adding some noise to give $y(x) = f(x) + \epsilon$, with the noise being Gaussian distributed, $\epsilon \sim \mathcal{N}(0, 0.01)$.
In [25]:
np.random.normal?
In [24]:
X = np.linspace(0.05,0.95,10)[:,None]
Y = -np.cos(np.pi*X) + np.sin(4*np.pi*X) + np.random.normal(loc=0.0, scale=0.1, size=(10,1))
plt.figure()
plt.plot(X,Y,'kx',mew=1.5)
Out[24]:
A GP regression model based on an exponentiated quadratic covariance function can be defined by first defining a covariance function,
In [25]:
k = GPy.kern.RBF(input_dim=1, variance=1., lengthscale=1.)
And then combining it with the data to form a Gaussian process model,
In [26]:
m = GPy.models.GPRegression(X,Y,k)
Just as for the covariance function object, we can find out about the model using the command print m
.
In [27]:
print m
Note that by default the model includes some observation noise
with variance 1. We can see the posterior mean prediction and visualize the marginal posterior variances using m.plot()
.
In [28]:
m.plot()
Out[28]:
The actual predictions of the model for a set of points Xstar
(an $m \times p$ array) can be computed using Ystar, Vstar, up95, lo95 = m.predict(Xstar)
a) What do you think about this first fit? Does the prior given by the GP seem to be adapted?
b) The parameters of the models can be modified using a regular expression matching the parameters names (for example m['noise'] = 0.001
). Change the values of the parameters to obtain a better fit.
In [29]:
# Exercise 4 b) answer
c) As in Section 2, random sample paths from the conditional GP can be obtained using
np.random.multivariate_normal(mu[:,0],C)
where the mean vector and covariance
matrix mu
, C
are obtained through the predict function mu, C, up95, lo95 = m.predict(Xp,full_cov=True)
. Obtain 10 samples from the posterior sample and plot them alongside the data below.
In [30]:
# Exercise 4 c) answer
As we have seen during the lectures, the parameters values can be estimated by maximizing the likelihood of the observations. Since we don’t want one of the variance to become negative during the optimization, we can constrain all parameters to be positive before running the optimisation.
In [34]:
m.constrain_positive()
The warnings are because the parameters are already constrained by default, the software is warning us that they are being reconstrained.
Now we can optimize the model using the m.optimize()
method.
In [43]:
m.optimize()
m.plot()
print m
The parameters obtained after optimisation can be compared with the values selected by hand above. As previously, you can modify the kernel used for building the model to investigate its influence on the model.
Now we'll consider a small example with real world data, data giving the pace of all marathons run at the olympics. To load the data use
In [44]:
GPy.util.datasets.authorize_download = lambda x: True # prevents requesting authorization for download.
data = GPy.util.datasets.olympic_marathon_men()
print data['details']
In [45]:
X = data['X']
Y = data['Y']
plt.plot(X, Y, 'bx')
plt.xlabel('year')
plt.ylabel('marathon pace min/km')
Out[45]:
a) Build a Gaussian process model for the olympic data set using a combination of an exponentiated quadratic and a bias covariance function. Fit the covariance function parameters and the noise to the data. Plot the fit and error bars from 1870 to 2030. Do you think the predictions are reasonable? If not why not?
In [46]:
# Exercise 5 a) answer
kern = GPy.kern.RBF(1) + GPy.kern.Bias(1)
model = GPy.models.GPRegression(X, Y, kern)
model.optimize()
model.plot()# Exercise 5 d) answer
model.log_likelihood()
Out[46]:
b) Fit the same model, but this time intialize the length scale of the exponentiated quadratic to 0.5. What has happened? Which of model has the higher log likelihood, this one or the one from (a)?
Hint: use model.log_likelihood()
for computing the log likelihood.
In [47]:
# Exercise 5 b) answer
c) Modify your model by including two covariance functions. Intitialize a covariance function with an exponentiated quadratic part, a Matern 3/2 part and a bias covariance. Set the initial lengthscale of the exponentiated quadratic to 80 years, set the initial length scale of the Matern 3/2 to 10 years. Optimize the new model and plot the fit again. How does it compare with the previous model?
In [43]:
# Exercise 5 c) answer
d) Repeat part c) but now initialize both of the covariance functions' lengthscales to 20 years. Check the model parameters, what happens now?
In [44]:
# Exercise 5 d) answer
e) Now model the data with a product of an exponentiated quadratic covariance function and a linear covariance function. Fit the covariance function parameters. Why are the variance parameters of the linear part so small? How could this be fixed?
In [48]:
# Exercise 5 e) answer
Let $x$ be a random variable defined over the real numbers, $\Re$, and $f(\cdot)$ be a function mapping between the real numbers $\Re \rightarrow \Re$. Uncertainty propagation is the study of the distribution of the random variable $f ( x )$.
We will see in this section the advantage of using a model when only a few observations of $f$ are available. We consider here the 2-dimensional Branin test function defined over [−5, 10] × [0, 15] and a set of 25 observations as seen in Figure 3.
In [56]:
# Definition of the Branin test function
def branin(X):
y = (X[:,1]-5.1/(4*np.pi**2)*X[:,0]**2+5*X[:,0]/np.pi-6)**2
y += 10*(1-1/(8*np.pi))*np.cos(X[:,0])+10
return(y)
# Training set defined as a 5*5 grid:
xg1 = np.linspace(-5,10,5)
xg2 = np.linspace(0,15,5)
X = np.zeros((xg1.size * xg2.size,2))
for i,x1 in enumerate(xg1):
for j,x2 in enumerate(xg2):
X[i+xg1.size*j,:] = [x1,x2]
Y = branin(X)[:,None]
We assume here that we are interested in the distribution of $f (U )$ where $U$ is a random variable with uniform distribution over the input space of $f$. We will focus on the computation of two quantities: $E[ f (U )]$ and $P( f (U ) > 200)$.
The expectation of $f (U )$ is given by $\int_x f ( x )\text{d}x$. A basic approach to approximate this
integral is to compute the mean of the 25 observations: np.mean(Y)
. Since the points
are distributed on a grid, this can be seen as the approximation of the integral by a
rough Riemann sum. The result can be compared with the actual mean of the Branin
function which is 54.31.
Alternatively, we can fit a GP model and compute the integral of the best predictor by Monte Carlo sampling:
In [67]:
# Fit a GP
# Create an exponentiated quadratic plus bias covariance function
kg = GPy.kern.RBF(input_dim=2, ARD = True)
kb = GPy.kern.Bias(input_dim=2)
k = kg + kb
# Build a GP model
m = GPy.models.GPRegression(X,Y,k)
# fix the noise variance
m.likelihood.variance.fix(1e-5)
# Randomize the model and optimize
m.randomize()
m.optimize()
# Plot the resulting approximation to Brainin
# Here you get a two-d plot becaue the function is two dimensional.
m.plot()
# Compute the mean of model prediction on 1e5 Monte Carlo samples
Xp = np.random.uniform(size=(1e5,2))
Xp[:,0] = Xp[:,0]*15-5
Xp[:,1] = Xp[:,1]*15
mu, var = m.predict(Xp)
np.mean(mu)
Out[67]:
a) Has the approximation of the mean been improved by using the GP model?
b) One particular feature of GPs we have not use for now is their prediction variance. Can you use it to define some confidence intervals around the previous result?
In [62]:
# Exercise 6 b) answer
In various cases it is interesting to look at the probability that $f$ is greater than a given threshold. For example, assume that $f$ is the response of a physical model representing the maximum constraint in a structure depending on some parameters of the system such as Young’s modulus of the material (say $Y$) and the force applied on the structure (say $F$). If the later are uncertain, the probability of failure of the structure is given by $P( f (Y, F ) > \text{f_max} )$ where $f_\text{max}$ is the maximum acceptable constraint.
a) As previously, use the 25 observations to compute a rough estimate of the probability that $f (U ) > 200$.
In [63]:
# Exercise 7 a) answer
b) Compute the probability that the best predictor is greater than the threshold.
In [64]:
# Exercise 7 b) answer
c) Compute some confidence intervals for the previous result
In [65]:
# Exercise 7 c) answer
These two values can be compared with the actual value {$P( f (U ) > 200) = 1.23\times 10^{−2}$ .
We now assume that we have an extra budget of 10 evaluations of f and we want to use these new evaluations to improve the accuracy of the previous result.
a) Given the previous GP model, where is it interesting to add the new observations if we want to improve the accuracy of the estimator and reduce its variance?
b) Can you think about (and implement!) a procedure that updates sequentially the model with new points in order to improve the estimation of $P( f (U ) > 200)$?
In [66]:
# Exercise 8 b) answer