Lab session 1: Gaussian Process models with GPy

Gaussian Process Summer School, 14th Semptember 2015

written by Nicolas Durrande, Neil Lawrence and James Hensman

The aim of this lab session is to illustrate the concepts seen during the lectures. We will focus on three aspects of GPs: the kernel, the random sample paths and the GP regression model.

Since the background of the attendees is very diverse, this session will attempt to cover a large spectrum from the very basic of GPs to more technical questions. This in mind, the difficulties of the questions of the lab session varies. We do not mark these labs in any way, you should focus on the parts that you find most useful. In this session you should complete at least as far as Exercise 5.

1 Getting started: The Covariance Function

We assume that GPy is already installed on your machine. You can get instructions on how to install GPy from the SheffieldML github page. They are written as markdown in the README.md file, which is automatically parsed for you just under the file listing there.

We first tell the ipython notebook that we want the plots to appear inline, then we import the libraries we will need:


In [1]:
%matplotlib inline
import numpy as np
from matplotlib import pyplot as plt
import GPy

The current draft of the online documentation of GPy is available from this page.

Let's start with defining an exponentiated quadratic covariance function (also known as squared exponential or rbf or Gaussian) in one dimension:


In [2]:
d = 1          # input dimension
var = 1.       # variance
theta = 0.2    # lengthscale
k = GPy.kern.RBF(d, variance=var, lengthscale=theta)

A summary of the kernel can be obtained using the command print k.


In [3]:
print k


  rbf.         |  value  |  constraints  |  priors
  variance     |    1.0  |      +ve      |        
  lengthscale  |    0.2  |      +ve      |        

It is also possible to plot the kernel as a function of one of its inputs (whilst fixing the other) with k.plot().

Note: if you need help with a command in ipython notebook, then you can get it at any time by typing a question mark after the command, e.g. k.plot?


In [7]:
k.plot()


Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe6e1e26990>

Setting Covariance Function Parameters

The value of the covariance function parameters can be accessed and modified using k['.*var'] where the string in bracket is a regular expression matching the parameter name as it appears in print k. Let's use this to get an insight into the effect of the parameters on the shape of the covariance function.

We'll now use to set the lengthscale of the covariance to different values, and then plot the resulting covariance using the k.plot() method.


In [8]:
k = GPy.kern.RBF(d)     # By default, the parameters are set to 1.
theta = np.asarray([0.2,0.5,1.,2.,4.])
for t in theta:
    k.lengthscale=t
    k.plot()
plt.legend(theta)


Out[8]:
<matplotlib.legend.Legend at 0x7fe6e1d6b250>

Exercise 1

a) What is the effect of the lengthscale parameter on the covariance function?

# Exercise 1 a) answer Lengthscale parameter determines the influence of points as a function of distance. As lengthscale decreases the influence of far away points decreases.

b) Now change the code used above for plotting the covariances associated with the length scale to see the influence of the variance parameter. What is the effect of the the variance parameter on the covariance function?


In [15]:
k = GPy.kern.RBF(d)     # By default, the parameters are set to 1.
theta = np.asarray([1.0, 2.0, 4.0, 8.0])
for t in theta:
    k.variance=t
    k.plot()
plt.legend(theta)


Out[15]:
<matplotlib.legend.Legend at 0x7fe6e16dfd50>
# Exercise 1 b) answer Variance parameter determines the scale of the covariance.

Covariance Functions in GPy

Many covariance functions are already implemented in GPy. Instead of rbf, try constructing and plotting the following covariance functions: exponential, Matern32, Matern52, Brownian, linear, bias, rbfcos, periodic_Matern32, etc. Some of these covariance functions, such as rbfcos, are not parametrized by a variance and a lengthscale. Furthermore, not all kernels are stationary (i.e., they can’t all be written as $k ( x, y) = f ( x − y)$, see for example the Brownian covariance function). For plotting so it may be interesting to change the value of the fixed input:


In [17]:
kb = GPy.kern.Brownian(input_dim=1)
inputs = np.array([2., 4.])
for x in inputs:
    kb.plot(x,plot_limits=[0,5])
plt.legend(inputs)
plt.ylim(-0.1,5.1)


Out[17]:
(-0.1, 5.1)

Computing the Covariance Function given the Input Data, $\mathbf{X}$

Let $\mathbf{X}$ be a $n$ × $d$ numpy array. Given a kernel $k$, the covariance matrix associated to $\mathbf{X}$ is obtained with C = k.K(X,X) . The positive semi-definiteness of $k$ ensures that C is a positive semi-definite (psd) matrix regardless of the initial points $\mathbf{X}$. This can be checked numerically by looking at the eigenvalues:


In [18]:
k = GPy.kern.Matern52(input_dim=2)
X = np.random.rand(50,2)       # 50*2 matrix of iid standard Gaussians
C = k.K(X,X)
eigvals = np.linalg.eigvals(C)           # Computes the eigenvalues of a matrix
plt.bar(np.arange(len(eigvals)), eigvals)
plt.title('Eigenvalues of the Matern 5/2 Covariance')


Out[18]:
Text(0.5,1,u'Eigenvalues of the Matern 5/2 Covariance')

Combining Covariance Functions

Exercise 2

a) A matrix, $\mathbf{K}$, is positive semi-definite if the matrix inner product, $\mathbf{x}^\top \mathbf{K}\mathbf{x}$ is greater than or equal to zero regardless of the values in $\mathbf{x}$. Given this it should be easy to see that the sum of two positive semi-definite matrices is also positive semi-definite. In the context of Gaussian processes, this is the sum of two covariance functions. What does this mean from a modelling perspective?

# Exercise 2 a) answer

Hint: there are actually two related interpretations for this. Think about the properties of a Gaussian distribution, and where the sum of Gaussian variances arises.

What about the element-wise product of two covariance functions? In other words if we define

\begin{align*} k(\mathbf{x}, \mathbf{x}^\prime) = k_1(\mathbf{x}, \mathbf{x}^\prime) k_2(\mathbf{x}, \mathbf{x}^\prime) \end{align*}

then is $k(\mathbf{x}, \mathbf{x}^\prime)$ a valid covariance function?

# Exercise 2 b) answer

Combining Covariance Functions in GPy

In GPy you can easily combine covariance functions you have created using the sum and product operators, + and *. So, for example, if we wish to combine an exponentiated quadratic covariance with a Matern 5/2 then we can write


In [13]:
kern1 = GPy.kern.RBF(1, variance=1., lengthscale=2.)
kern2 = GPy.kern.Matern52(1, variance=2., lengthscale=4.)
kern = kern1 + kern2
print kern
kern.plot()


  add.               |  Value  |  Constraint  |  Prior  |  Tied to
  rbf.variance       |    1.0  |     +ve      |         |         
  rbf.lengthscale    |    2.0  |     +ve      |         |         
  Mat52.variance     |    2.0  |     +ve      |         |         
  Mat52.lengthscale  |    4.0  |     +ve      |         |         

Or if we wanted to multiply them we can write


In [14]:
kern = kern1*kern2
print kern
kern.plot()


  mul.               |  Value  |  Constraint  |  Prior  |  Tied to
  rbf.variance       |    1.0  |     +ve      |         |         
  rbf.lengthscale    |    2.0  |     +ve      |         |         
  Mat52.variance     |    2.0  |     +ve      |         |         
  Mat52.lengthscale  |    4.0  |     +ve      |         |         

2 Sampling from a Gaussian Process

The Gaussian process provides a prior over an infinite dimensional function. It is defined by a covariance function and a mean function. When we compute the covariance matrix using kern.K(X, X) we are computing a covariance matrix between the values of the function that correspond to the input locations in the matrix X. If we want to have a look at the type of functions that arise from a particular Gaussian process we can never generate all values of the function, because there are infinite values. However, we can generate samples from a Gaussian distribution based on a covariance matrix associated with a particular matrix of input locations X. If these locations are chosen appropriately then they give us a good idea of the underlying function. For example, for a one dimensional function, if we choose X to be uniformly spaced across part of the real line, and the spacing is small enough, we'll get an idea of the underlying function. We will now use this trick to draw sample paths from a Gaussian process.


In [ ]:
k = GPy.kern.RBF(input_dim=1,lengthscale=0.2)

X = np.linspace(0.,1.,500) # define X to be 500 points evenly spaced over [0,1]
X = X[:,None] # reshape X to make it n*p --- we try to use 'design matrices' in GPy 

mu = np.zeros((500)) # vector of the means --- we could use a mean function here, but here it is just zero.
C = k.K(X,X) # compute the covariance matrix associated with inputs X

In [22]:
_ = plt.imshow(C)



In [23]:
help(np.random.multivariate_normal)


Help on built-in function multivariate_normal:

multivariate_normal(...)
    multivariate_normal(mean, cov[, size, check_valid, tol])
    
    Draw random samples from a multivariate normal distribution.
    
    The multivariate normal, multinormal or Gaussian distribution is a
    generalization of the one-dimensional normal distribution to higher
    dimensions.  Such a distribution is specified by its mean and
    covariance matrix.  These parameters are analogous to the mean
    (average or "center") and variance (standard deviation, or "width,"
    squared) of the one-dimensional normal distribution.
    
    Parameters
    ----------
    mean : 1-D array_like, of length N
        Mean of the N-dimensional distribution.
    cov : 2-D array_like, of shape (N, N)
        Covariance matrix of the distribution. It must be symmetric and
        positive-semidefinite for proper sampling.
    size : int or tuple of ints, optional
        Given a shape of, for example, ``(m,n,k)``, ``m*n*k`` samples are
        generated, and packed in an `m`-by-`n`-by-`k` arrangement.  Because
        each sample is `N`-dimensional, the output shape is ``(m,n,k,N)``.
        If no shape is specified, a single (`N`-D) sample is returned.
    check_valid : { 'warn', 'raise', 'ignore' }, optional
        Behavior when the covariance matrix is not positive semidefinite.
    tol : float, optional
        Tolerance when checking the singular values in covariance matrix.
    
    Returns
    -------
    out : ndarray
        The drawn samples, of shape *size*, if that was provided.  If not,
        the shape is ``(N,)``.
    
        In other words, each entry ``out[i,j,...,:]`` is an N-dimensional
        value drawn from the distribution.
    
    Notes
    -----
    The mean is a coordinate in N-dimensional space, which represents the
    location where samples are most likely to be generated.  This is
    analogous to the peak of the bell curve for the one-dimensional or
    univariate normal distribution.
    
    Covariance indicates the level to which two variables vary together.
    From the multivariate normal distribution, we draw N-dimensional
    samples, :math:`X = [x_1, x_2, ... x_N]`.  The covariance matrix
    element :math:`C_{ij}` is the covariance of :math:`x_i` and :math:`x_j`.
    The element :math:`C_{ii}` is the variance of :math:`x_i` (i.e. its
    "spread").
    
    Instead of specifying the full covariance matrix, popular
    approximations include:
    
      - Spherical covariance (`cov` is a multiple of the identity matrix)
      - Diagonal covariance (`cov` has non-negative elements, and only on
        the diagonal)
    
    This geometrical property can be seen in two dimensions by plotting
    generated data-points:
    
    >>> mean = [0, 0]
    >>> cov = [[1, 0], [0, 100]]  # diagonal covariance
    
    Diagonal covariance means that points are oriented along x or y-axis:
    
    >>> import matplotlib.pyplot as plt
    >>> x, y = np.random.multivariate_normal(mean, cov, 5000).T
    >>> plt.plot(x, y, 'x')
    >>> plt.axis('equal')
    >>> plt.show()
    
    Note that the covariance matrix must be positive semidefinite (a.k.a.
    nonnegative-definite). Otherwise, the behavior of this method is
    undefined and backwards compatibility is not guaranteed.
    
    References
    ----------
    .. [1] Papoulis, A., "Probability, Random Variables, and Stochastic
           Processes," 3rd ed., New York: McGraw-Hill, 1991.
    .. [2] Duda, R. O., Hart, P. E., and Stork, D. G., "Pattern
           Classification," 2nd ed., New York: Wiley, 2001.
    
    Examples
    --------
    >>> mean = (1, 2)
    >>> cov = [[1, 0], [0, 1]]
    >>> x = np.random.multivariate_normal(mean, cov, (3, 3))
    >>> x.shape
    (3, 3, 2)
    
    The following is probably true, given that 0.6 is roughly twice the
    standard deviation:
    
    >>> list((x[0,0,:] - mean) < 0.6)
    [True, True]


In [19]:
# Generate 20 separate samples paths from a Gaussian with mean mu and covariance C
Z = np.random.multivariate_normal(mu,C,20)

plt.figure()     # open a new plotting window
for i in range(20):
    plt.plot(X[:],Z[i,:])


Our choice of X means that the points are close enough together to look like functions. We can see the structure of the covariance matrix we are plotting from if we visualize C.


In [24]:
plt.matshow(C)


Out[24]:
<matplotlib.image.AxesImage at 0x7fe6e05d3110>

Now try a range of different covariance functions and values and plot the corresponding sample paths for each using the same approach given above.


In [22]:
# Try plotting sample paths here

Exercise 3

Can you tell the covariance structures that have been used for generating the sample paths shown in the figure below?

# Exercise 3 answer

3 A Gaussian Process Regression Model

We will now combine the Gaussian process prior with some data to form a GP regression model with GPy. We will generate data from the function $f ( x ) = − \cos(\pi x ) + \sin(4\pi x )$ over $[0, 1]$, adding some noise to give $y(x) = f(x) + \epsilon$, with the noise being Gaussian distributed, $\epsilon \sim \mathcal{N}(0, 0.01)$.


In [25]:
np.random.normal?

In [24]:
X = np.linspace(0.05,0.95,10)[:,None]
Y = -np.cos(np.pi*X) + np.sin(4*np.pi*X) + np.random.normal(loc=0.0, scale=0.1, size=(10,1)) 
plt.figure()
plt.plot(X,Y,'kx',mew=1.5)


Out[24]:
[<matplotlib.lines.Line2D at 0x7f3a108e0810>]

A GP regression model based on an exponentiated quadratic covariance function can be defined by first defining a covariance function,


In [25]:
k = GPy.kern.RBF(input_dim=1, variance=1., lengthscale=1.)

And then combining it with the data to form a Gaussian process model,


In [26]:
m = GPy.models.GPRegression(X,Y,k)

Just as for the covariance function object, we can find out about the model using the command print m.


In [27]:
print m


Name                 : GP regression
Log-likelihood       : -13.5847457469
Number of Parameters : 3
Parameters:
  GP_regression.           |  Value  |  Constraint  |  Prior  |  Tied to
  rbf.variance             |    1.0  |     +ve      |         |         
  rbf.lengthscale          |    1.0  |     +ve      |         |         
  Gaussian_noise.variance  |    1.0  |     +ve      |         |         

Note that by default the model includes some observation noise with variance 1. We can see the posterior mean prediction and visualize the marginal posterior variances using m.plot().


In [28]:
m.plot()


Out[28]:
{'dataplot': [<matplotlib.lines.Line2D at 0x7f3a0eca5a90>],
 'gpplot': [[<matplotlib.lines.Line2D at 0x7f3a0ed17650>],
  [<matplotlib.patches.Polygon at 0x7f3a0ed17890>],
  [<matplotlib.lines.Line2D at 0x7f3a0ed17ed0>],
  [<matplotlib.lines.Line2D at 0x7f3a0eca5450>]]}

The actual predictions of the model for a set of points Xstar (an $m \times p$ array) can be computed using Ystar, Vstar, up95, lo95 = m.predict(Xstar)

Exercise 4

a) What do you think about this first fit? Does the prior given by the GP seem to be adapted?

# Exercise 4 a) answer

b) The parameters of the models can be modified using a regular expression matching the parameters names (for example m['noise'] = 0.001 ). Change the values of the parameters to obtain a better fit.


In [29]:
# Exercise 4 b) answer

c) As in Section 2, random sample paths from the conditional GP can be obtained using np.random.multivariate_normal(mu[:,0],C) where the mean vector and covariance matrix mu, C are obtained through the predict function mu, C, up95, lo95 = m.predict(Xp,full_cov=True). Obtain 10 samples from the posterior sample and plot them alongside the data below.


In [30]:
# Exercise 4 c) answer

Covariance Function Parameter Estimation

As we have seen during the lectures, the parameters values can be estimated by maximizing the likelihood of the observations. Since we don’t want one of the variance to become negative during the optimization, we can constrain all parameters to be positive before running the optimisation.


In [34]:
m.constrain_positive()


WARNING: reconstraining parameters GP_regression

The warnings are because the parameters are already constrained by default, the software is warning us that they are being reconstrained.

Now we can optimize the model using the m.optimize() method.


In [43]:
m.optimize()
m.plot()
print m


Name                 : GP regression
Log-likelihood       : -7.55526684196
Number of Parameters : 3
Parameters:
  GP_regression.           |        Value        |  Constraint  |  Prior  |  Tied to
  rbf.variance             |     0.939146973519  |     +ve      |         |         
  rbf.lengthscale          |     0.130465554097  |     +ve      |         |         
  Gaussian_noise.variance  |  1.12184151651e-10  |     +ve      |         |         

The parameters obtained after optimisation can be compared with the values selected by hand above. As previously, you can modify the kernel used for building the model to investigate its influence on the model.

4 A Running Example

Now we'll consider a small example with real world data, data giving the pace of all marathons run at the olympics. To load the data use


In [44]:
GPy.util.datasets.authorize_download = lambda x: True # prevents requesting authorization for download.
data = GPy.util.datasets.olympic_marathon_men()
print data['details']


Olympic mens' marathon gold medal winning times from 1896 to 2012. Time given in pace (minutes per kilometer). Data is originally downloaded and collated from Wikipedia, we are not responsible for errors in the data

In [45]:
X = data['X']
Y = data['Y']
plt.plot(X, Y, 'bx')
plt.xlabel('year')
plt.ylabel('marathon pace min/km')


Out[45]:
<matplotlib.text.Text at 0x7f3a0ec4ff50>

Exercise 5

a) Build a Gaussian process model for the olympic data set using a combination of an exponentiated quadratic and a bias covariance function. Fit the covariance function parameters and the noise to the data. Plot the fit and error bars from 1870 to 2030. Do you think the predictions are reasonable? If not why not?


In [46]:
# Exercise 5 a) answer 
kern = GPy.kern.RBF(1) + GPy.kern.Bias(1)
model = GPy.models.GPRegression(X, Y, kern)
model.optimize()
model.plot()# Exercise 5 d) answer
model.log_likelihood()


Out[46]:
-4.8248984763138765

b) Fit the same model, but this time intialize the length scale of the exponentiated quadratic to 0.5. What has happened? Which of model has the higher log likelihood, this one or the one from (a)?

Hint: use model.log_likelihood() for computing the log likelihood.


In [47]:
# Exercise 5 b) answer

c) Modify your model by including two covariance functions. Intitialize a covariance function with an exponentiated quadratic part, a Matern 3/2 part and a bias covariance. Set the initial lengthscale of the exponentiated quadratic to 80 years, set the initial length scale of the Matern 3/2 to 10 years. Optimize the new model and plot the fit again. How does it compare with the previous model?


In [43]:
# Exercise 5 c) answer

d) Repeat part c) but now initialize both of the covariance functions' lengthscales to 20 years. Check the model parameters, what happens now?


In [44]:
# Exercise 5 d) answer

e) Now model the data with a product of an exponentiated quadratic covariance function and a linear covariance function. Fit the covariance function parameters. Why are the variance parameters of the linear part so small? How could this be fixed?


In [48]:
# Exercise 5 e) answer

5 More Advanced: Uncertainty propagation

Let $x$ be a random variable defined over the real numbers, $\Re$, and $f(\cdot)$ be a function mapping between the real numbers $\Re \rightarrow \Re$. Uncertainty propagation is the study of the distribution of the random variable $f ( x )$.

We will see in this section the advantage of using a model when only a few observations of $f$ are available. We consider here the 2-dimensional Branin test function defined over [−5, 10] × [0, 15] and a set of 25 observations as seen in Figure 3.


In [56]:
# Definition of the Branin test function
def branin(X):
    y = (X[:,1]-5.1/(4*np.pi**2)*X[:,0]**2+5*X[:,0]/np.pi-6)**2
    y += 10*(1-1/(8*np.pi))*np.cos(X[:,0])+10
    return(y)

# Training set defined as a 5*5 grid:
xg1 = np.linspace(-5,10,5)
xg2 = np.linspace(0,15,5)
X = np.zeros((xg1.size * xg2.size,2))
for i,x1 in enumerate(xg1):
    for j,x2 in enumerate(xg2):
        X[i+xg1.size*j,:] = [x1,x2]

Y = branin(X)[:,None]

We assume here that we are interested in the distribution of $f (U )$ where $U$ is a random variable with uniform distribution over the input space of $f$. We will focus on the computation of two quantities: $E[ f (U )]$ and $P( f (U ) > 200)$.

4.1 Computation of E[ f (U )]

The expectation of $f (U )$ is given by $\int_x f ( x )\text{d}x$. A basic approach to approximate this integral is to compute the mean of the 25 observations: np.mean(Y). Since the points are distributed on a grid, this can be seen as the approximation of the integral by a rough Riemann sum. The result can be compared with the actual mean of the Branin function which is 54.31.

Alternatively, we can fit a GP model and compute the integral of the best predictor by Monte Carlo sampling:


In [67]:
# Fit a GP
# Create an exponentiated quadratic plus bias covariance function
kg = GPy.kern.RBF(input_dim=2, ARD = True)
kb = GPy.kern.Bias(input_dim=2)
k = kg + kb

# Build a GP model
m = GPy.models.GPRegression(X,Y,k)

# fix the noise variance
m.likelihood.variance.fix(1e-5)

# Randomize the model and optimize
m.randomize()
m.optimize()

# Plot the resulting approximation to Brainin
# Here you get a two-d plot becaue the function is two dimensional.
m.plot()

# Compute the mean of model prediction on 1e5 Monte Carlo samples
Xp = np.random.uniform(size=(1e5,2))
Xp[:,0] = Xp[:,0]*15-5
Xp[:,1] = Xp[:,1]*15
mu, var = m.predict(Xp)
np.mean(mu)


WARNING: reconstraining parameters GP_regression.Gaussian_noise.variance
Out[67]:
69.404138290697162

Exercise 6

a) Has the approximation of the mean been improved by using the GP model?

# Exercise 6 a) answer

b) One particular feature of GPs we have not use for now is their prediction variance. Can you use it to define some confidence intervals around the previous result?


In [62]:
# Exercise 6 b) answer

4.2 Computation of $P( f (U ) > 200)$

In various cases it is interesting to look at the probability that $f$ is greater than a given threshold. For example, assume that $f$ is the response of a physical model representing the maximum constraint in a structure depending on some parameters of the system such as Young’s modulus of the material (say $Y$) and the force applied on the structure (say $F$). If the later are uncertain, the probability of failure of the structure is given by $P( f (Y, F ) > \text{f_max} )$ where $f_\text{max}$ is the maximum acceptable constraint.

Exercise 7

a) As previously, use the 25 observations to compute a rough estimate of the probability that $f (U ) > 200$.


In [63]:
# Exercise 7 a) answer

b) Compute the probability that the best predictor is greater than the threshold.


In [64]:
# Exercise 7 b) answer

c) Compute some confidence intervals for the previous result


In [65]:
# Exercise 7 c) answer

These two values can be compared with the actual value {$P( f (U ) > 200) = 1.23\times 10^{−2}$ .

We now assume that we have an extra budget of 10 evaluations of f and we want to use these new evaluations to improve the accuracy of the previous result.

Exercise 8

a) Given the previous GP model, where is it interesting to add the new observations if we want to improve the accuracy of the estimator and reduce its variance?

# Exercise 8 a) answer

b) Can you think about (and implement!) a procedure that updates sequentially the model with new points in order to improve the estimation of $P( f (U ) > 200)$?


In [66]:
# Exercise 8 b) answer