```
In [3]:
```%matplotlib inline
import numpy as np
import pods
import pylab as plt
import GPy
from IPython.display import display

In the last session we introduced Gaussian process models through constructing covariance functions in `numpy`

. The `GPy`

software is a BSD licensed software package for modeling with Gaussian processes in python. It is designed to make it easy for the user to construct models and interact with data. The software is BSD licensed to ensure that there are as few as possible constraints on its use. The `GPy`

documentation is produced with Sphinx and is available here.

In the introduction to Gaussian processes we defined covariance functions (or kernels) as functions to which we passed objects in `GPy`

covariance functions are objects.

In `GPy`

the covariance object is stored in `GPy.kern`

. There are several covariance functions available. The exponentiated quadratic covariance is stored as `RBF`

and can be created as follows.

```
In [6]:
```k = GPy.kern.RBF(input_dim=1)
display(k)

```
```

```
In [8]:
```k = GPy.kern.RBF(input_dim=1, name='signal', variance=4.0, lengthscale=2.0)
display(k)

```
```

The covariance function expresses the covariation between two points. the `.plot()`

method can be applied to the kernel to discover how the covariation changes as a function as one of the inputs with the other fixed (i.e. it plots `k(x, z)`

as a function of a one dimensional `x`

whilst keeping `z`

fixed. By default `z`

is set to `0.0`

.

```
In [9]:
```k.plot()

```
```

Here the title of the kernel is taken from the name we gave it (`signal`

).

When we constructed the covariance function we gave it particular parameters. These parameters can be changed later in a couple of different ways firstly, we can simple set the field value of the parameters. If we want to change the lengthscale of the covariance to 3.5 we do so as follows.

```
In [11]:
```k.lengthscale = 3.5
display(k)

```
```

`timescale`

for the `lengthscale`

parameter.

```
In [12]:
```k.lengthscale.name = 'timescale'
display(k)

```
```

Now we can set the time scale appropriately.

```
In [13]:
```k.timescale = 10.
display(k)

```
```

There are other types of basic covariance function in `GPy`

. For example the `Linear`

kernel,
$$
k(\mathbf{x}, \mathbf{z}) = \alpha \mathbf{x}^\top \mathbf{z}
$$
and the `Bias`

kernel,
$$
k(\mathbf{x}, \mathbf{z}) = \alpha
$$
where everything is equally correlated to each other. `Brownian`

implements Brownian motion which has a covariance function of the form,
$$
k(t, t^\prime) = \alpha \text{min}(t, t^\prime).
$$

Broadly speaking covariances fall into two classes, *stationary* covariance functions, for which the kernel can always be written in the form
$$
k(\mathbf{x}, \mathbf{z}) = f(r)
$$
where $f(\cdot)$ is a function and $r$ is the distance between the vectors $\mathbf{x}$ and $\mathbf{z}$, i.e.,
$$
r = ||\mathbf{x} - \mathbf{z}||_2 = \sqrt{\left(\mathbf{x} - \mathbf{z}\right)^\top \left(\mathbf{x} - \mathbf{z}\right)}.
$$
This partitioning is reflected in the object hierarchy in GPy. There is a base object `Kern`

and this is inherited by `Stationary`

to form the stationary covariance functions (like `RBF`

and `Matern32`

).

When using `numpy`

to construct covariances we defined a function, `kern_compute`

which returned a covariance matrix given the name of the covariance function. In `GPy`

, the base object `Kern`

implements a method `K`

. That allows us to compute the associated covariance. Visualizing the structure of this covariance matrix is often informative in understanding whether we have appropriately captured the nature of the relationship between our variables in our covariance function. In `GPy`

the input data is assumed to be provided in a matrix with $n$ rows and $p$ columns where $n$ is the number of data points and $p$ is the number of features we are dealing with. We can compute the entries to the covariance matrix for a given set of inputs `X`

as follows.

```
In [ ]:
```data = pods.datasets.olympic_marathon_men()
# Load in the times of the olympics
X = data['X']
K=k.K(X)

Now to visualize this covariation between the time points we plot it as an image using the `imshow`

command from matplotlib.

```
imshow(K, interpolation='None')
```

Setting the interpolation to `'None'`

prevents the pixels being smoothed in the image. To better visualize the structure of the covariance we've also drawn white lines at the points where the First World War and the Second World War begin. At other points the Olympics are held every four years and the covariation is given by the computing the covariance function for two points which are four years apart, appropriately scaled by the time scale.

```
In [47]:
```def visualize_olympics(K):
"""Helper function for visualizing a covariance computed on the Olympics training data."""
fig, ax = plt.subplots(figsize=(8,8))
im = ax.imshow(K, interpolation='None')
WWI_index = np.argwhere(X==1912)[0][0]
WWII_index = np.argwhere(X==1936)[0][0]
ax.axhline(WWI_index+0.5,color='w')
ax.axvline(WWI_index+0.5,color='w')
ax.axhline(WWII_index+0.5,color='w')
ax.axvline(WWII_index+0.5,color='w')
plt.colorbar(im)
visualize_olympics(k.K(X))

```
```

```
In [23]:
```k.timescale

```
Out[23]:
```

```
In [48]:
```k.timescale = 20
visualize_olympics(k.K(X))

```
```

A Gaussian process specifieds that any finite set of realizations of the function will jointly have a Gaussian density. In other words, if we are interested in the joint distribution of function values at a set of particular points, realizations of those functions can be sampled according to a Gaussian density. The Gaussian process provides a prior over an infinite dimensional object: the function. For our purposes we can think of a function as being like an infinite dimensional vector. The mean of our Gaussian process is normally specified by a vector, but is now itself specified by a *mean function*. The covariance of the process, instead of being a matrix is also a *covariance function*. But to construct the infinite dimensional matrix we need to make it a function of two arguments, $k(\mathbf{x}, \mathbf{z})$. When we compute the covariance matrix using `k.K(X)`

we are computing the values of that matrix for the different entries in $\mathbf{X}$. `GPy`

also allows us to compute the cross covariance between two input matrices, $\mathbf{X}$ and $\mathbf{Z}$.

We cannot sample a full function from a process because it consists of infinite values, however, we can obtain a finite sample from a function as described by a Gaussian process and visualize these samples as functions. This is a useful exercise as it allows us to visualize the type of functions that fall within the support of our Gaussian process prior. Careful selection of the right covariance function can improve the performance of a Gaussian process based model in practice because the covariance function allows us to bring domain knowledge to bear on the problem. If the input domain is low dimensional, then we can at least ensure that we are encoding something reasonable in the covariance through visualizing samples from the Gaussian process.

For a one dimensional function, if we select a vector of input values, represented by `X`

, to be equally spaced, and ensure that the spacing is small relative to the lengthscale of our process, then we can visualize a sample from the function by sampling from the Gaussian (or a multivariate normal) with the `numpy`

command

```
F = np.random.multivariate_normal(mu, K, num_samps).T
```

where `mu`

is the mean (which we will set to be the zero vector) and `K`

is the covariance matrix computed at the points where we wish to visualize the function. The transpose at the end ensures that the the matrix `F`

has `num_samps`

columns and $n$ rows, where $n$ is the dimensionality of the *square* covariance matrix `K`

.

Below we build a simple helper function for sampling from a Gaussian process and visualizing the result.

```
In [75]:
```def sample_covariance(kern, X, num_samps=10):
"""Sample a one dimensional function as if its from a Gaussian process with the given covariance function."""
from IPython.display import HTML
display(HTML('<h2>Samples from a Gaussian Process with ' + kern.name + ' Covariance</h2>'))
display(k)
K = kern.K(X)
# Generate samples paths from a Gaussian with zero mean and covariance K
F = np.random.multivariate_normal(np.zeros(X.shape[0]), K, num_samps).T
fig, ax = plt.subplots(figsize=(8,8))
ax.plot(X,F)

```
In [76]:
```# create an input vector
X = np.linspace(-2, 2, 200)[:, None]
# create a covariance to visualize
k = GPy.kern.RBF(input_dim=1)
# perform the samples.
sample_covariance(k, X)

```
```

```
In [82]:
```k = GPy.kern.Poly(input_dim=1, order=6)
X = np.linspace(-.8, .8, 500)[:, None]
sample_covariance(k, X)

```
```

Covariance functions can be combined in various ways to make new covariance functions.

Perhaps simplest thing you can do to combine covariance functions is to add them. The sum of two Gaussian random variables is also Gaussian distributed, with a covariance equal to the sum of the covariances of the original variables (similarly for the mean). This implies that if we add two functions drawn from Gaussian processes together, then the result is another function drawn from a Gaussian process, and the covariance function is the sum of the two covariance functions of the original process. $$ k(\mathbf{x}, \mathbf{z}) = k_1(\mathbf{x}, \mathbf{z}) + k_2(\mathbf{x}, \mathbf{z}). $$ Here the domains of the two processes don't even need to be the same, so one of the covariance functions could perhaps depend on time and the other on space, $$ k\left(\left[\mathbf{x}\quad t\right], \left[\mathbf{z}\quad t^\prime\right]\right) = k_1(\mathbf{x}, \mathbf{z}) + k_2(t, t^\prime). $$

In `GPy`

the addition operator is overloaded so that it is easy to construct new covariance functions by adding other covariance functions together.

```
In [63]:
```k1 = GPy.kern.RBF(1, variance=4.0, lengthscale=10., name='long term trend')
k2 = GPy.kern.RBF(1, variance=1.0, lengthscale=2., name='short term trend')
k = k1 + k2
k.name = 'signal'
k.long_term_trend.lengthscale.name = 'timescale'
k.short_term_trend.lengthscale.name = 'timescale'
display(k)

```
```

```
In [62]:
```k1 = GPy.kern.Linear(1)
k2 = GPy.kern.RBF(1)
k = k1*k2
display(k)
visualize_olympics(k.K(X))

```
```

`X`

means that the points are close enough together to look like functions. We can see the structure of the covariance matrix we are plotting from if we visualize C.

```
In [253]:
```plt.matshow(C)

```
Out[253]:
```

```
In [254]:
``````
# Try plotting sample paths here
```

# Exercise 3 answer

We will now combine the Gaussian process prior with some data to form a GP regression model with GPy. We will generate data from the function $f ( x ) = − \cos(\pi x ) + \sin(4\pi x )$ over $[0, 1]$, adding some noise to give $y(x) = f(x) + \epsilon$, with the noise being Gaussian distributed, $\epsilon \sim \mathcal{N}(0, 0.01)$.

```
In [255]:
```np.random.normal?

```
In [256]:
```X = np.linspace(0.05,0.95,10)[:,None]
Y = -np.cos(np.pi*X) + np.sin(4*np.pi*X) + np.random.normal(loc=0.0, scale=0.1, size=(10,1))
pb.figure()
pb.plot(X,Y,'kx',mew=1.5)

```
Out[256]:
```

```
In [265]:
```k = GPy.kern.RBF(input_dim=1, variance=1., lengthscale=0.2)

And then combining it with the data to form a Gaussian process model,

```
In [266]:
```m = GPy.models.GPRegression(X,Y,k)

`print m`

.

```
In [25]:
```display(m)

```
```

`m.plot()`

.

```
In [268]:
```m.plot()

```
Out[268]:
```

# Exercise 4 a) answer

`m['noise'] = 0.001`

). Change the values of the parameters to obtain a better fit.

```
In [269]:
``````
# Exercise 4 b) answer
```

`np.random.multivariate_normal(mu[:,0],C)`

where the mean vector and covariance
matrix `mu`

, `C`

are obtained through the predict function `mu, C, up95, lo95 = m.predict(Xp,full_cov=True)`

. Obtain 10 samples from the posterior sample and plot them alongside the data below.

```
In [270]:
``````
# Exercise 4 c) answer
```

As we have seen during the lectures, the parameters values can be estimated by maximizing the likelihood of the observations. Since we don’t want one of the variance to become negative during the optimization, we can constrain all parameters to be positive before running the optimisation.

```
In [271]:
```m.constrain_positive('.*') # Constrains all parameters matching .* to be positive, i.e. everything

```
```

The warnings are because the parameters are already constrained by default, the software is warning us that they are being reconstrained.

Now we can optimize the model using the `m.optimize()`

method.

```
In [272]:
```m.optimize()
m.plot()
print m

```
```

```
In [17]:
```data = pods.datasets.olympic_marathon_men()
print data['details']

```
```

```
In [18]:
```X = data['X']
Y = data['Y']
plt.plot(X, Y, 'bx')
plt.xlabel('year')
plt.ylabel('marathon pace min/km')

```
Out[18]:
```

a) Build a Gaussian process model for the olympic data set using a combination of an exponentiated quadratic and a bias covariance function. Fit the covariance function parameters and the noise to the data. Plot the fit and error bars from 1870 to 2030. Do you think the predictions are reasonable? If not why not?

```
In [19]:
```# Exercise 5 a) answer
kern = GPy.kern.RBF(1, lengthscale=80) + GPy.kern.Matern52(1, lengthscale=10) + GPy.kern.Bias(1)
model = GPy.models.GPRegression(X, Y, kern)
model.optimize()
model.plot()# Exercise 5 d) answer
model.log_likelihood()
display(model)

```
```

b) Fit the same model, but this time intialize the length scale of the exponentiated quadratic to 0.5. What has happened? Which of model has the higher log likelihood, this one or the one from (a)?

*Hint:* use `model.log_likelihood()`

for computing the log likelihood.

```
In [276]:
```# Exercise 5 b) answer
kern=GPy.kern.RBF(1, lengthscale=0.5)+GPy.kern.Matern52(1,lengthscale=10)+GPy.kern.Bias(1)
m=GPy.models.GPRegression(X,Y,kern)
m.optimize()
m.plot()
model.log_likelihood()

```
Out[276]:
```

```
In [277]:
``````
# Exercise 5 c) answer
```

```
In [278]:
``````
# Exercise 5 d) answer
```

```
In [279]:
``````
# Exercise 5 e) answer
```

Let $x$ be a random variable defined over the real numbers, $\Re$, and $f(\cdot)$ be a function mapping between the real numbers $\Re \rightarrow \Re$. Uncertainty propagation is the study of the distribution of the random variable $f ( x )$.

We will see in this section the advantage of using a model when only a few observations of $f$ are available. We consider here the 2-dimensional Branin test function defined over [−5, 10] × [0, 15] and a set of 25 observations as seen in Figure 3.

```
In [23]:
```# Definition of the Branin test function
def branin(X):
y = (X[:,1]-5.1/(4*np.pi**2)*X[:,0]**2+5*X[:,0]/np.pi-6)**2
y += 10*(1-1/(8*np.pi))*np.cos(X[:,0])+10
return(y)
# Training set defined as a 5*5 grid:
xg1 = np.linspace(-5,10,5)
xg2 = np.linspace(0,15,5)
X = np.zeros((xg1.size * xg2.size,2))
for i,x1 in enumerate(xg1):
for j,x2 in enumerate(xg2):
X[i+xg1.size*j,:] = [x1,x2]
Y = branin(X)[:,None]

We assume here that we are interested in the distribution of $f (U )$ where $U$ is a random variable with uniform distribution over the input space of $f$. We will focus on the computation of two quantities: $E[ f (U )]$ and $P( f (U ) > 200)$.

The expectation of $f (U )$ is given by $\int_x f ( x )\text{d}x$. A basic approach to approximate this
integral is to compute the mean of the 25 observations: `np.mean(Y)`

. Since the points
are distributed on a grid, this can be seen as the approximation of the integral by a
rough Riemann sum. The result can be compared with the actual mean of the Branin
function which is 54.31.

Alternatively, we can fit a GP model and compute the integral of the best predictor by Monte Carlo sampling:

```
In [24]:
```# Fit a GP
# Create an exponentiated quadratic plus bias covariance function
kg = GPy.kern.RBF(input_dim=2, ARD = True)
kb = GPy.kern.Bias(input_dim=2)
k = kg + kb
# Build a GP model
m = GPy.models.GPRegression(X,Y,k)
m.Gaussian_noise.variance = 1e-5
display(m)
m.Gaussian_noise.variance.fix()
display(m)
# constrain parameters to be bounded
#m.constrain_bounded('.*rbf_var',1e-3,1e5)
#m.constrain_bounded('.*bias_var',1e-3,1e5)
#m.constrain_bounded('.*rbf_len',.1,200.)
# fix the noise variance
#m.constrain_fixed('.*noise',1e-5)
# Randomize the model and optimize
m.randomize()
m.optimize()
# Plot the resulting approximation to Brainin
# Here you get a two-d plot becaue the function is two dimensional.
m.plot()
display(m.add.rbf.lengthscale)
# Compute the mean of model prediction on 1e5 Monte Carlo samples
Xp = np.random.uniform(size=(1e5,2))
Xp[:,0] = Xp[:,0]*15-5
Xp[:,1] = Xp[:,1]*15
Yp = m.predict(Xp)[0]
np.mean(Yp)

```
Out[24]:
```

# Exercise 6 a) answer

```
In [ ]:
``````
# Exercise 6 b) answer
```

In various cases it is interesting to look at the probability that $f$ is greater than a given threshold. For example, assume that $f$ is the response of a physical model representing the maximum constraint in a structure depending on some parameters of the system such as Young’s modulus of the material (say $Y$) and the force applied on the structure (say $F$). If the later are uncertain, the probability of failure of the structure is given by $P( f (Y, F ) > \text{f_max} )$ where $f_\text{max}$ is the maximum acceptable constraint.

a) As previously, use the 25 observations to compute a rough estimate of the probability that $f (U ) > 200$.

```
In [ ]:
``````
# Exercise 7 a) answer
```

b) Compute the probability that the best predictor is greater than the threshold.

```
In [ ]:
``````
# Exercise 7 b) answer
```

c) Compute some confidence intervals for the previous result

```
In [ ]:
``````
# Exercise 7 c) answer
```

These two values can be compared with the actual value {$P( f (U ) > 200) = 1.23\times 10^{−2}$ .

We now assume that we have an extra budget of 10 evaluations of f and we want to use these new evaluations to improve the accuracy of the previous result.

a) Given the previous GP model, where is it interesting to add the new observations if we want to improve the accuracy of the estimator and reduce its variance?

# Exercise 8 a) answer

```
In [ ]:
``````
# Exercise 8 b) answer
```