Gaussian Process Taste-test

The scikit-learn package has a nice Gaussian Process example - but what is it doing? In this notebook, we review the mathematics of Gaussian Processes, and then 1) run the scikit-learn example, 2) do the same thing by-hand with numpy/scipy, and finally 3) use the GPy package to compare a few different kernels, on the same test dataset.

Super-brief Introduction to GPs

Let us look at the basics of Gaussian Processes in one dimension. See Rasmussen and Williams for a great, pedagogically smooth introduction to Gaussian Processes that will teach you everything you will need to get started.

We denote $\vec{x}=(x_0, \dots, x_N)$ a vector of 1D input values. A 1D Gaussian process $f$ is such that $f \sim \mathcal{GP} \ \Longleftrightarrow \ p(f(\vec{x}), f(\vec{x}'))\ \mathrm{is\ Gaussian} \ \forall \vec{x}, \vec{x}'$.

It is fully characterized by a mean function and a kernel, $$\begin{eqnarray*}m(\vec{x}) &=& \mathbb{E}[ f(\vec{x}) ]\\ k(\vec{x}, \vec{x}') &=& \mathbb{E}[ (f(\vec{x})-m(\vec{x}))(f(\vec{x}')-m(\vec{x}')) ]\end{eqnarray*}$$

Let us consider a noisy dataset $(\vec{x},\vec{y})$ with Gaussian homoskedastic errors $\epsilon$ that are Gaussian distributed with standard deviation $\sigma$. Fitting a Gaussian Process to this data is equivalent to considering a set of basis functions $\{\phi_i(x)\}$ and finding the optimal weights $\{\omega_i\}$, which we assume to be Gaussian distributed with some covariance $\Sigma$. It can also be thought of as fitting for an unknown correlated noise term in the data. $$\begin{eqnarray*} \vec{y} &=& f(\vec{x}) + \vec{\epsilon}\\ \vec{\epsilon} &\sim & \mathcal{N}(0,\sigma^2 I)\\ f(\vec{x}) &=& \sum_i \omega_i \phi_i(\vec{x}) = \vec{\omega}^T \vec{\phi}(\vec{x}) \\ \vec{\omega} &\sim & \mathcal{N}(0,\Sigma)\\ \end{eqnarray*}$$

In this case, the mean function is assumed to be zero, $m(\vec{x}) = 0$. (This is not actually very constraining, as Rasmussen and Williams explain, and it is not equivalent to assuming that the mean of $f$ is zero.)

There are as many weights as there are data points, which makes the function $f$ very flexible. The weights are constrained by their Gaussian distribution, though. Importantly, the kernel is fully characterized by the choice of basis functions, via

$$\quad k(\vec{x},\vec{x}') = \vec{\phi}(\vec{x})^T \Sigma\ \vec{\phi}(\vec{x}')$$

Picking a set of basis functions is equivalent to picking a kernel, and vice versa. In the correlated noise model interpretation its the kernel function that makes more sense. Typically a kernel will have a handful of hyper-parameters $\vec{\theta}$, that govern the shape of the basis function and correlation structure of the covariance matrix of the predictions. These hyper-parameters can in be inferred from the data, via their log likelihood:

$$ \log p(\vec{y} | \vec{x},\vec{\theta}) = \frac{1}{2} \vec{y}^T K^{-1} \vec{y} - \frac{1}{2} \log |K| - \frac{n}{2} \log 2\pi $$

(Here, the matrix $K$ has elements $K_{ij} = k(x_i,x_j) + \sigma^2 \delta_{ij}$. Note that evaluating the likelihood for $\theta$ involves computing the determinant of the matrix $K$.) Fitting the hyper-parameters is often done by maximizing this likelihood - but that only gets you the "best-fit" hyper-parameters. Posterior samples of the hyper-parameters can be drawn by MCMC in the usual way.

For any given set of hyper-parameters, we can use the Gaussian Process to predict new outputs $\vec{y}^*$ at inputs $\vec{x}^*$. Thanks to the magic of Gaussian distributions and linear algebra, one can show that the posterior distribution for the process evaluated at new inputs $\vec{x}^*$ given a fit to the existing values $(\vec{x},\vec{y})$ is also Gaussian:

$$p( f(\vec{x}^*) | \vec{y}, \vec{x}, \vec{x}^* ) \ = \ \mathcal{N}(\bar{f}, \bar{k})$$

The mean of this PDF for $f(\vec{x}^*)$ is

$$\bar{f} \ =\ k(\vec{x}^*,\vec{x})[k(\vec{x},\vec{x}) + \sigma^2 I]^{-1} \vec{y}$$

and its covariance is

$$\bar{k} = k(\vec{x}^*,\vec{x}^*) - k(\vec{x},\vec{x}^*) [k(\vec{x},\vec{x}) + \sigma^2 I]^{-1}k(\vec{x},\vec{x}^*)^T $$

Once the kernel is chosen, one can fit the data and make predictions for new data in a single linear algebra operation. Note that multiple matrix inversions and multiplications are involved, so Gaussian Processes can be computationally very expensive - and that the weights are being optimized during the arithmetic calculation of $\bar{f}$.

Inferring the hyper-parameters of the kernel make GPs even more expensive, thanks to the determinant calculation involved.

To generate large numbers of predictions, one just makes a long vector $\vec{x}^*$. We'll see this in the code below, when generating smooth functions to plot through sparse and noisy data. The mean prediction $\bar{f}$ is linear in the input data $y$, which is quite remarkable.

The above is all the math you need to run Gaussian Processes in simple situations. Here is a list of more advanced topics that you should think about when applying Gaussian Processes to real data:

Generalizing to multiple input dimensions (keeping one output dimension) is trivial, but the case of multiple outputs is not (partly because it is less natural).

Choosing a physically motivated kernel or a kernel that simplifies the computation, for example by yielding sparse matrices.

Parametrizing the kernel and/or the mean function and inferring these hyperparameters from the data.

Using a small fraction of the data to make predictions. This is referred to as Sparse Gaussian Processes. Finding an optimal "summary" subset of the data is key.

Gaussian Processes natively work with Gaussian noise / likelihood functions. With non-Gaussian cases, some analytical results are no longer valid (e.g. the marginal likelihood) but approximations exist.

What if the inputs $\vec{x}$ have uncertainties? There are various way to deal with this, but this is much more intensive than normal Gaussian Processes.



In [1]:

    
%matplotlib inline
import numpy as np
from matplotlib import pyplot as plt
plt.style.use('seaborn-whitegrid')

Make Some Data



In [2]:

    
def f(x):
    """The function to predict."""
    return x * np.sin(x)

def make_data(N, rseed=1):
    np.random.seed(rseed)

    # Create some observations with noise
    X = np.random.uniform(low=0.1, high=9.9, size=N)
    X = np.atleast_2d(X).T

    y = f(X).ravel()
    dy = 0.5 + 1.0 * np.random.random(y.shape)
    noise = np.random.normal(0, dy)
    y += noise
    
    return X, y, dy

X, y, dy = make_data(20)

Gaussian Process Regression with Scikit-Learn

Example adapted from Scikit-learn's Examples



In [3]:

    
# Get the master version of scikit-learn; new GP code isn't in release
# This needs to compile things, so it will take a while...
# Uncomment the following:

# !pip install git+git://github.com/scikit-learn/scikit-learn.git



In [4]:

    
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF as SquaredExponential
from sklearn.gaussian_process.kernels import ConstantKernel as Amplitude

# Instanciate a Gaussian Process model
kernel = Amplitude(1.0, (1E-3, 1E3)) * SquaredExponential(10, (1e-2, 1e2))

# Instantiate a Gaussian Process model
gp = GaussianProcessRegressor(kernel=kernel,
                              alpha=(dy / y)**2,  # fractional errors in data
                              n_restarts_optimizer=10)

# Fit to data using Maximum Likelihood Estimation of the hyper-parameters
gp.fit(X, y)









    Out[4]:





GaussianProcessRegressor(alpha=array([  2.81880e-01,   1.06665e-01,   8.21349e+01,   1.85751e+00,
         1.38706e+00,   7.71353e-01,   1.74396e-01,   1.49819e-01,
         3.77392e-02,   6.54060e-02,   2.14450e-02,   7.14823e-02,
         6.34041e+01,   2.98751e-02,   3.18792e-01,   6.80960e-02,
         9.46038e-02,   7.69771e-02,   2.29495e-01,   1.00917e-01]),
             copy_X_train=True, kernel=1**2 * RBF(length_scale=10),
             n_restarts_optimizer=10, normalize_y=False,
             optimizer='fmin_l_bfgs_b', random_state=None)



In [5]:

    
gp.kernel_









    Out[5]:





3.69**2 * RBF(length_scale=0.887)



In [6]:

    
# note: gp.kernel is the initial kernel
#       gp.kernel_ (with an underscore) is the fitted kernel
gp.kernel_.get_params()









    Out[6]:





{'k1': 3.69**2,
 'k1__constant_value': 13.646552048519276,
 'k1__constant_value_bounds': (0.001, 1000.0),
 'k2': RBF(length_scale=0.887),
 'k2__length_scale': 0.88652850735084876,
 'k2__length_scale_bounds': (0.01, 100.0)}



In [7]:

    
# Mesh the input space for evaluations of the real function, the prediction and
# its MSE
x_pred = np.atleast_2d(np.linspace(0, 10, 1000)).T

# Make the prediction on the meshed x-axis (ask for MSE as well)
y_pred, sigma = gp.predict(x_pred, return_std=True)



In [8]:

    
def plot_results(X, y, dy, x_pred, y_pred, sigma):
    fig = plt.figure(figsize=(8, 6))
    plt.plot(x_pred, f(x_pred), 'k:', label=u'$f(x) = x\,\sin(x)$')
    plt.errorbar(X.ravel(), y, dy, fmt='k.', markersize=10, label=u'Observations',
                 ecolor='gray')
    plt.plot(x_pred, y_pred, 'b-', label=u'Prediction')
    plt.fill(np.concatenate([x_pred, x_pred[::-1]]),
             np.concatenate([y_pred - 1.9600 * sigma,
                            (y_pred + 1.9600 * sigma)[::-1]]),
             alpha=.3, fc='b', ec='None', label='95% confidence interval')
    plt.xlabel('$x$')
    plt.ylabel('$f(x)$')
    plt.ylim(-10, 20)
    plt.legend(loc='upper left');
    
plot_results(X, y, dy, x_pred, y_pred, sigma)

Gaussian Processes by-hand

Let us run the same example but solving the Gaussian Process equations by hand. Let's use the kernel constructed with scikit-learn (because its parameters are optimized) And let's compute the Gaussian process manually using Scipy linalg



In [9]:

    
import scipy.linalg
KXX = gp.kernel_(X)
A = KXX + np.diag((dy/y)**2.)
L = scipy.linalg.cholesky(A, lower=True)
KXXp = gp.kernel_(x_pred, X)
KXpXp = gp.kernel_(x_pred)
alpha = scipy.linalg.cho_solve((L, True), y)
y_pred = np.dot(KXXp, alpha) + np.mean(y, axis=0)
v = scipy.linalg.cho_solve((L, True), KXXp.T)
y_pred_fullcov = KXpXp - KXXp.dot(v)
sigma = np.sqrt(np.diag(y_pred_fullcov))



In [10]:

    
plot_results(X, y, dy, x_pred, y_pred, sigma)

Quick kernel comparison with GPy

Let's now use the GPy package and compare a couple of kernels applied to our example. We'll optimize the parameters in each case. We not only plot the mean and std dev of the process but also a few samples. As you can see, they look very different, and the choice of kernel is critical!



In [11]:

    
import GPy

kernels = [GPy.kern.RBF(input_dim=1),
           GPy.kern.Brownian(input_dim=1),
          GPy.kern.Matern32(input_dim=1),
          GPy.kern.Matern52(input_dim=1),
          GPy.kern.ExpQuad(input_dim=1),
          GPy.kern.Cosine(input_dim=1)]
names = ['Gaussian', 'Brownian', 'Mattern32', 'Matern52', 'ExpQuad', 'Cosine']

fig, axs = plt.subplots(3, 2, figsize=(12, 12), sharex=True, sharey=True)
axs = axs.ravel()

for i, k in enumerate(kernels):
    m = GPy.models.GPRegression(X, y[:,None], kernel=k)
    m.optimize()
    m.plot_f(ax=axs[i], plot_data=True, samples=4, legend=False, plot_limits=[0, 10]) 
    # plotting four samples of the GP posterior too
    axs[i].errorbar(X, y, yerr=dy, fmt="o", c='k')
    axs[i].set_title(names[i])
    axs[i].plot(x_pred, f(x_pred), 'k:', label=u'$f(x) = x\,\sin(x)$')
fig.tight_layout()









    



 /Users/bl/anaconda/lib/python2.7/site-packages/GPy/core/gp.py:488: RuntimeWarning:covariance is not positive-semidefinite.



In [ ]:



In [ ]: