Gaussian Processes in Shogun

This notebook is about Bayesian regression and classification models with Gaussian Process (GP) priors in Shogun. After providing a semi-formal introduction, we illustrate how to efficiently train them, use them for predictions, and automatically learn parameters.


In [ ]:
%matplotlib inline
# import all shogun classes
from shogun import *
import random
import numpy as np
import matplotlib.pyplot as plt
import os
SHOGUN_DATA_DIR=os.getenv('SHOGUN_DATA_DIR', '../../../data')
from math import exp

Some Formal Background (Skip if you just want code examples)

This notebook is about Bayesian regression models with Gaussian Process priors. A Gaussian Process (GP) over real valued functions on some domain $\mathcal{X}$, $f(\mathbf{x}):\mathcal{X} \rightarrow \mathbb{R}$, written as

$\mathcal{GP}(m(\mathbf{x}), k(\mathbf{x},\mathbf{x}')),$

defines a distribution over real valued functions with mean value $m(\mathbf{x})=\mathbb{E}[f(\mathbf{x})]$ and inter-function covariance $k(\mathbf{x},\mathbf{x}')=\mathbb{E}[(f(\mathbf{x})-m(\mathbf{x}))^T(f(\mathbf{x})-m(\mathbf{x})]$. This intuitively means tha the function value at any point $\mathbf{x}$, i.e., $f(\mathbf{x})$ is a random variable with mean $m(\mathbf{x})$; if you take the average of infinitely many functions from the Gaussian Process, and evaluate them at $\mathbf{x}$, you will get this value. Similarily, the function values at two different points $\mathbf{x}, \mathbf{x}'$ have covariance $k(\mathbf{x}, \mathbf{x}')$. The formal definition is that Gaussian Process is a collection of random variables (may be infinite) of which any finite subset have a joint Gaussian distribution.

One can model data with Gaussian Processes via defining a joint distribution over

  • $n$ data (labels in Shogun) $\mathbf{y}\in \mathcal{Y}^n$, from a $n$ dimensional continous (regression) or discrete (classification) space. These data correspond to $n$ covariates $\mathbf{x}_i\in\mathcal{X}$ (features in Shogun) from the input space $\mathcal{X}$.
  • Hyper-parameters $\boldsymbol{\theta}$ which depend on the used model (details follow).
  • Latent Gaussian variables $\mathbf{f}\in\mathbb{R}^n$, coming from a GP, i.e., they have a joint Gaussian distribution. Every entry $f_i$ corresponds to the GP function $f(\mathbf{x_i})$ evaluated at covariate $\mathbf{x}_i$ for $1\leq i \leq n$.

The joint distribution takes the form

$p(\mathbf{f},\mathbf{y},\theta)=p(\boldsymbol{\theta})p(\mathbf{f}|\boldsymbol{\theta})p(\mathbf{y}|\mathbf{f}),$

where $\mathbf{f}|\boldsymbol{\theta}\sim\mathcal{N}(\mathbf{m}_\theta, \mathbf{C}_\theta)$ is the joint Gaussian distribution for the GP variables, with mean $\mathbf{m}_\boldsymbol{\theta}$ and covariance $\mathbf{C}_\theta$. The $(i,j)$-th entriy of $\mathbf{C}_\boldsymbol{\theta}$ is given by the covariance or kernel between the $(i,j)$-th covariates $k(\mathbf{x}_i, \mathbf{x}_j)$. Examples for kernel and mean functions are given later in the notebook.

Mean and covariance are both depending on hyper-parameters coming from a prior distribution $\boldsymbol{\theta}\sim p(\boldsymbol{\theta})$. The data itself $\mathbf{y}\in \mathcal{Y}^n$ (no assumptions on $\mathcal{Y}$ for now) is modelled by a likelihood function $p(\mathbf{y}|\mathbf{f})$, which gives the probability of the data $\mathbf{y}$ given a state of the latent Gaussian variables $\mathbf{f}$, i.e. $p(\mathbf{y}|\mathbf{f}):\mathcal{Y}^n\rightarrow [0,1]$.

In order to do inference for a new, unseen covariate $\mathbf{x}^*\in\mathcal{X}$, i.e., predicting its label $y^*\in\mathcal{Y}$ or in particular computing the predictive distribution for that label, we have integrate over the posterior over the latent Gaussian variables (assume fixed $\boldsymbol{\theta}$ for now, which means you can just ignore the symbol in the following if you want),

$p(y^*|\mathbf{y}, \boldsymbol{\theta})=\int p(\mathbf{y}^*|\mathbf{f})p(\mathbf{f}|\mathbf{y}, \boldsymbol{\theta})d\mathbf{f}|\boldsymbol{\theta}.$

This posterior, $p(\mathbf{f}|\mathbf{y}, \boldsymbol{\theta})$, can be obtained using standard Bayes-Rule as

$p(\mathbf{f}|\mathbf{y},\boldsymbol{\theta})=\frac{p(\mathbf{y}|\mathbf{f})p(\mathbf{f}|\boldsymbol{\theta})}{p(\mathbf{y}|\boldsymbol{\theta})},$

with the so called evidence or marginal likelihood $p(\mathbf{y}|\boldsymbol{\theta})$ given as another integral over the prior over the latent Gaussian variables

$p(\mathbf{y}|\boldsymbol{\theta})=\int p(\mathbf{y}|\mathbf{f})p(\mathbf{f}|\boldsymbol{\theta})d\mathbf{f}|\boldsymbol{\theta}$.

In order to solve the above integrals, Shogun offers a variety of approximations. Don't worry, you will not have to deal with these nasty integrals on your own, but everything is hidden within Shogun. Though, if you like to play with these objects, you will be able to compute only parts.

Note that in the above description, we did not make any assumptions on the input space $\mathcal{X}$. As long as you define mean and covariance functions, and a likelihood, your data can have any form you like. Shogun in fact is able to deal with standard dense numerical data, sparse data, and strings of *any* type, and many more out of the box. We will provide some examples below.

To gain some intuition how these latent Gaussian variables behave, and how to model data with them, see the regression part of this notebook.

Non-Linear Bayesian Regression

Bayesian regression with Gaussian Processes is among the most fundamental applications of latent Gaussian models. As usual, the oberved data come from a contintous space, i.e. $\mathbf{y}\in\mathbb{R}^n$, which is represented in the Shogun class CRegressionLabels. We assume that these observations come from some distribution $p(\mathbf{y}|\mathbf{f)}$ that is based on a fixed state of latent Gaussian response variables $\mathbf{f}\in\mathbb{R}^n$. In fact, we assume that the true model is the latent Gaussian response variable (which defined a distribution over functions; plus some Gaussian observation noise which is modelled by the likelihood as

$p(\mathbf{y}|\mathbf{f})=\mathcal{N}(\mathbf{f},\sigma^2\mathbf{I})$

This simple likelihood is implemented in the Shogun class CGaussianLikelihood. It is the well known bell curve. Below, we plot the likelihood as a function of $\mathbf{y}$, for $n=1$.


In [ ]:
# plot likelihood for three different noise lebels $\sigma$ (which is not yet squared)
sigmas=np.array([0.5,1,2])

# likelihood instance
lik=GaussianLikelihood()

# A set of labels to consider
lab=RegressionLabels(np.linspace(-4.0,4.0, 200))

# A single 1D Gaussian response function, repeated once for each label
# this avoids doing a loop in python which would be slow
F=np.zeros(lab.get_num_labels())

# plot likelihood for all observations noise levels
plt.figure(figsize=(12, 4))
for sigma in sigmas:
    # set observation noise, this is squared internally
    lik.set_sigma(sigma)
    
    # compute log-likelihood for all labels
    log_liks=lik.get_log_probability_f(lab, F)
    
    # plot likelihood functions, exponentiate since they were computed in log-domain
    plt.plot(lab.get_labels(), map(exp,log_liks))
    
plt.ylabel("$p(y_i|f_i)$")
plt.xlabel("$y_i$")
plt.title("Regression Likelihoods for different observation noise levels")
_=plt.legend(["sigma=$%.1f$" % sigma for sigma in sigmas])

Apart from its apealling form, this curve has the nice property of given rise to analytical solutions to the required integrals. Recall these are given by

$p(y^*|\mathbf{y}, \boldsymbol{\theta})=\int p(\mathbf{y}^*|\mathbf{f})p(\mathbf{f}|\mathbf{y}, \boldsymbol{\theta})d\mathbf{f}|\boldsymbol{\theta},$

and

$p(\mathbf{y}|\boldsymbol{\theta})=\int p(\mathbf{y}|\mathbf{f})p(\mathbf{f}|\boldsymbol{\theta})d\mathbf{f}|\boldsymbol{\theta}$.

Since all involved elements, the likelihood $p(\mathbf{y}|\mathbf{f})$, the GP prior $p(\mathbf{f}|\boldsymbol{\theta})$ are Gaussian, the same follows for the GP posterior $p(\mathbf{f}|\mathbf{y}, \boldsymbol{\theta})$, and the marginal likelihood $p(\mathbf{y}|\boldsymbol{\theta})$. Therefore, we just need to sit down with pen and paper to derive the resulting forms of the Gaussian distributions of these objects (see references). Luckily, everything is already implemented in Shogun.

In order to get some intuition about Gaussian Processes in general, let us first have a look at these latent Gaussian variables, which define a probability distribution over real values functions $f(\mathbf{x}):\mathcal{X} \rightarrow \mathbb{R}$, where in the regression case, $\mathcal{X}=\mathbb{R}$.

As mentioned above, the joint distribution of a finite number (say $n$) of variables $\mathbf{f}\in\mathbb{R}^n$ from a Gaussian Process $\mathcal{GP}(m(\mathbf{x}), k(\mathbf{x},\mathbf{x}'))$, takes the form

$\mathbf{f}|\boldsymbol{\theta}\sim\mathcal{N}(\mathbf{m}_\theta, \mathbf{C}_\theta),$

where $\mathbf{m}_\theta$ is the mean function's mean and $\mathbf{C}_\theta$ is the pairwise covariance or kernel matrix of the input covariates $\mathbf{x}_i$. This means, we can easily sample function realisations $\mathbf{f}^{(j)}$ from the Gaussian Process, and more important, visualise them.

To this end, let us consider the well-known and often used Gaussian Kernel or squared exponential covariance, which is implemented in the Shogun class CGaussianKernel in the parametric from (note that there are other forms in the literature)

$ k(\mathbf{x}, \mathbf{x}')=\exp\left( -\frac{||\mathbf{x}-\mathbf{x}'||_2^2}{\tau}\right),$

where $\tau$ is a hyper-parameter of the kernel. We will also use the constant CZeroMean mean function, which is suitable if the data's mean is zero (which can be achieved via removing it).

Let us consider some toy regression data in the form of a sine wave, which is observed at random points with some observations noise.


In [ ]:
def generate_regression_toy_data(n=50, n_test=100, x_range=15, x_range_test=20, noise_var=0.4):
    # training and test sine wave, test one has more points
    X_train = np.random.rand(n)*x_range
    X_test = np.linspace(0,x_range_test, 500)
    
    # add noise to training observations
    y_test = np.sin(X_test)
    y_train = np.sin(X_train)+np.random.randn(n)*noise_var
    
    return X_train, y_train, X_test, y_test

In [ ]:
X_train, y_train, X_test, y_test = generate_regression_toy_data()
plt.figure(figsize=(16,4))
plt.plot(X_train, y_train, 'ro')
plt.plot(X_test, y_test)
plt.legend(["Noisy observations", "True model"])
plt.title("One-Dimensional Toy Regression Data")
plt.xlabel("$\mathbf{x}$")
_=plt.ylabel("$\mathbf{y}$")

First, we compute the kernel matrix $\mathbf{C}_\boldsymbol{\theta}$ using the CGaussianKernel with hyperparameter $\boldsymbol{\theta}=\{\tau\}$ with a few differnt values. Note that in Gaussian Processes, kernels usually have a scaling parameter. We skip this one for now and cover it later.


In [ ]:
# bring data into shogun representation (features are 2d-arrays, organised as column vectors)
feats_train=RealFeatures(X_train.reshape(1,len(X_train)))
feats_test=RealFeatures(X_test.reshape(1,len(X_test)))
labels_train=RegressionLabels(y_train)

# compute covariances for different kernel parameters
taus=np.asarray([.1,4.,32.])
Cs=np.zeros(((len(X_train), len(X_train), len(taus))))
for i in range(len(taus)):
    # compute unscalled kernel matrix (first parameter is maximum size in memory and not very important)
    kernel=GaussianKernel(10, taus[i])
    kernel.init(feats_train, feats_train)
    Cs[:,:,i]=kernel.get_kernel_matrix()


# plot
plt.figure(figsize=(16,5))
for i in range(len(taus)):
    plt.subplot(1,len(taus),i+1)
    plt.imshow(Cs[:,:,i], interpolation="nearest")
    plt.xlabel("Covariate index")
    plt.ylabel("Covariate index")
    _=plt.title("tau=%.1f" % taus[i])

This matrix, as any kernel or covariance matrix, is positive semi-definite and symmetric. It can be viewed as a similarity matrix. Here, elements on the diagonal (corresponding to $\mathbf{x}=\mathbf{x}'$) have largest similarity. For increasing kernel bandwidth $\tau$, more and more elements are similar. This matrix fully specifies a distribution over functions $f(\mathbf{x}):\mathcal{X}\rightarrow\mathbb{R}$ over a finite set of latent Gaussian variables $\mathbf{f}$, which we can sample from and plot. To this end, we use the Shogun class CStatistics, which offers a method to sample from multivariate Gaussians.


In [ ]:
plt.figure(figsize=(16,5))
plt.suptitle("Random Samples from GP prior")
for i in range(len(taus)):
    plt.subplot(1,len(taus),i+1)
    
    # sample a bunch of latent functions from the Gaussian Process
    # note these vectors are stored row-wise
    F=Statistics.sample_from_gaussian(np.zeros(len(X_train)), Cs[:,:,i], 3)
    
    for j in range(len(F)):
        # sort points to connect the dots with lines
        sorted_idx=X_train.argsort()

        plt.plot(X_train[sorted_idx], F[j,sorted_idx], '-', markersize=6)
    
    plt.xlabel("$\mathbf{x}_i$")
    plt.ylabel("$f(\mathbf{x}_i)$")
    _=plt.title("tau=%.1f" % taus[i])

Note how the functions are exactly evaluated at the training covariates $\mathbf{x}_i$ which are randomly distributed on the x-axis. Even though these points do not visualise the full functions (we can only evaluate them at a finite number of points, but we connected the points with lines to make it more clear), this reveils that larger values of the kernel bandwidth $\tau$ lead to smoother latent Gaussian functions.

In the above plots all functions are equally possible. That is, the prior of the latent Gaussian variables $\mathbf{f}|\boldsymbol{\theta}$ does not favour any particular function setups. Computing the posterior given our training data, the distribution ober $\mathbf{f}|\mathbf{y},\boldsymbol{\theta}$ then corresponds to restricting the above distribution over functions to those that explain the training data (up to observation noise). We will now use the Shogun class CExactInferenceMethod to do exactly this. The class is the general basis of exact GP regression in Shogun. We have to define all parts of the Gaussian Process for the inference method.


In [ ]:
plt.figure(figsize=(16,5))
plt.suptitle("Random Samples from GP posterior")
for i in range(len(taus)):
    plt.subplot(1,len(taus),i+1)

    # create inference method instance with very small observation noise to make 
    inf=ExactInferenceMethod(GaussianKernel(10, taus[i]), feats_train, ZeroMean(), labels_train, GaussianLikelihood())
    
    C_post=inf.get_posterior_covariance()
    m_post=inf.get_posterior_mean()

    # sample a bunch of latent functions from the Gaussian Process
    # note these vectors are stored row-wise
    F=Statistics.sample_from_gaussian(m_post, C_post, 5)
    
    for j in range(len(F)):
        # sort points to connect the dots with lines
        sorted_idx=sorted(range(len(X_train)),key=lambda x:X_train[x])
        plt.plot(X_train[sorted_idx], F[j,sorted_idx], '-', markersize=6)
        plt.plot(X_train, y_train, 'r*')
    
    plt.xlabel("$\mathbf{x}_i$")
    plt.ylabel("$f(\mathbf{x}_i)$")
    _=plt.title("tau=%.1f" % taus[i])

Note how the above function samples are constrained to go through our training data labels (up to observation noise), as much as their smoothness allows them. In fact, these are already samples from the predictive distribution, which gives a probability for a label $\mathbf{y}^*$ for any covariate $\mathbf{x}^*$. These distributions are Gaussian (!), nice to look at and extremely useful to understand the GP's underlying model. Let's plot them. We finally use the Shogun class CGaussianProcessRegression to represent the whole GP under an interface to perform inference with. In addition, we use the helper class class CGaussianDistribution to evaluate the log-likelihood for every test point's $\mathbf{x}^*_j$ value $\mathbf{y}_j^*$.


In [ ]:
# helper function that plots predictive distribution and data
def plot_predictive_regression(X_train, y_train, X_test, y_test, means, variances):
    # evaluate predictive distribution in this range of y-values and preallocate predictive distribution
    y_values=np.linspace(-3,3)
    D=np.zeros((len(y_values), len(X_test)))
    
    # evaluate normal distribution at every prediction point (column)
    for j in range(np.shape(D)[1]):
        # create gaussian distributio instance, expects mean vector and covariance matrix, reshape
        gauss=GaussianDistribution(np.array(means[j]).reshape(1,), np.array(variances[j]).reshape(1,1))
    
        # evaluate predictive distribution for test point, method expects matrix
        D[:,j]=np.exp(gauss.log_pdf_multiple(y_values.reshape(1,len(y_values))))
    
    plt.pcolor(X_test,y_values,D)
    plt.colorbar()
    plt.contour(X_test,y_values,D)
    plt.plot(X_test,y_test, 'b', linewidth=3)
    plt.plot(X_test,means, 'm--', linewidth=3)
    plt.plot(X_train, y_train, 'ro')
    plt.legend(["Truth", "Prediction", "Data"])

In [ ]:
plt.figure(figsize=(18,10))
plt.suptitle("GP inference for different kernel widths")
for i in range(len(taus)):
    plt.subplot(len(taus),1,i+1)
    
    # create GP instance using inference method and train
    # use Shogun objects from above
    inf.set_kernel(GaussianKernel(10,taus[i]))
    gp=GaussianProcessRegression(inf)
    gp.train()
    
    # predict labels for all test data (note that this produces the same as the below mean vector)
    means = gp.apply(feats_test)
    
    # extract means and variance of predictive distribution for all test points
    means = gp.get_mean_vector(feats_test)
    variances = gp.get_variance_vector(feats_test)
    
    # note: y_predicted == means
    
    # plot predictive distribution and training data
    plot_predictive_regression(X_train, y_train, X_test, y_test, means, variances)
    _=plt.title("tau=%.1f" % taus[i])

The question now is: Which set of hyper-parameters $\boldsymbol{\theta}=\{\tau, \gamma, \sigma\}$ to take, where $\gamma$ is the kernel scaling (which we omitted so far), and $\sigma$ is the observation noise (which we left at its defaults value of one)? The question of model-selection will be handled in a bit more depth in the binary classification case. For now we just show code how to do it as a black box. See below for explanations.


In [ ]:
# re-create inference method and GP instance to start from scratch, use other Shogun structures from above
inf = ExactInferenceMethod(GaussianKernel(10, taus[i]), feats_train, ZeroMean(), labels_train, GaussianLikelihood())
gp = GaussianProcessRegression(inf)

# evaluate our inference method for its derivatives
grad = GradientEvaluation(gp, feats_train, labels_train, GradientCriterion(), False)
grad.set_function(inf)

# handles all of the above structures in memory
grad_search = GradientModelSelection(grad)

# search for best parameters and store them
best_combination = grad_search.select_model()

# apply best parameters to GP, train
best_combination.apply_to_machine(gp)

# we have to "cast" objects to the specific kernel interface we used (soon to be easier)
best_width=GaussianKernel.obtain_from_generic(inf.get_kernel()).get_width()
best_scale=inf.get_scale()
best_sigma=GaussianLikelihood.obtain_from_generic(inf.get_model()).get_sigma()
print "Selected tau (kernel bandwidth):", best_width
print "Selected gamma (kernel scaling):", best_scale
print "Selected sigma (observation noise):", best_sigma

Now we can output the best parameters and plot the predictive distribution for those.


In [ ]:
# train gp
gp.train()

# extract means and variance of predictive distribution for all test points
means = gp.get_mean_vector(feats_test)
variances = gp.get_variance_vector(feats_test)

# plot predictive distribution
plt.figure(figsize=(18,5))
plot_predictive_regression(X_train, y_train, X_test, y_test, means, variances)
_=plt.title("Maximum Likelihood II based inference")

Now the predictive distribution is very close to the true data generating process.

Non-Linear, Binary Bayesian Classification

In binary classification, the observed data comes from a space of discrete, binary labels, i.e. $\mathbf{y}\in\mathcal{Y}^n=\{-1,+1\}^n$, which are represented via the Shogun class CBinaryLabels. To model these observations with a GP, we need a likelihood function $p(\mathbf{y}|\mathbf{f})$ that maps a set of such discrete observations to a probability, given a fixed response $\mathbf{f}$ of the Gaussian Process.

In regression, this way straight-forward, as we could simply use the response variable $\mathbf{f}$ itself, plus some Gaussian noise, which gave rise to a probability distribution. However, now that the $\mathbf{y}$ are discrete, we cannot do the same thing. We rather need a function that squashes the Gaussian response variable itself to a probability, given some data. This is a common problem in Machine Learning and Statistics and is usually done with some sort of Sigmoid function of the form $\sigma:\mathbb{R}\rightarrow[0,1]$. One popular choicefor such a function is the Logit likelihood, given by

$p(\mathbf{y}|\mathbf{f})=\prod_{i=1}^n p(y_i|f_i)=\prod_{i=1}^n \frac{1}{1-\exp(-y_i f_i)}.$

This likelihood is implemented in Shogun under CLogitLikelihood and using it is sometimes refered to as logistic regression. Using it with GPs results in non-linear Bayesian logistic regression. We can easily use the class to illustrate the sigmoid function for a 1D example and a fixed data point with label $+1$


In [ ]:
# two classification likelihoods in Shogun
logit=LogitLikelihood()
probit=ProbitLikelihood()

# A couple of Gaussian response functions, 1-dimensional here
F=np.linspace(-5.0,5.0)

# Single observation label with +1
lab=BinaryLabels(np.array([1.0]))

# compute log-likelihood for all values in F
log_liks_logit=np.zeros(len(F))
log_liks_probit=np.zeros(len(F))
for i in range(len(F)):
    # Shogun expects a 1D array for f, not a single number
    f=np.array(F[i]).reshape(1,)
    log_liks_logit[i]=logit.get_log_probability_f(lab, f)
    log_liks_probit[i]=probit.get_log_probability_f(lab, f)
    
    
# in fact, loops are slow and Shogun offers a method to compute the likelihood for many f. Much faster!
log_liks_logit=logit.get_log_probability_fmatrix(lab, F.reshape(1,len(F)))
log_liks_probit=probit.get_log_probability_fmatrix(lab, F.reshape(1,len(F)))

# plot the sigmoid functions, note that Shogun computes it in log-domain, so we have to exponentiate
plt.figure(figsize=(12, 4))
plt.plot(F, np.exp(log_liks_logit))
plt.plot(F, np.exp(log_liks_probit))
plt.ylabel("$p(y_i|f_i)$")
plt.xlabel("$f_i$")
plt.title("Classification Likelihoods")
_=plt.legend(["Logit", "Probit"])

Note how the logit function maps any input value to $[0,1]$ in a continuous way. The other plot above is for another classification likelihood is implemented in Shogun is the Gaussian CDF function

$p(\mathbf{y}|\mathbf{f})=\prod_{i=1}^n p(y_i|f_i)=\prod_{i=1}^n \Phi(y_i f_i),$

where $\Phi:\mathbb{R}\rightarrow [0,1]$ is the cumulative distribution function (CDF) of the standard Gaussian distribution $\mathcal{N}(0,1)$. It is implemented in the Shogun class CProbitLikelihood and using it is refered to as probit regression. While the Gaussian CDF has some convinient properties for integrating over it (and thus allowing some different modelling decisions), it doesn not really matter what you use in Shogun in most cases. However, for the sake of completeness, it is also potted above, being very similar to the logit likelihood.

TODO: Show a function squashed through the logit likelihood

Recall that in order to do inference, we need to solve two integrals (in addition to the Bayes rule, see above)

$p(y^*|\mathbf{y}, \boldsymbol{\theta})=\int p(\mathbf{y}^*|\mathbf{f})p(\mathbf{f}|\mathbf{y}, \boldsymbol{\theta})d\mathbf{f}|\boldsymbol{\theta},$

and

$p(\mathbf{y}|\boldsymbol{\theta})=\int p(\mathbf{y}|\mathbf{f})p(\mathbf{f}|\boldsymbol{\theta})d\mathbf{f}|\boldsymbol{\theta}$.

In classification, the second integral is not available in closed form since it is the convolution of a Gaussian, $p(\mathbf{f}|\boldsymbol{\theta})$, and a non-Gaussian, $p(\mathbf{y}|\mathbf{f})$, distribution. Therefore, we have to rely on approximations in order to compute and integrate over the posterior $p(\mathbf{f}|\mathbf{y},\boldsymbol{\theta})$. Shogun offers various standard methods from the literature to deal with this problem, including the Laplace approximation (CLaplacianInferenceMethod), Expectation Propagation (CEPInferenceMethod) for inference and evaluatiing the marginal likelihood. These two approximations give rise to a Gaussian posterior $p(\mathbf{f}|\mathbf{y},\boldsymbol{\theta})$, which can then be easily computed and integrated over (all this is done by Shogun for you).

While the Laplace approximation is quite fast, EP usually has a better accuracy, in particular if one is not just interetsed in binary decisions but also in certainty values for these predictions. Go for Laplace if interested in binary decisions, and for EP otherwise.

TODO, add references to inference methods.

We will now give an example on how to do GP inference for binary classification in Shogun on some toy data. For that, we will first definea function to generate a classical non-linear classification problem.


In [ ]:
def generate_classification_toy_data(n_train=100, mean_a=np.asarray([0, 0]), std_dev_a=1.0, mean_b=3, std_dev_b=0.5):

    # positive examples are distributed normally
    X1 = (np.random.randn(n_train, 2)*std_dev_a+mean_a).T

    # negative examples have a "ring"-like form
    r = np.random.randn(n_train)*std_dev_b+mean_b
    angle = np.random.randn(n_train)*2*np.pi
    X2 = np.array([r*np.cos(angle)+mean_a[0], r*np.sin(angle)+mean_a[1]])

    # stack positive and negative examples in a single array
    X_train = np.hstack((X1,X2))

    # label positive examples with +1, negative with -1
    y_train = np.zeros(n_train*2)
    y_train[:n_train] = 1
    y_train[n_train:] = -1

    return X_train, y_train

def plot_binary_data(X_train, y_train):
    plt.plot(X_train[0, np.argwhere(y_train == 1)], X_train[1, np.argwhere(y_train == 1)], 'ro')
    plt.plot(X_train[0, np.argwhere(y_train == -1)], X_train[1, np.argwhere(y_train == -1)], 'bo')

In [ ]:
X_train, y_train=generate_classification_toy_data()
plot_binary_data(X_train, y_train)
_=plt.title("2D Toy classification problem")

We will now pass this data into Shogun representation, and use the standard Gaussian kernel (or squared exponential covariance function (CGaussianKernel)) and the Laplace approximation to obtain a decision boundary for the two classes. You can easily exchange different likelihood models and inference methods.


In [ ]:
# for building combinations of arrays
from itertools import product

# convert training data into Shogun representation
train_features = RealFeatures(X_train)
train_labels = BinaryLabels(y_train)

# generate all pairs in 2d range of testing data (full space), discretisation resultion is n_test
n_test=50
x1 = np.linspace(X_train[0,:].min()-1, X_train[0,:].max()+1, n_test)
x2 = np.linspace(X_train[1,:].min()-1, X_train[1,:].max()+1, n_test)
X_test = np.asarray(list(product(x1, x2))).T

# convert testing features into Shogun representation
test_features = RealFeatures(X_test)

# create Gaussian kernel with width = 2.0
kernel = GaussianKernel(10, 2)

# create zero mean function
zero_mean = ZeroMean()

# you can easily switch between probit and logit likelihood models
# by uncommenting/commenting the following lines:

# create probit likelihood model
# lik = ProbitLikelihood()

# create logit likelihood model
lik = LogitLikelihood()

# you can easily switch between Laplace and EP approximation by
# uncommenting/commenting the following lines:

# specify Laplace approximation inference method
#inf = LaplacianInferenceMethod(kernel, train_features, zero_mean, train_labels, lik)

# specify EP approximation inference method
inf = EPInferenceMethod(kernel, train_features, zero_mean, train_labels, lik)

# EP might not converge, we here allow that without errors
inf.set_fail_on_non_convergence(False)

# create and train GP classifier, which uses Laplace approximation
gp = GaussianProcessClassification(inf)
gp.train()

test_labels=gp.apply(test_features)

# plot data and decision boundary
plot_binary_data(X_train, y_train)
plt.pcolor(x1, x2, test_labels.get_labels().reshape(n_test, n_test))
_=plt.title('Decision boundary')

This is already quite nice. The nice thing about Gaussian Processes now is that they are Bayesian, which means that have a full predictive distribution, i.e., we can plot the probability for a point belonging to a class. These can be obtained via the interface of CGaussianProcessClassification


In [ ]:
# obtain probabilities for 
p_test = gp.get_probabilities(test_features)

# create figure
plt.title('Training data, predictive probability and decision boundary')

# plot training data
plot_binary_data(X_train, y_train)

# plot decision boundary
plt.contour(x1, x2, np.reshape(p_test, (n_test, n_test)), levels=[0.5], colors=('black'))

# plot probabilities
plt.pcolor(x1, x2, p_test.reshape(n_test, n_test))
_=plt.colorbar()

If you are interested in the marginal likelihood $p(\mathbf{y}|\boldsymbol{\theta})$, for example for the sake of comparing different model parameters $\boldsymbol{\theta}$ (more in model-selection later), it is very easy to compute it via the interface of CInferenceMethod, i.e., every inference method in Shogun can do that. It is even possible to obtain the mean and covariance of the Gaussian approximation to the posterior $p(\mathbf{f}|\mathbf{y})$ using Shogun. In the following, we plot the marginal likelihood under the EP inference method (more accurate approximation) as a one dimensional function of the kernel width.


In [ ]:
# generate some non-negative kernel widths
widths=2**np.linspace(-5,6,20)

# compute marginal likelihood under Laplace apprixmation for every width
# use Shogun objects from above
marginal_likelihoods=np.zeros(len(widths))
for i in range(len(widths)):
    # note that GP training is automatically done/updated if a parameter is changed. No need to call train again
    kernel.set_width(widths[i])
    marginal_likelihoods[i]=-inf.get_negative_log_marginal_likelihood()
    
# plot marginal likelihoods as a function of kernel width
plt.plot(np.log2(widths), marginal_likelihoods)
plt.title("Log Marginal likelihood for different kernels")
plt.xlabel("Kernel Width in log-scale")
_=plt.ylabel("Log-Marginal Likelihood")

print "Width with largest marginal likelihood:", widths[marginal_likelihoods.argmax()]

This plot clearly shows that there is one kernel width (aka hyper-parameter element $\theta$) for that the marginal likelihood is maximised. If one was interested in the single best parameter, the above concept can be used to learn the best hyper-parameters of the GP. In fact, this is possible in a very efficient way since we have a lot of information about the geometry of the marginal likelihood function, as for example its gradient: It turns out that for example the above function is smooth and we can use the usual optimisation techniques to find extrema. This is called maximum likelihood II. Let's have a closer look.

Excurs: Model-Selection with Gaussian Processes

First, let us have a look at the predictive distributions of some of the above kernel widths


In [ ]:
# again, use Shogun objects from above, but a few extremal widths
widths_subset=np.array([widths[0], widths[marginal_likelihoods.argmax()], widths[len(widths)-1]])

plt.figure(figsize=(18, 5))
for i in range(len(widths_subset)):
    plt.subplot(1,len(widths_subset),i+1)
    kernel.set_width(widths_subset[i])
    
    # obtain and plot predictive distribution
    p_test = gp.get_probabilities(test_features)
    title_str="Width=%.2f, " % widths_subset[i]
    if i is 0:
        title_str+="too complex, overfitting"
    elif i is 1:
        title_str+="just right"
    else:
        title_str+="too smooth, underfitting"
        
    plt.title(title_str)
    plot_binary_data(X_train, y_train)
    plt.contour(x1, x2, np.reshape(p_test, (n_test, n_test)), levels=[0.5], colors=('black'))
    plt.pcolor(x1, x2, p_test.reshape(n_test, n_test))
    _=plt.colorbar()

In the above plots, it is quite clear that the maximum of the marginal likelihood corresponds to the best single setting of the parameters. To give some more intuition: The interpretation of the marginal likelihood

$p(\mathbf{y}|\boldsymbol{\theta})=\int p(\mathbf{y}|\mathbf{f})p(\mathbf{f}|\boldsymbol{\theta})d\mathbf{f}|\boldsymbol{\theta}$

is the probability of the data given the model parameters $\boldsymbol{\theta}$. Note that this is averaged over all possible configurations of the latent Gaussian variables $\mathbf{f}|\boldsymbol{\theta}$ given a fixed configuration of parameters. However, since this is probability distribution, it has to integrate to $1$. This means that models that are too complex (and thus being able to explain too many different data configutations) and models that are too simple (and thus not able to explain the current data) give rise to a small marginal likelihood. Only when the model is just complex enough to explain the data well (but not more complex), the marginal likelihood is maximised. This is an implementation of a concept called Occam's razor, and is a nice motivation why you should be Bayesian if you can -- overfitting doesn't happen that quickly.

As mentioned before, Shogun is able to automagically learn all of the hyper-parameters $\boldsymbol{\theta}$ using gradient based optimisation on the marginal likelihood (whose derivatives are computed internally). To this is, we use the class CGradientModelSelection. Note that we could also use CGridSearchModelSelection to do a standard grid-search, such as is done for Support Vector Machines. However, this is highly ineffective, in particular when the number of parameters grows. In addition, order to evaluate parameter states, we have to use the classes CGradientEvaluation, and GradientCriterion, which is also much cheaper than the usual CCrossValidation, since it just evaluates the gradient of the marginal likelihood rather than performing many training and testing runs. This is another very nice motivation for using Gaussian Processes: optimising parameters is much easier. In the following, we demonstrate how to select all parameters of the used model. In Shogun, parameter configurations (corresponding to $\boldsymbol{\theta}$ are stored in instances of CParameterCombination, which can be applied to machines.

This approach is known as maximum likelihood II (the 2 is for the second level, averaging over all possible $\mathbf{f}|\boldsymbol{\theta}$), or evidence maximisation.


In [ ]:
# re-create inference method and GP instance to start from scratch, use other Shogun structures from above
inf = EPInferenceMethod(kernel, train_features, zero_mean, train_labels, lik)

# EP might not converge, we here allow that without errors
inf.set_fail_on_non_convergence(False)

gp = GaussianProcessClassification(inf)

# evaluate our inference method for its derivatives
grad = GradientEvaluation(gp, train_features, train_labels, GradientCriterion(), False)
grad.set_function(inf)

# handles all of the above structures in memory
grad_search = GradientModelSelection(grad)

# search for best parameters and store them
best_combination = grad_search.select_model()

# apply best parameters to GP
best_combination.apply_to_machine(gp)

# we have to "cast" objects to the specific kernel interface we used (soon to be easier)
best_width=GaussianKernel.obtain_from_generic(inf.get_kernel()).get_width()
best_scale=inf.get_scale()
print "Selected kernel bandwidth:", best_width
print "Selected kernel scale:", best_scale

This now gives us a trained Gaussian Process with the best hyper-parameters. In the above setting, this is the s CGaussianKernel bandwith, and its scale (which is stored in the GP itself since Shogun kernels do not support scalling). We can now again visualise the predictive distribution, and also output the best parameters.


In [ ]:
# train gp
gp.train()

# visualise predictive distribution
p_test = gp.get_probabilities(test_features)
plot_binary_data(X_train, y_train)
plt.contour(x1, x2, np.reshape(p_test, (n_test, n_test)), levels=[0.5], colors=('black'))
plt.pcolor(x1, x2, p_test.reshape(n_test, n_test))
_=plt.colorbar()

Note how nicely this predictive distribution matches the data generating distribution. Also note that the best kernel bandwidth is different to the one we saw in the above plot. This is caused by the different kernel scalling that was also learned automatically. The kernel scaling, roughly speaking, corresponds to the sharpness of the changes in the surface of the predictive likelihood. Since we have two hyper-parameters, we can plot the surface of the marginal likelihood as a function of both of them. This is sometimes interesting, for example when this surface has multiple maximum (corresponding to multiple "best" parameter settings), and thus might be useful for analysis. It is expensive however.


In [ ]:
# parameter space, increase resolution if you want finer plots, takes long though
resolution=5
widths=2**np.linspace(-4,10,resolution)
scales=2**np.linspace(-5,10,resolution)

# re-create inference method and GP instance to start from scratch, use other Shogun structures from above
inf = EPInferenceMethod(kernel, train_features, zero_mean, train_labels, lik)

# EP might not converge, we here allow that without errors
inf.set_fail_on_non_convergence(False)

gp = GaussianProcessClassification(inf)
inf.set_tolerance(1e-3)
# compute marginal likelihood for every parameter combination
# use Shogun objects from above
marginal_likelihoods=np.zeros((len(widths), len(scales)))
for i in range(len(widths)):
    for j in range(len(scales)):
        kernel.set_width(widths[i])
        inf.set_scale(scales[j])
        marginal_likelihoods[i,j]=-inf.get_negative_log_marginal_likelihood()
    
# contour plot of marginal likelihood as a function of kernel width and scale
plt.contour(np.log2(widths), np.log2(scales), marginal_likelihoods)
plt.colorbar()
plt.xlabel("Kernel width (log-scale)")
plt.ylabel("Kernel scale (log-scale)")
_=plt.title("Log Marginal Likelihood")

# plot our found best parameters
_=plt.plot([np.log2(best_width)], [np.log2(best_scale)], 'r*', markersize=20)

Our found maximum nicely matches the result of the "grid-search". The take home message for this is: With Gaussian Processes, you neither need to do expensive brute force approaches to find best paramters (but you can use gradient descent), nor do you need to do expensive cross-validation to evaluate your model (but can use the Bayesian concept of maximum likelihood II).

Excurs: Large-Scale Regression

One "problem" with the classical method of Gaussian Process based inference is the computational complexity of $\mathcal{O}(n^3)$, where $n$ is the number of training examples. This is caused by matrix inversion, Cholesky factorization, etc. Up to a few thousand points, this is feasible. You will quickly run into memory and runtime problems for very large problems.

One way of approaching very large problems is called Fully Independent Training Components, which is a low-rank plus diagonal approximation to the exact covariance. The rough idea is to specify a set of $m\ll n$ inducing points and to base all computations on the covariance between training/test and inducing points only, which intuitively corresponds to combining various training points around an inducing point. This reduces the computational complexity to $\mathcal{O}(nm^2)$, where again $n$ is the number of training points, and $m$ is the number of inducing points. This is quite a significant decrease, in particular if the number of inducing points is much smaller than the number of examples.

The optimal way to specify inducing points is to densely and uniformly place them in the input space. However, this might be quickly become infeasible in high dimensions. In this case, a random subset of the training data might be a good idea.

In Shogun, the class CFITCInferenceMethod handles inference for regression with the CGaussianLikelihood. Below, we demonstrate its usage on a toy example and compare to exact regression. Note that CGradientModelSelection still works as before. We compare the runtime for inference with both GP.

First, note that changing the inference method only requires the change of a single line of code


In [ ]:
# for measuring runtime
import time

# simple regression data
X_train, y_train, X_test, y_test = generate_regression_toy_data(n=1000)

# bring data into shogun representation (features are 2d-arrays, organised as column vectors)
feats_train=RealFeatures(X_train.reshape(1,len(X_train)))
feats_test=RealFeatures(X_test.reshape(1,len(X_test)))
labels_train=RegressionLabels(y_train)

# inducing features (here: a random grid over the input space, try out others)
n_inducing=10
#X_inducing=linspace(X_train.min(), X_train.max(), n_inducing)
X_inducing=np.random.rand(X_train.min()+n_inducing)*X_train.max()
feats_inducing=RealFeatures(X_inducing.reshape(1,len(X_inducing)))

# create FITC inference method and GP instance
inf = FITCInferenceMethod(GaussianKernel(10, best_width), feats_train, ZeroMean(), labels_train, \
                          GaussianLikelihood(best_sigma), feats_inducing)
gp = GaussianProcessRegression(inf)

start=time.time()
gp.train()
means = gp.get_mean_vector(feats_test)
variances = gp.get_variance_vector(feats_test)
print "FITC inference took %.2f seconds" % (time.time()-start)

# exact GP
start=time.time()
inf_exact = ExactInferenceMethod(GaussianKernel(10, best_width), feats_train, ZeroMean(), labels_train, \
                                 GaussianLikelihood(best_sigma))
inf_exact.set_scale(best_scale)
gp_exact = GaussianProcessRegression(inf_exact)
gp_exact.train()
means_exact = gp_exact.get_mean_vector(feats_test)
variances_exact = gp_exact.get_variance_vector(feats_test)
print "Exact inference took %.2f seconds" % (time.time()-start)

# comparison plot FITC and exact inference, plot 95% confidence of both predictive distributions
plt.figure(figsize=(18,5))
plt.plot(X_test, y_test, color="black", linewidth=3)
plt.plot(X_test, means, 'r--', linewidth=3)
plt.plot(X_test, means_exact, 'b--', linewidth=3)
plt.plot(X_train, y_train, 'ro')
plt.plot(X_inducing, np.zeros(len(X_inducing)), 'g*', markersize=15)

# tube plot of 95% confidence
error=1.96*np.sqrt(variances)
plt.plot(X_test,means-error, color='red', alpha=0.3, linewidth=3)
plt.fill_between(X_test,means-error,means+error,color='red', alpha=0.3)
error_exact=1.96*np.sqrt(variances_exact)
plt.plot(X_test,means_exact-error_exact, color='blue', alpha=0.3, linewidth=3)
plt.fill_between(X_test,means_exact-error_exact,means_exact+error_exact,color='blue', alpha=0.3)

# plot upper confidence lines later due to legend
plt.plot(X_test,means+error, color='red', alpha=0.3, linewidth=3)
plt.plot(X_test,means_exact+error_exact, color='blue', alpha=0.3, linewidth=3)

plt.legend(["True", "FITC prediction", "Exact prediction", "Data", "Inducing points", "95% FITC", "95% Exact"])
_=plt.title("Comparison FITC and Exact Regression")

Note how the FITC inference method is much faster than exact inference for this number of data points. However, based on the placement of the inducing points, the results might be unacurate. As mentioned, a uniform spacing over the space would be optimal.

Soon to Come

  • Overview of all base classes for GPs
  • Excurs: Automatic relevance determination
  • Real life examples for regression and classification, including classification of sequences
  • Fully Bayesian treatment of \theta

In [ ]:


In [ ]: