Variational Inference for Gaussian Process Classification Models

By Wu Lin - yorker.lin@gmail.com - https://github.com/yorkerlin Suggested by Heiko Strathmann and Mohammad Emtiyaz Khan Based on the notebook of Gaussian Processes by Heiko Strathmann - heiko.strathmann@gmail.com - github.com/karlnapf - herrstrathmann.de, and the GP framework of the Google summer of code 2014 project of Wu Lin, - Google summer of code 2013 projec of Roman Votyakov - votjakovr@gmail.com - github.com/votjakovr, and the Google summer of code 2012 project of Jacob Walker - walke434@gmail.com - github.com/puffin444

This notebook describes how to do variational inference for Gaussian Process classification (GPC) models in Shogun. In a GPC model, a set of GP functions are used to predict the label of a data point given its features. We assume that readers have some background in Bayesian statistics. For the background, please see the notebook about Gaussian Process. After providing a semi-formal introduction, we illustrate how to do training, make predictions, and automatically learn hyper-parameters for GPC models in Shogun


In [ ]:
%matplotlib inline
# import all shogun classes
from modshogun import *

# import all required libraries
import scipy
import scipy.io
import numpy as np
from math import exp,sqrt,log
import random
import time
import matplotlib.pyplot as plt

Brief Introduction

In binary classification, we get binary labels in form of a column vector, $\mathbf{y}\in\mathcal{Y}^n=\{-1,+1\}^n$, and features as a matrix, $\mathbf{X}\in\mathcal{R}^{n\times d}$, where $n$ is the number of data points and $d$ is the number of features. A likelihood function $p(\mathbf{y}|\text{f},\mathbf{X})=p(\mathbf{y}|\mathbf{f})$ is used to model data with labels, where a function $\text{f}:\mathcal{R}^{d}\to \mathcal{R} $ is drawn from a Gaussian Process prior and $\text{f}(\mathbf{X})=\mathbf{f} \in \mathcal{R}^{n}$ is a column vector by applying f to each row of $\mathbf{X}$. Note that the difference between f and $\mathbf{f}$ is that f is a function while $\mathbf{f}$ is an n-dimensional vector. $p(\text{f})$ and $p(\mathbf{f})$ are also different because $p(\text{f})$ is a distribution in an infinite dimensional (function) space while $p(\mathbf{f})$ is a distribution in an finite dimensional space.

Given data with labels, our goal is to train a Gaussian Process classifier to fit the data well. In other words, we want to learn posterior, $p(\text{f}|\mathbf{y},\mathbf{X}) \propto p(\mathbf{y}| \text{f},\mathbf{X}) \cdot p(\text{f}) $, given training data points. Note that $p(\text{f})$ is the prior over the GP functions. In fact, we will see in the next session that all we need is to learn $p(\mathbf{f}|\mathbf{y})$ (Note that we use $\mathbf{f}$ as $\text{f}(\mathbf{X})$ for short).

The key intuition of variational inference is to approximate the distribution of interest (here $p(\mathbf{f}|\mathbf{y})$, which is a non-Gaussian distribution) with a more tractable distribution (here a Gaussian $q(\mathbf{f}|\mathbf{y}))$, via minimizing the Kullback–Leibler divergence (KL divergence)

$${\mathrm{KL}}(Q\|P) = \int_{-\infty}^\infty \ln\left(\frac{q(\mathbf{f}|\mathbf{y})}{p(\mathbf{f}|\mathbf{y})}\right) q(\mathbf{f}|\mathbf{y}) \ {\rm d}\mathbf{f}$$

between the true distribution and the approximation.

Please see the next session if you want to know the reason why non-Gaussian $p(\mathbf{f}|\mathbf{y})$ cannot be used in inference process of GPC models.

Throughout this notebook, we deal with binary classification using inverse-logit, (also known as Bernoulli-logistic function) likelihood to model labels, given by

$p(\mathbf{y}|\mathbf{f})=\prod_{i=1}^n p(y_\text{i}|\text{f}(\mathbf{X}_\text{i}))=\prod_{i=1}^n \frac{1}{1-\exp(-y_\text{i} \mathbf{f}_\text{i})}$, where $\mathbf{f}_\text{i}$ is the i-th row of $\mathbf{X}$ and $\text{f}(\mathbf{X}_\text{i})=\mathbf{f}_\text{i}$

More Detailed Background about GPC Models (Skip if you just want code examples)

In GPC models, our goal is to predict the label ($y_\text{new}$) of a new data point, a column vector, $\mathbf{x}_\text{new}\in\mathcal{R}^{d}$ based on $p(y_\text{new}|\mathbf{y},\mathbf{X},\mathbf{x}_\text{new})$. According to Bayes' theorem, we know that the Posterior predictive distribution is $$p(y_\text{new}|\mathbf{y},\mathbf{X},\mathbf{x}_\text{new})= \int {p(y_\text{new}|f,\mathbf{x}_\text{new})p(f|\mathbf{y},\mathbf{X}) df}$$ Informally, according to the property of GP about marginalization, we have:

$$\int {p(\mathbf{y}_\text{new}|\text{f},\mathbf{x}_\text{new})p(\text{f}|\mathbf{y},\mathbf{X}) d\text{f}}= \int {p(\mathbf{y}_\text{new}|\mathbf{f}_\text{new})p(\mathbf{f}_\text{new}|\mathbf{y},\mathbf{X}) d\mathbf{f}_\text{new}}$$

.

where $\text{f}(\mathbf{x}_\text{new})=\mathbf{f}_\text{new}$, $p(\mathbf{y}_\text{new}|\mathbf{f}_\text{new})=p(\mathbf{y}_\text{new}|\text{f},\mathbf{x}_\text{new})$ and $p(\mathbf{f}_\text{new}|\mathbf{y},\mathbf{X})=p(\mathbf{f}_\text{new}|\mathbf{y},\mathbf{X}, \mathbf{x}_\text{new})$ .

The key difference here is that $p(f|\mathbf{y}, \mathbf{X})$ is a distribution in infinite dimensional space while $p(\mathbf{f}_{new}|\mathbf{y},\mathbf{X})$ is a distribution in one-dimensional space.

Note that $p(\mathbf{f}_\text{new}|\mathbf{y},\mathbf{X})=\int {p(\mathbf{f}_\text{new}|\text{f}) p(\text{f}|\mathbf{y},\mathbf{X}) d\text{f}}$. Similarly, $\int {p(\mathbf{f}_\text{new}|\text{f}) p(\text{f}|\mathbf{y},\mathbf{X}) d\text{f}}=\int {p(\mathbf{f}_\text{new}|\mathbf{f}) p(\mathbf{f}|\mathbf{y}) d\mathbf{f}}$. Again, $p(\mathbf{f}|\mathbf{y})=p(\mathbf{f}|\mathbf{y}, \mathbf{X})$ is a distribution in n-dimensional space while $p(\text{f}|\mathbf{y},\mathbf{X})$ is distribution in infinite dimensional space. Informally, according to the definition of GP, the following holds:

$p(\mathbf{f}_\text{new}|\mathbf{f})=\frac{p(\text{f}(\mathbf{[\mathbf{X};\mathbf{x}_\text{new}^T]}))}{p(\text{f}(\mathbf{\mathbf{X}}))}$, which follows from the conditional properties of jointly Gaussian variables, where $[\mathbf{X};\mathbf{x}_\text{new}^T]\in \mathcal{R}^{(n+1)\times d}$ and $\text{f}([\mathbf{X};\mathbf{x}_\text{new}^T]) \in \mathcal{R}^{n+1} $

Note that If $p(\mathbf{f}|\mathbf{y})$ is a Gaussian distribution, $p(\mathbf{f}_{new}|\mathbf{y},\mathbf{X})$ has a close form. However,in classification $p(\mathbf{f}|\mathbf{y})$ usually is NOT a Gaussian distribution since $p(\mathbf{f}|\mathbf{y}) \propto p(\mathbf{y}|\mathbf{f})p(\mathbf{f})$, where $p(\mathbf{f})$ is a Gaussian distribution but $p(\mathbf{y}|\mathbf{f}))$ (the likelihood) is a non-Gaussian distribution.

Variational inference in GPC is to approximate $p(\mathbf{f}|\mathbf{y})$ using a Gaussian distribution, $q(\mathbf{f}|\mathbf{y})$, via minimizing the KL divergence, ${\mathrm{KL}}(Q\|P)$. Reader may note that KL divergence is asymmetric. If we minimize ${\mathrm{KL}}(P\|Q)$ instead of ${\mathrm{KL}}(Q\|P)$, it is called expectation propagation (EP).

Variational Inference in GPC Models (Skip if you just want code examples)

As mentioned in the first section, variaitonal inference in GPC models is about minimizing the KL divergence given as below:

${\mathrm{KL}}(Q\|P) = \int_{-\infty}^\infty \ln\left(\frac{q(\mathbf{f}|\mathbf{y})}{p(\mathbf{f}|y)}\right) q(\mathbf{f}|\mathbf{y}) \ {\rm d}\mathbf{f} = \int_{-\infty}^\infty \ln\left(\frac{q(\mathbf{f}|\mathbf{y})}{ \frac{p(\mathbf{y}|\mathbf{f})p(\mathbf{f})}{p(\mathbf{y})} }\right) q(\mathbf{f}|\mathbf{y}) \ {\rm d}\mathbf{f}=\int_{-\infty}^\infty \ln\left(\frac{q(\mathbf{f}|\mathbf{y})}{ p(\mathbf{y}|\mathbf{f})p(\mathbf{f}|\mathbf{y}) }\right) q(\mathbf{f}|\mathbf{y}) \ {\rm d}\mathbf{f} - \text{const}$.

Another way to explain variational inference in GPC is to maximize a lower bound of the log of marginal likelihood, $\ln (p(\mathbf{y}|\mathbf{X})) $.

$\ln (p(\mathbf{y}|\mathbf{X})) = \ln (\int_{-\infty}^\infty {p(\mathbf{y}|\text{f})p(\text{f}|\mathbf{X})} d\text{f}) = \ln (\int_{-\infty}^\infty {p(\mathbf{y}|\mathbf{f})p(\mathbf{f})} d\mathbf{f})= \ln (\int_{-\infty}^\infty { q(\mathbf{f}) \frac{ p(\mathbf{y}|\mathbf{f})p(\mathbf{f})} {q(\mathbf{f}|\mathbf{y})}} d\mathbf{f}) \geq (\int_{-\infty}^\infty { q(\mathbf{f}|\mathbf{y}) \ln ( \frac{ p(\mathbf{y}|\mathbf{f})p(\mathbf{f})} {q(\mathbf{f}|\mathbf{y})}} ) d\mathbf{f})$

where the inequality is based on Jensen’s inequality.

$\int_{-\infty}^\infty { q(\mathbf{f}|\mathbf{y}) \ln ( \frac{ p(\mathbf{y}|\mathbf{f})p(\mathbf{f})} {q(\mathbf{f}|\mathbf{y})}} ) d\mathbf{f}=- \int_{-\infty}^\infty \ln\left(\frac{q(\mathbf{f}|\mathbf{y})}{ p(\mathbf{y}|\mathbf{f})p(\mathbf{f}) }\right) q(\mathbf{f}|\mathbf{y}) \ {\rm d}\mathbf{f} = -E_q[\ln(q(\mathbf{f}|\mathbf{y}))] + E_q[\ln(p(\mathbf{f}))] + E_q[\ln(p(\mathbf{y}|\mathbf{f}))]$,

where $E_q(\cdot)$ is the expectation with respect to $q(\mathbf{f})$.

Note that the last term, $E_q[\ln(p(\mathbf{y}|\mathbf{f}))]$, in GPC usually does NOT have a close form.

Dealing with the Non-closed Form Issue in GPC Models

We can show that $E_q[\ln(p(\mathbf{y}|\mathbf{f}))]$ can be expressed in term of summation of one-dimensional integrations via the similar marginalization property of multivariate Gaussian distribution. Note that according to De Finetti's theorem labels ($\mathbf{y}$) are exchangeable since we assume that these labels are independently generated from latent GP functions.

$$E_q[\ln(p(\mathbf{y}|\mathbf{f}))]=E_q[\sum_{i=1}^n {\ln(p(\mathbf{y}_\text{i}|\mathbf{f}_\text{i})}]=\sum_{i=1}^n {E_q[\ln(p(\mathbf{y}_\text{i}|\mathbf{f}_\text{i})]}=\sum_{i=1}^n {E_{q_\text{i}}[\ln(p(\mathbf{y}_\text{i}|\mathbf{f}_\text{i})]}$$

where $q$ denotes $q(\mathbf{f}|\mathbf{y})$ is a multivariable $N(\mu, \Sigma)$, $q_i$ denotes $q_i(\mathbf{f}_\text{i}|\mathbf{y}_\text{i})$ is a univariate $N(\mu_\text{i}, \Sigma_\text{i,i})$, and $\mu_\text{i}$ and $\Sigma_\text{i,i}$ are the i-th element of the mean $\mu$ and the i-th diagonal element of the covariance matrix $\Sigma$ of $q(\mathbf{f}|\mathbf{y})$ respectively.

Of course, even $E_{q_\text{i}}[\ln(p(\mathbf{y}_\text{i}|\mathbf{f}_\text{i})]$ usually does not have a close form in GPC. However, it is relatively easy to deal with this one-dimensional problem.
One way to deal with this issue is one dimensional numerical integration. Another way is to use some variational bound. For now, we assume this expectation is given and we will discuss later in the notebook.

Summary of Inference Methods in GPC Models

Inference Method in GPC Idea of Approximation
Laplace Method Using the second-order Taylor expansion in the mode of the true posterior
Covariance Variational Method Minimizing KL(approximated distribution || true posterior), where the approximated distribution has a complete covariance matrix
Mean-field Variational Method Minimizing KL(approximated distribution || true posterior), where the approximated distribution has a diagonal covariance matrix
Cholesky Variational Method Minimizing KL(approximated distribution || true posterior), where the approximated distribution has a complete covariance matrix in Cholesky representation
Dual Variational Method Minimizing KL(approximated distribution || true posterior) in a dual form
Expectation Propagation Method Minimizing KL(true posterior || approximated distribution)

Summary of Optimizers used in Inference Methods in GPC Models

Summary of Supported Line Search Methods used in Inference Methods for GPC models

A Toy Example of Variational Inference

Now, we present a 2-D example for GP classification using variational inference. The gaussian prior and the likelihood are shown in Fig. (a) and (b). For this simple example, we can numerically compute the unnormalized true posterior, shown in Fig. (c), using the Bayes rule: $p(\mathbf{f}|\mathbf{y}) \propto p(\mathbf{y}|\mathbf{f})p(\mathbf{f})$, where $p(\mathbf{f})$ is the prior and $p(\mathbf{y}|\mathbf{f}))$ is the likelihood. The approximated posterior obtained using various methods are shown in Fig. (g)-(i). These figures can be obtained by using the following code. Note that this example is closely related to the example at page 7-8 of this paper.


In [ ]:
def Gaussian_points(Sigma,mu,xmin,xmax,ymin,ymax,delta=0.01):
    """
    This function is used to evaluate the likelihood of a Gaussian distribution on mesh.
    
    Parameters:
         Sigma - covariance of Gaussian
         mu - mean of Gaussian
         xmin, xmax, ymin, ymax, delta - defines mesh
         
    Returns:
    X, Y, Z, where Z = log(p(X,Y)) and p is a bivariate gaussian (mu, Sigma) evaluated at a mesh X,Y
    """

    xlist = np.arange(xmin, xmax, delta)
    ylist = np.arange(ymin, ymax, delta)
    X, Y = np.meshgrid(xlist, ylist)
    model = GaussianDistribution(mu, Sigma)
    Z = np.zeros(len(X)*len(Y))
    idx = 0
    for items in zip(X,Y):
        for sample in zip(items[0],items[1]):
            sample = np.asarray(sample)
            Z[idx] = model.log_pdf(sample)
            idx += 1
    Z = np.asarray(Z).reshape(len(X),len(Y))
    return (X,Y,Z)

In [ ]:
def likelihood_points(X,Y,labels,likelihood):
    """
    This function is used to evaluate a given likelihood function on a given mesh of points
    
    Parameters:
        X, Y - coordiates of the mesh
        labels - labels for the likelihood
        likelihood - likelihood function
        
    Returns:
        Z - log(p(X,Y,labels)), where p comes from likelihood
    """
    
    Z = np.zeros(len(X)*len(Y))    
    idx = 0
    for items in zip(X,Y):
        for sample in zip(items[0],items[1]):
            sample = np.asarray(sample)
            lpdf = likelihood.get_log_probability_f(labels, sample).sum()
            Z[idx] = lpdf
            idx += 1
    Z = np.asarray(Z).reshape(len(X),len(Y))
    return Z

In [ ]:
def approx_posterior_plot(methods, kernel_func, features, mean_func, 
                          labels, likelihoods, kernel_log_scale, 
                          xmin, xmax, ymin, ymax, delta, plots):
    """
    This function is used to generate points drawn from approximated posterior and plot them
        
    Parameters:
        methods - a list of methods used to approximate the posterior
        kernel_func - a covariance function for GP
        features - X
        mean_func -  mean function for GP
        labels - Y
        likelihood- a data likelihood to model labels
        kernel_log_scale - a hyper-parameter of covariance function
        xmin, ymin, xmax, ymax, delta - used in linespace function: maximum and minimum values used to generate points from an approximated posterior
        plots
    
    Returns:
        Nothing
    """
    
    (rows, cols) = plots.shape
    methods = np.asarray(methods).reshape(rows, cols)
    likelihoods = np.asarray(likelihoods).reshape(rows, cols)
    for r in xrange(rows):
        for c in xrange(cols):
            inference = methods[r][c]
            likelihood = likelihoods[r][c]
            inf = inference(kernel_func, features, mean_func, labels, likelihood())
            inf.set_scale(exp(kernel_log_scale))
            #get the approximated Gaussian distribution
            mu = inf.get_posterior_mean()
            Sigma = inf.get_posterior_covariance()
            #normalized approximated posterior
            (X,Y,Z) = Gaussian_points(Sigma, mu, xmin, xmax, ymin, ymax, delta)
            plots[r][c].contour(X, Y, np.exp(Z))
            plots[r][c].set_title('posterior via %s'%inf.get_name())
            plots[r][c].axis('equal')

In [ ]:
#a toy 2D example (data)
x=np.asarray([sqrt(2),-sqrt(2)]).reshape(1,2)
y=np.asarray([1,-1])

features = RealFeatures(x)
labels = BinaryLabels(y)
kernel_log_sigma = 1.0
kernel_log_scale = 1.5

#a mean function and a covariance function for GP
mean_func = ConstMean()
#using log_sigma as a hyper-parameter of GP instead of sigma
kernel_sigma = 2*exp(2*kernel_log_sigma)
kernel_func = GaussianKernel(10, kernel_sigma)
kernel_func.init(features, features)
#a prior distribution derived from GP via applying the mean function and the covariance function to data
Sigma = kernel_func.get_kernel_matrix()
Sigma = Sigma * exp(2.0*kernel_log_scale)
mu = mean_func.get_mean_vector(features)

delta = 0.1
xmin = -4
xmax = 6
ymin = -6
ymax = 4

#a prior (Gaussian) derived from GP
(X,Y,Z1) = Gaussian_points(Sigma, mu, xmin, xmax, ymin, ymax, delta)

col_size = 6
f, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(col_size*3,col_size))

ax1.contour(X, Y, np.exp(Z1))
ax1.set_title('Prior')
ax1.axis('equal')

#likelihood (inverse logit, A.K.A. Bernoulli-logistic)
#likelihoods classes for inference methods  (see the table above)
likelihoods=[
LogitLikelihood,
LogitLikelihood,
LogitVGLikelihood,
LogitVGLikelihood,
LogitVGLikelihood,
LogitDVGLikelihood
]


Z2 = likelihood_points(X,Y,labels,LogitLikelihood())
ax2.contour(X, Y, np.exp(Z2))
ax2.set_title('Likelihood')
ax2.axis('equal')

#a unnormalized true posterior (non-Gaussian)
Z3 = Z1+Z2
ax3.contour(X, Y, np.exp(Z3))
ax3.set_title('True posterior')
ax3.axis('equal')

f, plots =plt.subplots(2, 3, figsize=(col_size*3,col_size*2))

#Inference methods used to approximate a posterior (see the table below)
methods=[
SingleLaplacianInferenceMethod,
SingleLaplacianInferenceMethodWithLBFGS,
KLApproxDiagonalInferenceMethod,
KLCovarianceInferenceMethod,
KLCholeskyInferenceMethod,
KLDualInferenceMethod
]

approx_posterior_plot(methods, kernel_func, features, mean_func, labels, likelihoods, kernel_log_scale, 
                      xmin, xmax, ymin, ymax, delta, plots)

plt.show()

Remark: The true posterior is non-Gaussian and the normalized constant usually is difficult to compute. All approximated posteriors are Gaussian. A posterior Gaussian distribution is required to perform inference and predictions. The Laplace approximation is not ideal in term of approximating a posterior and making predictions although it is fast. With the price of speed, variational methods such as the KLCovariance and the KLCholesky offer more accurate approximations compared than the Laplace method. The KLDual method also can have a good approximation and in many cases are better than the Laplace method. On the other hand, the KLApproxDiagonal (mean-field) method is as fast as the Laplace method and for certain datasets, the KLApproxDiagonal method is more accurate than Laplace method in term of making predictions.

A Real-World Example

We apply variational methods to the sonar data set. The data set can be found at here. This is a binary classification problem. The goal of this example is to compare all inference methods implemented in shogun in term of making prediction predictions. The following is the description of the data set. This example is about reproducing the example at page 29 of this paper. Please refer the paper about the definition of the information bit measure.

The class "sonar.mines" contains 111 data points obtained by bouncing sonar signals off a metal cylinder at various angles and under various conditions. The class "sonar.rocks" contains 97 data points obtained from rocks under similar conditions. The transmitted sonar signal is a frequency-modulated chirp, rising in frequency. The data set contains signals obtained from a variety of different aspect angles, spanning 90 degrees for the cylinder and 180 degrees for the rock. Each data point is a set of 60 features in the range 0.0 to 1.0. Each feature represents the energy within a particular frequency band, integrated over a certain period of time. The integration aperture for higher frequencies occur later in time, since these frequencies are transmitted later during the chirp. The label associated with each record contains the letter "R" if the object is a rock and "M" if it is a mine (metal cylinder). The features in the labels are in increasing order of aspect angle, but they do not encode the angle directly.


In [ ]:
def learning_example1(inference, linesearch, likelihood, x_train, y_train, x_test, y_test, kernel_log_sigma, kernel_log_scale):
    
    """
    This function is used to train a GP classifer given an inference method
        
    Parameters:
        inference - an inference methods used to train the classifer
        linesearch - a linesearch method used in the inference method 
        likelihood - likelihood function to model data
        x_train, x_test - X for training and testing
        y_train, y_test - Y for training and testing
        kernel_log_sigma, kernel_log_scale hyper-parameters of covariance function    
    Returns:
        predictive result of the testing data set
    """

    mean_func = ZeroMean()
    kernel_sigma = 2*exp(2*kernel_log_sigma);
    kernel_func = GaussianKernel(10, kernel_sigma)

    #Y is a sample-by-1 vector
    labels_train = BinaryLabels(y_train)
    labels_test = BinaryLabels(y_test)
    #X is a feature-by-sample matrix
    features_train=RealFeatures(x_train)
    features_test=RealFeatures(x_test)

    inf = inference(kernel_func, features_train, mean_func, labels_train, likelihood)
    inf.set_scale(exp(kernel_log_scale))
    try:
        #setting lbfgs parameters
        inf.set_lbfgs_parameters(100,2000,int(linesearch),2000)
        #used to make sure the kernel matrix is positive definite
        inf.set_noise_factor(1e-6)
        inf.set_min_coeff_kernel(1e-5)
        inf.set_max_attempt(10)
    except:
        pass

    gp = GaussianProcessClassification(inf)
    gp.train()
    prob=gp.get_probabilities(features_test)

    return prob, inf.get_name()

In [ ]:
labels={
    "m":1,
    "r":-1,
       }

def extract(inf, train_size):
    """
    This function is used to processing raw data
        
    Parameters:
        inf - the path of the raw data
        train_size - number of data points used for testing   
        
    Returns:
        x_train, x_test - X for training and testing
        y_train, y_test - Y for training and testing
        B - for computing the information bit
    """

    random.seed(1)
    x=[]
    y=[]
    for line in open(inf):
        line=line.strip()
        info=line.split(',')
        label=labels[info[-1].lower()]
        x.append(map(float, info[:-1]))
        y.append(label)
        
    #train_size should be less than the size of all available data points
    assert train_size < len(x) 
    idx=range(len(y))
    random.shuffle(idx)
    train_idx=set(idx[:train_size])
    test_idx=set(idx[train_size:])
    x_train = np.asarray([value for (idx, value) in enumerate(x) if idx in train_idx]).T
    y_train = np.asarray([label for (idx, label) in enumerate(y) if idx in train_idx])
    x_test = np.asarray([value for (idx, value) in enumerate(x) if idx in test_idx]).T
    y_test = np.asarray([label for (idx, label) in enumerate(y) if idx in test_idx])

    y_train_positive=(y_train==1).sum()
    y_train_negative=(y_train==-1).sum()
    y_test_positive=(y_test==1).sum()
    y_test_negative=(y_test==-1).sum()

    prb_positive=float(y_train_positive)/len(y_train)
    B=float(y_test_positive)*log(prb_positive,2)+float(y_test_negative)*log(1.0-prb_positive,2)
    B=-B/len(y_test)

    return x_train, y_train, x_test, y_test, B

In [ ]:
def approx_bit_plot(methods, linesearchs, likelihoods, x_train, y_train, x_test , y_test, plots, lScale, lSigma, B):
    """
    This function is used to plot information bit figures
        
    Parameters:
        methods - a list of inference methods used to train the classifer
        linesearchs - a list of linesearch methods used in the inference method
        likelihood - a list of likelihood functions to model data
        x_train, x_test - X for training and testing
        y_train, y_test - Y for training and testing
        plots  - plot objects
        lScale, lSigma  - a list of hyper-parameters of covariance function    
        B - for computing information bits
        
    Returns:
        predictive result of the testing data set
    """
    #Note that information bit is used to measure effectiveness of an inference method
    if len(plots.shape)==1:
        rows=1
        cols=plots.shape[0]
    else:
        (rows, cols) = plots.shape

    methods=np.asarray(methods).reshape(rows, cols)
    linesearchs=np.asarray(linesearchs).reshape(rows, cols)
    likelihoods=np.asarray(likelihoods).reshape(rows, cols) 

    for r in xrange(rows):
        for c in xrange(cols):
            inference = methods[r][c]
            linesearch = linesearchs[r][c]
            likelihood = likelihoods[r][c]
            try:
                likelihood.set_noise_factor(1e-15)
                likelihood.set_strict_scale(0.01)
            except:
                pass
            scores=[]
            for items in zip(lScale, lSigma):
                for parameters in zip(items[0],items[1]):
                    lscale=parameters[0]
                    lsigma=parameters[1]
                    (prob, inf_name)=learning_example1(inference, linesearch, likelihood, x_train, y_train, x_test,  y_test, 
                                              lscale, lsigma)
                    #compute the information bit
                    score=0.0
                    for (label_idx, prb_idx) in zip(y_test, prob):
                        if label_idx==1:
                            score+=(1.0+label_idx)*log(prb_idx,2)
                        else: #label_idx==-1:
                            score+=(1.0-label_idx)*log(1.0-prb_idx,2)
                    score=score/(2.0*len(y_test))+B
                    scores.append(score)
            scores=np.asarray(scores).reshape(len(lScale),len(scores)/len(lScale))

            if len(plots.shape)==1:
                sub_plot=plots
            else:
                sub_plot=plots[r]
            CS =sub_plot[c].contour(lScale, lSigma, scores)
            sub_plot[c].clabel(CS, inline=1, fontsize=10)
            sub_plot[c].set_title('information bit for %s'%inf_name)
            sub_plot[c].set_xlabel('log_scale')
            sub_plot[c].set_ylabel('log_sigma')
            sub_plot[c].axis('equal')

In [ ]:
#an example for the sonar data set
train_size=108
(x_train, y_train, x_test, y_test, B)=extract('../../../data/uci/sonar/sonar.all-data', train_size)
inference_methods =[
                  EPInferenceMethod,
                  KLDualInferenceMethod,
                  SingleLaplacianInferenceMethod,
                  KLApproxDiagonalInferenceMethod,
                  KLCholeskyInferenceMethod,
                  KLCovarianceInferenceMethod,
                  ]

likelihoods =[
            LogitLikelihood(),
            LogitDVGLikelihood(), #KLDual method uses a likelihood class that supports dual variational inference
            LogitLikelihood(),
            LogitVGLikelihood(), #KL method uses a likelihood class that supports variational inference
            LogitVGLikelihood(), #KL method uses a likelihood class that supports variational inference
            LogitVGLikelihood(), #KL method uses a likelihood class that supports variational inference
            ]

linesearchs =[
            3,
            1, #KLDual method using the type 3 line_search method will cause an error during computing the information bit
            3,
            3,
            3,
            3,
            ]

col_size=8
lscale_min=0.0
lscale_max=5.0
lsigma_min=0.0
lsigma_max=5.0
delta=0.5
scale=5.0

lscale_list = np.arange(lscale_min, lscale_max, delta)
lsigma_list = np.arange(lsigma_min, lsigma_max, delta*scale)
lScale, lSigma = np.meshgrid(lscale_list, lsigma_list)
f, plots =plt.subplots(2, 3, figsize=(col_size*3,col_size*2))

approx_bit_plot(inference_methods, linesearchs, likelihoods, x_train, y_train,  x_test, y_test, plots, lScale, lSigma, B)

plt.show()

Note that the information bit is used to measure the accuracy of classification. If there is a perfect classification, the information bit should be 1. From these figures, we can observe that the Laplace method is worst among these methods in term of information bit. The EP method, the KLCholesky method and the KLCovariance method are equally good. Surprisingly, for this data set the KLApproxDiagonal method performs as good as the EP method in term of information bit. Note that the KLApproxDiagonal method enforces sparsity in non-diagonal entries of the posterior co-variance matrix.

Another Real-world Example

We use the US Postal Service (USPS) database of handwritten digits as another example. The dataset refers to numeric data obtained from the scanning of handwritten digits from envelopes by the U.S. Postal Service. The original scanned digits are binary and of different sizes and orientations; the images here have been deslanted and size normalized, resulting in 16 x 16 grayscale images.

We consider to classify the images of digit 3 from images of digit 5 in this example, which means only images of digit 3 and digit 5 are used. Note that this is example is also used in session 3.7.3 of the textbook, Gaussian Processes for Machine Learning.


In [ ]:
def binary_extract(idx, features, labels):
    """
    This function is used to extract a trainning set and a test set
    
    Parameters:
        idx - a 2-wide list of digits (eg, [0,3] means to extract a sub set of images and labels for digit 3 and digit 0) 
        features - the whole image set 
        labels - the whole label set
        
    Returns:
        binary_features - X (images for binary classification)
        binary_labels - y (labels for binary classification)
    """
    #binary classification
    assert len(idx) == 2
    positive_idx = (labels[idx[0],:] == 1)
    negative_idx = (labels[idx[1],:] == 1)
    binary_idx = (positive_idx | negative_idx)
    ds = binary_idx.shape[-1]

    bin_labels = np.zeros(ds) 
    bin_labels[positive_idx] = 1
    bin_labels[negative_idx] = -1

    binary_features = (features[:,binary_idx])
    binary_labels = (bin_labels[binary_idx])
 
    positive_count = bin_labels[positive_idx].shape[-1]
    negative_count = bin_labels[negative_idx].shape[-1]
    binary_count = binary_labels.shape[-1]
 
    print "There are %d positive samples and %d negative samples" %(positive_count, negative_count)
    return binary_features, binary_labels

In [1]:
def learning_example2(inference, likelihood, x_train, x_test, y_train, y_test):
    """
    This function is used to train a GP classifer given an inference method
        
    Parameters:
        inference - an inference methods used to train the classifer
        likelihood - likelihood function to model data
        x_train, x_test - X for training and testing
        y_train, y_test - Y for training and testing 
        
    Returns:
        Nothing
    """
        
    #train a GP classifer
    error_eval = ErrorRateMeasure()
    mean_func = ConstMean()
    #set hyper-parameters of covariance function
    kernel_log_sigma = 1.0
    kernel_sigma = 2*exp(2*kernel_log_sigma);
    kernel_func = GaussianKernel(10, kernel_sigma)

    #sample by 1
    labels_train = BinaryLabels(y_train)
    labels_test = BinaryLabels(y_test)
    #feature by sample
    features_train=RealFeatures(x_train)
    features_test=RealFeatures(x_test)

    kernel_log_scale = 1.0

    inf = inference(kernel_func, features_train, mean_func, labels_train, likelihood)
    print "\nusing %s"%inf.get_name()
    
    inf.set_scale(exp(kernel_log_scale))
    try:
        inf.set_lbfgs_parameters(100,80,3,80)
    except:
        pass

    start = time.time()
    gp = GaussianProcessClassification(inf)
    gp.train()
    end = time.time()
    print "cost %.2f seconds at training"%(end-start)
    nlz=inf.get_negative_log_marginal_likelihood()
    print "the negative_log_marginal_likelihood is %.4f"%nlz
    start = time.time()
    #classification on train_data
    pred_labels_train = gp.apply_binary(features_train)
    #classification on test_data
    pred_labels_test = gp.apply_binary(features_test)
    end = time.time()    
    print "cost %.2f seconds at prediction"%(end-start)
    
    error_train = error_eval.evaluate(pred_labels_train, labels_train)
    error_test = error_eval.evaluate(pred_labels_test, labels_test)
    
    print "Train error : %.2f Test error: %.2f\n" % (error_train, error_test)

In [ ]:
#an example for the USPS data set
inf='../../../data/toy/usps/usps_resampled/usps_resampled.mat'
data=scipy.io.loadmat(inf)
train_labels=data['train_labels']
test_labels=data['test_labels']
train_features=data['train_patterns']
test_features=data['test_patterns']
#using images of digit 3 and digit 5 from the dataset
idx=[3,5]
#Note that 
#y_train and y_test are followed the definition in the first session 
#the transpose of x_train and x_test are followed the definition in the first session
print "Training set statistics"
(x_train, y_train)=binary_extract(idx,train_features, train_labels)
print "Test set statistics"
(x_test, y_test)=binary_extract(idx,test_features, test_labels)

Laplace Method in GPC Models

The Laplace method can be viewed as a special case of variational method. The idea of the Laplace method is to use the second-order Taylor expansion in the mode, $\mathbf{\hat{f}}$ of $p(\mathbf{f}|\mathbf{y})$ to approximate $p(\mathbf{f}|\mathbf{y})$, which is given below:

$p(\mathbf{f}|\mathbf{y}) \approx \hat{p}(\mathbf{f}|\mathbf{y})$ followed by $N(\mathbf{\hat{f}}, H_{\mathbf{\hat{f}}}^{-1})$ ,where $H_{\mathbf{\hat{f}}}$ is the Hessian matrix in $\mathbf{\hat{f}}$

Therefore, the KL divergence, ${\mathrm{KL}}(Q\|P)$, is approximated by $\int_{-\infty}^\infty \ln\left(\frac{q(\mathbf{f}|\mathbf{y})}{\hat{p}(\mathbf{f}|y)}\right) q(\mathbf{f}|\mathbf{y}) \ {\rm d}\mathbf{f}$.

By minimizing $\int_{-\infty}^\infty \ln\left(\frac{q(\mathbf{f}|\mathbf{y})}{\hat{p}(\mathbf{f}|y)}\right) q(\mathbf{f}|\mathbf{y}) \ {\rm d}\mathbf{f}$ we can get $q(\mathbf{f}|\mathbf{y})= \hat{p}(\mathbf{f}|\mathbf{y})$. In practice, we can use Newton-Raphson optimizer or Quasi-Newton optimizers(ie, Limited-memory Broyden–Fletcher–Goldfarb–Shanno (LBFGS)) to find $\mathbf{\hat{f}}$. Since the likelihood is the inverse-logit, we can show that the objective function, $\ln(p(\mathbf{f}|\mathbf{y}))$, is strictly concave, which means in theory these optimizers can find the same global solution $\mathbf{\hat{f}}$. We demonstrate how to apply the Laplace method via Newton-Raphson optimizer and LBFGS optimizer in Shogun to the USPS data set as below.


In [ ]:
inference_methods=[
                  SingleLaplacianInferenceMethodWithLBFGS,     #using LBFGS optimizer
                  SingleLaplacianInferenceMethod,              #using Newton-Raphson optimizer
                  ]
likelihood = LogitLikelihood()
#using default optimizer and default linesearch method
for inference in inference_methods:
    learning_example2(inference, likelihood, x_train, x_test, y_train, y_test)

Remark: In practice, the Laplace method is the fastest method among all implemented variational methods in Shogun. However, the second-order Taylor approximation in the mode of marginal likelihood is not always a good approximation, which can be observed in the figures in the previous session for the sonar data set.

Covariance Variational Method in GPC Models

This method is mentioned in the paper of Nickisch and Rasmussen in 2008, which is called the KL method in the paper. The idea of this method is to maximize $E_q[ln(p(\mathbf{f}))] + E_q[ln(p(\mathbf{y}|\mathbf{f}))] - E_q[ln(q(\mathbf{f}|\mathbf{y}))]$ with respect to $\mu$ and $\Sigma$, by reparametrizing $\Sigma$ to reduce the dimension of variables to be optimized. However, such parametrization does not preserve convexity (concave in this setting) according to the paper of Khan et. al. in 2013. Except for this method, convexity holds in other variational methods implemented in Shogun. We demonstrate how to apply this variational method in Shogun to the USPS example.


In [ ]:
likelihood = LogitVGLikelihood()
#using the default linesearch method
learning_example2(KLCovarianceInferenceMethod, likelihood, x_train, x_test, y_train, y_test)

Remark: This method may be the first variational method used in GPC models. However, in practice, it is the slowest method among all implemented variational methods in Shogun.

Mean-field Variational Method in GPC Models

This method is also called the factorial variational method in the paper of Nickisch and Rasmussen in 2008. The idea of mean-field variatonal method is to enforce artificial structure on the co-variance matrix, $\Sigma$ of $q(\mathbf{f})$. In mean-field variatonal method, $\Sigma$ is a diagonal positive-definite matrix. We demonstrate how to apply this variational method in Shogun to the USPS example.


In [ ]:
likelihood = LogitVGLikelihood()
#using the default linesearch method
learning_example2(KLApproxDiagonalInferenceMethod, likelihood, x_train, x_test, y_train, y_test)

Remark: This method could be as fast as the Laplace method in Shogun in practice but ignores all off-diagonal elements of the covariance matrix. However, in some case (ie, there are a lot of noise in data set), such structure regularization may bring some promising result compared to other variational methods, which can be observed in the previous session for the sonar data set.

Cholesky Variational Method in GPC Models

This method is proposed in the paper of Challis and Barber in 2011. The idea of this method is to maximize $E_q[\ln(p(\mathbf{f}))] + E_q[\ln(p(\mathbf{y}|\mathbf{f}))] - E_q[\ln(q(\mathbf{f}|\mathbf{y}))]$ in terms of the Cholesky representation, C, for the covariance matrix of $q(\mathbf{f})$, where $\Sigma=CC^T$ and C is a lower triangular matrix. We demonstrate how to apply this variational method in Shogun to the USPS data set as below.


In [ ]:
likelihood = LogitVGLikelihood()
#using the default linesearch method
learning_example2(KLCholeskyInferenceMethod, likelihood, x_train, x_test, y_train, y_test)

Remark: This method may be faster to learn the complete structure of the covariance matrix from data than covariance variational method. The reason is solving linear system related to $\Sigma$ is required at each optimization step. In Cholesky representation, the time complexity at each optimization step is reduced to O(n^2) from O(n^3) compared to covariance variational method. However, the overall time complexity is still O(n^3)

Dual Variational Method in GPC Models

This method is proposed in the paper of Khan et. al. in 2013. The idea of this method is to optimize $E_q[ln(p(\mathbf{f}))] + E_q[ln(p(\mathbf{y}|\mathbf{f}))] - E_q[ln(q(\mathbf{f}|\mathbf{y}))]$ in Lagrange dual form instead of the primal form used in covariance variational method via explicitly expressing constraint in the covariance matrix, $\Sigma$, in terms of auxiliary variable. However, the time complexity of each optimization step is still O(n^3) and variational bound usually is required to approximate $E_{q_i}[ln(p(\mathbf{y_i}|\mathbf{f_i})]$ for GPC in this method. We demonstrate how to apply this variational method in Shogun to the USPS data set as below.


In [ ]:
likelihood = LogitDVGLikelihood()
likelihood.set_strict_scale(0.1)
#using the default linesearch method
learning_example2(KLDualInferenceMethod, likelihood, x_train, x_test, y_train, y_test)

Remark: This method requires O(N) variables to be optimized while the Cholesky method requires O(N^2) variables to be optimized. This fact is important for large-scale cases. A promising observation in practice is that this method is faster than covariance variational method and may not be slower than the Cholesky method.

Variational Bounds

Now, we discuss how to approximate $E_{q_i}[ln(p(\mathbf{y_i}|\mathbf{f_i})]$. Since this term is about one-dimensional integration, for GPC we can use Gauss–Hermite quadrature to appromxiate this term. In Shogun, the corresponding implementation of Bernoulli-logistic likelihood for variational inference is called LogitVGLikelihood . This likelihood can be used for all variational methods except the dual variational method.

The piecewise variational bound proposed at the paper of Benjamin M. Marlin et. al. in 2011 is also implemented in Shogun, which is called LogitVGPiecewiseBoundLikelihood. This likelihood can be used for all variational methods except the dual variational method.

For dual variational method, we use the variational bound in the paper of DM Blei et. al. in 2007. The corresponding implementation in Shogun is called LogitVGPiecewiseBoundLikelihood. Note that when using LogitDVGLikelihood, the bound is only enabled for the dual variational method. When other variational methods use this class, it uses one-dimensional integration instead of the variational bound. For detailed discussion about variational bounds in GPC models, please refer to Khan's work.


In [ ]:
likelihood = LogitVGPiecewiseBoundLikelihood()
likelihood.set_default_variational_bound()
likelihood.set_noise_factor(1e-15)
inference_methods=[
                  KLCholeskyInferenceMethod,
                  KLApproxDiagonalInferenceMethod,
                  KLCovarianceInferenceMethod,
                  ]
for inference in inference_methods:
    #using the default linesearch method
    learning_example2(inference, likelihood, x_train, x_test, y_train, y_test)

Optimizing Hyper-parameters

We demonstrate how to optimize hypyer-parameters in Shogun for applying the mean-field variational method to the USPS data set.


In [ ]:
def learning_example3(inference, likelihood, x_train, x_test, y_train, y_test):
    """
    This function is used to train a GP classifer given an inference method and hyper-parameters are automatically optimized
        
    Parameters:
        inference - an inference methods used to train the classifer
        likelihood - likelihood function to model data
        x_train, x_test - X for training and testing
        y_train, y_test - Y for training and testing
        
    Returns:
        Nothing
    """
    error_eval = ErrorRateMeasure()
    mean_func = ZeroMean()
    kernel_log_sigma = 1.0
    kernel_sigma = 2*exp(2*kernel_log_sigma);
    kernel_func = GaussianKernel(10, kernel_sigma)

    #sample by 1
    labels_train = BinaryLabels(y_train)
    labels_test = BinaryLabels(y_test)
    #feature by sample
    features_train=RealFeatures(x_train)
    features_test=RealFeatures(x_test)

    kernel_log_scale = 1.0

    inf = inference(kernel_func, features_train, mean_func, labels_train, likelihood)
    print "\nusing %s"%inf.get_name()
    
    inf.set_scale(exp(kernel_log_scale))
    try:
            inf.set_lbfgs_parameters(100,80,3,80)
    except:
            pass

    gp = GaussianProcessClassification(inf)

    # evaluate our inference method for its derivatives
    grad = GradientEvaluation(gp, features_train, labels_train, GradientCriterion(), False)
    grad.set_function(inf)

    # handles all of the above structures in memory
    grad_search = GradientModelSelection(grad)

    # search for best parameters and store them
    best_combination = grad_search.select_model()

    # apply best parameters to GP
    best_combination.apply_to_machine(gp)

    # we have to "cast" objects to the specific kernel interface we used (soon to be easier)
    best_width=GaussianKernel.obtain_from_generic(inf.get_kernel()).get_width()
    best_scale=inf.get_scale()
    print "Selected kernel bandwidth:", best_width
    print "Selected kernel scale:", best_scale

    start = time.time()
    gp.train()
    end = time.time()
    print "cost %s seconds at training"%(end-start)
    nlz=inf.get_negative_log_marginal_likelihood()
    print "the negative_log_marginal_likelihood is %.4f"%nlz
    start = time.time()
    #classification on train_data
    pred_labels_train = gp.apply_binary(features_train)
    #classification on test_data
    pred_labels_test = gp.apply_binary(features_test)
    end = time.time()    
    print "cost %s seconds at prediction"%(end-start)
    
    error_train = error_eval.evaluate(pred_labels_train, labels_train)
    error_test = error_eval.evaluate(pred_labels_test, labels_test)
    
    print "Train error : %.2f Test error: %.2f\n" % (error_train, error_test);

In [ ]:
likelihood = LogitVGLikelihood()
likelihood.set_noise_factor(1e-15)
#using the default linesearch method
learning_example3(KLApproxDiagonalInferenceMethod, likelihood, x_train, x_test, y_train, y_test)

Reference

  • Rasmussen, Carl Edward. "Gaussian processes for machine learning." (2006).
  • Bishop, Christopher M. Pattern recognition and machine learning. Vol. 1. New York: springer, 2006.
  • Nickisch, Hannes, and Carl Edward Rasmussen. "Approximations for binary Gaussian process classification." Journal of Machine Learning Research 9 (2008): 2035-2078.
  • Challis, Edward, and David Barber. "Concave Gaussian variational approximations for inference in large-scale Bayesian linear models." International conference on Artificial Intelligence and Statistics. 2011.
  • Emtiyaz Khan, Mohammad, et al. "Fast dual variational inference for non-conjugate latent gaussian models." Proceedings of The 30th International Conference on Machine Learning. 2013.
  • Marlin, Benjamin M., Mohammad Emtiyaz Khan, and Kevin P. Murphy. "Piecewise Bounds for Estimating Bernoulli-Logistic Latent Gaussian Models." ICML. 2011.
  • Khan, Mohammad. "Variational learning for latent Gaussian model of discrete data." (2012).
  • Wang, Chong, and David M. Blei. "Variational inference in nonconjugate models." The Journal of Machine Learning Research 14.1 (2013): 1005-1031.
  • Paisley, John, David Blei, and Michael Jordan. "Variational Bayesian inference with stochastic search." ICML. 2012.

Soon to Come

  • Large-scale sparse variational inference in GPC models
  • Stochastic variational inference in GPC models

In [ ]: