Mixtures of Gaussian processes with GPclust

This notebook accompanies the paper

Nonparameteric Clustering of Structured Time Series
James Hensman, Magnus Rattray and Neil D. Lawrence
IEEE TPAMI 2014

The code is available at https://github.com/jameshensman/gpclust . The GPclust module depends on GPy.

The hierachical Gaussian process model was fleshed out in

Hierarchical Bayesian modelling of gene expression time series
across irregularly sampled replicates and clusters

James Hensman, Neil D. Lawrence and Magnus Rattray

http://www.biomedcentral.com/1471-2105/14/252

A simple implementation of hierarchical GPs is available as part of GPy. You may also be interested in the related notebook on hierarchical GPs.


In [2]:
%matplotlib inline
%config InlineBackend.figure_format = 'png'#'svg' would be better, but eats memory for these big plots.
from matplotlib import pyplot as plt
import numpy as np
import GPy
import sys
sys.path.append('/home/james/work/gpclust/')
import GPclust

A simple sinusoid dataset

Here's a simulated dataset that contains the simple features that we expect to have in real data sets: smooth processes (here, sinusoids) corrupted by further smooth processes (here, more sinusoids) as well as noise.


In [3]:
#generate a data set. Here's the sinusoid demo from the manuscript.
Nclust = 10
Nx = 12
Nobs = [np.random.randint(20,31) for i in range(Nclust)] #a random number of realisations in each cluster
X = np.random.rand(Nx,1)
X.sort(0)

#random frequency and phase for each cluster
base_freqs = 2*np.pi + 0.3*(np.random.rand(Nclust)-.5)
base_phases = 2*np.pi*np.random.rand(Nclust)
means = np.vstack([np.tile(np.sin(f*X+p).T,(Ni,1)) for f,p,Ni in zip(base_freqs,base_phases,Nobs)])

#add a lower frequency sinusoid for the noise
freqs = .4*np.pi + 0.01*(np.random.rand(means.shape[0])-.5)
phases = 2*np.pi*np.random.rand(means.shape[0])
offsets = 0.3*np.vstack([np.sin(f*X+p).T for f,p in zip(freqs,phases)])
Y = means + offsets + np.random.randn(*means.shape)*0.05

In the plot below, we show the underlying function for each cluster as a smooth red function, and the data associated with the cluster as thinly connected blue crosses.


In [4]:
#plotting. 
x_plot, xmin, xmax = GPy.plotting.matplot_dep.base_plots.x_frame1D(X)
plt.figure(figsize=(18,6))
index_starts = np.hstack([0, np.cumsum(Nobs[:-1])])
index_stops = np.cumsum(Nobs)
for n in range(Nclust):
    plt.subplot(2,Nclust/2, n+1)
    plt.plot(X, Y[index_starts[n]:index_stops[n]].T, 'b', marker='x',ms=4, mew=1, linewidth=0.2)
    plt.plot(x_plot, np.sin(base_freqs[n]*x_plot+base_phases[n]), 'r', linewidth=2)
    
GPy.plotting.matplot_dep.base_plots.align_subplots(2, Nclust/2, xlim=(xmin, xmax))


Constructing and optimizing a model

Now that we have generated a data set, it's straightforward to build and optimize a clustering model. First, we need to build two GPy kernels (covariance functions), which will be used to model the underlying function and the replication noise, respecively. We'll take a wild stab at the parameters of these covariances, and let the model optimize them for us later.

The two kernels model the underlying function of the cluster, and the deviations of each gene from that underlying function. If we believe that the only corruption of the data from the cluster mean is i.i.d. noise, we can specify a GPy.kern.White covariance. In practise, it's helpful to allow correlated noise. The model of any cluster of genes then has a hierarchical structure, with the unknown cluster-specific mean drawn from a GP, and then each gene in that cluster being drawn from a GP with said unknown mean function.

To optimize the model with the default optimization settings, we call m.optimize(). To invoke the recommended merge-split procedure, call m.systematic_splits(). Note that during the splitting procedure, many calls are made to the optimize function.


In [12]:
k_underlying = GPy.kern.RBF(input_dim=1, variance=0.1, lengthscale=0.1)
k_corruption = GPy.kern.RBF(input_dim=1, variance=0.01, lengthscale=0.1) + GPy.kern.White(1, variance=0.001)

m = GPclust.MOHGP(X, k_underlying, k_corruption, Y, K=10, prior_Z='DP', alpha=1.0)
m.optimize()
m.systematic_splits(verbose=False)


iteration 12 bound=75.926774702 grad=5.05137047842e-07, beta=0.18167151314 vb converged (ftol)
vb converged (gtol)

Plotting and examining the posterior

The model has quite extensive plotting built in, with various options for colour, display of the data as points or connected lines, etc. Here we find that the model manages to separate all but two of the true clusters. The number of 'genes' found in each cluster is labeled in the corner of each plot.


In [13]:
plt.figure(figsize=(14,9))
m.plot(on_subplots=True, colour=True, newfig=False)


Structure is important

Why do we have to specify two kernels in GPclust? The first kernel describes the properties of the functions which underly each cluster. The second describes the properties of the functions which describe how each time-course (gene) deviates from the cluster.

This structure is important: if we model the deviation of each time-course from the cluster as simply noise, it's more difficult to infer the correct clusters. Such a model can be constructed in GPclust by using a white (noise) kernel for the structure, as follows.


In [10]:
#exactly as above, but with a white-noise kernel for the structure.
k_underlying = GPy.kern.RBF(input_dim=1, variance=0.1, lengthscale=0.1)
k_corruption = GPy.kern.White(1, variance=0.1)

m = GPclust.MOHGP(X, k_underlying, k_corruption, Y, K=10, prior_Z='DP', alpha=1.0)
m.optimize()
m.systematic_splits(verbose=False)


iteration 23 bound=-748.166941963 grad=6.96257878533e-07, beta=0.0793985614504 vb converged (ftol)
vb converged (gtol)

In [11]:
plt.figure(figsize=(14,9))
m.plot(on_subplots=True, colour=True, newfig=False)


Here we can see that the procedure finds too many clusters (we know that the ground truth is that there are 10). This is because without the ability to model the deviation from the cluster mean in a structured fashion, that structure appears as additional clusters.

This point is exaggerated a little by the toy data that we have generated, but the same issue exists in real data. For more details and consideration of the cluster structure, see the subsequent notebook on clustering Drosophila development.