This notebook accompanies the paper
Nonparameteric Clustering of Structured Time Series
James Hensman, Magnus Rattray and Neil D. Lawrence
IEEE TPAMI 2014
The code is available at https://github.com/jameshensman/gpclust . The GPclust module depends on GPy.
The hierachical Gaussian process model was fleshed out in
Hierarchical Bayesian modelling of gene expression time series
across irregularly sampled replicates and clusters
James Hensman, Neil D. Lawrence and Magnus Rattray
http://www.biomedcentral.com/1471-2105/14/252
A simple implementation of hierarchical GPs is available as part of GPy. You may also be interested in the related notebook on hierarchical GPs.
In [2]:
%matplotlib inline
%config InlineBackend.figure_format = 'png'#'svg' would be better, but eats memory for these big plots.
from matplotlib import pyplot as plt
import numpy as np
import GPy
import sys
sys.path.append('/home/james/work/gpclust/')
import GPclust
In [3]:
#generate a data set. Here's the sinusoid demo from the manuscript.
Nclust = 10
Nx = 12
Nobs = [np.random.randint(20,31) for i in range(Nclust)] #a random number of realisations in each cluster
X = np.random.rand(Nx,1)
X.sort(0)
#random frequency and phase for each cluster
base_freqs = 2*np.pi + 0.3*(np.random.rand(Nclust)-.5)
base_phases = 2*np.pi*np.random.rand(Nclust)
means = np.vstack([np.tile(np.sin(f*X+p).T,(Ni,1)) for f,p,Ni in zip(base_freqs,base_phases,Nobs)])
#add a lower frequency sinusoid for the noise
freqs = .4*np.pi + 0.01*(np.random.rand(means.shape[0])-.5)
phases = 2*np.pi*np.random.rand(means.shape[0])
offsets = 0.3*np.vstack([np.sin(f*X+p).T for f,p in zip(freqs,phases)])
Y = means + offsets + np.random.randn(*means.shape)*0.05
In the plot below, we show the underlying function for each cluster as a smooth red function, and the data associated with the cluster as thinly connected blue crosses.
In [4]:
#plotting.
x_plot, xmin, xmax = GPy.plotting.matplot_dep.base_plots.x_frame1D(X)
plt.figure(figsize=(18,6))
index_starts = np.hstack([0, np.cumsum(Nobs[:-1])])
index_stops = np.cumsum(Nobs)
for n in range(Nclust):
plt.subplot(2,Nclust/2, n+1)
plt.plot(X, Y[index_starts[n]:index_stops[n]].T, 'b', marker='x',ms=4, mew=1, linewidth=0.2)
plt.plot(x_plot, np.sin(base_freqs[n]*x_plot+base_phases[n]), 'r', linewidth=2)
GPy.plotting.matplot_dep.base_plots.align_subplots(2, Nclust/2, xlim=(xmin, xmax))
Now that we have generated a data set, it's straightforward to build and optimize a clustering model. First, we need to build two GPy kernels (covariance functions), which will be used to model the underlying function and the replication noise, respecively. We'll take a wild stab at the parameters of these covariances, and let the model optimize them for us later.
The two kernels model the underlying function of the cluster, and the deviations of each gene from that underlying function. If we believe that the only corruption of the data from the cluster mean is i.i.d. noise, we can specify a GPy.kern.White
covariance. In practise, it's helpful to allow correlated noise. The model of any cluster of genes then has a hierarchical structure, with the unknown cluster-specific mean drawn from a GP, and then each gene in that cluster being drawn from a GP with said unknown mean function.
To optimize the model with the default optimization settings, we call m.optimize(). To invoke the recommended merge-split procedure, call m.systematic_splits(). Note that during the splitting procedure, many calls are made to the optimize function.
In [12]:
k_underlying = GPy.kern.RBF(input_dim=1, variance=0.1, lengthscale=0.1)
k_corruption = GPy.kern.RBF(input_dim=1, variance=0.01, lengthscale=0.1) + GPy.kern.White(1, variance=0.001)
m = GPclust.MOHGP(X, k_underlying, k_corruption, Y, K=10, prior_Z='DP', alpha=1.0)
m.optimize()
m.systematic_splits(verbose=False)
The model has quite extensive plotting built in, with various options for colour, display of the data as points or connected lines, etc. Here we find that the model manages to separate all but two of the true clusters. The number of 'genes' found in each cluster is labeled in the corner of each plot.
In [13]:
plt.figure(figsize=(14,9))
m.plot(on_subplots=True, colour=True, newfig=False)
Why do we have to specify two kernels in GPclust? The first kernel describes the properties of the functions which underly each cluster. The second describes the properties of the functions which describe how each time-course (gene) deviates from the cluster.
This structure is important: if we model the deviation of each time-course from the cluster as simply noise, it's more difficult to infer the correct clusters. Such a model can be constructed in GPclust by using a white (noise) kernel for the structure, as follows.
In [10]:
#exactly as above, but with a white-noise kernel for the structure.
k_underlying = GPy.kern.RBF(input_dim=1, variance=0.1, lengthscale=0.1)
k_corruption = GPy.kern.White(1, variance=0.1)
m = GPclust.MOHGP(X, k_underlying, k_corruption, Y, K=10, prior_Z='DP', alpha=1.0)
m.optimize()
m.systematic_splits(verbose=False)
In [11]:
plt.figure(figsize=(14,9))
m.plot(on_subplots=True, colour=True, newfig=False)
Here we can see that the procedure finds too many clusters (we know that the ground truth is that there are 10). This is because without the ability to model the deviation from the cluster mean in a structured fashion, that structure appears as additional clusters.
This point is exaggerated a little by the toy data that we have generated, but the same issue exists in real data. For more details and consideration of the cluster structure, see the subsequent notebook on clustering Drosophila development.