In [1]:
%matplotlib inline
%config InlineBackend.figure_format = 'png'#'svg' would be better, but eats memory for these big plots.
from matplotlib import pyplot as plt
import numpy as np
import GPy
import sys
sys.path.append('/home/james/work/gpclust/')
import GPclust

Clustering Drosophila development with GPclust

Here we present a clustering of the data in

Gene expression divergence recapitulates the developmental hourglass model
Alex T. Kalinka, Karolina M. Varga, Dave T. Gerrard, Stephan Preibisch, David L. Corcoran, Julia Jarrells, Uwe Ohler, Casey M. Bergman and Pavel Tomancak
Nature 468, 811–814 (09 December 2010)

The data are available at http://publications.mpi-cbg.de/publications-sites/4240/Supplemental_Information_Summary.html#Raw_data. These are a time series experiment across several species, with several replicates. We have downloaded these data, and filtered for only the Melanogaster rows. This results in 57 measurements of 3590 genes, spread (unevenly) across 8 replicates.


In [2]:
import urllib
urllib.urlretrieve('http://staffwww.dcs.sheffield.ac.uk/people/J.Hensman/data/kalinka09_mel.csv', 'kalinka_data.csv')
urllib.urlretrieve('http://staffwww.dcs.sheffield.ac.uk/people/J.Hensman/data/kalinka09_mel_pdata.csv', 'kalinka_pdata.csv')
expression = np.loadtxt('kalinka_data.csv', delimiter=',', usecols=range(1, 57))
gene_names = np.loadtxt('kalinka_data.csv', delimiter=',', usecols=[0], dtype=np.str)
replicates, times = np.loadtxt('kalinka_pdata.csv', delimiter=',').T

#normalize data row-wise
expression -= expression.mean(1)[:,np.newaxis]
expression /= expression.std(1)[:,np.newaxis]

3590 genes is a few too many to work with for the sake of this demo, so we'll filter down to 500 genes that appear to have a good signal-to-noise ratio. To do this, we'll rank the genes by the ratio of the mean replicate-wise variance to the variance of the replicate-wise means (for replicates points that have at least three measurements).


In [3]:
#compute the scores as desribed
means, stds = [], []
for t in np.unique(times):
    index = times == t
    if len(index)<3: continue
    e = expression[:,index]
    means.append(e.mean(1))
    stds.append(e.std(1))
means, stds = np.vstack(means).T, np.vstack(stds).T
scores = means.std(1) / stds.mean(1)

#ranks the genes and cut off the top 500
index = np.argsort(scores)[::-1]
expression = expression[index[:500]]
gene_names = gene_names[index[:500]]

Now we can build a model. The first thing to do is to select covariance functions to describe 1) the underlying process in each cluster, and 2) the process which corrupts each time-course (here associated with individual genes) away from the cluster center.

For both of these, we'll use a Matern52 kernel. For each, we must specify:

  • The number of input dimensions.
    • Here, we're just clustering one dimensional functions (of time), so input_dim=1
  • The signal variance
    • An initial guess at the amount of variance attributed to this kernel. Since we've normalized the data (above) to have unit variance, we'll set the variance of the underlying function to be 1. The variance which corrupts the measurements away from the center of the cluster will (hopefully) be smaller than the variance of the cluster, so we rather arbitrarily pick 0.1
  • The lengthscale
    • The lengthscale describes how rapidly the function changes as a function of the input (here, time). Our measurements are spread across 24 hours, and as a ballpark we pick a lengthscale of half the signal length, for both kernels.

Additinoally, we allow the corrupting kernel to include some white noise, with small variance.

The parameters of the kernels will be optimized against the bound on the marginal likelihood of the model, every hyperparam_opt_interval iterations. For large problems, we recommend setting this value to be quite big, so that the model has a chance to find a reasonable clustering before attempting to fit the GP parameters. Fitting the GP parameters too early can lead to pathologies.

Finally, we can call systematic_splits on the model in the same way as for the mixture of Gaussians.


In [4]:
k_underlying = GPy.kern.Matern52(input_dim=1, variance=1.0, lengthscale=12.)
k_corruption = GPy.kern.Matern52(input_dim=1, variance=0.1, lengthscale=12.) + GPy.kern.White(1, variance=0.05)

m = GPclust.MOHGP(times.reshape(-1,1), k_underlying, k_corruption, expression, K=20, prior_Z='DP', alpha=1.0)
m.hyperparam_opt_interval = 1000 # how often to optimize the hyperparameters

m.hyperparam_opt_args['messages'] = False # turn off the printing of the optimization
m.optimize()
m.systematic_splits(verbose=False)


iteration 39 bound=-90.6533748186 grad=3.36570566717e-07, beta=0.301297993742 vb converged (ftol)
vb converged (gtol)
iteration 128 bound=6565.727327 grad=5.83813710117e-07, beta=0.616222135238 vb converged (gtol)
iteration 143 bound=6566.16912151 grad=9.10176927213e-07, beta=0.686378598813 vb converged (gtol)

Now that the optimization has converged, we can explore the posterior as for the mixture of Gaussians demo. First thing is to plot the clusters. See m.plot? for help on the arguments to the plotting function.


In [5]:
m.reorder()# plot the biggest clusters at the top
plt.figure(figsize=(18,18))
m.plot(on_subplots=True, colour=True, in_a_row=False, newfig=False, min_in_cluster=1, joined=False, ylim=(-2,2))


We can also inspect the optimized values of the kernel parameters


In [7]:
print m


Name                 : MOHGP
Log-likelihood       : 6830.17881231
Number of Parameters : 5
Parameters:
  MOHGP.                 |       Value       |  Constraint  |  Prior  |  Tied to
  Mat52.variance         |    2.15976471178  |     +ve      |         |         
  Mat52.lengthscale      |     5.5162478493  |     +ve      |         |         
  add.Mat52.variance     |  0.0346778691282  |     +ve      |         |         
  add.Mat52.lengthscale  |    2.05324747389  |     +ve      |         |         
  add.white.variance     |  0.0215093090535  |     +ve      |         |         

In [6]: