In this notebook we design a covariance function for reconstructing transcription factor activities given gene expression profiles and a connectivity matrix (binding data) between genes and transcription factors. Our modelling framework builds on ideas in Sanguinetti et al (2006) who used a linear-Gaussian statespace modelling framework to infer the transcription factor activity of a group of genes.
We note that the linear Gaussian model is equivalent to a Gaussian process with a particular covariance function. We therefore build a model directly from the Gaussian process perspective to achieve the same effect. We introduce a computational trick, based on judicious application of singluar value decomposition, to enable us to efficiently fit the Gaussian process in a reduced 'TF activity' space.
In [1]:
import numpy as np
import pylab as pb
import GPy
import scipy.io
import matplotlib.pyplot as plt
%matplotlib inline
First we load in the classic Spellman et al (1998) Yeast Cell Cycle data set. The cdc15 time series data has 23 time points. We can load this gene expression data in with GPy.
In [2]:
data = GPy.util.datasets.spellman_yeast_cdc15()
Y = data['Y'].fillna(0) # Replace missing values with zero following Sanguinetti et al.
t = data['t']
print data['info'], data['details']
More details on the data are given in this notebook.
We can make a simple helper function to plot genes from the data set (which are provided as a pandas
array).
In [3]:
def plot_gene(gene_name='YER124C'):
plt.plot(data['t'], data['Y'][gene_name], 'rx')
plt.title('Gene: ' + gene_name)
plt.xlabel('time/minutes')
plot_gene('YER124C')
Our second data set is from ChiP-chip experiments performed on yeast by Lee et al (2002). These give us the binding information between transcription factors and genes. In this notebook we are going to try and combine this binding information with the gene expression information to infer transcription factor activities.
In [4]:
data = GPy.util.datasets.lee_yeast_ChIP()
# set S to find relationships where p-value is less than 1e-3
S = data['Y'].T<1e-3
More details on the data are give in this notebook.
The two data sets have slightly different overlapping sets of genes. Fortunately, with pandas
it is easy to do the matching.
In [5]:
intersect = list(set(S.columns) & set(Y.columns))
# this is the new value for n
print 'There are n=', len(intersect), 'overlapping genes.'
# Now reset S and Y to have the overlap of genes
Y = Y[intersect].T # now an n by T matrix
S = S[intersect].T # now an n by q matrix
We are working with log expression levels in a matrix $\mathbf{Y} \in \Re^{n\times T}$ and we will assume a linear (additive) model giving the relationship between the expression level of the gene and the corresponding transcription factor activity which are unobserved, but we represent by a matrix $\mathbf{F} \in \Re^{q\times T}$. Our basic assumption is as follows. Transcription factors are in time series, so they are likely to be temporally smooth. Further we assume that the transcription factors are potentially correlated with one another (to account for transcription factors that operate in unison).
If there are $q$ transcription factors then correlation between different transcription factors is encoded in a covariance matrix, $\boldsymbol{\Sigma}$ which is $q\times q$ in dimensionality.
Further we assume that the log of the transcription factors' activities is temporally smooth, and drawn from an underlying Gaussian process with covariance $\mathbf{K}_t$.
We assume that the joint process across all $q$ transcription factor activities and across all time points is well represented by an intrinsic model of coregionalization where the covariance is given by the Kronecker product of these terms. $$\mathbf{K}_f = \mathbf{K}_t \otimes \boldsymbol{\Sigma}$$
This is known as an intrinsic coregionalization model Wackernagel, (2003). See Alvarez et al (2012) for a machine learning orientated review of these methods. The matrix $\boldsymbol{\Sigma}$ is known as the coregionalization matrix.
We now assume that the $j$th gene's expression is given by the product of the transcription factors that bind to that gene. Because we are working in log space, that implies a log linear relationship. At the $i$th time point, the log of the $j$th gene's expression, $\mathbf{y}_{i,j}$ is linearly related to the log of the transcription factor activities at the corresponding time point, $\mathbf{f}_{i, :}$. This relationship is given by the binding information from $\mathbf{S}$. We then assume that there is some corrupting Gaussian noise to give us the final observation.
where the Gaussian noise is sampled from
We consider a vector operator which takes all the separate time series in $\mathbf{Y}$ and stacks the time series to form a new vector $n\times T$ length vector $\mathbf{y}$. A similar operation is applied to form a $q \times T$ length vector $\mathbf{f}$. Using Kronecker products we can now represent the relationship between $\mathbf{y}$ and $\mathbf{f}$ as follows:
where $$\mathbf{K} = \mathbf{K}_t \otimes \mathbf{S} \boldsymbol{\Sigma} \mathbf{S}^\top + \sigma^2 \mathbf{I}.$$ This results in a covariance function that is of size $n$ by $T$ where $n$ is number of genes and $T$ is number of time points. However, we can get a drastic reduction in the size of the covariance function by considering the singular value decomposition of $\mathbf{S}$.
The matrix $\mathbf{S}$ is $n$ by $q$ matrix, where $q$ is the number of transcription factors. It contains a 1 if a given transcription factor binds to a given gene, and zero otherwise.
In the worst case, because the vector $\mathbf{y}$ contains $T\times n$ points ($T$ time points for each of $n$ genes) we are faced with $O(T^3n^3)$ computational complexity. We are going to use a rotation trick to help.
For any multivariate Gaussian you can rotate the data set and compute a new roated covariance which is valid for the rotated data set. Mathematically this works by first inserting $\mathbf{R}\mathbf{R}^\top$ into the likelihood at three points as follows:
The rules of determinants and a transformation of the data allows us to rewrite the likelihood as
where we have introduced the rotated data: $\hat{\mathbf{y}}=\mathbf{R} \mathbf{y}$. Geometrically what this says is that if we want to maintain the same likelihood, then when we rotate our data set by $\mathbf{R}$ we need to rotate either side of the covariance matrix by $\mathbf{R}$, which makes perfect sense when we recall the properties of the multivariate Gaussian.
In this notebook we are using a particular structure of covariance which involves a Kronecker product. The rotation we consider will be a Kronecker rotation (see Stegle et al, 2011). We are going to try and take advantage of the fact that the matrix $\mathbf{S}$ is square meaning that $\mathbf{S}\boldsymbol{\Sigma}\mathbf{S}^\top$ is not full rank (it has rank of most $q$, but is size $n\times n$, and we expect number of transcription factors $q$ to be less than number of genes $n$).
When ranks are involved, it is always a good idea to look at singular value decompositions (SVDs). The SVD of $\mathbf{S}$ is given by: $$\mathbf{S} = \mathbf{Q} \boldsymbol{\Lambda} \mathbf{V}^\top$$ where $\mathbf{V}^\top \mathbf{V} = \mathbf{I}$, $\boldsymbol{\Lambda}$ is a diagonal matrix of positive values, $\mathbf{Q}$ is a matrix of size $n\times q$: it matches the dimensionality of $\mathbf{S}$, but we have $\mathbf{Q}^\top \mathbf{Q} = \mathbf{I}$. Note that because it is not square, $\mathbf{Q}$ is not in itself a rotation matrix. However it could be seen as the first $q$ columns of an $n$ dimensional rotation matrix (assuming $n$ is larger than $q$, i.e. there are more genes than transcription factors).
If we call the $n-q$ missing columns of this rotation matrix $\mathbf{U}$ then we have a valid rotation matrix $\mathbf{R}=\begin{bmatrix} \mathbf{Q}& \mathbf{U}\end{bmatrix}$. Although this rotation matrix is only rotating across the $n$ dimensions of the genes, not the additional dimensions across time. In other words we are choosing $\mathbf{K}_t$ to be unrotated. To represent this properly for our covariance we need to set $\mathbf{R} = \mathbf{I} \otimes \begin{bmatrix} \mathbf{Q}& \mathbf{U}\end{bmatrix}$. This gives us a structure that when applied to a covariance of the form $\mathbf{K}_t\otimes \mathbf{K}_n$ it will rotate $\mathbf{K}_n$ whilst leaving $\mathbf{K}_t$ untouched.
When we apply this rotation matrix to $\mathbf{K}$ we hace to consider two terms, the rotation of $\mathbf{K}_t \otimes \mathbf{S}\boldsymbol{\Sigma}\mathbf{S}^\top$, and the rotation of $\sigma^2 \mathbf{I}$.
Rotating the latter is easy, because it is just the identity multiplied by a scalar so it remains unchanged $$\mathbf{R}^\top\mathbf{I}\sigma^2 \mathbf{R}= \mathbf{I}\sigma^2$$ The former is slightly more involved, for that term we have $$\left[\mathbf{I}\otimes \begin{bmatrix}\mathbf{Q} & \mathbf{U}\end{bmatrix}^\top \right]\mathbf{K}_t \otimes \mathbf{S}\boldsymbol{\Sigma}\mathbf{S}^\top\left[ \mathbf{I} \otimes \begin{bmatrix}\mathbf{Q} & \mathbf{U}\end{bmatrix}\right]=\mathbf{K}_t \otimes \begin{bmatrix}\mathbf{Q} & \mathbf{U}\end{bmatrix}^\top \mathbf{S} \boldsymbol{\Sigma}\mathbf{S}^\top \begin{bmatrix}\mathbf{Q} & \mathbf{U}\end{bmatrix}.$$
Since $\mathbf{X} = \mathbf{Q}\boldsymbol{\Lambda}\mathbf{V}^\top$ then we have $$\begin{bmatrix}\mathbf{Q} & \mathbf{U}\end{bmatrix}^\top \mathbf{X}\boldsymbol{\Sigma}\mathbf{X}^\top\begin{bmatrix}\mathbf{Q} & \mathbf{U}\end{bmatrix} = \begin{bmatrix}\boldsymbol{\Lambda} \mathbf{V}^\top \boldsymbol{\Sigma}\mathbf{V} \boldsymbol{\Lambda} &\mathbf{0} \\ \mathbf{0} & \mathbf{0}\end{bmatrix}.$$
This prompts us to split our vector $\hat{\mathbf{y}}$ into a $q$ dimensional vector $\hat{\mathbf{y}}_u = \mathbf{U}^\top \mathbf{y}$ and an $n-q$ dimensional vector $\hat{\mathbf{y}}_q =\mathbf{Q}^\top \mathbf{y}$. The Gaussian likelihood can be written as
where
and
Strictly speaking we should fit these models jointly, but for the purposes of illustration we will firstly use a simple procedure. Firstly, we fit the noise variance $\sigma^2$ on $\hat{\mathbf{y}}_u$ alone using $L_u$. Once this is done, fix the value of $\sigma^2$ in $L_q$ and optimize with respect to the other parameters.
In [6]:
# step 1, find the SVD of S.
n, q = S.shape
T = Y.shape[1]
R, Lambda, V = np.linalg.svd(S)
# Extract first q columns for Q
Q = R[:, :q]
# remaining columns for U
U = R[:, q:]
In [7]:
Q.shape
#U.shape
#Y.shape
Out[7]:
In [8]:
# Find sigma2 by looking at variance of y_u
Y_u = np.dot(U.T, Y)
sigma2 = 1./(T*(n-q))*(Y_u*Y_u).sum()
print "sigma2 found as", sigma2
In [10]:
# Prepare the data for processing in GPy
Y_q = np.dot(Q.T, Y) # project data onto the principal subspace of X
# Generate the input associated with each Y, the TF and the time point.
x0, x1 = np.asarray(np.meshgrid(t.flatten(),np.arange(q)))
X = np.hstack([x0.flatten()[:, None], x1.flatten()[:, None]])
y = Y_q.flatten()[:, None]
In [50]:
kern = GPy.kern.RBF(1, active_dims=[0])*GPy.kern.Coregionalize(1,q,rank=5, active_dims=[1])
m = GPy.models.GPRegression(X, y, kern)
m.mul.rbf.lengthscale = 50
m.Gaussian_noise.variance = sigma2
#m.Gaussian_noise.variance.constrain_fixed()
#()#kern, t, y_q)
#m.sigma2 = constrained_fixed(sigma2)
#m.optimize(messages=True)
In [51]:
print m
In [52]:
m.optimize(messages=True)
#m.plot(fixed_inputs=[(1, 1)]) # this would plot ACE2.
In [53]:
print m
With the current design the model is switching off the temporal correlation. The next step in the analysis will be to reimplement the same model as described by Sanguinetti et al (2006) and recover their results. That will involve using an Ornstein Uhlbeck covariance and joint maximisation of the likelihoood of $L_u$ and $L_q$.
This work is funded by a Commonwealth Scholarship to Aif Rahman and by the EU FP7 RADIANT project.