Crystal

Crystal is a framework for writing functions that test regions for differential methylation. Or, practically, it is a DMR (differentially methylated region) caller. The challenge of DMR calling lies in handling correlated data as we see in DNA methylation.

Crystal works by generating clusters using aclust and then testing them against a chosen method. New methods can be implemented easily, but currently, there is:

  1. GEE: generalized estimating equation using sample as the grouping variable
  2. Mixed Effect Model: with a random intercept by sample
  3. Combining P-values: test each site in a region and combine the resulting p-values using the Stouffer-Liptak or Z-score method
  4. Bump(hunt)ing: A modification of bumphunting that works on local clusters and allows arbitrary metrics (not just the sum of the smoothed coefficients).

Methods using Robust regression are also available for the above.

Note that these are cleanly implemented in python thanks to the excellent statsmodels package

Clustering

Crystal first clusters DNA methylation by grouping correlated probes using streaming agglomerative clustering. This idea was first implemented in a different form by Sofer et al here. The advantage of clustering before modeling are:

  1. it allows us to perform a fixed number of tests
  2. different testing methods can be performed on the same clusters
  3. the clustering is done without knowing the study design, so we can test the same clusters with different studies
  4. since we test a fixed number of clusters, traditional multiple-testing methods are appropriate

Modeling

Once we have clusters, we can evaluate different modeling strategies to see which perform best. It is simple to write your own modeling function in a few lines of python, for this example, we will use the z-score method which models each probe in a cluster independently and then combines the p-values accounting for their correlation.

Below we go through an example. You can run this on your own data as long as it is in the same format (e.g. methylation row-labels are chrom:position rather than illumina cg id).

Example

First we import the needed modules and take a loot at the files that we will be using. The example data is from GSE51032 which contains 845 samples with 450K data.

We have downloaded and prepared the data using this script.

We have normalized the methylation using minfi with this script


In [1]:
%matplotlib inline
%reload_ext autoreload
%autoreload 2

import toolshed as ts
import numpy as np
import pandas as pd

import crystal
from aclust import mclust # to create correlated clusters

methylation = "v0-450K-12.meth.txt.gz"
covariates = "v0-450K-12.covariates.csv"


/usr/local/lib/python2.7/dist-packages/pytz/__init__.py:29: UserWarning: Module readline was already imported from /usr/lib/python2.7/lib-dynload/readline.x86_64-linux-gnu.so, but /home/brentp/.local/lib/python2.7/site-packages is being added to sys.path
  from pkg_resources import resource_stream
/usr/local/lib/python2.7/dist-packages/pytz/__init__.py:29: UserWarning: Module docutils was already imported from /usr/lib/python2.7/dist-packages/docutils/__init__.pyc, but /usr/local/lib/python2.7/dist-packages is being added to sys.path
  from pkg_resources import resource_stream

In [2]:
%%bash
zcat v0-450K-12.meth.txt.gz | head -5 | cut -f 1-6


probe	GSM1235900	GSM1235902	GSM1235983	GSM1236050	GSM1236090
chr1:834183	0.595	0.661	0.573	0.612	0.696
chr1:839752	0.418	0.559	0.380	0.278	0.372
chr1:843581	0.815	0.803	0.803	0.808	0.855
chr1:844499	0.891	0.894	0.894	0.869	0.914
gzip: stdout: Broken pipe

In [3]:
%%bash 
head -5 v0-450K-12.covariates.csv


,gender,age,Basename,ID,id,plate
GSM1235900,F,58.231,6042324088_R04C01,age58.231_F,GSM1235900,6042324088
GSM1235902,M,52.632,6042324088_R06C01,age52.632_M,GSM1235902,6042324088
GSM1235983,M,64.679,6042324088_R01C01,age64.679_M,GSM1235983,6042324088
GSM1236050,F,55.299,6042324088_R04C02,age55.299_F,GSM1236050,6042324088

Note that the rows from covariates.csv match the column labels from the methylation file.

Now we create a feature_gen function that yields Features that have a chrom, a position, and a numpy array of methylation values.

These are clustered using aclust and yielded lazily.


In [4]:
def feature_gen(fname):         
    for i, toks in enumerate(ts.reader(fname, header=False)):
        if i == 0: continue
        chrom, pos = toks[0].split(":")
        yield crystal.Feature(chrom, int(pos), crystal.logit(np.array(map(float, toks[1:]))))
        
# generate clusters lazily.
cluster_iter =(c for c in mclust(feature_gen(methylation),
                                max_dist=400, max_skip=1) if len(c) > 3)

A cluster is basically just a list of Feature objects for positions with correlated methylation


In [5]:
first_cluster = next(cluster_iter)
[repr(f) for f in first_cluster]


Out[5]:
['Feature(chr1:855046)',
 'Feature(chr1:855425)',
 'Feature(chr1:855445)',
 'Feature(chr1:855544)']

We can see that the data in the cluster is highly correlated as expected:


In [6]:
np.corrcoef(first_cluster[2].values, first_cluster[3].values)[0, 1]


Out[6]:
0.67649504687119966

Now we want to model the cluster. We use a model in R / patsy syntax and a pandas dataframe for the covariates. With that, we can immediately get a p-value for the first cluster above:


In [7]:
formula = "methylation ~ age + gender"

covs = pd.read_csv(covariates, index_col=0)
covs['id'] = covs.index

crystal.zscore_cluster(formula, first_cluster, covs, "age")


Out[7]:
{'coef': 0.0028188864527111186,
 'covar': 'age',
 'p': 0.59214971958573348,
 't': 0.69457856354596847}

We can do the same with different methods. Note that methods which do not model the CpG's separately should include "CpG" as a covariate in the model. It is added to the tested data automatically.


In [8]:
crystal.liptak_cluster(formula, first_cluster, covs, "age")


Out[8]:
{'coef': 0.0028188864527111186,
 'covar': 'age',
 'p': 0.4367180997149267,
 't': 0.69457856354596847}

In [9]:
crystal.gee_cluster(formula + " + CpG" , first_cluster, covs, "age")


Out[9]:
{'coef': 0.0028188864527111204,
 'covar': 'age',
 'p': 0.22466731941029672,
 't': 1.2142105852120666}

In [10]:
crystal.gls_cluster(formula + "+CpG", first_cluster, covs, "age")


Out[10]:
{'coef': 0.0028188864527110931,
 'covar': 'age',
 'p': 0.18104097755147489,
 't': 1.3601428367835633}

In [11]:
crystal.bump_cluster(formula, first_cluster, covs, "age")


Out[11]:
{'coef': 0.0028188864527111186,
 'covar': 'age',
 'n_sim': 116,
 'p': 0.06837606837606838,
 't': 0.69457856354596847}

Note that:

  1. the different methods come up with different p-values but the coefficient estimates are nearly identical. (the other notebook in this directory compares the specificity and sensitivity of these methods).
  2. every method takes the same arguments: formula, methylation, covariates, coefficient-of-interest

So we could easily test gender as (note that it is very significant):


In [12]:
crystal.gee_cluster(formula + " + CpG" , first_cluster, covs, "gender")


Out[12]:
{'coef': -0.10249678935226518,
 'covar': 'gender[T.M]',
 'p': 0.0063973953064402666,
 't': -2.7266856506820076}

In general we don't want to mess around with a single cluster, we want to test every cluster. There is a simple calling function to do this:


In [13]:
header = "{chrom}\t{start}\t{end}\t{p:.6g}\t{coef:.4f}\t{n_sites}"
formula = "methylation ~ age + gender + CpG"
print ts.fmt2header(header)
for i, cluster in enumerate(crystal.model_clusters(cluster_iter, covs, formula, "age", crystal.gee_cluster, n_cpu=1)):
    if i > 5: break
    print header.format(**cluster)


chrom	start	end	p	coef	n_sites
chr1	869345	872385	0.542112	-0.0053	13
chr1	901448	901892	0.898846	0.0006	6
chr1	917248	917819	0.2135	-0.0078	5
chr1	935455	936229	0.000654375	0.0179	4
chr1	946874	947343	0.74954	0.0015	6
chr1	948892	949893	0.759461	-0.0038	6

Note the n_cpu argument which will parallelize the model calling.

We can see how similar those results are to using another method that combines p-values:


In [14]:
cluster_iter =(c for c in mclust(feature_gen(methylation),
                                max_dist=400, max_skip=1) if len(c) > 3)
next(cluster_iter) # discard first one as above

formula = "methylation ~ age + gender"
for i, cluster in enumerate(crystal.model_clusters(cluster_iter, covs, formula, "age", crystal.liptak_cluster, n_cpu=1)):
    print header.format(**cluster)
    if i > 5: break


chr1	869345	872385	0.732089	-0.0053	13
chr1	901448	901892	0.820732	0.0006	6
chr1	917248	917819	0.292244	-0.0078	5
chr1	935455	936229	0.327064	0.0179	4
chr1	946874	947343	0.54557	0.0015	6
chr1	948892	949893	0.264239	-0.0038	6
chr1	960760	964506	0.0346717	0.0009	5

We can see what is returned in a single cluster afer the modeling:


In [15]:
cluster


Out[15]:
{'chrom': 'chr1',
 'cluster': [Feature(chr1:960761),
  Feature(chr1:962651),
  Feature(chr1:963599),
  Feature(chr1:964497),
  Feature(chr1:964506)],
 'coef': 0.00094014901993787572,
 'covar': 'age',
 'end': 964506,
 'n_sites': 5,
 'p': 0.034671694115228517,
 'sites': ['chr1:960761',
  'chr1:962651',
  'chr1:963599',
  'chr1:964497',
  'chr1:964506'],
 'start': 960760,
 't': 0.20485323448674161,
 'time': 0.022290945053100586,
 'var': 'age'}

In general, regardless of the format of your input data, if you can make a generator like feature_gen above, that yields Feature objects in chromosome and position order, then it will be very simple to run these methods.

See the included minfi-norm.R for how to get your 450k data in a format where you can use the above without modification.

Plotting

Below, we show how to plot a single cluster. For dichotomous variables, such as gender, in this example, it will plot a histogram of the methylation values at each site. For continuous variables, such as age it will do a scatter plot vs methylation


In [16]:
from pylab import rcParams
rcParams['figure.figsize'] = 10, 10
cluster['var'] = 'gender'
fig, axs = crystal.plot.plot_cluster(cluster, covs, normed=True)
cluster['var'] = 'age'
fig, axs = crystal.plot.plot_cluster(cluster, covs)



In [17]:
# I run this to get a cluster that looks different by gender
for i, cluster in enumerate(crystal.model_clusters(cluster_iter, covs, formula + " + CpG" , "gender",
                                                   crystal.gee_cluster, n_cpu=1)):
    if cluster['p'] < 1e-4 and abs(cluster['coef']) > 0.1: break
    
_ = crystal.plot.plot_cluster(cluster, covs, normed=True)



In [18]:
cluster


Out[18]:
{'chrom': 'chr1',
 'cluster': [Feature(chr1:3432342),
  Feature(chr1:3435887),
  Feature(chr1:3435989),
  Feature(chr1:3436115)],
 'coef': -0.11960754737043301,
 'covar': 'gender[T.M]',
 'end': 3436115,
 'n_sites': 4,
 'p': 2.2051046688007441e-06,
 'sites': ['chr1:3432342', 'chr1:3435887', 'chr1:3435989', 'chr1:3436115'],
 'start': 3432341,
 't': -4.7336567988880249,
 'time': 0.017721891403198242,
 'var': 'gender'}

In the case of a dichotomous covariate, we can make a nice horizontal bar plot with the genomic positions increasing along the x-axis and the distribution of each group plotted in histograms to the left/right of the position:


In [19]:
cluster['var'] = 'gender'
ax = crystal.plot.spaghetti_plot(cluster, covs, ilogit=True)


Remember, we can modle a cluster using a variety of methods, simply by using a different function call:


In [20]:
crystal.gee_cluster(formula + " + CpG", cluster['cluster'], covs, "gender")


Out[20]:
{'coef': -0.11960754737043326,
 'covar': 'gender[T.M]',
 'p': 2.2051046688006298e-06,
 't': -4.7336567988880356}

In [21]:
crystal.ols_cluster_robust(formula + " + CpG", cluster['cluster'], covs, "gender")


Out[21]:
{'coef': -0.11960754737043293,
 'covar': 'gender[T.M]',
 'p': 1.8332901224481242e-05,
 't': -4.2842835185198584}

In [22]:
crystal.glsar_cluster(formula + " + CpG", cluster['cluster'], covs, "gender")


Out[22]:
{'coef': -0.12026443023136824,
 'covar': 'gender[T.M]',
 'p': 1.1027813962481873e-07,
 't': -6.6007724009502775}

In [23]:
crystal.liptak_cluster(formula, cluster['cluster'], covs, "gender")


Out[23]:
{'coef': -0.11960754737043294,
 'covar': 'gender[T.M]',
 'p': 0.069717794982398509,
 't': -2.2495214627902937}

Note that the default is to use OLS method for modeling. We can specify to use robust regression


In [24]:
crystal.liptak_cluster(formula, cluster['cluster'][:2], covs, "gender", method=crystal.RLM)


Out[24]:
{'coef': -0.10013375009542361,
 'covar': 'gender[T.M]',
 'p': 0.15559143540404846,
 't': -2.0856923514839645}

In [25]:
crystal.zscore_cluster(formula, cluster['cluster'][:2], covs, "gender", method=crystal.RLM)


Out[25]:
{'coef': -0.10013375009542361,
 'covar': 'gender[T.M]',
 'p': 0.026926296550330703,
 't': -2.0856923514839645}

In [25]: