Crystal is a framework for writing functions that test regions for differential methylation. Or, practically, it is a DMR (differentially methylated region) caller. The challenge of DMR calling lies in handling correlated data as we see in DNA methylation.
Crystal works by generating clusters using aclust and then testing them against a chosen method. New methods can be implemented easily, but currently, there is:
Methods using Robust regression are also available for the above.
Note that these are cleanly implemented in python thanks to the excellent statsmodels package
Crystal first clusters DNA methylation by grouping correlated probes using streaming agglomerative clustering. This idea was first implemented in a different form by Sofer et al here. The advantage of clustering before modeling are:
Once we have clusters, we can evaluate different modeling strategies to see which perform best. It is simple to write your own modeling function in a few lines of python, for this example, we will use the z-score method which models each probe in a cluster independently and then combines the p-values accounting for their correlation.
Below we go through an example. You can run this on your own data as long as it is in the same format (e.g. methylation row-labels are chrom:position rather than illumina cg id).
First we import the needed modules and take a loot at the files that we will be using. The example data is from GSE51032 which contains 845 samples with 450K data.
We have downloaded and prepared the data using this script.
We have normalized the methylation using minfi with this script
In [1]:
%matplotlib inline
%reload_ext autoreload
%autoreload 2
import toolshed as ts
import numpy as np
import pandas as pd
import crystal
from aclust import mclust # to create correlated clusters
methylation = "v0-450K-12.meth.txt.gz"
covariates = "v0-450K-12.covariates.csv"
In [2]:
%%bash
zcat v0-450K-12.meth.txt.gz | head -5 | cut -f 1-6
In [3]:
%%bash
head -5 v0-450K-12.covariates.csv
Note that the rows from covariates.csv match the column labels from the methylation file.
Now we create a feature_gen function that yields Features that have a chrom, a position, and a numpy array of methylation values.
These are clustered using aclust and yielded lazily.
In [4]:
def feature_gen(fname):
for i, toks in enumerate(ts.reader(fname, header=False)):
if i == 0: continue
chrom, pos = toks[0].split(":")
yield crystal.Feature(chrom, int(pos), crystal.logit(np.array(map(float, toks[1:]))))
# generate clusters lazily.
cluster_iter =(c for c in mclust(feature_gen(methylation),
max_dist=400, max_skip=1) if len(c) > 3)
A cluster is basically just a list of Feature objects for positions with correlated methylation
In [5]:
first_cluster = next(cluster_iter)
[repr(f) for f in first_cluster]
Out[5]:
We can see that the data in the cluster is highly correlated as expected:
In [6]:
np.corrcoef(first_cluster[2].values, first_cluster[3].values)[0, 1]
Out[6]:
Now we want to model the cluster. We use a model in R / patsy syntax and a pandas dataframe for the covariates. With that, we can immediately get a p-value for the first cluster above:
In [7]:
formula = "methylation ~ age + gender"
covs = pd.read_csv(covariates, index_col=0)
covs['id'] = covs.index
crystal.zscore_cluster(formula, first_cluster, covs, "age")
Out[7]:
We can do the same with different methods. Note that methods which do not model the CpG's separately should include "CpG" as a covariate in the model. It is added to the tested data automatically.
In [8]:
crystal.liptak_cluster(formula, first_cluster, covs, "age")
Out[8]:
In [9]:
crystal.gee_cluster(formula + " + CpG" , first_cluster, covs, "age")
Out[9]:
In [10]:
crystal.gls_cluster(formula + "+CpG", first_cluster, covs, "age")
Out[10]:
In [11]:
crystal.bump_cluster(formula, first_cluster, covs, "age")
Out[11]:
Note that:
So we could easily test gender as (note that it is very significant):
In [12]:
crystal.gee_cluster(formula + " + CpG" , first_cluster, covs, "gender")
Out[12]:
In general we don't want to mess around with a single cluster, we want to test every cluster. There is a simple calling function to do this:
In [13]:
header = "{chrom}\t{start}\t{end}\t{p:.6g}\t{coef:.4f}\t{n_sites}"
formula = "methylation ~ age + gender + CpG"
print ts.fmt2header(header)
for i, cluster in enumerate(crystal.model_clusters(cluster_iter, covs, formula, "age", crystal.gee_cluster, n_cpu=1)):
if i > 5: break
print header.format(**cluster)
Note the n_cpu argument which will parallelize the model calling.
We can see how similar those results are to using another method that combines p-values:
In [14]:
cluster_iter =(c for c in mclust(feature_gen(methylation),
max_dist=400, max_skip=1) if len(c) > 3)
next(cluster_iter) # discard first one as above
formula = "methylation ~ age + gender"
for i, cluster in enumerate(crystal.model_clusters(cluster_iter, covs, formula, "age", crystal.liptak_cluster, n_cpu=1)):
print header.format(**cluster)
if i > 5: break
We can see what is returned in a single cluster afer the modeling:
In [15]:
cluster
Out[15]:
In general, regardless of the format of your input data, if you can make a generator like feature_gen above, that yields Feature objects in chromosome and position order, then it will be very simple to run these methods.
See the included minfi-norm.R for how to get your 450k data in a format where you can use the above without modification.
Below, we show how to plot a single cluster. For dichotomous variables, such as gender, in this example, it will plot a histogram of the methylation values at each site. For continuous variables, such as age it will do a scatter plot vs methylation
In [16]:
from pylab import rcParams
rcParams['figure.figsize'] = 10, 10
cluster['var'] = 'gender'
fig, axs = crystal.plot.plot_cluster(cluster, covs, normed=True)
cluster['var'] = 'age'
fig, axs = crystal.plot.plot_cluster(cluster, covs)
In [17]:
# I run this to get a cluster that looks different by gender
for i, cluster in enumerate(crystal.model_clusters(cluster_iter, covs, formula + " + CpG" , "gender",
crystal.gee_cluster, n_cpu=1)):
if cluster['p'] < 1e-4 and abs(cluster['coef']) > 0.1: break
_ = crystal.plot.plot_cluster(cluster, covs, normed=True)
In [18]:
cluster
Out[18]:
In the case of a dichotomous covariate, we can make a nice horizontal bar plot with the genomic positions increasing along the x-axis and the distribution of each group plotted in histograms to the left/right of the position:
In [19]:
cluster['var'] = 'gender'
ax = crystal.plot.spaghetti_plot(cluster, covs, ilogit=True)
Remember, we can modle a cluster using a variety of methods, simply by using a different function call:
In [20]:
crystal.gee_cluster(formula + " + CpG", cluster['cluster'], covs, "gender")
Out[20]:
In [21]:
crystal.ols_cluster_robust(formula + " + CpG", cluster['cluster'], covs, "gender")
Out[21]:
In [22]:
crystal.glsar_cluster(formula + " + CpG", cluster['cluster'], covs, "gender")
Out[22]:
In [23]:
crystal.liptak_cluster(formula, cluster['cluster'], covs, "gender")
Out[23]:
Note that the default is to use OLS method for modeling. We can specify to use robust regression
In [24]:
crystal.liptak_cluster(formula, cluster['cluster'][:2], covs, "gender", method=crystal.RLM)
Out[24]:
In [25]:
crystal.zscore_cluster(formula, cluster['cluster'][:2], covs, "gender", method=crystal.RLM)
Out[25]:
In [25]: