In this document, we evaluate the above methods on a real dataset. We also create data that should serve as true-negatives by shuffling the observed data. In this way, we can evaluate sensitivity by the number of DMR's found for a given $\alpha$ by the number of DMR's that meet that cutoff in the real dataset. Likewise, we can evaluate the specificity (actually the false positives) by finding the number of DMR's that meet a given cutoff in the shuffled data.
In [1]:
%matplotlib inline
import toolshed as ts
import itertools as it
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="white", context="talk")
colors = sns.color_palette("Set1", 8)
pool = ts.pool()
import crystal
from aclust import mclust # to create correlated clusters
We need a generator that reads data and generates features to sent to aclust
In [2]:
def feature_gen(fname, samples, logit=True):
for i, d in enumerate(ts.reader(fname, header=True)):
chrom, pos = d['probe'].split(":")
if chrom[-1] in "XY": break
values = np.array([float(d[s]) for s in samples])
yield crystal.Feature(chrom, int(pos), crystal.logit(values) if logit else values)
Now we need to define our model and read in our covariates. First, we want to test 6 vs 6 (male vs female).
In [3]:
methylation_file = "d0-450K-12.meth.txt.gz"
covariates_file = "d0-450K-12.covariates.csv"
np.random.seed(42)
covs = pd.read_csv(covariates_file, index_col=0)
Our input data is a methylation matrix (rows of probes and columns of samples). Here, we have encoded the position into the row names, but you can also add these on the fly as you stream over the data.
In [4]:
# a function to restart the cluster generation each time.
def get_cluster_iter(samples, fmeth, min_len=3, logit=True):
return (c for c in mclust(feature_gen(fmeth, samples, logit=logit),
max_dist=300, max_skip=0) if len(c) >= min_len)
Mclust returns an iterator of clusters given an iterator of features. The clusters are have correlated methylation. The clustering is done without knowing the study design so it is unbiased.
Each cluster is a list of features from your own feature_gen()
In [5]:
print covs.gender.value_counts()
covs.head()
Out[5]:
Our analysis will be looking for DMRs related to gender we run a number of methods, repeat some under robust regression, and the evaluate the results:
In [6]:
clusters = list(get_cluster_iter(covs.index, methylation_file, 1))
Now we will evaluate performance using the continuous variable 'age' by using a discovery and validation cohort randomly drawn from the full dataset. We will create the plots as above using sites that replicate in the validation as 'true' positives and sites that do not as 'false' positives. There are some issues with that interpretation in the context of an ROC analysis, but it is an informative metric because we often are interested in sites that will replicate across cohorts.
First, we model the original clusters and find sites related to age:
In [7]:
formula_base = "methylation ~ age + gender"
formula_cpg = formula_base + " + CpG"
coef = 'age'
labels = ['zscore', 'gee', 'mixed-model']
age_modeled = [list(crystal.model_clusters(clusters, covs, formula, coef, fn, pool=pool, **kwargs))
for fn, formula, kwargs in [
(crystal.zscore_cluster, formula_base, {}),
#(crystal.zscore_cluster, formula_base, dict(method=crystal.RLM)),
#(crystal.zscore_cluster, formula_base, dict(method=crystal.Beta, transform=crystal.ilogit)),
(crystal.gee_cluster, formula_cpg, {}),
#(crystal.gee_cluster, formula_cpg, dict(cov_struct=Independence())),
#(crystal.ols_cluster_robust, formula_cpg, {}),
(crystal.mixed_model_cluster, formula_cpg, {})
]]
Then, we write those regions to a file; we will want to replicate sites with p < 1e-3.
In [73]:
CUTOFF = 1e-4
from crystal.evaluate import write_modeled_regions
region_files = []
for label, method in zip(labels, age_modeled):
out_fh = open('/tmp/%s.regions.bed' % label.replace(' ', '-'), 'w')
write_modeled_regions(method, CUTOFF, out_fh)
region_files.append(out_fh.name)
out_fh.close()
In [74]:
external = ('betabinomial', 'bumphunter', 'dmrcate', 'limma')
disc_feats = [f for c in clusters for f in c]
for label in external:
out_fh = open('/tmp/%s.regions.bed' % label, 'w')
df = "../work/discovery%s.output.bed" % label
subset = ["\t".join(t.values()) for t in ts.reader(df, header="ordered") if float(t['p']) < CUTOFF ]
print label, len(subset)
write_region_bed(disc_feats, subset, out_fh)
out_fh.close()
In [9]:
rmethylation_file = "v0-450K-12.meth.txt.gz"
rcovariates_file = "v0-450K-12.covariates.csv"
rcovs = pd.read_csv(rcovariates_file, index_col=0)
replication_clusters = list(get_cluster_iter(rcovs.index, rmethylation_file))
In [10]:
age_replicated = [list(crystal.model_clusters(replication_clusters, rcovs, formula, coef, fn, pool=pool, **kwargs))
for fn, formula, kwargs in [
(crystal.zscore_cluster, formula_base, {}),
(crystal.gee_cluster, formula_cpg, {}),
(crystal.mixed_model_cluster, formula_cpg, {})
]]
In [75]:
from crystal.evaluate import evaluate, cluster_bediter
fig, ax = plt.subplots(1)
# first do a simple transform to make use of the regions function
rep_results = [list(cluster_bediter(r)) for r in age_replicated]
for label, rep, regions in zip(labels, rep_results, region_files):
evaluate(rep, regions, label=label, ax=ax)
for label in external:
if label == "bumphunter": continue
regions = "/tmp/%s.regions.bed" % label
biter = ((d['chrom'], int(float(d['start'])), int(float(d['end'])), float(d['p']), 1) for
d in ts.reader("../work/validation%s.output.bed" % label))
evaluate(biter, regions, label=label, ax=ax)
ax.plot([0, 1], [0, 1], 'k--', zorder=-1)
ax.set_xlim(0, 0.3)
ax.set_ylim(0, 0.3)
ax.legend(loc='best')
plt.savefig('../figures/figure3.eps')
In [ ]: