Evaluation

In this document, we evaluate the above methods on a real dataset. We also create data that should serve as true-negatives by shuffling the observed data. In this way, we can evaluate sensitivity by the number of DMR's found for a given $\alpha$ by the number of DMR's that meet that cutoff in the real dataset. Likewise, we can evaluate the specificity (actually the false positives) by finding the number of DMR's that meet a given cutoff in the shuffled data.


In [1]:
%matplotlib inline

import toolshed as ts
import itertools as it
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

sns.set(style="white", context="talk")
colors = sns.color_palette("Set1", 8)
pool = ts.pool()

import crystal
from aclust import mclust # to create correlated clusters


/usr/local/lib/python2.7/dist-packages/pytz/__init__.py:29: UserWarning: Module readline was already imported from /usr/lib/python2.7/lib-dynload/readline.x86_64-linux-gnu.so, but /home/brentp/.local/lib/python2.7/site-packages is being added to sys.path
  from pkg_resources import resource_stream

We need a generator that reads data and generates features to sent to aclust


In [2]:
def feature_gen(fname, samples, logit=True):
    for i, d in enumerate(ts.reader(fname, header=True)):
        chrom, pos = d['probe'].split(":")
        if chrom[-1] in "XY": break
        values = np.array([float(d[s]) for s in samples])
        yield crystal.Feature(chrom, int(pos), crystal.logit(values) if logit else values)

Now we need to define our model and read in our covariates. First, we want to test 6 vs 6 (male vs female).


In [3]:
methylation_file = "d0-450K-12.meth.txt.gz"
covariates_file = "d0-450K-12.covariates.csv"
np.random.seed(42)

covs = pd.read_csv(covariates_file, index_col=0)

Our input data is a methylation matrix (rows of probes and columns of samples). Here, we have encoded the position into the row names, but you can also add these on the fly as you stream over the data.


In [4]:
# a function to restart the cluster generation each time.
def get_cluster_iter(samples, fmeth, min_len=3, logit=True):
    return (c for c in mclust(feature_gen(fmeth, samples, logit=logit),
                              max_dist=300, max_skip=0) if len(c) >= min_len)

Mclust returns an iterator of clusters given an iterator of features. The clusters are have correlated methylation. The clustering is done without knowing the study design so it is unbiased.

Each cluster is a list of features from your own feature_gen()


In [5]:
print covs.gender.value_counts()
covs.head()


M    7
F    5
dtype: int64
Out[5]:
gender age Basename ID id plate
GSM1235879 M 56.120 6042324097_R03C02 age56.12_M GSM1235879 6042324097
GSM1235909 M 54.790 6042324097_R01C01 age54.79_M GSM1235909 6042324097
GSM1236002 F 44.304 6042324097_R05C02 age44.304_F GSM1236002 6042324097
GSM1236011 M 53.719 6042324097_R02C01 age53.719_M GSM1236011 6042324097
GSM1236114 M 58.450 6042324097_R03C01 age58.45_M GSM1236114 6042324097

Our analysis will be looking for DMRs related to gender we run a number of methods, repeat some under robust regression, and the evaluate the results:


In [6]:
clusters = list(get_cluster_iter(covs.index, methylation_file, 1))

Evaluate On Replication

Now we will evaluate performance using the continuous variable 'age' by using a discovery and validation cohort randomly drawn from the full dataset. We will create the plots as above using sites that replicate in the validation as 'true' positives and sites that do not as 'false' positives. There are some issues with that interpretation in the context of an ROC analysis, but it is an informative metric because we often are interested in sites that will replicate across cohorts.

First, we model the original clusters and find sites related to age:


In [7]:
formula_base = "methylation ~ age + gender"
formula_cpg = formula_base + " + CpG"
coef = 'age'

labels = ['zscore', 'gee', 'mixed-model']

age_modeled = [list(crystal.model_clusters(clusters, covs, formula, coef, fn, pool=pool, **kwargs))
             for fn, formula, kwargs in [
                  (crystal.zscore_cluster, formula_base, {}),
                  #(crystal.zscore_cluster, formula_base, dict(method=crystal.RLM)),
                  #(crystal.zscore_cluster, formula_base, dict(method=crystal.Beta, transform=crystal.ilogit)),

                  (crystal.gee_cluster, formula_cpg, {}),
                  #(crystal.gee_cluster, formula_cpg, dict(cov_struct=Independence())),
                  #(crystal.ols_cluster_robust, formula_cpg, {}),
                  (crystal.mixed_model_cluster, formula_cpg, {})
 
           ]]


/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/regression/mixed_linear_model.py:1676: ConvergenceWarning: The MLE may be on the boundary of the parameter space.
  warnings.warn(msg, ConvergenceWarning)
/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/regression/mixed_linear_model.py:1676: ConvergenceWarning: The MLE may be on the boundary of the parameter space.
  warnings.warn(msg, ConvergenceWarning)
/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/regression/mixed_linear_model.py:1676: ConvergenceWarning: The MLE may be on the boundary of the parameter space.
  warnings.warn(msg, ConvergenceWarning)
/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/regression/mixed_linear_model.py:1676: ConvergenceWarning: The MLE may be on the boundary of the parameter space.
  warnings.warn(msg, ConvergenceWarning)
/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/regression/mixed_linear_model.py:1676: ConvergenceWarning: The MLE may be on the boundary of the parameter space.
  warnings.warn(msg, ConvergenceWarning)
/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/regression/mixed_linear_model.py:1676: ConvergenceWarning: The MLE may be on the boundary of the parameter space.
  warnings.warn(msg, ConvergenceWarning)
/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/regression/mixed_linear_model.py:1676: ConvergenceWarning: The MLE may be on the boundary of the parameter space.
  warnings.warn(msg, ConvergenceWarning)
/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/regression/mixed_linear_model.py:1676: ConvergenceWarning: The MLE may be on the boundary of the parameter space.
  warnings.warn(msg, ConvergenceWarning)
/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/regression/mixed_linear_model.py:1676: ConvergenceWarning: The MLE may be on the boundary of the parameter space.
  warnings.warn(msg, ConvergenceWarning)
/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/regression/mixed_linear_model.py:1676: ConvergenceWarning: The MLE may be on the boundary of the parameter space.
  warnings.warn(msg, ConvergenceWarning)
/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/regression/mixed_linear_model.py:1676: ConvergenceWarning: The MLE may be on the boundary of the parameter space.
  warnings.warn(msg, ConvergenceWarning)
/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/base/model.py:445: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)
/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/base/model.py:445: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)
/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/base/model.py:445: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)
/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/base/model.py:445: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)
/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/base/model.py:445: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)
/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/base/model.py:445: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)
/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/base/model.py:445: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)
/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/base/model.py:445: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)
/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/base/model.py:445: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)
/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/base/model.py:445: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)
/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/base/model.py:445: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)

Then, we write those regions to a file; we will want to replicate sites with p < 1e-3.


In [73]:
CUTOFF = 1e-4
from crystal.evaluate import write_modeled_regions        
region_files = []        
for label, method in zip(labels, age_modeled):
    out_fh = open('/tmp/%s.regions.bed' % label.replace(' ', '-'), 'w')
    write_modeled_regions(method, CUTOFF, out_fh)
    region_files.append(out_fh.name)
    out_fh.close()

In [74]:
external = ('betabinomial', 'bumphunter', 'dmrcate', 'limma')
disc_feats = [f for c in clusters for f in c]
for label in external:
    out_fh = open('/tmp/%s.regions.bed' % label, 'w')
    df = "../work/discovery%s.output.bed" % label
    subset = ["\t".join(t.values()) for t in ts.reader(df, header="ordered") if float(t['p']) < CUTOFF ]
    print label, len(subset)
    write_region_bed(disc_feats, subset, out_fh)
    out_fh.close()


betabinomial 3992
bumphunter 0
dmrcate 91
limma 77

In [9]:
rmethylation_file = "v0-450K-12.meth.txt.gz"
rcovariates_file = "v0-450K-12.covariates.csv"

rcovs = pd.read_csv(rcovariates_file, index_col=0)

replication_clusters = list(get_cluster_iter(rcovs.index, rmethylation_file))

In [10]:
age_replicated = [list(crystal.model_clusters(replication_clusters, rcovs, formula, coef, fn, pool=pool, **kwargs))
             for fn, formula, kwargs in [
                  (crystal.zscore_cluster, formula_base, {}),
                  (crystal.gee_cluster, formula_cpg, {}),
                  (crystal.mixed_model_cluster, formula_cpg, {})
           ]]

In [75]:
from crystal.evaluate import evaluate, cluster_bediter
fig, ax = plt.subplots(1)

# first do a simple transform to make use of the regions function
rep_results = [list(cluster_bediter(r)) for r in age_replicated]
for label, rep, regions in zip(labels, rep_results, region_files):
    evaluate(rep, regions, label=label, ax=ax)
    
for label in external:
    if label == "bumphunter": continue
    regions = "/tmp/%s.regions.bed" % label
    biter = ((d['chrom'], int(float(d['start'])), int(float(d['end'])), float(d['p']), 1) for 
                                          d in ts.reader("../work/validation%s.output.bed" % label))
    evaluate(biter, regions, label=label, ax=ax)
    
ax.plot([0, 1], [0, 1], 'k--', zorder=-1) 
ax.set_xlim(0, 0.3)
ax.set_ylim(0, 0.3)
ax.legend(loc='best')    
plt.savefig('../figures/figure3.eps')


AUC: 0.5309 | zscore 443170
AUC: 0.6003 | gee 443170
AUC: 0.6052 | mixed-model 443170
AUC: 0.5137 | betabinomial 443170
AUC: 0.4891 | dmrcate 443170
AUC: 0.5815 | limma 443170

In [ ]: