Evaluation

In this document, we evaluate the above methods on a real dataset. We also create data that should serve as true-negatives by shuffling the observed data. In this way, we can evaluate sensitivity by the number of DMR's found for a given $\alpha$ by the number of DMR's that meet that cutoff in the real dataset. Likewise, we can evaluate the specificity (actually the false positives) by finding the number of DMR's that meet a given cutoff in the shuffled data.



In [1]:

    
%matplotlib inline

import toolshed as ts
import itertools as it
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

sns.set(style="white", context="talk")
colors = sns.color_palette("Set1", 8)
pool = ts.pool()

import crystal
from aclust import mclust # to create correlated clusters









    



/usr/local/lib/python2.7/dist-packages/pytz/__init__.py:29: UserWarning: Module readline was already imported from /usr/lib/python2.7/lib-dynload/readline.x86_64-linux-gnu.so, but /home/brentp/.local/lib/python2.7/site-packages is being added to sys.path
  from pkg_resources import resource_stream

We need a generator that reads data and generates features to sent to aclust



In [2]:

    
def feature_gen(fname, samples, logit=True):
    for i, d in enumerate(ts.reader(fname, header=True)):
        chrom, pos = d['probe'].split(":")
        if chrom[-1] in "XY": break
        values = np.array([float(d[s]) for s in samples])
        yield crystal.Feature(chrom, int(pos), crystal.logit(values) if logit else values)

Now we need to define our model and read in our covariates. First, we want to test 6 vs 6 (male vs female).



In [3]:

    
methylation_file = "d0-450K-12.meth.txt.gz"
covariates_file = "d0-450K-12.covariates.csv"
np.random.seed(42)

covs = pd.read_csv(covariates_file, index_col=0)

Our input data is a methylation matrix (rows of probes and columns of samples). Here, we have encoded the position into the row names, but you can also add these on the fly as you stream over the data.



In [4]:

    
# a function to restart the cluster generation each time.
def get_cluster_iter(samples, fmeth, min_len=3, logit=True):
    return (c for c in mclust(feature_gen(fmeth, samples, logit=logit),
                              max_dist=300, max_skip=0) if len(c) >= min_len)

Mclust returns an iterator of clusters given an iterator of features. The clusters are have correlated methylation. The clustering is done without knowing the study design so it is unbiased.

Each cluster is a list of features from your own feature_gen()



In [5]:

    
print covs.gender.value_counts()
covs.head()









    



M    7
F    5
dtype: int64






    Out[5]:






  
    
      
      gender
      age
      Basename
      ID
      id
      plate
    
  
  
    
      GSM1235879
       M
       56.120
       6042324097_R03C02
        age56.12_M
       GSM1235879
       6042324097
    
    
      GSM1235909
       M
       54.790
       6042324097_R01C01
        age54.79_M
       GSM1235909
       6042324097
    
    
      GSM1236002
       F
       44.304
       6042324097_R05C02
       age44.304_F
       GSM1236002
       6042324097
    
    
      GSM1236011
       M
       53.719
       6042324097_R02C01
       age53.719_M
       GSM1236011
       6042324097
    
    
      GSM1236114
       M
       58.450
       6042324097_R03C01
        age58.45_M
       GSM1236114
       6042324097

Our analysis will be looking for DMRs related to gender we run a number of methods, repeat some under robust regression, and the evaluate the results:



In [6]:

    
clusters = list(get_cluster_iter(covs.index, methylation_file, 1))

Evaluate On Replication

Now we will evaluate performance using the continuous variable 'age' by using a discovery and validation cohort randomly drawn from the full dataset. We will create the plots as above using sites that replicate in the validation as 'true' positives and sites that do not as 'false' positives. There are some issues with that interpretation in the context of an ROC analysis, but it is an informative metric because we often are interested in sites that will replicate across cohorts.

First, we model the original clusters and find sites related to age:



In [7]:

    
formula_base = "methylation ~ age + gender"
formula_cpg = formula_base + " + CpG"
coef = 'age'

labels = ['zscore', 'gee', 'mixed-model']

age_modeled = [list(crystal.model_clusters(clusters, covs, formula, coef, fn, pool=pool, **kwargs))
             for fn, formula, kwargs in [
                  (crystal.zscore_cluster, formula_base, {}),
                  #(crystal.zscore_cluster, formula_base, dict(method=crystal.RLM)),
                  #(crystal.zscore_cluster, formula_base, dict(method=crystal.Beta, transform=crystal.ilogit)),

                  (crystal.gee_cluster, formula_cpg, {}),
                  #(crystal.gee_cluster, formula_cpg, dict(cov_struct=Independence())),
                  #(crystal.ols_cluster_robust, formula_cpg, {}),
                  (crystal.mixed_model_cluster, formula_cpg, {})
 
           ]]









    



/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/regression/mixed_linear_model.py:1676: ConvergenceWarning: The MLE may be on the boundary of the parameter space.
  warnings.warn(msg, ConvergenceWarning)
/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/regression/mixed_linear_model.py:1676: ConvergenceWarning: The MLE may be on the boundary of the parameter space.
  warnings.warn(msg, ConvergenceWarning)
/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/regression/mixed_linear_model.py:1676: ConvergenceWarning: The MLE may be on the boundary of the parameter space.
  warnings.warn(msg, ConvergenceWarning)
/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/regression/mixed_linear_model.py:1676: ConvergenceWarning: The MLE may be on the boundary of the parameter space.
  warnings.warn(msg, ConvergenceWarning)
/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/regression/mixed_linear_model.py:1676: ConvergenceWarning: The MLE may be on the boundary of the parameter space.
  warnings.warn(msg, ConvergenceWarning)
/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/regression/mixed_linear_model.py:1676: ConvergenceWarning: The MLE may be on the boundary of the parameter space.
  warnings.warn(msg, ConvergenceWarning)
/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/regression/mixed_linear_model.py:1676: ConvergenceWarning: The MLE may be on the boundary of the parameter space.
  warnings.warn(msg, ConvergenceWarning)
/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/regression/mixed_linear_model.py:1676: ConvergenceWarning: The MLE may be on the boundary of the parameter space.
  warnings.warn(msg, ConvergenceWarning)
/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/regression/mixed_linear_model.py:1676: ConvergenceWarning: The MLE may be on the boundary of the parameter space.
  warnings.warn(msg, ConvergenceWarning)
/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/regression/mixed_linear_model.py:1676: ConvergenceWarning: The MLE may be on the boundary of the parameter space.
  warnings.warn(msg, ConvergenceWarning)
/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/regression/mixed_linear_model.py:1676: ConvergenceWarning: The MLE may be on the boundary of the parameter space.
  warnings.warn(msg, ConvergenceWarning)
/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/base/model.py:445: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)
/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/base/model.py:445: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)
/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/base/model.py:445: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)
/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/base/model.py:445: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)
/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/base/model.py:445: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)
/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/base/model.py:445: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)
/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/base/model.py:445: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)
/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/base/model.py:445: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)
/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/base/model.py:445: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)
/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/base/model.py:445: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)
/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/base/model.py:445: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)

Then, we write those regions to a file; we will want to replicate sites with p < 1e-3.



In [73]:

    
CUTOFF = 1e-4
from crystal.evaluate import write_modeled_regions        
region_files = []        
for label, method in zip(labels, age_modeled):
    out_fh = open('/tmp/%s.regions.bed' % label.replace(' ', '-'), 'w')
    write_modeled_regions(method, CUTOFF, out_fh)
    region_files.append(out_fh.name)
    out_fh.close()



In [74]:

    
external = ('betabinomial', 'bumphunter', 'dmrcate', 'limma')
disc_feats = [f for c in clusters for f in c]
for label in external:
    out_fh = open('/tmp/%s.regions.bed' % label, 'w')
    df = "../work/discovery%s.output.bed" % label
    subset = ["\t".join(t.values()) for t in ts.reader(df, header="ordered") if float(t['p']) < CUTOFF ]
    print label, len(subset)
    write_region_bed(disc_feats, subset, out_fh)
    out_fh.close()









    



betabinomial 3992
bumphunter 0
dmrcate 91
limma 77



In [9]:

    
rmethylation_file = "v0-450K-12.meth.txt.gz"
rcovariates_file = "v0-450K-12.covariates.csv"

rcovs = pd.read_csv(rcovariates_file, index_col=0)

replication_clusters = list(get_cluster_iter(rcovs.index, rmethylation_file))



In [10]:

    
age_replicated = [list(crystal.model_clusters(replication_clusters, rcovs, formula, coef, fn, pool=pool, **kwargs))
             for fn, formula, kwargs in [
                  (crystal.zscore_cluster, formula_base, {}),
                  (crystal.gee_cluster, formula_cpg, {}),
                  (crystal.mixed_model_cluster, formula_cpg, {})
           ]]



In [75]:

    
from crystal.evaluate import evaluate, cluster_bediter
fig, ax = plt.subplots(1)

# first do a simple transform to make use of the regions function
rep_results = [list(cluster_bediter(r)) for r in age_replicated]
for label, rep, regions in zip(labels, rep_results, region_files):
    evaluate(rep, regions, label=label, ax=ax)
    
for label in external:
    if label == "bumphunter": continue
    regions = "/tmp/%s.regions.bed" % label
    biter = ((d['chrom'], int(float(d['start'])), int(float(d['end'])), float(d['p']), 1) for 
                                          d in ts.reader("../work/validation%s.output.bed" % label))
    evaluate(biter, regions, label=label, ax=ax)
    
ax.plot([0, 1], [0, 1], 'k--', zorder=-1) 
ax.set_xlim(0, 0.3)
ax.set_ylim(0, 0.3)
ax.legend(loc='best')    
plt.savefig('../figures/figure3.eps')









    



AUC: 0.5309 | zscore 443170
AUC: 0.6003 | gee 443170
AUC: 0.6052 | mixed-model 443170
AUC: 0.5137 | betabinomial 443170
AUC: 0.4891 | dmrcate 443170
AUC: 0.5815 | limma 443170



In [ ]:

	gender	age	Basename	ID	id	plate
GSM1235879	M	56.120	6042324097_R03C02	age56.12_M	GSM1235879	6042324097
GSM1235909	M	54.790	6042324097_R01C01	age54.79_M	GSM1235909	6042324097
GSM1236002	F	44.304	6042324097_R05C02	age44.304_F	GSM1236002	6042324097
GSM1236011	M	53.719	6042324097_R02C01	age53.719_M	GSM1236011	6042324097
GSM1236114	M	58.450	6042324097_R03C01	age58.45_M	GSM1236114	6042324097