HIV Methylation Age: Epigenetic Coorelates of Age Advancement

Imports and Helper Functions


In [1]:
cd ..


/cellar/users/agross/TCGA_Code/Methlation

In [2]:
import NotebookImport
from Setup.Imports import *


importing IPython notebook from Setup/Imports
Populating the interactive namespace from numpy and matplotlib

In [3]:
import Parallel.Age_HIV_Features as fx


importing IPython notebook from Parallel/Age_HIV_Features
importing IPython notebook from HIV_Age_Advancement
importing IPython notebook from Setup/MethylationAgeModels
importing IPython notebook from Setup/Read_HIV_Data

In [4]:
from Setup.DX_Imports import ttest_df


importing IPython notebook from Setup/DX_Imports

In [5]:
(fx.rr > 0).value_counts()


Out[5]:
False    411452
True      61592
dtype: int64

In [6]:
fx.rr.value_counts()


Out[6]:
0    411452
1     34665
2     26927
dtype: int64

In [7]:
r2 = ti(fx.rr > 1)
direction = fx.res.in_set_s1.multi_variate.age > 0
r3 = ti(((fx.mm < .5).ix[r2] == direction.ix[r2]) == False)
len(r3)


Out[7]:
7967

Unsupervised Analysis of Age-Associated Probes

  • Now we are taking these age associated probes and looking in our new dataset for broad trends
  • For now, mainly I'm just doing basic PCA on the normalized dataset

In [96]:
df_hiv2 = fx.df_hiv.ix[:, fx.duration.index]
df_hiv2 = df_hiv2.dropna(1, how='all')

dd = logit_adj(df_hiv2)
m = dd.mean(1)
s = dd.std(1)
df_norm = dd.subtract(m, axis=0).divide(s, axis=0)

In [97]:
idx = r3
U,S,vH = frame_svd(df_norm.ix[idx])

The first PC captures about 25% of the variance in the data


In [98]:
p = S ** 2 / sum(S ** 2)
p[:5]


Out[98]:
array([ 0.14613607,  0.09732037,  0.04516041,  0.03142676,  0.02604059])

In [99]:
fig, ax = subplots(1,1, figsize=(4,3))
rr = 1*vH[0]
k = fx.pred_c.index
hiv = fx.duration.ix[rr.index].dropna() != 'Control'
age = fx.age

sns.regplot(*match_series(age.ix[k], rr.ix[ti(hiv==0)]),
            ax=ax, label='HIV+', ci=None)
sns.regplot(*match_series(age.ix[k], rr.ix[ti(hiv>0)]),
            ax=ax, label='Control', ci=None)
ax.set_ylabel('First PC', size=12)
ax.set_xlabel('Chronological age (years)', size=14)

ax.set_yticks([0])
ax.axhline(0, ls='--', lw=2.5, color='grey', zorder=-1)
ax.set_xbound(23,70)
ax.set_ybound(-.25,.25)
prettify_ax(ax)
fig.tight_layout()
fig.savefig(FIGDIR + 'PC1_validated.pdf')


Here I am doing a drop-1 test for an interaction term of age X HIV. In the second act of this analysis we filter a few more patients with data that doesn't agree between aging models. In general I think these are bad samples, and removing them does improve this result and make the lines more parallel than shown here. I'm not using these filters here because it seemed a little ad-hoc to go back and apply them before we even looked at aging models and it is generally hard to report an iterative analysis in a journal paper.


In [100]:
import statsmodels.api as sm

In [101]:
y = vH[0]
intercept = pd.Series(1, index=y.index)

X = pd.concat([intercept, age, hiv], axis=1, keys=['Intercept', 'age', 'HIV'])
X = X.dropna().ix[y.index]
m1 = sm.OLS(y, X).fit()

X = pd.concat([intercept, age, hiv, hiv*age], axis=1, keys=['Intercept', 'age', 'HIV', 'int'])
X = X.dropna().ix[y.index]
m2 = sm.OLS(y, X).fit()

m2.compare_lr_test(m1)


Out[101]:
(1.1298061889274322, 0.28781668290682894, 1.0)

In [102]:
m2.pvalues


Out[102]:
Intercept    5.41e-08
age          1.19e-04
HIV          3.11e-01
int          2.94e-01
dtype: float64

In [103]:
m1.pvalues


Out[103]:
Intercept    4.79e-18
age          6.42e-12
HIV          3.07e-18
dtype: float64

Here I'm calculating the effect sise of HIV presence across the top principal components. The idea is that an effect of 25 years means the the presence of HIV has an equivalent effect as 25 years of age on the molecular signal.


In [106]:
cc = {}
for i,y in vH.iteritems():
    mod_all = sm.OLS(y, X)
    res_ref = mod_all.fit()
    cc[i] = res_ref.params['HIV'] / res_ref.params['age']
cc = pd.Series(cc)
cc.head()


Out[106]:
0    16.23
1    48.20
2   -10.29
3    19.65
4   -35.27
dtype: float64

In [107]:
def entropy(p):
    '''
    Entropy of a methylaion vector. Here we assume 50% methylation is 
    random and 0 or 1 constitute an informative measument. 
    '''
    q = 1. - p
    s = np.sum((p*np.log(p)) + (q*np.log(q))) / (np.log(.5) * len(p))
    s.name = 'Entropy'
    return s

In [108]:
def disorder_frac(direction, beta, test_set):
    ct = pd.crosstab(beta < .5, direction.ix[test_set]>0)
    return (1.*ct[1][1] + ct[0][0]) / ct.sum().sum()

In [109]:
from Setup.MethylationAgeModels import *

In [110]:
model_idx = hannum_model.index.union(horvath_model.index)

In [111]:
disorder_frac(fx.res.in_set_s1.multi_variate.age, fx.mm, model_idx)


Out[111]:
0.5298165137614679

In [112]:
disorder_frac(fx.res.in_set_s1.multi_variate.age, fx.mm, model_idx) * len(model_idx)


Out[112]:
231.0

In [113]:
231 / 436.


Out[113]:
0.5298165137614679

In [114]:
stats.binom_test(231, n=436)


Out[114]:
0.2311578847176522

In [115]:
model_idx = hannum_model.index.union(horvath_model.index)
d1 = disorder_frac(fx.res.in_set_s1.multi_variate.age, fx.mm, model_idx)
d2 = disorder_frac(fx.r4.multi_variate.HIV, fx.mm, model_idx)

In [116]:
fig, ax = subplots(figsize=(4,3))
ax.bar([0,1], [d1,d2])


Out[116]:
<Container object of 2 artists>

In [118]:
model_entropy = entropy(df_hiv2.ix[model_idx, hiv.index])
all_entropy = entropy(df_hiv2.ix[fx.probe_idx, hiv.index])

In [129]:
zscore = lambda s: (s - s.mean()) / s.std()
e2 = all_entropy.groupby(hiv).apply(zscore)
all_entropy[e2.abs() > 3]


Out[129]:
Sample_Name
METI-8         0.68
RM067          0.66
Name: Entropy, dtype: float64

In [121]:
e2 = model_entropy.groupby(hiv).apply(zscore)
model_entropy[e2.abs() > 3]


Out[121]:
Sample_Name
RA223          0.68
RM067          0.71
Name: Entropy, dtype: float64

In [123]:
fx.hiv.value_counts()


Out[123]:
HIV+    137
HIV-     42
dtype: int64

In [124]:
fig, axs = subplots(1,2, figsize=(7,3))

zscore = lambda s: (s - s.mean()) / s.std()
e2 = all_entropy.groupby(hiv).apply(zscore)
sns.violinplot(all_entropy[e2.abs() < 3], fx.hiv, ax=axs[0], 
               order=['HIV-','HIV+'],
               inner='points', inner_kws={"ms": 8}, alpha=.5,
               bw='scott')
axs[0].scatter([1,2], all_entropy[e2.abs() > 3], color='white', edgecolor='red',
               alpha=1, s=20, lw=1)

e2 = model_entropy.groupby(hiv).apply(zscore)
sns.violinplot(model_entropy[e2.abs() < 3], fx.hiv, ax=axs[1], 
               order=['HIV-','HIV+'],
               inner='points', inner_kws={"ms": 8}, alpha=.5,
               bw='scott')
axs[1].scatter([2,2], model_entropy[e2.abs() > 3], color='white', edgecolor='red',
               alpha=1, s=20, lw=1)

for ax in axs:
    ax.set_xlabel('')
    ax.set_yticks([])
    ax.set_ylabel('Entropy')
    prettify_ax(ax)
fig.tight_layout()



In [128]:
fig, axs = subplots(1,3, figsize=(11,3), sharey=False)

rr = vH[0]
hiv = (fx.duration != 'Control').map({True:'HIV+',False:'HIV-'})

ax=axs[0]
sns.regplot(*match_series(age, rr.ix[ti(hiv=='HIV-')]),
            ax=ax, label='HIV+', ci=None)
sns.regplot(*match_series(age, rr.ix[ti(hiv=='HIV+')]),
            ax=ax, label='Control', ci=None)
ax.set_ylabel('First PC')
ax.set_xlabel('Age (years)')
ax.set_yticks([0])
ax.axhline(0, ls='--', color='grey', zorder=-1)
ax.set_xbound(23,70)
ax.set_ybound(-.25,.25)
prettify_ax(ax)

ax = axs[1]
zscore = lambda s: (s - s.mean()) / s.std()
e2 = all_entropy.groupby(hiv).apply(zscore)
sns.violinplot(all_entropy[e2.abs() < 3], hiv, ax=ax, 
               order=['HIV-','HIV+'],
               inner='points', inner_kws={"ms": 8}, alpha=.5,
               bw='scott')
ax.scatter([1,2], all_entropy[e2.abs() > 3], color='white', edgecolor='red',
               alpha=1, s=20, lw=1)

ax = axs[2]
e2 = model_entropy.groupby(hiv).apply(zscore)
sns.violinplot(model_entropy[e2.abs() < 3], hiv, ax=ax, 
               order=['HIV-','HIV+'],
               inner='points', inner_kws={"ms": 8}, alpha=.5,
               bw='scott')
ax.scatter([2,2], model_entropy[e2.abs() > 3], color='white', edgecolor='red',
               alpha=1, s=20, lw=1)
ax.set_ybound(.67,.72)

for ax in axs[1:]:
    ax.set_xlabel('')
    ax.set_yticks([])
    ax.set_ylabel('Entropy')
    prettify_ax(ax)
fig.tight_layout()
fig.savefig(FIGDIR + 'xd_F4.pdf')



In [147]:
hiv.value_counts()


Out[147]:
HIV+    137
HIV-     42
dtype: int64

In [146]:
kruskal_pandas(hiv, all_entropy)


Out[146]:
H    5.61e+01
p    6.97e-14
dtype: float64

In [148]:
kruskal_pandas(hiv, model_entropy)


Out[148]:
H    1.02
p    0.31
dtype: float64

In [73]:
import Setup.DX_Imports as dx

In [74]:
features = {'Age only': fx.features['Age (BH)'],
            'HIV + Age': fx.features['HIV + Age (BH)'],
            'HIV only': fx.features['HIV (BH)']}

probes_sets = {'PRC2': dx.prc2_probes, 'CpG island': dx.cpg_island,
               'DHS': dx.dhs_site, 'Enhancer': dx.enhancer,
               'Gene body': dx.gene_body, 'TSS': dx.gene_tss,
               'Promoter': dx.promoter}

In [77]:
associations = {}
for i,a in features.iteritems():
    for j,b in probes_sets.iteritems():
        associations[(i,j)] = fisher_exact_test(a,b)

In [82]:
fig, ax = subplots(figsize=(10,4.5))
cmap = sns.diverging_palette(h_neg=210, h_pos=350, s=90, l=30, as_cmap=True)
o = ['DHS','PRC2','CpG island','Enhancer', 'TSS','Gene body','Promoter']
p = ['HIV only', 'Age only', 'HIV + Age']
#p = ['Age', 'HIV']
df = pd.concat(associations).unstack()['odds_ratio'].unstack().T
r = sns.heatmap(df.ix[o,p].T, cmap=cmap, center=1, annot=True, fmt='0.1f')
ax.set_xticklabels(o, rotation=20, ha='center')
fig.tight_layout()