Unsupervised analysis of HIV and age associated methylation signals

Here we are looking at our HIV+ cases and HIV- controls together in an unsupervised way. The idea is to look at age-associated probes and look at the cohort for broad differences between the cases and controls.

Imports and Helper Functions


In [1]:
import NotebookImport
from Setup.Imports import *


importing IPython notebook from Setup/Imports
Populating the interactive namespace from numpy and matplotlib

In [2]:
import Parallel.Age_HIV_Features as fx


importing IPython notebook from Parallel/Age_HIV_Features
importing IPython notebook from HIV_Age_Advancement
importing IPython notebook from Setup/MethylationAgeModels
importing IPython notebook from Setup/Read_HIV_Data

In [3]:
from Setup.DX_Imports import ttest_df


importing IPython notebook from Setup/DX_Imports

Taking a look age associated probes

We see a number of age associated markers in our two external cohorts of (more or less) healthy indivisuals. This is processed in the Parallel.Age_HIV_Features notebook. The (poorly named) variable fx.rr stores the results of our two tiered screen for age associated probes. A value of one indicates that a marker is age assoicated in the Hannum dataset, while a value of two indicates that it is age associated in both the Hannum dataset and the EPIC dataset.


In [4]:
(fx.rr > 0).value_counts()


Out[4]:
False    411452
True      61592
dtype: int64

In [5]:
fx.rr.value_counts()


Out[5]:
0    411452
1     34665
2     26927
dtype: int64

Unsupervised Analysis of Age-Associated Probes

  • Now we are taking these age associated probes and looking in our new dataset for broad trends
  • For now, mainly I'm just doing basic PCA on the normalized dataset

In [7]:
df_hiv2 = fx.df_hiv.ix[:, fx.duration.index]
df_hiv2 = df_hiv2.dropna(1, how='all')

dd = logit_adj(df_hiv2)
m = dd.mean(1)
s = dd.std(1)
df_norm = dd.subtract(m, axis=0).divide(s, axis=0)

In [8]:
idx = ti(fx.rr > 1)
U,S,vH = frame_svd(df_norm.ix[fx.probe_idx].ix[idx])

The first PC captures about 19% of the variance in the data


In [9]:
p = S ** 2 / sum(S ** 2)
p[:5]


Out[9]:
array([ 0.19263648,  0.09834372,  0.04463382,  0.03695052,  0.02871613])

In [10]:
fig, ax = subplots(1,1, figsize=(4,3))
rr = -1*vH[0]
k = fx.pred_c.index
hiv = fx.duration != 'Control'
age = fx.age

sns.regplot(*match_series(age.ix[k], rr.ix[ti(hiv==0)]),
            ax=ax, label='HIV+', ci=None)
sns.regplot(*match_series(age.ix[k], rr.ix[ti(hiv>0)]),
            ax=ax, label='Control', ci=None)
ax.set_ylabel('First PC of validated markers', size=12)
ax.set_xlabel('Chronological age (years)', size=14)

ax.set_yticks([0])
ax.axhline(0, ls='--', lw=2.5, color='grey', zorder=-1)
ax.set_xbound(23,70)
ax.set_ybound(-.25,.28)
prettify_ax(ax)
fig.tight_layout()
fig.savefig(FIGDIR + 'PC1_validated.pdf')


Here I am doing a drop-1 test for an interaction term of age X HIV. In the second act of this analysis we filter a few more patients with data that doesn't agree between aging models. In general I think these are bad samples, and removing them does improve this result and make the lines more parallel than shown here. I'm not using these filters here because it seemed a little ad-hoc to go back and apply them before we even looked at aging models and it is generally hard to report an iterative analysis in a journal paper.

  • m1 is a model with age and HIV fit independently
  • m2 is a model with age, HIV, and an interaction term

In [11]:
import statsmodels.api as sm

In [19]:
y = vH[0]
intercept = pd.Series(1, index=y.index)

X = pd.concat([intercept, age, hiv], axis=1, keys=['Intercept', 'age', 'HIV'])
X = X.dropna().ix[y.index]
m1 = sm.OLS(y, X).fit()

X = pd.concat([intercept, age, hiv, hiv*age], axis=1, keys=['Intercept', 'age', 'HIV', 'int'])
X = X.dropna().ix[y.index]
m2 = sm.OLS(y, X).fit()

In [17]:
m1.pvalues


Out[17]:
Intercept    8.78e-15
age          2.64e-09
HIV          2.25e-16
dtype: float64

In [16]:
m2.pvalues


Out[16]:
Intercept    1.13e-05
age          4.83e-03
HIV          7.05e-01
int          1.19e-01
dtype: float64

The interaction term does not have a significant contribution to the model.


In [20]:
m2.compare_lr_test(m1)


Out[20]:
(2.4912199645619921, 0.11448295270975471, 1.0)

Here I'm calculating the effect sise of HIV presence across the top principal components. The idea is that an effect of 25 years means the the presence of HIV has an equivalent effect as 25 years of age on the molecular signal.


In [18]:
cc = {}
for i,y in vH.iteritems():
    mod_all = sm.OLS(y, X)
    res_ref = mod_all.fit()
    cc[i] = res_ref.params['HIV'] / res_ref.params['age']
cc = pd.Series(cc)
cc.head()


Out[18]:
0     8.34
1    67.84
2     1.08
3    38.76
4   -77.23
dtype: float64

Association with HIV of Age Associated Probes


In [21]:
t_hiv = ttest_df(hiv, df_norm)

In [22]:
fig, ax = subplots(figsize=(4,4))
draw_dist(t_hiv, fx.rr, ax=ax, colors={0: 'grey', 1: colors_th[1], 2: colors_th[3]})
ax.set_xlabel('HIV association (t-statistic)')
ax.set_ylabel('density')
ax.axvline(0, ls='--', lw=2.5, color='grey')
ax.set_yticks([])
prettify_ax(ax)
fig.tight_layout()
fig.savefig(FIGDIR + 't_dens.pdf')



In [24]:
anova(fx.rr>1, t_hiv)


Out[24]:
F    2586.27
p       0.00
dtype: float64

Association with HIV of Age Associated Probes- Split by Hypo/Hyper methylation


In [26]:
fig, axs = subplots(1,2, figsize=(13,4))
draw_dist(t_hiv.ix[ti(fx.mm > .5)], fx.rr, ax=axs[0])
draw_dist(t_hiv.ix[ti(fx.mm < .5)], fx.rr, ax=axs[1])



In [27]:
direction = fx.res.in_set_s1.multi_variate.age > 0

In [29]:
fisher_exact_test(fx.mm < .5, direction.ix[ti(fx.rr > 0)])


Out[29]:
odds_ratio    6.03
p             0.00
dtype: float64

Looking at Upstream Probes

These are probes that trend away from random, and validate in both cohorts. The idea of this analysis is to show that the shared effects of HIV and age are not soely due to a shared effect of general disordering of the methylome.


In [30]:
r2 = ti(fx.rr > 1)
direction = fx.res.in_set_s1.multi_variate.age > 0
r3 = ti(((fx.mm < .5).ix[r2] == direction.ix[r2]) == False)
len(r3)


Out[30]:
7967

In [31]:
idx = r3
U,S,vH = frame_svd(df_norm.ix[idx])

The first PC capture about 15% of the variation


In [34]:
p = S ** 2 / sum(S ** 2)
p[:5]


Out[34]:
array([ 0.14613607,  0.09732037,  0.04516041,  0.03142676,  0.02604059])

In [35]:
fig, ax = subplots(1,1, figsize=(4,3))
rr = vH[0]
hiv = fx.duration != 'Control'

sns.regplot(*match_series(age, rr.ix[ti(hiv==0)]),
            ax=ax, label='HIV+', ci=None)
sns.regplot(*match_series(age, rr.ix[ti(hiv>0)]),
            ax=ax, label='Control', ci=None)
ax.set_ylabel('first PC')
ax.set_yticks([0])
ax.axhline(0, ls='--', color='grey', zorder=-1)
ax.set_xbound(23,70)
ax.set_ybound(-.25,.25)
prettify_ax(ax)
fig.savefig(FIGDIR + 'PC1_validated.pdf')


Repeating the drop-1 model on this subset of the probes.


In [36]:
y = vH[0]
intercept = pd.Series(1, index=y.index)

X = pd.concat([intercept, age, hiv], axis=1, 
              keys=['Intercept', 'age', 'HIV'])
X = X.dropna().ix[y.index]
m1 = sm.OLS(y, X).fit()

X = pd.concat([intercept, age, hiv, hiv*age], axis=1, 
              keys=['Intercept', 'age', 'HIV', 'int'])
X = X.dropna().ix[y.index]
m2 = sm.OLS(y, X).fit()

In [37]:
m2.pvalues


Out[37]:
Intercept    5.41e-08
age          1.19e-04
HIV          3.11e-01
int          2.94e-01
dtype: float64

In [38]:
m1.pvalues


Out[38]:
Intercept    4.79e-18
age          6.42e-12
HIV          3.07e-18
dtype: float64

In [39]:
m2.compare_lr_test(m1)


Out[39]:
(1.1298061889274322, 0.28781668290682894, 1.0)