Here we are looking at our HIV+ cases and HIV- controls together in an unsupervised way. The idea is to look at age-associated probes and look at the cohort for broad differences between the cases and controls.
In [1]:
import NotebookImport
from Setup.Imports import *
In [2]:
import Parallel.Age_HIV_Features as fx
In [3]:
from Setup.DX_Imports import ttest_df
We see a number of age associated markers in our two external cohorts of (more or less) healthy indivisuals. This is processed in the Parallel.Age_HIV_Features notebook. The (poorly named) variable fx.rr stores the results of our two tiered screen for age associated probes. A value of one indicates that a marker is age assoicated in the Hannum dataset, while a value of two indicates that it is age associated in both the Hannum dataset and the EPIC dataset.
In [4]:
(fx.rr > 0).value_counts()
Out[4]:
In [5]:
fx.rr.value_counts()
Out[5]:
In [7]:
df_hiv2 = fx.df_hiv.ix[:, fx.duration.index]
df_hiv2 = df_hiv2.dropna(1, how='all')
dd = logit_adj(df_hiv2)
m = dd.mean(1)
s = dd.std(1)
df_norm = dd.subtract(m, axis=0).divide(s, axis=0)
In [8]:
idx = ti(fx.rr > 1)
U,S,vH = frame_svd(df_norm.ix[fx.probe_idx].ix[idx])
The first PC captures about 19% of the variance in the data
In [9]:
p = S ** 2 / sum(S ** 2)
p[:5]
Out[9]:
In [10]:
fig, ax = subplots(1,1, figsize=(4,3))
rr = -1*vH[0]
k = fx.pred_c.index
hiv = fx.duration != 'Control'
age = fx.age
sns.regplot(*match_series(age.ix[k], rr.ix[ti(hiv==0)]),
ax=ax, label='HIV+', ci=None)
sns.regplot(*match_series(age.ix[k], rr.ix[ti(hiv>0)]),
ax=ax, label='Control', ci=None)
ax.set_ylabel('First PC of validated markers', size=12)
ax.set_xlabel('Chronological age (years)', size=14)
ax.set_yticks([0])
ax.axhline(0, ls='--', lw=2.5, color='grey', zorder=-1)
ax.set_xbound(23,70)
ax.set_ybound(-.25,.28)
prettify_ax(ax)
fig.tight_layout()
fig.savefig(FIGDIR + 'PC1_validated.pdf')
Here I am doing a drop-1 test for an interaction term of age X HIV. In the second act of this analysis we filter a few more patients with data that doesn't agree between aging models. In general I think these are bad samples, and removing them does improve this result and make the lines more parallel than shown here. I'm not using these filters here because it seemed a little ad-hoc to go back and apply them before we even looked at aging models and it is generally hard to report an iterative analysis in a journal paper.
In [11]:
import statsmodels.api as sm
In [19]:
y = vH[0]
intercept = pd.Series(1, index=y.index)
X = pd.concat([intercept, age, hiv], axis=1, keys=['Intercept', 'age', 'HIV'])
X = X.dropna().ix[y.index]
m1 = sm.OLS(y, X).fit()
X = pd.concat([intercept, age, hiv, hiv*age], axis=1, keys=['Intercept', 'age', 'HIV', 'int'])
X = X.dropna().ix[y.index]
m2 = sm.OLS(y, X).fit()
In [17]:
m1.pvalues
Out[17]:
In [16]:
m2.pvalues
Out[16]:
The interaction term does not have a significant contribution to the model.
In [20]:
m2.compare_lr_test(m1)
Out[20]:
Here I'm calculating the effect sise of HIV presence across the top principal components. The idea is that an effect of 25 years means the the presence of HIV has an equivalent effect as 25 years of age on the molecular signal.
In [18]:
cc = {}
for i,y in vH.iteritems():
mod_all = sm.OLS(y, X)
res_ref = mod_all.fit()
cc[i] = res_ref.params['HIV'] / res_ref.params['age']
cc = pd.Series(cc)
cc.head()
Out[18]:
In [21]:
t_hiv = ttest_df(hiv, df_norm)
In [22]:
fig, ax = subplots(figsize=(4,4))
draw_dist(t_hiv, fx.rr, ax=ax, colors={0: 'grey', 1: colors_th[1], 2: colors_th[3]})
ax.set_xlabel('HIV association (t-statistic)')
ax.set_ylabel('density')
ax.axvline(0, ls='--', lw=2.5, color='grey')
ax.set_yticks([])
prettify_ax(ax)
fig.tight_layout()
fig.savefig(FIGDIR + 't_dens.pdf')
In [24]:
anova(fx.rr>1, t_hiv)
Out[24]:
In [26]:
fig, axs = subplots(1,2, figsize=(13,4))
draw_dist(t_hiv.ix[ti(fx.mm > .5)], fx.rr, ax=axs[0])
draw_dist(t_hiv.ix[ti(fx.mm < .5)], fx.rr, ax=axs[1])
In [27]:
direction = fx.res.in_set_s1.multi_variate.age > 0
In [29]:
fisher_exact_test(fx.mm < .5, direction.ix[ti(fx.rr > 0)])
Out[29]:
These are probes that trend away from random, and validate in both cohorts. The idea of this analysis is to show that the shared effects of HIV and age are not soely due to a shared effect of general disordering of the methylome.
In [30]:
r2 = ti(fx.rr > 1)
direction = fx.res.in_set_s1.multi_variate.age > 0
r3 = ti(((fx.mm < .5).ix[r2] == direction.ix[r2]) == False)
len(r3)
Out[30]:
In [31]:
idx = r3
U,S,vH = frame_svd(df_norm.ix[idx])
The first PC capture about 15% of the variation
In [34]:
p = S ** 2 / sum(S ** 2)
p[:5]
Out[34]:
In [35]:
fig, ax = subplots(1,1, figsize=(4,3))
rr = vH[0]
hiv = fx.duration != 'Control'
sns.regplot(*match_series(age, rr.ix[ti(hiv==0)]),
ax=ax, label='HIV+', ci=None)
sns.regplot(*match_series(age, rr.ix[ti(hiv>0)]),
ax=ax, label='Control', ci=None)
ax.set_ylabel('first PC')
ax.set_yticks([0])
ax.axhline(0, ls='--', color='grey', zorder=-1)
ax.set_xbound(23,70)
ax.set_ybound(-.25,.25)
prettify_ax(ax)
fig.savefig(FIGDIR + 'PC1_validated.pdf')
Repeating the drop-1 model on this subset of the probes.
In [36]:
y = vH[0]
intercept = pd.Series(1, index=y.index)
X = pd.concat([intercept, age, hiv], axis=1,
keys=['Intercept', 'age', 'HIV'])
X = X.dropna().ix[y.index]
m1 = sm.OLS(y, X).fit()
X = pd.concat([intercept, age, hiv, hiv*age], axis=1,
keys=['Intercept', 'age', 'HIV', 'int'])
X = X.dropna().ix[y.index]
m2 = sm.OLS(y, X).fit()
In [37]:
m2.pvalues
Out[37]:
In [38]:
m1.pvalues
Out[38]:
In [39]:
m2.compare_lr_test(m1)
Out[39]: