The purpose of this notebook is to investigate the Schubert results. Specifically, it looks like 100% of tests are being rejected by qvalue at an alpha cutoff of 0.1. That seems fishy, let's investigate!
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style('white')
In [2]:
fpvals = 'schubert-sb-table.txt'
fotu = 'data/cdi_schubert_results/RDP/cdi_schubert.otu_table.100.denovo.rdp_assigned'
fmeta = 'data/cdi_schubert_results/cdi_schubert.metadata.txt'
pvals = pd.read_csv(fpvals, sep=' ')
pvals.index = pvals['otu']
df = pd.read_csv(fotu, sep='\t', index_col=0).T
abundf = df.divide(df.sum(axis=1), axis=0)
meta = pd.read_csv(fmeta, sep='\t', index_col=0)
In [3]:
pvals.head()
Out[3]:
In [4]:
plt.hist(pvals['unadjusted'])
Out[4]:
In [85]:
df.head()
Out[85]:
In [86]:
meta.head()
Out[86]:
In [87]:
pvals.columns
Out[87]:
In [88]:
pvals.sort_values(by='qvalue', ascending=False)[['pval', 'qvalue', 'ubiquity']].head(15)
Out[88]:
It looks like OTUs with an uncorrected pvalue of 0.9 get smushed down to 0.08 with qvalue - this seems fishy!
Let's plot a few of these bugs and see if we can understand what's happening with the covariate...
In [89]:
# Tidyfy the OTU table
df.index.name = 'sample'
tidydf = df.reset_index().melt(id_vars='sample', var_name='otu', value_name='abun')
# Add disease state
tidydf = tidydf.join(meta['DiseaseState'], on='sample')
tidydf.head()
Out[89]:
In [90]:
otus = pvals.sort_values(by='qvalue', ascending=False).index[0:12].tolist()
In [91]:
fig, ax = plt.subplots(3, 4, figsize=(14,12))
ax = ax.flatten()
for i in range(len(ax)):
o = otus[i]
sns.stripplot(
data=tidydf.query('otu == @o'),
x='DiseaseState', y='abun',
ax=ax[i],
jitter=True)
In [92]:
# Tidyfy the realtive abundance OTU table
abundf.index.name = 'sample'
tidyabundf = abundf.reset_index().melt(id_vars='sample', var_name='otu', value_name='abun')
# Add disease state
tidyabundf = tidyabundf.join(meta['DiseaseState'], on='sample')
fig, ax = plt.subplots(3, 4, figsize=(14,12))
ax = ax.flatten()
for i in range(len(ax)):
o = otus[i]
sns.stripplot(
data=tidyabundf.query('otu == @o'),
x='DiseaseState', y='abun',
ax=ax[i],
jitter=True)
So these are, for the most part, singletons, maybe?
Let's check that I calculated the ubiquity correctly here..
In [93]:
kept_otus = pvals['otu'].tolist()
df_fromR = df.loc[:, kept_otus]
keep_dis = ['H', 'CDI', 'nonCDI']
df_fromR = df_fromR.loc[meta.query('DiseaseState == @keep_dis').index, :]
df_fromR.shape
# Hm, okay - maybe my problems are coming from
df_fromR.shape, df_fromR.dropna().shape
Out[93]:
In [94]:
# Recalculate ubiquity with python
fig, ax = plt.subplots(2, 2, figsize=(10, 8))
ax = ax.flatten()
ax[0].hist((df_fromR.dropna() > 0).sum() / df_fromR.dropna().shape[0])
ax[0].set_title('Ubiquity calculated from python')
ax[1].hist(pvals['ubiquity'])
ax[1].set_title('Ubiquity calculated from R')
ax[2].hist(np.log10((df_fromR.dropna() > 0).sum() / df_fromR.dropna().shape[0]))
ax[2].set_title('Ubiquity calculated from python')
ax[3].hist(np.log10(pvals['ubiquity']))
ax[3].set_title('Ubiquity calculated from R')
Out[94]:
Hm, ok - so it looks like they're the same. I didn't mess up the ubiquity calculation...
It looks like the least significant bugs (which get moved down to below q = 0.1) are in the lower covariate group - they have ubiquity = 0.02 - 0.03. I want to look back at the covariate boxplots and see if this makes sense, I guess...?
In [95]:
np.log10(0.02)
Out[95]:
In [96]:
pvals.columns
Out[96]:
In [97]:
pcols = [u'bonf', u'bh',
u'qvalue', u'ihw-a10',
u'bl-df03', u'lfdr']
fig, ax = plt.subplots(2, 3, figsize=(14, 8))
ax = ax.flatten()
i = 0
for c in pcols:
ax[i].scatter(pvals['unadjusted'], pvals[c])
ax[i].set_title(c)
if i > 2:
ax[i].set_xlabel('unadjusted')
if i in [0, 3]:
ax[i].set_ylabel('corrected')
i += 1
Y-axis is the corrected qvalue (specified in subplot title), x-axis is the original pvalue
In [98]:
fig, ax = plt.subplots(1, 4, figsize=(14, 4))
pcols = ['unadjusted', 'ihw-a10', 'bl-df03', 'lfdr']
i = 0
for c in pcols:
ax[i].scatter(pvals['qvalue'], pvals[c])
ax[i].set_title(c)
ax[i].axhline(0.1)
ax[i].set_xlabel('qvalue')
i += 1
In [99]:
pvals.columns
Out[99]:
In [100]:
pcols = ['qvalue', 'unadjusted', 'ihw-a10', 'bl-df03', 'lfdr']
fig, ax = plt.subplots(1, len(pcols), figsize=(15, 4))
i = 0
for c in pcols:
ax[i].scatter(pvals['effect_size'], pvals[c])
ax[i].set_title(c)
ax[i].set_xlabel('effect size')
i += 1
In [101]:
fig, ax = plt.subplots(1, len(pcols), figsize=(15, 4))
i = 0
for c in pcols:
ax[i].scatter(pvals['ubiquity'], -np.log10(pvals[c]))
ax[i].set_title(c)
ax[i].set_xlabel('ubiquity')
#ax[i]
i += 1
print('Ubiquity vs log10(pvalue)')
In [102]:
import numpy as np
fig, ax = plt.subplots(1, len(pcols), figsize=(15, 4))
i = 0
for c in pcols:
ax[i].scatter(pvals['ubiquity'].rank(), pvals[c])
ax[i].set_title(c)
ax[i].set_xlabel('ubiquity')
#ax[i]
i += 1
print('Rank ubiquity vs pvalue')
In [ ]: