In [2]:
import NotebookImport
from DX_screen import *
Here I'm running GSEA on the fraction upregulated signal across genes.
In [3]:
gs2 = gene_sets.ix[dx_rna.index].fillna(0)
rr = screen_feature(dx_rna.frac, rev_kruskal, gs2.T,
align=False)
fp = (1.*gene_sets.T * dx_rna.frac).T.dropna().replace(0, np.nan).mean().order()
fp.name = 'mean frac'
First I do a greedy filter based on p-values to find non-overlapping gene sets that are significantly associated with the tumor signal. For details see the filter_pathway_hits funciton.
In [4]:
ff_u = filter_pathway_hits(rr.ix[ti(fp>.5)].p.order(), gs2)
ff_p = filter_pathway_hits(rr.ix[ti(fp<.5)].p.order(), gs2)
ff = ff_u.append(ff_p)
selected = rr.ix[ff[ff < .00001].index].join(fp)
selected.sort('p')
Out[4]:
The cell-cycle is a large pathway with lots of subsets in the mSigDB database. Here I'm looking for significant subsets within this pathway.
In [5]:
d = pd.DataFrame({g: gs2['REACTOME_CELL_CYCLE'] for g in gs2.columns})
a,b = odds_ratio_df(d.T>0, gs2.T>0)
dd = rr.ix[ti((a > 100) & (rr.q < 10e-15))].join(fp).sort(fp.name, ascending=False)
filter_pathway_hits(dd, gs2)
Out[5]:
These two gene sets are completely non-overlapping subsets of the cell-cycle.
In [9]:
m_g1 = 'REACTOME_M_G1_TRANSITION'
cepna = 'REACTOME_DEPOSITION_OF_NEW_CENPA_CONTAINING_NUCLEOSOMES_AT_THE_CENTROMERE'
combine(gs2[m_g1]>0, gs2[cepna]>0).value_counts()
Out[9]:
In [10]:
fig, ax = subplots()
v = pd.concat([dx_rna.frac,
dx_rna.frac.ix[ti(gs2['REACTOME_CELL_CYCLE']>0)],
dx_rna.frac.ix[ti(gs2[m_g1]>0)],
dx_rna.frac.ix[ti(gs2[cepna]>0)]
]).dropna()
v1 = pd.concat([pd.Series('All Genes', dx_rna.frac.index),
pd.Series('Cell Cycle', ti(gs2['REACTOME_CELL_CYCLE']>0)),
pd.Series('M/G1\nTransition', ti(gs2[m_g1]>0)),
pd.Series('CEPNA\nDeposition', ti(gs2[cepna]>0))
])
v1.name = ''
v.name = 'Fraction Overexpressed'
o = ['All Genes','Cell Cycle','CEPNA\nDeposition',
'M/G1\nTransition']
violin_plot_pandas(v1, v, order=o, ann=None, ax=ax)
prettify_ax(ax)
ax.spines['bottom'].set_visible(False)
ax.axhline(.5, color='grey', lw=2, ls='--');
I can also change the greedy filter to look for gene-sets with large effect sizes as opposed to p-values. This is going to give us smaller, but more specific gene-sets.
In [11]:
f2 = fp.ix[ti(rr.q < .00001)]
ff_u = filter_pathway_hits(fp.ix[ti(f2>.5)].order()[::-1], gs2)
ff_p = filter_pathway_hits(fp.ix[ti(f2<.5)].order(), gs2)
ff = ff_u.append(ff_p)
selected = rr.ix[ff.index].join(f2)
selected.ix[(f2 - .5).abs().order().index[::-1]].dropna()
Out[11]:
Interestingly unwinding of DNA has a very large effect size but is a relatviely small gene set at only 11 genes.
In [15]:
unwind = 'REACTOME_UNWINDING_OF_DNA'
telo = 'REACTOME_EXTENSION_OF_TELOMERES'
fig, ax = subplots()
v = pd.concat([dx_rna.frac,
dx_rna.frac.ix[ti(gs2[cepna]>0)],
dx_rna.frac.ix[ti(gs2[unwind]>0)],
dx_rna.frac.ix[ti(gs2[telo]>0)]
]).dropna()
v1 = pd.concat([pd.Series('All Genes', dx_rna.frac.index),
pd.Series('CEPNA\nDeposition', ti(gs2[cepna]>0)),
pd.Series('Unwinding\nof DNA', ti(gs2[unwind]>0)),
pd.Series('Extension\nof Telomeres', ti(gs2[telo]>0))
])
v1.name = ''
v.name = 'Fraction Overexpressed'
o = ['All Genes', 'CEPNA\nDeposition',
'Extension\nof Telomeres', 'Unwinding\nof DNA', ]
violin_plot_pandas(v1, v, order=o, ann=None, ax=ax)
prettify_ax(ax)
ax.spines['bottom'].set_visible(False)
ax.axhline(.5, color='grey', lw=2, ls='--');