Import Libraries


In [1]:
from __future__ import division
import os
import math
%matplotlib inline
import numpy as np
import pylab
import matplotlib.pyplot as plt
from matplotlib import rc
import scipy.stats as stats
import pandas as pd
from clean_data import CANCER_TYPES

from IPython.html.widgets import interact
from IPython.html import widgets
from IPython.display import display

Pathway Mutation Enrichment Analysis

Rationale

It is well known that the accumulation of the deleterious somatic mutations cause cancer. However, it has been observed that only very small number of genes are mutated with high frequency across tumor samples. The rest of mutations are found only in small number of patients. In other words mutational landscape of cancer has only few mountains and many hills (Vogelestien et. al., Science, 2013). It is hypothesized that there are only a few "driver" mutations that provide selective advantage to the tumor cells and the remaining mutations are passengers. The passenger mutations do not offer any selection benefit happen due to the random mutation process. The genes with driver and passenger mutations are also referred to as drivers and passengers as well.

However, there are many "hallmark" characteristics such as uncontrolled proliferation, evading immune response and programmed cell death, showing angiogenesis and metastasizing to remote tissues that are shared not only across all the tumors of a particular type but also across different cancers. Note that each of these characteristics requires cancer to manipulated multiple cellular pathways. Could it be achieved by alteration in only a handful number of gene? This seems unlikely. Then how to reconcile the both of these observations?

We wish to propose an alternative hypothesis. In addition to the "drivers" and "passengers" there is another class of mutations: "conductors". The conductor mutations are not drivers, but they assist driver mutations to cause cancers. They are distinct from passengers too. In contrast to the passenger mutations, the conductor mutations offer small selection advantage. In contrast to the driver mutations, they are larger in number and can have a great variety. On functional side they assist driver genes to acquire "hallmark" functionalities. Intuitively, since these conductor mutations need to provide only small selective advantage, there exists a wider range of options to choose from the genes involved in the "hallmark" pathways. The same idea can be understood from the perspective of pathways. The cancer is outcome of mis-regulated pathways that comprise several potential conductor genes in addition to a few driver and large number of potential passenger genes. Therefore, we propose to investigate mutation patterns aggregated at the level of pathways. At the minimum, such an analysis will complement extant gene-centric notion of driver genes.

With the above background, it is natural to ask if there are some pathways that show statistically significant difference between "silent" and "non-silent" mutation rates. In the following section, we attempt to identify pathways that indeed are enriched for non-silent mutations. This pan-cancer analysis is based on TCGA data.

Input Data

The MAF files were downloaded from Broad Institute's GDAC website http://gdac.broadinstitute.org website. The MAF files were preprocessed with Firehose pipeline.

  1. No of samples? No of cancers. Cancer wise table of the samples count. Only primary tumors were considered in the analysis excluding recurrent samples.
  2. Variant calls were categorized in two groups: silent and non-silent (that includes nonsense, missense, indels and others). The number of silent and non-silent mutations were counted for each gene in every samples.
  3. The pathway information was aggregated from four pathway sources: NCI-Nature curated Protein Interaction database (PID), KEGG, BioCarta and positional (cytogenic bands". These were compiled from mSigDB. Number of pathways?

In [2]:
can_types = []
for c in CANCER_TYPES:
    f1 = '../results/' + c + os.sep + 'silent_mutation_pathway_score.txt'
    f2 = '../results/' + c + os.sep + 'nsilent_mutation_pathway_score.txt'
    
    if os.path.exists(f1) and os.path.exists(f2):
        can_types.append(c)

print "There are %d cancer types ready to be analysed" % len(can_types)
can_type_wid = widgets.DropdownWidget(description="Select Cancer Type", values=can_types)
display(can_type_wid)


There are 25 cancer types ready to be analysed

In [11]:
can = can_type_wid.value
opt = 'silent'
input_fpath = '../results/' + can + os.sep + opt +'_mutation_pathway_score.txt'
sdf = pd.read_table(input_fpath, sep='\t', header=0, index_col=0)
opt = 'nsilent'
input_fpath = '../results/' + can + os.sep + opt +'_mutation_pathway_score.txt'
nsdf = pd.read_table(input_fpath, sep='\t', header=0, index_col=0)

# Take care of the fact that non-silent mutation has a prior probability that is
# twice of the silent mutation probability
nsdf = nsdf/1
print nsdf.shape

common = nsdf.index.intersection(sdf.index)                                    
res = pd.Series(index=sdf.index)
fwer = nsdf.shape[0]
for p in common:
    res.loc[p] = -1*math.log10(stats.ttest_ind(sdf.loc[p], nsdf.loc[p], equal_var=False)[1])


@interact(pval=widgets.FloatSliderWidget(min=res.min(), max=res.max(), value=max(res.max()-2, res.min()), step=1))
def plot_entiched(pval):
    pylab.rcParams['figure.figsize'] = (12.0, 8.0)
    res[res > pval].order().plot(title=can + " Enriched Pathways vs Boneferroni t-test P-values (-log10)", kind='barh', rot=0)

print('Common pathways: %d, Patients: %d' %(len(common), sdf.shape[1]))



In [46]:
res.order(ascending=False).head(10)


Out[46]:
Pathway
BIOCARTA_ARF_PATHWAY                  72.859240
BIOCARTA_CTCF_PATHWAY                 65.954918
PID_ECADHERIN_KERATINOCYTE_PATHWAY    55.027621
PID_IL5_PATHWAY                       47.383835
KEGG_ENDOMETRIAL_CANCER               45.717277
PID_TRAIL_PATHWAY                     44.766116
BIOCARTA_ACH_PATHWAY                  42.009037
BIOCARTA_LONGEVITY_PATHWAY            41.657147
BIOCARTA_PTEN_PATHWAY                 40.129975
BIOCARTA_HCMV_PATHWAY                 40.040087
dtype: float64

In [48]:
from pathway_analysis import preprocess_pathway_data
pdf = pd.DataFrame(preprocess_pathway_data().count(), columns=['Genes'])
x = pd.DataFrame(res.order(ascending=False).head(20), columns=['minus-log10-pval']).join(pdf, how='inner')
x.sort_index(by=['minus-log10-pval'], ascending=False)


Out[48]:
minus-log10-pval Genes
BIOCARTA_ARF_PATHWAY 72.859240 17
BIOCARTA_CTCF_PATHWAY 65.954918 23
PID_ECADHERIN_KERATINOCYTE_PATHWAY 55.027621 21
PID_IL5_PATHWAY 47.383835 14
KEGG_ENDOMETRIAL_CANCER 45.717277 52
PID_TRAIL_PATHWAY 44.766116 28
BIOCARTA_ACH_PATHWAY 42.009037 16
BIOCARTA_LONGEVITY_PATHWAY 41.657147 15
BIOCARTA_PTEN_PATHWAY 40.129975 18
BIOCARTA_HCMV_PATHWAY 40.040087 17
KEGG_MELANOMA 38.616070 71
BIOCARTA_RB_PATHWAY 38.082660 13
BIOCARTA_CTLA4_PATHWAY 37.757477 21
BIOCARTA_CDC42RAC_PATHWAY 37.326363 16
BIOCARTA_TRKA_PATHWAY 36.549321 12
PID_NECTIN_PATHWAY 36.480670 30
BIOCARTA_GLEEVEC_PATHWAY 35.397204 23
BIOCARTA_TID_PATHWAY 35.246091 19
BIOCARTA_IL7_PATHWAY 34.760617 17
KEGG_PANCREATIC_CANCER 32.747762 70

Correlating pathways scores with clinical outcomes


In [22]:
from pathway_analysis import preprocess_pathway_data
pdf = preprocess_pathway_data()
pdf.count().head()


Out[22]:
PID_ILK_PATHWAY                    45
BIOCARTA_VEGF_PATHWAY              29
BIOCARTA_AKAPCENTROSOME_PATHWAY    15
BIOCARTA_TEL_PATHWAY               18
KEGG_NON_HOMOLOGOUS_END_JOINING    14
dtype: int64

In [48]:
nsdf.ix['KEGG_MELANOMA'].plot(kind='hist', bins=50)


Out[48]:
<matplotlib.axes._subplots.AxesSubplot at 0x10c3a5490>

In [50]:
sdf.ix['KEGG_MELANOMA'].plot(kind='hist', bins=50)


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-50-ca22740d1a0d> in <module>()
----> 1 0.93*sdf.ix['KEGG_MELANOMA'].plot(kind='hist', bins=50)

TypeError: unsupported operand type(s) for *: 'float' and 'AxesSubplot'