Pathway analysis


Copyright (c) 2015-2017 The Hyve B.V. This notebook is licensed under the GNU General Public License, version 3. Authors: Ruslan Forostianov, Pieter Lukasse, Ward Weistra.

Make sure to have protobuf installed: pip install protobuf

Before running other code cells: Authenticate with tranSMART


In [ ]:
import getpass
from transmart_api import TransmartApi

api = TransmartApi(
    host = 'http://postgres-test.transmartfoundation.org/transmart',
    user = input('Username:'),
    password = getpass.getpass('Password:'),
    apiversion = 1)

api.access()

Part 1


Get observations data

We start by importing pandas utility package:


In [ ]:
import pandas
from pandas.io.json import json_normalize

pandas.set_option('max_colwidth', 1000)
pandas.set_option("display.max_rows",100)

Pandas is useful for converting the JSON format, received by the REST API, into an easy to use Python Pandas DataFrame object.


In [ ]:
studies = api.get_studies()
json_normalize(studies['studies'])

In [ ]:
study_id = 'GSE8581'
obs = api.get_observations(study = study_id)
obs_df1 = json_normalize(obs)

In [ ]:
obs_df1

Here we reorganize the data such that each line represents a subject and the columns represent the different clinical attributes.


In [ ]:
obs_df2 = obs_df1.pivot(index = 'subject.inTrialId', columns = 'label', values = 'value')
obs_df3 = obs_df2.convert_objects()
obs_df4 = obs_df3.rename(
    columns = lambda c: c.replace('\Public Studies\COPD_Bhattacharya_GSE8581\\', '')[:-1],
    inplace = False)
obs_df4

Split the control and chronic obstructive pulmonary disease sets:


In [ ]:
control = obs_df4['Subjects\Lung Disease'] == 'control'
treatment = obs_df4['Subjects\Lung Disease'] == 'chronic obstructive pulmonary disease'

Some basic statistics comparing the numerical attributes of both sets:


In [ ]:
obs_df4[control].describe()

In [ ]:
obs_df4[treatment].describe()

Part 2 - Retrieving molecular (aka "high dimensional") data


Downloading the expression data for our chosen study

This can take a while (~1 minute)


In [ ]:
(hdHeader, hdRows) = api.get_hd_node_data(study = study_id,
                                          node_name = 'Lung',
                                          projection='log_intensity')

Select only the data rows where the probes are mapped to a gene (aka bioMarker). Gene ID\probe ID will form the row names while the patientId values are used for the columns.


In [ ]:
rows = [row.value[0].doubleValue._values for row in hdRows if row.bioMarker != 'null']
row_names = [row.bioMarker + "\\" + row.label for row in hdRows if row.bioMarker != 'null']
col_names = [assay.patientId for assay in hdHeader.assay]

Build the table as a DataFrame in Pandas:


In [ ]:
from pandas import DataFrame

hd_df = DataFrame(rows, columns = col_names, index = row_names)
hd_df

In [ ]:
%load_ext rpy2.ipython

In [ ]:
%%R -i hd_df

plot(density(hd_df[,1]))

Using the control and treatment sets made a few cells above, we build the "design" data frame showing which subjects are part of which set.


In [ ]:
design_df = DataFrame({'control': control[col_names], 'treatment': treatment[col_names]})
design_df

Calling R differential analysis code (aka Advanced Analysis>'Marker Selection' in tranSMART)

Here we pass the data frames, initialized above in Python, to R. Jupyter will convert this to R. Handy feature!! Now we can also reuse all the nice R libraries which are not (yet) in Python.


In [ ]:
%load_ext rpy2.ipython

In [ ]:
%%R -i hd_df -i design_df -o top_fit

library(limma)

design.matrix <- sapply(design_df, as.numeric)
contrast.matrix <- makeContrasts(control - treatment, levels = design.matrix)

fit <- lmFit(hd_df, design.matrix)
fit <- contrasts.fit(fit, contrast.matrix)
fit <- eBayes(fit)

contr=1
log_fc = fit$coefficients[, contr]/log(2)
p_value = -log10(fit$p.value[, contr])
plot(log_fc, p_value, ylab="-log10 p-value", xlab="log2 fold change")

# alternative plots log fc x log odds: 
#volcanoplot(fit)

# "send" data back to Python: make a dataframe containing the important data and store 
# in top_fit (the "-o" variable above) so that it can be accessed in python code 
# in the subsequent code cells of this notebook (see flag '-o top_fit' above):
top_fit = data.frame(
    probe = rownames(fit$coefficients),
    log_fc = fit$coefficients[, contr],
    t = fit$t[, contr],
    p_value = fit$p.value[, contr],
    b = fit$lods[, contr],
    stringsAsFactors = FALSE
)

In [ ]:
top_fit

Filtering results on p-value

Back to Python....we can access top_fit variable because it was passed above as an "output" parameter to the R code. We use it to select the probes below a specific p-value threshold.


In [ ]:
top_probes = top_fit[top_fit.p_value < 0.05] \
              .set_index(['probe']).sort(['p_value'], ascending = True)
top_probes

In [ ]:
def get_gene_name(probe):
    return probe.split('\\')[0]

top_probes['gene_name'] = top_probes.index.map(get_gene_name)
top_probes

Map the gene names to Entrez Gene IDs (which are supported by KEGG API). The output table shows top_probes again, now with mapped gene_id column:


In [ ]:
from utils import Entrez

entrez = Entrez()

top_probes['gene_id'] = top_probes.gene_name.map(entrez.get_gene_id)
top_probes

Search via the KEGG API

Here we use our custom Python class Kegg which calls the KEGG API to find the pathways in KEGG that match one or more of the gene_ids. The output table displays the top_probes again, now with a column containing the pathway(s) in which this gene is found. The pathway names are also links to their corresponding entry in the KEGG website. We expect some genes to be found in one pathway, some genes in many pathways and many genes in none of the pathways.


In [ ]:
from utils import Kegg
# find the pathways in KEGG that match one or more of the gene_ids:
kegg = Kegg(gene_ids = top_probes.gene_id)

top_probes['pathways_ids'] = top_probes.gene_id.map(kegg.get_pathways_image_links_by_gene)

from IPython.display import HTML
HTML(top_probes.to_html(escape=False))

Results + KEGG visualization

Here we select the pathways that match 2 or more genes. The pathway name is also a link to the KEGG visualization, which displays the pathway and all the genes of our dataset that were found in this pathway. The genes of our dataset are highlighted in red in the KEGG visualization. The figure below shows this.


In [ ]:
pathways_df = DataFrame(kegg.get_all_pathways_rows(), columns=['link to pathway', 'nr genes matched'])

pathways_df = pathways_df[pathways_df['nr genes matched'] > 1].sort(['nr genes matched'], ascending = False)

HTML(pathways_df.to_html(escape=False))

Exercise


Change p-value constraint to 0.05 in step Filtering results on p-value


In [ ]:
# optional exercise: duplicate only the necessary code here, now with new p-value. 
# Generate only the table with the list of pathways and 'nr of genes matched'