In [ ]:
import getpass
from transmart_api import TransmartApi
api = TransmartApi(
host = 'http://postgres-test.transmartfoundation.org/transmart',
user = input('Username:'),
password = getpass.getpass('Password:'),
apiversion = 1)
api.access()
In [ ]:
import pandas
from pandas.io.json import json_normalize
pandas.set_option('max_colwidth', 1000)
pandas.set_option("display.max_rows",100)
Pandas is useful for converting the JSON format, received by the REST API, into an easy to use Python Pandas DataFrame object.
In [ ]:
studies = api.get_studies()
json_normalize(studies['studies'])
In [ ]:
study_id = 'GSE8581'
obs = api.get_observations(study = study_id)
obs_df1 = json_normalize(obs)
In [ ]:
obs_df1
Here we reorganize the data such that each line represents a subject and the columns represent the different clinical attributes.
In [ ]:
obs_df2 = obs_df1.pivot(index = 'subject.inTrialId', columns = 'label', values = 'value')
obs_df3 = obs_df2.convert_objects()
obs_df4 = obs_df3.rename(
columns = lambda c: c.replace('\Public Studies\COPD_Bhattacharya_GSE8581\\', '')[:-1],
inplace = False)
obs_df4
Split the control and chronic obstructive pulmonary disease sets:
In [ ]:
control = obs_df4['Subjects\Lung Disease'] == 'control'
treatment = obs_df4['Subjects\Lung Disease'] == 'chronic obstructive pulmonary disease'
Some basic statistics comparing the numerical attributes of both sets:
In [ ]:
obs_df4[control].describe()
In [ ]:
obs_df4[treatment].describe()
In [ ]:
(hdHeader, hdRows) = api.get_hd_node_data(study = study_id,
node_name = 'Lung',
projection='log_intensity')
Select only the data rows where the probes are mapped to a gene (aka bioMarker). Gene ID\probe ID will form the row names while the patientId values are used for the columns.
In [ ]:
rows = [row.value[0].doubleValue._values for row in hdRows if row.bioMarker != 'null']
row_names = [row.bioMarker + "\\" + row.label for row in hdRows if row.bioMarker != 'null']
col_names = [assay.patientId for assay in hdHeader.assay]
Build the table as a DataFrame in Pandas:
In [ ]:
from pandas import DataFrame
hd_df = DataFrame(rows, columns = col_names, index = row_names)
hd_df
In [ ]:
%load_ext rpy2.ipython
In [ ]:
%%R -i hd_df
plot(density(hd_df[,1]))
Using the control and treatment sets made a few cells above, we build the "design" data frame showing which subjects are part of which set.
In [ ]:
design_df = DataFrame({'control': control[col_names], 'treatment': treatment[col_names]})
design_df
Here we pass the data frames, initialized above in Python, to R. Jupyter will convert this to R. Handy feature!! Now we can also reuse all the nice R libraries which are not (yet) in Python.
In [ ]:
%load_ext rpy2.ipython
In [ ]:
%%R -i hd_df -i design_df -o top_fit
library(limma)
design.matrix <- sapply(design_df, as.numeric)
contrast.matrix <- makeContrasts(control - treatment, levels = design.matrix)
fit <- lmFit(hd_df, design.matrix)
fit <- contrasts.fit(fit, contrast.matrix)
fit <- eBayes(fit)
contr=1
log_fc = fit$coefficients[, contr]/log(2)
p_value = -log10(fit$p.value[, contr])
plot(log_fc, p_value, ylab="-log10 p-value", xlab="log2 fold change")
# alternative plots log fc x log odds:
#volcanoplot(fit)
# "send" data back to Python: make a dataframe containing the important data and store
# in top_fit (the "-o" variable above) so that it can be accessed in python code
# in the subsequent code cells of this notebook (see flag '-o top_fit' above):
top_fit = data.frame(
probe = rownames(fit$coefficients),
log_fc = fit$coefficients[, contr],
t = fit$t[, contr],
p_value = fit$p.value[, contr],
b = fit$lods[, contr],
stringsAsFactors = FALSE
)
In [ ]:
top_fit
In [ ]:
top_probes = top_fit[top_fit.p_value < 0.05] \
.set_index(['probe']).sort(['p_value'], ascending = True)
top_probes
In [ ]:
def get_gene_name(probe):
return probe.split('\\')[0]
top_probes['gene_name'] = top_probes.index.map(get_gene_name)
top_probes
Map the gene names to Entrez Gene IDs (which are supported by KEGG API). The output table shows top_probes again, now with mapped gene_id column:
In [ ]:
from utils import Entrez
entrez = Entrez()
top_probes['gene_id'] = top_probes.gene_name.map(entrez.get_gene_id)
top_probes
Here we use our custom Python class Kegg which calls the KEGG API to find the pathways in KEGG that match one or more of the gene_ids. The output table displays the top_probes again, now with a column containing the pathway(s) in which this gene is found. The pathway names are also links to their corresponding entry in the KEGG website. We expect some genes to be found in one pathway, some genes in many pathways and many genes in none of the pathways.
In [ ]:
from utils import Kegg
# find the pathways in KEGG that match one or more of the gene_ids:
kegg = Kegg(gene_ids = top_probes.gene_id)
top_probes['pathways_ids'] = top_probes.gene_id.map(kegg.get_pathways_image_links_by_gene)
from IPython.display import HTML
HTML(top_probes.to_html(escape=False))
Here we select the pathways that match 2 or more genes. The pathway name is also a link to the KEGG visualization, which displays the pathway and all the genes of our dataset that were found in this pathway. The genes of our dataset are highlighted in red in the KEGG visualization. The figure below shows this.
In [ ]:
pathways_df = DataFrame(kegg.get_all_pathways_rows(), columns=['link to pathway', 'nr genes matched'])
pathways_df = pathways_df[pathways_df['nr genes matched'] > 1].sort(['nr genes matched'], ascending = False)
HTML(pathways_df.to_html(escape=False))
In [ ]:
# optional exercise: duplicate only the necessary code here, now with new p-value.
# Generate only the table with the list of pathways and 'nr of genes matched'