Examples of interaction with TranSMART RESTful API from Jupyter Notebook

What is Jupyter Notebook

The Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. The Notebook has support for over 40 programming languages, including those popular in Data Science such as Julia, Python, R and Scala.

Simple TranSMART API Python Client

See transmart api client python module.

Authorize



In [2]:

    
import getpass
from transmart_api import TransmartApi

api = TransmartApi(
    host = 'http://localhost:8080',
    user = raw_input('Username:'),
    password = getpass.getpass('Password:'))

api.access()









    



Username:admin
Password:········






    Out[2]:





'SUCCESS'

Fetch some clinical data



In [3]:

    
obs = api.get_observations(study = 'GSE8581')
obs[0:3]









    Out[3]:





[{u'label': u'\\Public Studies\\GSE8581\\Biomarker Data\\GPL570\\',
  u'subject': {u'age': 65,
   u'birthDate': None,
   u'deathDate': None,
   u'id': 1000384597,
   u'inTrialId': u'GSE8581GSM210006',
   u'maritalStatus': None,
   u'race': u'Afro American',
   u'religion': None,
   u'sex': u'FEMALE',
   u'trial': u'GSE8581'},
  u'value': None},
 {u'label': u'\\Public Studies\\GSE8581\\Endpoints\\Diagnosis\\',
  u'subject': {u'age': 65,
   u'birthDate': None,
   u'deathDate': None,
   u'id': 1000384597,
   u'inTrialId': u'GSE8581GSM210006',
   u'maritalStatus': None,
   u'race': u'Afro American',
   u'religion': None,
   u'sex': u'FEMALE',
   u'trial': u'GSE8581'},
  u'value': u'non-small cell adenocarcinoma'},
 {u'label': u'\\Public Studies\\GSE8581\\Endpoints\\FEV1\\',
  u'subject': {u'age': 65,
   u'birthDate': None,
   u'deathDate': None,
   u'id': 1000384597,
   u'inTrialId': u'GSE8581GSM210006',
   u'maritalStatus': None,
   u'race': u'Afro American',
   u'religion': None,
   u'sex': u'FEMALE',
   u'trial': u'GSE8581'},
  u'value': 1.41}]

Fetch some high dimensonal data

The api returns data serialized with Google Protocol Buffers. To deserialize binary stream it's important to have a protobuf definition file for the data structure. See it.



In [4]:

    
(hdHeader, hdRows) = api.get_hd_node_data(study = 'GSE8581',
                                          node_name = 'Lung',
                                          projection='default_real_projection',
                                          genes = ['TP53', 'AURKA'])



In [5]:

    
hdHeader









    Out[5]:





<highdim_pb2.HighDimHeader at 0x7f1260c28050>



In [6]:

    
hdRows[0:5]









    Out[6]:





[<highdim_pb2.Row at 0x7f1260c32b90>,
 <highdim_pb2.Row at 0x7f1260c32c80>,
 <highdim_pb2.Row at 0x7f1260c32d70>,
 <highdim_pb2.Row at 0x7f1260c32e60>,
 <highdim_pb2.Row at 0x7f1260c32f50>]



In [7]:

    
#type:
#1 - double
#2 - string
[(x.name, x.type) for x in hdHeader.columnSpec]









    Out[7]:





[(u'value', 1)]



In [8]:

    
hdDataDic = {row.label: row.value[0].doubleValue for row in hdRows}

Pandas

Change the clinical data shape



In [9]:

    
import pandas
from pandas.io.json import json_normalize
obs_df1 = json_normalize(obs)
obs_df1[0:3]









    Out[9]:






  
    
      
      label
      subject.age
      subject.birthDate
      subject.deathDate
      subject.id
      subject.inTrialId
      subject.maritalStatus
      subject.race
      subject.religion
      subject.sex
      subject.trial
      value
    
  
  
    
      0
      \Public Studies\GSE8581\Biomarker Data\GPL570\
      65
      None
      None
      1000384597
      GSE8581GSM210006
      None
      Afro American
      None
      FEMALE
      GSE8581
      None
    
    
      1
      \Public Studies\GSE8581\Endpoints\Diagnosis\
      65
      None
      None
      1000384597
      GSE8581GSM210006
      None
      Afro American
      None
      FEMALE
      GSE8581
      non-small cell adenocarcinoma
    
    
      2
      \Public Studies\GSE8581\Endpoints\FEV1\
      65
      None
      None
      1000384597
      GSE8581GSM210006
      None
      Afro American
      None
      FEMALE
      GSE8581
      1.41



In [10]:

    
obs_df2 = obs_df1.pivot(index = 'subject.inTrialId', columns = 'label', values = 'value')
obs_df2[0:3]









    Out[10]:






  
    
      label
      \Public Studies\GSE8581\Biomarker Data\GPL570\
      \Public Studies\GSE8581\Endpoints\Diagnosis\
      \Public Studies\GSE8581\Endpoints\FEV1\
      \Public Studies\GSE8581\Endpoints\Forced Expiratory Volume Ratio\
      \Public Studies\GSE8581\Subjects\Age\
      \Public Studies\GSE8581\Subjects\Height (inch)\
      \Public Studies\GSE8581\Subjects\Lung Disease\
      \Public Studies\GSE8581\Subjects\Organism\
      \Public Studies\GSE8581\Subjects\Race\
      \Public Studies\GSE8581\Subjects\Sex\
    
    
      subject.inTrialId
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      GSE8581GSM210004
      None
      non-small cell squamous cell carcinoma
      2.54
      58
      63
      72
      chronic obstructive pulmonary disease
      Homo sapiens
      Caucasian
      male
    
    
      GSE8581GSM210005
      None
      non-small cell adenocarcinoma
      1.69
      83.66
      84
      60
      control
      Homo sapiens
      Afro American
      female
    
    
      GSE8581GSM210006
      None
      non-small cell adenocarcinoma
      1.41
      51
      65
      66
      chronic obstructive pulmonary disease
      Homo sapiens
      Afro American
      female



In [11]:

    
obs_df2.dtypes









    Out[11]:





label
\Public Studies\GSE8581\Biomarker Data\GPL570\                       object
\Public Studies\GSE8581\Endpoints\Diagnosis\                         object
\Public Studies\GSE8581\Endpoints\FEV1\                              object
\Public Studies\GSE8581\Endpoints\Forced Expiratory Volume Ratio\    object
\Public Studies\GSE8581\Subjects\Age\                                object
\Public Studies\GSE8581\Subjects\Height (inch)\                      object
\Public Studies\GSE8581\Subjects\Lung Disease\                       object
\Public Studies\GSE8581\Subjects\Organism\                           object
\Public Studies\GSE8581\Subjects\Race\                               object
\Public Studies\GSE8581\Subjects\Sex\                                object
dtype: object



In [12]:

    
#to make pandas guess data types of the columns
obs_df3 = obs_df2.convert_objects()
obs_df3.dtypes









    Out[12]:





label
\Public Studies\GSE8581\Biomarker Data\GPL570\                        object
\Public Studies\GSE8581\Endpoints\Diagnosis\                          object
\Public Studies\GSE8581\Endpoints\FEV1\                              float64
\Public Studies\GSE8581\Endpoints\Forced Expiratory Volume Ratio\    float64
\Public Studies\GSE8581\Subjects\Age\                                float64
\Public Studies\GSE8581\Subjects\Height (inch)\                      float64
\Public Studies\GSE8581\Subjects\Lung Disease\                        object
\Public Studies\GSE8581\Subjects\Organism\                            object
\Public Studies\GSE8581\Subjects\Race\                                object
\Public Studies\GSE8581\Subjects\Sex\                                 object
dtype: object



In [13]:

    
obs_df4 = obs_df3.rename(
    columns = lambda c: c.replace('\Public Studies\GSE8581\\', '')[:-1],
    inplace = False)
obs_df4[0:3]









    Out[13]:






  
    
      label
      Biomarker Data\GPL570
      Endpoints\Diagnosis
      Endpoints\FEV1
      Endpoints\Forced Expiratory Volume Ratio
      Subjects\Age
      Subjects\Height (inch)
      Subjects\Lung Disease
      Subjects\Organism
      Subjects\Race
      Subjects\Sex
    
    
      subject.inTrialId
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      GSE8581GSM210004
      None
      non-small cell squamous cell carcinoma
      2.54
      58.00
      63
      72
      chronic obstructive pulmonary disease
      Homo sapiens
      Caucasian
      male
    
    
      GSE8581GSM210005
      None
      non-small cell adenocarcinoma
      1.69
      83.66
      84
      60
      control
      Homo sapiens
      Afro American
      female
    
    
      GSE8581GSM210006
      None
      non-small cell adenocarcinoma
      1.41
      51.00
      65
      66
      chronic obstructive pulmonary disease
      Homo sapiens
      Afro American
      female

Change the shape of HD data



In [14]:

    
from pandas import DataFrame
hdDataDic['patientId'] = [assay.patientId for assay in hdHeader.assay]
assayIds = [assay.assayId for assay in hdHeader.assay]
hd_df = DataFrame(data=hdDataDic, index = assayIds)
hd_df[0:10]









    Out[14]:






  
    
      
      201746_at
      204092_s_at
      208079_s_at
      208080_at
      211300_s_at
      patientId
    
  
  
    
      45741
      45.25530
      61.428400
      53.77660
      1.08084
      31.6424
      GSE8581GSM213034
    
    
      45742
      61.31550
      42.516700
      34.59850
      2.42510
      13.8971
      GSE8581GSM212811
    
    
      45743
      56.84860
      29.488600
      26.76440
      2.79179
      17.3414
      GSE8581GSM213036
    
    
      45744
      88.71880
      44.719800
      39.41060
      1.94118
      24.0898
      GSE8581GSM212075
    
    
      45745
      57.22760
      25.578800
      33.30950
      5.15415
      33.0352
      GSE8581GSM211008
    
    
      45746
      77.81250
      25.868200
      27.15200
      9.03928
      50.2931
      GSE8581GSM210090
    
    
      45747
      32.36190
      24.203600
      18.65140
      4.20431
      19.9424
      GSE8581GSM212855
    
    
      45748
      105.82100
      35.938300
      47.17160
      2.29847
      67.8302
      GSE8581GSM212070
    
    
      45749
      6.98541
      0.362184
      2.11288
      0.97270
      4.2149
      GSE8581GSM212810
    
    
      45750
      82.68780
      37.258200
      28.35180
      3.19601
      34.2873
      GSE8581GSM210193

Select subgroups



In [15]:

    
males_abv_50 = (obs_df4['Subjects\Age'] > 50) & (obs_df4['Subjects\Sex'] == 'male')
males_abv_50[0:10]









    Out[15]:





subject.inTrialId
GSE8581GSM210004     True
GSE8581GSM210005    False
GSE8581GSM210006    False
GSE8581GSM210007    False
GSE8581GSM210008    False
GSE8581GSM210009    False
GSE8581GSM210010    False
GSE8581GSM210011     True
GSE8581GSM210012     True
GSE8581GSM210014    False
dtype: bool



In [16]:

    
obs_df4[males_abv_50][0:3]









    Out[16]:






  
    
      label
      Biomarker Data\GPL570
      Endpoints\Diagnosis
      Endpoints\FEV1
      Endpoints\Forced Expiratory Volume Ratio
      Subjects\Age
      Subjects\Height (inch)
      Subjects\Lung Disease
      Subjects\Organism
      Subjects\Race
      Subjects\Sex
    
    
      subject.inTrialId
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      GSE8581GSM210004
      None
      non-small cell squamous cell carcinoma
      2.54
      58.00
      63
      72
      chronic obstructive pulmonary disease
      Homo sapiens
      Caucasian
      male
    
    
      GSE8581GSM210011
      E
      non-small cell squamous cell carcinoma
      1.87
      56.00
      56
      72
      chronic obstructive pulmonary disease
      Homo sapiens
      Caucasian
      male
    
    
      GSE8581GSM210012
      E
      non-small cell adenocarcinoma
      2.76
      70.58
      61
      69
      not specified
      Homo sapiens
      Caucasian
      male

Join data

See pandas documentation on how to merge



In [17]:

    
hd_df.join(obs_df4, on='patientId')[0:3]









    Out[17]:






  
    
      
      201746_at
      204092_s_at
      208079_s_at
      208080_at
      211300_s_at
      patientId
      Biomarker Data\GPL570
      Endpoints\Diagnosis
      Endpoints\FEV1
      Endpoints\Forced Expiratory Volume Ratio
      Subjects\Age
      Subjects\Height (inch)
      Subjects\Lung Disease
      Subjects\Organism
      Subjects\Race
      Subjects\Sex
    
  
  
    
      45741
      45.2553
      61.4284
      53.7766
      1.08084
      31.6424
      GSE8581GSM213034
      E
      non-small cell adenocarcinoma
      2.56
      61
      70
      69
      not specified
      Homo sapiens
      Caucasian
      male
    
    
      45742
      61.3155
      42.5167
      34.5985
      2.42510
      13.8971
      GSE8581GSM212811
      E
      non-small cell squamous cell carcinoma
      1.37
      82
      77
      58
      control
      Homo sapiens
      Caucasian
      female
    
    
      45743
      56.8486
      29.4886
      26.7644
      2.79179
      17.3414
      GSE8581GSM213036
      E
      non-small cell squamous cell carcinoma
      2.68
      71
      59
      69
      control
      Homo sapiens
      Caucasian
      female

Interaction with R

The Notebook gives possibility to interact with R environunment easily. You need just to mark a cell with so called magic notation.



In [19]:

    
%load_ext rpy2.ipython



In [20]:

    
%%R -i obs_df4
str(obs_df4)









    





'data.frame':	58 obs. of  10 variables:
 $ Biomarker.Data.GPL570                   : Factor w/ 2 levels "E","None": 2 2 2 1 1 1 1 1 1 1 ...
 $ Endpoints.Diagnosis                     : Factor w/ 14 levels "carcinoid","emphysema",..: 12 11 11 11 11 12 11 12 11 11 ...
 $ Endpoints.FEV1                          : num [1:58(1d)] 2.54 1.69 1.41 2.51 1.64 2.72 1.45 1.87 2.76 1.98 ...
 $ Endpoints.Forced.Expiratory.Volume.Ratio: num [1:58(1d)] 58 83.7 51 81 57 ...
 $ Subjects.Age                            : num [1:58(1d)] 63 84 65 46 53 53 77 56 61 71 ...
 $ Subjects.Height..inch.                  : num [1:58(1d)] 72 60 66 66 65 64 63 72 69 63 ...
 $ Subjects.Lung.Disease                   : Factor w/ 3 levels "chronic obstructive pulmonary disease",..: 1 2 1 3 1 2 3 1 3 2 ...
 $ Subjects.Organism                       : Factor w/ 1 level "Homo sapiens": 1 1 1 1 1 1 1 1 1 1 ...
 $ Subjects.Race                           : Factor w/ 2 levels "Afro American",..: 2 1 1 2 2 2 2 2 2 2 ...
 $ Subjects.Sex                            : Factor w/ 2 levels "female","male": 2 1 1 2 1 1 1 2 2 1 ...



In [21]:

    
%%R -i obs_df4
library(ggplot2)

ggplot(obs_df4, aes(x = Subjects.Sex, y = Endpoints.FEV1)) + geom_boxplot()



In [22]:

    
%%R -i obs_df4 -o tmodel

tmodel <- t.test(obs_df4$Endpoints.FEV1~obs_df4$Subjects.Sex)
tmodel









    





	Welch Two Sample t-test

data:  obs_df4$Endpoints.FEV1 by obs_df4$Subjects.Sex
t = -3.2644, df = 45.884, p-value = 0.002077
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.1037123 -0.2617163
sample estimates:
mean in group female   mean in group male 
            1.703000             2.385714



In [23]:

    
tmodel[3]









    Out[23]:





<FloatVector - Python:0x7f1241a84758 / R:0x5f55dd8>
[-1.103712, -0.261716]



In [30]:

    
%%R -i hd_df
library(dplyr)

pc <- princomp(select(hd_df, -patientId))$loadings[, 1:2]
plot(pc[, 1], pc[, 2], pch = 19, xlab = "Comp. 1", ylab = "Comp. 2", main = "", cex = 1.3)

Widgets



In [31]:

    
from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets

def f(x):
    return x

interact(f, x=widgets.IntSlider(min=-10,max=30,step=1,value=10));



In [26]:

    
concepts = api.get_concepts('GSE8581', hal = True)



In [27]:

    
import ipywidgets as widgets
from IPython.display import display

d = widgets.Dropdown()
d.options = [c['name'] for c in concepts['_embedded']['ontology_terms'] if c['type'] == 'NUMERIC']
def f(prop, val):
    print val
d.on_trait_change(f, 'value')
display(d)









    



Forced Expiratory Volume Ratio
FEV1
Height (inch)

	label	subject.age	subject.birthDate	subject.deathDate	subject.id	subject.inTrialId	subject.maritalStatus	subject.race	subject.religion	subject.sex	subject.trial	value
0	\Public Studies\GSE8581\Biomarker Data\GPL570\	65	None	None	1000384597	GSE8581GSM210006	None	Afro American	None	FEMALE	GSE8581	None
1	\Public Studies\GSE8581\Endpoints\Diagnosis\	65	None	None	1000384597	GSE8581GSM210006	None	Afro American	None	FEMALE	GSE8581	non-small cell adenocarcinoma
2	\Public Studies\GSE8581\Endpoints\FEV1\	65	None	None	1000384597	GSE8581GSM210006	None	Afro American	None	FEMALE	GSE8581	1.41

label	\Public Studies\GSE8581\Biomarker Data\GPL570\	\Public Studies\GSE8581\Endpoints\Diagnosis\	\Public Studies\GSE8581\Endpoints\FEV1\	\Public Studies\GSE8581\Endpoints\Forced Expiratory Volume Ratio\	\Public Studies\GSE8581\Subjects\Age\	\Public Studies\GSE8581\Subjects\Height (inch)\	\Public Studies\GSE8581\Subjects\Lung Disease\	\Public Studies\GSE8581\Subjects\Organism\	\Public Studies\GSE8581\Subjects\Race\	\Public Studies\GSE8581\Subjects\Sex\
subject.inTrialId
GSE8581GSM210004	None	non-small cell squamous cell carcinoma	2.54	58	63	72	chronic obstructive pulmonary disease	Homo sapiens	Caucasian	male
GSE8581GSM210005	None	non-small cell adenocarcinoma	1.69	83.66	84	60	control	Homo sapiens	Afro American	female
GSE8581GSM210006	None	non-small cell adenocarcinoma	1.41	51	65	66	chronic obstructive pulmonary disease	Homo sapiens	Afro American	female

	201746_at	204092_s_at	208079_s_at	208080_at	211300_s_at	patientId
45741	45.25530	61.428400	53.77660	1.08084	31.6424	GSE8581GSM213034
45742	61.31550	42.516700	34.59850	2.42510	13.8971	GSE8581GSM212811
45743	56.84860	29.488600	26.76440	2.79179	17.3414	GSE8581GSM213036
45744	88.71880	44.719800	39.41060	1.94118	24.0898	GSE8581GSM212075
45745	57.22760	25.578800	33.30950	5.15415	33.0352	GSE8581GSM211008
45746	77.81250	25.868200	27.15200	9.03928	50.2931	GSE8581GSM210090
45747	32.36190	24.203600	18.65140	4.20431	19.9424	GSE8581GSM212855
45748	105.82100	35.938300	47.17160	2.29847	67.8302	GSE8581GSM212070
45749	6.98541	0.362184	2.11288	0.97270	4.2149	GSE8581GSM212810
45750	82.68780	37.258200	28.35180	3.19601	34.2873	GSE8581GSM210193

	201746_at	204092_s_at	208079_s_at	208080_at	211300_s_at	patientId	Biomarker Data\GPL570	Endpoints\Diagnosis	Endpoints\FEV1	Endpoints\Forced Expiratory Volume Ratio	Subjects\Age	Subjects\Height (inch)	Subjects\Lung Disease	Subjects\Organism	Subjects\Race	Subjects\Sex
45741	45.2553	61.4284	53.7766	1.08084	31.6424	GSE8581GSM213034	E	non-small cell adenocarcinoma	2.56	61	70	69	not specified	Homo sapiens	Caucasian	male
45742	61.3155	42.5167	34.5985	2.42510	13.8971	GSE8581GSM212811	E	non-small cell squamous cell carcinoma	1.37	82	77	58	control	Homo sapiens	Caucasian	female
45743	56.8486	29.4886	26.7644	2.79179	17.3414	GSE8581GSM213036	E	non-small cell squamous cell carcinoma	2.68	71	59	69	control	Homo sapiens	Caucasian	female