Prepare: authenticate with tranSMART

Authenticate with tranSMART first if you want to execute any of the analysis in the boxes below again.

Step 1: Please open URL http://localhost:8080/transmart/oauth/authorize?response_type=code&client_id=api-client&client_secret=api-client&redirect_uri=http%3A%2F%2Flocalhost%3A8080%2Ftransmart%2Foauth%2Fverify

Step 2: paste token in the token parameter below.


In [1]:
require("transmartRClient")
connectToTransmart("http://localhost:8080/transmart", prefetched.request.token = "4xIZYZ")


Loading required package: transmartRClient
Loading required package: RCurl
Loading required package: bitops
Loading required package: RJSONIO
Loading required package: plyr
Loading required package: RProtoBuf

Attaching package: ‘RProtoBuf’

The following object is masked from ‘package:RCurl’:

    clone

Loading required package: hash
hash-2.2.6 provided by Decision Patterns


Attaching package: ‘hash’

The following object is masked from ‘package:RProtoBuf’:

    clear

Loading required package: reshape

Attaching package: ‘reshape’

The following objects are masked from ‘package:plyr’:

    rename, round_any

Authentication completed.
Connection successful.

If the output above is: Authentication completed. TRUE , then you can continue below.

Get studies and observations data


In [3]:
# Get studies
studies <- getStudies()
studies


Out[3]:
idapi.link.self.hrefontologyTerm.fullName
GSE8581GSE8581/studies/gse8581\Public Studies\GSE8581\

In [4]:
study <- "GSE8581"  

# Retrieve Clinical Data
allObservations <- getObservations(study, as.data.frame = T)
# show first 3 rows, just to get impression of the fields available
allObservations$observations[1:3,]


Out[4]:
subject.idBiomarker Data_GPL570Endpoints_DiagnosisEndpoints_FEV1Endpoints_Forced Expiratory Volume RatioSubjects_AgeSubjects_Height (inch)Subjects_Lung DiseaseSubjects_OrganismSubjects_RaceSubjects_Sex
11000384597NAnon-small cell adenocarcinoma1.41516566chronic obstructive pulmonary diseaseHomo sapiensAfro Americanfemale
21000384598Enon-small cell squamous cell carcinoma1.29537767chronic obstructive pulmonary diseaseHomo sapiensCaucasianfemale
31000384599Einflammation4.04795569controlHomo sapiensCaucasianmale

Making subsets based on attributes (aka "concepts")


In [5]:
# get the concepts for this study
concepts <- getConcepts(study)
concepts


Out[5]:
namefullNametypeapi.link.self.hrefapi.link.observations.hrefapi.link.parent.hrefapi.link.children.NA.hrefapi.link.children.NA.titleapi.link.highdim.href
1Afro American\Public Studies\GSE8581\Subjects\Race\Afro American\CATEGORICAL_OPTION/studies/gse8581/concepts/Subjects/Race/Afro%20American/studies/gse8581/concepts/Subjects/Race/Afro%20American/observations/studies/gse8581/concepts/Subjects/RaceNANANA
2Age\Public Studies\GSE8581\Subjects\Age\NUMERIC/studies/gse8581/concepts/Subjects/Age/studies/gse8581/concepts/Subjects/Age/observations/studies/gse8581/concepts/SubjectsNANANA
3Biomarker Data\Public Studies\GSE8581\Biomarker Data\UNKNOWN/studies/gse8581/concepts/Biomarker%20Data/studies/gse8581/concepts/Biomarker%20Data/observations/studies/gse8581/concepts/ROOT/studies/gse8581/concepts/Biomarker%20Data/GPL570GPL570NA
4carcinoid\Public Studies\GSE8581\Endpoints\Diagnosis\carcinoid\CATEGORICAL_OPTION/studies/gse8581/concepts/Endpoints/Diagnosis/carcinoid/studies/gse8581/concepts/Endpoints/Diagnosis/carcinoid/observations/studies/gse8581/concepts/Endpoints/DiagnosisNANANA
5Caucasian\Public Studies\GSE8581\Subjects\Race\Caucasian\CATEGORICAL_OPTION/studies/gse8581/concepts/Subjects/Race/Caucasian/studies/gse8581/concepts/Subjects/Race/Caucasian/observations/studies/gse8581/concepts/Subjects/RaceNANANA
6chronic obstructive pulmonary disease\Public Studies\GSE8581\Subjects\Lung Disease\chronic obstructive pulmonary disease\CATEGORICAL_OPTION/studies/gse8581/concepts/Subjects/Lung%20Disease/chronic%20obstructive%20pulmonary%20disease/studies/gse8581/concepts/Subjects/Lung%20Disease/chronic%20obstructive%20pulmonary%20disease/observations/studies/gse8581/concepts/Subjects/Lung%20DiseaseNANANA
7control\Public Studies\GSE8581\Subjects\Lung Disease\control\CATEGORICAL_OPTION/studies/gse8581/concepts/Subjects/Lung%20Disease/control/studies/gse8581/concepts/Subjects/Lung%20Disease/control/observations/studies/gse8581/concepts/Subjects/Lung%20DiseaseNANANA
8Diagnosis\Public Studies\GSE8581\Endpoints\Diagnosis\UNKNOWN/studies/gse8581/concepts/Endpoints/Diagnosis/studies/gse8581/concepts/Endpoints/Diagnosis/observations/studies/gse8581/concepts/Endpoints/studies/gse8581/concepts/Endpoints/Diagnosis/UnknownUnknownNA
9emphysema\Public Studies\GSE8581\Endpoints\Diagnosis\emphysema\CATEGORICAL_OPTION/studies/gse8581/concepts/Endpoints/Diagnosis/emphysema/studies/gse8581/concepts/Endpoints/Diagnosis/emphysema/observations/studies/gse8581/concepts/Endpoints/DiagnosisNANANA
10Endpoints\Public Studies\GSE8581\Endpoints\UNKNOWN/studies/gse8581/concepts/Endpoints/studies/gse8581/concepts/Endpoints/observations/studies/gse8581/concepts/ROOT/studies/gse8581/concepts/Endpoints/Forced%20Expiratory%20Volume%20RatioForced Expiratory Volume RatioNA
11female\Public Studies\GSE8581\Subjects\Sex\female\CATEGORICAL_OPTION/studies/gse8581/concepts/Subjects/Sex/female/studies/gse8581/concepts/Subjects/Sex/female/observations/studies/gse8581/concepts/Subjects/SexNANANA
12FEV1\Public Studies\GSE8581\Endpoints\FEV1\NUMERIC/studies/gse8581/concepts/Endpoints/FEV1/studies/gse8581/concepts/Endpoints/FEV1/observations/studies/gse8581/concepts/EndpointsNANANA
13Forced Expiratory Volume Ratio\Public Studies\GSE8581\Endpoints\Forced Expiratory Volume Ratio\NUMERIC/studies/gse8581/concepts/Endpoints/Forced%20Expiratory%20Volume%20Ratio/studies/gse8581/concepts/Endpoints/Forced%20Expiratory%20Volume%20Ratio/observations/studies/gse8581/concepts/EndpointsNANANA
14giant bullae\Public Studies\GSE8581\Endpoints\Diagnosis\giant bullae\CATEGORICAL_OPTION/studies/gse8581/concepts/Endpoints/Diagnosis/giant%20bullae/studies/gse8581/concepts/Endpoints/Diagnosis/giant%20bullae/observations/studies/gse8581/concepts/Endpoints/DiagnosisNANANA
15Giant Cell Tumor\Public Studies\GSE8581\Endpoints\Diagnosis\Giant Cell Tumor\CATEGORICAL_OPTION/studies/gse8581/concepts/Endpoints/Diagnosis/Giant%20Cell%20Tumor/studies/gse8581/concepts/Endpoints/Diagnosis/Giant%20Cell%20Tumor/observations/studies/gse8581/concepts/Endpoints/DiagnosisNANANA
16GPL570\Public Studies\GSE8581\Biomarker Data\GPL570\UNKNOWN/studies/gse8581/concepts/Biomarker%20Data/GPL570/studies/gse8581/concepts/Biomarker%20Data/GPL570/observations/studies/gse8581/concepts/Biomarker%20Data/studies/gse8581/concepts/Biomarker%20Data/GPL570/LungLungNA
17Height (inch)\Public Studies\GSE8581\Subjects\Height (inch)\NUMERIC/studies/gse8581/concepts/Subjects/Height%20%28inch%29/studies/gse8581/concepts/Subjects/Height%20%28inch%29/observations/studies/gse8581/concepts/SubjectsNANANA
18hematoma\Public Studies\GSE8581\Endpoints\Diagnosis\hematoma\CATEGORICAL_OPTION/studies/gse8581/concepts/Endpoints/Diagnosis/hematoma/studies/gse8581/concepts/Endpoints/Diagnosis/hematoma/observations/studies/gse8581/concepts/Endpoints/DiagnosisNANANA
19Homo sapiens\Public Studies\GSE8581\Subjects\Organism\Homo sapiens\CATEGORICAL_OPTION/studies/gse8581/concepts/Subjects/Organism/Homo%20sapiens/studies/gse8581/concepts/Subjects/Organism/Homo%20sapiens/observations/studies/gse8581/concepts/Subjects/OrganismNANANA
20inflammation\Public Studies\GSE8581\Endpoints\Diagnosis\inflammation\CATEGORICAL_OPTION/studies/gse8581/concepts/Endpoints/Diagnosis/inflammation/studies/gse8581/concepts/Endpoints/Diagnosis/inflammation/observations/studies/gse8581/concepts/Endpoints/DiagnosisNANANA
21Lung\Public Studies\GSE8581\Biomarker Data\GPL570\Lung\HIGH_DIMENSIONAL/studies/gse8581/concepts/Biomarker%20Data/GPL570/LungNA/studies/gse8581/concepts/Biomarker%20Data/GPL570NANA/studies/gse8581/concepts/Biomarker%20Data/GPL570/Lung/highdim
22Lung Disease\Public Studies\GSE8581\Subjects\Lung Disease\UNKNOWN/studies/gse8581/concepts/Subjects/Lung%20Disease/studies/gse8581/concepts/Subjects/Lung%20Disease/observations/studies/gse8581/concepts/Subjects/studies/gse8581/concepts/Subjects/Lung%20Disease/not%20specifiednot specifiedNA
23lymphoma\Public Studies\GSE8581\Endpoints\Diagnosis\lymphoma\CATEGORICAL_OPTION/studies/gse8581/concepts/Endpoints/Diagnosis/lymphoma/studies/gse8581/concepts/Endpoints/Diagnosis/lymphoma/observations/studies/gse8581/concepts/Endpoints/DiagnosisNANANA
24male\Public Studies\GSE8581\Subjects\Sex\male\CATEGORICAL_OPTION/studies/gse8581/concepts/Subjects/Sex/male/studies/gse8581/concepts/Subjects/Sex/male/observations/studies/gse8581/concepts/Subjects/SexNANANA
25metastatic non-small cell adenocarcinoma\Public Studies\GSE8581\Endpoints\Diagnosis\metastatic non-small cell adenocarcinoma\CATEGORICAL_OPTION/studies/gse8581/concepts/Endpoints/Diagnosis/metastatic%20non-small%20cell%20adenocarcinoma/studies/gse8581/concepts/Endpoints/Diagnosis/metastatic%20non-small%20cell%20adenocarcinoma/observations/studies/gse8581/concepts/Endpoints/DiagnosisNANANA
26metastatic renal cell carcinoma\Public Studies\GSE8581\Endpoints\Diagnosis\metastatic renal cell carcinoma\CATEGORICAL_OPTION/studies/gse8581/concepts/Endpoints/Diagnosis/metastatic%20renal%20cell%20carcinoma/studies/gse8581/concepts/Endpoints/Diagnosis/metastatic%20renal%20cell%20carcinoma/observations/studies/gse8581/concepts/Endpoints/DiagnosisNANANA
27no malignancy\Public Studies\GSE8581\Endpoints\Diagnosis\no malignancy\CATEGORICAL_OPTION/studies/gse8581/concepts/Endpoints/Diagnosis/no%20malignancy/studies/gse8581/concepts/Endpoints/Diagnosis/no%20malignancy/observations/studies/gse8581/concepts/Endpoints/DiagnosisNANANA
28non-small cell adenocarcinoma\Public Studies\GSE8581\Endpoints\Diagnosis\non-small cell adenocarcinoma\CATEGORICAL_OPTION/studies/gse8581/concepts/Endpoints/Diagnosis/non-small%20cell%20adenocarcinoma/studies/gse8581/concepts/Endpoints/Diagnosis/non-small%20cell%20adenocarcinoma/observations/studies/gse8581/concepts/Endpoints/DiagnosisNANANA
29non-small cell squamous cell carcinoma\Public Studies\GSE8581\Endpoints\Diagnosis\non-small cell squamous cell carcinoma\CATEGORICAL_OPTION/studies/gse8581/concepts/Endpoints/Diagnosis/non-small%20cell%20squamous%20cell%20carcinoma/studies/gse8581/concepts/Endpoints/Diagnosis/non-small%20cell%20squamous%20cell%20carcinoma/observations/studies/gse8581/concepts/Endpoints/DiagnosisNANANA
30not specified\Public Studies\GSE8581\Subjects\Lung Disease\not specified\CATEGORICAL_OPTION/studies/gse8581/concepts/Subjects/Lung%20Disease/not%20specified/studies/gse8581/concepts/Subjects/Lung%20Disease/not%20specified/observations/studies/gse8581/concepts/Subjects/Lung%20DiseaseNANANA
31NSC-Mixed\Public Studies\GSE8581\Endpoints\Diagnosis\NSC-Mixed\CATEGORICAL_OPTION/studies/gse8581/concepts/Endpoints/Diagnosis/NSC-Mixed/studies/gse8581/concepts/Endpoints/Diagnosis/NSC-Mixed/observations/studies/gse8581/concepts/Endpoints/DiagnosisNANANA
32Organism\Public Studies\GSE8581\Subjects\Organism\UNKNOWN/studies/gse8581/concepts/Subjects/Organism/studies/gse8581/concepts/Subjects/Organism/observations/studies/gse8581/concepts/Subjects/studies/gse8581/concepts/Subjects/Organism/Homo%20sapiensHomo sapiensNA
33Race\Public Studies\GSE8581\Subjects\Race\UNKNOWN/studies/gse8581/concepts/Subjects/Race/studies/gse8581/concepts/Subjects/Race/observations/studies/gse8581/concepts/Subjects/studies/gse8581/concepts/Subjects/Race/CaucasianCaucasianNA
34Sex\Public Studies\GSE8581\Subjects\Sex\UNKNOWN/studies/gse8581/concepts/Subjects/Sex/studies/gse8581/concepts/Subjects/Sex/observations/studies/gse8581/concepts/Subjects/studies/gse8581/concepts/Subjects/Sex/malemaleNA
35Subjects\Public Studies\GSE8581\Subjects\UNKNOWN/studies/gse8581/concepts/Subjects/studies/gse8581/concepts/Subjects/observations/studies/gse8581/concepts/ROOT/studies/gse8581/concepts/Subjects/SexSexNA
36Unknown\Public Studies\GSE8581\Endpoints\Diagnosis\Unknown\CATEGORICAL_OPTION/studies/gse8581/concepts/Endpoints/Diagnosis/Unknown/studies/gse8581/concepts/Endpoints/Diagnosis/Unknown/observations/studies/gse8581/concepts/Endpoints/DiagnosisNANANA

In [6]:
observations <- getObservations(study, 
                                # concept names from api.link.self.href column above: 
                                concept.links =
                                  c("/studies/gse8581/concepts/Subjects/Age",
                                    "/studies/gse8581/concepts/Subjects/Sex")
                                )
# make two groups based on gender :
observations_female <- subset(observations$observations, Sex == 'female')
observations_male <- subset(observations$observations, Sex == 'male')
# show age distribution:
d <- density(as.integer(observations_male$Age)) # returns the density data 
plot(d, col="blue", main="Male vs Female groups age distribution") # plots the results
legend("topright", c("Males","Females"), pch = 1, col=c("blue", "red"))
d <- density(as.integer(observations_female$Age)) # returns the density data 
lines(d, col="red") # plots the results


Exercise 1

Make a distribution plot similar to the plot above, but now comparing the ages of "control" vs "chronic obstructive pulmonary disease" (if you did the previous RStudio exercise, you can just paste your code in a code cell below).

NB: run also all necessary cells above to fetch the data that your script requires.


In [15]:



Out[15]:
1

Downloading the expression data

This can take a while (~1 minute)


In [7]:
dataDownloaded <- getHighdimData(study.name = study, concept.match = "Lung", projection = "log_intensity")


Retrieving data from server. This can take some time, depending on your network connection speed. 2015-10-13 05:36:26
Retrieving data: 
 24.363 MiB downloaded.
Download complete.
Received data for 55 assays. Unpacking data. 2015-10-13 05:37:54
  |======================================================================| 100%
Data unpacked. Converting to data.frame. 2015-10-13 05:38:16
Additional biomarker information is available.
This function will return a list containing a dataframe containing the high dimensional data and a hash describing which (column) labels refer to which bioMarker

In [8]:
summary(dataDownloaded)


Out[8]:
                    Length Class      Mode
data                54680  data.frame list
labelToBioMarkerMap 54674  hash       S4  

In [9]:
# preview part of the data
data<-dataDownloaded[["data"]]
data[1:10,1:10]


Out[9]:
assayIdpatientIdsampleTypeNametimepointNametissueTypeNameplatformX235956_atX226260_x_atX232632_atX214503_x_at
145741GSE8581GSM213034HumanLungGPL5706.4617924.9690125.5469750.3130368
245742GSE8581GSM212811HumanLungGPL5707.4617581.0536955.507547-1.114216
345743GSE8581GSM213036HumanLungGPL5706.9071915.0643487.6373860.5699965
445744GSE8581GSM212075HumanLungGPL5707.7755584.7667916.595917-0.1566947
545745GSE8581GSM211008HumanLungGPL5707.3087853.5592358.2498530.2687817
645746GSE8581GSM210090HumanLungGPL5706.2866544.9675387.3189860.1080092
745747GSE8581GSM212855HumanLungGPL5704.9243184.0855954.5525121.08789
845748GSE8581GSM212070HumanLungGPL5708.00494.7677456.7625351.19427
945749GSE8581GSM212810HumanLungGPL5702.411315-1.4471940.2749713-1.321647
1045750GSE8581GSM210193HumanLungGPL5706.6706854.0258537.5389340.9389105

Prepare the data for easy usage in different standard R functions

The steps below show how the table above is processed into a simple table that contains only expression values + an extra feature of having patient identifiers as row names.


In [10]:
# select gene expression data, which is the data *excluding* columns 1 to 6:
expression_data<-data[,-c(1:6)]
expression_data[1:3,1:3]
dim(expression_data)
# add patientId as the row name for the expression_data matrix:
rownames(expression_data)<-data$patientId
expression_data[1:3,1:3]


Out[10]:
X235956_atX226260_x_atX232632_at
16.4617924.9690125.546975
27.4617581.0536955.507547
36.9071915.0643487.637386
Out[10]:
  1. 55
  2. 54674
Out[10]:
X235956_atX226260_x_atX232632_at
GSE8581GSM2130346.4617924.9690125.546975
GSE8581GSM2128117.4617581.0536955.507547
GSE8581GSM2130366.9071915.0643487.637386

Heatmap

If the dimensions of the expression_data table are large, you may want to create a subset of the data first. Here we use a probelist as a subset for the probes, based on the list found in: "Bhattacharya S., Srisuma S., Demeo D. L., et al., Molecular biomarkers for quantitative and discrete COPD phenotypes.American Journal of Respiratory Cell and Molecular Biology. 2009;40(3):359–367. doi: 10.1165/rcmb.2008-0114OC."


In [11]:
#Make a heatmap
probeNames<- c("1552622_s_at","1555318_at","1557293_at","1558280_s_at","1558411_at","1558515_at","1559964_at","204284_at","205051_s_at","205528_s_at","208835_s_at","209377_s_at","209815_at","211548_s_at","212179_at","212263_at","213156_at","213269_at","213650_at","213878_at","215359_x_at","215933_s_at","218352_at","218490_s_at","220094_s_at","220906_at","220925_at","222108_at","224711_at","225318_at","225595_at","225835_at","225892_at","226316_at","226492_at","226534_at","226666_at","226800_at","227095_at","227105_at","227148_at","227812_at","227852_at","227930_at","227947_at","228157_at","228630_at","228665_at","228760_at","228850_s_at","228875_at","228963_at","229111_at","229572_at","230142_s_at","230986_at","232014_at","235423_at","235810_at","238712_at","238992_at","239842_x_at","239847_at","241936_x_at","242389_at")
#note: this is because R automatically prepends "X" in front of column names that start with a numerical value. Therefore prepend "X"
probeNames<- paste("X", probeNames, sep = "")

In [12]:
# select only the cases and controls (excluding the patients for which the lung disease is not specified). Note: in the observation table the database IDs 
# are used to identify the patients and not the patient IDs that are used in the gene expression dataset
cases <- allObservations$observations$subject.id[allObservations$observations$'Subjects_Lung Disease' == "chronic obstructive pulmonary disease"]
controls <- allObservations$observations$subject.id[allObservations$observations$'Subjects_Lung Disease' == "control"]

In [13]:
# now we have the *internal database* IDs for the patients, but we need to get the patient IDs because 
# this is the index of the expression_data matrix. 
# These can be retrieved from the subjectInfo table: 
subjectInfo <- allObservations$subjectInfo
patientIDsCase    <- subjectInfo$subject.inTrialId[ subjectInfo$subject.id %in% cases ] 
patientIDsControl <- subjectInfo$subject.inTrialId[ subjectInfo$subject.id %in% controls] 

# patient sets containing case and control patientIDs
patientSets <- c(patientIDsCase, patientIDsControl)
patientSets <- patientSets[which(patientSets %in% rownames(expression_data))]
# make a subset of the data based on the selected patientSets and the probelist, and transpose the 
# table so that the rows now contain probe names
subset<-t(expression_data[patientSets,probeNames]) 
# for ease of recognition: append "Case" and "Control" to the patient names
colnames(subset)[colnames(subset)%in% patientIDsCase] <- paste(colnames(subset)[colnames(subset)%in% patientIDsCase],"Case", sep="_" )
colnames(subset)[colnames(subset)%in% patientIDsControl] <- paste( colnames(subset)[colnames(subset)%in% patientIDsControl] , "Control",sep= "_")

# make heatmap
heatmap(as.matrix(subset), scale = "row")

# there is one patient that seems to be an outlier: GSE8581GSM212810_Case.
# remove this outlier and plot heatmap again
subset_without_outlier <- subset[,colnames(subset)!= "GSE8581GSM212810_Case"]
heatmap(as.matrix(subset_without_outlier), scale = "row")


PCA visualization


In [14]:
# PCA analysis : 
options(warn=-1)# to turn warnings back on, use : options(warn=0)
subset_t <- t(subset_without_outlier)
prcomp_result <- prcomp(x = subset_t)
result_pca <- prcomp_result$x
#result_pca
rownames_pca_result <- rownames(result_pca)
colors <- c()
for (row in rownames_pca_result){ colors <- c(colors, ifelse(grepl("Case", row), "red", "blue")) }
plot(result_pca[,1], result_pca[,2], col=colors)
legend("topright", c("Case","Control"), pch = 1, col=c("red", "blue"))
#3D
# install.packages("plot3D")
plot3D::scatter3D(result_pca[,1], result_pca[,2], result_pca[,3], col=colors, phi=0, theta=0)
legend("topright", c("Case","Control"), pch = 1, col=c("red", "blue"))


Error in loadNamespace(name): there is no package called ‘plot3D’

Exercise 2

Try different phi and theta angles in script above to improve the 3D PCA visualization. Tip: try also 0, 60.


In [ ]: