Authenticate with tranSMART first if you want to execute any of the analysis in the boxes below again.
Step 1: Please open URL http://localhost:8080/transmart/oauth/authorize?response_type=code&client_id=api-client&client_secret=api-client&redirect_uri=http%3A%2F%2Flocalhost%3A8080%2Ftransmart%2Foauth%2Fverify
Step 2: paste token in the token parameter below.
In [1]:
require("transmartRClient")
connectToTransmart("http://localhost:8080/transmart", prefetched.request.token = "4xIZYZ")
If the output above is: Authentication completed. TRUE , then you can continue below.
In [3]:
# Get studies
studies <- getStudies()
studies
Out[3]:
In [4]:
study <- "GSE8581"
# Retrieve Clinical Data
allObservations <- getObservations(study, as.data.frame = T)
# show first 3 rows, just to get impression of the fields available
allObservations$observations[1:3,]
Out[4]:
In [5]:
# get the concepts for this study
concepts <- getConcepts(study)
concepts
Out[5]:
In [6]:
observations <- getObservations(study,
# concept names from api.link.self.href column above:
concept.links =
c("/studies/gse8581/concepts/Subjects/Age",
"/studies/gse8581/concepts/Subjects/Sex")
)
# make two groups based on gender :
observations_female <- subset(observations$observations, Sex == 'female')
observations_male <- subset(observations$observations, Sex == 'male')
# show age distribution:
d <- density(as.integer(observations_male$Age)) # returns the density data
plot(d, col="blue", main="Male vs Female groups age distribution") # plots the results
legend("topright", c("Males","Females"), pch = 1, col=c("blue", "red"))
d <- density(as.integer(observations_female$Age)) # returns the density data
lines(d, col="red") # plots the results
Make a distribution plot similar to the plot above, but now comparing the ages of "control" vs "chronic obstructive pulmonary disease" (if you did the previous RStudio exercise, you can just paste your code in a code cell below).
NB: run also all necessary cells above to fetch the data that your script requires.
In [15]:
Out[15]:
In [7]:
dataDownloaded <- getHighdimData(study.name = study, concept.match = "Lung", projection = "log_intensity")
In [8]:
summary(dataDownloaded)
Out[8]:
In [9]:
# preview part of the data
data<-dataDownloaded[["data"]]
data[1:10,1:10]
Out[9]:
In [10]:
# select gene expression data, which is the data *excluding* columns 1 to 6:
expression_data<-data[,-c(1:6)]
expression_data[1:3,1:3]
dim(expression_data)
# add patientId as the row name for the expression_data matrix:
rownames(expression_data)<-data$patientId
expression_data[1:3,1:3]
Out[10]:
Out[10]:
Out[10]:
If the dimensions of the expression_data table are large, you may want to create a subset of the data first. Here we use a probelist as a subset for the probes, based on the list found in: "Bhattacharya S., Srisuma S., Demeo D. L., et al., Molecular biomarkers for quantitative and discrete COPD phenotypes.American Journal of Respiratory Cell and Molecular Biology. 2009;40(3):359–367. doi: 10.1165/rcmb.2008-0114OC."
In [11]:
#Make a heatmap
probeNames<- c("1552622_s_at","1555318_at","1557293_at","1558280_s_at","1558411_at","1558515_at","1559964_at","204284_at","205051_s_at","205528_s_at","208835_s_at","209377_s_at","209815_at","211548_s_at","212179_at","212263_at","213156_at","213269_at","213650_at","213878_at","215359_x_at","215933_s_at","218352_at","218490_s_at","220094_s_at","220906_at","220925_at","222108_at","224711_at","225318_at","225595_at","225835_at","225892_at","226316_at","226492_at","226534_at","226666_at","226800_at","227095_at","227105_at","227148_at","227812_at","227852_at","227930_at","227947_at","228157_at","228630_at","228665_at","228760_at","228850_s_at","228875_at","228963_at","229111_at","229572_at","230142_s_at","230986_at","232014_at","235423_at","235810_at","238712_at","238992_at","239842_x_at","239847_at","241936_x_at","242389_at")
#note: this is because R automatically prepends "X" in front of column names that start with a numerical value. Therefore prepend "X"
probeNames<- paste("X", probeNames, sep = "")
In [12]:
# select only the cases and controls (excluding the patients for which the lung disease is not specified). Note: in the observation table the database IDs
# are used to identify the patients and not the patient IDs that are used in the gene expression dataset
cases <- allObservations$observations$subject.id[allObservations$observations$'Subjects_Lung Disease' == "chronic obstructive pulmonary disease"]
controls <- allObservations$observations$subject.id[allObservations$observations$'Subjects_Lung Disease' == "control"]
In [13]:
# now we have the *internal database* IDs for the patients, but we need to get the patient IDs because
# this is the index of the expression_data matrix.
# These can be retrieved from the subjectInfo table:
subjectInfo <- allObservations$subjectInfo
patientIDsCase <- subjectInfo$subject.inTrialId[ subjectInfo$subject.id %in% cases ]
patientIDsControl <- subjectInfo$subject.inTrialId[ subjectInfo$subject.id %in% controls]
# patient sets containing case and control patientIDs
patientSets <- c(patientIDsCase, patientIDsControl)
patientSets <- patientSets[which(patientSets %in% rownames(expression_data))]
# make a subset of the data based on the selected patientSets and the probelist, and transpose the
# table so that the rows now contain probe names
subset<-t(expression_data[patientSets,probeNames])
# for ease of recognition: append "Case" and "Control" to the patient names
colnames(subset)[colnames(subset)%in% patientIDsCase] <- paste(colnames(subset)[colnames(subset)%in% patientIDsCase],"Case", sep="_" )
colnames(subset)[colnames(subset)%in% patientIDsControl] <- paste( colnames(subset)[colnames(subset)%in% patientIDsControl] , "Control",sep= "_")
# make heatmap
heatmap(as.matrix(subset), scale = "row")
# there is one patient that seems to be an outlier: GSE8581GSM212810_Case.
# remove this outlier and plot heatmap again
subset_without_outlier <- subset[,colnames(subset)!= "GSE8581GSM212810_Case"]
heatmap(as.matrix(subset_without_outlier), scale = "row")
In [14]:
# PCA analysis :
options(warn=-1)# to turn warnings back on, use : options(warn=0)
subset_t <- t(subset_without_outlier)
prcomp_result <- prcomp(x = subset_t)
result_pca <- prcomp_result$x
#result_pca
rownames_pca_result <- rownames(result_pca)
colors <- c()
for (row in rownames_pca_result){ colors <- c(colors, ifelse(grepl("Case", row), "red", "blue")) }
plot(result_pca[,1], result_pca[,2], col=colors)
legend("topright", c("Case","Control"), pch = 1, col=c("red", "blue"))
#3D
# install.packages("plot3D")
plot3D::scatter3D(result_pca[,1], result_pca[,2], result_pca[,3], col=colors, phi=0, theta=0)
legend("topright", c("Case","Control"), pch = 1, col=c("red", "blue"))
In [ ]: