Medulloblastoma is highly malignant brain cancer that disproportionately affects children. While clinical outcomes associated with the disease are often poor— the 20 year survival rate for children is only 51 percent— they can be different between subgroups of the cancer. We extracted our data from a previously published paper on the subtypes of Medulloblastoma: SHH, WNT, Group 3 and Group 4. After realizing there was a significant gender disparity between some of the author’s own molecular subgroups we decided to try to isolate the genes associated with this gender difference.
We got our gene expression data from a study by Robinson et al in 2012$^1$, in which the the subtypes of cancer discovered by the original authors of the paper had significantly different gender distributions, especially across the WNT and G3 subtypes which were 75% female and 93.75% male, respectively. We wanted to isolate the gene cluster that is associated with the gender disparity between the molecular subgroups.
Because differentially expressed genes do not necessarily affect tumor development, we first located genes that were differentially expressed between male and female patients, then out of these genes, we found those that were also differentially expressed between subgroups to ensure that the gene cluster we obtained was significant between both gender and disease subtype, and actually affected tumor development and clinical outcomes.
Note: We only analyzed data on about 76 patients, 3 of which were either outliers or could not be subtyped. Furthermore, of the patients who could be subtyped 21 were female while 52 were male, so of course our results are necessarily hesitant.
We wished to do a project using biological data but were unsure how to go about obtaining a sufficiently large dataset that was freely available for analysis. Because of this, we decided to work with a gene expression data set, as many interesting data sets are available from the Gene Expression Omnibus repository of the National Center for Biotechnology Information’s website which are amenable to analysis, as soft.gz files.
It took our group a while to find a dataset that was of sufficient interest to all members before we settled on a study about gene expression levels in tumor tissue samples of patients with the brain cancer Medulloblastoma. Once we had chosen our topic, actually getting the data was fairly easy: we simply used the provided download link.
The format of the data did make it somewhat challenging to load as it had been designed to be analyzed with a very specific set of tools. Although we did create a .csv
file of the data, some of the gene names were corrupted and we were unable to index through it by category (i.e. gender). However, once we obtained the required software and installed it properly, we were able to load and analyze our data with ease.
The NCBI recommends using the software Bioconductor
for analysis of this type of data set, as it was specifically designed for working with gene expression data sets and efficiently indexing through them. One Bioconductor
package we used to analyze our data was Biobase
, a necessary package for both the GEOquery
and limma
packages. GEOquery
was used to transform the .soft
file into a GDS
object, which gave us access to the file's meta data and a table of our expression data. GEOquery
was also used to transform our data from a GDS object to an expression set (eset
) object, so that it could be analyzed using limma
. limma
was used to find differentially expressed genes between our chosen samples.
The main challenge we encountered as a group was understanding exactly what our dataset represented. In particular, our data set contained gene expression information of tumors from various cancer patients, but it was not intuitively clear what the matrix of gene names and expression levels actually meant, which, correspondingly, made it harder to decide on an interesting question.
Loading our data and properly installing Bioconductor proved to be the true bottleneck in our project. We initially did not realize we had to use specialized software to load our .soft
file and so we spent several hours loading it into a .csv
file, only to lose key parts of the data set. We discovered that we needed Bioconductor
to properly read and index through our dataset. However, it was quite a Sisyphean ordeal to install Bioconductor
and its necessary packages, taking several days and a visit with the professor. (“My God”)
To get GEOquery
, we first had to install libcurl
and r-cran-xml
. This was particularly arduous as installation constantly failed across multiple Oskiboxes; we had the sublime opportunity to witness the natural enforcement of Murphy’s Law.
However, once these problems were resolved, the only difficulties were determining what further packages were required to help analyze the data (i.e. gplots
, pvca
).
For anyone who may not remember their intro to Biology class, all living things are made of cells, all cells have genes, and depending on what that function that particular cell is carrying out, different genes will be translated into proteins. Even though all our cells contain our full genome, only certain parts of our genome will be used within any single cell. Cancer cells, which are cells whose growth can no longer be controlled through normal cellular pathways, will translate a different set of genes into proteins than healthy cells will. By using a DNA microarray scientists are able to quantify which genes are being 'expressed' (i.e. which genes are having their DNA translated into protein) within cells. The data we analyzed was the ‘expression’ or translation levels of over 54,000 genes (of which around 29,000 were unique) for 76 different tumors from patients with a form of cancer known as Medulloblastoma.
In [1]:
%load_ext rmagic
In [11]:
%%R
c_names = paste('Patient', 1:3, sep=' ')
r_names = paste('Gene', LETTERS[1:3], sep=' ')
categories = c('High', 'Medium', 'Low')
data = sample(categories, size=9, replace=TRUE, prob=c(1/3, 1/3, 1/3))
df = data.frame(matrix(data, nrow=3), row.names=r_names)
colnames(df) = c_names
print(df)
Above we have built a toy example of an expression file to help visualize what one looks like.
In the source paper for our dataset, the authors created molecular subgroups which had vastly different gender distributions. We were quite interested in identifying genes that were associated with this difference. To do this we identified genes that were differentially expressed across gender for the entire patient population and using limma, found 34 genes that were significantly differentially expressed between the male and female subpopulations. However, we are only interested in identifying genes that were relevant to cancer development, and it is possible that the tumors that have differentially expressed genes do not not actually influence development and so are not used in molecular subtyping, such as the kind the authors of our source paper did. Since we do not have the biological expertise to decide which genes are truly implicated in cancer development, we used the authors’ own subgroups to determine this. By checking which of the 34 genes that were differentially expressed between gender for the whole populations were also differentially expressed between subgroups we were able to identify genes that were both differentially expressed and likely important for cancer development.
The below heatmap shows gene expression values for the 34 genes determined to be differentially expressed between male and female patients. Patients are displayed across the x-axis while genes are displayed across the y-axis. The red and blue bar across the top of the heatmap color codes for female and male patients respectively.
In [1]:
from IPython.core.display import Image
Image(filename="../visualizations/malefemale.png", width=485)
Out[1]:
The heatmap below shows gene expression values for the two genes determined to be differentially expressed between the SHH and G3 subgroups out of the 34 genes determined to be differentially expressed between gender. Again, patients are depicted on the x-axis and genes on the y-axis. The colored bar across the top of the heatmap codes for the subgroups of patients based on disease state, red corresponds to the WNT subgroup, green to the SHH subgroup, yellow to the G3 subgroup, and blue to the G4 subgroup.
In [2]:
from IPython.core.display import Image
Image(filename="../visualizations/SHH.png", width=485)
Out[2]:
The below heatmap shows gene expression values for the 17 genes determined to be differentially expressed between the WNT and G3 subgroups out of the 34 genes that are differentially expressed between male and female patients. Patients are displayed across the x-axis while genes are displayed across the y-axis. The red, green, yellow, and blue color coded bar across the top corresponds to the, WNT, SHH, G3, and G4 subgroups respectively.
In [1]:
from IPython.core.display import Image
Image(filename="../visualizations/WNT.png", width=485)
Out[1]:
In conclusion we've tentatively identified the gene clusters that affect differential cancer development between some male and female patients at least between subgroups WNT and G3 and subgroups SHH and G3. Of the differentially expressed genes between male and female patients 17 were found to be differentially expressed between WNT and G3 while two were found to be differentially expressed between SHH and G3. We can possibly surmise that these genes might be associated with different paths of cancer development between some male and female patients.
Once again we would like to emphasize that Medulloblastoma is exceedingly rare cancer and that our sample size is possibly inadequate for us to draw firm conclusions from our data, especially with the discrepancy between the amount of males and amount of females in our sample.
1 Robinson et al. "Novel Mutations Target Distinct Subgroups of Medulloblastoma." Nature 448 (2012): 43-48. NCBI. Web. 20 Apr. 2014. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3412905/.
In [5]:
%load_ext rmagic
In [16]:
%%R
print(citation('Biobase'))
print(citation('GEOquery'))
print(citation('limma'))