In this notebook we gathered the data for our project and turned it into a GDS object.
We began by making the appropriate directories to organize our data and analysis into.
In [1]:
%%bash
mkdir ../data
mkdir ../data/raw
mkdir ../data/cleaned
mkdir ../data/simulated
mkdir ../visualizations
mkdir ../script
We decided to analyze a microDNA expression dataset from the NCBI's (National Center for Biotechnology Information) database of gene expression arrays, the Gene Expression Omnibus (GEO) repository. Eventually we settled on one that contained gene expression information for 76 tissue samples from child patients with a particular form of brain cancer, Medulloblastoma.
Eventually, we decided on the data set from this page:
Load rmagic:
In [2]:
%load_ext rmagic
In the cell below we downloaded the data into the ../data/raw
directory, and renamed it bio_data.soft.gz
using R and the download link for the dataset provided by the NCBI.
In [3]:
%%R
data_url = 'ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS4nnn/GDS4471/soft/GDS4471.soft.gz'
bio_dir = '../data/raw/bio_data.soft.gz'
download.file(data_url, destfile=bio_dir)
In order to analyze the GEO Dataset (GDS) file that our data set is contained in, we decided to use the software the NCBI recommends, Bioconductor, which was specifically designed to work with gene expression data sets. However, Bioconductor requires several packages to work properly, which we have installed below:
Installed libcurl:
sudo apt-get install libcurl4-gnutls-dev
Installed r-cran-xml:
sudo apt-get install r-cran-xml
Please note: This had to be done in the terminal. It could not be done in ipython notebook because the install prompts for user permission.
Once we had installed the correct packages in our oskiboxes, we were able to install Bioconductor and the required Bioconductor packages.
Installed Bioconductor
In [4]:
%%R
source("http://bioconductor.org/biocLite.R")
biocLite()
Installed GEOquery
package of Bioconductor
soft.gz
or GDS file in our directory into a "GDS object" which gives us access to meta data and a table of our expression data. These can be accessed through the functions Meta()
and Table()
, respectively.
In [5]:
%%R
biocLite("GEOquery")
library(Biobase)
library(GEOquery)
After loading GEOquery
, we used the function getGEO()
to turn our GDS file into a GDS object.
In [6]:
%%R
data = getGEO(filename=bio_dir)
We then examined the object to ensure that it had loaded correctly, and that both the Meta()
and Table()
functions performed properly.
In [7]:
%%R
print(mode(data))
This class is consistent with a GDS object.
In [8]:
%%R
print(head(Table(data)))
print(Meta(data)$description)
The object is consistent with original file. This means we were able to succesfully create the GDS object from our dataset!
We all worked together on data gathering, and we ended up using Renee's method.