Load rmagic:
In [1]:
%load_ext rmagic
(From Notebook 1)
Loaded Biobase and GEOquery then used getGEO
method of GEOquery
to load bio_data.soft.gz
into data, an S4 object.
In [2]:
%%R
library(Biobase)
library(GEOquery)
bio_dir = '../data/raw/bio_data.soft.gz'
data = getGEO(filename=bio_dir)
Since we our primary interest is in expression data for the tissue samples, we needed to turn our GDS (GEO Dataset) object into an expression set object in order to take advantage of the powerful tools Bioconductor provides to explore gene expression. To do this we used the built in GEOquery
function, GDS2eSet
, which extracts expression data from GDS objects. To make computations more efficent, we used base two algorithms.
In [3]:
%%R
bio_eset = GDS2eSet(data, do.log2=TRUE)
To ensure that the data was correctly converted, we examined bio_eset
below:
First we viewed the object to make sure it was the right type.
In [4]:
%%R
print(bio_eset)
We then further checked to make sure that all of the data was converted correctly. To this end, we retrieved several expression values for different gene/patient pairs and checked that they matched with expression values in the original dataset. An example for the gene 1007_s_at
of subject GSM918603
is below. All our data points matched the original set.
In [5]:
%%R
print(bio_eset["1007_s_at", "GSM918603"])
print(exprs(bio_eset["1007_s_at", "GSM918603"]))
We installed the Bioconductor package pvca
which is required to save and load R objects into .Rda files, and then saved our cleaned and properly formatted data set to the ../data/cleaned
directory within our project repository.
In [6]:
%%R
source("http://bioconductor.org/biocLite.R")
biocLite("pvca")
In the cell below we saved our expression set as a .Rda file so that we can easily access our data for the remainder of the project.
In [7]:
%%R
library(pvca)
save(bio_eset, file="../data/cleaned/bio_eset.Rda")