Load rmagic:
In [1]:
%load_ext rmagic
(From Notebook 1)
Loaded Biobase and GEOquery then used getGEO method of GEOquery to load bio_data.soft.gz into data, an S4 object.
In [2]:
%%R
library(Biobase)
library(GEOquery)
bio_dir = '../data/raw/bio_data.soft.gz'
data = getGEO(filename=bio_dir)
Since we our primary interest is in expression data for the tissue samples, we needed to turn our GDS (GEO Dataset) object into an expression set object in order to take advantage of the powerful tools Bioconductor provides to explore gene expression. To do this we used the built in GEOquery function, GDS2eSet, which extracts expression data from GDS objects. To make computations more efficent, we used base two algorithms.
In [3]:
%%R
bio_eset = GDS2eSet(data, do.log2=TRUE)
To ensure that the data was correctly converted, we examined bio_eset below:
First we viewed the object to make sure it was the right type.
In [4]:
%%R
print(bio_eset)
We then further checked to make sure that all of the data was converted correctly. To this end, we retrieved several expression values for different gene/patient pairs and checked that they matched with expression values in the original dataset. An example for the gene 1007_s_at of subject GSM918603 is below. All our data points matched the original set.
In [5]:
%%R
print(bio_eset["1007_s_at", "GSM918603"])
print(exprs(bio_eset["1007_s_at", "GSM918603"]))
We installed the Bioconductor package pvca which is required to save and load R objects into .Rda files, and then saved our cleaned and properly formatted data set to the ../data/cleaned directory within our project repository.
In [6]:
%%R
source("http://bioconductor.org/biocLite.R")
biocLite("pvca")
In the cell below we saved our expression set as a .Rda file so that we can easily access our data for the remainder of the project.
In [7]:
%%R
library(pvca)
save(bio_eset, file="../data/cleaned/bio_eset.Rda")