Data Cleaning

Load rmagic:



In [1]:

    
%load_ext rmagic

(From Notebook 1) Loaded Biobase and GEOquery then used getGEO method of GEOquery to load bio_data.soft.gz into data, an S4 object.

The S4 class is a set of facilities in R provided for object oriented programming, and is highly used in Bioconductor.



In [2]:

    
%%R

library(Biobase)
library(GEOquery)

bio_dir = '../data/raw/bio_data.soft.gz'
data = getGEO(filename=bio_dir)









    





Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following object is masked from ‘package:stats’:

    xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, as.vector, cbind, colnames,
    do.call, duplicated, eval, evalq, Filter, Find, get, intersect,
    is.unsorted, lapply, Map, mapply, match, mget, order, paste, pmax,
    pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rep.int,
    rownames, sapply, setdiff, sort, table, tapply, union, unique,
    unlist

Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Setting options('download.file.method.GEOquery'='curl')

Creating an Expression Set Object

Since we our primary interest is in expression data for the tissue samples, we needed to turn our GDS (GEO Dataset) object into an expression set object in order to take advantage of the powerful tools Bioconductor provides to explore gene expression. To do this we used the built in GEOquery function, GDS2eSet, which extracts expression data from GDS objects. To make computations more efficent, we used base two algorithms.



In [3]:

    
%%R

bio_eset = GDS2eSet(data, do.log2=TRUE)









    





File stored at: 
/tmp/RtmpfFNwMG/GPL570.annot.gz

To ensure that the data was correctly converted, we examined bio_eset below:

First we viewed the object to make sure it was the right type.



In [4]:

    
%%R
print(bio_eset)









    





ExpressionSet (storageMode: lockedEnvironment)
assayData: 54675 features, 76 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: GSM918603 GSM918641 ... GSM918644 (76 total)
  varLabels: sample disease.state ... description (5 total)
  varMetadata: labelDescription
featureData
  featureNames: 1007_s_at 1053_at ... AFFX-TrpnX-M_at (54675 total)
  fvarLabels: ID Gene title ... GO:Component ID (21 total)
  fvarMetadata: Column labelDescription
experimentData: use 'experimentData(object)'
  pubMedIds: 22722829 
Annotation:

We then further checked to make sure that all of the data was converted correctly. To this end, we retrieved several expression values for different gene/patient pairs and checked that they matched with expression values in the original dataset. An example for the gene 1007_s_at of subject GSM918603 is below. All our data points matched the original set.



In [5]:

    
%%R
print(bio_eset["1007_s_at", "GSM918603"])
print(exprs(bio_eset["1007_s_at", "GSM918603"]))









    





ExpressionSet (storageMode: lockedEnvironment)
assayData: 1 features, 1 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: GSM918603
  varLabels: sample disease.state ... description (5 total)
  varMetadata: labelDescription
featureData
  featureNames: 1007_s_at
  fvarLabels: ID Gene title ... GO:Component ID (21 total)
  fvarMetadata: Column labelDescription
experimentData: use 'experimentData(object)'
  pubMedIds: 22722829 
Annotation:  
          GSM918603
1007_s_at  2.757211

Saving the Cleaned File

We installed the Bioconductor package pvca which is required to save and load R objects into .Rda files, and then saved our cleaned and properly formatted data set to the ../data/cleaned directory within our project repository.



In [6]:

    
%%R
source("http://bioconductor.org/biocLite.R")
biocLite("pvca")









    





Bioconductor version 3.0 (BiocInstaller 1.15.1), ?biocLite for help
BioC_mirror: http://bioconductor.org
Using Bioconductor version 3.0 (BiocInstaller 1.15.1), R version 3.1.0.
Installing package(s) 'pvca'
trying URL 'http://bioconductor.org/packages/3.0/bioc/src/contrib/pvca_1.5.0.tar.gz'
Content type 'application/x-gzip' length 128153 bytes (125 Kb)
opened URL
==================================================
downloaded 125 Kb


The downloaded source packages are in
	‘/tmp/RtmpfFNwMG/downloaded_packages’

In the cell below we saved our expression set as a .Rda file so that we can easily access our data for the remainder of the project.



In [7]:

    
%%R

library(pvca)

save(bio_eset, file="../data/cleaned/bio_eset.Rda")

Team members responsible for this notebook:

Team member 1: Yucheng Zhang - made expression set and wrote markdown, saved eset as .Rda file.

Team member 2: Renee Rao - figured out limma and Bioconductor.

Team member 3: Philip Hong - helped troubleshoot Bioconductor and limma installation.

Team member 4: Rebeccah Lavelle Terhune - helped troubleshoot Bioconductor and limma installation.