Compile Probe Annotations


In [16]:
import NotebookImport
from Imports import *


importing IPython notebook from Imports

Got to do this in rpy2


In [1]:
%load_ext rpy2.ipython

In [2]:
import pandas as pd

from pandas.rpy.common import convert_robj
import rpy2.robjects as robjects
from pandas.rpy.common import convert_to_r_dataframe, convert_robj

In [3]:
%%R
require('IlluminaHumanMethylation450kanno.ilmn12.hg19')
ann = IlluminaHumanMethylation450kanno.ilmn12.hg19
data = ann@data


Loading required package: IlluminaHumanMethylation450kanno.ilmn12.hg19
Loading required package: minfi
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following object is masked from ‘package:stats’:

    xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, as.vector, cbind, colnames,
    duplicated, eval, evalq, Filter, Find, get, intersect, is.unsorted,
    lapply, Map, mapply, match, mget, order, paste, pmax, pmax.int,
    pmin, pmin.int, Position, rank, rbind, Reduce, rep.int, rownames,
    sapply, setdiff, sort, table, tapply, union, unique, unlist

Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: lattice
Loading required package: reshape
Loading required package: GenomicRanges
Loading required package: IRanges

Attaching package: ‘IRanges’

The following objects are masked from ‘package:reshape’:

    expand, rename

Loading required package: XVector
Loading required package: Biostrings
Loading required package: bumphunter
Loading required package: foreach
foreach: simple, scalable parallel programming from Revolution Analytics
Use Revolution R for scalability, fault tolerance and more.
http://www.revolutionanalytics.com
Loading required package: iterators
Loading required package: locfit
locfit 1.5-9.1 	 2013-03-22

Attaching package: ‘locfit’

The following objects are masked from ‘package:GenomicRanges’:

    left, right


In [4]:
islands = convert_robj(robjects.r("data.frame(data[['Islands.UCSC']])"))
locations = convert_robj(robjects.r("data.frame(data[['Locations']])"))
other = convert_robj(robjects.r("data.frame(data[['Other']])"))
other = other.ix[:, 2:] #first two columns are probe sequences which take up a lot of space
snps = convert_robj(robjects.r("data.frame(data[['SNPs.Illumina']])"))

In [5]:
PATH = './Data/methylation450_annotations.csv'
probe_annotations = pd.read_csv(PATH, index_col=0)

In [7]:
store = pd.HDFStore(DATA_STORE)
store.append('islands', islands)
store.create_table_index('islands', optlevel=9, kind='full')
store.append('locations', locations)
store.create_table_index('locations', optlevel=9, kind='full')
store.append('other', other)
store.create_table_index('other', optlevel=9, kind='full')
store.append('snps', snps)
store.create_table_index('snps', optlevel=9, kind='full')
store.append('probe_annotations', probe_annotations)
store.create_table_index('probe_annotations', optlevel=9, kind='full')

Mapping Islands on to genes


In [11]:
ti = lambda s: s[s].index 
isl = islands.sort(['Islands_Name','Relation_to_Island']).dropna()
isl = isl[isl.Islands_Name.isin(ti(isl.Islands_Name.value_counts() > 7))]
isl = isl[isl.Islands_Name != '']

ot = other.Regulatory_Feature_Name
ot = ot[ot.isin(ti(ot.value_counts()> 7))]

gb = pd.concat([isl, probe_annotations], axis=1)
gb = gb[gb.Gene_Symbol.notnull() & gb.Islands_Name.notnull()]
g2 = gb.sort('Islands_Name')

top_gene = lambda s: s.Gene_Symbol.value_counts().index[0]
island_to_gene = g2.groupby('Islands_Name').apply(top_gene)
probe_to_island = isl
store['probe_to_island'] = probe_to_island
store['island_to_gene'] = island_to_gene

Read in Flow Sorted Data and Convert to More Usable Format


In [8]:
%%R
require(minfi)
require(minfiData)
require(doParallel)

referencePkg <- "FlowSorted.Blood.450k"
require(referencePkg, character.only = TRUE)
data(list = referencePkg)
referenceRGset <- get(referencePkg)


Loading required package: minfiData
Loading required package: IlluminaHumanMethylation450kmanifest
Loading required package: doParallel
Loading required package: FlowSorted.Blood.450k

In [9]:
%%R
Mset.quantile = preprocessQuantile(referenceRGset, removeBadSamples = TRUE)
beta = getBeta(Mset.quantile)


[preprocessQuantile] Mapping to genome.
[preprocessQuantile] Fixing outliers.
[preprocessQuantile] Quantile normalizing.

In [10]:
beta = convert_robj(robjects.r("data.frame(beta)"))

In [11]:
labels = convert_robj(robjects.r('referenceRGset$CellType'))
label_map = pd.Series(labels, beta.columns)

In [12]:
store['label_map'] = label_map

In [13]:
beta.to_hdf(store,'flow_sorted_data')

In [14]:
store.close()