Horvath Normalization Pipeline

Here we run the data normalization pipeline following Steve Horvath's method. Briefly he uses a variant of BMIQ which normalizes all of the data to a gold standard. His method only uses a subset of the probes, so we can throw out the majority of the data here.

Unlike the normalization with the full data I do not yet do an adjustment for cellular composition as this was not in Horvath's pipeline. Rather, I do the adjustment after the normalization.



In [1]:

    
cd ..









    



/cellar/users/agross/TCGA_Code/Methlation



In [2]:

    
import NotebookImport
from Setup.Imports import *









    



/cellar/users/agross/anaconda2/lib/python2.7/site-packages/IPython/nbformat/current.py:19: UserWarning: IPython.nbformat.current is deprecated.

- use IPython.nbformat for read/write/validate public API
- use IPython.nbformat.vX directly to composing notebooks of a particular version

  """)






    




importing IPython notebook from Setup/Imports






    



Populating the interactive namespace from numpy and matplotlib



In [3]:

    
import os as os
import pandas as pd
from pandas.rpy.common import convert_to_r_dataframe, convert_robj
import rpy2.robjects as robjects
from IPython.display import clear_output

Load Horvath normalization source into R namespace.



In [4]:

    
robjects.r.library('WGCNA');
robjects.r.source("/cellar/users/agross/Data/MethylationAge/Horvath/NORMALIZATION.R")
clear_output()

Read in Betas



In [5]:

    
path = '/cellar/users/agross/TCGA_Code/Methlation/data/'
f = path + 'all_betas_raw.csv'
df = pd.read_csv(f, low_memory=True, header=0, index_col=0)



In [6]:

    
labels = pd.read_csv(path + 'all_betas_raw_pdata.csv',
                     index_col=0)

Normalization Step



In [7]:

    
gold_standard = pd.read_csv('/cellar/users/agross/Data/MethylationAge/Horvath/probeAnnotation21kdatMethUsed.csv', index_col=0)
horvath = pd.read_table('/cellar/users/agross/TCGA_Code/Methlation/data/Horvath_Model.csv', index_col=0, skiprows=[0,1])
intercept = horvath.CoefficientTraining['(Intercept)']
horvath = horvath.iloc[1:]



In [8]:

    
df = df.ix[gold_standard.index]
df = df.T.fillna(gold_standard.goldstandard2).T



In [9]:

    
df_r = robjects.r.t(convert_to_r_dataframe(df))
gs = list(gold_standard.goldstandard2.ix[df.index])
gs_r = robjects.FloatVector(gs)



In [ ]:

    
del df



In [ ]:

    
data_n = robjects.r.BMIQcalibration(df_r, gs_r)
data_n = convert_robj(data_n).T
clear_output()

Now we need to fix the labels a little bit.



In [ ]:

    
c1 = pd.read_excel(ucsd_path + 'DESIGN_Fox_v2_Samples-ChipLAyout-Clinical UNMC-UCSD methylomestudy.xlsx', 
                   'HIV- samples from OldStudy', index_col=0)
c2 = pd.read_excel(ucsd_path + 'DESIGN_Fox_v2_Samples-ChipLAyout-Clinical UNMC-UCSD methylomestudy.xlsx', 
                   'HIV+ samples', index_col=0)
clinical = c1.append(c2)



In [ ]:

    
s = labels[labels.studyIndex == 's2'].sampleNames
ss = clinical[['Sample_Plate','Sample_Well']].sort(['Sample_Plate','Sample_Well'])
assert(alltrue(ss.Sample_Well == s))



In [ ]:

    
new_label = clinical.sort(['Sample_Plate','Sample_Well']).index
new_label = pd.Series(new_label, s.index)
new_names = labels['sampleNames'].replace(new_label.to_dict())
new_labels = labels['sampleNames'].ix[new_label.index] = new_label
new_labels = new_labels.combine_first(labels.sampleNames)



In [ ]:

    
labels['sampleNames'] = new_labels
new_labels2 = labels[labels.studyIndex == 's3'].sampleNames
new_labels2 = new_labels2.map(lambda s: '_'.join(s.split('_')[1:]))
new_labels2 = new_labels2.combine_first(new_labels)
labels['sampleNames'] = new_labels2



In [ ]:

    
data_n.columns = list(new_labels2)
data_n = data_n.astype(float)
data_n.to_hdf(HDFS_DIR + 'methylation_norm.h5','BMIQ_Horvath')









    



/cellar/users/agross/anaconda2/lib/python2.7/site-packages/pandas/io/pytables.py:2441: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->axis0] [items->None]

  warnings.warn(ws, PerformanceWarning)
/cellar/users/agross/anaconda2/lib/python2.7/site-packages/pandas/io/pytables.py:2441: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block0_items] [items->None]

  warnings.warn(ws, PerformanceWarning)



In [ ]:

    
labels.to_hdf(HDFS_DIR + 'methylation_norm.h5','labels')









    



/cellar/users/agross/anaconda2/lib/python2.7/site-packages/pandas/io/pytables.py:2441: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block0_values] [items->['sampleNames', 'studyIndex']]

  warnings.warn(ws, PerformanceWarning)