BMIQ Normalization

Here we are running BMIQ normalization on all of our quantile-normalized data together. I am using the implementation provided by Steve Horvath along with his recent methylation-age paper. We are doing this as a result of this paper's recomendation to run both quantile normalization and BMIQ in series on the same datasets.

One very important step to note is that we are adjusting each patient's beta-values by the expected value given their cell composition. We are including this step to reduce the variablility in methylation levels that is purely resulting from differing cellular compositions. We are preforming this step after quantile normalization, but before the BMIQ step. This could be done in other sections of the pipeline but should not have too big of an impact on downstream results.


In [1]:
cd /cellar/users/agross/TCGA_Code/Methlation/


/cellar/users/agross/TCGA_Code/Methlation

In [2]:
import NotebookImport
from Setup.Imports import *


/cellar/users/agross/anaconda2/lib/python2.7/site-packages/IPython/nbformat/current.py:19: UserWarning: IPython.nbformat.current is deprecated.

- use IPython.nbformat for read/write/validate public API
- use IPython.nbformat.vX directly to composing notebooks of a particular version

  """)
importing IPython notebook from Setup/Imports
Populating the interactive namespace from numpy and matplotlib

In [3]:
from IPython.display import clear_output

Load Horvath normalization source into R namespace.


In [4]:
robjects.r.library('WGCNA');
robjects.r.source("/cellar/users/agross/Data/MethylationAge/Horvath/NORMALIZATION.R");
clear_output()

Read in quantile-normalized beta values and cell counts from the MINFI pipeline.


In [5]:
betas = pd.read_hdf(HDFS_DIR + 'dx_methylation.h5', 'betas')
cell_counts = pd.read_hdf(HDFS_DIR + 'dx_methylation.h5', 'cell_counts')

Read in cell compositions from probe annotations.


In [6]:
flow_sorted_data = pd.read_hdf(HDFS_DIR + 'methylation_annotation.h5','flow_sorted_data')
cell_type = pd.read_hdf(HDFS_DIR + 'methylation_annotation.h5', 'label_map')

In [7]:
n2 = flow_sorted_data.groupby(cell_type, axis=1).mean()
avg = n2[cell_counts.columns].dot(cell_counts.T)

Adjust data to account for differences in cellular composition.


In [ ]:
study = betas.columns
study = pd.Series(study.get_level_values(0), study.get_level_values(1))
betas.columns = betas.columns.get_level_values(1)

In [ ]:
cc = avg.mean(1)
adj = (betas - avg).add(cc, axis=0)
adj = adj.dropna(how='all', axis=1)

Here s1 is the Hannum data, which we used to train our model. We are normalizeing all of our data with the median value of the Hannum cohort as our reference for each probe.


In [ ]:
gold_standard_ah = adj.ix[:, ti(study=='s1')].median(1)

In [ ]:
del betas
del avg

In [ ]:
df = adj
df = df.ix[gold_standard_ah.index]
df = df.T.fillna(gold_standard_ah).T

In [ ]:
df_r = robjects.r.t(convert_to_r_dataframe(df))
gs = list(gold_standard_ah.ix[df.index])
gs_r = robjects.FloatVector(gs)

In [ ]:
del df

In [ ]:
data_n = robjects.r.BMIQcalibration(df_r, gs_r)
data_n = convert_robj(data_n).T
clear_output()

In [ ]:
data_n.columns = data_n.columns.map(lambda s: s.replace('.','-'))
data_n.columns = data_n.columns.map(lambda s: s[1:] if s.startswith('X') else s)

In [ ]:
store = pd.HDFStore(HDFS_DIR + 'methylation_norm.h5')
store.append('quant_BMIQ_adj', data_n)
store.create_table_index('quant_BMIQ_adj', optlevel=9, kind='full')