Data Processing Pipeline

Processing Summary

  • Primary processing done with MINFI R package, described in this notebook
    • All data quantile normalized together
    • Needs large memory machine (~100GB)
  • Cell-counts calculated using estimateCellCounts
  • Detection p-values obtained by detectionP function
  • Data are read in from the flat .csv generated in the R pipeline and dumped into HDFS files in this notebook
    • Considerable savings in both time and memory usage
    • I like to keep these HDFS files on my SSD to speed things up even more
  • All data normalized to reference distribution via BMIQ
  • Each probe adjusted for cell-composition
    • Hannum: quantile normalization -> adjustment -> BMIQ
    • Horvath: quantile normalization -> BMIQ -> adjustement
    • Exploration of this processing step is done here
  • Probe annotations obtained from R Bioconductor 450k annotation package link, dumped into HDFS format in this notebook

Pipeline Dependencies

  1. Process Raw Data
  2. Prepare data, save into HDFS
  3. Adjust for cell composition and preform BMIQ normalization

Datasets

HIV Dataset

  • Our dataset collected for this study
  • 142 cases, 50 controls

Hannum Dataset

EPIC Dataset

Summary

  • Summarized in Read HIV Data notebook
  • 142 cases, 50 controls
  • All white, 2 with current alcohol use, 1 with cannabis
  • None are diabetic or HCV+
  • All adherent to drugs

Filters (QC)

  • 5 patients not treated with HAART therepy
  • 2 female controls removed
  • 2 controls removed due to data quality (MINFI pipeline, probe detection criteria)

Filters (age models)

Looking at agreement between Hannum and Horvath models. The idea here is that when they disagree it is likely a sign of poor data quality, or that our models do a great job of describing the aging process for a particular patien (Relavant code).

  • Drop 5 cases, 1 control
  • 2/6 samples have non-zero detection p-values for probes in one or both of the models