Guide to Running

Main Data Analysis

Here is where I do the main data analysis for the manuscript.

  • Unsupervised Age HIV Analysis
    Unsupervised analysis of age associated probes with the aim of showing a shared influence of age and HIV on the methylome. This contains the code for the generation of Figure 1.
  • HIV Age Advancment
    Here I read in the data and run the methylation age models.
  • HIV Age Advancement: Confounders
    Here I am looking at confounding from patients' blood composition as well as association of age advancment with other clinical variables that we have available.
  • Figure 2
    Generation of Figure 2 for the manuscript.
  • Validation_figure
    Generation of Figure 3 of the manuscript.
  • Figure 3
    Generation of Figure 4 for the manuscript. Figure 3 got added in revisions, so I'm keeping the name consistent to not lose the version control.
  • Figure 5_top
    Generation of Figure 5a and 5b for the manuscript. Also includes a look at general disorder in response to age and HIV as well as some post-hoc analysis on the HLA and sourounding regions.
  • Figure_5_bottom
    Generation of Figure 5c-f for the manuscript.

Benchmarks, QC, ect.

Quality control and benchmarking important to the manuscript, but not necissarily directly refered to in the paper.

  • Model Comparison
    Comparison of the Hannum and Horvath epigenetic aging models and compilation statistics on each across multiple datasets. I then construct a consensus model and use the agreement of the two models as to filter out subjects that may not have good methylation data.
  • Sorted Cell Benchmark
    Analysis of methylation age models in populations of sorted cells.
  • Validation Cohort Power Calculation
    Power calculation done prior to vaidation of the age advancment effect in a smaller cohort of patients with DNA extracted from purified cells.
  • CD14 aging
    Analysis of epigenetic aging in 1202 patients with DNA extracted from a purified population of monocytes from this paper.
  • Lupus cohort, multiple sorted cell types
    Here I look at epigenetic aging in a case/control Lupus cohort. I pulled this dataset because it contains a group of patients with sorted (purified cell populations) blood samples profiled across multiple different cell types. It shows that the epigenetic aging models are fairly robust across cell types.

Analysis Setup Notebooks

These are notebooks which I generally run as upstream dependencies of other notebooks, think of them as python modules. I include these as notebook rather than modules because there are some global variables ad-hoc decisions being invoked and I do not think it is appropriate to abstract away into modules. Is this the best software development process? Likely not, but this is data-analysis and this is also grad-ware, both of which have a certain level of sloppiness which is more or less unavoidable.

  • Imports
    Imports packages into global namespace, sets up a few helper functions, and global path variables. If you are running this yourself you must change the global parameters here.
  • DX Imports
    Sets up some helper functions for doing analysis of differential methylation and reads in probe annotations for methylation arrays.
  • Methylation Age Models
    Reads in the linear data for the two epigenetic aging models and creates helper functions to apply each model across a set of patients.
  • Read HIV Data
    This reads in the methylation data for our primary case-control cohort as well a lot of associated clinical data. I also do a little pre-filtering based on basic selection criteria (age range, HAART status, gender).

Running Parallel Jobs for Linear Models

I have created various scripts for running linear models on the methylation data. With about a half of a million probes on the chip, this can take a while to do in series so I use a relatively primative map-reduce paradigm.

  • A dataset is saved in 100 chunks in HDF5 format
  • Covariates are saved in the same data-store
  • A python script to run the linear models is hard coded with the table chunk number as an input parameter
  • An SGE driver script is created to run a parallel job
  • The parallel job is run on the cluster
  • The results are merged back together on the local machine

In the Parallel folder, I have a number of notebooks which contain the hard coded python script and the code for generating the SGE driver. Our cluster shares its filesystem with our network storage, so for applying this to different systems some modifications to the protocol are likely needed.

Pre Processing

Here I am doing some pre-processing on the raw datasets to make them more usable for downstream analysis.

This stuff is pretty computationally intensive. If you want to run this part do the following things:

  • Ask yourself if you really need to do this or if you can start with the data already processed (I can provide this in a nice small binary file which is much easier to handle).
  • Get yourself a pretty high memory machine. You are going to need to do quantile-normalization on about 2000 450k arrays. This took up about 100GB in memory. You could probably figure out how to do this out of memory but its only really a one time thing so probably not necessary.
  • Download the raw EPIC dataset from GEO link.
  • Contact me for the raw dataset from Hannum et al. (This data is on GEO but not in the most raw form... I will try and fix this soon)

Metylation Normalization Protocol
This mainly uses functions provided in the R bioconductor minfi package.

  • Read in all of the raw .IDAT methylation files for our datasets as well as the two controls
  • Extract detection p-values for all probes across all samples
  • Run quantile normalization on the concatinated dataset
  • Run estimateCellCounts on the concatinated dataset
  • Save the results of this analysis

Saving things in HDF5 Format

These 450k arrays can get pretty unweildy due to their size when trying to do in-memory data analysis. To help out with this and generally speed up I/O I like to store these data as Pandas HDF5 data objects. This data format has the advantage of being much faster than handling everything in .csvs like you would generally do with expression datasets.

BMIQ Processing

We run quantile normalization in the minfi processing pipeline but it has been recomended to also run BMIQ on this normalized data. In addition the processing protocol for use of the Horvath aging model requires a modified verision of this. This is done for various datasets in the following notebooks.