Here is where I do the main data analysis for the manuscript.
Quality control and benchmarking important to the manuscript, but not necissarily directly refered to in the paper.
These are notebooks which I generally run as upstream dependencies of other notebooks, think of them as python modules. I include these as notebook rather than modules because there are some global variables ad-hoc decisions being invoked and I do not think it is appropriate to abstract away into modules. Is this the best software development process? Likely not, but this is data-analysis and this is also grad-ware, both of which have a certain level of sloppiness which is more or less unavoidable.
I have created various scripts for running linear models on the methylation data. With about a half of a million probes on the chip, this can take a while to do in series so I use a relatively primative map-reduce paradigm.
In the Parallel folder, I have a number of notebooks which contain the hard coded python script and the code for generating the SGE driver. Our cluster shares its filesystem with our network storage, so for applying this to different systems some modifications to the protocol are likely needed.
Here I am doing some pre-processing on the raw datasets to make them more usable for downstream analysis.
This stuff is pretty computationally intensive. If you want to run this part do the following things:
Metylation Normalization Protocol
This mainly uses functions provided in the R bioconductor minfi package.
These 450k arrays can get pretty unweildy due to their size when trying to do in-memory data analysis. To help out with this and generally speed up I/O I like to store these data as Pandas HDF5 data objects. This data format has the advantage of being much faster than handling everything in .csvs like you would generally do with expression datasets.
We run quantile normalization in the minfi processing pipeline but it has been recomended to also run BMIQ on this normalized data. In addition the processing protocol for use of the Horvath aging model requires a modified verision of this. This is done for various datasets in the following notebooks.