In this tutorial, we'll walk through downloading and preprocessing the compendium of ENCODE and Epigenomics Roadmap data.

This part won't be very iPython tutorial-ly...

First cd in the terminal over to the data directory and run the script get_dnase.sh.

That will download all of the BED files from ENCODE and Epigenomics Roadmap. Read the script to see where I'm getting those files from. Perhaps there will be more in the future, and you'll want to manipulate the links.

Once that has finished, we need to merge all of the BED files into one BED and an activity table.

I typically use the -y option to avoid the Y chromosome, since I don't know which samples sequenced male or female cells.

I'll use my default of extending the sequences to 600 bp, and merging sites that overlap by more than 200 bp. But you might want to edit these.


In [4]:
!cd ../data; preprocess_features.py -y -m 200 -s 600 -o er -c genomes/human.hg19.genome sample_beds.txt

To convert the sequences to the format needed by Torch, we'll first convert to FASTA.


In [6]:
!bedtools getfasta -fi ../data/genomes/hg19.fa -bed ../data/er.bed -s -fo ../data/er.fa

Finally, we convert to HDF5 for Torch and set aside some data for validation and testing.

-r permutes the sequences. -c informs the script we're providing raw counts. -v specifies the size of the validation set. -t specifies the size of the test set.


In [7]:
!seq_hdf5.py -c -t 71886 -v 70000 ../data/er.fa ../data/er_act.txt ../data/er.h5


Ignoring header line
1880000 training sequences 
71886 test sequences 
70000 validation sequences 

And you're good to go!