In this tutorial, we'll walk through running Basset by adding your datasets to the full compendium of ENCODE and Epigenomics Roadmap DNaseI hypersensitity sites. These datasets are rich

The input you bring to the pipeline is:

  • BED file(s) of peak calls

First, let's grab some data from the original ATAC-seq paper here http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE47753


In [25]:
!wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE47nnn/GSE47753/suppl/GSE47753_CD4%2B_ATACseq_AllDays_AllReps_ZINBA_pp08.bed.gz
!mv GSE47753_CD4+_ATACseq_AllDays_AllReps_ZINBA_pp08.bed.gz atac_cd4.bed.gz
!gunzip -f atac_cd4.bed.gz


--2015-09-18 11:14:06--  ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE47nnn/GSE47753/suppl/GSE47753_CD4%2B_ATACseq_AllDays_AllReps_ZINBA_pp08.bed.gz
           => 'GSE47753_CD4+_ATACseq_AllDays_AllReps_ZINBA_pp08.bed.gz'
Resolving ftp.ncbi.nlm.nih.gov... 130.14.250.11, 2607:f220:41e:250::13
Connecting to ftp.ncbi.nlm.nih.gov|130.14.250.11|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /geo/series/GSE47nnn/GSE47753/suppl ... done.
==> SIZE GSE47753_CD4+_ATACseq_AllDays_AllReps_ZINBA_pp08.bed.gz ... 908175
==> PASV ... done.    ==> RETR GSE47753_CD4+_ATACseq_AllDays_AllReps_ZINBA_pp08.bed.gz ... done.
Length: 908175 (887K) (unauthoritative)

100%[======================================>] 908,175     2.36MB/s   in 0.4s   

2015-09-18 11:14:06 (2.36 MB/s) - 'GSE47753_CD4+_ATACseq_AllDays_AllReps_ZINBA_pp08.bed.gz' saved [908175]

Let's first arrange those BED files into a table for Basset. Let's say you have two BED files representing a before and after treatment.


In [26]:
samples_out = open('cd4_sample.txt', 'w')
print >> samples_out, 'CD4+\tatac_cd4.bed'
samples_out.close()

Next, we need to merge this BED files into the ENCODE/Roadmap database to form one BED file and an activity table.

(If you want to change the paramters here for extension size and/or allowed overlap, you'll need to go back to the BED files. Check out the tutorial prepare_compendium.ipynb.)

I typically use the -y option to avoid the Y chromosome, since I don't know which samples sequenced male or female cells.

I'll use my default of extending the sequences to 600 bp, and merging sites that overlap by more than 200 bp.

-a and -b together let us provide the database activity table and BED file respectively.


In [28]:
!preprocess_features.py -y -m 200 -s 600 -a ../data/encode_roadmap_act.txt -b ../data/encode_roadmap.bed -o db_cd4 -c ../data/genomes/human.hg19.genome cd4_sample.txt

To convert the sequences to the format needed by Torch, we'll first convert to FASTA.


In [29]:
!bedtools getfasta -fi ../data/genomes/hg19.fa -bed db_cd4.bed -s -fo db_cd4.fa

Let's see how many sequences we have in the merged set.


In [30]:
!wc -l db_cd4.bed


   91261 learn_cd4.bed

~91k by my count, so let's take out 3k for validation, and 3k for testing.

Finally, we convert to HDF5 for Torch.

-r permutes the sequences.

-c informs the script we're providing raw counts.

-v specifies the size of the validation set.

-t specifies the size of the test set.


In [32]:
!seq_hdf5.py -c -v 75000 -t 75000 db_cd4.fa db_cd4_act.txt db_cd4.h5


85261 training sequences 
3000 test sequences 
3000 validation sequences 

Now we're ready to train!

Let's define the hyper-parameters of the network.

params_out = open('params.txt', 'w') print >> params_out, '' params_out.close()

And let 'er rip!


In [ ]:
basset_train.lua -job params.txt -save db_cd4_cnn db_cd4.h5

After training, the model will be saved in db_cd4_cnn_best.th