In this tutorial, we'll walk through running Basset on ONLY your own dataset.

The input you bring to the pipeline is:

  • BED file(s) of peak calls

First, let's grab some data from the original ATAC-seq paper here http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE47753


In [25]:
!wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE47nnn/GSE47753/suppl/GSE47753_CD4%2B_ATACseq_AllDays_AllReps_ZINBA_pp08.bed.gz
!mv GSE47753_CD4+_ATACseq_AllDays_AllReps_ZINBA_pp08.bed.gz atac_cd4.bed.gz
!gunzip -f atac_cd4.bed.gz


--2015-09-18 11:14:06--  ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE47nnn/GSE47753/suppl/GSE47753_CD4%2B_ATACseq_AllDays_AllReps_ZINBA_pp08.bed.gz
           => 'GSE47753_CD4+_ATACseq_AllDays_AllReps_ZINBA_pp08.bed.gz'
Resolving ftp.ncbi.nlm.nih.gov... 130.14.250.11, 2607:f220:41e:250::13
Connecting to ftp.ncbi.nlm.nih.gov|130.14.250.11|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /geo/series/GSE47nnn/GSE47753/suppl ... done.
==> SIZE GSE47753_CD4+_ATACseq_AllDays_AllReps_ZINBA_pp08.bed.gz ... 908175
==> PASV ... done.    ==> RETR GSE47753_CD4+_ATACseq_AllDays_AllReps_ZINBA_pp08.bed.gz ... done.
Length: 908175 (887K) (unauthoritative)

100%[======================================>] 908,175     2.36MB/s   in 0.4s   

2015-09-18 11:14:06 (2.36 MB/s) - 'GSE47753_CD4+_ATACseq_AllDays_AllReps_ZINBA_pp08.bed.gz' saved [908175]

Let's first arrange those BED files into a table for Basset. Let's say you have two BED files representing a before and after treatment.


In [26]:
samples_out = open('cd4_sample.txt', 'w')
print >> samples_out, 'CD4+\tatac_cd4.bed'
samples_out.close()

Next, we need a set of negative training examples. If you came in with a more diverse set of input data, then you could probably count on having enough negative sequences for each sample. But here, we'll grab a set of negative sequences from the compendium of ENCODE and Epigenomics Roadmap data. There are about 43k positive examples in each this BED file, so I'll grab an additional 50k sequences to match them. (Note: run install_data.py to get this dataset.)

(I should note, that I haven't experimented much with what makes the best test set, so feel encouraged to explore!)


In [27]:
!basset_sample.py ../data/encode_roadmap.bed ../data/encode_roadmap_act.txt 50000 neg

Next, we need to merge all of the BED files into one BED and an activity table.

I typically use the -y option to avoid the Y chromosome, since I don't know which samples sequenced male or female cells.

I'll use my default of extending the sequences to 600 bp, and merging sites that overlap by more than 200 bp.

-n and -b together let us provide the negative examples and tell Basset to treat them accordingly.


In [28]:
!preprocess_features.py -y -m 200 -s 600 -b neg.bed -n -o learn_cd4 -c ../data/genomes/human.hg19.genome cd4_sample.txt

To convert the sequences to the format needed by Torch, we'll first convert to FASTA.


In [29]:
!bedtools getfasta -fi ../data/genomes/hg19.fa -bed learn_cd4.bed -s -fo learn_cd4.fa

Let's see how many sequences we have in the merged set.


In [30]:
!wc -l learn_cd4.bed


   91261 learn_cd4.bed

~91k by my count, so let's take out 3k for validation, and 3k for testing.

Finally, we convert to HDF5 for Torch.

-r permutes the sequences. -c informs the script we're providing raw counts. -v specifies the size of the validation set. -t specifies the size of the test set.


In [32]:
!seq_hdf5.py -c -v 3000 -t 3000 learn_cd4.fa learn_cd4_act.txt learn_cd4.h5


85261 training sequences 
3000 test sequences 
3000 validation sequences 

Now we're ready to train!

Let's define the hyper-parameters of the network.


In [1]:
params_out = open('params.txt', 'w')
print >> params_out, ''
params_out.close()

And let 'er rip!


In [ ]:
basset_train.lua -job params.txt -save cd4_cnn learn_cd4.h5

After training, the model will be saved in cd4_cnn_best.th