In this tutorial, we'll walk through running Basset by adding your datasets to the full compendium of ENCODE and Epigenomics Roadmap DNaseI hypersensitity sites. These datasets are rich
The input you bring to the pipeline is:
First, let's grab some data from the original ATAC-seq paper here http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE47753
In [25]:
!wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE47nnn/GSE47753/suppl/GSE47753_CD4%2B_ATACseq_AllDays_AllReps_ZINBA_pp08.bed.gz
!mv GSE47753_CD4+_ATACseq_AllDays_AllReps_ZINBA_pp08.bed.gz atac_cd4.bed.gz
!gunzip -f atac_cd4.bed.gz
Let's first arrange those BED files into a table for Basset. Let's say you have two BED files representing a before and after treatment.
In [26]:
samples_out = open('cd4_sample.txt', 'w')
print >> samples_out, 'CD4+\tatac_cd4.bed'
samples_out.close()
Next, we need to merge this BED files into the ENCODE/Roadmap database to form one BED file and an activity table.
(If you want to change the paramters here for extension size and/or allowed overlap, you'll need to go back to the BED files. Check out the tutorial prepare_compendium.ipynb.)
I typically use the -y option to avoid the Y chromosome, since I don't know which samples sequenced male or female cells.
I'll use my default of extending the sequences to 600 bp, and merging sites that overlap by more than 200 bp.
-a and -b together let us provide the database activity table and BED file respectively.
In [28]:
!preprocess_features.py -y -m 200 -s 600 -a ../data/encode_roadmap_act.txt -b ../data/encode_roadmap.bed -o db_cd4 -c ../data/genomes/human.hg19.genome cd4_sample.txt
To convert the sequences to the format needed by Torch, we'll first convert to FASTA.
In [29]:
!bedtools getfasta -fi ../data/genomes/hg19.fa -bed db_cd4.bed -s -fo db_cd4.fa
Let's see how many sequences we have in the merged set.
In [30]:
!wc -l db_cd4.bed
~91k by my count, so let's take out 3k for validation, and 3k for testing.
Finally, we convert to HDF5 for Torch.
-r permutes the sequences.
-c informs the script we're providing raw counts.
-v specifies the size of the validation set.
-t specifies the size of the test set.
In [32]:
!seq_hdf5.py -c -v 75000 -t 75000 db_cd4.fa db_cd4_act.txt db_cd4.h5
Now we're ready to train!
Let's define the hyper-parameters of the network.
And let 'er rip!
In [ ]:
basset_train.lua -job params.txt -save db_cd4_cnn db_cd4.h5
After training, the model will be saved in db_cd4_cnn_best.th