In this tutorial, we'll walk through running Basset on ONLY your own dataset.
The input you bring to the pipeline is:
First, let's grab some data from the original ATAC-seq paper here http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE47753
In [25]:
!wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE47nnn/GSE47753/suppl/GSE47753_CD4%2B_ATACseq_AllDays_AllReps_ZINBA_pp08.bed.gz
!mv GSE47753_CD4+_ATACseq_AllDays_AllReps_ZINBA_pp08.bed.gz atac_cd4.bed.gz
!gunzip -f atac_cd4.bed.gz
Let's first arrange those BED files into a table for Basset. Let's say you have two BED files representing a before and after treatment.
In [26]:
samples_out = open('cd4_sample.txt', 'w')
print >> samples_out, 'CD4+\tatac_cd4.bed'
samples_out.close()
Next, we need a set of negative training examples. If you came in with a more diverse set of input data, then you could probably count on having enough negative sequences for each sample. But here, we'll grab a set of negative sequences from the compendium of ENCODE and Epigenomics Roadmap data. There are about 43k positive examples in each this BED file, so I'll grab an additional 50k sequences to match them. (Note: run install_data.py to get this dataset.)
(I should note, that I haven't experimented much with what makes the best test set, so feel encouraged to explore!)
In [27]:
!basset_sample.py ../data/encode_roadmap.bed ../data/encode_roadmap_act.txt 50000 neg
Next, we need to merge all of the BED files into one BED and an activity table.
I typically use the -y option to avoid the Y chromosome, since I don't know which samples sequenced male or female cells.
I'll use my default of extending the sequences to 600 bp, and merging sites that overlap by more than 200 bp.
-n and -b together let us provide the negative examples and tell Basset to treat them accordingly.
In [28]:
!preprocess_features.py -y -m 200 -s 600 -b neg.bed -n -o learn_cd4 -c ../data/genomes/human.hg19.genome cd4_sample.txt
To convert the sequences to the format needed by Torch, we'll first convert to FASTA.
In [29]:
!bedtools getfasta -fi ../data/genomes/hg19.fa -bed learn_cd4.bed -s -fo learn_cd4.fa
Let's see how many sequences we have in the merged set.
In [30]:
!wc -l learn_cd4.bed
~91k by my count, so let's take out 3k for validation, and 3k for testing.
Finally, we convert to HDF5 for Torch.
-r permutes the sequences. -c informs the script we're providing raw counts. -v specifies the size of the validation set. -t specifies the size of the test set.
In [32]:
!seq_hdf5.py -c -v 3000 -t 3000 learn_cd4.fa learn_cd4_act.txt learn_cd4.h5
Now we're ready to train!
Let's define the hyper-parameters of the network.
In [1]:
params_out = open('params.txt', 'w')
print >> params_out, ''
params_out.close()
And let 'er rip!
In [ ]:
basset_train.lua -job params.txt -save cd4_cnn learn_cd4.h5
After training, the model will be saved in cd4_cnn_best.th