Preparing a dataset for Basenji training involves a series of design choices.
The input you bring to the pipeline is:
First, make sure you have an hg19 FASTA file visible. If you have it already, put a symbolic link into the data directory. Otherwise, I have a machine learning friendly simplified version you can download in the next cell.
In [1]:
import os, subprocess
if not os.path.isfile('data/hg19.ml.fa'):
subprocess.call('curl -o data/hg19.ml.fa https://storage.googleapis.com/basenji_tutorial_data/hg19.ml.fa', shell=True)
subprocess.call('curl -o data/hg19.ml.fa.fai https://storage.googleapis.com/basenji_tutorial_data/hg19.ml.fa.fai', shell=True)
Next, let's grab a few CAGE datasets from FANTOM5 related to heart biology.
These data were processed by
In [2]:
if not os.path.isfile('data/CNhs11760.bw'):
subprocess.call('curl -o data/CNhs11760.bw https://storage.googleapis.com/basenji_tutorial_data/CNhs11760.bw', shell=True)
subprocess.call('curl -o data/CNhs12843.bw https://storage.googleapis.com/basenji_tutorial_data/CNhs12843.bw', shell=True)
subprocess.call('curl -o data/CNhs12856.bw https://storage.googleapis.com/basenji_tutorial_data/CNhs12856.bw', shell=True)
Then we'll write out these BigWig files and labels to a samples table.
In [3]:
lines = [['index','identifier','file','clip','sum_stat','description']]
lines.append(['0', 'CNhs11760', 'data/CNhs11760.bw', '384', 'sum', 'aorta'])
lines.append(['1', 'CNhs12843', 'data/CNhs12843.bw', '384', 'sum', 'artery'])
lines.append(['2', 'CNhs12856', 'data/CNhs12856.bw', '384', 'sum', 'pulmonic_valve'])
samples_out = open('data/heart_wigs.txt', 'w')
for line in lines:
print('\t'.join(line), file=samples_out)
samples_out.close()
Next, we want to choose genomic sequences to form batches for stochastic gradient descent, divide them into training/validation/test sets, and construct TFRecords to provide to downstream programs.
The script basenji_data.py implements this procedure.
The most relevant options here are:
Option/Argument | Value | Note |
---|---|---|
-d | 0.1 | Down-sample the genome to 10% to speed things up here. |
-g | data/unmap_macro.bed | Dodge large-scale unmappable regions like assembly gaps. |
-l | 131072 | Sequence length. |
--local | True | Run locally, as opposed to on my SLURM scheduler. |
-o | data/heart_l131k | Output directory |
-p | 8 | Uses multiple concourrent processes to read/write. |
-t | .1 | Hold out 10% sequences for testing. |
-v | .1 | Hold out 10% sequences for validation. |
-w | 128 | Pool the nucleotide-resolution values to 128 bp bins. |
fasta_file | data/hg19.ml.fa | FASTA file to extract sequences from. |
targets_file | data/heart_wigs.txt | Target samples table with BigWig paths. |
In [10]:
! basenji_data.py -d .1 -g data/unmap_macro.bed -l 131072 --local -o data/heart_l131k -p 8 -t .1 -v .1 -w 128 data/hg19.ml.fa data/heart_wigs.txt
Now, data/heart_l131k contains relevant data for training.
contigs.bed contains the original large contiguous regions from which training sequences were taken (possibly strided). sequences.bed
contains the train/valid/test sequences.
In [11]:
! cut -f4 data/heart_l131k/sequences.bed | sort | uniq -c
In [12]:
! head -n3 data/heart_l131k/sequences.bed
In [13]:
! grep valid data/heart_l131k/sequences.bed | head -n3
In [14]:
! grep test data/heart_l131k/sequences.bed | head -n3
In [15]:
! ls -l data/heart_l131k/tfrecords/*.tfr
In [ ]: