Preparing a dataset for Basenji training involves a series of design choices.

The input you bring to the pipeline is:

  • BigWig coverage tracks
  • Genome FASTA file

First, make sure you have an hg19 FASTA file visible. If you have it already, put a symbolic link into the data directory. Otherwise, I have a machine learning friendly simplified version you can download in the next cell.


In [1]:
import os, subprocess

if not os.path.isfile('data/hg19.ml.fa'):
    subprocess.call('curl -o data/hg19.ml.fa https://storage.googleapis.com/basenji_tutorial_data/hg19.ml.fa', shell=True)
    subprocess.call('curl -o data/hg19.ml.fa.fai https://storage.googleapis.com/basenji_tutorial_data/hg19.ml.fa.fai', shell=True)

Next, let's grab a few CAGE datasets from FANTOM5 related to heart biology.

These data were processed by

  1. Aligning with Bowtie2 with very sensitive alignment parameters.
  2. Distributing multi-mapping reads and estimating genomic coverage with bam_cov.py

In [2]:
if not os.path.isfile('data/CNhs11760.bw'):
    subprocess.call('curl -o data/CNhs11760.bw https://storage.googleapis.com/basenji_tutorial_data/CNhs11760.bw', shell=True)
    subprocess.call('curl -o data/CNhs12843.bw https://storage.googleapis.com/basenji_tutorial_data/CNhs12843.bw', shell=True)
    subprocess.call('curl -o data/CNhs12856.bw https://storage.googleapis.com/basenji_tutorial_data/CNhs12856.bw', shell=True)

Then we'll write out these BigWig files and labels to a samples table.


In [3]:
lines = [['index','identifier','file','clip','sum_stat','description']]
lines.append(['0', 'CNhs11760', 'data/CNhs11760.bw', '384', 'sum', 'aorta'])
lines.append(['1', 'CNhs12843', 'data/CNhs12843.bw', '384', 'sum', 'artery'])
lines.append(['2', 'CNhs12856', 'data/CNhs12856.bw', '384', 'sum', 'pulmonic_valve'])

samples_out = open('data/heart_wigs.txt', 'w')
for line in lines:
    print('\t'.join(line), file=samples_out)
samples_out.close()

Next, we want to choose genomic sequences to form batches for stochastic gradient descent, divide them into training/validation/test sets, and construct TFRecords to provide to downstream programs.

The script basenji_data.py implements this procedure.

The most relevant options here are:

Option/Argument Value Note
-d 0.1 Down-sample the genome to 10% to speed things up here.
-g data/unmap_macro.bed Dodge large-scale unmappable regions like assembly gaps.
-l 131072 Sequence length.
--local True Run locally, as opposed to on my SLURM scheduler.
-o data/heart_l131k Output directory
-p 8 Uses multiple concourrent processes to read/write.
-t .1 Hold out 10% sequences for testing.
-v .1 Hold out 10% sequences for validation.
-w 128 Pool the nucleotide-resolution values to 128 bp bins.
fasta_file data/hg19.ml.fa FASTA file to extract sequences from.
targets_file data/heart_wigs.txt Target samples table with BigWig paths.

In [10]:
! basenji_data.py -d .1 -g data/unmap_macro.bed -l 131072 --local -o data/heart_l131k -p 8 -t .1 -v .1 -w 128 data/hg19.ml.fa data/heart_wigs.txt


stride_train 1 converted to 131072.000000
stride_test 1 converted to 131072.000000
Contigs divided into
 Train:  4701 contigs, 2169074921 nt (0.8005)
 Valid:   572 contigs,  270358978 nt (0.0998)
 Test:    584 contigs,  270330829 nt (0.0998)
/Users/drk/code/Basenji/bin/basenji_data.py:263: FutureWarning: read_table is deprecated, use read_csv instead, passing sep='\t'.
  targets_df = pd.read_table(targets_file, index_col=0)
basenji_data_read.py --crop 0 -w 128 -u sum -c 384.000000 -s 1.000000 data/CNhs11760.bw data/heart_l131k/sequences.bed data/heart_l131k/seqs_cov/0.h5
basenji_data_read.py --crop 0 -w 128 -u sum -c 384.000000 -s 1.000000 data/CNhs12843.bw data/heart_l131k/sequences.bed data/heart_l131k/seqs_cov/1.h5
basenji_data_read.py --crop 0 -w 128 -u sum -c 384.000000 -s 1.000000 data/CNhs12856.bw data/heart_l131k/sequences.bed data/heart_l131k/seqs_cov/2.h5
basenji_data_write.py -s 0 -e 256 data/hg19.ml.fa data/heart_l131k/sequences.bed data/heart_l131k/seqs_cov data/heart_l131k/tfrecords/train-0.tfr &> data/heart_l131k/tfrecords/train-0.err
basenji_data_write.py -s 256 -e 512 data/hg19.ml.fa data/heart_l131k/sequences.bed data/heart_l131k/seqs_cov data/heart_l131k/tfrecords/train-1.tfr &> data/heart_l131k/tfrecords/train-1.err
basenji_data_write.py -s 512 -e 768 data/hg19.ml.fa data/heart_l131k/sequences.bed data/heart_l131k/seqs_cov data/heart_l131k/tfrecords/train-2.tfr &> data/heart_l131k/tfrecords/train-2.err
basenji_data_write.py -s 768 -e 1024 data/hg19.ml.fa data/heart_l131k/sequences.bed data/heart_l131k/seqs_cov data/heart_l131k/tfrecords/train-3.tfr &> data/heart_l131k/tfrecords/train-3.err
basenji_data_write.py -s 1024 -e 1280 data/hg19.ml.fa data/heart_l131k/sequences.bed data/heart_l131k/seqs_cov data/heart_l131k/tfrecords/train-4.tfr &> data/heart_l131k/tfrecords/train-4.err
basenji_data_write.py -s 1280 -e 1499 data/hg19.ml.fa data/heart_l131k/sequences.bed data/heart_l131k/seqs_cov data/heart_l131k/tfrecords/train-5.tfr &> data/heart_l131k/tfrecords/train-5.err
basenji_data_write.py -s 1499 -e 1679 data/hg19.ml.fa data/heart_l131k/sequences.bed data/heart_l131k/seqs_cov data/heart_l131k/tfrecords/valid-0.tfr &> data/heart_l131k/tfrecords/valid-0.err
basenji_data_write.py -s 1679 -e 1858 data/hg19.ml.fa data/heart_l131k/sequences.bed data/heart_l131k/seqs_cov data/heart_l131k/tfrecords/test-0.tfr &> data/heart_l131k/tfrecords/test-0.err

Now, data/heart_l131k contains relevant data for training.

contigs.bed contains the original large contiguous regions from which training sequences were taken (possibly strided). sequences.bed

contains the train/valid/test sequences.


In [11]:
! cut -f4 data/heart_l131k/sequences.bed | sort | uniq -c


 179 test
1499 train
 180 valid

In [12]:
! head -n3 data/heart_l131k/sequences.bed


chr22	30113858	30244930	train
chr8	76979074	77110146	train
chr3	89515975	89647047	train

In [13]:
! grep valid data/heart_l131k/sequences.bed | head -n3


chr9	132920986	133052058	valid
chr9	47160133	47291205	valid
chr13	79498687	79629759	valid

In [14]:
! grep test data/heart_l131k/sequences.bed | head -n3


chr14	55143579	55274651	test
chr3	4825239	4956311	test
chr19	48710409	48841481	test

In [15]:
! ls -l data/heart_l131k/tfrecords/*.tfr


-rw-r--r--  1 drk  staff  10600666 Nov 10 17:41 data/heart_l131k/tfrecords/test-0.tfr
-rw-r--r--  1 drk  staff  15181536 Nov 10 17:41 data/heart_l131k/tfrecords/train-0.tfr
-rw-r--r--  1 drk  staff  15182571 Nov 10 17:41 data/heart_l131k/tfrecords/train-1.tfr
-rw-r--r--  1 drk  staff  15180501 Nov 10 17:41 data/heart_l131k/tfrecords/train-2.tfr
-rw-r--r--  1 drk  staff  15173820 Nov 10 17:41 data/heart_l131k/tfrecords/train-3.tfr
-rw-r--r--  1 drk  staff  15160686 Nov 10 17:41 data/heart_l131k/tfrecords/train-4.tfr
-rw-r--r--  1 drk  staff  12962227 Nov 10 17:41 data/heart_l131k/tfrecords/train-5.tfr
-rw-r--r--  1 drk  staff  10687464 Nov 10 17:41 data/heart_l131k/tfrecords/valid-0.tfr

In [ ]: