Although Basenji is unaware of the locations of known genes in the genome, we can go in afterwards and ask what a model predicts for those locations to interpret it as a gene expression prediction.
To do this, you'll need
First, make sure you have an hg19 FASTA file visible. If you have it already, put a symbolic link into the data directory. Otherwise, I have a machine learning friendly simplified version you can download in the next cell.
In [1]:
import os, subprocess
if not os.path.isfile('data/hg19.ml.fa'):
subprocess.call('curl -o data/hg19.ml.fa https://storage.googleapis.com/basenji_tutorial_data/hg19.ml.fa', shell=True)
subprocess.call('curl -o data/hg19.ml.fa.fai https://storage.googleapis.com/basenji_tutorial_data/hg19.ml.fa.fai', shell=True)
Next, let's grab a few CAGE datasets from FANTOM5 related to heart biology.
These data were processed by
In [2]:
if not os.path.isfile('data/CNhs11760.bw'):
subprocess.call('curl -o data/CNhs11760.bw https://storage.googleapis.com/basenji_tutorial_data/CNhs11760.bw', shell=True)
subprocess.call('curl -o data/CNhs12843.bw https://storage.googleapis.com/basenji_tutorial_data/CNhs12843.bw', shell=True)
subprocess.call('curl -o data/CNhs12856.bw https://storage.googleapis.com/basenji_tutorial_data/CNhs12856.bw', shell=True)
Then we'll write out these BigWig files and labels to a samples table.
In [3]:
lines = [['index','identifier','file','clip','sum_stat','description']]
lines.append(['0', 'CNhs11760', 'data/CNhs11760.bw', '384', 'sum', 'aorta'])
lines.append(['1', 'CNhs12843', 'data/CNhs12843.bw', '384', 'sum', 'artery'])
lines.append(['2', 'CNhs12856', 'data/CNhs12856.bw', '384', 'sum', 'pulmonic_valve'])
samples_out = open('data/heart_wigs.txt', 'w')
for line in lines:
print('\t'.join(line), file=samples_out)
samples_out.close()
Predictions in the portion of the genome that we trained might inflate our accuracy, so we'll focus on chr9 genes, which have formed my typical test set. Then we use basenji_hdf5_genes.py to create the file.
The most relevant options are:
Option/Argument | Value | Note |
---|---|---|
-g | data/human.hg19.genome | Genome assembly chromosome length to bound gene sequences. |
-l | 262144 | Sequence length. |
-c | 0.333 | Multiple genes per sequence are allowed, but the TSS must be in the middle 1/3 of the sequence. |
-p | 3 | Use 3 threads via |
-t | data/heart_wigs.txt | Save coverage values from this table of BigWig files. |
-w | 128 | Bin the coverage values at 128 bp resolution. |
fasta_file | data/hg19.ml.fa | Genome FASTA file for extracting sequences. |
gtf_file | data/gencode_chr9.gtf | Gene annotations in gene transfer format. |
hdf5_file | data/gencode_chr9_l262k_w128.h5 | Gene sequence output HDF5 file. |
In [4]:
! basenji_hdf5_genes.py -g data/human.hg19.genome -l 131072 -c 0.333 -p 3 -t data/heart_wigs.txt -w 128 data/hg19.ml.fa data/gencode_chr9.gtf data/gencode_chr9.h5
Now, you can either train your own model in the Train/test tutorial or download one that I pre-trained.
In [5]:
if not os.path.isdir('models/heart'):
os.mkdir('models/heart')
if not os.path.isfile('models/heart/model_best.tf.meta'):
subprocess.call('curl -o models/heart/model_best.tf.index https://storage.googleapis.com/basenji_tutorial_data/model_best.tf.index', shell=True)
subprocess.call('curl -o models/heart/model_best.tf.meta https://storage.googleapis.com/basenji_tutorial_data/model_best.tf.meta', shell=True)
subprocess.call('curl -o models/heart/model_best.tf.data-00000-of-00001 https://storage.googleapis.com/basenji_tutorial_data/model_best.tf.data-00000-of-00001', shell=True)
Finally, you can offer data/gencode_chr9_l262k_w128.h5 and the model to basenji_test_genes.py to make gene expression predictions and benchmark them.
The most relevant options are:
Option/Argument | Value | Note |
---|---|---|
-o | data/gencode_chr9_test | Output directory. |
--rc | Average the forward and reverse complement to form prediction. | |
-s | Make scatter plots, comparing predictions to experiment values. | |
--table | Print gene expression table. | |
params_file | models/params_small.txt | Table of parameters to setup the model architecture and optimization. |
model_file | models/gm12878_best.tf | Trained saved model prefix. |
genes_hdf5_file | data/gencode_chr9_l262k_w128.h5 | HDF5 file containing the gene sequences, annotations, and experiment values. |
In [6]:
! basenji_test_genes.py -o output/gencode_chr9_test --rc -s --table models/params_small.txt models/heart/model_best.tf data/gencode_chr9.h5
In the output directory output/gencode_chr9_test/ are several tables and plots describing gene prediction accuracy. For example gene_cors.txt contains Spearman and Pearson correlations for predictions versus experimental measurements for all genes and nonzero genes.
In [7]:
! cat output/gencode_chr9_test/gene_cors.txt
gene_table.txt.gz contains specific gene predictions and experimental measurements.
In [8]:
! gunzip -c output/gencode_chr9_test/gene_table.txt.gz | head
And gene_scatterX.pdf plots gene predictions versus experimental measurements for each dataset indexed by X.
In [9]:
from IPython.display import IFrame
IFrame('output/gencode_chr9_test/gene_scatter0.pdf', width=600, height=500)
Out[9]:
In [ ]: