Although Basenji is unaware of the locations of known genes in the genome, we can go in afterwards and ask what a model predicts for those locations to interpret it as a gene expression prediction.

To do this, you'll need

Trained model
Gene Transfer Format (GTF) gene annotations
BigWig coverage tracks
Gene sequences saved in my HDF5 format.

First, make sure you have an hg19 FASTA file visible. If you have it already, put a symbolic link into the data directory. Otherwise, I have a machine learning friendly simplified version you can download in the next cell.



In [1]:

    
import os, subprocess

if not os.path.isfile('data/hg19.ml.fa'):
    subprocess.call('curl -o data/hg19.ml.fa https://storage.googleapis.com/basenji_tutorial_data/hg19.ml.fa', shell=True)
    subprocess.call('curl -o data/hg19.ml.fa.fai https://storage.googleapis.com/basenji_tutorial_data/hg19.ml.fa.fai', shell=True)

Next, let's grab a few CAGE datasets from FANTOM5 related to heart biology.

These data were processed by

Aligning with Bowtie2 with very sensitive alignment parameters.
Distributing multi-mapping reads and estimating genomic coverage with bam_cov.py



In [2]:

    
if not os.path.isfile('data/CNhs11760.bw'):
    subprocess.call('curl -o data/CNhs11760.bw https://storage.googleapis.com/basenji_tutorial_data/CNhs11760.bw', shell=True)
    subprocess.call('curl -o data/CNhs12843.bw https://storage.googleapis.com/basenji_tutorial_data/CNhs12843.bw', shell=True)
    subprocess.call('curl -o data/CNhs12856.bw https://storage.googleapis.com/basenji_tutorial_data/CNhs12856.bw', shell=True)

Then we'll write out these BigWig files and labels to a samples table.



In [3]:

    
lines = [['index','identifier','file','clip','sum_stat','description']]
lines.append(['0', 'CNhs11760', 'data/CNhs11760.bw', '384', 'sum', 'aorta'])
lines.append(['1', 'CNhs12843', 'data/CNhs12843.bw', '384', 'sum', 'artery'])
lines.append(['2', 'CNhs12856', 'data/CNhs12856.bw', '384', 'sum', 'pulmonic_valve'])

samples_out = open('data/heart_wigs.txt', 'w')
for line in lines:
    print('\t'.join(line), file=samples_out)
samples_out.close()

Predictions in the portion of the genome that we trained might inflate our accuracy, so we'll focus on chr9 genes, which have formed my typical test set. Then we use basenji_hdf5_genes.py to create the file.

The most relevant options are:

Option/Argument	Value	Note
-g	data/human.hg19.genome	Genome assembly chromosome length to bound gene sequences.
-l	262144	Sequence length.
-c	0.333	Multiple genes per sequence are allowed, but the TSS must be in the middle 1/3 of the sequence.
-p	3	Use 3 threads via
-t	data/heart_wigs.txt	Save coverage values from this table of BigWig files.
-w	128	Bin the coverage values at 128 bp resolution.
fasta_file	data/hg19.ml.fa	Genome FASTA file for extracting sequences.
gtf_file	data/gencode_chr9.gtf	Gene annotations in gene transfer format.
hdf5_file	data/gencode_chr9_l262k_w128.h5	Gene sequence output HDF5 file.



In [4]:

    
! basenji_hdf5_genes.py -g data/human.hg19.genome -l 131072 -c 0.333 -p 3 -t data/heart_wigs.txt -w 128 data/hg19.ml.fa data/gencode_chr9.gtf data/gencode_chr9.h5









    



/Users/davidkelley/code/Basenji/bin/basenji_hdf5_genes.py:356: FutureWarning: read_table is deprecated, use read_csv instead, passing sep='\t'.
  target_wigs_df = pd.read_table(target_wigs_file, index_col=0)
/Users/davidkelley/code/Basenji/bin/basenji_hdf5_genes.py:212: FutureWarning: read_table is deprecated, use read_csv instead, passing sep='\t'.
  target_wigs_df = pd.read_table(options.target_wigs_file, index_col=0)

Now, you can either train your own model in the Train/test tutorial or download one that I pre-trained.



In [5]:

    
if not os.path.isdir('models/heart'):
    os.mkdir('models/heart')
if not os.path.isfile('models/heart/model_best.tf.meta'):
    subprocess.call('curl -o models/heart/model_best.tf.index https://storage.googleapis.com/basenji_tutorial_data/model_best.tf.index', shell=True)
    subprocess.call('curl -o models/heart/model_best.tf.meta https://storage.googleapis.com/basenji_tutorial_data/model_best.tf.meta', shell=True)
    subprocess.call('curl -o models/heart/model_best.tf.data-00000-of-00001 https://storage.googleapis.com/basenji_tutorial_data/model_best.tf.data-00000-of-00001', shell=True)

Finally, you can offer data/gencode_chr9_l262k_w128.h5 and the model to basenji_test_genes.py to make gene expression predictions and benchmark them.

The most relevant options are:

Option/Argument	Value	Note
-o	data/gencode_chr9_test	Output directory.
--rc		Average the forward and reverse complement to form prediction.
-s		Make scatter plots, comparing predictions to experiment values.
--table		Print gene expression table.
params_file	models/params_small.txt	Table of parameters to setup the model architecture and optimization.
model_file	models/gm12878_best.tf	Trained saved model prefix.
genes_hdf5_file	data/gencode_chr9_l262k_w128.h5	HDF5 file containing the gene sequences, annotations, and experiment values.



In [6]:

    
! basenji_test_genes.py -o output/gencode_chr9_test --rc -s --table models/params_small.txt models/heart/model_best.tf data/gencode_chr9.h5









    



{'batch_size': 4, 'batch_buffer': 4096, 'link': 'softplus', 'loss': 'poisson', 'optimizer': 'adam', 'adam_beta1': 0.97, 'adam_beta2': 0.98, 'learning_rate': 0.002, 'num_targets': 3, 'target_pool': 128, 'seq_length': 131072, 'target_length': 1024, 'cnn_dropout': 0.1, 'cnn_filter_sizes': [20, 7, 7, 7, 3, 3, 3, 3, 3, 3, 3, 1], 'cnn_filters': [128, 128, 192, 256, 256, 32, 32, 32, 32, 32, 32, 384], 'cnn_pool': [2, 4, 4, 4, 1, 0, 0, 0, 0, 0, 0, 0], 'cnn_dilation': [1, 1, 1, 1, 1, 2, 4, 8, 16, 32, 64, 1], 'cnn_dense': [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0]}
WARNING:tensorflow:From /Users/davidkelley/anaconda3/envs/py37/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.

WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

Targets pooled by 128 to length 1024
WARNING:tensorflow:From /Users/davidkelley/code/basenji/basenji/layers.py:53: conv1d (from tensorflow.python.layers.convolutional) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.conv1d instead.
WARNING:tensorflow:From /Users/davidkelley/code/basenji/basenji/layers.py:74: batch_normalization (from tensorflow.python.layers.normalization) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.batch_normalization instead.
WARNING:tensorflow:From /Users/davidkelley/code/basenji/basenji/layers.py:82: dropout (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dropout instead.
WARNING:tensorflow:From /Users/davidkelley/anaconda3/envs/py37/lib/python3.7/site-packages/tensorflow/python/keras/layers/core.py:143: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
WARNING:tensorflow:From /Users/davidkelley/code/basenji/basenji/layers.py:103: max_pooling1d (from tensorflow.python.layers.pooling) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.max_pooling1d instead.
WARNING:tensorflow:From /Users/davidkelley/code/basenji/basenji/seqnn.py:295: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dense instead.
Convolution w/ 3 384x1 filters to final targets
Computing gene predictions.2019-04-17 13:51:34.980700: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
WARNING:tensorflow:From /Users/davidkelley/anaconda3/envs/py37/lib/python3.7/site-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
 Done in 265s.
Computing correlations./Users/davidkelley/anaconda3/envs/py37/lib/python3.7/site-packages/seaborn/axisgrid.py:2262: UserWarning: The `size` paramter has been renamed to `height`; please update your code.
  warnings.warn(msg, UserWarning)
/Users/davidkelley/anaconda3/envs/py37/lib/python3.7/site-packages/matplotlib/tight_layout.py:231: UserWarning: tight_layout : falling back to Agg renderer
  warnings.warn("tight_layout : falling back to Agg renderer")
/Users/davidkelley/anaconda3/envs/py37/lib/python3.7/site-packages/seaborn/axisgrid.py:2262: UserWarning: The `size` paramter has been renamed to `height`; please update your code.
  warnings.warn(msg, UserWarning)
/Users/davidkelley/anaconda3/envs/py37/lib/python3.7/site-packages/matplotlib/tight_layout.py:231: UserWarning: tight_layout : falling back to Agg renderer
  warnings.warn("tight_layout : falling back to Agg renderer")
 Done in 0s.
Printing predictions. Done in 3s.

In the output directory output/gencode_chr9_test/ are several tables and plots describing gene prediction accuracy. For example gene_cors.txt contains Spearman and Pearson correlations for predictions versus experimental measurements for all genes and nonzero genes.



In [7]:

    
! cat output/gencode_chr9_test/gene_cors.txt









    



0       0.607    0.687    0.573    0.591  CNhs11760 aorta
1       0.530    0.649    0.559    0.511  CNhs12843 artery
2       0.570    0.672    0.589    0.655  CNhs12856 pulmonic_valve

gene_table.txt.gz contains specific gene predictions and experimental measurements.



In [8]:

    
! gunzip -c output/gencode_chr9_test/gene_table.txt.gz | head









    



ENSG00000277631.4_1   2.906  0.462     0                 aorta
ENSG00000227917.2_1   0.000  0.024     0                 aorta
ENSG00000231808.2_1   0.762  1.200     0                 aorta
ENSG00000170122.5_1   0.000  0.118     0                 aorta
ENSG00000172785.18_2  5.128  2.111     0                 aorta
ENSG00000183784.6_1   0.068  0.218     0                 aorta
ENSG00000235880.1_1   3.355  0.068     0                 aorta
ENSG00000224836.1_1   0.000  0.017     0                 aorta
ENSG00000236594.2_1   0.000  0.045     0                 aorta
ENSG00000225893.1_1   1.054  0.044     0                 aorta
gunzip: error writing to output: Broken pipe
gunzip: output/gencode_chr9_test/gene_table.txt.gz: uncompress failed

And gene_scatterX.pdf plots gene predictions versus experimental measurements for each dataset indexed by X.



In [9]:

    
from IPython.display import IFrame
IFrame('output/gencode_chr9_test/gene_scatter0.pdf', width=600, height=500)









    Out[9]:



In [ ]: