Although Basenji is unaware of the locations of known genes in the genome, we can go in afterwards and ask what a model predicts for those locations to interpret it as a gene expression prediction.

To do this, you'll need

  • Trained model
  • Gene Transfer Format (GTF) gene annotations
  • BigWig coverage tracks
  • Gene sequences saved in my HDF5 format.

First, make sure you have an hg19 FASTA file visible. If you have it already, put a symbolic link into the data directory. Otherwise, I have a machine learning friendly simplified version you can download in the next cell.


In [1]:
import os, subprocess

if not os.path.isfile('data/hg19.ml.fa'):
    subprocess.call('curl -o data/hg19.ml.fa https://storage.googleapis.com/basenji_tutorial_data/hg19.ml.fa', shell=True)
    subprocess.call('curl -o data/hg19.ml.fa.fai https://storage.googleapis.com/basenji_tutorial_data/hg19.ml.fa.fai', shell=True)

Next, let's grab a few CAGE datasets from FANTOM5 related to heart biology.

These data were processed by

  1. Aligning with Bowtie2 with very sensitive alignment parameters.
  2. Distributing multi-mapping reads and estimating genomic coverage with bam_cov.py

In [2]:
if not os.path.isfile('data/CNhs11760.bw'):
    subprocess.call('curl -o data/CNhs11760.bw https://storage.googleapis.com/basenji_tutorial_data/CNhs11760.bw', shell=True)
    subprocess.call('curl -o data/CNhs12843.bw https://storage.googleapis.com/basenji_tutorial_data/CNhs12843.bw', shell=True)
    subprocess.call('curl -o data/CNhs12856.bw https://storage.googleapis.com/basenji_tutorial_data/CNhs12856.bw', shell=True)

Then we'll write out these BigWig files and labels to a samples table.


In [3]:
lines = [['index','identifier','file','clip','sum_stat','description']]
lines.append(['0', 'CNhs11760', 'data/CNhs11760.bw', '384', 'sum', 'aorta'])
lines.append(['1', 'CNhs12843', 'data/CNhs12843.bw', '384', 'sum', 'artery'])
lines.append(['2', 'CNhs12856', 'data/CNhs12856.bw', '384', 'sum', 'pulmonic_valve'])

samples_out = open('data/heart_wigs.txt', 'w')
for line in lines:
    print('\t'.join(line), file=samples_out)
samples_out.close()

Predictions in the portion of the genome that we trained might inflate our accuracy, so we'll focus on chr9 genes, which have formed my typical test set. Then we use basenji_hdf5_genes.py to create the file.

The most relevant options are:

Option/Argument Value Note
-g data/human.hg19.genome Genome assembly chromosome length to bound gene sequences.
-l 262144 Sequence length.
-c 0.333 Multiple genes per sequence are allowed, but the TSS must be in the middle 1/3 of the sequence.
-p 3 Use 3 threads via
-t data/heart_wigs.txt Save coverage values from this table of BigWig files.
-w 128 Bin the coverage values at 128 bp resolution.
fasta_file data/hg19.ml.fa Genome FASTA file for extracting sequences.
gtf_file data/gencode_chr9.gtf Gene annotations in gene transfer format.
hdf5_file data/gencode_chr9_l262k_w128.h5 Gene sequence output HDF5 file.

In [4]:
! basenji_hdf5_genes.py -g data/human.hg19.genome -l 131072 -c 0.333 -p 3 -t data/heart_wigs.txt -w 128 data/hg19.ml.fa data/gencode_chr9.gtf data/gencode_chr9.h5


/Users/davidkelley/code/Basenji/bin/basenji_hdf5_genes.py:356: FutureWarning: read_table is deprecated, use read_csv instead, passing sep='\t'.
  target_wigs_df = pd.read_table(target_wigs_file, index_col=0)
/Users/davidkelley/code/Basenji/bin/basenji_hdf5_genes.py:212: FutureWarning: read_table is deprecated, use read_csv instead, passing sep='\t'.
  target_wigs_df = pd.read_table(options.target_wigs_file, index_col=0)

Now, you can either train your own model in the Train/test tutorial or download one that I pre-trained.


In [5]:
if not os.path.isdir('models/heart'):
    os.mkdir('models/heart')
if not os.path.isfile('models/heart/model_best.tf.meta'):
    subprocess.call('curl -o models/heart/model_best.tf.index https://storage.googleapis.com/basenji_tutorial_data/model_best.tf.index', shell=True)
    subprocess.call('curl -o models/heart/model_best.tf.meta https://storage.googleapis.com/basenji_tutorial_data/model_best.tf.meta', shell=True)
    subprocess.call('curl -o models/heart/model_best.tf.data-00000-of-00001 https://storage.googleapis.com/basenji_tutorial_data/model_best.tf.data-00000-of-00001', shell=True)

Finally, you can offer data/gencode_chr9_l262k_w128.h5 and the model to basenji_test_genes.py to make gene expression predictions and benchmark them.

The most relevant options are:

Option/Argument Value Note
-o data/gencode_chr9_test Output directory.
--rc Average the forward and reverse complement to form prediction.
-s Make scatter plots, comparing predictions to experiment values.
--table Print gene expression table.
params_file models/params_small.txt Table of parameters to setup the model architecture and optimization.
model_file models/gm12878_best.tf Trained saved model prefix.
genes_hdf5_file data/gencode_chr9_l262k_w128.h5 HDF5 file containing the gene sequences, annotations, and experiment values.

In [6]:
! basenji_test_genes.py -o output/gencode_chr9_test --rc -s --table models/params_small.txt models/heart/model_best.tf data/gencode_chr9.h5


{'batch_size': 4, 'batch_buffer': 4096, 'link': 'softplus', 'loss': 'poisson', 'optimizer': 'adam', 'adam_beta1': 0.97, 'adam_beta2': 0.98, 'learning_rate': 0.002, 'num_targets': 3, 'target_pool': 128, 'seq_length': 131072, 'target_length': 1024, 'cnn_dropout': 0.1, 'cnn_filter_sizes': [20, 7, 7, 7, 3, 3, 3, 3, 3, 3, 3, 1], 'cnn_filters': [128, 128, 192, 256, 256, 32, 32, 32, 32, 32, 32, 384], 'cnn_pool': [2, 4, 4, 4, 1, 0, 0, 0, 0, 0, 0, 0], 'cnn_dilation': [1, 1, 1, 1, 1, 2, 4, 8, 16, 32, 64, 1], 'cnn_dense': [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0]}
WARNING:tensorflow:From /Users/davidkelley/anaconda3/envs/py37/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.

WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

Targets pooled by 128 to length 1024
WARNING:tensorflow:From /Users/davidkelley/code/basenji/basenji/layers.py:53: conv1d (from tensorflow.python.layers.convolutional) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.conv1d instead.
WARNING:tensorflow:From /Users/davidkelley/code/basenji/basenji/layers.py:74: batch_normalization (from tensorflow.python.layers.normalization) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.batch_normalization instead.
WARNING:tensorflow:From /Users/davidkelley/code/basenji/basenji/layers.py:82: dropout (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dropout instead.
WARNING:tensorflow:From /Users/davidkelley/anaconda3/envs/py37/lib/python3.7/site-packages/tensorflow/python/keras/layers/core.py:143: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
WARNING:tensorflow:From /Users/davidkelley/code/basenji/basenji/layers.py:103: max_pooling1d (from tensorflow.python.layers.pooling) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.max_pooling1d instead.
WARNING:tensorflow:From /Users/davidkelley/code/basenji/basenji/seqnn.py:295: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dense instead.
Convolution w/ 3 384x1 filters to final targets
Computing gene predictions.2019-04-17 13:51:34.980700: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
WARNING:tensorflow:From /Users/davidkelley/anaconda3/envs/py37/lib/python3.7/site-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
 Done in 265s.
Computing correlations./Users/davidkelley/anaconda3/envs/py37/lib/python3.7/site-packages/seaborn/axisgrid.py:2262: UserWarning: The `size` paramter has been renamed to `height`; please update your code.
  warnings.warn(msg, UserWarning)
/Users/davidkelley/anaconda3/envs/py37/lib/python3.7/site-packages/matplotlib/tight_layout.py:231: UserWarning: tight_layout : falling back to Agg renderer
  warnings.warn("tight_layout : falling back to Agg renderer")
/Users/davidkelley/anaconda3/envs/py37/lib/python3.7/site-packages/seaborn/axisgrid.py:2262: UserWarning: The `size` paramter has been renamed to `height`; please update your code.
  warnings.warn(msg, UserWarning)
/Users/davidkelley/anaconda3/envs/py37/lib/python3.7/site-packages/matplotlib/tight_layout.py:231: UserWarning: tight_layout : falling back to Agg renderer
  warnings.warn("tight_layout : falling back to Agg renderer")
 Done in 0s.
Printing predictions. Done in 3s.

In the output directory output/gencode_chr9_test/ are several tables and plots describing gene prediction accuracy. For example gene_cors.txt contains Spearman and Pearson correlations for predictions versus experimental measurements for all genes and nonzero genes.


In [7]:
! cat output/gencode_chr9_test/gene_cors.txt


0       0.607    0.687    0.573    0.591  CNhs11760 aorta
1       0.530    0.649    0.559    0.511  CNhs12843 artery
2       0.570    0.672    0.589    0.655  CNhs12856 pulmonic_valve

gene_table.txt.gz contains specific gene predictions and experimental measurements.


In [8]:
! gunzip -c output/gencode_chr9_test/gene_table.txt.gz | head


ENSG00000277631.4_1   2.906  0.462     0                 aorta
ENSG00000227917.2_1   0.000  0.024     0                 aorta
ENSG00000231808.2_1   0.762  1.200     0                 aorta
ENSG00000170122.5_1   0.000  0.118     0                 aorta
ENSG00000172785.18_2  5.128  2.111     0                 aorta
ENSG00000183784.6_1   0.068  0.218     0                 aorta
ENSG00000235880.1_1   3.355  0.068     0                 aorta
ENSG00000224836.1_1   0.000  0.017     0                 aorta
ENSG00000236594.2_1   0.000  0.045     0                 aorta
ENSG00000225893.1_1   1.054  0.044     0                 aorta
gunzip: error writing to output: Broken pipe
gunzip: output/gencode_chr9_test/gene_table.txt.gz: uncompress failed

And gene_scatterX.pdf plots gene predictions versus experimental measurements for each dataset indexed by X.


In [9]:
from IPython.display import IFrame
IFrame('output/gencode_chr9_test/gene_scatter0.pdf', width=600, height=500)


Out[9]:

In [ ]: