In [1]:
import os
import subprocess

Precursors


In [2]:
if not os.path.isfile('data/hg19.ml.fa'):
    subprocess.call('curl -o data/hg19.ml.fa https://storage.googleapis.com/basenji_tutorial_data/hg19.ml.fa', shell=True)
    subprocess.call('curl -o data/hg19.ml.fa.fai https://storage.googleapis.com/basenji_tutorial_data/hg19.ml.fa.fai', shell=True)

In [3]:
if not os.path.isdir('models/heart'):
    os.mkdir('models/heart')
if not os.path.isfile('models/heart/model_best.tf.meta'):
    subprocess.call('curl -o models/heart/model_best.tf.index https://storage.googleapis.com/basenji_tutorial_data/model_best.tf.index', shell=True)
    subprocess.call('curl -o models/heart/model_best.tf.meta https://storage.googleapis.com/basenji_tutorial_data/model_best.tf.meta', shell=True)
    subprocess.call('curl -o models/heart/model_best.tf.data-00000-of-00001 https://storage.googleapis.com/basenji_tutorial_data/model_best.tf.data-00000-of-00001', shell=True)

In [4]:
lines = [['index','identifier','file','clip','sum_stat','description']]
lines.append(['0', 'CNhs11760', 'data/CNhs11760.bw', '384', 'sum', 'aorta'])
lines.append(['1', 'CNhs12843', 'data/CNhs12843.bw', '384', 'sum', 'artery'])
lines.append(['2', 'CNhs12856', 'data/CNhs12856.bw', '384', 'sum', 'pulmonic_valve'])

samples_out = open('data/heart_wigs.txt', 'w')
for line in lines:
    print('\t'.join(line), file=samples_out)
samples_out.close()

Compute scores

Saturation mutagenesis is a powerful tool both for dissecting a specific sequence of interest and understanding what the model learned. basenji_sat_bed.py enables this analysis from a test set of data. basenji_sat_vcf.py lets you provide a VCF file for variant-centered mutagenesis.

To do this, you'll need

  • Trained model
  • BED file

First, you can either train your own model in the Train/test tutorial or use one that I pre-trained from the models subdirectory.

We'll bash the GATA4 promoter to see what motifs drive its expression. I placed a BED file surrounding the GATA4 TSS in data/gata4.bed, so we'll use basenji_sat_bed.py.

The most relevant options are:

Option/Argument Value Note
-f data/hg19.ml.fa Genome FASTA to extract sequences.
-l 200 Saturation mutagenesis region in the center of the given sequence(s)
-o gata4_sat Outplot plot directory.
--rc True Predict forward and reverse complement versions and average the results.
-t data/heart_wigs.txt Target indexes to analyze.
params_file models/params_small.txt Table of parameters to setup the model architecture and optimization parameters.
model_file models/heart/model_best.tf Trained saved model prefix.
input_file data/gata4.bed BED regions.

In [5]:
! basenji_sat_bed.py -f data/hg19.ml.fa -l 200 -o output/gata4_sat --rc -t data/heart_wigs.txt models/params_small.txt models/heart/model_best.tf data/gata4.bed


{'batch_size': 4, 'batch_buffer': 4096, 'link': 'softplus', 'loss': 'poisson', 'optimizer': 'adam', 'adam_beta1': 0.97, 'adam_beta2': 0.98, 'learning_rate': 0.002, 'num_targets': 3, 'target_pool': 128, 'seq_length': 131072, 'target_length': 1024, 'cnn_dropout': 0.1, 'cnn_filter_sizes': [20, 7, 7, 7, 3, 3, 3, 3, 3, 3, 3, 1], 'cnn_filters': [128, 128, 192, 256, 256, 32, 32, 32, 32, 32, 32, 384], 'cnn_pool': [2, 4, 4, 4, 1, 0, 0, 0, 0, 0, 0, 0], 'cnn_dilation': [1, 1, 1, 1, 1, 2, 4, 8, 16, 32, 64, 1], 'cnn_dense': [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0]}
/Users/davidkelley/code/Basenji/bin/basenji_sat_bed.py:117: FutureWarning: read_table is deprecated, use read_csv instead, passing sep='\t'.
  targets_df = pd.read_table(options.targets_file, index_col=0)
WARNING:tensorflow:From /Users/davidkelley/anaconda3/envs/py37/lib/python3.7/site-packages/tensorflow/python/data/ops/dataset_ops.py:429: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version.
Instructions for updating:
tf.py_func is deprecated in TF V2. Instead, use
    tf.py_function, which takes a python function which manipulates tf eager
    tensors instead of numpy arrays. It's easy to convert a tf eager tensor to
    an ndarray (just call tensor.numpy()) but having access to eager tensors
    means `tf.py_function`s can use accelerators such as GPUs as well as
    being differentiable using a gradient tape.
    
WARNING:tensorflow:From /Users/davidkelley/anaconda3/envs/py37/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.

WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

Targets pooled by 128 to length 1024
WARNING:tensorflow:From /Users/davidkelley/code/basenji/basenji/layers.py:53: conv1d (from tensorflow.python.layers.convolutional) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.conv1d instead.
WARNING:tensorflow:From /Users/davidkelley/code/basenji/basenji/layers.py:74: batch_normalization (from tensorflow.python.layers.normalization) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.batch_normalization instead.
WARNING:tensorflow:From /Users/davidkelley/code/basenji/basenji/layers.py:82: dropout (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dropout instead.
WARNING:tensorflow:From /Users/davidkelley/anaconda3/envs/py37/lib/python3.7/site-packages/tensorflow/python/keras/layers/core.py:143: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
WARNING:tensorflow:From /Users/davidkelley/code/basenji/basenji/layers.py:103: max_pooling1d (from tensorflow.python.layers.pooling) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.max_pooling1d instead.
WARNING:tensorflow:From /Users/davidkelley/code/basenji/basenji/seqnn.py:303: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dense instead.
Convolution w/ 3 384x1 filters to final targets
2019-05-22 16:26:01.494229: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
WARNING:tensorflow:From /Users/davidkelley/code/Basenji/bin/basenji_sat_bed.py:197: start_queue_runners (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
WARNING:tensorflow:`tf.train.start_queue_runners()` was called when no queue runners were defined. You can safely remove the call to this deprecated function.
WARNING:tensorflow:From /Users/davidkelley/anaconda3/envs/py37/lib/python3.7/site-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
Predicting 0
Writing 0
Waiting for threads to finish.

Plot

The saturation mutagenesis scores go into output/gata4_sat/scores.h5. Then we can use basenji_sat_plot.py to visualize the scores.

The most relevant options are:

Option/Argument Value Note
-g True Draw a sequence logo for the gain score, too, identifying repressor motifs.
-l 200 Saturation mutagenesis region in the center of the given sequence(s)
-o output/gata4_sat/plots Outplot plot directory.
-t data/heart_wigs.txt Target indexes to analyze.
scores_file output/gata4_sat/scores.h5 Scores HDF5 from above.

In [6]:
! basenji_sat_plot.py --png -l 200 -o output/gata4_sat/plots -t data/heart_wigs.txt output/gata4_sat/scores.h5


/Users/davidkelley/code/Basenji/bin/basenji_sat_plot.py:95: FutureWarning: read_table is deprecated, use read_csv instead, passing sep='\t'.
  targets_df = pd.read_table(options.targets_file, index_col=0)

In [7]:
! ls output/gata4_sat/plots


seq0_t0.pdf seq0_t0.png seq0_t1.pdf seq0_t1.png seq0_t2.pdf seq0_t2.png

The resulting plots reveal a low level of activity, with a GC-rich motif driving the only signal.