In [1]:
import h5py
import os
import subprocess

Precursors


In [2]:
if not os.path.isfile('data/hg19.ml.fa'):
    subprocess.call('curl -o data/hg19.ml.fa https://storage.googleapis.com/basenji_tutorial_data/hg19.ml.fa', shell=True)
    subprocess.call('curl -o data/hg19.ml.fa.fai https://storage.googleapis.com/basenji_tutorial_data/hg19.ml.fa.fai', shell=True)

In [3]:
if not os.path.isdir('models/heart'):
    os.mkdir('models/heart')
if not os.path.isfile('models/heart/model_best.tf.meta'):
    subprocess.call('curl -o models/heart/model_best.tf.index https://storage.googleapis.com/basenji_tutorial_data/model_best.tf.index', shell=True)
    subprocess.call('curl -o models/heart/model_best.tf.meta https://storage.googleapis.com/basenji_tutorial_data/model_best.tf.meta', shell=True)
    subprocess.call('curl -o models/heart/model_best.tf.data-00000-of-00001 https://storage.googleapis.com/basenji_tutorial_data/model_best.tf.data-00000-of-00001', shell=True)

In [4]:
lines = [['index','identifier','file','clip','sum_stat','description']]
lines.append(['0', 'CNhs11760', 'data/CNhs11760.bw', '384', 'sum', 'aorta'])
lines.append(['1', 'CNhs12843', 'data/CNhs12843.bw', '384', 'sum', 'artery'])
lines.append(['2', 'CNhs12856', 'data/CNhs12856.bw', '384', 'sum', 'pulmonic_valve'])

samples_out = open('data/heart_wigs.txt', 'w')
for line in lines:
    print('\t'.join(line), file=samples_out)
samples_out.close()

SNP activity difference compute

Analyzing noncoding variation associated with disease is a major application of Basenji. I now offer several tools to enable that analysis. If you have a small set of variants and know what datasets are most relevant, basenji_sat_vcf.py lets you perform a saturation mutagenesis of the variant and surrounding region to see the relevant nearby motifs.

If you want scores measuring the influence of those variants on all datasets,

  • basenji_sad.py computes my SNP activity difference (SAD) score--the predicted change in aligned fragments to the region.
  • basenji_sed.py computes my SNP expression difference (SED) score--the predicted change in aligned fragments to gene TSS's.

Here, I'll demonstrate those two programs. You'll need

  • Trained model
  • Input file (FASTA or HDF5 with test_in/test_out)

First, you can either train your own model in the Train/test tutorial or use one that I pre-trained from the models subdirectory.

As an example, we'll study a prostate cancer susceptibility allele of rs339331 that increases RFX6 expression by modulating HOXB13 chromatin binding (http://www.nature.com/ng/journal/v46/n2/full/ng.2862.html).

First, we'll use basenji_sad.py to predict across the region for each allele and compute stats about the mean and max differences.

The most relevant options are:

Option/Argument Value Note
--cpu True Run on CPU (and avoid a GPU prefetch op.)
-f data/hg19.ml.fa Genome fasta.
-g data/human.hg19.genome Genome assembly chromosome length to bound gene sequences.
--h5 True Write output to HDF5.
-o rfx6_sad Outplot plot directory.
--rc True Ensemble predictions for forward and reverse complement sequences.
--shift 1,0,-1 Ensemble predictions for sequences shifted by 1, 0, and -1 bp.
-t data/heart_wigs.txt Target labels.
params_file models/params_small.txt Table of parameters to setup the model architecture and optimization parameters.
model_file models/heart/model_best.tf Trained saved model prefix.
vcf_file data/rs339331.vcf VCF file specifying variants to score.

In [5]:
! basenji_sad.py --cpu -f data/hg19.ml.fa -g data/human.hg19.genome --h5 -o output/rfx6_sad --rc --shift "1,0,-1" -t data/heart_wigs.txt models/params_small.txt models/heart/model_best.tf data/rs339331.vcf


{'batch_size': 4, 'batch_buffer': 4096, 'link': 'softplus', 'loss': 'poisson', 'optimizer': 'adam', 'adam_beta1': 0.97, 'adam_beta2': 0.98, 'learning_rate': 0.002, 'num_targets': 3, 'target_pool': 128, 'seq_length': 131072, 'target_length': 1024, 'cnn_dropout': 0.1, 'cnn_filter_sizes': [20, 7, 7, 7, 3, 3, 3, 3, 3, 3, 3, 1], 'cnn_filters': [128, 128, 192, 256, 256, 32, 32, 32, 32, 32, 32, 384], 'cnn_pool': [2, 4, 4, 4, 1, 0, 0, 0, 0, 0, 0, 0], 'cnn_dilation': [1, 1, 1, 1, 1, 2, 4, 8, 16, 32, 64, 1], 'cnn_dense': [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0]}
/Users/davidkelley/code/Basenji/bin/basenji_sad.py:156: FutureWarning: read_table is deprecated, use read_csv instead, passing sep='\t'.
  targets_df = pd.read_table(options.targets_file, index_col=0)
WARNING:tensorflow:From /Users/davidkelley/anaconda3/envs/py37/lib/python3.7/site-packages/tensorflow/python/data/ops/dataset_ops.py:429: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version.
Instructions for updating:
tf.py_func is deprecated in TF V2. Instead, use
    tf.py_function, which takes a python function which manipulates tf eager
    tensors instead of numpy arrays. It's easy to convert a tf eager tensor to
    an ndarray (just call tensor.numpy()) but having access to eager tensors
    means `tf.py_function`s can use accelerators such as GPUs as well as
    being differentiable using a gradient tape.
    
WARNING:tensorflow:From /Users/davidkelley/anaconda3/envs/py37/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.

WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

Targets pooled by 128 to length 1024
WARNING:tensorflow:From /Users/davidkelley/code/basenji/basenji/layers.py:53: conv1d (from tensorflow.python.layers.convolutional) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.conv1d instead.
WARNING:tensorflow:From /Users/davidkelley/code/basenji/basenji/layers.py:74: batch_normalization (from tensorflow.python.layers.normalization) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.batch_normalization instead.
WARNING:tensorflow:From /Users/davidkelley/code/basenji/basenji/layers.py:82: dropout (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dropout instead.
WARNING:tensorflow:From /Users/davidkelley/anaconda3/envs/py37/lib/python3.7/site-packages/tensorflow/python/keras/layers/core.py:143: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
WARNING:tensorflow:From /Users/davidkelley/code/basenji/basenji/layers.py:103: max_pooling1d (from tensorflow.python.layers.pooling) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.max_pooling1d instead.
WARNING:tensorflow:From /Users/davidkelley/code/basenji/basenji/seqnn.py:303: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dense instead.
Convolution w/ 3 384x1 filters to final targets
Model building time 2.653474
2019-05-22 16:01:20.331637: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
WARNING:tensorflow:From /Users/davidkelley/code/Basenji/bin/basenji_sad.py:267: start_queue_runners (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
WARNING:tensorflow:`tf.train.start_queue_runners()` was called when no queue runners were defined. You can safely remove the call to this deprecated function.
WARNING:tensorflow:From /Users/davidkelley/anaconda3/envs/py37/lib/python3.7/site-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.

SNP activity difference output

The output HDF5 stores the SNP and target information and predicted scores.


In [6]:
sad_h5 = h5py.File('output/rfx6_sad/sad.h5', 'r')
list(sad_h5.keys())


Out[6]:
['SAD',
 'SAD_pct',
 'chr',
 'percentiles',
 'pos',
 'ref',
 'snp',
 'target_ids',
 'target_labels']

In [7]:
for snp_key in ['snp', 'chr', 'pos', 'ref']:
    print(snp_key, sad_h5[snp_key][:])


snp [b'rs339331']
chr [b'chr6']
pos [117210052]
ref [b'T']

In [8]:
for ti in range(3):
    cols = (ti, sad_h5['SAD'][0,ti], sad_h5['target_ids'][ti], sad_h5['target_labels'][ti])
    print('%2d  %7.4f  %12s  %s' % cols)


 0   0.1210  b'CNhs11760'  b'aorta'
 1   0.0208  b'CNhs12843'  b'artery'
 2   0.1587  b'CNhs12856'  b'pulmonic_valve'

These are inconclusive small effect sizes, not surprising given that we're only studying heart CAGE. The proper cell types and experiments would shed more light.

SNP expression difference compute

Alternatively, we can directly query the predictions at gene TSS's using basenji_sed.py. Note that I haven't revised these scripts to make use of tf.data, so they are a bit less wieldy right now.

basenji_sed.py takes as input the gene sequence HDF5 format described in genes.ipynb. There's no harm to providing an HDF5 that describes all genes, but it's too big to easily move around so I constructed one that focuses on RFX6.

The most relevant options are:

Option/Argument Value Note
-g data/human.hg19.genome Genome assembly chromosome length to bound gene sequences.
-o rfx6_sed Outplot plot directory.
--rc Predict forward and reverse complement versions and average the results.
-w 128 Sequence bin width at which predictions are made.
params_file models/params_med.txt Table of parameters to setup the model architecture and optimization parameters.
model_file models/gm12878.tf Trained saved model prefix.
genes_hdf5_file data/rfx6.h5 HDF5 file specifying gene sequences to query.
vcf_file data/rs339331.vcf VCF file specifying variants to score.

Before running basenji_sed.py, we need to generate an input data file for RFX6. Using an included GTF file that contains only RFX6, one can use basenji_hdf5_genes.py to create the required format.


In [9]:
! basenji_hdf5_genes.py -g data/human.hg19.genome -l 131072 -c 0.333 -w 128 data/hg19.ml.fa data/rfx6.gtf data/rfx6.h5

In [16]:
! basenji_sed.py -a -g data/human.hg19.genome -o output/rfx6_sed --rc models/params_small.txt models/heart/model_best.tf data/rfx6.h5 data/rs339331.vcf


Intersecting gene sequences with SNPs...1 sequences w/ SNPs
{'batch_size': 4, 'batch_buffer': 4096, 'link': 'softplus', 'loss': 'poisson', 'optimizer': 'adam', 'adam_beta1': 0.97, 'adam_beta2': 0.98, 'learning_rate': 0.002, 'num_targets': 3, 'target_pool': 128, 'seq_length': 131072, 'target_length': 1024, 'cnn_dropout': 0.1, 'cnn_filter_sizes': [20, 7, 7, 7, 3, 3, 3, 3, 3, 3, 3, 1], 'cnn_filters': [128, 128, 192, 256, 256, 32, 32, 32, 32, 32, 32, 384], 'cnn_pool': [2, 4, 4, 4, 1, 0, 0, 0, 0, 0, 0, 0], 'cnn_dilation': [1, 1, 1, 1, 1, 2, 4, 8, 16, 32, 64, 1], 'cnn_dense': [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0]}
WARNING:tensorflow:From /Users/davidkelley/anaconda3/envs/py37/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.

WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

Targets pooled by 128 to length 1024
WARNING:tensorflow:From /Users/davidkelley/code/basenji/basenji/layers.py:53: conv1d (from tensorflow.python.layers.convolutional) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.conv1d instead.
WARNING:tensorflow:From /Users/davidkelley/code/basenji/basenji/layers.py:74: batch_normalization (from tensorflow.python.layers.normalization) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.batch_normalization instead.
WARNING:tensorflow:From /Users/davidkelley/code/basenji/basenji/layers.py:82: dropout (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dropout instead.
WARNING:tensorflow:From /Users/davidkelley/anaconda3/envs/py37/lib/python3.7/site-packages/tensorflow/python/keras/layers/core.py:143: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
WARNING:tensorflow:From /Users/davidkelley/code/basenji/basenji/layers.py:103: max_pooling1d (from tensorflow.python.layers.pooling) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.max_pooling1d instead.
WARNING:tensorflow:From /Users/davidkelley/code/basenji/basenji/seqnn.py:303: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dense instead.
Convolution w/ 3 384x1 filters to final targets
WARNING:tensorflow:From /Users/davidkelley/anaconda3/envs/py37/lib/python3.7/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
Targets pooled by 128 to length 1024
Convolution w/ 3 384x1 filters to final targets
2019-05-22 16:25:36.134283: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
WARNING:tensorflow:From /Users/davidkelley/anaconda3/envs/py37/lib/python3.7/site-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
chr6:117136639-117267711 2 TSSs

SNP expression difference compute


In [17]:
! sort -k9 -g output/rfx6_sed/sed_gene.txt


rsid ref alt gene tss_dist ref_pred alt_pred sed ser target_index target_id target_label
rs339331      T     C ENSG00000185002.9_1  4100  1.4131  1.4111 -0.0020 -0.0020    1           t1 
rs339331      T     C ENSG00000185002.9_1  4100  1.3027  1.3018 -0.0010 -0.0010    2           t2 
rs339331      T     C ENSG00000185002.9_1  4100  0.8306  0.8320  0.0015  0.0022    0           t0 

In [18]:
! sort -k9 -gr output/rfx6_sed/sed_gene.txt


rs339331      T     C ENSG00000185002.9_1  4100  0.8306  0.8320  0.0015  0.0022    0           t0 
rs339331      T     C ENSG00000185002.9_1  4100  1.3027  1.3018 -0.0010 -0.0010    2           t2 
rs339331      T     C ENSG00000185002.9_1  4100  1.4131  1.4111 -0.0020 -0.0020    1           t1 
rsid ref alt gene tss_dist ref_pred alt_pred sed ser target_index target_id target_label

In [ ]: