In [1]:
import h5py
import os
import subprocess
In [2]:
if not os.path.isfile('data/hg19.ml.fa'):
subprocess.call('curl -o data/hg19.ml.fa https://storage.googleapis.com/basenji_tutorial_data/hg19.ml.fa', shell=True)
subprocess.call('curl -o data/hg19.ml.fa.fai https://storage.googleapis.com/basenji_tutorial_data/hg19.ml.fa.fai', shell=True)
In [3]:
if not os.path.isdir('models/heart'):
os.mkdir('models/heart')
if not os.path.isfile('models/heart/model_best.tf.meta'):
subprocess.call('curl -o models/heart/model_best.tf.index https://storage.googleapis.com/basenji_tutorial_data/model_best.tf.index', shell=True)
subprocess.call('curl -o models/heart/model_best.tf.meta https://storage.googleapis.com/basenji_tutorial_data/model_best.tf.meta', shell=True)
subprocess.call('curl -o models/heart/model_best.tf.data-00000-of-00001 https://storage.googleapis.com/basenji_tutorial_data/model_best.tf.data-00000-of-00001', shell=True)
In [4]:
lines = [['index','identifier','file','clip','sum_stat','description']]
lines.append(['0', 'CNhs11760', 'data/CNhs11760.bw', '384', 'sum', 'aorta'])
lines.append(['1', 'CNhs12843', 'data/CNhs12843.bw', '384', 'sum', 'artery'])
lines.append(['2', 'CNhs12856', 'data/CNhs12856.bw', '384', 'sum', 'pulmonic_valve'])
samples_out = open('data/heart_wigs.txt', 'w')
for line in lines:
print('\t'.join(line), file=samples_out)
samples_out.close()
Analyzing noncoding variation associated with disease is a major application of Basenji. I now offer several tools to enable that analysis. If you have a small set of variants and know what datasets are most relevant, basenji_sat_vcf.py lets you perform a saturation mutagenesis of the variant and surrounding region to see the relevant nearby motifs.
If you want scores measuring the influence of those variants on all datasets,
Here, I'll demonstrate those two programs. You'll need
First, you can either train your own model in the Train/test tutorial or use one that I pre-trained from the models subdirectory.
As an example, we'll study a prostate cancer susceptibility allele of rs339331 that increases RFX6 expression by modulating HOXB13 chromatin binding (http://www.nature.com/ng/journal/v46/n2/full/ng.2862.html).
First, we'll use basenji_sad.py to predict across the region for each allele and compute stats about the mean and max differences.
The most relevant options are:
Option/Argument | Value | Note |
---|---|---|
--cpu | True | Run on CPU (and avoid a GPU prefetch op.) |
-f | data/hg19.ml.fa | Genome fasta. |
-g | data/human.hg19.genome | Genome assembly chromosome length to bound gene sequences. |
--h5 | True | Write output to HDF5. |
-o | rfx6_sad | Outplot plot directory. |
--rc | True | Ensemble predictions for forward and reverse complement sequences. |
--shift | 1,0,-1 | Ensemble predictions for sequences shifted by 1, 0, and -1 bp. |
-t | data/heart_wigs.txt | Target labels. |
params_file | models/params_small.txt | Table of parameters to setup the model architecture and optimization parameters. |
model_file | models/heart/model_best.tf | Trained saved model prefix. |
vcf_file | data/rs339331.vcf | VCF file specifying variants to score. |
In [5]:
! basenji_sad.py --cpu -f data/hg19.ml.fa -g data/human.hg19.genome --h5 -o output/rfx6_sad --rc --shift "1,0,-1" -t data/heart_wigs.txt models/params_small.txt models/heart/model_best.tf data/rs339331.vcf
The output HDF5 stores the SNP and target information and predicted scores.
In [6]:
sad_h5 = h5py.File('output/rfx6_sad/sad.h5', 'r')
list(sad_h5.keys())
Out[6]:
In [7]:
for snp_key in ['snp', 'chr', 'pos', 'ref']:
print(snp_key, sad_h5[snp_key][:])
In [8]:
for ti in range(3):
cols = (ti, sad_h5['SAD'][0,ti], sad_h5['target_ids'][ti], sad_h5['target_labels'][ti])
print('%2d %7.4f %12s %s' % cols)
These are inconclusive small effect sizes, not surprising given that we're only studying heart CAGE. The proper cell types and experiments would shed more light.
Alternatively, we can directly query the predictions at gene TSS's using basenji_sed.py. Note that I haven't revised these scripts to make use of tf.data, so they are a bit less wieldy right now.
basenji_sed.py takes as input the gene sequence HDF5 format described in genes.ipynb. There's no harm to providing an HDF5 that describes all genes, but it's too big to easily move around so I constructed one that focuses on RFX6.
The most relevant options are:
Option/Argument | Value | Note |
---|---|---|
-g | data/human.hg19.genome | Genome assembly chromosome length to bound gene sequences. |
-o | rfx6_sed | Outplot plot directory. |
--rc | Predict forward and reverse complement versions and average the results. | |
-w | 128 | Sequence bin width at which predictions are made. |
params_file | models/params_med.txt | Table of parameters to setup the model architecture and optimization parameters. |
model_file | models/gm12878.tf | Trained saved model prefix. |
genes_hdf5_file | data/rfx6.h5 | HDF5 file specifying gene sequences to query. |
vcf_file | data/rs339331.vcf | VCF file specifying variants to score. |
Before running basenji_sed.py, we need to generate an input data file for RFX6. Using an included GTF file that contains only RFX6, one can use basenji_hdf5_genes.py to create the required format.
In [9]:
! basenji_hdf5_genes.py -g data/human.hg19.genome -l 131072 -c 0.333 -w 128 data/hg19.ml.fa data/rfx6.gtf data/rfx6.h5
In [16]:
! basenji_sed.py -a -g data/human.hg19.genome -o output/rfx6_sed --rc models/params_small.txt models/heart/model_best.tf data/rfx6.h5 data/rs339331.vcf
In [17]:
! sort -k9 -g output/rfx6_sed/sed_gene.txt
In [18]:
! sort -k9 -gr output/rfx6_sed/sed_gene.txt
In [ ]: