In [1]:
import os
import subprocess
In [2]:
if not os.path.isfile('data/hg19.ml.fa'):
subprocess.call('curl -o data/hg19.ml.fa https://storage.googleapis.com/basenji_tutorial_data/hg19.ml.fa', shell=True)
subprocess.call('curl -o data/hg19.ml.fa.fai https://storage.googleapis.com/basenji_tutorial_data/hg19.ml.fa.fai', shell=True)
In [3]:
if not os.path.isdir('models/heart'):
os.mkdir('models/heart')
if not os.path.isfile('models/heart/model_best.tf.meta'):
subprocess.call('curl -o models/heart/model_best.tf.index https://storage.googleapis.com/basenji_tutorial_data/model_best.tf.index', shell=True)
subprocess.call('curl -o models/heart/model_best.tf.meta https://storage.googleapis.com/basenji_tutorial_data/model_best.tf.meta', shell=True)
subprocess.call('curl -o models/heart/model_best.tf.data-00000-of-00001 https://storage.googleapis.com/basenji_tutorial_data/model_best.tf.data-00000-of-00001', shell=True)
In [4]:
lines = [['index','identifier','file','clip','sum_stat','description']]
lines.append(['0', 'CNhs11760', 'data/CNhs11760.bw', '384', 'sum', 'aorta'])
lines.append(['1', 'CNhs12843', 'data/CNhs12843.bw', '384', 'sum', 'artery'])
lines.append(['2', 'CNhs12856', 'data/CNhs12856.bw', '384', 'sum', 'pulmonic_valve'])
samples_out = open('data/heart_wigs.txt', 'w')
for line in lines:
print('\t'.join(line), file=samples_out)
samples_out.close()
Saturation mutagenesis is a powerful tool both for dissecting a specific sequence of interest and understanding what the model learned. basenji_sat_bed.py enables this analysis from a test set of data. basenji_sat_vcf.py lets you provide a VCF file for variant-centered mutagenesis.
To do this, you'll need
First, you can either train your own model in the Train/test tutorial or use one that I pre-trained from the models subdirectory.
We'll bash the GATA4 promoter to see what motifs drive its expression. I placed a BED file surrounding the GATA4 TSS in data/gata4.bed, so we'll use basenji_sat_bed.py.
The most relevant options are:
Option/Argument | Value | Note |
---|---|---|
-f | data/hg19.ml.fa | Genome FASTA to extract sequences. |
-l | 200 | Saturation mutagenesis region in the center of the given sequence(s) |
-o | gata4_sat | Outplot plot directory. |
--rc | True | Predict forward and reverse complement versions and average the results. |
-t | data/heart_wigs.txt | Target indexes to analyze. |
params_file | models/params_small.txt | Table of parameters to setup the model architecture and optimization parameters. |
model_file | models/heart/model_best.tf | Trained saved model prefix. |
input_file | data/gata4.bed | BED regions. |
In [5]:
! basenji_sat_bed.py -f data/hg19.ml.fa -l 200 -o output/gata4_sat --rc -t data/heart_wigs.txt models/params_small.txt models/heart/model_best.tf data/gata4.bed
The saturation mutagenesis scores go into output/gata4_sat/scores.h5. Then we can use basenji_sat_plot.py to visualize the scores.
The most relevant options are:
Option/Argument | Value | Note |
---|---|---|
-g | True | Draw a sequence logo for the gain score, too, identifying repressor motifs. |
-l | 200 | Saturation mutagenesis region in the center of the given sequence(s) |
-o | output/gata4_sat/plots | Outplot plot directory. |
-t | data/heart_wigs.txt | Target indexes to analyze. |
scores_file | output/gata4_sat/scores.h5 | Scores HDF5 from above. |
In [6]:
! basenji_sat_plot.py --png -l 200 -o output/gata4_sat/plots -t data/heart_wigs.txt output/gata4_sat/scores.h5
In [7]:
! ls output/gata4_sat/plots
The resulting plots reveal a low level of activity, with a GC-rich motif driving the only signal.