Saturated mutagenesis is a powerful tool both for dissecting a specific sequence of interest and understanding what the model learned. basset_sat.py enables this analysis from a test set of data.
To run this tutorial, you'll need to either download the pre-trained model from https://www.dropbox.com/s/rguytuztemctkf8/pretrained_model.th.gz and preprocess the consortium data, or just substitute your own files here:
In :model_file = '../data/models/pretrained_model.th' seqs_file = '../data/encode_roadmap.h5'
This analysis is somewhat time consuming, proportional to the number of sequences, number of center nucleotides mutated, and number of plots. Typically, I just focus on one or a few targets of interest and provide a few hundred sequences.
Then we run the script as follows:
In :import subprocess cmd = 'basset_sat.py -t 46 -n 200 -s 10 -o satmut %s %s' % (model_file, seqs_file) subprocess.call(cmd, shell=True)
Then I often just browse through and check out the various sequences. In this set of 10, there isn't a whole lot going on. (It's a big genome.) But this one contains a weak CTCF motif, which is highlighted by loss scores. Note that the motif also contains a few positions marked by high gain scores where CTCF binding and accessibility would increase after mutation.
In :# actual file is satmut/chr17_4904020-4904620\(+\)_c46_heat.pdf from IPython.display import Image Image(filename='satmut_eg/chr17_4904020-4904620(+)_c46_heat.png')
If you have a particular sequence of interest, you can also provide basset_sat.py with a FASTA file in place of the HDF5 test set.
To demonstrate, I chose a fascinating boundary domain, separating rostral from caudal gene expression programs in the HOXA locus. The exact region that I grabbed is chr7:27183235-27183835.
By specifying -1 to -t, we'll print heat maps for all cells.
In :cmd = 'basset_sat.py -t -1 -n 200 -o satmut_hox %s satmut_eg/hoxa_boundary.fa' % model_file subprocess.call(cmd, shell=True)
As mentioned in the paper, the CTCF motif here does not match the consensus; a C->T mutation would cause it to match far better. Accordingly, Basset does not predict high accessibility in any cell type, but does highlight the motif and a GC-rich region to it's left.
In :# actual file is satmut_hox/chr7_27183235-27183835_c127_heat.pdf Image(filename='satmut_eg/chr7_27183235-27183835_c127_heat.png')