Once we've trained a model, we might want to better understand what sequence motifs the first convolutional layer has discovered and how it's using them. Basset offers two methods to help users explore these filters.
You'll want to double check that you have the Tomtom motif comparison tool for the MEME suite installed. Tomtom provides rigorous methods to compare the filters to a database of motifs. You can download it from here: http://meme-suite.org/doc/download.html
To run this tutorial, you'll need to either download the pre-trained model from https://www.dropbox.com/s/rguytuztemctkf8/pretrained_model.th.gz and preprocess the consortium data, or just substitute your own files here:
In :model_file = '../data/models/pretrained_model.th' seqs_file = '../data/encode_roadmap.h5'
First, we'll run basset_motifs.py, which will extract a bunch of basic information about the first layer filters. The script takes an HDF file (such as any preprocessed with preprocess_features.py) and samples sequences from the test set. It sends those sequences through the neural network and examines its hidden unit values to describe what they're doing.
In :import subprocess cmd = 'basset_motifs.py -s 1000 -t -o motifs_out %s %s' % (model_file, seqs_file) subprocess.call(cmd, shell=True)
Now there's plenty of information output in motifs_out. My favorite way to get started is to open the HTML file output by Tomtom's comparison of the motifs to a database. It displays all of the motifs and their database matches in a neat table.
Before we take a look though, let me describe where these position weight matrices came from. Inside the neural network, the filters are reprsented by real-valued matrices. Here's one:
In :# actual file is motifs_out/filter9_heat.pdf from IPython.display import Image Image(filename='motifs_eg/filter9_heat.png')
Although it's matrix of values, this doesn't quite match up with the conventional notion of a position weight matrix that we typically use to represent sequences motifs in genome biology. To make that, basset_motifs.py pulls out the underlying sequences that activate the filters in the test sequences and passes that to weblogo.
In :# actual file is motifs_out/filter9_logo.eps Image(filename='motifs_eg/filter9_logo.png')
The Tomtom output will annotate the filters, but it won't tell you how the model is using them. So alongside it, let's print out a table of the filters with the greatest output standard deviation over this test set. Variance correlates strongly with how influential the filter is for making predictions.
The columns here are:
In :!sort -k6 -gr motifs_out/table.txt | head -n20
118 TCAGGTGGGTATTGTATAC . 3.86 0.0729 0.0894 88 AAATAGTTAATAATTTTTA HMG20B 8.63 0.0463 0.0889 208 CATACGTATGTAGTATCAA . 5.47 0.0332 0.0782 24 TCCGGGCCGCGTATTGCAG . 4.98 0.0306 0.0743 96 ATAAATGATATTAGCCTGG . 4.85 0.0467 0.0668 112 TCAATACGACGGGCGTAAA . 4.92 0.0313 0.0664 219 TTTGGGGGGCGCCACCCGC . 4.11 0.0322 0.0636 75 TCTCTTGATAATATGATTA . 6.16 0.0244 0.0625 84 GAATGACTTGGTTGGNTGG . 6.50 0.0106 0.0594 289 GCCCGATTTTATAACATCA HOXB13 7.93 0.0099 0.0582 285 ACTTGATCAGGGCTTCTTA ARID3B 8.19 0.0083 0.0569 171 AATATTGGAATACATGGAC . 8.38 0.0131 0.0557 105 TTAATTAATTACGTAAAAT ENSG00000250542 5.62 0.0080 0.0534 19 CGGCCAGTCCGTGGCTAGA . 5.71 0.0141 0.0505 138 ATATTAATCCAAATGTTTT . 8.63 0.0059 0.0499 248 CACTATAGCAACTTACTGG . 5.55 0.0173 0.0496 117 CCCATTGACAAGGCCGTGG . 7.23 0.0112 0.0495 59 ATTTCAGAACGCACCGCAC . 5.36 0.0123 0.0491 87 NNCCNCAAGTATGCNNNNN TEAD1 9.59 0.0041 0.0486 277 NNNNACCGGAAGTGGNNNN FLI1 10.74 0.0024 0.0481
As I discuss in the paper, unannotated low complexity filters tend to rise to the top here because low order sequence complexity influence accessibility.
The Tomtom output HTML is here:
In :!open motifs_out/tomtom/tomtom.html
The other primary tool that we have to understand the filters is to remove the filter from the model and assess the impact. Rather than truly remove it and re-train, we can just nullify it within the model by setting all output from the filter to its mean. This way the model around it isn't drastically affected, but there's no information flowing through.
This analysis requires considerably more compute time, so I separated it into a different script. To give it a sufficient number of sequences to obtain a good estimate influence, I typically run it overnight. If your computer is using too much memory, decrease the batch size. I'm going to run here with 1,000 sequences, but I used 20,000 for the paper.
To get really useful output, the script needs a few additional pieces of information:
In :cmd = 'basset_motifs_infl.py' cmd += ' -m motifs_out/table.txt' cmd += ' -s 2000 -b 500' cmd += ' -o infl_out' cmd += ' --subset motifs_eg/primary_cells.txt' cmd += ' -t motifs_eg/cell_activity.txt' cmd += ' --width 7 --height 40 --font 0.5' cmd += ' %s %s' % (model_file, seqs_file) subprocess.call(cmd, shell=True)