This tutorial utilizes a Jupyter Notebook - an interactive computational enviroment that combines live code, visualizations, and explanatory text. The notebook is organized into a series of cells. You can run the next cell by cliking the play button:
In this tutorial, we will:
1) Simulate regulatory DNA sequence classification task
2) Train DragoNN models of varying complexity to solve the simulation
3) Interpret trained DragoNN models
4) Show how to train your DragoNN on your own, non-simulated data and use it to interpret data
This tutorial is implemented in python (see this online python course for an introduction).
We start by loading dragonn's tutorial utilities. Let's review properties of regulatory sequence while the utilities are loading
In [9]:
%reload_ext autoreload
%autoreload 2
from dragonn.tutorial_utils import *
%matplotlib inline
In this tutorial, we will simulate heterodimer motif grammar detection task. Specifically, we will simulate a "positive" class of sequences with a SIX5-ZNF143 grammar with relatively fixed spacing between the motifs and a negative class of sequences containing both motifs positioned independently:
Let's run the print_available_simulations function and see it in action.
In [10]:
print_available_simulations()
To get simulation data we:
1) Define the simulation parameters
- obtain description of simulation parameters using the print_simulation_info function
2) Call the get_simulation_data function, which takes as input the simulation name and the simulation
parameters, and outputs the simulation data.
We simulate the SIX5-ZNF143 heterodimer motif grammar using the "simulate_heterodimer_grammar" simulation function. To get a description of the simulation parameters we use the print_simulation_info function, which takes as input the simulation function name, and outputs documentation for the simulation including the simulation parameters:
In [11]:
print_simulation_info("simulate_heterodimer_grammar")
Next, we define parameters for a heterodimer grammar simulation of 500bp long sequence, with 0.4 GC fraction, 10000 positive and negative sequences, with SIx5 and ZNF143 motifs spaced 2-10 bp apart in the positive sequences:
In [12]:
heterodimer_grammar_simulation_parameters = {
"seq_length": 500,
"GC_fraction": 0.4,
"num_pos": 10000,
"num_neg": 10000,
"motif1": "SIX5_known5",
"motif2": "ZNF143_known2",
"min_spacing": 2,
"max_spacing": 10}
We get the simulation data by calling the get_simulation_data function with the simulation name and the simulation parameters as inputs.
In [13]:
simulation_data = get_simulation_data("simulate_heterodimer_grammar", heterodimer_grammar_simulation_parameters)
simulation_data provides training, validation, and test sets of input sequences X and sequence labels y. The inputs X are matrices with a one-hot-encoding of the sequences:
In [14]:
simulation_data.X_train[0, :, :, :10]
Out[14]:
This matrix represent the 10bp sequence TTGGTAGATA.
Next, we will provide a brief overview of DragoNNs and proceed to train a DragoNN to classify the sequences we simulated:
A locally connected linear unit in a DragoNN model can represent a PSSM (part a). A sequence PSSM score is obtained by multiplying the PSSM across the sequence, thersholding the PSSM scores, and taking the max (part b). A PSSM score can also be computed by a DragoNN model with tiled locally connected linear units, amounting to a convolutional layer with a single convolutional filter representing the PSSM, followed by ReLU thersholding and maxpooling (part c).
The main DragoNN model class is SequenceDNN, which provides a simple interface to a range of models and methods to train, test, and interpret DragoNNs. SequenceDNN uses keras, a deep learning library for Theano and TensorFlow, which are popular software packages for deep learning.
To get a DragoNN model we:
1) Define the DragoNN architecture parameters
- obtain description of architecture parameters using the inspect_SequenceDNN() function
2) Call the get_SequenceDNN function, which takes as input the DragoNN architecture parameters, and outputs a
randomly initialized DragoNN model.
To get a description of the architecture parameters we use the inspect_SequenceDNN function, which outputs documentation for the model class including the architecture parameters:
In [15]:
inspect_SequenceDNN()
"Available methods" display what can be done with a SequenceDNN model. These include common operations such as training and testing the model, and more complex operations such as extracting insight from trained models. We define a simple DragoNN model with one convolutional layer with one convolutional filter, followed by maxpooling of width 35.
In [16]:
one_filter_dragonn_parameters = {
'seq_length': 500,
'num_filters': [1],
'conv_width': [45],
'pool_width': 45}
we get a radnomly initialized DragoNN model by calling the get_SequenceDNN function with one_filter_dragonn_parameters as the input
In [17]:
one_filter_dragonn = get_SequenceDNN(one_filter_dragonn_parameters)
Next, we train the one_filter_dragonn by calling train_SequenceDNN with one_filter_dragonn and simulation_data as the inputs. In each epoch, the one_filter_dragonn will perform a complete pass over the training data, and update its parameters to minimize the loss, which quantifies the error in the model predictions. After each epoch, the code prints performance metrics for the one_filter_dragonn on the validation data. Training stops once the loss on the validation stops improving for multiple consecutive epochs. The performance metrics include balanced accuracy, area under the receiver-operating curve (auROC), are under the precision-recall curve (auPRC), area under the precision-recall-gain curve (auPRG), and recall for multiple false discovery rates (Recall at FDR).
In [18]:
train_SequenceDNN(one_filter_dragonn, simulation_data)
We can see that the validation loss is not decreasing and the auROC metric is not decreasing, which indicates this model is not learning. A simple plot of the learning curve, showing the loss function on the training and validation data over the course of training, demonstrates this visually:
In [19]:
SequenceDNN_learning_curve(one_filter_dragonn)
In [20]:
multi_filter_dragonn_parameters = {
'seq_length': 500,
'num_filters': [15], ## notice the change from 1 filter to 15 filters
'conv_width': [45],
'pool_width': 45,
'dropout': 0.1}
multi_filter_dragonn = get_SequenceDNN(multi_filter_dragonn_parameters)
train_SequenceDNN(multi_filter_dragonn, simulation_data)
SequenceDNN_learning_curve(multi_filter_dragonn)
In [21]:
interpret_SequenceDNN_filters(multi_filter_dragonn, simulation_data)
As can be expected, the sequence filters don't reveal patterns that resemble the simulated motifs. Next we explore methods to interpret specific sequences with this DragoNN model.
Using in-silico mutagenesis (ISM) and DeepLIFT, we can obtain scores for specific sequence indicating the importance of each position in the sequence. To assess these methods we compare ISM and DeepLIFT scores to motif scores for each simulated motif at each position in the sequence. These motif scores represent the "ground truth" importance of each position because they are based on the motifs used to simulate the data. We plot provide comaprisons for a positive class sequence on the left and a negative class sequence on the right.
In [22]:
interpret_data_with_SequenceDNN(multi_filter_dragonn, simulation_data)
We can see that neither DeepLIFT nor ISM highlight the locations of the simulated motifs (highlighted in grey). This is expected because this model doesn't perform well on this simulation.
In [23]:
multi_layer_dragonn_parameters = {
'seq_length': 500,
'num_filters': [15, 15, 15], ## notice the change to multiple filter values, one for each layer
'conv_width': [25, 25, 25], ## convolutional filter width has been modified to 25 from 45
'pool_width': 45,
'dropout': 0.1}
multi_layer_dragonn = get_SequenceDNN(multi_layer_dragonn_parameters)
train_SequenceDNN(multi_layer_dragonn, simulation_data)
SequenceDNN_learning_curve(multi_layer_dragonn)
The multi-layered DragoNN model achieves a higher auROC and a lower training and validation loss than the multi-filter DragoNN model. Try the same model without dropout regularization: how important is dropout?
Let's see what the model learns in its sequence filters.
In [24]:
interpret_SequenceDNN_filters(multi_layer_dragonn, simulation_data)
The sequence filters here are not amenable to interpretation based on visualization alone. In multi-layered models, sequence features are learned compositionally across the layers. As a result, sequence filters in the first layer focus more on simple features that can be combined in higher layers to learn motif features more efficiently, and their interpretation becomes less clear based on simple visualizations. Let's see where ISM and DeepLIFT get us with this model.
In [25]:
interpret_data_with_SequenceDNN(multi_layer_dragonn, simulation_data)
DeepLIFT and ISM scores for this model on representative positive (left) and negative (right) sequences expose what the model is doing.. The SIX5-ZNF143 grammar is clearly highlighted by both methods in the positive class sequence. However, ISM assigns higher scores to false features around position 250, so we would not be able to distinguish between flase and true features in this example based on ISM score magnitude. DeepLIFT, on the other hand, assigns the highest scores to the true features and therefore it could be used in this example to detect the SIX5-ZNF143 grammar.
In [26]:
!dragonn train --pos-sequences example_pos_sequences.fa --neg-sequences example_neg_sequences.fa --prefix training_example
Based on the provided prefix, this command stores a model file, training_example.model.json, with the model architecture and a weights file, training_example.weights.hd5, with the parameters of the trained model. We test the model by running:
In [27]:
!dragonn test --pos-sequences example_pos_sequences.fa --neg-sequences example_neg_sequences.fa \
--arch-file training_example.arch.json --weights-file training_example.weights.h5
This command prints the model's test performance metrics on the provided data. Model predictions on sequence data can be obtained by running:
In [28]:
!dragonn predict --sequences example_pos_sequences.fa --arch-file training_example.arch.json \
--weights-file training_example.weights.h5 --output-file example_predictions.txt
This command stores the model predictions for sequences in example_pos_sequences.fa in the output file example_predictions.txt. We can interpret sequence data with a dragonn model by running:
In [29]:
!dragonn interpret --sequences example_pos_sequences.fa --arch-file training_example.arch.json \
--weights-file training_example.weights.h5 --prefix example_interpretation
This will write the most important subsequence in each input sequence along with its location in the input sequence in the file example_interpretation.task_0.important_sequences.txt. Note: by default, only examples with predicted positive class probability >0.5 are interpreted. Examples below this threshold yield important subsequence of Ns with location -1. Let's look the first few lines of this file:
In [30]:
!head example_interpretation.task_0.important_sequences.txt
The tutorial example here touches on general principles of DragoNN model development and interpretation. To gain a deeper insight into the difference between DeepLIFT and ISM for model interpretation, consider the following exercise:
Train, test, and run sequence-centric interpretation for the one layered CNN model used here for the following simulations:
1. single motif detection simulation of TAL1 in 1000bp sequence with 40% GC content
(run print_simulation_info("simulate_single_motif_detection") to see the exact simulation parameters)
2. motif density localization simulation of 2-4 TAL1 motif instances in the central of 150bp of a total 1000bp
sequence with 40% GC
content
(run print_simulation_info("simulate_motif_density_localization") to see the exact simulation parameters)
Key questions:
1) What could explain the difference in ISM's sensitivity to the TAL1 motif sequence between the simulations?
2) What does that tell us about the the scope of ISM for feature discovery? Under what conditions is it likely
to show sensitivity to sequence features?
Starter code is provided below to get the data for each simulation and new DragoNN model.
In [ ]:
single_motif_detection_simulation_parameters = {
"motif_name": "TAL1_known4",
"seq_length": 1000,
"num_pos": 10000,
"num_neg": 10000,
"GC_fraction": 0.4}
density_localization_simulation_parameters = {
"motif_name": "TAL1_known4",
"seq_length": 1000,
"center_size": 150,
"min_motif_counts": 2,
"max_motif_counts": 4,
"num_pos": 10000,
"num_neg": 10000,
"GC_fraction": 0.4}
single_motif_detection_simulation_data = get_simulation_data(
"simulate_single_motif_detection", single_motif_detection_simulation_parameters)
density_localization_simulation_data = get_simulation_data(
"simulate_motif_density_localization", density_localization_simulation_parameters)
In [ ]:
new_dragonn_model = get_SequenceDNN(multi_layer_dragonn_parameters)