Tutorial length: 25-30 minutes with a CPU.
* How to use this tutorial
* Review of patterns in transcription factor binding sites
* Learning to localize homotypic motif density
* Sequence model definition
* Training and interpretation of
- single layer, single filter DragoNN
- single layer, multiple filters DragoNN
- Multi-layer DragoNN
- Regularized multi-layer DragoNN
* Critical questions in this tutorial:
- What is the "right" way to get insight from a DragoNN model?
- What are the limitations of different interpretation methods?
- Do those limitations depend on the model and the target pattern?
* Suggestions for further exploration
Github issues on the dragonn repository with feedback, questions, and discussion are always welcome.
This tutorial utilizes a Jupyter/IPython Notebook - an interactive computational enviroment that combines live code, visualizations, and explanatory text. The notebook is organized into a series of cells. You can run the next cell by cliking the play button:
We start by loading dragonn's tutorial utilities and reviewing properties of regulatory sequence that transcription factors bind.
In [1]:
%reload_ext autoreload
%autoreload 2
#from tutorial_utils import *
%matplotlib inline
In this tutorial we will learn how to localize a homotypic motif cluster. We will simulate a positive set of sequences with multiple instances of a motif in the center and a negative set of sequences with multiple motif instances positioned anywhere in the sequence:
We start by getting the simulation data.
In [2]:
from simulations import simulate_motif_density_localization
print(simulate_motif_density_localization.__doc__)
Next, we define parameters for a TAL1 motif density localization in 1500bp long sequence, with 0.4 GC fraction, and 2-4 instances of the motif in the central 150bp for the positive sequences. We simulate a total of 3000 positive and 3000 negative sequences.
In [3]:
motif_density_localization_simulation_parameters = {
"motif_name": "TAL1_known4",
"seq_length": 1000,
"center_size": 150,
"min_motif_counts": 2,
"max_motif_counts": 4,
"num_pos": 3000,
"num_neg": 3000,
"GC_fraction": 0.4}
We get the simulation data by calling the get_simulation_data function with the simulation name and the simulation parameters as inputs. 1000 sequences are held out for a test set, 1000 sequences for a validation set, and the remaining 4000 sequences are in the training set.
In [4]:
sequences, y, embed = simulate_motif_density_localization(**motif_density_localization_simulation_parameters)
In [5]:
import deepchem as dc
from utils import one_hot_encode
from sklearn.model_selection import train_test_split
splitter = dc.splits.RandomSplitter()
X = one_hot_encode(sequences)
X_train, X_rem, y_train, y_rem, embed_train, embed_rem = train_test_split(X, y, embed, test_size=.3)
X_valid, X_test, y_valid, y_test, embed_valid, embed_test = train_test_split(X_rem, y_rem, embed_rem, test_size=.5)
train = dc.data.NumpyDataset(X_train, y_train)
valid = dc.data.NumpyDataset(X_valid, y_valid)
test = dc.data.NumpyDataset(X_test, y_test)
print(type(X), type(y), type(embed))
print(X.shape, y.shape, len(embed))
print(sequences[:1])
print(y[:1])
print([print(emb) for emb in embed[0]])
print("X_train.shape")
print(X_train.shape)
print("X_valid.shape")
print(X_valid.shape)
print("X_test.shape")
print(X_test.shape)
print("y_train.shape")
print(y_train.shape)
print("y_valid.shape")
print(y_valid.shape)
print("y_test.shape")
print(y_test.shape)
print(X.shape)
print(X[0, :, :, :10])
simulation_data provides training, validation, and test sets of input sequences X and sequence labels y. The inputs X are matrices with a one-hot-encoding of the sequences:
In [6]:
#simulation_data.X_train[0, :, :, :10]
X_train[0, :, :, :10]
Out[6]:
A locally connected linear unit in a DragoNN model can represent a PSSM (part a). A sequence PSSM score is obtained by multiplying the PSSM across the sequence, thersholding the PSSM scores, and taking the max (part b). A PSSM score can also be computed by a DragoNN model with tiled locally connected linear units, amounting to a convolutional layer with a single convolutional filter representing the PSSM, followed by ReLU thersholding and maxpooling (part c).
The main DragoNN model class is SequenceDNN, which provides a simple interface to a range of models and methods to train, test, and interpret DragoNNs. SequenceDNN uses keras, a deep learning library for Theano and TensorFlow, which are popular software packages for deep learning.
To get a DragoNN model we:
1) Define the DragoNN architecture parameters
- obtain description of architecture parameters using the inspect_SequenceDNN() function
2) Call the get_SequenceDNN function, which takes as input the DragoNN architecture parameters, and outputs a
randomly initialized DragoNN model.
To get a description of the architecture parameters we use the inspect_SequenceDNN function, which outputs documentation for the model class including the architecture parameters:
In [7]:
#inspect_SequenceDNN()
from deepchem.models.tensorgraph.models.sequence_dnn import SequenceDNN
print(SequenceDNN.__doc__)
"Available methods" display what can be done with a SequenceDNN model. These include common operations such as training and testing the model, and more complex operations such as extracting insight from trained models. We define a simple DragoNN model with one convolutional layer with one convolutional filter, followed by maxpooling of width 35.
In [8]:
one_filter_dragonn_parameters = {
'seq_length': 1000,
'num_filters': [1],
'conv_width': [10],
'pool_width': 35}
we get a randomly initialized DragoNN model by calling the get_SequenceDNN function with one_filter_dragonn_parameters as the input
In [12]:
seq_dnn = SequenceDNN(1000, num_filters=1)
Next, we train the one_filter_dragonn by calling train_SequenceDNN with one_filter_dragonn and simulation_data as the inputs. In each epoch, the one_filter_dragonn will perform a complete pass over the training data, and update its parameters to minimize the loss, which quantifies the error in the model predictions. After each epoch, the code prints performance metrics for the one_filter_dragonn on the validation data. Training stops once the loss on the validation stops improving for multiple consecutive epochs. The performance metrics include balanced accuracy, area under the receiver-operating curve (auROC), are under the precision-recall curve (auPRC), and recall for multiple false discovery rates (Recall at FDR).
In [13]:
#train_SequenceDNN(one_filter_dragonn, simulation_data)
#seq_dnn.build()
seq_dnn.fit(train, "binary_crossentropy", nb_epoch=1)
A single layer, single filter model gets good performance and doesn't overfit much. Let's look at the learning curve to demonstrate this visually:
In [11]:
SequenceDNN_learning_curve(one_filter_dragonn)
In [ ]:
multi_filter_dragonn_parameters = {
'seq_length': 1000,
'num_filters': [15], ## notice the change from 1 filter to 15 filters
'conv_width': [10],
'pool_width': 35}
multi_filter_dragonn = get_SequenceDNN(multi_filter_dragonn_parameters)
train_SequenceDNN(multi_filter_dragonn, simulation_data)
SequenceDNN_learning_curve(multi_filter_dragonn)
It slightly outperforms the single filter model. Let's check if the learned filters capture the simulated pattern.
In [ ]:
interpret_SequenceDNN_filters(multi_filter_dragonn, simulation_data)
Only some of the filters closesly match the simulated pattern. This illustrates that interpreting model parameters directly works partially for multi-filter models. Another way to deduce learned patterns is to examine feature importances for specific examples. Next, we explore methods for feature importance scoring.
Using in-silico mutagenesis (ISM) and DeepLIFT, we can obtain scores for specific sequence indicating the importance of each position in the sequence. To assess these methods we compare ISM and DeepLIFT scores to motif scores for each simulated motif at each position in the sequence. These motif scores represent the "ground truth" importance of each position because they are based on the motifs used to simulate the data. We plot provide comaprisons for a positive class sequence on the left and a negative class sequence on the right.
In [ ]:
interpret_data_with_SequenceDNN(multi_filter_dragonn, simulation_data)
In the positive example (left side), ISM correctly highlights the two motif instances in the central 150bp. DeepLIFT highlights them as well. DeepLIFT also slightly highlights false positive feature on the left side but its score is sufficiently small that we can discriminate between the false positive feature and the true positive features. In the negative example (right side), ISM doesn't highlight anything but DeepLIFT a couple false positive feature almost as much as it highlights true positive features in the positive example.
In [ ]:
multi_layer_dragonn_parameters = {
'seq_length': 1000,
'num_filters': [15, 15, 15], ## notice the change to multiple filter values, one for each layer
'conv_width': [10, 10, 10], ## convolutional filter width has been modified to 25 from 45
'pool_width': 35}
multi_layer_dragonn = get_SequenceDNN(multi_layer_dragonn_parameters)
train_SequenceDNN(multi_layer_dragonn, simulation_data)
SequenceDNN_learning_curve(multi_layer_dragonn)
This model performs about the same as the single layer model but it overfits more. We will try to address that with dropout regularization. But first, what do the first layer filters look like?
In [ ]:
interpret_SequenceDNN_filters(multi_layer_dragonn, simulation_data)
The filters now make less sense than in the single layer model case. In multi-layered models, sequence features are learned compositionally across the layers. As a result, sequence filters in the first layer focus more on simple features that can be combined in higher layers to learn motif features more efficiently, and their interpretation becomes less clear based on simple visualizations. Let's see where ISM and DeepLIFT get us with this model.
In [ ]:
interpret_data_with_SequenceDNN(multi_layer_dragonn, simulation_data)
As in the single layer model case, ISM correctly highlights the two true positive features in the positive example (left side) and correctly ignores features in the negative example (right side). DeepLIFT still highlight the same false positive feature example in the positive example as before, but we can still separate it from the true positive features. In the negative example, it still highlights some false positive features.
In [ ]:
regularized_multi_layer_dragonn_parameters = {
'seq_length': 1000,
'num_filters': [15, 15, 15],
'conv_width': [10, 10, 10],
'pool_width': 35,
'dropout': 0.2} ## we introduce dropout of 0.2 on every convolutional layer for regularization
regularized_multi_layer_dragonn = get_SequenceDNN(regularized_multi_layer_dragonn_parameters)
train_SequenceDNN(regularized_multi_layer_dragonn, simulation_data)
SequenceDNN_learning_curve(regularized_multi_layer_dragonn)
As expected, dropout decreased the overfitting this model displayed previously and increased validation performance. Let's see the effect on feature discovery.
In [ ]:
interpret_data_with_SequenceDNN(regularized_multi_layer_dragonn, simulation_data)
ISM now highlights a false positive feature in the positive example (left side) more than the true positive features. What happened? A sufficiently accurate model should not change its confidence that there are 2 or more features in the central 150 base pairs (bps) due to a single bp change. So it makes sense that in the limit of the "perfect" model ISM will actually lose its power to discover features in this example.
How about DeepLIFT? DeepLIFT correctly highlights the only two positive features in the positive example. So it seems that in the limit of the "perfect" model, DeepLIFT gets closer to the true positive features.
Why did this happen? Why, as we regularize the model and improve the performance, ISM fails to highlight the true positive features? Here is a hint: in the limit of the "perfect" model for this simulation, will a single base pair perturbation to the positive example here change its confidence that it is still a positive example? I encourage you to open github issues on the dragonn repo to discuss these questions.
Below is an overview of patterns and simulations for further exploration.
In this tutorial we explored modeling of homotypic motif density. Other properties of regulatory DNA sequence include
DragoNN provides simulations that formulate learning these patterns into classification problems:
You can view the available simulation functions by running print_available_simulations:
In [ ]:
print_available_simulations()