In this example, we will use a SeqToSeq
model to generate fingerprints for classifying molecules. This is based on the following paper, although some of the implementation details are different: Xu et al., "Seq2seq Fingerprint: An Unsupervised Deep Molecular Embedding for Drug Discovery" (https://doi.org/10.1145/3107411.3107424).
Many types of models require their inputs to have a fixed shape. Since molecules can vary widely in the numbers of atoms and bonds they contain, this makes it hard to apply those models to them. We need a way of generating a fixed length "fingerprint" for each molecule. Various ways of doing this have been designed, such as Extended-Connectivity Fingerprints (ECFPs). But in this example, instead of designing a fingerprint by hand, we will let a SeqToSeq
model learn its own method of creating fingerprints.
A SeqToSeq
model performs sequence to sequence translation. For example, they are often used to translate text from one language to another. It consists of two parts called the "encoder" and "decoder". The encoder is a stack of recurrent layers. The input sequence is fed into it, one token at a time, and it generates a fixed length vector called the "embedding vector". The decoder is another stack of recurrent layers that performs the inverse operation: it takes the embedding vector as input, and generates the output sequence. By training it on appropriately chosen input/output pairs, you can create a model that performs many sorts of transformations.
In this case, we will use SMILES strings describing molecules as the input sequences. We will train the model as an autoencoder, so it tries to make the output sequences identical to the input sequences. For that to work, the encoder must create embedding vectors that contain all information from the original sequence. That's exactly what we want in a fingerprint, so perhaps those embedding vectors will then be useful as a way to represent molecules in other models!
This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.
To run DeepChem within Colab, you'll need to run the following cell of installation commands. This will take about 5 minutes to run to completion and install your environment. This notebook will take a few hours to run on a GPU machine, so we encourage you to run it on Google colab unless you have a good GPU machine available.
In [0]:
%tensorflow_version 1.x
!curl -Lo deepchem_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py
import deepchem_installer
%time deepchem_installer.install(version='2.3.0')
Let's start by loading the data. We will use the MUV dataset. It includes 74,501 molecules in the training set, and 9313 molecules in the validation set, so it gives us plenty of SMILES strings to work with.
In [0]:
import deepchem as dc
tasks, datasets, transformers = dc.molnet.load_muv()
train_dataset, valid_dataset, test_dataset = datasets
train_smiles = train_dataset.ids
valid_smiles = valid_dataset.ids
We need to define the "alphabet" for our SeqToSeq
model, the list of all tokens that can appear in sequences. (It's also possible for input and output sequences to have different alphabets, but since we're training it as an autoencoder, they're identical in this case.) Make a list of every character that appears in any training sequence.
In [0]:
tokens = set()
for s in train_smiles:
tokens = tokens.union(set(c for c in s))
tokens = sorted(list(tokens))
Create the model and define the optimization method to use. In this case, learning works much better if we gradually decrease the learning rate. We use an ExponentialDecay
to multiply the learning rate by 0.9 after each epoch.
In [0]:
from deepchem.models.optimizers import Adam, ExponentialDecay
max_length = max(len(s) for s in train_smiles)
batch_size = 100
batches_per_epoch = len(train_smiles)/batch_size
model = dc.models.SeqToSeq(tokens,
tokens,
max_length,
encoder_layers=2,
decoder_layers=2,
embedding_dimension=256,
model_dir='fingerprint',
batch_size=batch_size,
learning_rate=ExponentialDecay(0.004, 0.9, batches_per_epoch))
Let's train it! The input to fit_sequences()
is a generator that produces input/output pairs. On a good GPU, this should take a few hours or less.
In [0]:
def generate_sequences(epochs):
for i in range(epochs):
for s in train_smiles:
yield (s, s)
model.fit_sequences(generate_sequences(40))
Let's see how well it works as an autoencoder. We'll run the first 500 molecules from the validation set through it, and see how many of them are exactly reproduced.
In [0]:
predicted = model.predict_from_sequences(valid_smiles[:500])
count = 0
for s,p in zip(valid_smiles[:500], predicted):
if ''.join(p) == s:
count += 1
print('reproduced', count, 'of 500 validation SMILES strings')
Now we'll trying using the encoder as a way to generate molecular fingerprints. We compute the embedding vectors for all molecules in the training and validation datasets, and create new datasets that have those as their feature vectors. The amount of data is small enough that we can just store everything in memory.
In [0]:
train_embeddings = model.predict_embeddings(train_smiles)
train_embeddings_dataset = dc.data.NumpyDataset(train_embeddings,
train_dataset.y,
train_dataset.w,
train_dataset.ids)
valid_embeddings = model.predict_embeddings(valid_smiles)
valid_embeddings_dataset = dc.data.NumpyDataset(valid_embeddings,
valid_dataset.y,
valid_dataset.w,
valid_dataset.ids)
For classification, we'll use a simple fully connected network with one hidden layer.
In [0]:
classifier = dc.models.MultitaskClassifier(n_tasks=len(tasks),
n_features=256,
layer_sizes=[512])
classifier.fit(train_embeddings_dataset, nb_epoch=10)
Find out how well it worked. Compute the ROC AUC for the training and validation datasets.
In [0]:
import numpy as np
metric = dc.metrics.Metric(dc.metrics.roc_auc_score, np.mean, mode="classification")
train_score = classifier.evaluate(train_embeddings_dataset, [metric], transformers)
valid_score = classifier.evaluate(valid_embeddings_dataset, [metric], transformers)
print('Training set ROC AUC:', train_score)
print('Validation set ROC AUC:', valid_score)
Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:
This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.
The DeepChem Gitter hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!