SeqToSeq Fingerprint

In this example, we will use a SeqToSeq model to generate fingerprints for classifying molecules. This is based on the following paper, although some of the implementation details are different: Xu et al., "Seq2seq Fingerprint: An Unsupervised Deep Molecular Embedding for Drug Discovery" (https://doi.org/10.1145/3107411.3107424).

Many types of models require their inputs to have a fixed shape. Since molecules can vary widely in the numbers of atoms and bonds they contain, this makes it hard to apply those models to them. We need a way of generating a fixed length "fingerprint" for each molecule. Various ways of doing this have been designed, such as Extended-Connectivity Fingerprints (ECFPs). But in this example, instead of designing a fingerprint by hand, we will let a SeqToSeq model learn its own method of creating fingerprints.

A SeqToSeq model performs sequence to sequence translation. For example, they are often used to translate text from one language to another. It consists of two parts called the "encoder" and "decoder". The encoder is a stack of recurrent layers. The input sequence is fed into it, one token at a time, and it generates a fixed length vector called the "embedding vector". The decoder is another stack of recurrent layers that performs the inverse operation: it takes the embedding vector as input, and generates the output sequence. By training it on appropriately chosen input/output pairs, you can create a model that performs many sorts of transformations.

In this case, we will use SMILES strings describing molecules as the input sequences. We will train the model as an autoencoder, so it tries to make the output sequences identical to the input sequences. For that to work, the encoder must create embedding vectors that contain all information from the original sequence. That's exactly what we want in a fingerprint, so perhaps those embedding vectors will then be useful as a way to represent molecules in other models!

Let's start by loading the data. We will use the MUV dataset. It includes 74,501 molecules in the training set, and 9313 molecules in the validation set, so it gives us plenty of SMILES strings to work with.


In [1]:
import deepchem as dc
tasks, datasets, transformers = dc.molnet.load_muv()
train_dataset, valid_dataset, test_dataset = datasets
train_smiles = train_dataset.ids
valid_smiles = valid_dataset.ids


About to load MUV dataset.
Loading dataset from disk.
Loading dataset from disk.
Loading dataset from disk.

We need to define the "alphabet" for our SeqToSeq model, the list of all tokens that can appear in sequences. (It's also possible for input and output sequences to have different alphabets, but since we're training it as an autoencoder, they're identical in this case.) Make a list of every character that appears in any training sequence.


In [2]:
tokens = set()
for s in train_smiles:
  tokens = tokens.union(set(c for c in s))
tokens = sorted(list(tokens))

Create the model and define the optimization method to use. In this case, learning works much better if we gradually decrease the learning rate. We use an ExponentialDecay to multiply the learning rate by 0.9 after each epoch.


In [3]:
from deepchem.models.tensorgraph.optimizers import Adam, ExponentialDecay
max_length = max(len(s) for s in train_smiles)
model = dc.models.SeqToSeq(tokens,
                           tokens,
                           max_length,
                           encoder_layers=2,
                           decoder_layers=2,
                           embedding_dimension=256,
                           model_dir='fingerprint')
batches_per_epoch = len(train_smiles)/model.batch_size
model.set_optimizer(Adam(learning_rate=ExponentialDecay(0.004, 0.9, batches_per_epoch)))

Let's train it! The input to fit_sequences() is a generator that produces input/output pairs. On a good GPU, this should take a few hours or less.


In [4]:
def generate_sequences(epochs):
  for i in range(epochs):
    for s in train_smiles:
      yield (s, s)

model.fit_sequences(generate_sequences(40))


Ending global_step 999: Average loss 72.0029
Ending global_step 1999: Average loss 40.7221
Ending global_step 2999: Average loss 31.5364
Ending global_step 3999: Average loss 26.4576
Ending global_step 4999: Average loss 22.814
Ending global_step 5999: Average loss 19.5248
Ending global_step 6999: Average loss 16.4594
Ending global_step 7999: Average loss 18.8898
Ending global_step 8999: Average loss 13.476
Ending global_step 9999: Average loss 11.5528
Ending global_step 10999: Average loss 10.1594
Ending global_step 11999: Average loss 10.6434
Ending global_step 12999: Average loss 6.57057
Ending global_step 13999: Average loss 6.46177
Ending global_step 14999: Average loss 7.53559
Ending global_step 15999: Average loss 4.95809
Ending global_step 16999: Average loss 4.35039
Ending global_step 17999: Average loss 3.39137
Ending global_step 18999: Average loss 3.5216
Ending global_step 19999: Average loss 3.08579
Ending global_step 20999: Average loss 2.80738
Ending global_step 21999: Average loss 2.92217
Ending global_step 22999: Average loss 2.51032
Ending global_step 23999: Average loss 1.86265
Ending global_step 24999: Average loss 1.67088
Ending global_step 25999: Average loss 1.87016
Ending global_step 26999: Average loss 1.61166
Ending global_step 27999: Average loss 1.40708
Ending global_step 28999: Average loss 1.4488
Ending global_step 29801: Average loss 1.33917
TIMING: model fitting took 5619.924 s

Let's see how well it works as an autoencoder. We'll run the first 500 molecules from the validation set through it, and see how many of them are exactly reproduced.


In [5]:
predicted = model.predict_from_sequences(valid_smiles[:500])
count = 0
for s,p in zip(valid_smiles[:500], predicted):
  if ''.join(p) == s:
    count += 1
print('reproduced', count, 'of 500 validation SMILES strings')


reproduced 363 of 500 validation SMILES strings

Now we'll trying using the encoder as a way to generate molecular fingerprints. We compute the embedding vectors for all molecules in the training and validation datasets, and create new datasets that have those as their feature vectors. The amount of data is small enough that we can just store everything in memory.


In [6]:
train_embeddings = model.predict_embeddings(train_smiles)
train_embeddings_dataset = dc.data.NumpyDataset(train_embeddings,
                                                train_dataset.y,
                                                train_dataset.w,
                                                train_dataset.ids)

valid_embeddings = model.predict_embeddings(valid_smiles)
valid_embeddings_dataset = dc.data.NumpyDataset(valid_embeddings,
                                                valid_dataset.y,
                                                valid_dataset.w,
                                                valid_dataset.ids)

For classification, we'll use a simple fully connected network with one hidden layer.


In [7]:
classifier = dc.models.MultitaskClassifier(n_tasks=len(tasks),
                                                      n_features=256,
                                                      layer_sizes=[512])
classifier.fit(train_embeddings_dataset, nb_epoch=10)


Ending global_step 999: Average loss 829.805
Ending global_step 1999: Average loss 450.42
Ending global_step 2999: Average loss 326.079
Ending global_step 3999: Average loss 265.199
Ending global_step 4999: Average loss 246.724
Ending global_step 5999: Average loss 224.64
Ending global_step 6999: Average loss 202.624
Ending global_step 7460: Average loss 213.885
TIMING: model fitting took 19.780 s

Find out how well it worked. Compute the ROC AUC for the training and validation datasets.


In [8]:
import numpy as np
metric = dc.metrics.Metric(dc.metrics.roc_auc_score, np.mean, mode="classification")
train_score = classifier.evaluate(train_embeddings_dataset, [metric], transformers)
valid_score = classifier.evaluate(valid_embeddings_dataset, [metric], transformers)
print('Training set ROC AUC:', train_score)
print('Validation set ROC AUC:', valid_score)


computed_metrics: [0.97828427249789751, 0.98705973960125326, 0.966007068438685, 0.9874401066031584, 0.97794394675150698, 0.98021719680962449, 0.95318452689781941, 0.97185747562764213, 0.96389538770053473, 0.96798988621997473, 0.9690779239145807, 0.98544402211472004, 0.97762497271338133, 0.96843239633294886, 0.97753648081489997, 0.96504683675485614, 0.93547151958366914]
computed_metrics: [0.90790686952512678, 0.79891461649782913, 0.61900937081659968, 0.75241212956581671, 0.58678903240426017, 0.72765072765072758, 0.34929006085192693, 0.83986814712005553, 0.82379943502824859, 0.61844636844636847, 0.863620199146515, 0.68106930272108857, 0.98020477815699669, 0.85073580939032944, 0.781015678254942, 0.75399733510992673, nan]
Training set ROC AUC: {'mean-roc_auc_score': 0.97132433878689139}
Validation set ROC AUC: {'mean-roc_auc_score': 0.74592061629292239}