SMILES enumeration is the process of writing out all possible SMILES forms of a molecule. It's a useful technique for data augmentation before sequence based modeling of molecules. You can read more about the background in this blog post or this preprint on arxiv.org
Import the SmilesEnumerator and instantiate the object
In [1]:
from SmilesEnumerator import SmilesEnumerator
sme = SmilesEnumerator()
print help(SmilesEnumerator)
A few SMILES strings will be enumerated as a demonstration.
In [2]:
for i in range(10):
print sme.randomize_smiles("CCC(=O)O[C@@]1(CC[NH+](C[C@H]1CC=C)C)c2ccccc2")
Before vectorization SMILES must be stored as strings in an numpy array. The transform takes numpy arrays or pandas series with the SMILES as strings.
In [5]:
import numpy as np
smiles = np.array(["CCC(=O)O[C@@]1(CC[NH+](C[C@H]1CC=C)C)c2ccccc2"])
print smiles.shape
Fit the charset and the padding to the SMILES array, alternatively they can be specified when instantiating the object.
In [6]:
sme.fit(smiles)
print sme.charset
print sme.pad
There have been added some extra padding to the maximum lenght observed in the smiles array. The SMILES can be transformed to one-hot encoded vectors and showed with matplotlib.
In [7]:
import matplotlib.pyplot as plt
%matplotlib inline
vect = sme.transform(smiles)
plt.imshow(vect[0])
Out[7]:
It's a nice piano roll. If the vectorization is repeated, the vectorization will be different due to the enumeration, as sme.enum and sme.canonical is set to True and False, respectively (default settings).
In [8]:
print sme.enumerate, sme.canonical
vect = sme.transform(smiles)
plt.imshow(vect[0])
Out[8]:
The reverse_transform() function can be used to translate back to a SMILES string, as long as the charset is the same as was used to vectorize.
In [7]:
print sme.reverse_transform(vect)
The SmilesEnumerator class can be used together with the SmilesIterator batch generator for on the fly vectorization for RNN modeling of molecules. Below it's briefly demonstrated how this can be done.
In [10]:
import pandas as pd
data = pd.read_csv("Example_data/Sutherland_DHFR.csv")
print data.head()
In [11]:
from sklearn.cross_validation import train_test_split
#We ignore the > signs, and use random splitting for simplicity
X_train, X_test, y_train, y_test = train_test_split(data["smiles_parent"],
np.log(data["PC_uM_value"]).values.reshape(-1,1),
random_state=42)
from sklearn.preprocessing import RobustScaler
rbs = RobustScaler(with_centering=True, with_scaling=True, quantile_range=(5.0, 95.0), copy=True)
y_train = rbs.fit_transform((y_train))
y_test = rbs.transform(y_test)
_ = plt.hist(y_train, bins=25)
In [12]:
import keras.backend as K
from SmilesEnumerator import SmilesIterator
#The SmilesEnumerator must be fit to the entire dataset, so that all chars are registered
sme.fit(data["smiles_parent"])
sme.leftpad = True
print sme.charset
print sme.pad
#The dtype is set for the K.floatx(), which is the numerical type configured for Tensorflow or Theano
generator = SmilesIterator(X_train, y_train, sme, batch_size=200, dtype=K.floatx())
In [83]:
X,y = generator.next()
print X.shape
print y.shape
Build a SMILES based RNN QSAR model with Keras.
In [84]:
from keras.models import Sequential
from keras.layers import Dense, LSTM
from keras.layers.core import Dropout
from keras.callbacks import ReduceLROnPlateau
from keras import regularizers
from keras.optimizers import RMSprop, Adam
In [131]:
input_shape = X.shape[1:]
output_shape = 1
model = Sequential()
model.add(LSTM(64,
input_shape=input_shape,
dropout = 0.19
#unroll= True
))
model.add(Dense(output_shape,
kernel_regularizer=regularizers.l1_l2(0.005,0.01),
activation="linear"))
model.compile(loss="mse", optimizer=RMSprop(lr=0.005))
print model.summary()
Use the generator object for training.
In [132]:
model.fit_generator(generator, steps_per_epoch=100, epochs=25, workers=4)
Out[132]:
In [133]:
y_pred_train = model.predict(sme.transform(X_train))
y_pred_test = model.predict(sme.transform(X_test))
plt.scatter(y_train, y_pred_train, label="Train")
plt.scatter(y_test, y_pred_test, label="Test")
plt.legend()
Out[133]:
Not the best model until now. However, prolonged training with lowering of the learning rate towards the end will improve the model.
In [149]:
#The Enumerator can be used in sampling
i = 1
y_true = y_test[i]
y_pred = model.predict(sme.transform(X_test.iloc[i:i+1]))
print y_true - y_pred
In [151]:
#Enumeration of the SMILES before sampling stabilises the result
smiles_repeat = np.array([X_test.iloc[i:i+1].values[0]]*50)
y_pred = model.predict(sme.transform(smiles_repeat))
print y_pred.std()
print y_true - np.median(y_pred)
_ = plt.hist(y_pred)
Please cite: SMILES enumeration as Data Augmentation for Network Modeling of Molecules
@article{DBLP:journals/corr/Bjerrum17,
author = {Esben Jannik Bjerrum},
title = {{SMILES} Enumeration as Data Augmentation for Neural Network Modeling
of Molecules},
journal = {CoRR},
volume = {abs/1703.07076},
year = {2017},
url = {http://arxiv.org/abs/1703.07076},
timestamp = {Wed, 07 Jun 2017 14:40:38 +0200},
biburl = {http://dblp.uni-trier.de/rec/bib/journals/corr/Bjerrum17},
bibsource = {dblp computer science bibliography, http://dblp.org}
}
If you find it useful, feel welcome to leave a comment on the blog.
In [ ]: