Thanks for checking out Salty!
The purpose of Salty is to extend some of the machine learning ecosystem to ionic liquid (IL) data from ILThermo.
Salty operates using the simplified molecular-input line-entry system (SMILES). One of the core methods of Salty is the check_name() function that converts IUPAC naming to Smiles:
In [1]:
import salty
smiles = salty.check_name("1-butyl-3-methylimidazolium")
print(smiles)
once we have a smiles representation of a molecule, we can convert it into a molecular object with RDKit:
In [2]:
%matplotlib inline
from rdkit import Chem
from rdkit.Chem import Draw
fig = Draw.MolToMPL(Chem.MolFromSmiles(smiles),figsize=(5,5))
Once we have a molecular object, we can generate many kinds of bitvector representations or fingerprints.
Fingerprints can be used as descriptors in machine learning models, uncertainty estimators in structure search algorithms, or, as shown below, to simply compare two molecular structures:
In [3]:
ms = [Chem.MolFromSmiles("OC(=O)C(N)Cc1ccc(O)cc1"), Chem.MolFromSmiles(smiles)]
fig=Draw.MolsToGridImage(ms[:],molsPerRow=2,subImgSize=(400,200))
fig.save('compare.png')
from rdkit.Chem.AtomPairs import Pairs
from rdkit.Chem import AllChem
from rdkit.Chem.Fingerprints import FingerprintMols
from rdkit import DataStructs
radius = 2
fpatom = [Pairs.GetAtomPairFingerprintAsBitVect(x) for x in ms]
print("atom pair score: {:8.4f}".format(DataStructs.TanimotoSimilarity(fpatom[0], fpatom[1])))
fpmorg = [AllChem.GetMorganFingerprint(ms[0],radius,useFeatures=True),
AllChem.GetMorganFingerprint(ms[1],radius,useFeatures=True)]
fptopo = [FingerprintMols.FingerprintMol(x) for x in ms]
print("morgan score: {:11.4f}".format(DataStructs.TanimotoSimilarity(fpmorg[0], fpmorg[1])))
print("topological score: {:3.4f}".format(DataStructs.TanimotoSimilarity(fptopo[0], fptopo[1])))
check_name is based on a curated data file of known cations and anions
In [4]:
print("Cations in database: {}".format(len(salty.load_data("cationInfo.csv"))))
print("Anions in database: {}".format(len(salty.load_data("anionInfo.csv"))))
In [5]:
rawdata = salty.load_data("cpt.csv")
rawdata.columns
Out[5]:
In [6]:
devmodel = salty.aggregate_data(['cpt', 'density']) # other option is viscosity
aggregate_data returns a devmodel object that contains a pandas dataframe of the raw data and a data summary:
In [7]:
devmodel.Data_summary
Out[7]:
and has scaled and centered 2D features from rdkit:
In [8]:
devmodel.Data.columns
Out[8]:
The purpose of the data summary is to provide historical information when ML models are ported over into GAINS. Once we have a devmodel the underlying data can be interrogated:
In [9]:
import matplotlib.pyplot as plt
import numpy as np
df = devmodel.Data
with plt.style.context('seaborn-whitegrid'):
fig=plt.figure(figsize=(5,5), dpi=300)
ax=fig.add_subplot(111)
scat = ax.scatter(np.exp(df["Heat capacity at constant pressure, J/K/mol"]), np.exp(
df["Specific density, kg/m<SUP>3</SUP>"]),
marker="*", c=df["Temperature, K"]/max(df["Temperature, K"]), cmap="Purples")
plt.colorbar(scat)
ax.grid()
ax.set_ylabel("Density $(kg/m^3)$")
ax.set_xlabel("Heat Capacity $(J/K/mol)$")
In [58]:
from sklearn.model_selection import cross_val_score
from sklearn.neural_network import MLPRegressor
from sklearn.multioutput import MultiOutputRegressor
model = MLPRegressor(activation='relu', alpha=0.92078, batch_size='auto',
beta_1=0.9, beta_2=0.999, early_stopping=False, epsilon=1e-08,
hidden_layer_sizes=75, learning_rate='constant',
learning_rate_init=0.001, max_iter=1e8, momentum=0.9,
nesterovs_momentum=True, power_t=0.5, random_state=None,
shuffle=True, solver='lbfgs', tol=1e-08, validation_fraction=0.1,
verbose=False, warm_start=False)
multi_model = MultiOutputRegressor(model)
X_train, Y_train, X_test, Y_test = salty.devmodel_to_array\
(devmodel, train_fraction=0.8)
multi_model.fit(X_train, Y_train)
Out[58]:
We can then see how the model is performing with matplotlib:
In [24]:
X=X_train
Y=Y_train
with plt.style.context('seaborn-whitegrid'):
fig=plt.figure(figsize=(5,2.5), dpi=300)
ax=fig.add_subplot(122)
ax.plot([0,2000], [0,2000], linestyle="-", label=None, c="black", linewidth=1)
ax.plot(np.exp(Y)[:,0],np.exp(multi_model.predict(X))[:,0],\
marker="*",linestyle="",alpha=0.4)
ax.set_ylabel("Predicted $C_{pt}$ $(K/J/mol)$")
ax.set_xlabel("Actual $C_{pt}$ $(K/J/mol)$")
ax.text(0.1,.9,"R: {0:5.3f}".format(multi_model.score(X,Y)), transform = ax.transAxes)
plt.xlim(200,1700)
plt.ylim(200,1700)
ax.grid()
ax=fig.add_subplot(121)
ax.plot([0,2000], [0,2000], linestyle="-", label=None, c="black", linewidth=1)
ax.plot(np.exp(Y)[:,1],np.exp(multi_model.predict(X))[:,1],\
marker="*",linestyle="",alpha=0.4)
ax.set_ylabel("Predicted Density $(kg/m^3)$")
ax.set_xlabel("Actual Density $(kg/m^3)$")
plt.xlim(850,1600)
plt.ylim(850,1600)
ax.grid()
plt.tight_layout()
In [17]:
from keras.layers import Dense, Dropout, Input
from keras.models import Model, Sequential
from keras.optimizers import Nadam
X_train, Y_train, X_test, Y_test = salty.devmodel_to_array\
(devmodel, train_fraction=0.8)
model = Sequential()
model.add(Dense(75, activation='relu', input_dim=X_train.shape[1]))
model.add(Dropout(0.5))
model.add(Dense(2, activation='linear'))
model.compile(optimizer="adam",
loss="mean_squared_error",
metrics=['mse'])
In [18]:
model.fit(X_train,Y_train,epochs=1000,verbose=False)
Out[18]:
In [19]:
X=X_train
Y=Y_train
with plt.style.context('seaborn-whitegrid'):
fig=plt.figure(figsize=(5,2.5), dpi=300)
ax=fig.add_subplot(122)
ax.plot([0,2000], [0,2000], linestyle="-", label=None, c="black", linewidth=1)
ax.plot(np.exp(Y)[:,0],np.exp(model.predict(X))[:,0],\
marker="*",linestyle="",alpha=0.4)
ax.set_ylabel("Predicted $C_{pt}$ $(K/J/mol)$")
ax.set_xlabel("Actual $C_{pt}$ $(K/J/mol)$")
plt.xlim(200,1700)
plt.ylim(200,1700)
ax.grid()
ax=fig.add_subplot(121)
ax.plot([0,2000], [0,2000], linestyle="-", label=None, c="black", linewidth=1)
ax.plot(np.exp(Y)[:,1],np.exp(model.predict(X))[:,1],\
marker="*",linestyle="",alpha=0.4)
ax.set_ylabel("Predicted Density $(kg/m^3)$")
ax.set_xlabel("Actual Density $(kg/m^3)$")
plt.xlim(850,1600)
plt.ylim(850,1600)
ax.grid()
plt.tight_layout()
In [50]:
model.save("_static/cpt_density_qspr.h5")
devmodel.Data_summary.to_csv("_static/cpt_density_summ.csv")
devmodel.Coef_data.to_csv("_static/cpt_density_desc.csv", index=False)
These are all the basic Salty functions for now!