This is an example of main.py in the ionic_liquids folder. I will first have to import the libraries that are necessary to run this program, including train_test_split that allows for splitting datasets into training sets and test sets necessary to run machine learning.
In [38]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from rdkit import Chem
from rdkit.Chem import AllChem, Descriptors
from rdkit.ML.Descriptors.MoleculeDescriptors import MolecularDescriptorCalculator as Calculator
For this example, I will utilize the following filename, machine learning model, and directory name to save the model.
In [55]:
FILENAME = 'inputdata2.xlsx'
MODEL = 'mlp_regressor'
DIRNAME = 'my_test'
The following step prepares the data to be read in the machine_learning methods. First, we need to get the data into a readable form and parse, if necessary. In our case, we need to parse the values and errors in the last column of the FILENAME.
In [56]:
def read_data(filename):
"""
Reads data in from given file to Pandas DataFrame
Inputs
-------
filename : string of path to file
Returns
------
df : Pandas DataFrame
"""
cols = filename.split('.')
name = cols[0]
filetype = cols[1]
if (filetype == 'csv'):
df = pd.read_csv(filename)
elif (filetype in ['xls', 'xlsx']):
df = pd.read_excel(filename)
else:
raise ValueError('Filetype not supported')
# clean the data if necessary
df['EC_value'], df['EC_error'] = zip(*df['ELE_COD'].map(lambda x: x.split('±')))
df = df.drop('EC_error', 1)
df = df.drop('ELE_COD', 1)
return df
df = read_data(FILENAME)
Secondly, we will create a X matrix and y vector that are send to the molecular descriptor function in utils.py. The X matrix will hold all of our inputs for the machine learning whereas y vector will be the actual electronic conductivity values.
In [57]:
def molecular_descriptors(data):
"""
Use RDKit to prepare the molecular descriptor
Inputs
------
data: dataframe, cleaned csv data
Returns
------
prenorm_X: normalized input features
Y: experimental electrical conductivity
"""
n = data.shape[0]
# Choose which molecular descriptor we want
list_of_descriptors = ['NumHeteroatoms', 'ExactMolWt',
'NOCount', 'NumHDonors',
'RingCount', 'NumAromaticRings',
'NumSaturatedRings', 'NumAliphaticRings']
# Get the molecular descriptors and their dimension
calc = Calculator(list_of_descriptors)
D = len(list_of_descriptors)
d = len(list_of_descriptors)*2 + 4
Y = data['EC_value']
X = np.zeros((n, d))
X[:, -3] = data['T']
X[:, -2] = data['P']
X[:, -1] = data['MOLFRC_A']
for i in range(n):
A = Chem.MolFromSmiles(data['A'][i])
B = Chem.MolFromSmiles(data['B'][i])
X[i][:D] = calc.CalcDescriptors(A)
X[i][D:2*D] = calc.CalcDescriptors(B)
prenorm_X = pd.DataFrame(X,columns=['NUM', 'NumHeteroatoms_A',
'MolWt_A', 'NOCount_A','NumHDonors_A',
'RingCount_A', 'NumAromaticRings_A',
'NumSaturatedRings_A',
'NumAliphaticRings_A',
'NumHeteroatoms_B', 'MolWt_B',
'NOCount_B', 'NumHDonors_B',
'RingCount_B', 'NumAromaticRings_B',
'NumSaturatedRings_B',
'NumAliphaticRings_B',
'T', 'P', 'MOLFRC_A'])
prenorm_X = prenorm_X.drop('NumAliphaticRings_A', 1)
prenorm_X = prenorm_X.drop('NumAliphaticRings_B', 1)
return prenorm_X, Y
X, y = molecular_descriptors(df)
We can prepare our testing and training data set for the machine learning calling using train_test_split, a function called from sklearn module of python.
In [50]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
Followingly, the program will normalize the testing data using the training data set. This will also provide us with the mean value and standard deviation of X.
In [51]:
def normalization(data, means=None, stdevs=None):
"""
Normalizes the data using the means and standard
deviations given, calculating them otherwise.
Returns the means and standard deviations of columns.
Inputs
------
data : Pandas DataFrame
means : optional numpy argument of column means
stdevs : optional numpy argument of column st. devs
Returns
------
normed : the normalized DataFrame
means : the numpy row vector of column means
stdevs : the numpy row vector of column st. devs
"""
cols = data.columns
data = data.values
if (means is None) or (stdevs is None):
means = np.mean(data, axis=0)
stdevs = np.std(data, axis=0, ddof=1)
else:
means = np.array(means)
stdevs = np.array(stdevs)
# handle special case of one row
if (len(data.shape) == 1) or (data.shape[0] == 1):
for i in range(len(data)):
data[i] = (data[i] - means[i]) / stdevs[i]
else:
for i in range(data.shape[1]):
data[:,i] = (data[:,i] - means[i]*np.ones(data.shape[0])) / stdevs[i]
normed = pd.DataFrame(data, columns=cols)
return normed, means, stdevs
X_train, X_mean, X_std = normalization(X_train)
X_test, trash, trash = normalization(X_test, X_mean, X_std)
We coded three models into our program: MLP_regressor, LASSO, and SVR. Each of these models are well documented in sklearn, a library in python. In the actual program, you can use all three models, but for the purpose of this example, we chose mlp_regressor. The ValueError will only raise if you do not use one of the three models. A good example is if you were to change the MODEL used to 'MLP_classifier'.
In [52]:
if (MODEL.lower() == 'mlp_regressor'):
obj = methods.do_MLP_regressor(X_train, y_train)
elif (MODEL.lower() == 'lasso'):
obj = methods.do_lasso(X_train, y_train)
elif (MODEL.lower() == 'svr'):
obj = methods.do_svr(X_train, y_train)
else:
raise ValueError("Model not supported")
After the method is called , it will be saved to an objective. This objective is saved along with the mean and standard deviation and the training set in the directory, named DIRNAME. This step is not as important for the workflow but vital to the success of the graphical user interface.
In [53]:
def save_model(obj, X_mean, X_stdev, X=None, y=None, dirname='default'):
"""
Save the trained regressor model to the file
Input
------
obj: model object
X_mean : mean for each column of training X
X_stdev : stdev for each column of training X
X : Predictor matrix
y : Response vector
dirname : the directory to save contents
Returns
------
None
"""
if (dirname == 'default'):
timestamp = str(datetime.now())[:19]
dirname = 'model_'+timestamp.replace(' ', '_')
else:
pass
if not os.path.exists(dirname):
os.makedirs(dirname)
filename = dirname + '/model.pkl'
joblib.dump(obj, filename)
joblib.dump(X_mean, dirname+'/X_mean.pkl')
joblib.dump(X_stdev, dirname+'/X_stdev.pkl')
if (X is not None):
filename = dirname + '/X_data.pkl'
joblib.dump(X, filename)
else:
pass
if (y is not None):
filename = dirname + '/y_data.pkl'
joblib.dump(y, filename)
else:
pass
return
save_model(obj, X_mean, X_std, X_train, y_train, dirname=DIRNAME)
Lastly, the experimental values will be scatter plotted against the predicted values. We will use the parity_plot to do so. plt.show() function will just allow the plot to show up.
In [54]:
def parity_plot(y_pred, y_act):
"""
Creates a parity plot
Input
-----
y_pred : predicted values from the model
y_act : 'true' (actual) values
Output
------
fig : matplotlib figure
"""
fig = plt.figure(figsize=FIG_SIZE)
plt.scatter(y_act, y_pred)
plt.plot([y_act.min(), y_act.max()], [y_act.min(), y_act.max()],
lw=4, color='r')
plt.xlabel('Actual')
plt.ylabel('Predicted')
return fig
my_plot = parity_plot(y_train, obj.predict(X_train))
plt.show(my_plot)
Feel free to look at the other examples that will be more explicit about the functions. I hope you enjoy our package and use it to fit your needs!