Example of the Workflow

This is an example of main.py in the ionic_liquids folder. I will first have to import the libraries that are necessary to run this program, including train_test_split that allows for splitting datasets into training sets and test sets necessary to run machine learning.


In [38]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from rdkit import Chem
from rdkit.Chem import AllChem, Descriptors
from rdkit.ML.Descriptors.MoleculeDescriptors import MolecularDescriptorCalculator as Calculator

For this example, I will utilize the following filename, machine learning model, and directory name to save the model.


In [55]:
FILENAME = 'inputdata2.xlsx'
MODEL = 'mlp_regressor'
DIRNAME = 'my_test'

The following step prepares the data to be read in the machine_learning methods. First, we need to get the data into a readable form and parse, if necessary. In our case, we need to parse the values and errors in the last column of the FILENAME.


In [56]:
def read_data(filename):
    """
    Reads data in from given file to Pandas DataFrame

    Inputs
    -------
    filename : string of path to file

    Returns
    ------
    df : Pandas DataFrame

    """
    cols = filename.split('.')
    name = cols[0]
    filetype = cols[1]
    if (filetype == 'csv'):
        df = pd.read_csv(filename)
    elif (filetype in ['xls', 'xlsx']):
        df = pd.read_excel(filename)
    else:
        raise ValueError('Filetype not supported')

    # clean the data if necessary
    df['EC_value'], df['EC_error'] = zip(*df['ELE_COD'].map(lambda x: x.split('±')))
    df = df.drop('EC_error', 1)
    df = df.drop('ELE_COD', 1)

    return df

df = read_data(FILENAME)

Secondly, we will create a X matrix and y vector that are send to the molecular descriptor function in utils.py. The X matrix will hold all of our inputs for the machine learning whereas y vector will be the actual electronic conductivity values.


In [57]:
def molecular_descriptors(data):
    """
    Use RDKit to prepare the molecular descriptor

    Inputs
    ------
    data: dataframe, cleaned csv data

    Returns
    ------
    prenorm_X: normalized input features
    Y: experimental electrical conductivity

    """

    n = data.shape[0]
    # Choose which molecular descriptor we want
    list_of_descriptors = ['NumHeteroatoms', 'ExactMolWt',
        'NOCount', 'NumHDonors',
        'RingCount', 'NumAromaticRings', 
        'NumSaturatedRings', 'NumAliphaticRings']
    # Get the molecular descriptors and their dimension
    calc = Calculator(list_of_descriptors)
    D = len(list_of_descriptors)
    d = len(list_of_descriptors)*2 + 4

    Y = data['EC_value']
    X = np.zeros((n, d))
    X[:, -3] = data['T']
    X[:, -2] = data['P']
    X[:, -1] = data['MOLFRC_A']
    for i in range(n):
        A = Chem.MolFromSmiles(data['A'][i])
        B = Chem.MolFromSmiles(data['B'][i])
        X[i][:D]    = calc.CalcDescriptors(A)
        X[i][D:2*D] = calc.CalcDescriptors(B)

    prenorm_X = pd.DataFrame(X,columns=['NUM', 'NumHeteroatoms_A', 
        'MolWt_A', 'NOCount_A','NumHDonors_A', 
        'RingCount_A', 'NumAromaticRings_A', 
        'NumSaturatedRings_A',
        'NumAliphaticRings_A', 
        'NumHeteroatoms_B', 'MolWt_B', 
        'NOCount_B', 'NumHDonors_B',
        'RingCount_B', 'NumAromaticRings_B', 
        'NumSaturatedRings_B', 
        'NumAliphaticRings_B',
        'T', 'P', 'MOLFRC_A'])

    prenorm_X = prenorm_X.drop('NumAliphaticRings_A', 1)
    prenorm_X = prenorm_X.drop('NumAliphaticRings_B', 1)

    return prenorm_X, Y
X, y = molecular_descriptors(df)


Traceback (most recent call last):
  File "/Users/SarahsAdventure/miniconda3/lib/python3.5/site-packages/rdkit/ML/Descriptors/MoleculeDescriptors.py", line 87, in CalcDescriptors
    res[i] = fn(mol)
  File "/Users/SarahsAdventure/miniconda3/lib/python3.5/site-packages/rdkit/Chem/Lipinski.py", line 64, in <lambda>
    NumHeteroatoms = lambda x: rdMolDescriptors.CalcNumHeteroatoms(x)
Boost.Python.ArgumentError: Python argument types in
    rdkit.Chem.rdMolDescriptors.CalcNumHeteroatoms(NoneType)
did not match C++ signature:
    CalcNumHeteroatoms(RDKit::ROMol mol)
Traceback (most recent call last):
  File "/Users/SarahsAdventure/miniconda3/lib/python3.5/site-packages/rdkit/ML/Descriptors/MoleculeDescriptors.py", line 87, in CalcDescriptors
    res[i] = fn(mol)
  File "/Users/SarahsAdventure/miniconda3/lib/python3.5/site-packages/rdkit/Chem/Descriptors.py", line 83, in <lambda>
    ExactMolWt = lambda *x, **y: _rdMolDescriptors.CalcExactMolWt(*x, **y)
Boost.Python.ArgumentError: Python argument types in
    rdkit.Chem.rdMolDescriptors.CalcExactMolWt(NoneType)
did not match C++ signature:
    CalcExactMolWt(RDKit::ROMol mol, bool onlyHeavy=False)
Traceback (most recent call last):
  File "/Users/SarahsAdventure/miniconda3/lib/python3.5/site-packages/rdkit/ML/Descriptors/MoleculeDescriptors.py", line 87, in CalcDescriptors
    res[i] = fn(mol)
  File "/Users/SarahsAdventure/miniconda3/lib/python3.5/site-packages/rdkit/Chem/Lipinski.py", line 72, in <lambda>
    NOCount = lambda x: rdMolDescriptors.CalcNumLipinskiHBA(x)
Boost.Python.ArgumentError: Python argument types in
    rdkit.Chem.rdMolDescriptors.CalcNumLipinskiHBA(NoneType)
did not match C++ signature:
    CalcNumLipinskiHBA(RDKit::ROMol mol)
Traceback (most recent call last):
  File "/Users/SarahsAdventure/miniconda3/lib/python3.5/site-packages/rdkit/ML/Descriptors/MoleculeDescriptors.py", line 87, in CalcDescriptors
    res[i] = fn(mol)
  File "/Users/SarahsAdventure/miniconda3/lib/python3.5/site-packages/rdkit/Chem/Lipinski.py", line 56, in <lambda>
    NumHDonors = lambda x: rdMolDescriptors.CalcNumHBD(x)
Boost.Python.ArgumentError: Python argument types in
    rdkit.Chem.rdMolDescriptors.CalcNumHBD(NoneType)
did not match C++ signature:
    CalcNumHBD(RDKit::ROMol mol)
Traceback (most recent call last):
  File "/Users/SarahsAdventure/miniconda3/lib/python3.5/site-packages/rdkit/ML/Descriptors/MoleculeDescriptors.py", line 87, in CalcDescriptors
    res[i] = fn(mol)
  File "/Users/SarahsAdventure/miniconda3/lib/python3.5/site-packages/rdkit/Chem/Lipinski.py", line 79, in <lambda>
    RingCount = lambda x: rdMolDescriptors.CalcNumRings(x)
Boost.Python.ArgumentError: Python argument types in
    rdkit.Chem.rdMolDescriptors.CalcNumRings(NoneType)
did not match C++ signature:
    CalcNumRings(RDKit::ROMol mol)
Traceback (most recent call last):
  File "/Users/SarahsAdventure/miniconda3/lib/python3.5/site-packages/rdkit/ML/Descriptors/MoleculeDescriptors.py", line 87, in CalcDescriptors
    res[i] = fn(mol)
  File "/Users/SarahsAdventure/miniconda3/lib/python3.5/site-packages/rdkit/Chem/Lipinski.py", line 97, in <lambda>
    _fn = lambda x, y=_cfn: y(x)
Boost.Python.ArgumentError: Python argument types in
    rdkit.Chem.rdMolDescriptors.CalcNumAromaticRings(NoneType)
did not match C++ signature:
    CalcNumAromaticRings(RDKit::ROMol mol)
Traceback (most recent call last):
  File "/Users/SarahsAdventure/miniconda3/lib/python3.5/site-packages/rdkit/ML/Descriptors/MoleculeDescriptors.py", line 87, in CalcDescriptors
    res[i] = fn(mol)
  File "/Users/SarahsAdventure/miniconda3/lib/python3.5/site-packages/rdkit/Chem/Lipinski.py", line 97, in <lambda>
    _fn = lambda x, y=_cfn: y(x)
Boost.Python.ArgumentError: Python argument types in
    rdkit.Chem.rdMolDescriptors.CalcNumSaturatedRings(NoneType)
did not match C++ signature:
    CalcNumSaturatedRings(RDKit::ROMol mol)
Traceback (most recent call last):
  File "/Users/SarahsAdventure/miniconda3/lib/python3.5/site-packages/rdkit/ML/Descriptors/MoleculeDescriptors.py", line 87, in CalcDescriptors
    res[i] = fn(mol)
  File "/Users/SarahsAdventure/miniconda3/lib/python3.5/site-packages/rdkit/Chem/Lipinski.py", line 97, in <lambda>
    _fn = lambda x, y=_cfn: y(x)
Boost.Python.ArgumentError: Python argument types in
    rdkit.Chem.rdMolDescriptors.CalcNumAliphaticRings(NoneType)
did not match C++ signature:
    CalcNumAliphaticRings(RDKit::ROMol mol)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-57-16f9291a4afd> in <module>()
     52 
     53     return prenorm_X, Y
---> 54 X, y = molecular_descriptors(df)

<ipython-input-57-16f9291a4afd> in molecular_descriptors(data)
     31     X[:, -1] = data['MOLFRC_A']
     32     for i in range(n):
---> 33         A = Chem.MolFromSmiles(data['A'][i])
     34         B = Chem.MolFromSmiles(data['B'][i])
     35         X[i][:D]    = calc.CalcDescriptors(A)

TypeError: No registered converter was able to produce a C++ rvalue of type std::basic_string<wchar_t, std::char_traits<wchar_t>, std::allocator<wchar_t> > from this Python object of type float

We can prepare our testing and training data set for the machine learning calling using train_test_split, a function called from sklearn module of python.


In [50]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

Followingly, the program will normalize the testing data using the training data set. This will also provide us with the mean value and standard deviation of X.


In [51]:
def normalization(data, means=None, stdevs=None):
    """
    Normalizes the data using the means and standard
    deviations given, calculating them otherwise.
    Returns the means and standard deviations of columns.

    Inputs
    ------
    data : Pandas DataFrame
    means : optional numpy argument of column means
    stdevs : optional numpy argument of column st. devs

    Returns
    ------
    normed : the normalized DataFrame
    means : the numpy row vector of column means
    stdevs : the numpy row vector of column st. devs

    """
    cols = data.columns
    data = data.values

    if (means is None) or (stdevs is None):
        means = np.mean(data, axis=0)
        stdevs = np.std(data, axis=0, ddof=1)
    else:
        means = np.array(means)
        stdevs = np.array(stdevs)

    # handle special case of one row
    if (len(data.shape) == 1) or (data.shape[0] == 1):
        for i in range(len(data)):
            data[i] = (data[i] - means[i]) / stdevs[i]
    else: 
        for i in range(data.shape[1]):
            data[:,i] = (data[:,i] - means[i]*np.ones(data.shape[0])) / stdevs[i]

    normed = pd.DataFrame(data, columns=cols)

    return normed, means, stdevs

X_train, X_mean, X_std = normalization(X_train)
X_test, trash, trash = normalization(X_test, X_mean, X_std)

We coded three models into our program: MLP_regressor, LASSO, and SVR. Each of these models are well documented in sklearn, a library in python. In the actual program, you can use all three models, but for the purpose of this example, we chose mlp_regressor. The ValueError will only raise if you do not use one of the three models. A good example is if you were to change the MODEL used to 'MLP_classifier'.


In [52]:
if (MODEL.lower() == 'mlp_regressor'):
    obj = methods.do_MLP_regressor(X_train, y_train)
elif (MODEL.lower() == 'lasso'):
    obj = methods.do_lasso(X_train, y_train)
elif (MODEL.lower() == 'svr'):
    obj = methods.do_svr(X_train, y_train)
else:
    raise ValueError("Model not supported")


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-52-9c39a8132214> in <module>()
      1 if (MODEL.lower() == 'mlp_regressor'):
----> 2     obj = methods.do_MLP_regressor(X_train, y_train)
      3 elif (MODEL.lower() == 'lasso'):
      4     obj = methods.do_lasso(X_train, y_train)
      5 elif (MODEL.lower() == 'svr'):

NameError: name 'methods' is not defined

After the method is called , it will be saved to an objective. This objective is saved along with the mean and standard deviation and the training set in the directory, named DIRNAME. This step is not as important for the workflow but vital to the success of the graphical user interface.


In [53]:
def save_model(obj, X_mean, X_stdev, X=None, y=None, dirname='default'):
    """
    Save the trained regressor model to the file

    Input
    ------
    obj: model object
    X_mean : mean for each column of training X
    X_stdev : stdev for each column of training X
    X : Predictor matrix
    y : Response vector
    dirname : the directory to save contents

    Returns
    ------
    None
    """
    if (dirname == 'default'):
        timestamp = str(datetime.now())[:19]
        dirname = 'model_'+timestamp.replace(' ', '_')
    else:
        pass
    if not os.path.exists(dirname):
        os.makedirs(dirname)

    filename = dirname + '/model.pkl'
    joblib.dump(obj, filename)

    joblib.dump(X_mean, dirname+'/X_mean.pkl')
    joblib.dump(X_stdev, dirname+'/X_stdev.pkl')

    if (X is not None):
        filename = dirname + '/X_data.pkl'
        joblib.dump(X, filename)
    else:
        pass

    if (y is not None):
        filename = dirname + '/y_data.pkl'
        joblib.dump(y, filename)
    else:
        pass

    return

save_model(obj, X_mean, X_std, X_train, y_train, dirname=DIRNAME)


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-53-c049027ad263> in <module>()
     44     return
     45 
---> 46 save_model(obj, X_mean, X_std, X_train, y_train, dirname=DIRNAME)

NameError: name 'obj' is not defined

Lastly, the experimental values will be scatter plotted against the predicted values. We will use the parity_plot to do so. plt.show() function will just allow the plot to show up.


In [54]:
def parity_plot(y_pred, y_act):
    """
    Creates a parity plot

    Input
    -----
    y_pred : predicted values from the model
    y_act : 'true' (actual) values

    Output
    ------
    fig : matplotlib figure

    """

    fig = plt.figure(figsize=FIG_SIZE)
    plt.scatter(y_act, y_pred)
    plt.plot([y_act.min(), y_act.max()], [y_act.min(), y_act.max()],
             lw=4, color='r')
    plt.xlabel('Actual')
    plt.ylabel('Predicted')

    return fig

my_plot = parity_plot(y_train, obj.predict(X_train))
plt.show(my_plot)


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-54-298c74b981b0> in <module>()
     23     return fig
     24 
---> 25 my_plot = parity_plot(y_train, obj.predict(X_train))
     26 plt.show(my_plot)

NameError: name 'obj' is not defined

Feel free to look at the other examples that will be more explicit about the functions. I hope you enjoy our package and use it to fit your needs!