This notebook was put together by [wesley beckner](http://wesleybeckner.github.io)

create confidence intervals for coefficients



In [1]:

    
import statistics
import requests
import json
import pickle
import salty
import numpy as np
import matplotlib.pyplot as plt
import numpy.linalg as LINA
from scipy import stats
from scipy.stats import uniform as sp_rand
from scipy.stats import mode
from sklearn.linear_model import Lasso
from sklearn.model_selection import cross_val_score
from sklearn.neural_network import MLPRegressor
import os
import sys
import pandas as pd
from collections import OrderedDict
from numpy.random import randint
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RandomizedSearchCV
from math import log
from time import sleep
%matplotlib inline

class dev_model():
    def __init__(self, coef_data, data):
        self.Coef_data = coef_data
        self.Data = data

Scrape ILThermo Data

ILThermo has specific 4-letter tags for the properties in the database. These can be determined by inspecting the web elements on their website.

Melting point: prp=lcRG (note this in the paper_url string)

All that needs to be changed to scrape other property data is the 4-letter tag and the directory in which to save the information.



In [12]:

    
paper_url = "http://ilthermo.boulder.nist.gov/ILT2/ilsearch?"\
    "cmp=&ncmp=1&year=&auth=&keyw=&prp=lcRG"
    
r = requests.get(paper_url)
header = r.json()['header']
papers = r.json()['res']
i = 1
data_url = 'http://ilthermo.boulder.nist.gov/ILT2/ilset?set={paper_id}'
for paper in papers[:1]:
    
    r = requests.get(data_url.format(paper_id=paper[0]))
    data = r.json()['data']
    with open("../salty/data/MELTING_POINT/%s.json" % i, "w") as outfile:
        json.dump(r.json(), outfile)
    #then do whatever you want to data like writing to a file
    sleep(0.5) #import step to avoid getting banned by server
    i += 1

Create Descriptors

The scraped data is in the form of a json file. The json files contain all the experimental information NIST has archived, including methods and experimental error!

Unfortunately the IUPAC names in the database are imperfect. We address this after the following cell.



In [13]:

    
###add JSON files to density.csv
outer_old = pd.DataFrame()
outer_new = pd.DataFrame()
number_of_files = 2266

for i in range(10):
    with open("../salty/data/DENSITY/%s.json" % str(i+1)) as json_file:
        
        #grab data, data headers (names), the salt name
        json_full = json.load(json_file)
        json_data = pd.DataFrame(json_full['data'])
        json_datanames = np.array(json_full['dhead'])
        json_data.columns =  json_datanames
        json_saltname = pd.DataFrame(json_full['components'])
        print(json_saltname.iloc[0][3])
        
        inner_old = pd.DataFrame()
        inner_new = pd.DataFrame()
        
        #loop through the columns of the data, note that some of the 
        #json files are missing pressure data. 
        for indexer in range(len(json_data.columns)):
            grab=json_data.columns[indexer]
            list = json_data[grab]
            my_list = [l[0] for l in list]
            dfmy_list = pd.DataFrame(my_list)
            dfmy_list.columns = [json_datanames[indexer][0]]
            inner_new = pd.concat([dfmy_list, inner_old], axis=1)
            inner_old = inner_new
            
        #add the name of the salt    
        inner_old['salt_name']=json_saltname.iloc[0][3]           
        
        #add to the growing dataframe
        outer_new = pd.concat([inner_old, outer_old], axis=0)
        outer_old = outer_new
print(outer_old)
# pd.DataFrame.to_csv(outer_old, path_or_buf='../salty/data/density.csv', index=False)









    



1-hexyl-3-methylimidazolium bis[(trifluoromethyl)sulfonyl]imide
1-butyl-3-methylimidazolium tetrafluoroborate
1-butyl-3-methylimidazolium methylsulfate
1-ethyl-3-methylimidazolium ethyl sulfate
2-hydroxy-N-methylethanaminium pentanoate
2-hydroxy-N-methylethanaminium propionate
N-methyl-2-hydroxyethylammonium acetate
2-hydroxy-N-methylethanaminium formate
2-hydroxy-N-methylethanaminium isobutyrate
N-methyl-2-hydroxyethylammonium butanoate
    Specific density, kg/m<SUP>3</SUP> Pressure, kPa Temperature, K  \
0                               1052.7       101.325         278.15   
1                               1052.5       101.325          278.4   
2                               1052.4       101.325         278.65   
3                               1052.2       101.325          278.9   
4                                 1052       101.325         279.15   
5                               1051.9       101.325          279.4   
6                               1051.7       101.325         279.65   
7                               1051.5       101.325          279.9   
8                               1051.4       101.325         280.15   
9                               1051.2       101.325          280.4   
10                                1051       101.325         280.65   
11                              1050.9       101.325          280.9   
12                              1050.7       101.325         281.15   
13                              1050.5       101.325          281.4   
14                              1050.4       101.325         281.65   
15                              1050.2       101.325          281.9   
16                                1050       101.325         282.15   
17                              1049.9       101.325          282.4   
18                              1049.7       101.325         282.65   
19                              1049.5       101.325          282.9   
20                              1049.4       101.325         283.15   
21                              1049.2       101.325          283.4   
22                                1049       101.325         283.65   
23                              1048.9       101.325          283.9   
24                              1048.7       101.325         284.15   
25                              1048.5       101.325          284.4   
26                              1048.3       101.325         284.65   
27                              1048.2       101.325          284.9   
28                                1048       101.325         285.15   
29                              1047.8       101.325          285.4   
..                                 ...           ...            ...   
236                             1322.3         39950         393.15   
237                             1330.1         49810         393.15   
238                               1368        110000         393.15   
239                               1373        119900         393.15   
240                             1382.4        139900         393.15   
241                             1290.3          5148         393.16   
242                             1313.4         29440         393.16   
243                             1337.5         59900         393.16   
244                             1356.7         89470         393.16   
245                             1295.4         10147         393.17   
246                             1377.8        130000         393.17   
247                             1287.5          2367         393.18   
248                             1344.6         70200         393.18   
249                             1305.9         39960         413.13   
250                             1347.7         99920         413.13   
251                               1363        130000         413.13   
252                             1287.5         20000         413.14   
253                             1358.2        120000         413.14   
254                             1266.2           101         413.15   
255                             1267.3          1025         413.15   
256                             1277.2         10021         413.15   
257                               1297         29950         413.15   
258                             1314.2         50000         413.15   
259                               1329         69900         413.15   
260                             1341.9         89920         413.15   
261                             1353.1        110000         413.15   
262                             1367.6        140000         413.15   
263                             1271.8          5032         413.16   
264                             1335.8         80000         413.16   
265                             1321.9         59920         413.17   

                                             salt_name  
0            N-methyl-2-hydroxyethylammonium butanoate  
1            N-methyl-2-hydroxyethylammonium butanoate  
2            N-methyl-2-hydroxyethylammonium butanoate  
3            N-methyl-2-hydroxyethylammonium butanoate  
4            N-methyl-2-hydroxyethylammonium butanoate  
5            N-methyl-2-hydroxyethylammonium butanoate  
6            N-methyl-2-hydroxyethylammonium butanoate  
7            N-methyl-2-hydroxyethylammonium butanoate  
8            N-methyl-2-hydroxyethylammonium butanoate  
9            N-methyl-2-hydroxyethylammonium butanoate  
10           N-methyl-2-hydroxyethylammonium butanoate  
11           N-methyl-2-hydroxyethylammonium butanoate  
12           N-methyl-2-hydroxyethylammonium butanoate  
13           N-methyl-2-hydroxyethylammonium butanoate  
14           N-methyl-2-hydroxyethylammonium butanoate  
15           N-methyl-2-hydroxyethylammonium butanoate  
16           N-methyl-2-hydroxyethylammonium butanoate  
17           N-methyl-2-hydroxyethylammonium butanoate  
18           N-methyl-2-hydroxyethylammonium butanoate  
19           N-methyl-2-hydroxyethylammonium butanoate  
20           N-methyl-2-hydroxyethylammonium butanoate  
21           N-methyl-2-hydroxyethylammonium butanoate  
22           N-methyl-2-hydroxyethylammonium butanoate  
23           N-methyl-2-hydroxyethylammonium butanoate  
24           N-methyl-2-hydroxyethylammonium butanoate  
25           N-methyl-2-hydroxyethylammonium butanoate  
26           N-methyl-2-hydroxyethylammonium butanoate  
27           N-methyl-2-hydroxyethylammonium butanoate  
28           N-methyl-2-hydroxyethylammonium butanoate  
29           N-methyl-2-hydroxyethylammonium butanoate  
..                                                 ...  
236  1-hexyl-3-methylimidazolium bis[(trifluorometh...  
237  1-hexyl-3-methylimidazolium bis[(trifluorometh...  
238  1-hexyl-3-methylimidazolium bis[(trifluorometh...  
239  1-hexyl-3-methylimidazolium bis[(trifluorometh...  
240  1-hexyl-3-methylimidazolium bis[(trifluorometh...  
241  1-hexyl-3-methylimidazolium bis[(trifluorometh...  
242  1-hexyl-3-methylimidazolium bis[(trifluorometh...  
243  1-hexyl-3-methylimidazolium bis[(trifluorometh...  
244  1-hexyl-3-methylimidazolium bis[(trifluorometh...  
245  1-hexyl-3-methylimidazolium bis[(trifluorometh...  
246  1-hexyl-3-methylimidazolium bis[(trifluorometh...  
247  1-hexyl-3-methylimidazolium bis[(trifluorometh...  
248  1-hexyl-3-methylimidazolium bis[(trifluorometh...  
249  1-hexyl-3-methylimidazolium bis[(trifluorometh...  
250  1-hexyl-3-methylimidazolium bis[(trifluorometh...  
251  1-hexyl-3-methylimidazolium bis[(trifluorometh...  
252  1-hexyl-3-methylimidazolium bis[(trifluorometh...  
253  1-hexyl-3-methylimidazolium bis[(trifluorometh...  
254  1-hexyl-3-methylimidazolium bis[(trifluorometh...  
255  1-hexyl-3-methylimidazolium bis[(trifluorometh...  
256  1-hexyl-3-methylimidazolium bis[(trifluorometh...  
257  1-hexyl-3-methylimidazolium bis[(trifluorometh...  
258  1-hexyl-3-methylimidazolium bis[(trifluorometh...  
259  1-hexyl-3-methylimidazolium bis[(trifluorometh...  
260  1-hexyl-3-methylimidazolium bis[(trifluorometh...  
261  1-hexyl-3-methylimidazolium bis[(trifluorometh...  
262  1-hexyl-3-methylimidazolium bis[(trifluorometh...  
263  1-hexyl-3-methylimidazolium bis[(trifluorometh...  
264  1-hexyl-3-methylimidazolium bis[(trifluorometh...  
265  1-hexyl-3-methylimidazolium bis[(trifluorometh...  

[2495 rows x 4 columns]

Dealing with messy data is commonplace. Even highly vetted data in ILThermo.

I addressed inaccuracies in the IUPAC naming by first parsing the IUPAC names into two strings (caiton and anion) and then hand checking the strings that had more than two components. I then matched these weird IUPAC names to their correct SMILES representations. These are stored in the salty database file cationInfo.csv and anionInfo.csv.

I've taken care of most of them but I've left a few unaddressed and you can see these after executing the cell bellow.



In [14]:

    
###a hacky hack solution to cleaning raw ILThermo data
# df = pd.read_csv("../salty/data/viscosity_full.csv")
df = pd.read_csv('../salty/data/density.csv',delimiter=',')
salts = pd.DataFrame(df["salt_name"])
salts = salts.rename(columns={"salt_name": "salts"})

anions= []
cations= []
missed = 0
for i in range(df.shape[0]):
    if len(salts['salts'].iloc[i].split()) == 2:
        cations.append(salts['salts'].iloc[i].split()[0])
        anions.append(salts['salts'].iloc[i].split()[1])
    elif len(salts['salts'].iloc[i].split()) == 3:
        #two word cation
        if"tris(2-hydroxyethyl) methylammonium" in salts['salts'].iloc[i]:
            first = salts['salts'].iloc[i].split()[0]
            second = salts['salts'].iloc[i].split()[1]
            anions.append(salts['salts'].iloc[i].split()[2])
            cations.append(first + ' ' + second)
            
        #these strings have two word anions
        elif("sulfate" in salts['salts'].iloc[i] or\
        "phosphate" in salts['salts'].iloc[i] or\
        "phosphonate" in salts['salts'].iloc[i] or\
        "carbonate" in salts['salts'].iloc[i]):
            first = salts['salts'].iloc[i].split()[1]
            second = salts['salts'].iloc[i].split()[2]
            cations.append(salts['salts'].iloc[i].split()[0])
            anions.append(first + ' ' + second)
        elif("bis(trifluoromethylsulfonyl)imide" in salts['salts'].iloc[i]): 
            #this string contains 2 word cations
            first = salts['salts'].iloc[i].split()[0]
            second = salts['salts'].iloc[i].split()[1]
            third = salts['salts'].iloc[i].split()[2]
            cations.append(first + ' ' + second)
            anions.append(third)
        else:
            print(salts['salts'].iloc[i])
            missed += 1
    elif len(salts['salts'].iloc[i].split()) == 4:
        #this particular string block contains (1:1) at end of name
        if("1,1,2,3,3,3-hexafluoro-1-propanesulfonate" in salts['salts'].iloc[i]):
            first = salts['salts'].iloc[i].split()[0]
            second = salts['salts'].iloc[i].split()[1]
            cations.append(first + ' ' + second)
            anions.append(salts['salts'].iloc[i].split()[2])
        else:
            #and two word anion
            first = salts['salts'].iloc[i].split()[1]
            second = salts['salts'].iloc[i].split()[2]
            anions.append(first + ' ' + second)
            cations.append(salts['salts'].iloc[i].split()[0])
    elif("2-aminoethanol-2-hydroxypropanoate" in salts['salts'].iloc[i]):
        #one of the ilthermo salts is missing a space between cation/anion
        anions.append("2-hydroxypropanoate")
        cations.append("2-aminoethanol")
    elif len(salts['salts'].iloc[i].split()) == 5:
        if("bis[(trifluoromethyl)sulfonyl]imide" in salts['salts'].iloc[i]):
            anions.append("bis(trifluoromethylsulfonyl)imide")
            first = salts['salts'].iloc[i].split()[0]
            second = salts['salts'].iloc[i].split()[1]
            third = salts['salts'].iloc[i].split()[2]
            fourth = salts['salts'].iloc[i].split()[3]
            cations.append(first + ' ' + second + ' ' + third + ' ' + fourth)
        if("trifluoro(perfluoropropyl)borate" in salts['salts'].iloc[i]):
            anions.append("trifluoro(perfluoropropyl)borate")
            cations.append("N,N,N-triethyl-2-methoxyethan-1-aminium")    
    else:
        print(salts['salts'].iloc[i])
        missed += 1
anions = pd.DataFrame(anions, columns=["name-anion"])
cations = pd.DataFrame(cations, columns=["name-cation"])
salts=pd.read_csv('../salty/data/salts_with_smiles.csv',delimiter=',')
new_df = pd.concat([salts["name-cation"], salts["name-anion"], salts["Temperature, K"],\
                    salts["Pressure, kPa"], salts["Specific density, kg/m<SUP>3</SUP>"]],\
                   axis = 1)
print(missed)









    



3-butyl-1-ethyl-1H-imidazolium trifluoromethanesulfonate (1:1)
1H-imidazolium, 1-butyl-3-methyl-, (OC-6-11)-hexafluoroantimonate(1-)
1,3-propanediol, 2-amino-2-(hydroxymethyl)-, hydrochloride
1H-imidazolium, 1-ethyl-3-methyl-, salt with trifluoroacetic acid (1:1)
1H-imidazolium, 1-ethyl-3-methyl-, salt with trifluoroacetic acid (1:1)
1H-imidazolium, 1-ethyl-3-methyl-, salt with trifluoroacetic acid (1:1)
1H-imidazolium, 1-ethyl-3-methyl-, salt with trifluoroacetic acid (1:1)
choline cyclopentane carboxylate
choline cyclopentane carboxylate
choline cyclopentane carboxylate
choline cyclopentane carboxylate
choline cyclopentane carboxylate
choline cyclopentane carboxylate
choline cyclopentane carboxylate
3-butyl-1-ethyl-1H-imidazolium trifluoromethanesulfonate (1:1)
3-butyl-1-ethyl-1H-imidazolium trifluoromethanesulfonate (1:1)
3-butyl-1-ethyl-1H-imidazolium trifluoromethanesulfonate (1:1)
3-butyl-1-ethyl-1H-imidazolium trifluoromethanesulfonate (1:1)
3-butyl-1-ethyl-1H-imidazolium trifluoromethanesulfonate (1:1)
3-butyl-1-ethyl-1H-imidazolium trifluoromethanesulfonate (1:1)
3-butyl-1-ethyl-1H-imidazolium trifluoromethanesulfonate (1:1)
1H-imidazolium, 1-ethyl-3-methyl-, salt with trifluoroacetic acid (1:1)
1H-imidazolium, 1-ethyl-3-methyl-, salt with trifluoroacetic acid (1:1)
1H-imidazolium, 1-ethyl-3-methyl-, salt with trifluoroacetic acid (1:1)
1H-imidazolium, 1-ethyl-3-methyl-, salt with trifluoroacetic acid (1:1)
1H-imidazolium, 1-ethyl-3-methyl-, salt with trifluoroacetic acid (1:1)
1H-imidazolium, 1-ethyl-3-methyl-, salt with trifluoroacetic acid (1:1)
1H-imidazolium, 1-ethyl-3-methyl-, salt with trifluoroacetic acid (1:1)
1H-imidazolium, 1-ethyl-3-methyl-, salt with trifluoroacetic acid (1:1)
L-alanine, 1-methylethyl ester, dodecyl sulfate (1:1)
L-alanine, 1-methylethyl ester, dodecyl sulfate (1:1)
L-alanine, 1-methylethyl ester, dodecyl sulfate (1:1)
L-alanine, 1-methylethyl ester, dodecyl sulfate (1:1)
L-alanine, 1-methylethyl ester, dodecyl sulfate (1:1)
L-alanine, 1-methylethyl ester, dodecyl sulfate (1:1)
L-alanine, 1-methylethyl ester, dodecyl sulfate (1:1)
L-alanine, 1-methylethyl ester, dodecyl sulfate (1:1)
L-alanine, 1-methylethyl ester, dodecyl sulfate (1:1)
L-alanine, 1-methylethyl ester, dodecyl sulfate (1:1)
L-alanine, 1-methylethyl ester, dodecyl sulfate (1:1)
L-alanine, 1-methylethyl ester, dodecyl sulfate (1:1)
1-butyl-3-propanenitrile imidazolium dioctysulfosuccinate
1-butyl-3-propanenitrile imidazolium dioctysulfosuccinate
1-butyl-3-propanenitrile imidazolium dioctysulfosuccinate
1-butyl-3-propanenitrile imidazolium dioctysulfosuccinate
1-butyl-3-propanenitrile imidazolium dioctysulfosuccinate
1-butyl-3-propanenitrile imidazolium dioctysulfosuccinate
1-butyl-3-propanenitrile imidazolium dioctysulfosuccinate
1-butyl-3-propanenitrile imidazolium dioctysulfosuccinate
1-butyl-3-propanenitrile imidazolium dioctysulfosuccinate
1-butyl-3-propanenitrile imidazolium dioctysulfosuccinate
1-butyl-3-propanenitrile imidazolium dioctysulfosuccinate
1-butyl-3-propanenitrile imidazolium dioctysulfosuccinate
1-butyl-3-propanenitrile imidazolium dioctysulfosuccinate
1-decyl-3-propanenitrile imidazolium dioctysulfosuccinate
1-decyl-3-propanenitrile imidazolium dioctysulfosuccinate
1-decyl-3-propanenitrile imidazolium dioctysulfosuccinate
1-decyl-3-propanenitrile imidazolium dioctysulfosuccinate
1-decyl-3-propanenitrile imidazolium dioctysulfosuccinate
1-decyl-3-propanenitrile imidazolium dioctysulfosuccinate
1-decyl-3-propanenitrile imidazolium dioctysulfosuccinate
1-decyl-3-propanenitrile imidazolium dioctysulfosuccinate
1-decyl-3-propanenitrile imidazolium dioctysulfosuccinate
1-decyl-3-propanenitrile imidazolium dioctysulfosuccinate
1-decyl-3-propanenitrile imidazolium dioctysulfosuccinate
1-decyl-3-propanenitrile imidazolium dioctysulfosuccinate
1-decyl-3-propanenitrile imidazolium dioctysulfosuccinate
1-butyl-3-propanenitrile imidazolium trifluoromethanesulfonate
1-butyl-3-propanenitrile imidazolium trifluoromethanesulfonate
1-butyl-3-propanenitrile imidazolium trifluoromethanesulfonate
1-butyl-3-propanenitrile imidazolium trifluoromethanesulfonate
1-butyl-3-propanenitrile imidazolium trifluoromethanesulfonate
1-butyl-3-propanenitrile imidazolium trifluoromethanesulfonate
1-butyl-3-propanenitrile imidazolium trifluoromethanesulfonate
1-butyl-3-propanenitrile imidazolium trifluoromethanesulfonate
1-butyl-3-propanenitrile imidazolium trifluoromethanesulfonate
1-butyl-3-propanenitrile imidazolium trifluoromethanesulfonate
1-butyl-3-propanenitrile imidazolium trifluoromethanesulfonate
1-butyl-3-propanenitrile imidazolium trifluoromethanesulfonate
1-butyl-3-propanenitrile imidazolium trifluoromethanesulfonate
1-methyl-3-pentylimidazolium gallium tetrachloride
1-methyl-3-pentylimidazolium gallium tetrachloride
1-methyl-3-pentylimidazolium gallium tetrachloride
1-methyl-3-pentylimidazolium gallium tetrachloride
1-methyl-3-pentylimidazolium gallium tetrachloride
1-methyl-3-pentylimidazolium gallium tetrachloride
1-methyl-3-pentylimidazolium gallium tetrachloride
1-methyl-3-pentylimidazolium gallium tetrachloride
1-methyl-3-pentylimidazolium gallium tetrachloride
1-methyl-3-pentylimidazolium gallium tetrachloride
1-methyl-3-pentylimidazolium gallium tetrachloride
1-methyl-3-pentylimidazolium gallium tetrachloride
1-methyl-3-pentylimidazolium gallium tetrachloride
1-methyl-3-pentylimidazolium gallium tetrachloride
1-methyl-3-pentylimidazolium gallium tetrachloride
1-methyl-3-pentylimidazolium aluminum tetrachloride
1-methyl-3-pentylimidazolium aluminum tetrachloride
1-methyl-3-pentylimidazolium aluminum tetrachloride
1-methyl-3-pentylimidazolium aluminum tetrachloride
1-methyl-3-pentylimidazolium aluminum tetrachloride
1-methyl-3-pentylimidazolium aluminum tetrachloride
1-methyl-3-pentylimidazolium aluminum tetrachloride
1-methyl-3-pentylimidazolium aluminum tetrachloride
1-methyl-3-pentylimidazolium aluminum tetrachloride
1-methyl-3-pentylimidazolium aluminum tetrachloride
1-methyl-3-pentylimidazolium aluminum tetrachloride
1-methyl-3-pentylimidazolium aluminum tetrachloride
1-methyl-3-pentylimidazolium aluminum tetrachloride
1-methyl-3-pentylimidazolium aluminum tetrachloride
1-methyl-3-pentylimidazolium aluminum tetrachloride
110

After appending SMILES to the dataframe, we're ready to add RDKit descriptors. Because the descriptors are specific to a given cation and anion, and there are many repeats of these within the data (~10,000 datapoints with ~300 cations and ~150 anions) it is much faster to use pandas to append existing descriptor dataframes to our growing dataframe from ILThermo.



In [15]:

    
cationDescriptors = salty.load_data("cationDescriptors.csv")
cationDescriptors.columns = [str(col) + '-cation' for col in cationDescriptors.columns]
anionDescriptors = salty.load_data("anionDescriptors.csv")
anionDescriptors.columns = [str(col) + '-anion' for col in anionDescriptors.columns]



In [16]:

    
# new_df = pd.concat([cations, anions, df["Temperature, K"], df["Pressure, kPa"],\
#                     df["Specific density, kg/m<SUP>3</SUP>"]], axis=1)
new_df = pd.merge(cationDescriptors, new_df, on="name-cation", how="right")
new_df = pd.merge(anionDescriptors, new_df, on="name-anion", how="right")
new_df.dropna(inplace=True) #remove entires not in smiles database



In [17]:

    
pd.DataFrame.to_csv(new_df, path_or_buf='../salty/data/density_premodel.csv', index=False)

Optimize LASSO (alpha hyperparameter)

I like to shrink my feature space before feeding it into a neural network.

This is useful for two reasons. We can combat overfitting in our neural network and we can speed up our genetic search algorithm by reducing the number of computations needed in our fitness test--more on this later.

Scikit-learn has a random search algorithm that is pretty easy to implement and useful. I've personally used bootstrap, cross validation, and shuffle-split to parameterize LASSO on ILThermo data and they all agree pretty well with each other.



In [2]:

    
property_model = "density"
df = pd.DataFrame.from_csv('../salty/data/%s_premodel.csv' % property_model, index_col=None)
metaDf = df.select_dtypes(include=["object"])
dataDf = df.select_dtypes(include=[np.number])
property_scale = dataDf["Specific density, kg/m<SUP>3</SUP>"].apply(lambda x: log(float(x)))
cols = dataDf.columns.tolist()
instance = StandardScaler()
data = pd.DataFrame(instance.fit_transform(dataDf.iloc[:,:-1]), columns=cols[:-1])
df = pd.concat([data, property_scale, metaDf], axis=1)
mean_std_of_coeffs = pd.DataFrame([instance.mean_,instance.scale_], columns=cols[:-1])
viscosity_devmodel = dev_model(mean_std_of_coeffs, df)
pickle_out = open("../salty/data/%s_devmodel.pkl" % property_model, "wb")
pickle.dump(viscosity_devmodel, pickle_out)
pickle_out.close()

At this point I introduce a new class of objects called devmodel.

devmodel is a pickle-able object. self.Data contains the scaled/centered feature data and log of the property data as well as the original IUPAC names and SMILES. This makes it easy to consistently unpickle the devmodel and begin using it in an sklearn algorithm without making changes to the dataframe. Self.Coef_data contains the mean and standard deviation of the features so that structure candidates in our genetic algorithm can be scaled and centered appropriately.



In [3]:

    
pickle_in = open("../salty/data/%s_devmodel.pkl" % property_model, "rb")
devmodel = pickle.load(pickle_in)
df = devmodel.Data
metaDf = df.select_dtypes(include=["object"])
dataDf = df.select_dtypes(include=[np.number])
X_train = dataDf.values[:,:-1]
Y_train = dataDf.values[:,-1]
#metaDf["Specific density, kg/m<SUP>3</SUP>"].str.split().apply(lambda x: log(float(x[0])))

And now we can parameterize our LASSO model:



In [21]:

    
#param_grid = {"alpha": sp_rand(0,0.1), "hidden_layer_sizes" : [randint(10)]}
# model = MLPRegressor(max_iter=10000,tol=1e-8)
param_grid = {"alpha": sp_rand(0.001,0.1)}
model = Lasso(max_iter=1e5,tol=1e-8)
grid = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_jobs=-1,\
                         n_iter=15)
grid_result = grid.fit(X_train, Y_train)
print(grid_result.best_estimator_)









    



Lasso(alpha=0.017040666390571235, copy_X=True, fit_intercept=True,
   max_iter=100000.0, normalize=False, positive=False, precompute=False,
   random_state=None, selection='cyclic', tol=1e-08, warm_start=False)

Determine Confidence Intervals for LASSO Coefficients

It can be incredibly useful to look at our coefficient response to changes in the underlying training data (e.g. does it look like one of our features are being selected because of a single type of training datum, category of salt, etc.)

This can be assessed using the bootstrap.



In [50]:

    
iterations=2
averages=np.zeros(iterations)
variances=np.zeros(iterations)
test_MSE_array=[]

property_model = "density"
pickle_in = open("../salty/data/%s_devmodel.pkl" % property_model, "rb")
devmodel = pickle.load(pickle_in)
df = devmodel.Data
df = df.sample(frac=1)
# df["Viscosity, Pas"] = df["Viscosity, Pas"].str.split().apply(lambda x: log(float(x[0])))
metadf = df.select_dtypes(include=["object"])
datadf = df.select_dtypes(include=[np.number])

          
data=np.array(datadf)
n = data.shape[0]
d = data.shape[1]
d -= 1
n_train = int(n*0.8) #set fraction of data to be for training
n_test  = n - n_train
deslist=datadf.columns
score=np.zeros(len(datadf.columns))
feature_coefficients=np.zeros((len(datadf.columns),iterations))
test_MSE_array=[]
model_intercept_array=[]
for i in range(iterations):
    data = np.random.permutation(data) 
    X_train = np.zeros((n_train,d)) #prepare train/test arrays
    X_test  = np.zeros((n_test,d))
    Y_train = np.zeros((n_train))
    Y_test = np.zeros((n_test))

    ###sample from training set with replacement
    for k in range(n_train):
        x = randint(0,n_train)
        X_train[k] = data[x,:-1]
        Y_train[k] = (float(data[x,-1]))
    n = data.shape[0]
    ###sample from test set with replacement
    for k in range(n_test):
        x = randint(n_train,n)
        X_test[k] = data[x,:-1]
        Y_test[k] = (float(data[x,-1]))

    ###train the lasso model    
    model = Lasso(alpha=0.007115873059701538,tol=1e-10,max_iter=4000)
    model.fit(X_train,Y_train)
    
    ###Check what features are selected
    p=0
    avg_size=[]
    for a in range(len(data[0])-1):
        if model.coef_[a] != 0:
            score[a] = score[a] + 1
            feature_coefficients[a,i] = model.coef_[a] ###append the model coefs 
            p+=1
    avg_size.append(p)
    
    ###Calculate the test set MSE
    Y_hat = model.predict(X_test)
    n = len(Y_test)
    test_MSE = np.sum((Y_test-Y_hat)**2)**1/n
    test_MSE_array.append(test_MSE)
    
    ###Grab intercepts
    model_intercept_array.append(model.intercept_)
    
print("{}\t{}".format("average feature length:", np.average(avg_size)))
print("{}\t{}".format("average y-intercept:", "%.2f" % np.average(model_intercept_array)))
print("{}\t{}".format("average test MSE:", "%.2E" % np.average(test_MSE_array)))
print("{}\t{}".format("average MSE std dev:", "%.2E" % np.std(test_MSE_array)))
select_score=[]
select_deslist=[]
feature_coefficient_averages=[]
feature_coefficient_variance=[]
feature_coefficients_all=[]
for a in range(len(deslist)):
    if score[a] != 0:
        select_score.append(score[a])
        select_deslist.append(deslist[a])
        feature_coefficient_averages.append(np.average(feature_coefficients[a,:]))
        feature_coefficient_variance.append(np.std(feature_coefficients[a,:]))
        feature_coefficients_all.append(feature_coefficients[a,:])









    



average feature length:	17.0
average y-intercept:	7.11
average test MSE:	1.12E-03
average MSE std dev:	7.58E-05

Executing the following cell will overwrite saved files that were done for many bootstrap itterations.



In [10]:

    
#save the selected feature coeffs and their scores
df = pd.DataFrame(select_score, select_deslist)
df.to_pickle("../salty/data/bootstrap_list_scores.pkl")

#save the selected feature coefficients
df = pd.DataFrame(data=np.array(feature_coefficients_all).T, columns=select_deslist)
df = df.T.sort_values(by=1, ascending=False)
df.to_pickle("../salty/data/bootstrap_coefficients.pkl")

#save all the bootstrap data to create a box & whiskers plot
df = pd.DataFrame(data=[feature_coefficient_averages,\
                   feature_coefficient_variance], columns=select_deslist)
df = df.T.sort_values(by=1, ascending=False)
df.to_pickle("../salty/data/bootstrap_coefficient_estimates.pkl")

#save the coefficients sorted by their abs() values
df = pd.DataFrame(select_score, select_deslist)
df = df.sort_values(by=0, ascending=False).iloc[:]
cols = df.T.columns.tolist()
df = pd.read_pickle('../salty/data/bootstrap_coefficient_estimates.pkl')
df = df.loc[cols]
med = df.T.median()
med.sort_values()
newdf = df.T[med.index]
newdf.to_pickle('../salty/data/bootstrap_coefficient_estimates_top_sorted.pkl')

df = pd.DataFrame(select_score, select_deslist)
df.sort_values(by=0, ascending=False)









    Out[10]:







  
    
      
      0
    
  
  
    
      Pressure, kPa
      100.0
    
    
      Temperature, K
      100.0
    
    
      NumHeteroatoms-anion
      100.0
    
    
      EState_VSA5-cation
      100.0
    
    
      PEOE_VSA8-anion
      100.0
    
    
      SMR_VSA1-anion
      100.0
    
    
      Kappa3-anion
      100.0
    
    
      NHOHCount-cation
      99.0
    
    
      PEOE_VSA1-cation
      99.0
    
    
      NumHDonors-cation
      99.0
    
    
      PEOE_VSA11-anion
      99.0
    
    
      EState_VSA5-anion
      99.0
    
    
      SlogP_VSA5-anion
      94.0
    
    
      VSA_EState10-anion
      81.0
    
    
      BalabanJ-anion
      64.0
    
    
      EState_VSA8-anion
      61.0
    
    
      Chi1v-anion
      52.0
    
    
      PEOE_VSA1-anion
      25.0
    
    
      NumHeteroatoms-cation
      9.0
    
    
      NumHAcceptors-cation
      9.0
    
    
      PEOE_VSA10-cation
      9.0
    
    
      PEOE_VSA3-cation
      8.0
    
    
      PEOE_VSA12-cation
      8.0
    
    
      RingCount-cation
      7.0
    
    
      SMR_VSA7-cation
      4.0
    
    
      SlogP_VSA3-cation
      3.0
    
    
      SMR_VSA4-cation
      1.0
    
    
      SlogP_VSA6-cation
      1.0



In [55]:

    
model = pd.read_pickle('../salty/data/bootstrap_coefficient_estimates_top_sorted.pkl')
model2 = model.abs()
df = model2.T.sort_values(by=0, ascending=False).iloc[:]
cols = df.T.columns.tolist()
df = pd.read_pickle('../salty/data/bootstrap_coefficients.pkl')
df = df.loc[cols]
med = df.T.median()
med.sort_values()
newdf = df.T[med.index]
newdf = newdf.replace(0, np.nan)
props = dict(boxes=tableau20[0], whiskers=tableau20[8], medians=tableau20[4],\
             caps=tableau20[6])
newdf.abs().plot(kind='box', figsize=(5,12), subplots=False, fontsize=18,\
        showmeans=True, logy=False, sharey=True, sharex=True, whis='range', showfliers=False,\
        color=props,  vert=False)
plt.xticks(np.arange(0,0.1,0.02))
print(df.shape)
# plt.savefig(filename='paper_images/Box_Plot_All_Salts.eps', bbox_inches='tight', format='eps',\
#            transparent=True)

It can also be useful to evaluate the t-scores for the coefficients.



In [51]:

    
df = pd.read_pickle('../salty/data/bootstrap_coefficients.pkl')
med = df.T.median()
med.sort_values()
newdf = df.T[med.index]
df = newdf
for index, string in enumerate(newdf.columns):
    print(string)
    
    #get mean, std, N, and SEM from our sample
    samplemean=np.mean(df[string])
    print('sample mean', samplemean)
    samplestd=np.std(df[string],ddof=1)
    print('sample std', samplestd)
    sampleN=1000
    samplesem=stats.sem(df[string])
    print('sample SEM', samplesem)

    #t, the significance level of our sample mean is defined as
    #samplemean - 0 / standard error of sample mean
    #in other words, the number of standard deviations
    #the coefficient value is from 0
    #the t value by itself does not tell us very much
    t=(samplemean)/samplesem
    print('t', t)

    #the p-value tells us the propbability of achieving a value
    #at least as extreme as the one for our dataset if the null
    #hypothesis were true
    p=stats.t.sf(np.abs(t),sampleN-1)*2 #multiply by two for two-sided test
    print('p', p)

    #test rejection of the null hypothesis based on 
    #significance level of 0.05
    alpha=0.05
    if p < alpha:
        print('reject null hypothesis')
    else:
        print('null hypothesis accepted')









    



SMR_VSA1-anion
sample mean 0.04399188986007243
sample std 0.01073198656523781
sample SEM 0.00107319865652
t 40.9913761936
p 3.09016535122e-216
reject null hypothesis
NumHeteroatoms-anion
sample mean 0.03244533094105677
sample std 0.012096584659783815
sample SEM 0.00120965846598
t 26.8218938267
p 8.45567527318e-120
reject null hypothesis
Kappa3-anion
sample mean 0.014353460247971597
sample std 0.0012780155010559296
sample SEM 0.000127801550106
t 112.310533293
p 0.0
reject null hypothesis
PEOE_VSA11-anion
sample mean 0.007151785666037417
sample std 0.0031048137654573404
sample SEM 0.000310481376546
t 23.0345077235
p 1.65607953175e-94
reject null hypothesis
Pressure, kPa
sample mean 0.008449780271328167
sample std 0.0005193515481510048
sample SEM 5.19351548151e-05
t 162.698663389
p 0.0
reject null hypothesis
BalabanJ-anion
sample mean 0.0018720861483820057
sample std 0.0020047788720193183
sample SEM 0.000200477887202
t 9.33811790672
p 6.17783427219e-20
reject null hypothesis
Chi1v-anion
sample mean 0.0010270795126915751
sample std 0.001427983348586628
sample SEM 0.000142798334859
t 7.19251743172
p 1.24614736933e-12
reject null hypothesis
PEOE_VSA12-cation
sample mean 5.5394869413440314e-17
sample std 2.871389151232804e-16
sample SEM 2.87138915123e-17
t 1.92920104158
p 0.053988889623
null hypothesis accepted
SlogP_VSA6-cation
sample mean 9.622489266582675e-21
sample std 9.622489266582661e-20
sample SEM 9.62248926658e-21
t 1.0
p 0.317552660176
null hypothesis accepted
SlogP_VSA3-cation
sample mean 1.6144357383141313e-18
sample std 1.1162736624933567e-17
sample SEM 1.11627366249e-18
t 1.44627235467
p 0.148414485255
null hypothesis accepted
SMR_VSA7-cation
sample mean 4.416719063046324e-19
sample std 3.859210542910094e-18
sample SEM 3.85921054291e-19
t 1.1444618048
p 0.252706298249
null hypothesis accepted
SMR_VSA4-cation
sample mean 7.115274689742906e-07
sample std 7.115274689742915e-06
sample SEM 7.11527468974e-07
t 1.0
p 0.317552660176
null hypothesis accepted
NumHAcceptors-cation
sample mean 9.319633139383601e-06
sample std 4.347774320132911e-05
sample SEM 4.34777432013e-06
t 2.14354114385
p 0.0323105869595
reject null hypothesis
NumHeteroatoms-cation
sample mean 3.4303886165598464e-07
sample std 1.6894851063820354e-06
sample SEM 1.68948510638e-07
t 2.03043436346
p 0.0425768986672
reject null hypothesis
RingCount-cation
sample mean 8.77018210420082e-19
sample std 7.058877282245294e-18
sample SEM 7.05887728225e-19
t 1.24243300365
p 0.214368406028
null hypothesis accepted
PEOE_VSA1-anion
sample mean -0.0006019226486029682
sample std 0.0013633240449687069
sample SEM 0.000136332404497
t -4.41511063217
p 1.11951567809e-05
reject null hypothesis
PEOE_VSA3-cation
sample mean 1.3184782065543537e-08
sample std 6.673548358120185e-08
sample SEM 6.67354835812e-09
t 1.97567790896
p 0.0484661269513
reject null hypothesis
PEOE_VSA10-cation
sample mean 5.918732309403852e-05
sample std 0.0002840865750054359
sample SEM 2.84086575005e-05
t 2.08342555761
p 0.0374662116571
reject null hypothesis
PEOE_VSA1-cation
sample mean -7.869714468392476e-05
sample std 7.895714990622077e-05
sample SEM 7.89571499062e-06
t -9.96707008515
p 2.2595915495e-22
reject null hypothesis
EState_VSA8-anion
sample mean -0.00046937665217439703
sample std 0.0006295705621828296
sample SEM 6.29570562183e-05
t -7.45550507551
p 1.93836264358e-13
reject null hypothesis
NumHDonors-cation
sample mean -0.0005960328635967997
sample std 0.00014672712808282617
sample SEM 1.46727128083e-05
t -40.621858506
p 8.85146178952e-214
reject null hypothesis
VSA_EState10-anion
sample mean -0.0026203575385093754
sample std 0.002121006460118196
sample SEM 0.000212100646012
t -12.3543119164
p 9.89483494972e-33
reject null hypothesis
SlogP_VSA5-anion
sample mean -0.0017793535747706293
sample std 0.0010929568009954696
sample SEM 0.0001092956801
t -16.2801821001
p 4.92425475102e-53
reject null hypothesis
EState_VSA5-anion
sample mean -0.00348582364889761
sample std 0.0014616177443565702
sample SEM 0.000146161774436
t -23.8490786141
p 7.23944336068e-100
reject null hypothesis
NHOHCount-cation
sample mean -0.0034197434378417167
sample std 0.0013408093353200411
sample SEM 0.000134080933532
t -25.5050688249
p 6.57340953602e-111
reject null hypothesis
PEOE_VSA8-anion
sample mean -0.005556746529481952
sample std 0.0006763739656356153
sample SEM 6.76373965636e-05
t -82.1549440369
p 0.0
reject null hypothesis
Temperature, K
sample mean -0.007245015511031983
sample std 0.0004801337640318233
sample SEM 4.80133764032e-05
t -150.895772257
p 0.0
reject null hypothesis
EState_VSA5-cation
sample mean -0.08805889253215655
sample std 0.000701781377371402
sample SEM 7.01781377371e-05
t -1254.79095587
p 0.0
reject null hypothesis

Create Models Progressively Dropping Features

A last check that I find very useful is progressively dropping features from the LASSO model (based on their average coefficients--see box and whiskers plot above). At some point we should see that the inclusion of additional features doesn't improve the performance of the model. In this case we see improvement fall off at about 15-20 features.



In [16]:

    
mse_scores=[]
for i in range(df.shape[0]):
    model = pd.read_pickle('../salty/data/bootstrap_coefficient_estimates_top_sorted.pkl')
    model2 = model.abs()
    df = model2.T.sort_values(by=0, ascending=False).iloc[:i]
    cols = df.T.columns.tolist()
    model = model[cols]
    cols = model.columns.tolist()
    cols.append("Specific density, kg/m<SUP>3</SUP>")
    
    property_model = "density"
    pickle_in = open("../salty/data/%s_devmodel.pkl" % property_model, "rb")
    devmodel = pickle.load(pickle_in)
    df = devmodel.Data
    df = df.sample(frac=1)
    metadf = df.select_dtypes(include=["object"])
    datadf = df.select_dtypes(include=[np.number])
    df = datadf.T.loc[cols]
    data=np.array(df.T)

    n = data.shape[0]
    d = data.shape[1]
    d -= 1
    n_train = 0#int(n*0.8) #set fraction of data to be for training
    n_test  = n - n_train

    X_train = np.zeros((n_train,d)) #prepare train/test arrays
    X_test  = np.zeros((n_test,d))
    Y_train = np.zeros((n_train))
    Y_test = np.zeros((n_test))
    X_train[:] = data[:n_train,:-1] #fill arrays according to train/test split
    Y_train[:] = (data[:n_train,-1].astype(float))
    X_test[:] = data[n_train:,:-1]
    Y_test[:] = (data[n_train:,-1].astype(float))
    Y_hat = np.dot(X_test, model.loc[0])+np.mean(Y_test[:] - np.dot(X_test[:], model.loc[0]))
    n = len(Y_test)
    test_MSE = np.sum((Y_test-Y_hat)**2)**1/n
    mse_scores.append(test_MSE)



In [17]:

    
with plt.style.context('seaborn-whitegrid'):
    fig=plt.figure(figsize=(5,5), dpi=300)
    ax=fig.add_subplot(111)
    ax.plot(mse_scores)
    ax.grid(False)
# plt.xticks(np.arange(0,31,10))
# plt.yticks(np.arange(0,1.7,.4))

MLPRegressor

I set avg_selected_features to the number of features I want to include based on the box-whiskers plot, t-tests, and progressively dropped features model. I've set this value to 20 in the cell bellow.



In [56]:

    
####Create dataset according to LASSO selected features
df = pd.read_pickle("../salty/data/bootstrap_list_scores.pkl")
df = df.sort_values(by=0, ascending=False)
avg_selected_features=20
df = df.iloc[:avg_selected_features]

# coeffs = mean_std_of_coeffs[cols]
property_model = "density"
pickle_in = open("../salty/data/%s_devmodel.pkl" % property_model, "rb")
devmodel = pickle.load(pickle_in)
rawdf = devmodel.Data
rawdf = rawdf.sample(frac=1)
metadf = rawdf.select_dtypes(include=["object"])
datadf = rawdf.select_dtypes(include=[np.number])
to_add=[]
for i in range(len(df)):
    to_add.append(df.index[i])
cols = [col for col in datadf.columns if col in to_add]
cols.append("Specific density, kg/m<SUP>3</SUP>")
df = datadf.T.loc[cols]

data=np.array(datadf)

n = data.shape[0]
d = data.shape[1]
d -= 1
n_train = int(n*0.8) #set fraction of data to be for training
n_test  = n - n_train

X_train = np.zeros((n_train,d)) #prepare train/test arrays
X_test  = np.zeros((n_test,d))
Y_train = np.zeros((n_train))
Y_test = np.zeros((n_test))
X_train[:] = data[:n_train,:-1] #fill arrays according to train/test split
Y_train[:] = (data[:n_train,-1].astype(float))
X_test[:] = data[n_train:,:-1]
Y_test[:] = (data[n_train:,-1].astype(float))

I usually optimize my MLP regressor hyper parameters with any new type of dataset. This takes a long time to run so I use the Hyak supercomputer.



In [41]:

    
###Randomized Search NN Characterization
param_grid = {"activation": ["identity", "logistic", "tanh", "relu"],\
             "solver": ["lbfgs", "sgd", "adam"], "alpha": sp_rand(),\
             "learning_rate" :["constant", "invscaling", "adaptive"],\
             "hidden_layer_sizes": [randint(100)]}

model = MLPRegressor(max_iter=400,tol=1e-8)

grid = RandomizedSearchCV(estimator=model, param_distributions=param_grid,\
                          n_jobs=-1, n_iter=10)
grid_result = grid.fit(X_train, Y_train)

print(grid_result.best_estimator_)









    



/home/wesley/anaconda3/envs/py36/lib/python3.6/site-packages/sklearn/neural_network/multilayer_perceptron.py:563: ConvergenceWarning: Stochastic Optimizer: Maximum iterations reached and the optimization hasn't converged yet.
  % (), ConvergenceWarning)
/home/wesley/anaconda3/envs/py36/lib/python3.6/site-packages/sklearn/neural_network/multilayer_perceptron.py:563: ConvergenceWarning: Stochastic Optimizer: Maximum iterations reached and the optimization hasn't converged yet.
  % (), ConvergenceWarning)
/home/wesley/anaconda3/envs/py36/lib/python3.6/site-packages/sklearn/neural_network/multilayer_perceptron.py:563: ConvergenceWarning: Stochastic Optimizer: Maximum iterations reached and the optimization hasn't converged yet.
  % (), ConvergenceWarning)
/home/wesley/anaconda3/envs/py36/lib/python3.6/site-packages/sklearn/neural_network/multilayer_perceptron.py:563: ConvergenceWarning: Stochastic Optimizer: Maximum iterations reached and the optimization hasn't converged yet.
  % (), ConvergenceWarning)
/home/wesley/anaconda3/envs/py36/lib/python3.6/site-packages/sklearn/neural_network/multilayer_perceptron.py:563: ConvergenceWarning: Stochastic Optimizer: Maximum iterations reached and the optimization hasn't converged yet.
  % (), ConvergenceWarning)
/home/wesley/anaconda3/envs/py36/lib/python3.6/site-packages/sklearn/neural_network/multilayer_perceptron.py:563: ConvergenceWarning: Stochastic Optimizer: Maximum iterations reached and the optimization hasn't converged yet.
  % (), ConvergenceWarning)
/home/wesley/anaconda3/envs/py36/lib/python3.6/site-packages/sklearn/neural_network/multilayer_perceptron.py:563: ConvergenceWarning: Stochastic Optimizer: Maximum iterations reached and the optimization hasn't converged yet.
  % (), ConvergenceWarning)
/home/wesley/anaconda3/envs/py36/lib/python3.6/site-packages/sklearn/neural_network/multilayer_perceptron.py:563: ConvergenceWarning: Stochastic Optimizer: Maximum iterations reached and the optimization hasn't converged yet.
  % (), ConvergenceWarning)
/home/wesley/anaconda3/envs/py36/lib/python3.6/site-packages/sklearn/neural_network/multilayer_perceptron.py:563: ConvergenceWarning: Stochastic Optimizer: Maximum iterations reached and the optimization hasn't converged yet.
  % (), ConvergenceWarning)
/home/wesley/anaconda3/envs/py36/lib/python3.6/site-packages/sklearn/neural_network/multilayer_perceptron.py:563: ConvergenceWarning: Stochastic Optimizer: Maximum iterations reached and the optimization hasn't converged yet.
  % (), ConvergenceWarning)
/home/wesley/anaconda3/envs/py36/lib/python3.6/site-packages/sklearn/neural_network/multilayer_perceptron.py:563: ConvergenceWarning: Stochastic Optimizer: Maximum iterations reached and the optimization hasn't converged yet.
  % (), ConvergenceWarning)
/home/wesley/anaconda3/envs/py36/lib/python3.6/site-packages/sklearn/neural_network/multilayer_perceptron.py:563: ConvergenceWarning: Stochastic Optimizer: Maximum iterations reached and the optimization hasn't converged yet.
  % (), ConvergenceWarning)
/home/wesley/anaconda3/envs/py36/lib/python3.6/site-packages/sklearn/neural_network/multilayer_perceptron.py:563: ConvergenceWarning: Stochastic Optimizer: Maximum iterations reached and the optimization hasn't converged yet.
  % (), ConvergenceWarning)
/home/wesley/anaconda3/envs/py36/lib/python3.6/site-packages/sklearn/neural_network/multilayer_perceptron.py:563: ConvergenceWarning: Stochastic Optimizer: Maximum iterations reached and the optimization hasn't converged yet.
  % (), ConvergenceWarning)
/home/wesley/anaconda3/envs/py36/lib/python3.6/site-packages/sklearn/neural_network/multilayer_perceptron.py:563: ConvergenceWarning: Stochastic Optimizer: Maximum iterations reached and the optimization hasn't converged yet.
  % (), ConvergenceWarning)
/home/wesley/anaconda3/envs/py36/lib/python3.6/site-packages/sklearn/neural_network/multilayer_perceptron.py:563: ConvergenceWarning: Stochastic Optimizer: Maximum iterations reached and the optimization hasn't converged yet.
  % (), ConvergenceWarning)
/home/wesley/anaconda3/envs/py36/lib/python3.6/site-packages/sklearn/neural_network/multilayer_perceptron.py:563: ConvergenceWarning: Stochastic Optimizer: Maximum iterations reached and the optimization hasn't converged yet.
  % (), ConvergenceWarning)
/home/wesley/anaconda3/envs/py36/lib/python3.6/site-packages/sklearn/neural_network/multilayer_perceptron.py:563: ConvergenceWarning: Stochastic Optimizer: Maximum iterations reached and the optimization hasn't converged yet.
  % (), ConvergenceWarning)






    



MLPRegressor(activation='tanh', alpha=0.52803260587558776, batch_size='auto',
       beta_1=0.9, beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=62, learning_rate='constant',
       learning_rate_init=0.001, max_iter=400, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='sgd', tol=1e-08, validation_fraction=0.1,
       verbose=False, warm_start=False)






    



/home/wesley/anaconda3/envs/py36/lib/python3.6/site-packages/sklearn/neural_network/multilayer_perceptron.py:563: ConvergenceWarning: Stochastic Optimizer: Maximum iterations reached and the optimization hasn't converged yet.
  % (), ConvergenceWarning)



In [53]:

    
model = MLPRegressor(activation='logistic', alpha=0.92078, batch_size='auto',
       beta_1=0.9, beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=75, learning_rate='constant',
       learning_rate_init=0.001, max_iter=1e8, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='lbfgs', tol=1e-08, validation_fraction=0.1,
       verbose=False, warm_start=False)
model.fit(X_train,Y_train)









    Out[53]:





MLPRegressor(activation='logistic', alpha=0.92078, batch_size='auto',
       beta_1=0.9, beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=75, learning_rate='constant',
       learning_rate_init=0.001, max_iter=100000000.0, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='lbfgs', tol=1e-08, validation_fraction=0.1,
       verbose=False, warm_start=False)



In [54]:

    
with plt.style.context('seaborn-whitegrid'):
    fig=plt.figure(figsize=(6,6), dpi=300)
    ax=fig.add_subplot(111)
    ax.plot(np.exp(Y_test),np.exp(model.predict(X_test)),\
            marker=".",linestyle="")

Save the final model to be used in the GAINS fitness test



In [37]:

    
prodmodel = prod_model(coeffs, model)
pickle_out = open("../salty/data/%s_prodmodel.pkl" % property_model, "wb")
pickle.dump(prodmodel, pickle_out)
pickle_out.close()

	0
Pressure, kPa	100.0
Temperature, K	100.0
NumHeteroatoms-anion	100.0
EState_VSA5-cation	100.0
PEOE_VSA8-anion	100.0
SMR_VSA1-anion	100.0
Kappa3-anion	100.0
NHOHCount-cation	99.0
PEOE_VSA1-cation	99.0
NumHDonors-cation	99.0
PEOE_VSA11-anion	99.0
EState_VSA5-anion	99.0
SlogP_VSA5-anion	94.0
VSA_EState10-anion	81.0
BalabanJ-anion	64.0
EState_VSA8-anion	61.0
Chi1v-anion	52.0
PEOE_VSA1-anion	25.0
NumHeteroatoms-cation	9.0
NumHAcceptors-cation	9.0
PEOE_VSA10-cation	9.0
PEOE_VSA3-cation	8.0
PEOE_VSA12-cation	8.0
RingCount-cation	7.0
SMR_VSA7-cation	4.0
SlogP_VSA3-cation	3.0
SMR_VSA4-cation	1.0
SlogP_VSA6-cation	1.0