A data driven, Quantum Mechanical understanding of Chemistry

AKA Trying to make sense of 134k quantum calculations

by Benjamin Sanchez Lengeling (bsanchezlengeling@g.harvard.edu)

Load custom CSS styling


In [10]:
from IPython.core.display import HTML
import os
def css_styling():
    """Load default custom.css file from ipython profile"""
    base = os.getcwd()
    styles = "<style>\n%s\n</style>" % (open(os.path.join(base,'custom.css'),'r').read())
    return HTML(styles)
css_styling()


Out[10]:

Molecular party!

0 Intro

0.1 Motivation

This CS 209 project is an exploration of Chemistry using data science techniques. I want to use the methods learned in the class to apply them in my own context, which is Quantum Chemistry.

Quantum chemistry is a branch of theoretical chemistry which applies quantum mechanics to address problems in chemistry.

In Quantum Chemistry you can calculate a molecule on a computer, simulating and solving the Schrödinger equation to obtain molecular properties.

Some pretty awesome examples of Quantum Chemistry applications are:

  • Designing and finding better molecules and dyes, to construct more efficient solar cells
  • Finding out why things have the color they do with no previous knowledge of the world, for example why is a tomato red? You can simulate a photon (light) hitting the molecules on the frontier of a tomato and seeing what type of wavelength you get back.
  • Studying possible new materials like graphene, or carbon nanotubes.
  • Designing better of newer drugs by modeling the molecules when they come in "contact" with other molecules in our bodies.
  • Finding alternative reactions for common industrial products, typically you want these reactions to be more ecological (better byproducts), cheaper or more efficient.
  • Studying the composition of molecules and reactions that can occur easily in space, to better understand all the material world surrounding us right now, including life.

Due to advances in computing now we can calculate a large number of molecules in relatively small time frame, creating huge masses of data.

0.1 Motivation

The pharmaceutical industry has established for a long time now the practice of using molecular calculations to drive drug discovery. This has olny been done with classical mechanics and with very limited quantum treatment of molecules. There is now a trend of trying to adapt the tools and strategies of this industry to material science.

So a current open question is: How can we incorporate quantum effects/information into chemoinformatics?

Very little work has been done on this aspect, in part due to the computational constraint of the treatment of quantum effects and also due to the complexity of this information.

I want to put my newly earned skills to the test, on data I will most probably be working with during my graduate studies.

One aspect of this work is the research, there are two main projects/works that have tried to utilize data science in Quantum Chemistry:

  • The Harvard Clean Energy project, the largest database of quantum calculation in the world, all seeking to discover the new generation of plastic solar cell materials. http://cleanenergy.molecularspace.org/ , http://www.molecularspace.org/.
  • A recent 2014 Scientific Data article "Quantum chemistry structures and properties of 134 kilo molecules" (http://www.nature.com/articles/sdata201422). One of the first projects to grant access to a large database of quantum calculations. The article sets-up the rational behind the creation of the dataset, they plan to use machine learning to predict molecular properties.

The other aspect relates to accesibility in science, in using great tools as ipython notebooks, data repositories, open source libraries to accelerate and make more open the process of doing science. In Chemistry two side projects which are of great interest are:

  • OpenBabel (http://openbabel.org): library designed to support molecular modeling, chemistry, and many related areas, including interconversion of file formats and data
  • RDkit (http://www.rdkit.org/): library for Cheminformatics and Machine Learning.

0.3 Goals

  1. Cleanup the dataset and present it in a more user and sharable format (pandas dataframe, pickle, excel, json database).
  2. Visualize the dataset, explore all parameters of the data and present global trends.
  3. Find patterns within the data, how do these patterns reflect with chemistry concepts?

1 Data Wrangling

1.0 The Dataset

Dataset is hosted on http://dx.doi.org/10.6084/m9.figshare.978904, it is a few hundred MB's.

I also have pandas dataframe pickle in my dropbox:

https://www.dropbox.com/sh/6iysi4w0xmmevlt/AABrTLUFZJvrJPDeDCzuhxJYa?dl=0.

They reported computed geometric, energetic, electronic, and thermodynamic properties for 134k stable small organic molecules made up of CHONF atoms. These molecules correspond to the subset of all 133,885 species with up to nine heavy atoms (CONF) out of the GDB-17 chemical universe of 166 billion organic molecules.

All calculations have been carried out at the B3LYP/6-31G(2df,p) level of quantum chemistry. (Decent, but not the best)

Results of each calculation has been embeded within a xyz file in a very specific way. The data provided by the article is not very user friendly ( see: http://www.ch.imperial.ac.uk/rzepa/blog/?p=12803), a better format would have been a json or excel database.

We will parse out the information acoording to the readme.txt file and create for each molecule a row within a pandas dataframe. This dataframe can thn easily be exported to a variety of popular data formats.

1.1 Libraries


In [1]:
#Libraries
import numpy as np
import scipy as sp
import pandas as pd
import sklearn
import seaborn as sns
from matplotlib import pyplot as plt
import random
#chemo-informatics
import imolecule
import openbabel
import pybel
import rdkit
# visual stuff
%matplotlib inline
from IPython.core import display
from PIL import Image
#rdkit
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem import AllChem
import rdkit.rdBase
from rdkit.Chem.MACCSkeys import GenMACCSKeys
from rdkit import DataStructs
from rdkit.DataStructs import BitVectToText
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from rdkit import DataStructs
from rdkit.Chem import MCS as MCS
from rdkit.Chem import Descriptors as Descriptors
from rdkit.ML.Descriptors import MoleculeDescriptors
from rdkit.Chem import PandasTools as PandasTools
from rdkit.Chem import Descriptors as Descriptors
#awesome plot options
plt.style.use('fivethirtyeight')
plt.rcParams['figure.figsize'] = (12.0, 6.0)
# utility
from io import BytesIO


#functions for ipython

def display_pil_image(im):
    """Displayhook function for PIL Images, rendered as PNG."""

    b = BytesIO()
    im.save(b, format='png')
    data = b.getvalue()

    ip_img = display.Image(data=data, format='png', embed=True)
    return ip_img._repr_png_()

# register display func with PNG formatter:
png_formatter = get_ipython().display_formatter.formatters['image/png']
dpi = png_formatter.for_type(Image.Image, display_pil_image)

#functions for plotting
def sns_cycle_palette():
    """Cycle seaborn's color palete to have a variety of colors"""

    pal = sns.color_palette()
    pal=pal[1:]+[pal[0]]
    sns.set_palette(pal)
    return

1.2 Read XYZ files and load to DataFrame

We create a simple function to read an xyz file and return a row of data to be appended to a dataframe. The we iterate over all the xyz files, this took a few hours. We save the dataframe periodically unto a python pickle.


In [7]:
def readFormatedXYZ(xyzfile):
    fileData=[]
    with(open(xyzFile,'r')) as afile:
        numAtoms=int(afile.next())
        #1 num atoms
        fileData.append(numAtoms)
        datline=afile.next().split()
        # 2 molecular index within the database
        fileData.append(int(datline[1]))
        # 3-5 rotational constants
        fileData.append(float(datline[2]))
        fileData.append(float(datline[3]))
        fileData.append(float(datline[4]))
        # 6 dipole moment
        fileData.append(float(datline[5]))
        # 7 polarizability
        fileData.append(float(datline[6]))
        # 8-10 orbital energies (lumo,homo)
        fileData.append(float(datline[7]))
        fileData.append(float(datline[8]))
        fileData.append(float(datline[9]))
        # 11 spatial extent
        fileData.append(float(datline[10]))
        # 12 zpe
        fileData.append(float(datline[11]))
        # 13-14 Internal Energy
        fileData.append(float(datline[12]))
        fileData.append(float(datline[13]))
        # 15 Enthalpy Energy
        fileData.append(float(datline[14]))
        # 16 Free Energy
        fileData.append(float(datline[15]))
        # 17 Heat Capacity
        fileData.append(float(datline[16]))
        
        #read geometry
        atomCoords=np.zeros((numAtoms,3))
        atomList=[]
        for i in range(numAtoms):
            datline=afile.next().split()
            atomList.append(datline[0])
            atomCoords[i]=np.array([float(d) for d in datline[1:4]])
        fileData.append(atomList)
        fileData.append(atomCoords)
        # next frequencies
        fileData.append(np.array([float(f) for f in afile.next().split()]))
        #relaxed geometry smile
        fileData.append(afile.next().split()[1])
        #relaxed geometry InChI
        fileData.append( afile.next().split()[1])
        
    return fileData

In [61]:
#===== Initial Variables ===
molFolder = 'mols/'
rmin = 0
rmax = 100
pandasFile = "mols.pkl"
#===== Check that the pickle file exists ===
if os.path.isfile(pandasFile):
    masterdf = pd.read_pickle(pandasFile)
    rmin = masterdf.shape[0] + 1
else:
    datCols = ['numAtoms', 'dbindex', 'A', 'B', 'C', 'dipole', 'polar',
               'homo', 'lumo', 'gap', 'spatialSize', 'zpe',
               'U0', 'U', 'H', 'G', 'Cv',
               'atomList', 'atomCoords', 'freqs',
               'SMILES', 'InChI']
    masterdf = pd.DataFrame(columns=datCols)
    rmin = 0
    
masterdf=pd.DataFrame( columns=datCols)
#===== Iterate over file
for indx in indexes:
    xyzFile="mols/dsgdb9nsd_%06d.xyz"%(indx+1)
    masterdf.loc[indx]=readFormatedXYZ(xyzFile)
    if (indx % 10000 == 0):
        print(indx)
        masterdf.to_pickle(pandasFile)
#====== Save and
masterdf.to_pickle(pandasFile)

In [6]:
#pickle it!
pandasFile = "original.pkl"
masterdf = pd.read_pickle(pandasFile)
print(masterdf.columns)
masterdf.head()


Index([u'numAtoms', u'dbindex', u'A', u'B', u'C', u'dipole', u'polar', u'homo', u'lumo', u'gap', u'spatialSize', u'zpe', u'U0', u'U', u'H', u'G', u'Cv', u'atomList', u'atomCoords', u'freqs', u'SMILES', u'InChI'], dtype='object')
Out[6]:
numAtoms dbindex A B C dipole polar homo lumo gap ... U0 U H G Cv atomList atomCoords freqs SMILES InChI
0 5 1 157.71180 157.709970 157.706990 0.0000 13.21 -0.3877 0.1171 0.5048 ... -40.478930 -40.476062 -40.475117 -40.498597 6.469 [C, H, H, H, H] [[-0.0126981359, 1.0858041578, 0.0080009958], ... [1341.307, 1341.3284, 1341.365, 1562.6731, 156... C InChI=1S/CH4/h1H4
1 4 2 293.60975 293.541110 191.393970 1.6256 9.46 -0.2570 0.0829 0.3399 ... -56.525887 -56.523026 -56.522082 -56.544961 6.316 [N, H, H, H] [[-0.0404260543, 1.0241077531, 0.0625637998], ... [1103.8733, 1684.1158, 1684.3072, 3458.7145, 3... N InChI=1S/H3N/h1H3
2 3 3 799.58812 437.903860 282.945450 1.8511 6.31 -0.2928 0.0687 0.3615 ... -76.404702 -76.401867 -76.400922 -76.422349 6.002 [O, H, H] [[-0.0343604951, 0.9775395708, 0.0076015923], ... [1671.4222, 3803.6305, 3907.698] O InChI=1S/H2O/h1H2
3 4 4 0.00000 35.610036 35.610036 0.0000 16.28 -0.2845 0.0506 0.3351 ... -77.308427 -77.305527 -77.304583 -77.327429 8.574 [C, C, H, H] [[0.5995394918, 0.0, 1.0], [-0.5995394918, 0.0... [549.7648, 549.7648, 795.2713, 795.2713, 2078.... C#C InChI=1S/C2H2/c1-2/h1-2H
4 3 5 0.00000 44.593883 44.593883 2.8937 12.99 -0.3604 0.0191 0.3796 ... -93.411888 -93.409370 -93.408425 -93.431246 6.278 [C, N, H] [[-0.0133239314, 1.1324657151, 0.0082758861], ... [799.0101, 799.0101, 2198.4393, 3490.3686] C#N InChI=1S/CHN/c1-2/h1H

5 rows × 22 columns

1.3 Remove uncharacterized molecules

Within the article they note that 3k molecules did not obtain converged geoemtries consistant with the intial guess, i.e. radically different geometries. We filter these out.


In [76]:
#preprocessing the dataframes
df=masterdf.copy()
#Filter out the uncharacterized molecules
badmols=pd.read_csv('uncharacterized.csv')
df=df[~df['dbindex'].isin(badmols['Index'])]
print("Bad molecules: %s"%(str(badmols.shape)))
print("Before %s -> After %s "%(masterdf.shape[0],df.shape[0]))


Bad molecules: (3054, 5)
Before 133884 -> After 130830 

1.4 Add other computable descriptors

We add other descriptors to the dataframe. We create a row operating function and use pd.apply().

1.4.1 Molecular Weight (molWeight)


In [77]:
#load atomic data
atomdf=pd.read_csv('atomref.csv')
weightDict={ row['Element']:row['Mass'] for indx,row in atomdf.iterrows()}
print(weightDict)
#molecular weight function
def molecularWeight( row ):
    weightDict={'H': 1.00794, 'C': 12.0107, 'F': 18.998403, 'O': 15.9994, 'N': 14.0067}
    return np.sum([ weightDict[atom] for atom in row['atomList'] ])
#apply function to each row
df['molWeight']=df.apply(molecularWeight, axis=1)
atomdf.head()


{'H': 1.00794, 'C': 12.0107, 'F': 18.998403, 'O': 15.9994, 'N': 14.0067}
Out[77]:
Element Mass ZPVE_Hartree U (0 K)_Hartree U (298.15 K)_Hartree H (298.15 K)_Hartree G (298.15 K)_Hartree CV_Hartree_Cal/(Mol Kelvin)
0 H 1.007940 0 -0.500273 -0.498857 -0.497912 -0.510927 2.981
1 C 12.010700 0 -37.846772 -37.845355 -37.844411 -37.861317 2.981
2 N 14.006700 0 -54.583861 -54.582445 -54.581501 -54.598897 2.981
3 O 15.999400 0 -75.064579 -75.063163 -75.062219 -75.079532 2.981
4 F 18.998403 0 -99.718730 -99.717314 -99.716370 -99.733544 2.981

1.4.2 Standard Enthalphy of Atomization (Hatom)

$$ H_a=\Delta_{at}H^{\theta}= \sum_{atom} H^{\theta}_{atom} - H^{\theta}_{mol} $$

Can be understood as the energy required to break the molecule completely into its component atoms.


In [78]:
#get dict to put into function
enthalphyDict={ row['Element']:row['H (298.15 K)_Hartree'] for indx,row in atomdf.iterrows()}
print(enthalphyDict)
# define function
def enthalphyAtomization( row ):
    enthalphyDict={'H': -0.49791199999999997, 'C': -37.844411, 'F': -99.71637, 'O': -75.062219, 'N': -54.581501}
    return np.sum([ enthalphyDict[atom] for atom in row['atomList'] ])-row['H']

#apply function to each row
df['Hatom']=df.apply(enthalphyAtomization, axis=1)

df.head()


{'H': -0.49791199999999997, 'C': -37.844411, 'F': -99.71637, 'O': -75.062219, 'N': -54.581501}
Out[78]:
numAtoms dbindex A B C dipole polar homo lumo gap ... H G Cv atomList atomCoords freqs SMILES InChI molWeight Hatom
0 5 1 157.71180 157.709970 157.706990 0.0000 13.21 -0.3877 0.1171 0.5048 ... -40.475117 -40.498597 6.469 [C, H, H, H, H] [[-0.0126981359, 1.0858041578, 0.0080009958], ... [1341.307, 1341.3284, 1341.365, 1562.6731, 156... C InChI=1S/CH4/h1H4 16.04246 0.639058
1 4 2 293.60975 293.541110 191.393970 1.6256 9.46 -0.2570 0.0829 0.3399 ... -56.522082 -56.544961 6.316 [N, H, H, H] [[-0.0404260543, 1.0241077531, 0.0625637998], ... [1103.8733, 1684.1158, 1684.3072, 3458.7145, 3... N InChI=1S/H3N/h1H3 17.03052 0.446845
2 3 3 799.58812 437.903860 282.945450 1.8511 6.31 -0.2928 0.0687 0.3615 ... -76.400922 -76.422349 6.002 [O, H, H] [[-0.0343604951, 0.9775395708, 0.0076015923], ... [1671.4222, 3803.6305, 3907.698] O InChI=1S/H2O/h1H2 18.01528 0.342879
3 4 4 0.00000 35.610036 35.610036 0.0000 16.28 -0.2845 0.0506 0.3351 ... -77.304583 -77.327429 8.574 [C, C, H, H] [[0.5995394918, 0.0, 1.0], [-0.5995394918, 0.0... [549.7648, 549.7648, 795.2713, 795.2713, 2078.... C#C InChI=1S/C2H2/c1-2/h1-2H 26.03728 0.619937
4 3 5 0.00000 44.593883 44.593883 2.8937 12.99 -0.3604 0.0191 0.3796 ... -93.408425 -93.431246 6.278 [C, N, H] [[-0.0133239314, 1.1324657151, 0.0082758861], ... [799.0101, 799.0101, 2198.4393, 3490.3686] C#N InChI=1S/CHN/c1-2/h1H 27.02534 0.484601

5 rows × 24 columns


In [103]:
# save new descriptors to file
df.to_pickle('mol.pkl')

In [100]:
# load pickle file
df = pd.read_pickle('mol.pkl')

2) Data Visualization: Exploring the Dataset

2.1) Visualization

For visualization we use imolecule and rdkit which already has some ipython bindings.


In [30]:
nMols=10
indexes=np.random.randint(0,df.shape[0],nMols)
mols=[ Chem.MolFromSmiles(df.ix[i,'SMILES']) for i in indexes]
names=[ "Mol # %d"%(i) for i in indexes]
p = Draw.MolsToGridImage( mols, molsPerRow=5, subImgSize=(200, 200), legends=names)
print("=== Random selection of %d molecules ==="%nMols)
p


=== Random selection of 10 molecules ===
Out[30]:

In [99]:
randomMolecule=np.random.randint(0,df.shape[0])
smileStr=df.loc[randomMolecule,'SMILES']
print("== 2D ==")
mymol=pybel.readstring("smiles", smileStr)
mymol


== 2D ==
Out[99]:
Multiple Molecules - Open Babel Depiction C N O N

In [100]:
print("== 3D Rotatable Molecule==")
imolecule.draw(smileStr, drawing_type="ball and stick", shader="phong")


== 3D Rotatable Molecule==

2.2) Size: Number of atoms, Molecular Weight, Spatial Extensivity

spatialSize = Electronic Spatial Extent

The electronic spatial extent is a single number that attempts to describe the size of a molecule. This number is computed as the expectation value of electron density times the distance from the center of mass of a molecule. Because the information is condensed down to a single number, it does not distinguish between long chains and more globular molecules.


In [30]:
interest=['numAtoms','spatialSize','molWeight']
for i in interest:
    rangeSize=max(df[i])-min(df[i])
    sns_cycle_palette()
    # histogram or a distribution plot?
    if rangeSize < 40:
        bins = np.arange(min(df[i]),max(df[i])+1)
        plt.hist(df[i],bins)
    else:
        sns.distplot(df[i])
    plt.title('Distribution of molecules')
    plt.xlabel(i)
    plt.ylabel('Frequency')
    plt.show()
# show stats
print('=== Stats ===')
df[interest].describe().transpose()


=== Stats ===
Out[30]:
count mean std min 25% 50% 75% max
numAtoms 130830 18.032538 2.943693 3.00000 16.000000 18.00000 20.00000 29.000000
spatialSize 130830 1189.416537 280.471126 19.00020 1017.441875 1147.22465 1309.04710 3374.753200
molWeight 130830 122.721698 7.570695 16.04246 121.179640 125.12528 127.14116 152.038398

Observations:

  • Electronic Spatial Extent follows a normal distribution.
  • Molecular Weight has a bi-modal distribution, it appears to be the mix of 2 normal distributions around 110 and 130.
  • The number of atoms is heavily concentrated between 16 and 20.

2.2 Energetics: Internal energy (U), Enthalpy (H), Gibbs free energy (G), Zero Point Energy (zpe)

Internal energy (U)

Is the sum of all the microscopic energies such as translational kinetic energy, vibrational and rotational kinetic energy and potential energy from intermolecular forces.

Enthalpy $ H = U + pV $

Includes the pressure and Volume of the system. In a sense it accounts for energy transferred to the environment at constant pressure through expansion or heating.

Gibbs free energy $ G = U + pV - TS $

Measures the "usefulness" or process-initiating work obtainable from a thermodynamic system at a constant temperature and pressure.

Zero Point Energy (zpe)

Zero-point energy is the energy that remains when all other energy is removed from a system, i.e the energy of a molecule at T=0.


In [44]:
sns_cycle_palette()
sns.jointplot(df['G'], df['U'], kind="scatter",size=8);
plt.show()
sns_cycle_palette()
sns.jointplot(df['U0'], df['zpe'],kind="reg",size=8);
plt.show()

# show stats
interest=['U','H','G','zpe']
print('=== Stats ===')
df[interest].describe().transpose()


=== Stats ===
Out[44]:
count mean std min 25% 50% 75% max
U 130830 -410.812341 39.891166 -714.560153 -437.870873 -416.800560 -387.031228 -40.476062
H 130830 -410.811397 39.891166 -714.559209 -437.869929 -416.799616 -387.030284 -40.475117
G 130830 -410.854217 39.891884 -714.602138 -437.911833 -416.841337 -387.074526 -40.498597
zpe 130830 0.149090 0.033138 0.015951 0.125638 0.148630 0.171397 0.273944

Observations:

  • U0,U,H,G are all extremely correlated, this makes sense since they defined by vary similar formulas (i.e U=U0+zpe ). We can just keep one measure to reduce data size and number of variables.
  • zpe is a very small number compared to the rest, while quite similar in distribuition to the other energetics it does have some variability.

2.3 Rotational Constants (A,B,C)

Molecules are quantized (due to Quantum Mecchanics) rotating systems, so rotational energy and the angular momentum only takes certain fixed values, these are related to the rotational constants. These values are normally ordered as $A\ge B \ge C$. Based on the relationship of these constants we can classify the molecules as rotors:

  • Spherical rotors where $A = B = C$.
  • Linear molecules where $A >> B=C $ .
  • Oblate rotors where olny $A=B$
  • Prolate where olny $B=C$.
  • Asymmetric rotors where $A \neq B \neq C$.

We can count how many of these we might find in our data to make sense if we should classify molecules this way:


In [49]:
AeqB=np.abs(df['A'] - df['B'])/df['A'] < 0.005
BeqC=np.abs(df['B'] - df['C'])/df['B'] < 0.005
AeqC=np.abs(df['A'] - df['C'])/df['A'] < 0.005

largeRatio= df['A']/df['B'] < 0.25
smallRatio= df['A']/df['B'] > 100
largeA= np.logical_and(df['A'] >= 80 , df['A']/df['B'] > 4)
##clasifiers
spherical=np.logical_and(np.logical_and(AeqB,BeqC),AeqC)

linear=np.logical_or(np.logical_or(largeRatio,smallRatio),largeA)

checked=np.logical_or(linear,spherical)
oblate=np.logical_and(AeqB,np.logical_not(checked)) 

checked=np.logical_or(checked,oblate)
prolate=np.logical_and(BeqC,np.logical_not(checked)) 

checked=np.logical_or(checked,prolate)
asym=np.logical_not(checked)

print("Spherical %d mols"%( sum( spherical)) )
print("Linear %d mols"%( sum(linear) ) )
print("Oblate %d mols"%( sum(oblate) ) )
print("Prolate %d mols"%( sum(prolate) ) )
print("Asymmetric %d mols"%( sum(asym) ) )
# create new data column
df['rotType']='Asym'
for index,row in df.iterrows():
    if spherical[index]:
        df.ix[index,'rotType']='sph'
    elif linear[index]:
        df.ix[index,'rotType']='lin'
    elif oblate[index]:
        df.ix[index,'rotType']='ob'
    elif prolate[index]:
        df.ix[index,'rotType']='pro'


Spherical 7 mols
Linear 25 mols
Oblate 83 mols
Prolate 367 mols
Asymmetric 130348 mols

In [50]:
print("== Spherical Molecules ==")
subdf=df[ spherical ]
mols=[ Chem.MolFromSmiles(row['SMILES']) for indx,row in subdf.iterrows() ] 
names=[ "(# %d) %3.2f, %3.2f, %3.2f"%(row['dbindex'],row['A'],row['B'],row['C'])  for indx,row in subdf.iterrows() ]
p = Draw.MolsToGridImage( mols, molsPerRow=5, legends=names)
p


== Spherical Molecules ==
Out[50]:

In [51]:
print("== Linear ==")
subdf=df[ linear ]
mols=[ Chem.MolFromSmiles(row['SMILES']) for indx,row in subdf.iterrows() ] 
names=[ "(# %d) %3.2f, %3.2f, %3.2f"%(row['dbindex'],row['A'],row['B'],row['C'])  for indx,row in subdf.iterrows() ]
p = Draw.MolsToGridImage( mols, molsPerRow=5, legends=names)
p


== Linear ==
Out[51]:

In [52]:
print("== Some Oblate ==")
subdf=df[ oblate ]
subdf= subdf[subdf['dbindex'].isin(random.sample(subdf['dbindex'], 15))]
#subdf=subdf[:15]
mols=[ Chem.MolFromSmiles(row['SMILES']) for indx,row in subdf.iterrows() ] 
names=[ "(# %d) %3.2f, %3.2f, %3.2f"%(row['dbindex'],row['A'],row['B'],row['C'])  for indx,row in subdf.iterrows() ]
p = Draw.MolsToGridImage( mols, molsPerRow=5, legends=names)
p


== Some Oblate ==
Out[52]:

In [53]:
print("== Some Prolate ==")
subdf=df[ prolate ]
subdf= subdf[subdf['dbindex'].isin(random.sample(subdf['dbindex'], 15))]
mols=[ Chem.MolFromSmiles(row['SMILES']) for indx,row in subdf.iterrows() ] 
names=[ "(# %d) %3.2f, %3.2f, %3.2f"%(row['dbindex'],row['A'],row['B'],row['C'])  for indx,row in subdf.iterrows() ]
p = Draw.MolsToGridImage( mols, molsPerRow=5, legends=names)
p


== Some Prolate ==
Out[53]:

In [54]:
print("== Some Asymmetric ==")
subdf=df[ asym ]
subdf= subdf[subdf['dbindex'].isin(random.sample(subdf['dbindex'], 15))]
mols=[ Chem.MolFromSmiles(row['SMILES']) for indx,row in subdf.iterrows() ] 
names=[ "(# %d) %3.2f, %3.2f, %3.2f"%(row['dbindex'],row['A'],row['B'],row['C'])  for indx,row in subdf.iterrows() ]
p = Draw.MolsToGridImage( mols, molsPerRow=5, legends=names)
p


== Some Asymmetric ==
Out[54]:

In [85]:
# show stats
interest=['A','B','C']

print('=== Stats ===')
df.groupby('rotType')[interest].describe()


=== Stats ===
Out[85]:
A B C
rotType
Asym count 130348.000000 130348.000000 130348.000000
mean 3.428143 1.400986 1.122086
std 2.674430 1.296514 0.849326
min 1.405150 0.337120 0.331180
25% 2.554785 1.091817 0.911447
50% 3.088760 1.370480 1.081650
75% 3.832645 1.654280 1.281630
max 799.588120 437.903860 282.945450
lin count 25.000000 25.000000 25.000000
mean 34181.745950 8.506396 8.283852
std 130569.093627 13.280076 12.812553
min 0.000000 0.377100 0.377100
25% 0.000000 0.782570 0.782570
50% 80.462250 2.048960 2.048960
75% 159.871170 8.593230 8.593210
max 619867.683140 44.593883 44.593883
ob count 83.000000 83.000000 83.000000
mean 6.354241 6.350577 4.146749
std 32.011338 32.004122 20.864110
min 1.382500 1.381480 0.806320
25% 1.825840 1.820495 1.013025
50% 2.100840 2.096210 1.430060
75% 2.660040 2.655590 2.097040
max 293.609750 293.541110 191.393970
pro count 367.000000 367.000000 367.000000
mean 4.787481 1.378116 1.375165
std 2.701848 0.684939 0.684509
min 1.534850 0.373300 0.371660
25% 3.159955 0.907110 0.904440
50% 4.125120 1.292390 1.292330
75% 5.799905 1.728520 1.721730
max 24.204020 5.385450 5.385240
sph count 7.000000 7.000000 7.000000
mean 25.382989 25.381147 25.380613
std 58.370757 58.370755 58.369675
min 1.497270 1.497230 1.497190
25% 2.442800 2.442665 2.442625
50% 3.455190 3.445180 3.445020
75% 5.065530 5.065160 5.064920
max 157.711800 157.709970 157.706990

In [102]:
#exclude large coefficient molecules since they have very different values than skew a lot the stats
subdf=df[np.logical_and(df['A'] < 10,df['B'] < 10) ]
groupsA=[group['A'] for index,group in subdf.groupby('rotType') ]
groupsB=[group['B'] for index,group in subdf.groupby('rotType') ]
groupsC=[group['C'] for index,group in subdf.groupby('rotType') ]
names=[index for index,group in subdf.groupby('rotType')  ]
plt.title("Distribuition for different type of rotors")
sns.violinplot(groupsA,names=names,alpha=0.5,color='green',label='A')
sns.violinplot(groupsB,names=names,alpha=0.5,color='blue')
sns.violinplot(groupsC,names=names,alpha=0.5,color='red',label='A')
#custom legend
import matplotlib.patches as mpatches

green_patch = mpatches.Patch(color='green',alpha=0.5, label='A')
blue_patch = mpatches.Patch(color='blue',alpha=0.5, label='B')
red_patch = mpatches.Patch(color='red',alpha=0.5, label='C')
plt.legend(handles=[green_patch,blue_patch,red_patch])
plt.legend()
plt.show()
print("")




In [38]:
# show stats
interest=['A','B','C']
#exclude linear molecules since they have very different A values than skew a lot the stats
subdf=df[ np.logical_not(linear) ]

for i in interest:
    sns_cycle_palette()
    asubdf=subdf[ subdf[i] < 50 ]
    # histogram, seaborn has better
    sns.distplot(asubdf[i],rug=False, hist=True,hist_kws={"alpha": 0.5},label=i);
    plt.xlim([0,10])

plt.xlabel("Rot Constant value")
plt.title('Distribution of molecules')
plt.ylabel('Frequency')
plt.legend()
plt.show()


Observations:

  • B and C have normal distributions, $B \ge C$ explains the slight skew. A has around 2x bigger mean and std.
  • There are a few outlier cases, such as small molecules which have high rotational numbers ($H_2O$) or linear molecules (Huge $A$). These outliers mess-up the statistics of the distribuitions, removing them, we get ''nice'' distribuitions.
  • Would be interesting to see if each class of rotator also reflects on the distribution of other properties.
  • Might be worth considering only using the subset of asymmetric molecules since they represent the great majority of all the data.
  • How does molecular weight affect the roational constants? Intuition would dictate that more weight = less freedom of movement and this would reflect on smaller rot-constants. Linear molecules are except from this rational.

2.3 Special Properties

Electronic Band Gap (gap)

Calculated as $E_{gap}=E_{HOMO}-E_{LUMO}$, also known as "band gap", where HOMO (highest occupied molecular orbital) and LUMO (lowest unoccupied molecular orbital) are ''frontier'' orbitals for the molecule. Basically, it's how much energy you have to feed into the molecule to kick it from the ground (most stable) state into an excited state. The gap energy can tell us about what wavelengths the compound can absorb.

Molar Heat Capacity (Cv)

At the molecular level, temperature is the average kinetic energy of an ensamble of molecules. Heat capacity characterizes the amount of heat required to change a body's temperature by a given amount.

It is related to molecular complexity, more complex molecules have multiple degrees of freedom (translation, rotation and vibration)...when an ideal gas of ''simple'' molecules (low heat capacity) absorbs heat, all the energy will go to a low number of degrees of freedom...while for a gas with more complex molecules the absorbed energy is partitioned among many more kinds of motions, giving a smaller rise temperature (high heat capacity).

Dipole Moment (dipole)

Dipole moments arise from differences in electronegativity.

It measures the separation of positive and negative electrical charges in a system of electric charges, in this case electrons. The larger the difference in electronegativity, the larger the dipole moment. A zero dipole implies that the negative and positive forces are canceling each other.

Polarizability (polar)

Neutral nonpolar species have spherically symmetric arrangements of electrons in their electron clouds. When in the presence of an electric field, their electron clouds can be distorted. Polarizability is a measure of this distortion.

More electrons or electrons more far away from the nuclear charge, meaning increased polarizability.


In [47]:
interest=['Hatom','Cv','gap','dipole','polar']
# show stats
print('=== Stats ===')
df[interest].describe().transpose()


=== Stats ===
Out[47]:
count mean std min 25% 50% 75% max
Hatom 130830 2.830376 0.385466 0.342879 2.581609 2.835485 3.078835 4.211903
Cv 130830 31.620447 4.067484 6.002000 28.955250 31.578500 34.298000 46.969000
gap 130830 0.252044 0.047192 0.024600 0.217000 0.250200 0.289400 0.622100
dipole 130830 2.672963 1.503480 0.000000 1.577800 2.475300 3.596375 29.556400
polar 130830 75.281410 8.173457 6.310000 70.480000 75.600000 80.610000 196.620000

In [43]:
names=["Enthalphy of atomization","Heat Capacity","Electronic band gap","Dipole Moment","Polarizability"]
for index,i in enumerate(interest):
    sns_cycle_palette()
    # histogram, seaborn has better
    sns.distplot(df[i],rug=False, hist=True);
    plt.xlabel(i)
    plt.ylabel('Frequency')
    plt.title(names[index]+" across all molecules")
    plt.show()


Observations:

  • Overall all are normally distribuited, not surprising considering the sample size of our data. Some are more ''normal'' than others.
  • Dipole Moment and Electronic Band Gap could be a mixture of three gaussians since there are three visibles peaks.
  • $H_{a}$ has around 5 local maxima, could be modeled with a mixture of five gaussians.

2.4 Results of the Exploration

  • We can just keep one energy feature.
  • It might be advantegous to treat first a small subset of molecules, for example 2k from the asymetric rotor class. We might have bias due to the different representation of rotor class molecules.
  • Most of the data follows a normal distribution, some features have interesting peaks suggesting we should try to predict a Gaussian Mixture Model.

3 Machine Learning: Finding Patterns in the data

3.1 Prepare Data: Reduce Variables, do substructure matching

3.1.1 Calculate number of rings and decide if it contains a benzene ring


In [101]:
#get Ring Data
def getRingData( row ):
    mol=Chem.MolFromSmiles(row['SMILES'])
    if mol is None:
        numRing=0
    else:
        numRing=len(mol.GetAromaticAtoms())
    return numRing

#apply function to each row
df['ringNum']=df.apply(getRingData, axis=1)

In [102]:
benzeneMol=Chem.MolFromSmiles('c1ccccc1')
print("=== Our reference substructure ===")
benzeneMol


=== Our reference substructure ===
Out[102]:

In [103]:
#get Ring Data
refMol=benzeneMol
matchName='benzene'
def smileFriendly( row):
    mol=Chem.MolFromSmiles(row['SMILES'])
    if mol is None:
        match=False
    else:
        match=True
    return match

df['smileGood']=df.apply(smileFriendly, axis=1)
print(" finished classifing good smiles ")
def matchSubstructure( row):
    if row['smileGood']:
        mol=Chem.MolFromSmiles(row['SMILES'])
        match=(len(mol.GetSubstructMatch(refMol)) > 0)
    else:
        match=False
    return match

#apply function to each row
df[matchName]=df.apply(matchSubstructure, axis=1)
nMatched=df[df[matchName]].shape[0]
print("Found %d matched substructures "%(nMatched))


 finished classifing good smiles 
Found 291 matched substructures 

3.1.2 Create subset of data to try the classification


In [109]:
subdf=df[df['rotType']=='Asym']
# divide classes
ringdf=subdf[subdf['benzene']]
sampleSize=min(ringdf.shape[0],500);
if nMatched > 500:
    ringdf=ringdf[ringdf['dbindex'].isin(random.sample(ringdf['dbindex'], sampleSize))]

candidates=np.logical_and(np.logical_and(~subdf['benzene'],subdf['numAtoms'] >=11 ),
                          subdf['ringNum'] == 0)
noRingdf=subdf[ candidates]
noRingdf=noRingdf[noRingdf['dbindex'].isin(random.sample(noRingdf['dbindex'], sampleSize)) ]

# merge both data frames
subdf=pd.concat([ringdf,noRingdf])
print(subdf.shape)
subdf.head()


(566, 28)
Out[109]:
numAtoms dbindex A B C dipole polar homo lumo gap ... atomCoords freqs SMILES InChI molWeight Hatom rotType ringNum smileGood benzene
929 15 930 5.55639 2.51908 1.75211 0.3250 69.91 -0.2361 0.0045 0.2407 ... [[-0.0117195668, 1.5234800402, 0.0103391293], [-0.0089052882, 0.0143135821, 0.0022205061], [-0.0038616072, -0.6965524191, -1.2016066613], [0.0269483994, -2.0884265906, -1.2122990032], [0.05096882, -2.7969187218, -0.0134208178], [0.0414768819, -2.1015164974, 1.1932643885], [0.0106302783, -0.7095775849, 1.1980688052], [1.0096050652, 1.9245101689, 0.002531113], [-0.5230439031, 1.9290081911, -0.8677920221], [-0.5059025374, 1.9194936281, 0.9024884495], [-0.026631914, -0.1513605007, -2.1408448831], [0.0280624747, -2.6201883726, -2.1583697581], [0.0718047116, -3.8815216588, -0.0194480703], [0.0539945616, -2.6435172098, 2.1334233204], [-0.0007653961, -0.1746249648, 2.1433776041]] [13.2326, 211.1698, 340.3382, 414.4366, 478.4558, 526.0378, 632.4802, 716.2096, 752.6377, 797.8522, 866.3421, 909.755, 952.1108, 985.938, 1000.2668, 1014.519, 1052.7545, 1063.0599, 1112.2099, 1181.1798, 1204.2909, 1231.6192, 1333.0293, 1348.4539, 1413.1856, 1467.7377, 1488.3888, 1502.3855, 1529.7724, 1625.4066, 1646.9112, 3026.4157, 3083.6896, 3111.3874, 3161.7098, 3163.4093, 3176.0685, 3184.2829, 3197.4286] Cc1ccccc1 InChI=1S/C7H8/c1-7-5-3-2-4-6-7/h2-6H,1H3 92.138420 2.557181 Asym 6 True True
939 14 940 5.63915 2.59726 1.78123 1.6318 67.30 -0.1991 0.0086 0.2077 ... [[-0.0163224155, 1.4136802572, 0.1493342031], [0.0020312448, 0.020282505, 0.0475987977], [-1.1934551094, -0.7118247733, 0.0330451422], [-1.1694332794, -2.1010422192, -0.0062715702], [0.0397958132, -2.7928589821, -0.0332417297], [1.2300081891, -2.068436551, -0.0200152233], [1.216722451, -0.6790716275, 0.0192391113], [0.8087264938, 1.8733958315, -0.2078715935], [-0.8575919538, 1.8507521535, -0.1983342042], [-2.1422553477, -0.1831707261, 0.0563723871], [-2.1079538779, -2.6464255708, -0.0184299143], [0.0543362436, -3.8763067051, -0.0651406315], [2.1828012149, -2.5881178795, -0.0430067941], [2.1510138634, -0.1248304826, 0.0317785795]] [224.2427, 295.3001, 382.2724, 416.2446, 509.7066, 534.047, 616.1564, 631.5351, 710.7599, 772.7953, 832.1737, 836.367, 884.2922, 941.097, 970.5967, 1007.4543, 1050.4507, 1068.2887, 1138.9363, 1179.0525, 1201.1902, 1309.3874, 1351.983, 1359.2593, 1500.7314, 1534.1737, 1626.929, 1645.5227, 1661.851, 3159.109, 3160.213, 3176.3061, 3181.4197, 3200.4375, 3561.2957, 3659.2629] Nc1ccccc1 InChI=1S/C6H7N/c7-6-4-2-1-3-5-6/h1-5H,7H2 93.126480 2.369230 Asym 6 True True
948 13 949 5.67429 2.62513 1.79479 1.2550 62.42 -0.2191 0.0010 0.2201 ... [[0.097246341, 1.3675180399, 0.1311290224], [0.0678524749, 0.0068424163, 0.0400805622], [0.0761418569, -0.6646100168, -1.1834359131], [0.0447916339, -2.0568390099, -1.2095635585], [0.0052224321, -2.7854492589, -0.024379913], [-0.0027741535, -2.1059217819, 1.1939332909], [0.0282225363, -0.7173008208, 1.2333985161], [0.1226305024, 1.7401784789, -0.7561445864], [0.1070304048, -0.0987829965, -2.1112642157], [0.0515297324, -2.5700062722, -2.1656883557], [-0.019110342, -3.8689687908, -0.0478804758], [-0.0335038287, -2.6625358966, 2.1250038973], [0.0224610795, -0.1755467506, 2.1723285893]] [233.3224, 369.7802, 406.1505, 420.3493, 521.672, 533.1194, 628.597, 707.1698, 772.525, 830.0212, 832.7948, 891.6718, 943.8885, 974.7781, 1009.9947, 1045.7032, 1095.0609, 1177.2813, 1191.5389, 1198.5986, 1299.4243, 1349.953, 1368.2879, 1502.4366, 1533.009, 1637.5621, 1650.164, 3152.604, 3174.4128, 3182.7701, 3197.7086, 3204.7191, 3823.8246] Oc1ccccc1 InChI=1S/C6H6O/c7-6-4-2-1-3-5-6/h1-5,7H 94.111240 2.265065 Asym 6 True True
4207 12 4208 5.69508 2.57472 1.77311 1.2641 58.19 -0.2430 -0.0087 0.2343 ... [[0.0018123068, 0.0185526922, 0.0036786849], [-0.0163561014, 1.3566680459, 0.0106225625], [-1.2391398138, 2.0133412731, 0.0302940082], [-1.2496148565, 3.4061416766, 0.037416568], [-0.0538997981, 4.1218111685, 0.024991164], [1.1608026955, 3.439035941, 0.0053010093], [1.1881459921, 2.0464657894, -0.0020488151], [-2.1548410159, 1.4339368461, 0.0396547131], [-2.1986717599, 3.9313475557, 0.0527604747], [-0.0686188958, 5.2058980344, 0.0306258009], [2.0952486386, 3.9899455166, -0.0044499949], [2.1192387346, 1.4922640204, -0.0172973855]] [242.2515, 405.6341, 423.7619, 517.3056, 523.0544, 623.0781, 704.1833, 775.0643, 827.7738, 844.7107, 910.6343, 953.3747, 981.6342, 1014.9794, 1039.7573, 1089.9935, 1176.5394, 1178.4956, 1279.5097, 1311.1498, 1344.25, 1484.1072, 1528.3084, 1639.4333, 1640.871, 3177.1152, 3185.5145, 3198.6562, 3206.416, 3209.576] Fc1ccccc1 InChI=1S/C6H5F/c7-6-4-2-1-3-5-6/h1-5H 96.102303 2.133165 Asym 6 True True
4335 18 4336 3.58374 1.76010 1.19792 0.2951 82.66 -0.2298 0.0072 0.2371 ... [[-0.0191174654, 1.5277262784, 0.0048015465], [-0.008809547, 0.0181560366, -0.0088748709], [0.0188826203, -0.6864007888, -1.2152454422], [0.0502716541, -2.0775172563, -1.2189635356], [0.0536892744, -2.7833395215, -0.0197776703], [0.0264362783, -2.1048493087, 1.2014363329], [0.0537987347, -2.8618296084, 2.5073176092], [-0.0090061639, -0.7084944427, 1.1841457571], [0.9974410891, 1.9330366688, -0.0736450489], [-0.592033362, 1.9316640357, -0.8354533269], [-0.4531027076, 1.9171123855, 0.930214611], [0.0095100845, -0.1419569203, -2.1550250235], [0.064035518, -2.6149199055, -2.1620089268], [0.0713794806, -3.8692823209, -0.0300819074], [1.0813295133, -3.10873827, 2.8023461725], [-0.3849082441, -2.2753739066, 3.3197026885], [-0.4953835587, -3.805496322, 2.4344728166], [-0.0409565487, -0.1705457734, 2.1286873183]] [14.6838, 38.7789, 198.5832, 220.4659, 272.4767, 401.4202, 444.3633, 516.7992, 534.2659, 540.9004, 715.4553, 734.084, 795.1243, 899.3324, 901.4801, 920.2229, 961.7697, 998.7352, 1013.9928, 1032.1354, 1060.2953, 1063.5493, 1122.9596, 1182.9099, 1197.541, 1273.8214, 1323.4551, 1344.1816, 1410.6583, 1412.8541, 1449.7127, 1486.0, 1488.3176, 1488.5226, 1506.1266, 1525.5303, 1629.0915, 1648.929, 3026.2645, 3026.7556, 3083.951, 3084.1287, 3111.9609, 3112.0999, 3152.4465, 3160.0866, 3168.5559, 3187.0146] Cc1cccc(C)c1 InChI=1S/C8H10/c1-7-4-3-5-8(2)6-7/h3-6H,1-2H3 106.165000 3.008906 Asym 6 True True

5 rows × 28 columns

3.1.3 Visually verify molecular structures


In [110]:
sampleSize=20;
smalldf=subdf[subdf['benzene']]
smalldf=smalldf[smalldf['dbindex'].isin(random.sample(smalldf['dbindex'], sampleSize))]

mols = [ Chem.MolFromSmiles(row['SMILES']) for indx, row in smalldf.iterrows() ]
print("== Benzene Ringed atoms ==")
names=["# %d "%(row['dbindex']) for indx, row in smalldf.iterrows() ]
Draw.MolsToGridImage( mols, molsPerRow=5, legends=names)


== Benzene Ringed atoms ==
Out[110]:

In [113]:
sampleSize=20;
smalldf=subdf[~subdf['benzene']]
smalldf=smalldf[smalldf['dbindex'].isin(random.sample(smalldf['dbindex'], sampleSize))]

mols = [ Chem.MolFromSmiles(row['SMILES']) for indx, row in smalldf.iterrows() ]
print("== Non-Benzene-Ringed atoms ==")
names=["# %d "%(row['dbindex']) for indx, row in smalldf.iterrows() ]
Draw.MolsToGridImage( mols, molsPerRow=5, legends=names)


== Non-Benzene-Ringed atoms ==
Out[113]:

In [114]:
# save our pickle!
subdf.to_pickle('benzene.pkl')

In [119]:
# save our pickle!
subdf=pd.read_pickle('benzene.pkl')
subdf.columns


Out[119]:
Index([u'numAtoms', u'dbindex', u'A', u'B', u'C', u'dipole', u'polar', u'homo', u'lumo', u'gap', u'spatialSize', u'zpe', u'U0', u'U', u'H', u'G', u'Cv', u'atomList', u'atomCoords', u'freqs', u'SMILES', u'InChI', u'molWeight', u'Hatom', u'rotType', u'ringNum', u'smileGood', u'benzene'], dtype='object')

3.2 Create Classifiers!

3.2.1 Create X and Y matrices


In [255]:
print(subdf.columns)
nonTrainFeatures=['numAtoms','homo','lumo','dbindex','atomList', 'atomCoords', 'freqs', 'SMILES',
                  'InChI','rotType', 'ringNum', 'smileGood', 'benzene','U','H','G']

trainData=subdf.drop(nonTrainFeatures,axis=1)
targetData=subdf['benzene']
X = np.array([np.array(row) for indx,row in trainData.iterrows()])
Y = np.array( targetData)
features=[str(col) for col in trainData.columns]
features


Index([u'numAtoms', u'dbindex', u'A', u'B', u'C', u'dipole', u'polar', u'homo', u'lumo', u'gap', u'spatialSize', u'zpe', u'U0', u'U', u'H', u'G', u'Cv', u'atomList', u'atomCoords', u'freqs', u'SMILES', u'InChI', u'molWeight', u'Hatom', u'rotType', u'ringNum', u'smileGood', u'benzene'], dtype='object')
Out[255]:
['A',
 'B',
 'C',
 'dipole',
 'polar',
 'gap',
 'spatialSize',
 'zpe',
 'U0',
 'Cv',
 'molWeight',
 'Hatom']

3.2.2 RandomForestClassifier


In [256]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_score


maxtrees=40
kfoldn=10
ntrees=np.arange(1,maxtrees+1)
scores=np.zeros((ntrees.shape[0],kfoldn))
for indx,i in enumerate(ntrees):
    rf = RandomForestClassifier(n_estimators=i)
    scores[indx]= cross_val_score(rf,X,Y, cv=kfoldn)

scores=np.array(scores)

print("== Random Forest Classifier, KFold=10==")
print("Score Mean = %1.3f, Std = %1.3f"%(np.mean(scores),np.std(scores)))

plt.figure(figsize=(18, 6), dpi=80)
sns.boxplot(scores.transpose())
plt.plot(ntrees,[ i.mean() for i in scores],'r--',alpha=0.75,label="mean")
plt.legend()
plt.xlabel("Number of trees")
plt.ylabel("Score")
plt.title("RandomForest with varying number of trees")
plt.show()


== Random Forest Classifier, KFold=10==
Score Mean = 0.919, Std = 0.085

3.2.3 Feature Importance


In [260]:
goodTrees=11
rf=RandomForestClassifier(n_estimators=goodTrees)
scores= cross_val_score(rf,X,Y, cv=kfoldn)
kfold=sklearn.cross_validation.KFold(n=len(Y), n_folds=kfoldn)
importances=[]
for indj, indices in enumerate(kfold):
    train,test = indices
    X_train, X_test = X[train], X[test]
    Y_train, Y_test= Y[train], Y[test]
    rf.fit(X_train, Y_train)
    Y_pred=rf.predict(X_test)
    importances.append(rf.feature_importances_)
#pre-process data
importances=np.array(importances)
mimportances=[ i.mean() for i in importances.transpose()]

# sort
sortedtup=sorted(zip(mimportances,features))
mimportances = np.array([i[0] for i in sortedtup])
sfeatures = np.array([i[1] for i in sortedtup])
indexes=np.arange(len(mimportances))

In [261]:
#plot away!
fig, ax = plt.subplots()
sns.barplot(indexes,y=mimportances, palette="BuGn_d")
ax.set_xticklabels(sfeatures, rotation=45)
plt.xlabel("Feature")
plt.ylabel("Importance")
plt.title("Random Forest Classifier (%d Trees), Score = %1.3f ( %1.3f) "%(goodTrees,np.mean(scores),np.std(scores)))
print("")



3.2.4 Visualizing the distribution of the 5 most important features


In [269]:
interest=sfeatures[-5::]
mostImportance=mimportances[-5::]
names=['Ring','Non-ring']
print("=== 5 most important features ===")
print(interest)
realNames=["Heat Capacity","Molecular Weight","Zero Point Energy",
           "Rotational Const C","Electronic Band Gap"]
for index,i in enumerate(interest):
    # histogram, seaborn has better
    matchdf=subdf[subdf['benzene']]
    nomatchdf=subdf[~subdf['benzene']]
    data=[matchdf[i],nomatchdf[i]]
    sns.violinplot(data,names=names);
    plt.xlabel("Type of molecule")
    plt.ylabel(i)
    plt.title("Distribuition of %s , Importance= %0.3f "%(realNames[index],mostImportance[index]) )
    plt.show()


=== 5 most important features ===
['Cv' 'molWeight' 'zpe' 'C' 'gap']

3.3 Other Classifiers!


In [276]:
from sklearn.naive_bayes import GaussianNB 
from sklearn import svm
from sklearn import grid_search 
from sklearn import neighbors
from sklearn.metrics import classification_report

gnb = GaussianNB()
scores= cross_val_score(gnb,X,Y, cv=kfoldn)
print("== Naive_Bayes Classifier ==")
print("Score Mean = %1.3f, Std = %1.3f"%(np.mean(scores),np.std(scores)))

clf_KNN = neighbors.KNeighborsClassifier(2, weights="uniform" )
scores= cross_val_score(clf_KNN,X,Y, cv=kfoldn)
print("== K-NN ==")
print("Score Mean = %1.3f, Std = %1.3f"%(np.mean(scores),np.std(scores)))

## Grid for the CV
gammas = np.logspace(-10,4,8,base=2)
Cost = [0.001,0.001,0.01,1,10,100,1000]
param_grid = [
  {'C': Cost , 'gamma': gammas, 'kernel': ['rbf']},
 ]
clf_svm  = grid_search.GridSearchCV(estimator=svm.SVC(), param_grid=param_grid) 
scores= cross_val_score(clf_svm,X,Y, cv=kfoldn)
print("== SVM ==")
print("Score Mean = %1.3f, Std = %1.3f"%(np.mean(scores),np.std(scores)))


== Naive_Bayes Classifier ==
Score Mean = 0.865, Std = 0.066
== K-NN ==
Score Mean = 0.585, Std = 0.087
== SVM ==
Score Mean = 0.745, Std = 0.121

3.4 Exploring Similiarity between Molecules


In [127]:
mols = [ pybel.readstring("Smiles",row['SMILES']) for indx,row in subdf.iterrows() ]
indexes = [ indx for indx,row in subdf.iterrows() ]
fps= [ mol.calcfp() for mol in mols ]

def pairwise_sim(fps):
    #create pairwise matrix
    pairwise=np.zeros((len(fps),len(fps)))
    #iterate over all pairs
    for indx,imol in enumerate(fps):
        for jndx,jmol in enumerate(fps):
            #tanimoto similarity
            pairwise[indx,jndx]=imol|jmol
            
    return pairwise

sim=pairwise_sim(fps)

In [129]:
plt.figure(figsize=(12,12))
plt.imshow(sim,interpolation='None' )
plt.title("Similarity matrix for %d Molecules (1=Equal)"%20)
plt.xlabel("Mol #id")
plt.ylabel("Mol #id")
plt.colorbar()
plt.show()



In [130]:
ref_indx=10
sim_ref=sim[ref_indx,:]
plt.hist(sim_ref)
plt.title("Similarity for a reference molecule")
plt.xlabel("Tanito Similarity")
plt.ylabel("Frequency")
print("== Reference Molecule ==")
smileStr=subdf.ix[indexes[ref_indx],'SMILES']
imolecule.draw(smileStr)
mols[ref_indx]


== Reference Molecule ==
Out[130]:
Multiple Molecules - Open Babel Depiction C H O O O

In [131]:
nmols = [ Chem.MolFromSmiles(subdf.ix[indx,'SMILES']) for indx in indexes ]

# sorting the molecule as a function of their similarity to a query

nbrs = sorted(zip(sim_ref,nmols,indexes),reverse=True)
print("== Similar molecules, Sorted from most similar to disimilar ==")
# depict the molecules in decreasing order of similarity
names=["# %d - sim=%f"%(subdf.ix[nbr[2],'dbindex'],nbr[0]) for nbr in nbrs ]
Draw.MolsToGridImage( [nbr[1] for nbr in nbrs], molsPerRow=5, legends=names)


== Similar molecules, Sorted from most similar to disimilar ==
Out[131]:

3.2.1 Bigger samplesize


In [133]:
sampleSize=1000;
subdf=df[df['rotType']=='Asym']
subdf=subdf[subdf['dbindex'].isin(random.sample(subdf['dbindex'], sampleSize))]
#remove some variables
interest=[u'numAtoms', u'dbindex', u'A', u'B', u'C', u'dipole', u'polar', u'gap', u'spatialSize', u'zpe', u'U0', u'Cv', u'SMILES', u'molWeight', u'Hatom', u'rotType']
subdf=subdf[interest]
mols = [ pybel.readstring("Smiles",row['SMILES']) for indx,row in subdf.iterrows() ]
indexes = [ indx for indx,row in subdf.iterrows() ]
fps= [ mol.calcfp() for mol in mols ]
sim=pairwise_sim(fps)
#plot!
plt.figure(figsize=(12,12))
plt.imshow(sim )
plt.title("Similarity matrix for %d Molecules (1=Equal)"%sampleSize)
plt.xlabel("Mol #id")
plt.ylabel("Mol #id")
plt.colorbar()
plt.show()



In [134]:
ref_indx= np.argmax( np.linalg.norm(sim,axis=1) )
sim_ref=sim[ref_indx,:]
plt.hist(sim_ref)
plt.title("Reference molecule = biggest colunm norm in the matrix")
plt.xlabel("Tanito Similarity")
plt.ylabel("Frequency")
print("== Reference Molecule ==")
smileStr=subdf.ix[indexes[ref_indx],'SMILES']
imolecule.draw(smileStr)
mols[ref_indx]


== Reference Molecule ==
Out[134]:
Multiple Molecules - Open Babel Depiction H O H

In [135]:
nmols = [ Chem.MolFromSmiles(subdf.ix[indx,'SMILES']) for indx in indexes ]

# sorting the molecule as a function of their similarity to a query

nbrs = sorted(zip(sim_ref,nmols,indexes),reverse=True)
nbrs=nbrs[:30]
print("== 30 most Similar molecules, Sorted from most similar to disimilar ==")
# depict the molecules in decreasing order of similarity
names=["# %d - sim=%f"%(subdf.ix[nbr[2],'dbindex'],nbr[0]) for nbr in nbrs ]
Draw.MolsToGridImage( [nbr[1] for nbr in nbrs], molsPerRow=5, legends=names)


== 30 most Similar molecules, Sorted from most similar to disimilar ==
Out[135]:

4 Summary, Results, the End

Goals reached

The first two goals were met sucessfully. The last goal, which is the most interesting was met for one substructure search. With some partial results. We tried various classifiers, such as randomForest, SVM, K-NN and Naive Bayes. We obtained the best results with the randomForest ensemble method. One of the advantageous of this methods is that we can a decision tree that gives an idea how important the variables are for such classification. For example with benzene we found that band gap was the best descriptor for these types of molecules. Visualizing the distribuition of these two clases is is obvious they are very different. This also coincides with the chemical notion of an aromatic ring, these types of structures are being used to design organic semi-conductors, which are characterized by a low electronic gap. We see that on average a bezene containing molecule will have a lower band gap.

Extra ideas (not tested)

Due to time constrains, some ideas were not testet, but they might work for future directions:

  • Search the most common functional groups, chemical substructures that contain key chemical functionality. For this purpose I wanted to use RDKit, but had problems running it. Tried to think how to formulate this with a Kmeans classifier.
  • Create a labeling for each molecule based on a molecule having the substructure or not. This would create a membership of a molecule to within different clusters.
  • Reduce dimentionality, maybe with a TruncatedSVD or finding the best features by ranking their importance via a random forest classifier.
  • See if membership into a cluster would correspond to general differences in the molecular properties.
  • Use SVM/Random Forest to create a classifier of molecular substructure based on molecular properties.

On the experience Starting this course I had never used an ipython notebook or python. Had always programmed in C++ or Fortran. Now I want to use python for many more projects, it offers a very fast way of prototyping ideas.

Future Directions There is certaintly much more work to be done and many venues to explore with data science and Quantum Chemistry. For the msot part I stuck to the available dataset, yet I feel there could be more interesting data embedded in the calculations which is not included in the data set. Examples:

  • Frequencies, only frequencies were reported but amplitudes were not. Using amplitudes and frequencies together you could devise spectra features such as spectral centroids, brightness o cSpectrum.
  • Coulombic Matrix, The matrix of electromagnetic interaction between different molecules.
  • Orbital Occupation Numbers, Occupation numbers for the electrons within the molecular orbitals.

The main idea would be design and find descriptor that can not be found in a classical way, to demonstrate that a quantum calculation has more conection to reality than a classical one. (at the molecular level).


In [ ]: