A data driven, Quantum Mechanical understanding of Chemistry

AKA Trying to make sense of 134k quantum calculations

by Benjamin Sanchez Lengeling (bsanchezlengeling@g.harvard.edu)

Load custom CSS styling



In [10]:

    
from IPython.core.display import HTML
import os
def css_styling():
    """Load default custom.css file from ipython profile"""
    base = os.getcwd()
    styles = "<style>\n%s\n</style>" % (open(os.path.join(base,'custom.css'),'r').read())
    return HTML(styles)
css_styling()









    Out[10]:

Molecular party!

0 Intro

0.1 Motivation

This CS 209 project is an exploration of Chemistry using data science techniques. I want to use the methods learned in the class to apply them in my own context, which is Quantum Chemistry.

Quantum chemistry is a branch of theoretical chemistry which applies quantum mechanics to address problems in chemistry.

In Quantum Chemistry you can calculate a molecule on a computer, simulating and solving the Schrödinger equation to obtain molecular properties.

Some pretty awesome examples of Quantum Chemistry applications are:

Designing and finding better molecules and dyes, to construct more efficient solar cells
Finding out why things have the color they do with no previous knowledge of the world, for example why is a tomato red? You can simulate a photon (light) hitting the molecules on the frontier of a tomato and seeing what type of wavelength you get back.
Studying possible new materials like graphene, or carbon nanotubes.
Designing better of newer drugs by modeling the molecules when they come in "contact" with other molecules in our bodies.
Finding alternative reactions for common industrial products, typically you want these reactions to be more ecological (better byproducts), cheaper or more efficient.
Studying the composition of molecules and reactions that can occur easily in space, to better understand all the material world surrounding us right now, including life.

Due to advances in computing now we can calculate a large number of molecules in relatively small time frame, creating huge masses of data.

0.1 Motivation

The pharmaceutical industry has established for a long time now the practice of using molecular calculations to drive drug discovery. This has olny been done with classical mechanics and with very limited quantum treatment of molecules. There is now a trend of trying to adapt the tools and strategies of this industry to material science.

So a current open question is: How can we incorporate quantum effects/information into chemoinformatics?

Very little work has been done on this aspect, in part due to the computational constraint of the treatment of quantum effects and also due to the complexity of this information.

I want to put my newly earned skills to the test, on data I will most probably be working with during my graduate studies.

One aspect of this work is the research, there are two main projects/works that have tried to utilize data science in Quantum Chemistry:

The Harvard Clean Energy project, the largest database of quantum calculation in the world, all seeking to discover the new generation of plastic solar cell materials. http://cleanenergy.molecularspace.org/ , http://www.molecularspace.org/.
A recent 2014 Scientific Data article "Quantum chemistry structures and properties of 134 kilo molecules" (http://www.nature.com/articles/sdata201422). One of the first projects to grant access to a large database of quantum calculations. The article sets-up the rational behind the creation of the dataset, they plan to use machine learning to predict molecular properties.

The other aspect relates to accesibility in science, in using great tools as ipython notebooks, data repositories, open source libraries to accelerate and make more open the process of doing science. In Chemistry two side projects which are of great interest are:

OpenBabel (http://openbabel.org): library designed to support molecular modeling, chemistry, and many related areas, including interconversion of file formats and data
RDkit (http://www.rdkit.org/): library for Cheminformatics and Machine Learning.

0.3 Goals

Cleanup the dataset and present it in a more user and sharable format (pandas dataframe, pickle, excel, json database).
Visualize the dataset, explore all parameters of the data and present global trends.
Find patterns within the data, how do these patterns reflect with chemistry concepts?

1 Data Wrangling

1.0 The Dataset

Dataset is hosted on http://dx.doi.org/10.6084/m9.figshare.978904, it is a few hundred MB's.

I also have pandas dataframe pickle in my dropbox:

https://www.dropbox.com/sh/6iysi4w0xmmevlt/AABrTLUFZJvrJPDeDCzuhxJYa?dl=0.

They reported computed geometric, energetic, electronic, and thermodynamic properties for 134k stable small organic molecules made up of CHONF atoms. These molecules correspond to the subset of all 133,885 species with up to nine heavy atoms (CONF) out of the GDB-17 chemical universe of 166 billion organic molecules.

All calculations have been carried out at the B3LYP/6-31G(2df,p) level of quantum chemistry. (Decent, but not the best)

Results of each calculation has been embeded within a xyz file in a very specific way. The data provided by the article is not very user friendly ( see: http://www.ch.imperial.ac.uk/rzepa/blog/?p=12803), a better format would have been a json or excel database.

We will parse out the information acoording to the readme.txt file and create for each molecule a row within a pandas dataframe. This dataframe can thn easily be exported to a variety of popular data formats.

1.1 Libraries



In [1]:

    
#Libraries
import numpy as np
import scipy as sp
import pandas as pd
import sklearn
import seaborn as sns
from matplotlib import pyplot as plt
import random
#chemo-informatics
import imolecule
import openbabel
import pybel
import rdkit
# visual stuff
%matplotlib inline
from IPython.core import display
from PIL import Image
#rdkit
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem import AllChem
import rdkit.rdBase
from rdkit.Chem.MACCSkeys import GenMACCSKeys
from rdkit import DataStructs
from rdkit.DataStructs import BitVectToText
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from rdkit import DataStructs
from rdkit.Chem import MCS as MCS
from rdkit.Chem import Descriptors as Descriptors
from rdkit.ML.Descriptors import MoleculeDescriptors
from rdkit.Chem import PandasTools as PandasTools
from rdkit.Chem import Descriptors as Descriptors
#awesome plot options
plt.style.use('fivethirtyeight')
plt.rcParams['figure.figsize'] = (12.0, 6.0)
# utility
from io import BytesIO


#functions for ipython

def display_pil_image(im):
    """Displayhook function for PIL Images, rendered as PNG."""

    b = BytesIO()
    im.save(b, format='png')
    data = b.getvalue()

    ip_img = display.Image(data=data, format='png', embed=True)
    return ip_img._repr_png_()

# register display func with PNG formatter:
png_formatter = get_ipython().display_formatter.formatters['image/png']
dpi = png_formatter.for_type(Image.Image, display_pil_image)

#functions for plotting
def sns_cycle_palette():
    """Cycle seaborn's color palete to have a variety of colors"""

    pal = sns.color_palette()
    pal=pal[1:]+[pal[0]]
    sns.set_palette(pal)
    return

1.2 Read XYZ files and load to DataFrame

We create a simple function to read an xyz file and return a row of data to be appended to a dataframe. The we iterate over all the xyz files, this took a few hours. We save the dataframe periodically unto a python pickle.



In [7]:

    
def readFormatedXYZ(xyzfile):
    fileData=[]
    with(open(xyzFile,'r')) as afile:
        numAtoms=int(afile.next())
        #1 num atoms
        fileData.append(numAtoms)
        datline=afile.next().split()
        # 2 molecular index within the database
        fileData.append(int(datline[1]))
        # 3-5 rotational constants
        fileData.append(float(datline[2]))
        fileData.append(float(datline[3]))
        fileData.append(float(datline[4]))
        # 6 dipole moment
        fileData.append(float(datline[5]))
        # 7 polarizability
        fileData.append(float(datline[6]))
        # 8-10 orbital energies (lumo,homo)
        fileData.append(float(datline[7]))
        fileData.append(float(datline[8]))
        fileData.append(float(datline[9]))
        # 11 spatial extent
        fileData.append(float(datline[10]))
        # 12 zpe
        fileData.append(float(datline[11]))
        # 13-14 Internal Energy
        fileData.append(float(datline[12]))
        fileData.append(float(datline[13]))
        # 15 Enthalpy Energy
        fileData.append(float(datline[14]))
        # 16 Free Energy
        fileData.append(float(datline[15]))
        # 17 Heat Capacity
        fileData.append(float(datline[16]))
        
        #read geometry
        atomCoords=np.zeros((numAtoms,3))
        atomList=[]
        for i in range(numAtoms):
            datline=afile.next().split()
            atomList.append(datline[0])
            atomCoords[i]=np.array([float(d) for d in datline[1:4]])
        fileData.append(atomList)
        fileData.append(atomCoords)
        # next frequencies
        fileData.append(np.array([float(f) for f in afile.next().split()]))
        #relaxed geometry smile
        fileData.append(afile.next().split()[1])
        #relaxed geometry InChI
        fileData.append( afile.next().split()[1])
        
    return fileData



In [61]:

    
#===== Initial Variables ===
molFolder = 'mols/'
rmin = 0
rmax = 100
pandasFile = "mols.pkl"
#===== Check that the pickle file exists ===
if os.path.isfile(pandasFile):
    masterdf = pd.read_pickle(pandasFile)
    rmin = masterdf.shape[0] + 1
else:
    datCols = ['numAtoms', 'dbindex', 'A', 'B', 'C', 'dipole', 'polar',
               'homo', 'lumo', 'gap', 'spatialSize', 'zpe',
               'U0', 'U', 'H', 'G', 'Cv',
               'atomList', 'atomCoords', 'freqs',
               'SMILES', 'InChI']
    masterdf = pd.DataFrame(columns=datCols)
    rmin = 0
    
masterdf=pd.DataFrame( columns=datCols)
#===== Iterate over file
for indx in indexes:
    xyzFile="mols/dsgdb9nsd_%06d.xyz"%(indx+1)
    masterdf.loc[indx]=readFormatedXYZ(xyzFile)
    if (indx % 10000 == 0):
        print(indx)
        masterdf.to_pickle(pandasFile)
#====== Save and
masterdf.to_pickle(pandasFile)



In [6]:

    
#pickle it!
pandasFile = "original.pkl"
masterdf = pd.read_pickle(pandasFile)
print(masterdf.columns)
masterdf.head()









    



Index([u'numAtoms', u'dbindex', u'A', u'B', u'C', u'dipole', u'polar', u'homo', u'lumo', u'gap', u'spatialSize', u'zpe', u'U0', u'U', u'H', u'G', u'Cv', u'atomList', u'atomCoords', u'freqs', u'SMILES', u'InChI'], dtype='object')






    Out[6]:






  
    
      
      numAtoms
      dbindex
      A
      B
      C
      dipole
      polar
      homo
      lumo
      gap
      ...
      U0
      U
      H
      G
      Cv
      atomList
      atomCoords
      freqs
      SMILES
      InChI
    
  
  
    
      0
       5
       1
       157.71180
       157.709970
       157.706990
       0.0000
       13.21
      -0.3877
       0.1171
       0.5048
      ...
      -40.478930
      -40.476062
      -40.475117
      -40.498597
       6.469
       [C, H, H, H, H]
       [[-0.0126981359, 1.0858041578, 0.0080009958], ...
       [1341.307, 1341.3284, 1341.365, 1562.6731, 156...
         C
              InChI=1S/CH4/h1H4
    
    
      1
       4
       2
       293.60975
       293.541110
       191.393970
       1.6256
        9.46
      -0.2570
       0.0829
       0.3399
      ...
      -56.525887
      -56.523026
      -56.522082
      -56.544961
       6.316
          [N, H, H, H]
       [[-0.0404260543, 1.0241077531, 0.0625637998], ...
       [1103.8733, 1684.1158, 1684.3072, 3458.7145, 3...
         N
              InChI=1S/H3N/h1H3
    
    
      2
       3
       3
       799.58812
       437.903860
       282.945450
       1.8511
        6.31
      -0.2928
       0.0687
       0.3615
      ...
      -76.404702
      -76.401867
      -76.400922
      -76.422349
       6.002
             [O, H, H]
       [[-0.0343604951, 0.9775395708, 0.0076015923], ...
                        [1671.4222, 3803.6305, 3907.698]
         O
              InChI=1S/H2O/h1H2
    
    
      3
       4
       4
         0.00000
        35.610036
        35.610036
       0.0000
       16.28
      -0.2845
       0.0506
       0.3351
      ...
      -77.308427
      -77.305527
      -77.304583
      -77.327429
       8.574
          [C, C, H, H]
       [[0.5995394918, 0.0, 1.0], [-0.5995394918, 0.0...
       [549.7648, 549.7648, 795.2713, 795.2713, 2078....
       C#C
       InChI=1S/C2H2/c1-2/h1-2H
    
    
      4
       3
       5
         0.00000
        44.593883
        44.593883
       2.8937
       12.99
      -0.3604
       0.0191
       0.3796
      ...
      -93.411888
      -93.409370
      -93.408425
      -93.431246
       6.278
             [C, N, H]
       [[-0.0133239314, 1.1324657151, 0.0082758861], ...
              [799.0101, 799.0101, 2198.4393, 3490.3686]
       C#N
          InChI=1S/CHN/c1-2/h1H
    
  

5 rows × 22 columns

1.3 Remove uncharacterized molecules

Within the article they note that 3k molecules did not obtain converged geoemtries consistant with the intial guess, i.e. radically different geometries. We filter these out.



In [76]:

    
#preprocessing the dataframes
df=masterdf.copy()
#Filter out the uncharacterized molecules
badmols=pd.read_csv('uncharacterized.csv')
df=df[~df['dbindex'].isin(badmols['Index'])]
print("Bad molecules: %s"%(str(badmols.shape)))
print("Before %s -> After %s "%(masterdf.shape[0],df.shape[0]))









    



Bad molecules: (3054, 5)
Before 133884 -> After 130830

1.4 Add other computable descriptors

We add other descriptors to the dataframe. We create a row operating function and use pd.apply().

1.4.1 Molecular Weight (molWeight)



In [77]:

    
#load atomic data
atomdf=pd.read_csv('atomref.csv')
weightDict={ row['Element']:row['Mass'] for indx,row in atomdf.iterrows()}
print(weightDict)
#molecular weight function
def molecularWeight( row ):
    weightDict={'H': 1.00794, 'C': 12.0107, 'F': 18.998403, 'O': 15.9994, 'N': 14.0067}
    return np.sum([ weightDict[atom] for atom in row['atomList'] ])
#apply function to each row
df['molWeight']=df.apply(molecularWeight, axis=1)
atomdf.head()









    



{'H': 1.00794, 'C': 12.0107, 'F': 18.998403, 'O': 15.9994, 'N': 14.0067}






    Out[77]:






  
    
      
      Element
      Mass
      ZPVE_Hartree
      U (0 K)_Hartree
      U (298.15 K)_Hartree
      H (298.15 K)_Hartree
      G (298.15 K)_Hartree
      CV_Hartree_Cal/(Mol Kelvin)
    
  
  
    
      0
       H
        1.007940
       0
       -0.500273
       -0.498857
       -0.497912
       -0.510927
       2.981
    
    
      1
       C
       12.010700
       0
      -37.846772
      -37.845355
      -37.844411
      -37.861317
       2.981
    
    
      2
       N
       14.006700
       0
      -54.583861
      -54.582445
      -54.581501
      -54.598897
       2.981
    
    
      3
       O
       15.999400
       0
      -75.064579
      -75.063163
      -75.062219
      -75.079532
       2.981
    
    
      4
       F
       18.998403
       0
      -99.718730
      -99.717314
      -99.716370
      -99.733544
       2.981

1.4.2 Standard Enthalphy of Atomization (Hatom)

$$ H_a=\Delta_{at}H^{\theta}= \sum_{atom} H^{\theta}_{atom} - H^{\theta}_{mol} $$

Can be understood as the energy required to break the molecule completely into its component atoms.



In [78]:

    
#get dict to put into function
enthalphyDict={ row['Element']:row['H (298.15 K)_Hartree'] for indx,row in atomdf.iterrows()}
print(enthalphyDict)
# define function
def enthalphyAtomization( row ):
    enthalphyDict={'H': -0.49791199999999997, 'C': -37.844411, 'F': -99.71637, 'O': -75.062219, 'N': -54.581501}
    return np.sum([ enthalphyDict[atom] for atom in row['atomList'] ])-row['H']

#apply function to each row
df['Hatom']=df.apply(enthalphyAtomization, axis=1)

df.head()









    



{'H': -0.49791199999999997, 'C': -37.844411, 'F': -99.71637, 'O': -75.062219, 'N': -54.581501}






    Out[78]:






  
    
      
      numAtoms
      dbindex
      A
      B
      C
      dipole
      polar
      homo
      lumo
      gap
      ...
      H
      G
      Cv
      atomList
      atomCoords
      freqs
      SMILES
      InChI
      molWeight
      Hatom
    
  
  
    
      0
       5
       1
       157.71180
       157.709970
       157.706990
       0.0000
       13.21
      -0.3877
       0.1171
       0.5048
      ...
      -40.475117
      -40.498597
       6.469
       [C, H, H, H, H]
       [[-0.0126981359, 1.0858041578, 0.0080009958], ...
       [1341.307, 1341.3284, 1341.365, 1562.6731, 156...
         C
              InChI=1S/CH4/h1H4
       16.04246
       0.639058
    
    
      1
       4
       2
       293.60975
       293.541110
       191.393970
       1.6256
        9.46
      -0.2570
       0.0829
       0.3399
      ...
      -56.522082
      -56.544961
       6.316
          [N, H, H, H]
       [[-0.0404260543, 1.0241077531, 0.0625637998], ...
       [1103.8733, 1684.1158, 1684.3072, 3458.7145, 3...
         N
              InChI=1S/H3N/h1H3
       17.03052
       0.446845
    
    
      2
       3
       3
       799.58812
       437.903860
       282.945450
       1.8511
        6.31
      -0.2928
       0.0687
       0.3615
      ...
      -76.400922
      -76.422349
       6.002
             [O, H, H]
       [[-0.0343604951, 0.9775395708, 0.0076015923], ...
                        [1671.4222, 3803.6305, 3907.698]
         O
              InChI=1S/H2O/h1H2
       18.01528
       0.342879
    
    
      3
       4
       4
         0.00000
        35.610036
        35.610036
       0.0000
       16.28
      -0.2845
       0.0506
       0.3351
      ...
      -77.304583
      -77.327429
       8.574
          [C, C, H, H]
       [[0.5995394918, 0.0, 1.0], [-0.5995394918, 0.0...
       [549.7648, 549.7648, 795.2713, 795.2713, 2078....
       C#C
       InChI=1S/C2H2/c1-2/h1-2H
       26.03728
       0.619937
    
    
      4
       3
       5
         0.00000
        44.593883
        44.593883
       2.8937
       12.99
      -0.3604
       0.0191
       0.3796
      ...
      -93.408425
      -93.431246
       6.278
             [C, N, H]
       [[-0.0133239314, 1.1324657151, 0.0082758861], ...
              [799.0101, 799.0101, 2198.4393, 3490.3686]
       C#N
          InChI=1S/CHN/c1-2/h1H
       27.02534
       0.484601
    
  

5 rows × 24 columns



In [103]:

    
# save new descriptors to file
df.to_pickle('mol.pkl')



In [100]:

    
# load pickle file
df = pd.read_pickle('mol.pkl')

2) Data Visualization: Exploring the Dataset

2.1) Visualization

For visualization we use imolecule and rdkit which already has some ipython bindings.



In [30]:

    
nMols=10
indexes=np.random.randint(0,df.shape[0],nMols)
mols=[ Chem.MolFromSmiles(df.ix[i,'SMILES']) for i in indexes]
names=[ "Mol # %d"%(i) for i in indexes]
p = Draw.MolsToGridImage( mols, molsPerRow=5, subImgSize=(200, 200), legends=names)
print("=== Random selection of %d molecules ==="%nMols)
p









    



=== Random selection of 10 molecules ===






    Out[30]:



In [99]:

    
randomMolecule=np.random.randint(0,df.shape[0])
smileStr=df.loc[randomMolecule,'SMILES']
print("== 2D ==")
mymol=pybel.readstring("smiles", smileStr)
mymol









    



== 2D ==






    Out[99]:



In [100]:

    
print("== 3D Rotatable Molecule==")
imolecule.draw(smileStr, drawing_type="ball and stick", shader="phong")









    



== 3D Rotatable Molecule==

2.2) Size: Number of atoms, Molecular Weight, Spatial Extensivity

spatialSize = Electronic Spatial Extent

The electronic spatial extent is a single number that attempts to describe the size of a molecule. This number is computed as the expectation value of electron density times the distance from the center of mass of a molecule. Because the information is condensed down to a single number, it does not distinguish between long chains and more globular molecules.



In [30]:

    
interest=['numAtoms','spatialSize','molWeight']
for i in interest:
    rangeSize=max(df[i])-min(df[i])
    sns_cycle_palette()
    # histogram or a distribution plot?
    if rangeSize < 40:
        bins = np.arange(min(df[i]),max(df[i])+1)
        plt.hist(df[i],bins)
    else:
        sns.distplot(df[i])
    plt.title('Distribution of molecules')
    plt.xlabel(i)
    plt.ylabel('Frequency')
    plt.show()
# show stats
print('=== Stats ===')
df[interest].describe().transpose()









    












    












    












    



=== Stats ===






    Out[30]:






  
    
      
      count
      mean
      std
      min
      25%
      50%
      75%
      max
    
  
  
    
      numAtoms
       130830
         18.032538
         2.943693
        3.00000
         16.000000
         18.00000
         20.00000
         29.000000
    
    
      spatialSize
       130830
       1189.416537
       280.471126
       19.00020
       1017.441875
       1147.22465
       1309.04710
       3374.753200
    
    
      molWeight
       130830
        122.721698
         7.570695
       16.04246
        121.179640
        125.12528
        127.14116
        152.038398

Observations:

Electronic Spatial Extent follows a normal distribution.
Molecular Weight has a bi-modal distribution, it appears to be the mix of 2 normal distributions around 110 and 130.
The number of atoms is heavily concentrated between 16 and 20.

2.2 Energetics: Internal energy (U), Enthalpy (H), Gibbs free energy (G), Zero Point Energy (zpe)

Internal energy (U)

Is the sum of all the microscopic energies such as translational kinetic energy, vibrational and rotational kinetic energy and potential energy from intermolecular forces.

Enthalpy $ H = U + pV $

Includes the pressure and Volume of the system. In a sense it accounts for energy transferred to the environment at constant pressure through expansion or heating.

Gibbs free energy $ G = U + pV - TS $

Measures the "usefulness" or process-initiating work obtainable from a thermodynamic system at a constant temperature and pressure.

Zero Point Energy (zpe)

Zero-point energy is the energy that remains when all other energy is removed from a system, i.e the energy of a molecule at T=0.



In [44]:

    
sns_cycle_palette()
sns.jointplot(df['G'], df['U'], kind="scatter",size=8);
plt.show()
sns_cycle_palette()
sns.jointplot(df['U0'], df['zpe'],kind="reg",size=8);
plt.show()

# show stats
interest=['U','H','G','zpe']
print('=== Stats ===')
df[interest].describe().transpose()









    












    












    



=== Stats ===






    Out[44]:






  
    
      
      count
      mean
      std
      min
      25%
      50%
      75%
      max
    
  
  
    
      U
       130830
      -410.812341
       39.891166
      -714.560153
      -437.870873
      -416.800560
      -387.031228
      -40.476062
    
    
      H
       130830
      -410.811397
       39.891166
      -714.559209
      -437.869929
      -416.799616
      -387.030284
      -40.475117
    
    
      G
       130830
      -410.854217
       39.891884
      -714.602138
      -437.911833
      -416.841337
      -387.074526
      -40.498597
    
    
      zpe
       130830
         0.149090
        0.033138
         0.015951
         0.125638
         0.148630
         0.171397
        0.273944

Observations:

U0,U,H,G are all extremely correlated, this makes sense since they defined by vary similar formulas (i.e U=U0+zpe ). We can just keep one measure to reduce data size and number of variables.
zpe is a very small number compared to the rest, while quite similar in distribuition to the other energetics it does have some variability.

2.3 Rotational Constants (A,B,C)

Molecules are quantized (due to Quantum Mecchanics) rotating systems, so rotational energy and the angular momentum only takes certain fixed values, these are related to the rotational constants. These values are normally ordered as $A\ge B \ge C$. Based on the relationship of these constants we can classify the molecules as rotors:

Spherical rotors where $A = B = C$.
Linear molecules where $A >> B=C $ .
Oblate rotors where olny $A=B$
Prolate where olny $B=C$.
Asymmetric rotors where $A \neq B \neq C$.

We can count how many of these we might find in our data to make sense if we should classify molecules this way:



In [49]:

    
AeqB=np.abs(df['A'] - df['B'])/df['A'] < 0.005
BeqC=np.abs(df['B'] - df['C'])/df['B'] < 0.005
AeqC=np.abs(df['A'] - df['C'])/df['A'] < 0.005

largeRatio= df['A']/df['B'] < 0.25
smallRatio= df['A']/df['B'] > 100
largeA= np.logical_and(df['A'] >= 80 , df['A']/df['B'] > 4)
##clasifiers
spherical=np.logical_and(np.logical_and(AeqB,BeqC),AeqC)

linear=np.logical_or(np.logical_or(largeRatio,smallRatio),largeA)

checked=np.logical_or(linear,spherical)
oblate=np.logical_and(AeqB,np.logical_not(checked)) 

checked=np.logical_or(checked,oblate)
prolate=np.logical_and(BeqC,np.logical_not(checked)) 

checked=np.logical_or(checked,prolate)
asym=np.logical_not(checked)

print("Spherical %d mols"%( sum( spherical)) )
print("Linear %d mols"%( sum(linear) ) )
print("Oblate %d mols"%( sum(oblate) ) )
print("Prolate %d mols"%( sum(prolate) ) )
print("Asymmetric %d mols"%( sum(asym) ) )
# create new data column
df['rotType']='Asym'
for index,row in df.iterrows():
    if spherical[index]:
        df.ix[index,'rotType']='sph'
    elif linear[index]:
        df.ix[index,'rotType']='lin'
    elif oblate[index]:
        df.ix[index,'rotType']='ob'
    elif prolate[index]:
        df.ix[index,'rotType']='pro'









    



Spherical 7 mols
Linear 25 mols
Oblate 83 mols
Prolate 367 mols
Asymmetric 130348 mols



In [50]:

    
print("== Spherical Molecules ==")
subdf=df[ spherical ]
mols=[ Chem.MolFromSmiles(row['SMILES']) for indx,row in subdf.iterrows() ] 
names=[ "(# %d) %3.2f, %3.2f, %3.2f"%(row['dbindex'],row['A'],row['B'],row['C'])  for indx,row in subdf.iterrows() ]
p = Draw.MolsToGridImage( mols, molsPerRow=5, legends=names)
p









    



== Spherical Molecules ==






    Out[50]:



In [51]:

    
print("== Linear ==")
subdf=df[ linear ]
mols=[ Chem.MolFromSmiles(row['SMILES']) for indx,row in subdf.iterrows() ] 
names=[ "(# %d) %3.2f, %3.2f, %3.2f"%(row['dbindex'],row['A'],row['B'],row['C'])  for indx,row in subdf.iterrows() ]
p = Draw.MolsToGridImage( mols, molsPerRow=5, legends=names)
p









    



== Linear ==






    Out[51]:



In [52]:

    
print("== Some Oblate ==")
subdf=df[ oblate ]
subdf= subdf[subdf['dbindex'].isin(random.sample(subdf['dbindex'], 15))]
#subdf=subdf[:15]
mols=[ Chem.MolFromSmiles(row['SMILES']) for indx,row in subdf.iterrows() ] 
names=[ "(# %d) %3.2f, %3.2f, %3.2f"%(row['dbindex'],row['A'],row['B'],row['C'])  for indx,row in subdf.iterrows() ]
p = Draw.MolsToGridImage( mols, molsPerRow=5, legends=names)
p









    



== Some Oblate ==






    Out[52]:



In [53]:

    
print("== Some Prolate ==")
subdf=df[ prolate ]
subdf= subdf[subdf['dbindex'].isin(random.sample(subdf['dbindex'], 15))]
mols=[ Chem.MolFromSmiles(row['SMILES']) for indx,row in subdf.iterrows() ] 
names=[ "(# %d) %3.2f, %3.2f, %3.2f"%(row['dbindex'],row['A'],row['B'],row['C'])  for indx,row in subdf.iterrows() ]
p = Draw.MolsToGridImage( mols, molsPerRow=5, legends=names)
p









    



== Some Prolate ==






    Out[53]:



In [54]:

    
print("== Some Asymmetric ==")
subdf=df[ asym ]
subdf= subdf[subdf['dbindex'].isin(random.sample(subdf['dbindex'], 15))]
mols=[ Chem.MolFromSmiles(row['SMILES']) for indx,row in subdf.iterrows() ] 
names=[ "(# %d) %3.2f, %3.2f, %3.2f"%(row['dbindex'],row['A'],row['B'],row['C'])  for indx,row in subdf.iterrows() ]
p = Draw.MolsToGridImage( mols, molsPerRow=5, legends=names)
p









    



== Some Asymmetric ==






    Out[54]:



In [85]:

    
# show stats
interest=['A','B','C']

print('=== Stats ===')
df.groupby('rotType')[interest].describe()









    



=== Stats ===






    Out[85]:






  
    
      
      
      A
      B
      C
    
    
      rotType
      
      
      
      
    
  
  
    
      Asym
      count
       130348.000000
       130348.000000
       130348.000000
    
    
      mean
            3.428143
            1.400986
            1.122086
    
    
      std
            2.674430
            1.296514
            0.849326
    
    
      min
            1.405150
            0.337120
            0.331180
    
    
      25%
            2.554785
            1.091817
            0.911447
    
    
      50%
            3.088760
            1.370480
            1.081650
    
    
      75%
            3.832645
            1.654280
            1.281630
    
    
      max
          799.588120
          437.903860
          282.945450
    
    
      lin
      count
           25.000000
           25.000000
           25.000000
    
    
      mean
        34181.745950
            8.506396
            8.283852
    
    
      std
       130569.093627
           13.280076
           12.812553
    
    
      min
            0.000000
            0.377100
            0.377100
    
    
      25%
            0.000000
            0.782570
            0.782570
    
    
      50%
           80.462250
            2.048960
            2.048960
    
    
      75%
          159.871170
            8.593230
            8.593210
    
    
      max
       619867.683140
           44.593883
           44.593883
    
    
      ob
      count
           83.000000
           83.000000
           83.000000
    
    
      mean
            6.354241
            6.350577
            4.146749
    
    
      std
           32.011338
           32.004122
           20.864110
    
    
      min
            1.382500
            1.381480
            0.806320
    
    
      25%
            1.825840
            1.820495
            1.013025
    
    
      50%
            2.100840
            2.096210
            1.430060
    
    
      75%
            2.660040
            2.655590
            2.097040
    
    
      max
          293.609750
          293.541110
          191.393970
    
    
      pro
      count
          367.000000
          367.000000
          367.000000
    
    
      mean
            4.787481
            1.378116
            1.375165
    
    
      std
            2.701848
            0.684939
            0.684509
    
    
      min
            1.534850
            0.373300
            0.371660
    
    
      25%
            3.159955
            0.907110
            0.904440
    
    
      50%
            4.125120
            1.292390
            1.292330
    
    
      75%
            5.799905
            1.728520
            1.721730
    
    
      max
           24.204020
            5.385450
            5.385240
    
    
      sph
      count
            7.000000
            7.000000
            7.000000
    
    
      mean
           25.382989
           25.381147
           25.380613
    
    
      std
           58.370757
           58.370755
           58.369675
    
    
      min
            1.497270
            1.497230
            1.497190
    
    
      25%
            2.442800
            2.442665
            2.442625
    
    
      50%
            3.455190
            3.445180
            3.445020
    
    
      75%
            5.065530
            5.065160
            5.064920
    
    
      max
          157.711800
          157.709970
          157.706990



In [102]:

    
#exclude large coefficient molecules since they have very different values than skew a lot the stats
subdf=df[np.logical_and(df['A'] < 10,df['B'] < 10) ]
groupsA=[group['A'] for index,group in subdf.groupby('rotType') ]
groupsB=[group['B'] for index,group in subdf.groupby('rotType') ]
groupsC=[group['C'] for index,group in subdf.groupby('rotType') ]
names=[index for index,group in subdf.groupby('rotType')  ]
plt.title("Distribuition for different type of rotors")
sns.violinplot(groupsA,names=names,alpha=0.5,color='green',label='A')
sns.violinplot(groupsB,names=names,alpha=0.5,color='blue')
sns.violinplot(groupsC,names=names,alpha=0.5,color='red',label='A')
#custom legend
import matplotlib.patches as mpatches

green_patch = mpatches.Patch(color='green',alpha=0.5, label='A')
blue_patch = mpatches.Patch(color='blue',alpha=0.5, label='B')
red_patch = mpatches.Patch(color='red',alpha=0.5, label='C')
plt.legend(handles=[green_patch,blue_patch,red_patch])
plt.legend()
plt.show()
print("")



In [38]:

    
# show stats
interest=['A','B','C']
#exclude linear molecules since they have very different A values than skew a lot the stats
subdf=df[ np.logical_not(linear) ]

for i in interest:
    sns_cycle_palette()
    asubdf=subdf[ subdf[i] < 50 ]
    # histogram, seaborn has better
    sns.distplot(asubdf[i],rug=False, hist=True,hist_kws={"alpha": 0.5},label=i);
    plt.xlim([0,10])

plt.xlabel("Rot Constant value")
plt.title('Distribution of molecules')
plt.ylabel('Frequency')
plt.legend()
plt.show()

Observations:

B and C have normal distributions, $B \ge C$ explains the slight skew. A has around 2x bigger mean and std.
There are a few outlier cases, such as small molecules which have high rotational numbers ($H_2O$) or linear molecules (Huge $A$). These outliers mess-up the statistics of the distribuitions, removing them, we get ''nice'' distribuitions.
Would be interesting to see if each class of rotator also reflects on the distribution of other properties.
Might be worth considering only using the subset of asymmetric molecules since they represent the great majority of all the data.
How does molecular weight affect the roational constants? Intuition would dictate that more weight = less freedom of movement and this would reflect on smaller rot-constants. Linear molecules are except from this rational.

2.3 Special Properties

Electronic Band Gap (gap)

Calculated as $E_{gap}=E_{HOMO}-E_{LUMO}$, also known as "band gap", where HOMO (highest occupied molecular orbital) and LUMO (lowest unoccupied molecular orbital) are ''frontier'' orbitals for the molecule. Basically, it's how much energy you have to feed into the molecule to kick it from the ground (most stable) state into an excited state. The gap energy can tell us about what wavelengths the compound can absorb.

Molar Heat Capacity (Cv)

At the molecular level, temperature is the average kinetic energy of an ensamble of molecules. Heat capacity characterizes the amount of heat required to change a body's temperature by a given amount.

It is related to molecular complexity, more complex molecules have multiple degrees of freedom (translation, rotation and vibration)...when an ideal gas of ''simple'' molecules (low heat capacity) absorbs heat, all the energy will go to a low number of degrees of freedom...while for a gas with more complex molecules the absorbed energy is partitioned among many more kinds of motions, giving a smaller rise temperature (high heat capacity).

Dipole Moment (dipole)

Dipole moments arise from differences in electronegativity.

It measures the separation of positive and negative electrical charges in a system of electric charges, in this case electrons. The larger the difference in electronegativity, the larger the dipole moment. A zero dipole implies that the negative and positive forces are canceling each other.

Polarizability (polar)

Neutral nonpolar species have spherically symmetric arrangements of electrons in their electron clouds. When in the presence of an electric field, their electron clouds can be distorted. Polarizability is a measure of this distortion.

More electrons or electrons more far away from the nuclear charge, meaning increased polarizability.



In [47]:

    
interest=['Hatom','Cv','gap','dipole','polar']
# show stats
print('=== Stats ===')
df[interest].describe().transpose()









    



=== Stats ===






    Out[47]:






  
    
      
      count
      mean
      std
      min
      25%
      50%
      75%
      max
    
  
  
    
      Hatom
       130830
        2.830376
       0.385466
       0.342879
        2.581609
        2.835485
        3.078835
         4.211903
    
    
      Cv
       130830
       31.620447
       4.067484
       6.002000
       28.955250
       31.578500
       34.298000
        46.969000
    
    
      gap
       130830
        0.252044
       0.047192
       0.024600
        0.217000
        0.250200
        0.289400
         0.622100
    
    
      dipole
       130830
        2.672963
       1.503480
       0.000000
        1.577800
        2.475300
        3.596375
        29.556400
    
    
      polar
       130830
       75.281410
       8.173457
       6.310000
       70.480000
       75.600000
       80.610000
       196.620000



In [43]:

    
names=["Enthalphy of atomization","Heat Capacity","Electronic band gap","Dipole Moment","Polarizability"]
for index,i in enumerate(interest):
    sns_cycle_palette()
    # histogram, seaborn has better
    sns.distplot(df[i],rug=False, hist=True);
    plt.xlabel(i)
    plt.ylabel('Frequency')
    plt.title(names[index]+" across all molecules")
    plt.show()

Observations:

Overall all are normally distribuited, not surprising considering the sample size of our data. Some are more ''normal'' than others.
Dipole Moment and Electronic Band Gap could be a mixture of three gaussians since there are three visibles peaks.
$H_{a}$ has around 5 local maxima, could be modeled with a mixture of five gaussians.

2.4 Results of the Exploration

We can just keep one energy feature.
It might be advantegous to treat first a small subset of molecules, for example 2k from the asymetric rotor class. We might have bias due to the different representation of rotor class molecules.
Most of the data follows a normal distribution, some features have interesting peaks suggesting we should try to predict a Gaussian Mixture Model.

3 Machine Learning: Finding Patterns in the data

3.1 Prepare Data: Reduce Variables, do substructure matching

3.1.1 Calculate number of rings and decide if it contains a benzene ring



In [101]:

    
#get Ring Data
def getRingData( row ):
    mol=Chem.MolFromSmiles(row['SMILES'])
    if mol is None:
        numRing=0
    else:
        numRing=len(mol.GetAromaticAtoms())
    return numRing

#apply function to each row
df['ringNum']=df.apply(getRingData, axis=1)



In [102]:

    
benzeneMol=Chem.MolFromSmiles('c1ccccc1')
print("=== Our reference substructure ===")
benzeneMol









    



=== Our reference substructure ===






    Out[102]:



In [103]:

    
#get Ring Data
refMol=benzeneMol
matchName='benzene'
def smileFriendly( row):
    mol=Chem.MolFromSmiles(row['SMILES'])
    if mol is None:
        match=False
    else:
        match=True
    return match

df['smileGood']=df.apply(smileFriendly, axis=1)
print(" finished classifing good smiles ")
def matchSubstructure( row):
    if row['smileGood']:
        mol=Chem.MolFromSmiles(row['SMILES'])
        match=(len(mol.GetSubstructMatch(refMol)) > 0)
    else:
        match=False
    return match

#apply function to each row
df[matchName]=df.apply(matchSubstructure, axis=1)
nMatched=df[df[matchName]].shape[0]
print("Found %d matched substructures "%(nMatched))









    



 finished classifing good smiles 
Found 291 matched substructures

3.1.2 Create subset of data to try the classification



In [109]:

    
subdf=df[df['rotType']=='Asym']
# divide classes
ringdf=subdf[subdf['benzene']]
sampleSize=min(ringdf.shape[0],500);
if nMatched > 500:
    ringdf=ringdf[ringdf['dbindex'].isin(random.sample(ringdf['dbindex'], sampleSize))]

candidates=np.logical_and(np.logical_and(~subdf['benzene'],subdf['numAtoms'] >=11 ),
                          subdf['ringNum'] == 0)
noRingdf=subdf[ candidates]
noRingdf=noRingdf[noRingdf['dbindex'].isin(random.sample(noRingdf['dbindex'], sampleSize)) ]

# merge both data frames
subdf=pd.concat([ringdf,noRingdf])
print(subdf.shape)
subdf.head()









    



(566, 28)






    Out[109]:






  
    
      
      numAtoms
      dbindex
      A
      B
      C
      dipole
      polar
      homo
      lumo
      gap
      ...
      atomCoords
      freqs
      SMILES
      InChI
      molWeight
      Hatom
      rotType
      ringNum
      smileGood
      benzene
    
  
  
    
      929 
       15
        930
       5.55639
       2.51908
       1.75211
       0.3250
       69.91
      -0.2361
       0.0045
       0.2407
      ...
                                                                                                                                              [[-0.0117195668, 1.5234800402, 0.0103391293], [-0.0089052882, 0.0143135821, 0.0022205061], [-0.0038616072, -0.6965524191, -1.2016066613], [0.0269483994, -2.0884265906, -1.2122990032], [0.05096882, -2.7969187218, -0.0134208178], [0.0414768819, -2.1015164974, 1.1932643885], [0.0106302783, -0.7095775849, 1.1980688052], [1.0096050652, 1.9245101689, 0.002531113], [-0.5230439031, 1.9290081911, -0.8677920221], [-0.5059025374, 1.9194936281, 0.9024884495], [-0.026631914, -0.1513605007, -2.1408448831], [0.0280624747, -2.6201883726, -2.1583697581], [0.0718047116, -3.8815216588, -0.0194480703], [0.0539945616, -2.6435172098, 2.1334233204], [-0.0007653961, -0.1746249648, 2.1433776041]]
                                                                                                 [13.2326, 211.1698, 340.3382, 414.4366, 478.4558, 526.0378, 632.4802, 716.2096, 752.6377, 797.8522, 866.3421, 909.755, 952.1108, 985.938, 1000.2668, 1014.519, 1052.7545, 1063.0599, 1112.2099, 1181.1798, 1204.2909, 1231.6192, 1333.0293, 1348.4539, 1413.1856, 1467.7377, 1488.3888, 1502.3855, 1529.7724, 1625.4066, 1646.9112, 3026.4157, 3083.6896, 3111.3874, 3161.7098, 3163.4093, 3176.0685, 3184.2829, 3197.4286]
          Cc1ccccc1
            InChI=1S/C7H8/c1-7-5-3-2-4-6-7/h2-6H,1H3
        92.138420
       2.557181
       Asym
       6
       True
       True
    
    
      939 
       14
        940
       5.63915
       2.59726
       1.78123
       1.6318
       67.30
      -0.1991
       0.0086
       0.2077
      ...
                                                                                                                                                                                         [[-0.0163224155, 1.4136802572, 0.1493342031], [0.0020312448, 0.020282505, 0.0475987977], [-1.1934551094, -0.7118247733, 0.0330451422], [-1.1694332794, -2.1010422192, -0.0062715702], [0.0397958132, -2.7928589821, -0.0332417297], [1.2300081891, -2.068436551, -0.0200152233], [1.216722451, -0.6790716275, 0.0192391113], [0.8087264938, 1.8733958315, -0.2078715935], [-0.8575919538, 1.8507521535, -0.1983342042], [-2.1422553477, -0.1831707261, 0.0563723871], [-2.1079538779, -2.6464255708, -0.0184299143], [0.0543362436, -3.8763067051, -0.0651406315], [2.1828012149, -2.5881178795, -0.0430067941], [2.1510138634, -0.1248304826, 0.0317785795]]
                                                                                                                                       [224.2427, 295.3001, 382.2724, 416.2446, 509.7066, 534.047, 616.1564, 631.5351, 710.7599, 772.7953, 832.1737, 836.367, 884.2922, 941.097, 970.5967, 1007.4543, 1050.4507, 1068.2887, 1138.9363, 1179.0525, 1201.1902, 1309.3874, 1351.983, 1359.2593, 1500.7314, 1534.1737, 1626.929, 1645.5227, 1661.851, 3159.109, 3160.213, 3176.3061, 3181.4197, 3200.4375, 3561.2957, 3659.2629]
          Nc1ccccc1
           InChI=1S/C6H7N/c7-6-4-2-1-3-5-6/h1-5H,7H2
        93.126480
       2.369230
       Asym
       6
       True
       True
    
    
      948 
       13
        949
       5.67429
       2.62513
       1.79479
       1.2550
       62.42
      -0.2191
       0.0010
       0.2201
      ...
                                                                                                                                                                                                                                         [[0.097246341, 1.3675180399, 0.1311290224], [0.0678524749, 0.0068424163, 0.0400805622], [0.0761418569, -0.6646100168, -1.1834359131], [0.0447916339, -2.0568390099, -1.2095635585], [0.0052224321, -2.7854492589, -0.024379913], [-0.0027741535, -2.1059217819, 1.1939332909], [0.0282225363, -0.7173008208, 1.2333985161], [0.1226305024, 1.7401784789, -0.7561445864], [0.1070304048, -0.0987829965, -2.1112642157], [0.0515297324, -2.5700062722, -2.1656883557], [-0.019110342, -3.8689687908, -0.0478804758], [-0.0335038287, -2.6625358966, 2.1250038973], [0.0224610795, -0.1755467506, 2.1723285893]]
                                                                                                                                                                      [233.3224, 369.7802, 406.1505, 420.3493, 521.672, 533.1194, 628.597, 707.1698, 772.525, 830.0212, 832.7948, 891.6718, 943.8885, 974.7781, 1009.9947, 1045.7032, 1095.0609, 1177.2813, 1191.5389, 1198.5986, 1299.4243, 1349.953, 1368.2879, 1502.4366, 1533.009, 1637.5621, 1650.164, 3152.604, 3174.4128, 3182.7701, 3197.7086, 3204.7191, 3823.8246]
          Oc1ccccc1
             InChI=1S/C6H6O/c7-6-4-2-1-3-5-6/h1-5,7H
        94.111240
       2.265065
       Asym
       6
       True
       True
    
    
      4207
       12
       4208
       5.69508
       2.57472
       1.77311
       1.2641
       58.19
      -0.2430
      -0.0087
       0.2343
      ...
                                                                                                                                                                                                                                                                                               [[0.0018123068, 0.0185526922, 0.0036786849], [-0.0163561014, 1.3566680459, 0.0106225625], [-1.2391398138, 2.0133412731, 0.0302940082], [-1.2496148565, 3.4061416766, 0.037416568], [-0.0538997981, 4.1218111685, 0.024991164], [1.1608026955, 3.439035941, 0.0053010093], [1.1881459921, 2.0464657894, -0.0020488151], [-2.1548410159, 1.4339368461, 0.0396547131], [-2.1986717599, 3.9313475557, 0.0527604747], [-0.0686188958, 5.2058980344, 0.0306258009], [2.0952486386, 3.9899455166, -0.0044499949], [2.1192387346, 1.4922640204, -0.0172973855]]
                                                                                                                                                                                                    [242.2515, 405.6341, 423.7619, 517.3056, 523.0544, 623.0781, 704.1833, 775.0643, 827.7738, 844.7107, 910.6343, 953.3747, 981.6342, 1014.9794, 1039.7573, 1089.9935, 1176.5394, 1178.4956, 1279.5097, 1311.1498, 1344.25, 1484.1072, 1528.3084, 1639.4333, 1640.871, 3177.1152, 3185.5145, 3198.6562, 3206.416, 3209.576]
          Fc1ccccc1
               InChI=1S/C6H5F/c7-6-4-2-1-3-5-6/h1-5H
        96.102303
       2.133165
       Asym
       6
       True
       True
    
    
      4335
       18
       4336
       3.58374
       1.76010
       1.19792
       0.2951
       82.66
      -0.2298
       0.0072
       0.2371
      ...
       [[-0.0191174654, 1.5277262784, 0.0048015465], [-0.008809547, 0.0181560366, -0.0088748709], [0.0188826203, -0.6864007888, -1.2152454422], [0.0502716541, -2.0775172563, -1.2189635356], [0.0536892744, -2.7833395215, -0.0197776703], [0.0264362783, -2.1048493087, 1.2014363329], [0.0537987347, -2.8618296084, 2.5073176092], [-0.0090061639, -0.7084944427, 1.1841457571], [0.9974410891, 1.9330366688, -0.0736450489], [-0.592033362, 1.9316640357, -0.8354533269], [-0.4531027076, 1.9171123855, 0.930214611], [0.0095100845, -0.1419569203, -2.1550250235], [0.064035518, -2.6149199055, -2.1620089268], [0.0713794806, -3.8692823209, -0.0300819074], [1.0813295133, -3.10873827, 2.8023461725], [-0.3849082441, -2.2753739066, 3.3197026885], [-0.4953835587, -3.805496322, 2.4344728166], [-0.0409565487, -0.1705457734, 2.1286873183]]
       [14.6838, 38.7789, 198.5832, 220.4659, 272.4767, 401.4202, 444.3633, 516.7992, 534.2659, 540.9004, 715.4553, 734.084, 795.1243, 899.3324, 901.4801, 920.2229, 961.7697, 998.7352, 1013.9928, 1032.1354, 1060.2953, 1063.5493, 1122.9596, 1182.9099, 1197.541, 1273.8214, 1323.4551, 1344.1816, 1410.6583, 1412.8541, 1449.7127, 1486.0, 1488.3176, 1488.5226, 1506.1266, 1525.5303, 1629.0915, 1648.929, 3026.2645, 3026.7556, 3083.951, 3084.1287, 3111.9609, 3112.0999, 3152.4465, 3160.0866, 3168.5559, 3187.0146]
       Cc1cccc(C)c1
       InChI=1S/C8H10/c1-7-4-3-5-8(2)6-7/h3-6H,1-2H3
       106.165000
       3.008906
       Asym
       6
       True
       True
    
  

5 rows × 28 columns

3.1.3 Visually verify molecular structures



In [110]:

    
sampleSize=20;
smalldf=subdf[subdf['benzene']]
smalldf=smalldf[smalldf['dbindex'].isin(random.sample(smalldf['dbindex'], sampleSize))]

mols = [ Chem.MolFromSmiles(row['SMILES']) for indx, row in smalldf.iterrows() ]
print("== Benzene Ringed atoms ==")
names=["# %d "%(row['dbindex']) for indx, row in smalldf.iterrows() ]
Draw.MolsToGridImage( mols, molsPerRow=5, legends=names)









    



== Benzene Ringed atoms ==






    Out[110]:



In [113]:

    
sampleSize=20;
smalldf=subdf[~subdf['benzene']]
smalldf=smalldf[smalldf['dbindex'].isin(random.sample(smalldf['dbindex'], sampleSize))]

mols = [ Chem.MolFromSmiles(row['SMILES']) for indx, row in smalldf.iterrows() ]
print("== Non-Benzene-Ringed atoms ==")
names=["# %d "%(row['dbindex']) for indx, row in smalldf.iterrows() ]
Draw.MolsToGridImage( mols, molsPerRow=5, legends=names)









    



== Non-Benzene-Ringed atoms ==






    Out[113]:



In [114]:

    
# save our pickle!
subdf.to_pickle('benzene.pkl')



In [119]:

    
# save our pickle!
subdf=pd.read_pickle('benzene.pkl')
subdf.columns









    Out[119]:





Index([u'numAtoms', u'dbindex', u'A', u'B', u'C', u'dipole', u'polar', u'homo', u'lumo', u'gap', u'spatialSize', u'zpe', u'U0', u'U', u'H', u'G', u'Cv', u'atomList', u'atomCoords', u'freqs', u'SMILES', u'InChI', u'molWeight', u'Hatom', u'rotType', u'ringNum', u'smileGood', u'benzene'], dtype='object')

3.2 Create Classifiers!

3.2.1 Create X and Y matrices



In [255]:

    
print(subdf.columns)
nonTrainFeatures=['numAtoms','homo','lumo','dbindex','atomList', 'atomCoords', 'freqs', 'SMILES',
                  'InChI','rotType', 'ringNum', 'smileGood', 'benzene','U','H','G']

trainData=subdf.drop(nonTrainFeatures,axis=1)
targetData=subdf['benzene']
X = np.array([np.array(row) for indx,row in trainData.iterrows()])
Y = np.array( targetData)
features=[str(col) for col in trainData.columns]
features









    



Index([u'numAtoms', u'dbindex', u'A', u'B', u'C', u'dipole', u'polar', u'homo', u'lumo', u'gap', u'spatialSize', u'zpe', u'U0', u'U', u'H', u'G', u'Cv', u'atomList', u'atomCoords', u'freqs', u'SMILES', u'InChI', u'molWeight', u'Hatom', u'rotType', u'ringNum', u'smileGood', u'benzene'], dtype='object')






    Out[255]:





['A',
 'B',
 'C',
 'dipole',
 'polar',
 'gap',
 'spatialSize',
 'zpe',
 'U0',
 'Cv',
 'molWeight',
 'Hatom']

3.2.2 RandomForestClassifier



In [256]:

    
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_score


maxtrees=40
kfoldn=10
ntrees=np.arange(1,maxtrees+1)
scores=np.zeros((ntrees.shape[0],kfoldn))
for indx,i in enumerate(ntrees):
    rf = RandomForestClassifier(n_estimators=i)
    scores[indx]= cross_val_score(rf,X,Y, cv=kfoldn)

scores=np.array(scores)

print("== Random Forest Classifier, KFold=10==")
print("Score Mean = %1.3f, Std = %1.3f"%(np.mean(scores),np.std(scores)))

plt.figure(figsize=(18, 6), dpi=80)
sns.boxplot(scores.transpose())
plt.plot(ntrees,[ i.mean() for i in scores],'r--',alpha=0.75,label="mean")
plt.legend()
plt.xlabel("Number of trees")
plt.ylabel("Score")
plt.title("RandomForest with varying number of trees")
plt.show()









    



== Random Forest Classifier, KFold=10==
Score Mean = 0.919, Std = 0.085

3.2.3 Feature Importance



In [260]:

    
goodTrees=11
rf=RandomForestClassifier(n_estimators=goodTrees)
scores= cross_val_score(rf,X,Y, cv=kfoldn)
kfold=sklearn.cross_validation.KFold(n=len(Y), n_folds=kfoldn)
importances=[]
for indj, indices in enumerate(kfold):
    train,test = indices
    X_train, X_test = X[train], X[test]
    Y_train, Y_test= Y[train], Y[test]
    rf.fit(X_train, Y_train)
    Y_pred=rf.predict(X_test)
    importances.append(rf.feature_importances_)
#pre-process data
importances=np.array(importances)
mimportances=[ i.mean() for i in importances.transpose()]

# sort
sortedtup=sorted(zip(mimportances,features))
mimportances = np.array([i[0] for i in sortedtup])
sfeatures = np.array([i[1] for i in sortedtup])
indexes=np.arange(len(mimportances))



In [261]:

    
#plot away!
fig, ax = plt.subplots()
sns.barplot(indexes,y=mimportances, palette="BuGn_d")
ax.set_xticklabels(sfeatures, rotation=45)
plt.xlabel("Feature")
plt.ylabel("Importance")
plt.title("Random Forest Classifier (%d Trees), Score = %1.3f ( %1.3f) "%(goodTrees,np.mean(scores),np.std(scores)))
print("")

3.2.4 Visualizing the distribution of the 5 most important features



In [269]:

    
interest=sfeatures[-5::]
mostImportance=mimportances[-5::]
names=['Ring','Non-ring']
print("=== 5 most important features ===")
print(interest)
realNames=["Heat Capacity","Molecular Weight","Zero Point Energy",
           "Rotational Const C","Electronic Band Gap"]
for index,i in enumerate(interest):
    # histogram, seaborn has better
    matchdf=subdf[subdf['benzene']]
    nomatchdf=subdf[~subdf['benzene']]
    data=[matchdf[i],nomatchdf[i]]
    sns.violinplot(data,names=names);
    plt.xlabel("Type of molecule")
    plt.ylabel(i)
    plt.title("Distribuition of %s , Importance= %0.3f "%(realNames[index],mostImportance[index]) )
    plt.show()









    



=== 5 most important features ===
['Cv' 'molWeight' 'zpe' 'C' 'gap']

3.3 Other Classifiers!



In [276]:

    
from sklearn.naive_bayes import GaussianNB 
from sklearn import svm
from sklearn import grid_search 
from sklearn import neighbors
from sklearn.metrics import classification_report

gnb = GaussianNB()
scores= cross_val_score(gnb,X,Y, cv=kfoldn)
print("== Naive_Bayes Classifier ==")
print("Score Mean = %1.3f, Std = %1.3f"%(np.mean(scores),np.std(scores)))

clf_KNN = neighbors.KNeighborsClassifier(2, weights="uniform" )
scores= cross_val_score(clf_KNN,X,Y, cv=kfoldn)
print("== K-NN ==")
print("Score Mean = %1.3f, Std = %1.3f"%(np.mean(scores),np.std(scores)))

## Grid for the CV
gammas = np.logspace(-10,4,8,base=2)
Cost = [0.001,0.001,0.01,1,10,100,1000]
param_grid = [
  {'C': Cost , 'gamma': gammas, 'kernel': ['rbf']},
 ]
clf_svm  = grid_search.GridSearchCV(estimator=svm.SVC(), param_grid=param_grid) 
scores= cross_val_score(clf_svm,X,Y, cv=kfoldn)
print("== SVM ==")
print("Score Mean = %1.3f, Std = %1.3f"%(np.mean(scores),np.std(scores)))









    



== Naive_Bayes Classifier ==
Score Mean = 0.865, Std = 0.066
== K-NN ==
Score Mean = 0.585, Std = 0.087
== SVM ==
Score Mean = 0.745, Std = 0.121

3.4 Exploring Similiarity between Molecules



In [127]:

    
mols = [ pybel.readstring("Smiles",row['SMILES']) for indx,row in subdf.iterrows() ]
indexes = [ indx for indx,row in subdf.iterrows() ]
fps= [ mol.calcfp() for mol in mols ]

def pairwise_sim(fps):
    #create pairwise matrix
    pairwise=np.zeros((len(fps),len(fps)))
    #iterate over all pairs
    for indx,imol in enumerate(fps):
        for jndx,jmol in enumerate(fps):
            #tanimoto similarity
            pairwise[indx,jndx]=imol|jmol
            
    return pairwise

sim=pairwise_sim(fps)



In [129]:

    
plt.figure(figsize=(12,12))
plt.imshow(sim,interpolation='None' )
plt.title("Similarity matrix for %d Molecules (1=Equal)"%20)
plt.xlabel("Mol #id")
plt.ylabel("Mol #id")
plt.colorbar()
plt.show()



In [130]:

    
ref_indx=10
sim_ref=sim[ref_indx,:]
plt.hist(sim_ref)
plt.title("Similarity for a reference molecule")
plt.xlabel("Tanito Similarity")
plt.ylabel("Frequency")
print("== Reference Molecule ==")
smileStr=subdf.ix[indexes[ref_indx],'SMILES']
imolecule.draw(smileStr)
mols[ref_indx]









    



== Reference Molecule ==






    





           






    Out[130]:



In [131]:

    
nmols = [ Chem.MolFromSmiles(subdf.ix[indx,'SMILES']) for indx in indexes ]

# sorting the molecule as a function of their similarity to a query

nbrs = sorted(zip(sim_ref,nmols,indexes),reverse=True)
print("== Similar molecules, Sorted from most similar to disimilar ==")
# depict the molecules in decreasing order of similarity
names=["# %d - sim=%f"%(subdf.ix[nbr[2],'dbindex'],nbr[0]) for nbr in nbrs ]
Draw.MolsToGridImage( [nbr[1] for nbr in nbrs], molsPerRow=5, legends=names)









    



== Similar molecules, Sorted from most similar to disimilar ==






    Out[131]:

3.2.1 Bigger samplesize



In [133]:

    
sampleSize=1000;
subdf=df[df['rotType']=='Asym']
subdf=subdf[subdf['dbindex'].isin(random.sample(subdf['dbindex'], sampleSize))]
#remove some variables
interest=[u'numAtoms', u'dbindex', u'A', u'B', u'C', u'dipole', u'polar', u'gap', u'spatialSize', u'zpe', u'U0', u'Cv', u'SMILES', u'molWeight', u'Hatom', u'rotType']
subdf=subdf[interest]
mols = [ pybel.readstring("Smiles",row['SMILES']) for indx,row in subdf.iterrows() ]
indexes = [ indx for indx,row in subdf.iterrows() ]
fps= [ mol.calcfp() for mol in mols ]
sim=pairwise_sim(fps)
#plot!
plt.figure(figsize=(12,12))
plt.imshow(sim )
plt.title("Similarity matrix for %d Molecules (1=Equal)"%sampleSize)
plt.xlabel("Mol #id")
plt.ylabel("Mol #id")
plt.colorbar()
plt.show()



In [134]:

    
ref_indx= np.argmax( np.linalg.norm(sim,axis=1) )
sim_ref=sim[ref_indx,:]
plt.hist(sim_ref)
plt.title("Reference molecule = biggest colunm norm in the matrix")
plt.xlabel("Tanito Similarity")
plt.ylabel("Frequency")
print("== Reference Molecule ==")
smileStr=subdf.ix[indexes[ref_indx],'SMILES']
imolecule.draw(smileStr)
mols[ref_indx]









    



== Reference Molecule ==






    





           






    Out[134]:



In [135]:

    
nmols = [ Chem.MolFromSmiles(subdf.ix[indx,'SMILES']) for indx in indexes ]

# sorting the molecule as a function of their similarity to a query

nbrs = sorted(zip(sim_ref,nmols,indexes),reverse=True)
nbrs=nbrs[:30]
print("== 30 most Similar molecules, Sorted from most similar to disimilar ==")
# depict the molecules in decreasing order of similarity
names=["# %d - sim=%f"%(subdf.ix[nbr[2],'dbindex'],nbr[0]) for nbr in nbrs ]
Draw.MolsToGridImage( [nbr[1] for nbr in nbrs], molsPerRow=5, legends=names)









    



== 30 most Similar molecules, Sorted from most similar to disimilar ==






    Out[135]:

4 Summary, Results, the End

Goals reached

The first two goals were met sucessfully. The last goal, which is the most interesting was met for one substructure search. With some partial results. We tried various classifiers, such as randomForest, SVM, K-NN and Naive Bayes. We obtained the best results with the randomForest ensemble method. One of the advantageous of this methods is that we can a decision tree that gives an idea how important the variables are for such classification. For example with benzene we found that band gap was the best descriptor for these types of molecules. Visualizing the distribuition of these two clases is is obvious they are very different. This also coincides with the chemical notion of an aromatic ring, these types of structures are being used to design organic semi-conductors, which are characterized by a low electronic gap. We see that on average a bezene containing molecule will have a lower band gap.

Extra ideas (not tested)

Due to time constrains, some ideas were not testet, but they might work for future directions:

Search the most common functional groups, chemical substructures that contain key chemical functionality. For this purpose I wanted to use RDKit, but had problems running it. Tried to think how to formulate this with a Kmeans classifier.
Create a labeling for each molecule based on a molecule having the substructure or not. This would create a membership of a molecule to within different clusters.
Reduce dimentionality, maybe with a TruncatedSVD or finding the best features by ranking their importance via a random forest classifier.
See if membership into a cluster would correspond to general differences in the molecular properties.
Use SVM/Random Forest to create a classifier of molecular substructure based on molecular properties.

On the experience Starting this course I had never used an ipython notebook or python. Had always programmed in C++ or Fortran. Now I want to use python for many more projects, it offers a very fast way of prototyping ideas.

Future Directions There is certaintly much more work to be done and many venues to explore with data science and Quantum Chemistry. For the msot part I stuck to the available dataset, yet I feel there could be more interesting data embedded in the calculations which is not included in the data set. Examples:

Frequencies, only frequencies were reported but amplitudes were not. Using amplitudes and frequencies together you could devise spectra features such as spectral centroids, brightness o cSpectrum.
Coulombic Matrix, The matrix of electromagnetic interaction between different molecules.
Orbital Occupation Numbers, Occupation numbers for the electrons within the molecular orbitals.

The main idea would be design and find descriptor that can not be found in a classical way, to demonstrate that a quantum calculation has more conection to reality than a classical one. (at the molecular level).



In [ ]:

	numAtoms	dbindex	A	B	C	dipole	polar	homo	lumo	gap	...	U0	U	H	G	Cv	atomList	atomCoords	freqs	SMILES	InChI
0	5	1	157.71180	157.709970	157.706990	0.0000	13.21	-0.3877	0.1171	0.5048	...	-40.478930	-40.476062	-40.475117	-40.498597	6.469	[C, H, H, H, H]	[[-0.0126981359, 1.0858041578, 0.0080009958], ...	[1341.307, 1341.3284, 1341.365, 1562.6731, 156...	C	InChI=1S/CH4/h1H4
1	4	2	293.60975	293.541110	191.393970	1.6256	9.46	-0.2570	0.0829	0.3399	...	-56.525887	-56.523026	-56.522082	-56.544961	6.316	[N, H, H, H]	[[-0.0404260543, 1.0241077531, 0.0625637998], ...	[1103.8733, 1684.1158, 1684.3072, 3458.7145, 3...	N	InChI=1S/H3N/h1H3
2	3	3	799.58812	437.903860	282.945450	1.8511	6.31	-0.2928	0.0687	0.3615	...	-76.404702	-76.401867	-76.400922	-76.422349	6.002	[O, H, H]	[[-0.0343604951, 0.9775395708, 0.0076015923], ...	[1671.4222, 3803.6305, 3907.698]	O	InChI=1S/H2O/h1H2
3	4	4	0.00000	35.610036	35.610036	0.0000	16.28	-0.2845	0.0506	0.3351	...	-77.308427	-77.305527	-77.304583	-77.327429	8.574	[C, C, H, H]	[[0.5995394918, 0.0, 1.0], [-0.5995394918, 0.0...	[549.7648, 549.7648, 795.2713, 795.2713, 2078....	C#C	InChI=1S/C2H2/c1-2/h1-2H
4	3	5	0.00000	44.593883	44.593883	2.8937	12.99	-0.3604	0.0191	0.3796	...	-93.411888	-93.409370	-93.408425	-93.431246	6.278	[C, N, H]	[[-0.0133239314, 1.1324657151, 0.0082758861], ...	[799.0101, 799.0101, 2198.4393, 3490.3686]	C#N	InChI=1S/CHN/c1-2/h1H

	Element	Mass	U (0 K)_Hartree	U (298.15 K)_Hartree	H (298.15 K)_Hartree	G (298.15 K)_Hartree	CV_Hartree_Cal/(Mol Kelvin)
0	H	1.007940	-0.500273	-0.498857	-0.497912	-0.510927	2.981
1	C	12.010700	-37.846772	-37.845355	-37.844411	-37.861317	2.981
2	N	14.006700	-54.583861	-54.582445	-54.581501	-54.598897	2.981
3	O	15.999400	-75.064579	-75.063163	-75.062219	-75.079532	2.981
4	F	18.998403	-99.718730	-99.717314	-99.716370	-99.733544	2.981

	count	mean	std	min	25%	50%	75%	max
numAtoms	130830	18.032538	2.943693	3.00000	16.000000	18.00000	20.00000	29.000000
spatialSize	130830	1189.416537	280.471126	19.00020	1017.441875	1147.22465	1309.04710	3374.753200
molWeight	130830	122.721698	7.570695	16.04246	121.179640	125.12528	127.14116	152.038398

	count	mean	std	min	25%	50%	75%	max
U	130830	-410.812341	39.891166	-714.560153	-437.870873	-416.800560	-387.031228	-40.476062
H	130830	-410.811397	39.891166	-714.559209	-437.869929	-416.799616	-387.030284	-40.475117
G	130830	-410.854217	39.891884	-714.602138	-437.911833	-416.841337	-387.074526	-40.498597
zpe	130830	0.149090	0.033138	0.015951	0.125638	0.148630	0.171397	0.273944

		A	B	C
rotType
Asym	count	130348.000000	130348.000000	130348.000000
	mean	3.428143	1.400986	1.122086
	std	2.674430	1.296514	0.849326
	min	1.405150	0.337120	0.331180
	25%	2.554785	1.091817	0.911447
	50%	3.088760	1.370480	1.081650
	75%	3.832645	1.654280	1.281630
	max	799.588120	437.903860	282.945450
lin	count	25.000000	25.000000	25.000000
	mean	34181.745950	8.506396	8.283852
	std	130569.093627	13.280076	12.812553
	min	0.000000	0.377100	0.377100
	25%	0.000000	0.782570	0.782570
	50%	80.462250	2.048960	2.048960
	75%	159.871170	8.593230	8.593210
	max	619867.683140	44.593883	44.593883
ob	count	83.000000	83.000000	83.000000
	mean	6.354241	6.350577	4.146749
	std	32.011338	32.004122	20.864110
	min	1.382500	1.381480	0.806320
	25%	1.825840	1.820495	1.013025
	50%	2.100840	2.096210	1.430060
	75%	2.660040	2.655590	2.097040
	max	293.609750	293.541110	191.393970
pro	count	367.000000	367.000000	367.000000
	mean	4.787481	1.378116	1.375165
	std	2.701848	0.684939	0.684509
	min	1.534850	0.373300	0.371660
	25%	3.159955	0.907110	0.904440
	50%	4.125120	1.292390	1.292330
	75%	5.799905	1.728520	1.721730
	max	24.204020	5.385450	5.385240
sph	count	7.000000	7.000000	7.000000
	mean	25.382989	25.381147	25.380613
	std	58.370757	58.370755	58.369675
	min	1.497270	1.497230	1.497190
	25%	2.442800	2.442665	2.442625
	50%	3.455190	3.445180	3.445020
	75%	5.065530	5.065160	5.064920
	max	157.711800	157.709970	157.706990

	count	mean	std	min	25%	50%	75%	max
Hatom	130830	2.830376	0.385466	0.342879	2.581609	2.835485	3.078835	4.211903
Cv	130830	31.620447	4.067484	6.002000	28.955250	31.578500	34.298000	46.969000
gap	130830	0.252044	0.047192	0.024600	0.217000	0.250200	0.289400	0.622100
dipole	130830	2.672963	1.503480	0.000000	1.577800	2.475300	3.596375	29.556400
polar	130830	75.281410	8.173457	6.310000	70.480000	75.600000	80.610000	196.620000

	numAtoms	dbindex	A	B	C	dipole	polar	homo	lumo	gap	...	atomCoords	freqs	SMILES	InChI	molWeight	Hatom	rotType	ringNum	smileGood	benzene
929	15	930	5.55639	2.51908	1.75211	0.3250	69.91	-0.2361	0.0045	0.2407	...	[[-0.0117195668, 1.5234800402, 0.0103391293], [-0.0089052882, 0.0143135821, 0.0022205061], [-0.0038616072, -0.6965524191, -1.2016066613], [0.0269483994, -2.0884265906, -1.2122990032], [0.05096882, -2.7969187218, -0.0134208178], [0.0414768819, -2.1015164974, 1.1932643885], [0.0106302783, -0.7095775849, 1.1980688052], [1.0096050652, 1.9245101689, 0.002531113], [-0.5230439031, 1.9290081911, -0.8677920221], [-0.5059025374, 1.9194936281, 0.9024884495], [-0.026631914, -0.1513605007, -2.1408448831], [0.0280624747, -2.6201883726, -2.1583697581], [0.0718047116, -3.8815216588, -0.0194480703], [0.0539945616, -2.6435172098, 2.1334233204], [-0.0007653961, -0.1746249648, 2.1433776041]]	[13.2326, 211.1698, 340.3382, 414.4366, 478.4558, 526.0378, 632.4802, 716.2096, 752.6377, 797.8522, 866.3421, 909.755, 952.1108, 985.938, 1000.2668, 1014.519, 1052.7545, 1063.0599, 1112.2099, 1181.1798, 1204.2909, 1231.6192, 1333.0293, 1348.4539, 1413.1856, 1467.7377, 1488.3888, 1502.3855, 1529.7724, 1625.4066, 1646.9112, 3026.4157, 3083.6896, 3111.3874, 3161.7098, 3163.4093, 3176.0685, 3184.2829, 3197.4286]	Cc1ccccc1	InChI=1S/C7H8/c1-7-5-3-2-4-6-7/h2-6H,1H3	92.138420	2.557181	Asym	6	True	True
939	14	940	5.63915	2.59726	1.78123	1.6318	67.30	-0.1991	0.0086	0.2077	...	[[-0.0163224155, 1.4136802572, 0.1493342031], [0.0020312448, 0.020282505, 0.0475987977], [-1.1934551094, -0.7118247733, 0.0330451422], [-1.1694332794, -2.1010422192, -0.0062715702], [0.0397958132, -2.7928589821, -0.0332417297], [1.2300081891, -2.068436551, -0.0200152233], [1.216722451, -0.6790716275, 0.0192391113], [0.8087264938, 1.8733958315, -0.2078715935], [-0.8575919538, 1.8507521535, -0.1983342042], [-2.1422553477, -0.1831707261, 0.0563723871], [-2.1079538779, -2.6464255708, -0.0184299143], [0.0543362436, -3.8763067051, -0.0651406315], [2.1828012149, -2.5881178795, -0.0430067941], [2.1510138634, -0.1248304826, 0.0317785795]]	[224.2427, 295.3001, 382.2724, 416.2446, 509.7066, 534.047, 616.1564, 631.5351, 710.7599, 772.7953, 832.1737, 836.367, 884.2922, 941.097, 970.5967, 1007.4543, 1050.4507, 1068.2887, 1138.9363, 1179.0525, 1201.1902, 1309.3874, 1351.983, 1359.2593, 1500.7314, 1534.1737, 1626.929, 1645.5227, 1661.851, 3159.109, 3160.213, 3176.3061, 3181.4197, 3200.4375, 3561.2957, 3659.2629]	Nc1ccccc1	InChI=1S/C6H7N/c7-6-4-2-1-3-5-6/h1-5H,7H2	93.126480	2.369230	Asym	6	True	True
948	13	949	5.67429	2.62513	1.79479	1.2550	62.42	-0.2191	0.0010	0.2201	...	[[0.097246341, 1.3675180399, 0.1311290224], [0.0678524749, 0.0068424163, 0.0400805622], [0.0761418569, -0.6646100168, -1.1834359131], [0.0447916339, -2.0568390099, -1.2095635585], [0.0052224321, -2.7854492589, -0.024379913], [-0.0027741535, -2.1059217819, 1.1939332909], [0.0282225363, -0.7173008208, 1.2333985161], [0.1226305024, 1.7401784789, -0.7561445864], [0.1070304048, -0.0987829965, -2.1112642157], [0.0515297324, -2.5700062722, -2.1656883557], [-0.019110342, -3.8689687908, -0.0478804758], [-0.0335038287, -2.6625358966, 2.1250038973], [0.0224610795, -0.1755467506, 2.1723285893]]	[233.3224, 369.7802, 406.1505, 420.3493, 521.672, 533.1194, 628.597, 707.1698, 772.525, 830.0212, 832.7948, 891.6718, 943.8885, 974.7781, 1009.9947, 1045.7032, 1095.0609, 1177.2813, 1191.5389, 1198.5986, 1299.4243, 1349.953, 1368.2879, 1502.4366, 1533.009, 1637.5621, 1650.164, 3152.604, 3174.4128, 3182.7701, 3197.7086, 3204.7191, 3823.8246]	Oc1ccccc1	InChI=1S/C6H6O/c7-6-4-2-1-3-5-6/h1-5,7H	94.111240	2.265065	Asym	6	True	True
4207	12	4208	5.69508	2.57472	1.77311	1.2641	58.19	-0.2430	-0.0087	0.2343	...	[[0.0018123068, 0.0185526922, 0.0036786849], [-0.0163561014, 1.3566680459, 0.0106225625], [-1.2391398138, 2.0133412731, 0.0302940082], [-1.2496148565, 3.4061416766, 0.037416568], [-0.0538997981, 4.1218111685, 0.024991164], [1.1608026955, 3.439035941, 0.0053010093], [1.1881459921, 2.0464657894, -0.0020488151], [-2.1548410159, 1.4339368461, 0.0396547131], [-2.1986717599, 3.9313475557, 0.0527604747], [-0.0686188958, 5.2058980344, 0.0306258009], [2.0952486386, 3.9899455166, -0.0044499949], [2.1192387346, 1.4922640204, -0.0172973855]]	[242.2515, 405.6341, 423.7619, 517.3056, 523.0544, 623.0781, 704.1833, 775.0643, 827.7738, 844.7107, 910.6343, 953.3747, 981.6342, 1014.9794, 1039.7573, 1089.9935, 1176.5394, 1178.4956, 1279.5097, 1311.1498, 1344.25, 1484.1072, 1528.3084, 1639.4333, 1640.871, 3177.1152, 3185.5145, 3198.6562, 3206.416, 3209.576]	Fc1ccccc1	InChI=1S/C6H5F/c7-6-4-2-1-3-5-6/h1-5H	96.102303	2.133165	Asym	6	True	True
4335	18	4336	3.58374	1.76010	1.19792	0.2951	82.66	-0.2298	0.0072	0.2371	...	[[-0.0191174654, 1.5277262784, 0.0048015465], [-0.008809547, 0.0181560366, -0.0088748709], [0.0188826203, -0.6864007888, -1.2152454422], [0.0502716541, -2.0775172563, -1.2189635356], [0.0536892744, -2.7833395215, -0.0197776703], [0.0264362783, -2.1048493087, 1.2014363329], [0.0537987347, -2.8618296084, 2.5073176092], [-0.0090061639, -0.7084944427, 1.1841457571], [0.9974410891, 1.9330366688, -0.0736450489], [-0.592033362, 1.9316640357, -0.8354533269], [-0.4531027076, 1.9171123855, 0.930214611], [0.0095100845, -0.1419569203, -2.1550250235], [0.064035518, -2.6149199055, -2.1620089268], [0.0713794806, -3.8692823209, -0.0300819074], [1.0813295133, -3.10873827, 2.8023461725], [-0.3849082441, -2.2753739066, 3.3197026885], [-0.4953835587, -3.805496322, 2.4344728166], [-0.0409565487, -0.1705457734, 2.1286873183]]	[14.6838, 38.7789, 198.5832, 220.4659, 272.4767, 401.4202, 444.3633, 516.7992, 534.2659, 540.9004, 715.4553, 734.084, 795.1243, 899.3324, 901.4801, 920.2229, 961.7697, 998.7352, 1013.9928, 1032.1354, 1060.2953, 1063.5493, 1122.9596, 1182.9099, 1197.541, 1273.8214, 1323.4551, 1344.1816, 1410.6583, 1412.8541, 1449.7127, 1486.0, 1488.3176, 1488.5226, 1506.1266, 1525.5303, 1629.0915, 1648.929, 3026.2645, 3026.7556, 3083.951, 3084.1287, 3111.9609, 3112.0999, 3152.4465, 3160.0866, 3168.5559, 3187.0146]	Cc1cccc(C)c1	InChI=1S/C8H10/c1-7-4-3-5-8(2)6-7/h3-6H,1-2H3	106.165000	3.008906	Asym	6	True	True

A data driven, Quantum Mechanical understanding of Chemistry

AKA Trying to make sense of 134k quantum calculations

by Benjamin Sanchez Lengeling (bsanchezlengeling@g.harvard.edu)

Load custom CSS styling

0 Intro

0.1 Motivation

0.1 Motivation

0.2 Related Work

0.3 Goals

1 Data Wrangling

1.0 The Dataset

1.1 Libraries

1.2 Read XYZ files and load to DataFrame

1.3 Remove uncharacterized molecules

1.4 Add other computable descriptors

1.4.1 Molecular Weight (molWeight)

1.4.2 Standard Enthalphy of Atomization (Hatom)

2) Data Visualization: Exploring the Dataset

2.1) Visualization

2.2) Size: Number of atoms, Molecular Weight, Spatial Extensivity

spatialSize = Electronic Spatial Extent

2.2 Energetics: Internal energy (U), Enthalpy (H), Gibbs free energy (G), Zero Point Energy (zpe)

Internal energy (U)

Enthalpy $ H = U + pV $

Gibbs free energy $ G = U + pV - TS $

Zero Point Energy (zpe)

2.3 Rotational Constants (A,B,C)

2.3 Special Properties

Electronic Band Gap (gap)

Molar Heat Capacity (Cv)

Dipole Moment (dipole)

Polarizability (polar)

2.4 Results of the Exploration

3 Machine Learning: Finding Patterns in the data

3.1 Prepare Data: Reduce Variables, do substructure matching

3.1.1 Calculate number of rings and decide if it contains a benzene ring

3.1.2 Create subset of data to try the classification

3.1.3 Visually verify molecular structures

3.2 Create Classifiers!

3.2.1 Create X and Y matrices

3.2.2 RandomForestClassifier

3.2.3 Feature Importance

3.2.4 Visualizing the distribution of the 5 most important features

3.3 Other Classifiers!

3.4 Exploring Similiarity between Molecules

3.2.1 Bigger samplesize

4 Summary, Results, the End