In [10]:
from IPython.core.display import HTML
import os
def css_styling():
"""Load default custom.css file from ipython profile"""
base = os.getcwd()
styles = "<style>\n%s\n</style>" % (open(os.path.join(base,'custom.css'),'r').read())
return HTML(styles)
css_styling()
Out[10]:
This CS 209 project is an exploration of Chemistry using data science techniques. I want to use the methods learned in the class to apply them in my own context, which is Quantum Chemistry.
Quantum chemistry is a branch of theoretical chemistry which applies quantum mechanics to address problems in chemistry.
In Quantum Chemistry you can calculate a molecule on a computer, simulating and solving the Schrödinger equation to obtain molecular properties.
Some pretty awesome examples of Quantum Chemistry applications are:
Due to advances in computing now we can calculate a large number of molecules in relatively small time frame, creating huge masses of data.
The pharmaceutical industry has established for a long time now the practice of using molecular calculations to drive drug discovery. This has olny been done with classical mechanics and with very limited quantum treatment of molecules. There is now a trend of trying to adapt the tools and strategies of this industry to material science.
So a current open question is: How can we incorporate quantum effects/information into chemoinformatics?
Very little work has been done on this aspect, in part due to the computational constraint of the treatment of quantum effects and also due to the complexity of this information.
I want to put my newly earned skills to the test, on data I will most probably be working with during my graduate studies.
One aspect of this work is the research, there are two main projects/works that have tried to utilize data science in Quantum Chemistry:
The other aspect relates to accesibility in science, in using great tools as ipython notebooks, data repositories, open source libraries to accelerate and make more open the process of doing science. In Chemistry two side projects which are of great interest are:
Dataset is hosted on http://dx.doi.org/10.6084/m9.figshare.978904, it is a few hundred MB's.
I also have pandas dataframe pickle in my dropbox:
https://www.dropbox.com/sh/6iysi4w0xmmevlt/AABrTLUFZJvrJPDeDCzuhxJYa?dl=0.
They reported computed geometric, energetic, electronic, and thermodynamic properties for 134k stable small organic molecules made up of CHONF atoms. These molecules correspond to the subset of all 133,885 species with up to nine heavy atoms (CONF) out of the GDB-17 chemical universe of 166 billion organic molecules.
All calculations have been carried out at the B3LYP/6-31G(2df,p) level of quantum chemistry. (Decent, but not the best)
Results of each calculation has been embeded within a xyz file in a very specific way. The data provided by the article is not very user friendly ( see: http://www.ch.imperial.ac.uk/rzepa/blog/?p=12803), a better format would have been a json or excel database.
We will parse out the information acoording to the readme.txt file and create for each molecule a row within a pandas dataframe. This dataframe can thn easily be exported to a variety of popular data formats.
In [1]:
#Libraries
import numpy as np
import scipy as sp
import pandas as pd
import sklearn
import seaborn as sns
from matplotlib import pyplot as plt
import random
#chemo-informatics
import imolecule
import openbabel
import pybel
import rdkit
# visual stuff
%matplotlib inline
from IPython.core import display
from PIL import Image
#rdkit
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem import AllChem
import rdkit.rdBase
from rdkit.Chem.MACCSkeys import GenMACCSKeys
from rdkit import DataStructs
from rdkit.DataStructs import BitVectToText
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from rdkit import DataStructs
from rdkit.Chem import MCS as MCS
from rdkit.Chem import Descriptors as Descriptors
from rdkit.ML.Descriptors import MoleculeDescriptors
from rdkit.Chem import PandasTools as PandasTools
from rdkit.Chem import Descriptors as Descriptors
#awesome plot options
plt.style.use('fivethirtyeight')
plt.rcParams['figure.figsize'] = (12.0, 6.0)
# utility
from io import BytesIO
#functions for ipython
def display_pil_image(im):
"""Displayhook function for PIL Images, rendered as PNG."""
b = BytesIO()
im.save(b, format='png')
data = b.getvalue()
ip_img = display.Image(data=data, format='png', embed=True)
return ip_img._repr_png_()
# register display func with PNG formatter:
png_formatter = get_ipython().display_formatter.formatters['image/png']
dpi = png_formatter.for_type(Image.Image, display_pil_image)
#functions for plotting
def sns_cycle_palette():
"""Cycle seaborn's color palete to have a variety of colors"""
pal = sns.color_palette()
pal=pal[1:]+[pal[0]]
sns.set_palette(pal)
return
In [7]:
def readFormatedXYZ(xyzfile):
fileData=[]
with(open(xyzFile,'r')) as afile:
numAtoms=int(afile.next())
#1 num atoms
fileData.append(numAtoms)
datline=afile.next().split()
# 2 molecular index within the database
fileData.append(int(datline[1]))
# 3-5 rotational constants
fileData.append(float(datline[2]))
fileData.append(float(datline[3]))
fileData.append(float(datline[4]))
# 6 dipole moment
fileData.append(float(datline[5]))
# 7 polarizability
fileData.append(float(datline[6]))
# 8-10 orbital energies (lumo,homo)
fileData.append(float(datline[7]))
fileData.append(float(datline[8]))
fileData.append(float(datline[9]))
# 11 spatial extent
fileData.append(float(datline[10]))
# 12 zpe
fileData.append(float(datline[11]))
# 13-14 Internal Energy
fileData.append(float(datline[12]))
fileData.append(float(datline[13]))
# 15 Enthalpy Energy
fileData.append(float(datline[14]))
# 16 Free Energy
fileData.append(float(datline[15]))
# 17 Heat Capacity
fileData.append(float(datline[16]))
#read geometry
atomCoords=np.zeros((numAtoms,3))
atomList=[]
for i in range(numAtoms):
datline=afile.next().split()
atomList.append(datline[0])
atomCoords[i]=np.array([float(d) for d in datline[1:4]])
fileData.append(atomList)
fileData.append(atomCoords)
# next frequencies
fileData.append(np.array([float(f) for f in afile.next().split()]))
#relaxed geometry smile
fileData.append(afile.next().split()[1])
#relaxed geometry InChI
fileData.append( afile.next().split()[1])
return fileData
In [61]:
#===== Initial Variables ===
molFolder = 'mols/'
rmin = 0
rmax = 100
pandasFile = "mols.pkl"
#===== Check that the pickle file exists ===
if os.path.isfile(pandasFile):
masterdf = pd.read_pickle(pandasFile)
rmin = masterdf.shape[0] + 1
else:
datCols = ['numAtoms', 'dbindex', 'A', 'B', 'C', 'dipole', 'polar',
'homo', 'lumo', 'gap', 'spatialSize', 'zpe',
'U0', 'U', 'H', 'G', 'Cv',
'atomList', 'atomCoords', 'freqs',
'SMILES', 'InChI']
masterdf = pd.DataFrame(columns=datCols)
rmin = 0
masterdf=pd.DataFrame( columns=datCols)
#===== Iterate over file
for indx in indexes:
xyzFile="mols/dsgdb9nsd_%06d.xyz"%(indx+1)
masterdf.loc[indx]=readFormatedXYZ(xyzFile)
if (indx % 10000 == 0):
print(indx)
masterdf.to_pickle(pandasFile)
#====== Save and
masterdf.to_pickle(pandasFile)
In [6]:
#pickle it!
pandasFile = "original.pkl"
masterdf = pd.read_pickle(pandasFile)
print(masterdf.columns)
masterdf.head()
Out[6]:
In [76]:
#preprocessing the dataframes
df=masterdf.copy()
#Filter out the uncharacterized molecules
badmols=pd.read_csv('uncharacterized.csv')
df=df[~df['dbindex'].isin(badmols['Index'])]
print("Bad molecules: %s"%(str(badmols.shape)))
print("Before %s -> After %s "%(masterdf.shape[0],df.shape[0]))
In [77]:
#load atomic data
atomdf=pd.read_csv('atomref.csv')
weightDict={ row['Element']:row['Mass'] for indx,row in atomdf.iterrows()}
print(weightDict)
#molecular weight function
def molecularWeight( row ):
weightDict={'H': 1.00794, 'C': 12.0107, 'F': 18.998403, 'O': 15.9994, 'N': 14.0067}
return np.sum([ weightDict[atom] for atom in row['atomList'] ])
#apply function to each row
df['molWeight']=df.apply(molecularWeight, axis=1)
atomdf.head()
Out[77]:
In [78]:
#get dict to put into function
enthalphyDict={ row['Element']:row['H (298.15 K)_Hartree'] for indx,row in atomdf.iterrows()}
print(enthalphyDict)
# define function
def enthalphyAtomization( row ):
enthalphyDict={'H': -0.49791199999999997, 'C': -37.844411, 'F': -99.71637, 'O': -75.062219, 'N': -54.581501}
return np.sum([ enthalphyDict[atom] for atom in row['atomList'] ])-row['H']
#apply function to each row
df['Hatom']=df.apply(enthalphyAtomization, axis=1)
df.head()
Out[78]:
In [103]:
# save new descriptors to file
df.to_pickle('mol.pkl')
In [100]:
# load pickle file
df = pd.read_pickle('mol.pkl')
In [30]:
nMols=10
indexes=np.random.randint(0,df.shape[0],nMols)
mols=[ Chem.MolFromSmiles(df.ix[i,'SMILES']) for i in indexes]
names=[ "Mol # %d"%(i) for i in indexes]
p = Draw.MolsToGridImage( mols, molsPerRow=5, subImgSize=(200, 200), legends=names)
print("=== Random selection of %d molecules ==="%nMols)
p
Out[30]:
In [99]:
randomMolecule=np.random.randint(0,df.shape[0])
smileStr=df.loc[randomMolecule,'SMILES']
print("== 2D ==")
mymol=pybel.readstring("smiles", smileStr)
mymol
Out[99]:
In [100]:
print("== 3D Rotatable Molecule==")
imolecule.draw(smileStr, drawing_type="ball and stick", shader="phong")
The electronic spatial extent is a single number that attempts to describe the size of a molecule. This number is computed as the expectation value of electron density times the distance from the center of mass of a molecule. Because the information is condensed down to a single number, it does not distinguish between long chains and more globular molecules.
In [30]:
interest=['numAtoms','spatialSize','molWeight']
for i in interest:
rangeSize=max(df[i])-min(df[i])
sns_cycle_palette()
# histogram or a distribution plot?
if rangeSize < 40:
bins = np.arange(min(df[i]),max(df[i])+1)
plt.hist(df[i],bins)
else:
sns.distplot(df[i])
plt.title('Distribution of molecules')
plt.xlabel(i)
plt.ylabel('Frequency')
plt.show()
# show stats
print('=== Stats ===')
df[interest].describe().transpose()
Out[30]:
Observations:
Is the sum of all the microscopic energies such as translational kinetic energy, vibrational and rotational kinetic energy and potential energy from intermolecular forces.
Includes the pressure and Volume of the system. In a sense it accounts for energy transferred to the environment at constant pressure through expansion or heating.
Measures the "usefulness" or process-initiating work obtainable from a thermodynamic system at a constant temperature and pressure.
Zero-point energy is the energy that remains when all other energy is removed from a system, i.e the energy of a molecule at T=0.
In [44]:
sns_cycle_palette()
sns.jointplot(df['G'], df['U'], kind="scatter",size=8);
plt.show()
sns_cycle_palette()
sns.jointplot(df['U0'], df['zpe'],kind="reg",size=8);
plt.show()
# show stats
interest=['U','H','G','zpe']
print('=== Stats ===')
df[interest].describe().transpose()
Out[44]:
Observations:
Molecules are quantized (due to Quantum Mecchanics) rotating systems, so rotational energy and the angular momentum only takes certain fixed values, these are related to the rotational constants. These values are normally ordered as $A\ge B \ge C$. Based on the relationship of these constants we can classify the molecules as rotors:
We can count how many of these we might find in our data to make sense if we should classify molecules this way:
In [49]:
AeqB=np.abs(df['A'] - df['B'])/df['A'] < 0.005
BeqC=np.abs(df['B'] - df['C'])/df['B'] < 0.005
AeqC=np.abs(df['A'] - df['C'])/df['A'] < 0.005
largeRatio= df['A']/df['B'] < 0.25
smallRatio= df['A']/df['B'] > 100
largeA= np.logical_and(df['A'] >= 80 , df['A']/df['B'] > 4)
##clasifiers
spherical=np.logical_and(np.logical_and(AeqB,BeqC),AeqC)
linear=np.logical_or(np.logical_or(largeRatio,smallRatio),largeA)
checked=np.logical_or(linear,spherical)
oblate=np.logical_and(AeqB,np.logical_not(checked))
checked=np.logical_or(checked,oblate)
prolate=np.logical_and(BeqC,np.logical_not(checked))
checked=np.logical_or(checked,prolate)
asym=np.logical_not(checked)
print("Spherical %d mols"%( sum( spherical)) )
print("Linear %d mols"%( sum(linear) ) )
print("Oblate %d mols"%( sum(oblate) ) )
print("Prolate %d mols"%( sum(prolate) ) )
print("Asymmetric %d mols"%( sum(asym) ) )
# create new data column
df['rotType']='Asym'
for index,row in df.iterrows():
if spherical[index]:
df.ix[index,'rotType']='sph'
elif linear[index]:
df.ix[index,'rotType']='lin'
elif oblate[index]:
df.ix[index,'rotType']='ob'
elif prolate[index]:
df.ix[index,'rotType']='pro'
In [50]:
print("== Spherical Molecules ==")
subdf=df[ spherical ]
mols=[ Chem.MolFromSmiles(row['SMILES']) for indx,row in subdf.iterrows() ]
names=[ "(# %d) %3.2f, %3.2f, %3.2f"%(row['dbindex'],row['A'],row['B'],row['C']) for indx,row in subdf.iterrows() ]
p = Draw.MolsToGridImage( mols, molsPerRow=5, legends=names)
p
Out[50]:
In [51]:
print("== Linear ==")
subdf=df[ linear ]
mols=[ Chem.MolFromSmiles(row['SMILES']) for indx,row in subdf.iterrows() ]
names=[ "(# %d) %3.2f, %3.2f, %3.2f"%(row['dbindex'],row['A'],row['B'],row['C']) for indx,row in subdf.iterrows() ]
p = Draw.MolsToGridImage( mols, molsPerRow=5, legends=names)
p
Out[51]:
In [52]:
print("== Some Oblate ==")
subdf=df[ oblate ]
subdf= subdf[subdf['dbindex'].isin(random.sample(subdf['dbindex'], 15))]
#subdf=subdf[:15]
mols=[ Chem.MolFromSmiles(row['SMILES']) for indx,row in subdf.iterrows() ]
names=[ "(# %d) %3.2f, %3.2f, %3.2f"%(row['dbindex'],row['A'],row['B'],row['C']) for indx,row in subdf.iterrows() ]
p = Draw.MolsToGridImage( mols, molsPerRow=5, legends=names)
p
Out[52]:
In [53]:
print("== Some Prolate ==")
subdf=df[ prolate ]
subdf= subdf[subdf['dbindex'].isin(random.sample(subdf['dbindex'], 15))]
mols=[ Chem.MolFromSmiles(row['SMILES']) for indx,row in subdf.iterrows() ]
names=[ "(# %d) %3.2f, %3.2f, %3.2f"%(row['dbindex'],row['A'],row['B'],row['C']) for indx,row in subdf.iterrows() ]
p = Draw.MolsToGridImage( mols, molsPerRow=5, legends=names)
p
Out[53]:
In [54]:
print("== Some Asymmetric ==")
subdf=df[ asym ]
subdf= subdf[subdf['dbindex'].isin(random.sample(subdf['dbindex'], 15))]
mols=[ Chem.MolFromSmiles(row['SMILES']) for indx,row in subdf.iterrows() ]
names=[ "(# %d) %3.2f, %3.2f, %3.2f"%(row['dbindex'],row['A'],row['B'],row['C']) for indx,row in subdf.iterrows() ]
p = Draw.MolsToGridImage( mols, molsPerRow=5, legends=names)
p
Out[54]:
In [85]:
# show stats
interest=['A','B','C']
print('=== Stats ===')
df.groupby('rotType')[interest].describe()
Out[85]:
In [102]:
#exclude large coefficient molecules since they have very different values than skew a lot the stats
subdf=df[np.logical_and(df['A'] < 10,df['B'] < 10) ]
groupsA=[group['A'] for index,group in subdf.groupby('rotType') ]
groupsB=[group['B'] for index,group in subdf.groupby('rotType') ]
groupsC=[group['C'] for index,group in subdf.groupby('rotType') ]
names=[index for index,group in subdf.groupby('rotType') ]
plt.title("Distribuition for different type of rotors")
sns.violinplot(groupsA,names=names,alpha=0.5,color='green',label='A')
sns.violinplot(groupsB,names=names,alpha=0.5,color='blue')
sns.violinplot(groupsC,names=names,alpha=0.5,color='red',label='A')
#custom legend
import matplotlib.patches as mpatches
green_patch = mpatches.Patch(color='green',alpha=0.5, label='A')
blue_patch = mpatches.Patch(color='blue',alpha=0.5, label='B')
red_patch = mpatches.Patch(color='red',alpha=0.5, label='C')
plt.legend(handles=[green_patch,blue_patch,red_patch])
plt.legend()
plt.show()
print("")
In [38]:
# show stats
interest=['A','B','C']
#exclude linear molecules since they have very different A values than skew a lot the stats
subdf=df[ np.logical_not(linear) ]
for i in interest:
sns_cycle_palette()
asubdf=subdf[ subdf[i] < 50 ]
# histogram, seaborn has better
sns.distplot(asubdf[i],rug=False, hist=True,hist_kws={"alpha": 0.5},label=i);
plt.xlim([0,10])
plt.xlabel("Rot Constant value")
plt.title('Distribution of molecules')
plt.ylabel('Frequency')
plt.legend()
plt.show()
Observations:
Calculated as $E_{gap}=E_{HOMO}-E_{LUMO}$, also known as "band gap", where HOMO (highest occupied molecular orbital) and LUMO (lowest unoccupied molecular orbital) are ''frontier'' orbitals for the molecule. Basically, it's how much energy you have to feed into the molecule to kick it from the ground (most stable) state into an excited state. The gap energy can tell us about what wavelengths the compound can absorb.
At the molecular level, temperature is the average kinetic energy of an ensamble of molecules. Heat capacity characterizes the amount of heat required to change a body's temperature by a given amount.
It is related to molecular complexity, more complex molecules have multiple degrees of freedom (translation, rotation and vibration)...when an ideal gas of ''simple'' molecules (low heat capacity) absorbs heat, all the energy will go to a low number of degrees of freedom...while for a gas with more complex molecules the absorbed energy is partitioned among many more kinds of motions, giving a smaller rise temperature (high heat capacity).
Dipole moments arise from differences in electronegativity.
It measures the separation of positive and negative electrical charges in a system of electric charges, in this case electrons. The larger the difference in electronegativity, the larger the dipole moment. A zero dipole implies that the negative and positive forces are canceling each other.
Neutral nonpolar species have spherically symmetric arrangements of electrons in their electron clouds. When in the presence of an electric field, their electron clouds can be distorted. Polarizability is a measure of this distortion.
More electrons or electrons more far away from the nuclear charge, meaning increased polarizability.
In [47]:
interest=['Hatom','Cv','gap','dipole','polar']
# show stats
print('=== Stats ===')
df[interest].describe().transpose()
Out[47]:
In [43]:
names=["Enthalphy of atomization","Heat Capacity","Electronic band gap","Dipole Moment","Polarizability"]
for index,i in enumerate(interest):
sns_cycle_palette()
# histogram, seaborn has better
sns.distplot(df[i],rug=False, hist=True);
plt.xlabel(i)
plt.ylabel('Frequency')
plt.title(names[index]+" across all molecules")
plt.show()
Observations:
In [101]:
#get Ring Data
def getRingData( row ):
mol=Chem.MolFromSmiles(row['SMILES'])
if mol is None:
numRing=0
else:
numRing=len(mol.GetAromaticAtoms())
return numRing
#apply function to each row
df['ringNum']=df.apply(getRingData, axis=1)
In [102]:
benzeneMol=Chem.MolFromSmiles('c1ccccc1')
print("=== Our reference substructure ===")
benzeneMol
Out[102]:
In [103]:
#get Ring Data
refMol=benzeneMol
matchName='benzene'
def smileFriendly( row):
mol=Chem.MolFromSmiles(row['SMILES'])
if mol is None:
match=False
else:
match=True
return match
df['smileGood']=df.apply(smileFriendly, axis=1)
print(" finished classifing good smiles ")
def matchSubstructure( row):
if row['smileGood']:
mol=Chem.MolFromSmiles(row['SMILES'])
match=(len(mol.GetSubstructMatch(refMol)) > 0)
else:
match=False
return match
#apply function to each row
df[matchName]=df.apply(matchSubstructure, axis=1)
nMatched=df[df[matchName]].shape[0]
print("Found %d matched substructures "%(nMatched))
In [109]:
subdf=df[df['rotType']=='Asym']
# divide classes
ringdf=subdf[subdf['benzene']]
sampleSize=min(ringdf.shape[0],500);
if nMatched > 500:
ringdf=ringdf[ringdf['dbindex'].isin(random.sample(ringdf['dbindex'], sampleSize))]
candidates=np.logical_and(np.logical_and(~subdf['benzene'],subdf['numAtoms'] >=11 ),
subdf['ringNum'] == 0)
noRingdf=subdf[ candidates]
noRingdf=noRingdf[noRingdf['dbindex'].isin(random.sample(noRingdf['dbindex'], sampleSize)) ]
# merge both data frames
subdf=pd.concat([ringdf,noRingdf])
print(subdf.shape)
subdf.head()
Out[109]:
In [110]:
sampleSize=20;
smalldf=subdf[subdf['benzene']]
smalldf=smalldf[smalldf['dbindex'].isin(random.sample(smalldf['dbindex'], sampleSize))]
mols = [ Chem.MolFromSmiles(row['SMILES']) for indx, row in smalldf.iterrows() ]
print("== Benzene Ringed atoms ==")
names=["# %d "%(row['dbindex']) for indx, row in smalldf.iterrows() ]
Draw.MolsToGridImage( mols, molsPerRow=5, legends=names)
Out[110]:
In [113]:
sampleSize=20;
smalldf=subdf[~subdf['benzene']]
smalldf=smalldf[smalldf['dbindex'].isin(random.sample(smalldf['dbindex'], sampleSize))]
mols = [ Chem.MolFromSmiles(row['SMILES']) for indx, row in smalldf.iterrows() ]
print("== Non-Benzene-Ringed atoms ==")
names=["# %d "%(row['dbindex']) for indx, row in smalldf.iterrows() ]
Draw.MolsToGridImage( mols, molsPerRow=5, legends=names)
Out[113]:
In [114]:
# save our pickle!
subdf.to_pickle('benzene.pkl')
In [119]:
# save our pickle!
subdf=pd.read_pickle('benzene.pkl')
subdf.columns
Out[119]:
In [255]:
print(subdf.columns)
nonTrainFeatures=['numAtoms','homo','lumo','dbindex','atomList', 'atomCoords', 'freqs', 'SMILES',
'InChI','rotType', 'ringNum', 'smileGood', 'benzene','U','H','G']
trainData=subdf.drop(nonTrainFeatures,axis=1)
targetData=subdf['benzene']
X = np.array([np.array(row) for indx,row in trainData.iterrows()])
Y = np.array( targetData)
features=[str(col) for col in trainData.columns]
features
Out[255]:
In [256]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_score
maxtrees=40
kfoldn=10
ntrees=np.arange(1,maxtrees+1)
scores=np.zeros((ntrees.shape[0],kfoldn))
for indx,i in enumerate(ntrees):
rf = RandomForestClassifier(n_estimators=i)
scores[indx]= cross_val_score(rf,X,Y, cv=kfoldn)
scores=np.array(scores)
print("== Random Forest Classifier, KFold=10==")
print("Score Mean = %1.3f, Std = %1.3f"%(np.mean(scores),np.std(scores)))
plt.figure(figsize=(18, 6), dpi=80)
sns.boxplot(scores.transpose())
plt.plot(ntrees,[ i.mean() for i in scores],'r--',alpha=0.75,label="mean")
plt.legend()
plt.xlabel("Number of trees")
plt.ylabel("Score")
plt.title("RandomForest with varying number of trees")
plt.show()
In [260]:
goodTrees=11
rf=RandomForestClassifier(n_estimators=goodTrees)
scores= cross_val_score(rf,X,Y, cv=kfoldn)
kfold=sklearn.cross_validation.KFold(n=len(Y), n_folds=kfoldn)
importances=[]
for indj, indices in enumerate(kfold):
train,test = indices
X_train, X_test = X[train], X[test]
Y_train, Y_test= Y[train], Y[test]
rf.fit(X_train, Y_train)
Y_pred=rf.predict(X_test)
importances.append(rf.feature_importances_)
#pre-process data
importances=np.array(importances)
mimportances=[ i.mean() for i in importances.transpose()]
# sort
sortedtup=sorted(zip(mimportances,features))
mimportances = np.array([i[0] for i in sortedtup])
sfeatures = np.array([i[1] for i in sortedtup])
indexes=np.arange(len(mimportances))
In [261]:
#plot away!
fig, ax = plt.subplots()
sns.barplot(indexes,y=mimportances, palette="BuGn_d")
ax.set_xticklabels(sfeatures, rotation=45)
plt.xlabel("Feature")
plt.ylabel("Importance")
plt.title("Random Forest Classifier (%d Trees), Score = %1.3f ( %1.3f) "%(goodTrees,np.mean(scores),np.std(scores)))
print("")
In [269]:
interest=sfeatures[-5::]
mostImportance=mimportances[-5::]
names=['Ring','Non-ring']
print("=== 5 most important features ===")
print(interest)
realNames=["Heat Capacity","Molecular Weight","Zero Point Energy",
"Rotational Const C","Electronic Band Gap"]
for index,i in enumerate(interest):
# histogram, seaborn has better
matchdf=subdf[subdf['benzene']]
nomatchdf=subdf[~subdf['benzene']]
data=[matchdf[i],nomatchdf[i]]
sns.violinplot(data,names=names);
plt.xlabel("Type of molecule")
plt.ylabel(i)
plt.title("Distribuition of %s , Importance= %0.3f "%(realNames[index],mostImportance[index]) )
plt.show()
In [276]:
from sklearn.naive_bayes import GaussianNB
from sklearn import svm
from sklearn import grid_search
from sklearn import neighbors
from sklearn.metrics import classification_report
gnb = GaussianNB()
scores= cross_val_score(gnb,X,Y, cv=kfoldn)
print("== Naive_Bayes Classifier ==")
print("Score Mean = %1.3f, Std = %1.3f"%(np.mean(scores),np.std(scores)))
clf_KNN = neighbors.KNeighborsClassifier(2, weights="uniform" )
scores= cross_val_score(clf_KNN,X,Y, cv=kfoldn)
print("== K-NN ==")
print("Score Mean = %1.3f, Std = %1.3f"%(np.mean(scores),np.std(scores)))
## Grid for the CV
gammas = np.logspace(-10,4,8,base=2)
Cost = [0.001,0.001,0.01,1,10,100,1000]
param_grid = [
{'C': Cost , 'gamma': gammas, 'kernel': ['rbf']},
]
clf_svm = grid_search.GridSearchCV(estimator=svm.SVC(), param_grid=param_grid)
scores= cross_val_score(clf_svm,X,Y, cv=kfoldn)
print("== SVM ==")
print("Score Mean = %1.3f, Std = %1.3f"%(np.mean(scores),np.std(scores)))
In [127]:
mols = [ pybel.readstring("Smiles",row['SMILES']) for indx,row in subdf.iterrows() ]
indexes = [ indx for indx,row in subdf.iterrows() ]
fps= [ mol.calcfp() for mol in mols ]
def pairwise_sim(fps):
#create pairwise matrix
pairwise=np.zeros((len(fps),len(fps)))
#iterate over all pairs
for indx,imol in enumerate(fps):
for jndx,jmol in enumerate(fps):
#tanimoto similarity
pairwise[indx,jndx]=imol|jmol
return pairwise
sim=pairwise_sim(fps)
In [129]:
plt.figure(figsize=(12,12))
plt.imshow(sim,interpolation='None' )
plt.title("Similarity matrix for %d Molecules (1=Equal)"%20)
plt.xlabel("Mol #id")
plt.ylabel("Mol #id")
plt.colorbar()
plt.show()
In [130]:
ref_indx=10
sim_ref=sim[ref_indx,:]
plt.hist(sim_ref)
plt.title("Similarity for a reference molecule")
plt.xlabel("Tanito Similarity")
plt.ylabel("Frequency")
print("== Reference Molecule ==")
smileStr=subdf.ix[indexes[ref_indx],'SMILES']
imolecule.draw(smileStr)
mols[ref_indx]
Out[130]:
In [131]:
nmols = [ Chem.MolFromSmiles(subdf.ix[indx,'SMILES']) for indx in indexes ]
# sorting the molecule as a function of their similarity to a query
nbrs = sorted(zip(sim_ref,nmols,indexes),reverse=True)
print("== Similar molecules, Sorted from most similar to disimilar ==")
# depict the molecules in decreasing order of similarity
names=["# %d - sim=%f"%(subdf.ix[nbr[2],'dbindex'],nbr[0]) for nbr in nbrs ]
Draw.MolsToGridImage( [nbr[1] for nbr in nbrs], molsPerRow=5, legends=names)
Out[131]:
In [133]:
sampleSize=1000;
subdf=df[df['rotType']=='Asym']
subdf=subdf[subdf['dbindex'].isin(random.sample(subdf['dbindex'], sampleSize))]
#remove some variables
interest=[u'numAtoms', u'dbindex', u'A', u'B', u'C', u'dipole', u'polar', u'gap', u'spatialSize', u'zpe', u'U0', u'Cv', u'SMILES', u'molWeight', u'Hatom', u'rotType']
subdf=subdf[interest]
mols = [ pybel.readstring("Smiles",row['SMILES']) for indx,row in subdf.iterrows() ]
indexes = [ indx for indx,row in subdf.iterrows() ]
fps= [ mol.calcfp() for mol in mols ]
sim=pairwise_sim(fps)
#plot!
plt.figure(figsize=(12,12))
plt.imshow(sim )
plt.title("Similarity matrix for %d Molecules (1=Equal)"%sampleSize)
plt.xlabel("Mol #id")
plt.ylabel("Mol #id")
plt.colorbar()
plt.show()
In [134]:
ref_indx= np.argmax( np.linalg.norm(sim,axis=1) )
sim_ref=sim[ref_indx,:]
plt.hist(sim_ref)
plt.title("Reference molecule = biggest colunm norm in the matrix")
plt.xlabel("Tanito Similarity")
plt.ylabel("Frequency")
print("== Reference Molecule ==")
smileStr=subdf.ix[indexes[ref_indx],'SMILES']
imolecule.draw(smileStr)
mols[ref_indx]
Out[134]:
In [135]:
nmols = [ Chem.MolFromSmiles(subdf.ix[indx,'SMILES']) for indx in indexes ]
# sorting the molecule as a function of their similarity to a query
nbrs = sorted(zip(sim_ref,nmols,indexes),reverse=True)
nbrs=nbrs[:30]
print("== 30 most Similar molecules, Sorted from most similar to disimilar ==")
# depict the molecules in decreasing order of similarity
names=["# %d - sim=%f"%(subdf.ix[nbr[2],'dbindex'],nbr[0]) for nbr in nbrs ]
Draw.MolsToGridImage( [nbr[1] for nbr in nbrs], molsPerRow=5, legends=names)
Out[135]:
Goals reached
The first two goals were met sucessfully. The last goal, which is the most interesting was met for one substructure search. With some partial results. We tried various classifiers, such as randomForest, SVM, K-NN and Naive Bayes. We obtained the best results with the randomForest ensemble method. One of the advantageous of this methods is that we can a decision tree that gives an idea how important the variables are for such classification. For example with benzene we found that band gap was the best descriptor for these types of molecules. Visualizing the distribuition of these two clases is is obvious they are very different. This also coincides with the chemical notion of an aromatic ring, these types of structures are being used to design organic semi-conductors, which are characterized by a low electronic gap. We see that on average a bezene containing molecule will have a lower band gap.
Extra ideas (not tested)
Due to time constrains, some ideas were not testet, but they might work for future directions:
On the experience Starting this course I had never used an ipython notebook or python. Had always programmed in C++ or Fortran. Now I want to use python for many more projects, it offers a very fast way of prototyping ideas.
Future Directions There is certaintly much more work to be done and many venues to explore with data science and Quantum Chemistry. For the msot part I stuck to the available dataset, yet I feel there could be more interesting data embedded in the calculations which is not included in the data set. Examples:
The main idea would be design and find descriptor that can not be found in a classical way, to demonstrate that a quantum calculation has more conection to reality than a classical one. (at the molecular level).
In [ ]: