In [1]:
import pandas as pd
There are many pandas tutorials out there, so I will not attempt to recreate one of them. Instead, I will show a practical example relating to quantum chemistry calculations of the hydrogen molecule ($H_2$). Hopefully this example will inspire you to start using pandas in your own research.
To collect simulation data scattered across many folders, I build a list of dictionaries. One dictionary for a run.
In [2]:
import sys
sys.path.insert(0,'scripts')
from parse_gms import parse_gms_inp,parse_gms_out
In the data/h2_cis folder, I ran GAMESS to get the ground and first excited state energies of the H2 molecule at a number of bond lenths. I put a script in the scripts folder to parse GAMESS input and output. (do NOT open the script file, you will be horrified)
In [3]:
# take input file and turn into dictionary
parse_gms_inp('data/h2_cis/sep1.40/sep1.40-cis.inp')
Out[3]:
In [4]:
# take output file and turn into dictionary
parse_gms_out('data/h2_cis/sep1.40/sep1.40-cis.out')
Out[4]:
Construct a list of dictionaries by looping through the runs and parse every input-output pair.
In [5]:
import os
import subprocess as sp
def collect(database,rundir):
if not os.path.isfile(database):
# find all input files
all_inps = sp.check_output(['find',rundir,'-ipath','*.inp']).split('\n')[:-1]
data = []
for gms_inp in all_inps:
settings = parse_gms_inp(gms_inp) # get inputs
gms_out = gms_inp.replace('inp','out') # assume output name
outputs = parse_gms_out(gms_out) # get outputs
# glue input,output together
entry = outputs
entry.update({'settings':settings})
# simply entry.update(settings), if you want all settings to be exposed
# I also store run path, because folder name is sometimes useful
path = os.path.dirname(gms_inp)
entry.update({'path':path})
data.append(entry)
# end for
df = pd.DataFrame(data)
df.to_json(database) # I always save data to file to avoid rerunning analysis
else:
df = pd.read_json(database)
# end if
return df
# end def collect
database = 'data/h2_cct.json'
rundir = 'data/h2_cis/'
df0 = collect(database,rundir)
In [51]:
# if you want all settings to be exposed
flat_df = pd.concat( [df0['settings'].apply(pd.Series),df0],axis=1
).drop('settings',axis=1)
flat_df.head()
Out[51]:
first let me show you the energy surfaces of $H_2$
In [9]:
import matplotlib.pyplot as plt
%matplotlib inline
# select columns, and sort by bond length
mydf = flat_df[['dist','E0','E1']].sort_values('dist')
mydf.head()
Out[9]:
In [10]:
# Fig. 1: visualize data
fig,ax = plt.subplots(1,1)
ax.set_xlabel('r (bohr)',fontsize=16)
ax.set_ylabel('energy (Ha)',fontsize=16)
ax.plot(mydf['dist'],mydf['E0'],'o-',label='ground')
ax.plot(mydf['dist'],mydf['E1'],'^-',label='excited')
ax.legend()
Out[10]:
In [52]:
mydf.head()
Out[52]:
Rows can be selected by index,
In [11]:
mydf.loc[[21,10]] # select rows with index 21 and 10
Out[11]:
or by position.
In [12]:
mydf.iloc[0:2] # select first two rows in mydf
Out[12]:
The most useful selection method is probably by boolean array.
In [13]:
sel = mydf['dist']<1.0 # all points with bond length < 1 bohr
mydf.loc[sel]
Out[13]:
Column selection can be done at the same time.
In [14]:
mydf.loc[sel,['E0','E1']]
Out[14]:
I did a lousy job with the second batch of runs. There are missing data, is that a problem?
In [15]:
# first load new data, add to current DataFrame
database = 'data/h2_ccd.json'
rundir = 'data/ccd_cis/'
df1 = collect(database,rundir)
df_combined = pd.concat([df0,df1]).reset_index(drop=True)
# unwrap settings
df = pd.concat( [df_combined['settings'].apply(pd.Series),df_combined],axis=1
).drop('settings',axis=1)
In [16]:
# I changed the quantum chemistry basis set
df['gbasis'].unique()
Out[16]:
In [17]:
# Fig. 2: visualize new data
ccd_df = df.loc[df['gbasis']=='ccd'
,['dist','E0','E1']].sort_values('dist')
cct_df = df.loc[df['gbasis']=='cct'
,['dist','E0','E1']].sort_values('dist')
fig,ax = plt.subplots(1,1)
ax.set_xlabel('r (bohr)',fontsize=16)
ax.set_ylabel('energy (Ha)',fontsize=16)
ax.plot(cct_df['dist'],cct_df['E0'],'o-',label='ground cct')
ax.plot(cct_df['dist'],cct_df['E1'],'^-',label='excited cct')
ax.plot(ccd_df['dist'],ccd_df['E0'],'x--',label='ground ccd',mew=2)
ax.plot(ccd_df['dist'],ccd_df['E1'],'+--',label='excited ccd',mew=2)
ax.legend()
Out[17]:
we see that the answers are not sensitive to basis set, phew...
now we quantify the effect of the parameter change
In [18]:
# list out the energies as a function of data and parameter
compare = df.pivot('dist','gbasis','E0')
compare.head()
# we see two missing data for gbasis=ccd
Out[18]:
You can drop missing data with df.dropna(), but many functions still work even with missing data
In [19]:
import numpy as np
# pandas can't care less if data is missing or not
print compare.mean()
# however, in this case, we need to drop the missing data for a fair comparison
print compare.dropna().mean()
Only after dropping the nan do we see that the cct basis set gives lower energy, as is evident from the plot. Lesson: check your nans!
In [20]:
compare.isnull().any()
Out[20]:
In [21]:
# aggregate is also helpful for high level view of data
df.groupby('gbasis')['E0'].agg([np.mean,np.std,len])
Out[21]:
In [22]:
# different sets of functions can be applied to different columns!
df.groupby('gbasis')['E0','E1'].agg({'E0':[np.mean,np.std,len],'E1':np.mean})
Out[22]:
some chemists do not like atomic units, so I have to convert bohr to angstrom and hatree to ev face palm. This is where aggregate comes in handy
In [23]:
df['dist_ang'] = df['dist']*0.529
df['E0_ev'] = df['E0'] * 27
Suppose I want to plot the difference between the curves between ccd and cct in Fig. 2. We will have to select out the appropriate entries and subtract. Pandas can handle this elegantly
In [24]:
# energy gained by switching from ccd to cct
diff = ccd_df.set_index('dist') - cct_df.set_index('dist')
In [25]:
diff.head() # notice the nans are automatically taken care of
Out[25]:
In [26]:
# you can access 'dist' with diff.index, but I like 'diff' to be a col.
diff.reset_index(inplace=True)
In [27]:
# Fig. 2: visualize new data
fig,ax = plt.subplots(1,1)
ax.set_xlabel('r (bohr)',fontsize=16)
ax.set_ylabel('energy (Ha)',fontsize=16)
ax.plot(diff['dist'],diff['E0'],'o-',label='ground ccd-cct')
ax.plot(diff['dist'],diff['E1'],'^-',label='excited ccd-cct')
ax.legend()
Out[27]:
The bond distance is of type float. Sometimes the above procedure will fail when the decimals do not exactly match. In that case, you make a unique index column
In [28]:
df['dist_str'] = df['dist'].apply(lambda x:str(round(x,2)))
If data is already in a tabulated form, then pd.read_X() will read convert data from file type X to a pandas DataFrame. I usually check out the size of the table with df.info(). info() prints the number of rows (26729) and all column labels. df.head() will show the first few entries.
In [29]:
df_animals = pd.read_csv('data/train.csv')
df_animals.info()
df_animals.columns;df_animals.index; # these can be directly modified (quick&dirty)
#df_animals.rename(columns={'Name':'my name'},inplace=True) # cleaner rename
df_animals.head()
#pd.read_ # uncomment and double tap to see all the file types that can be read
#df_animals.to_ # double tap to see all file types can be saved to
Out[29]:
In [30]:
def bar_percentage(series,bar_width=0.5,xlabel_font=12,ylim=(0,1)):
# series should have pandas.Series interface
# return a plot of probabilities
fig = plt.figure()
ax = fig.add_subplot(111,aspect=len(series)-1)
ax.set_ylim(ylim)
ax.axhline(0,c="black")
ax.set_ylabel("Percentage",fontsize=14)
# enumerate series.index and plot series.value on top
x = np.array(range(len(series.index)))
ax.bar(x,series.values,bar_width,fill=False, hatch='\\',color=None)
# format x axis
ax.set_xticks(x)
ax.set_xlim(min(x)-bar_width/2.,max(x)+bar_width*1.5)
ax.set_xticklabels(series.index,fontsize=xlabel_font,rotation=45)
fig.tight_layout()
return fig,ax
# end def
In [31]:
fig,ax = bar_percentage( df_animals['OutcomeType'].value_counts(normalize=True) )
In [32]:
# why are animals euthanized?
sel = df_animals['OutcomeType'] == 'Euthanasia'
fig,ax = bar_percentage( df_animals.loc[sel,'OutcomeSubtype'].value_counts(normalize=True) )
In [33]:
df_animals.loc[ df_animals['OutcomeType'] == 'Adoption','SexuponOutcome' ].value_counts()
Out[33]:
In [34]:
df_animals.groupby('AnimalType')['OutcomeType'].value_counts(normalize=True)
Out[34]: