In [ ]:
# First lets import some libraries we will use...
import numpy as np
import scipy as sp
import pandas as pd
xyz_path = '100.xyz' # File path
nframe = 100 # Number of frames (or snapshots)
nat = 195 # Number of atoms
a = 12.55 # Cell size
Write a function that reads an xyz trajectory file in. We are going to need to be able to separate numbers from atomic symbols; an XYZ trajectory file looks like:
nat [unit]
[first frame]
symbol1 x11 y11 z11
symbol2 x21 y21 z21
nat [unit]
[second frame]
symbol1 x12 y12 z12
symbol2 x22 y22 z22
Stuff in [ ] are optional (if units are absent, angstroms are assumed; a blank is included if no comments are present).
Here is an example file parser. All it does is read line by line and return a list of these lines.
In [ ]:
def skeleton_naive_xyz_parser(path):
'''
Simple xyz parser.
'''
# Read in file
lines = None
with open(path) as f:
lines = f.readlines()
# Process lines
# ...
# Return processed lines
# ...
return lines
lines = skeleton_naive_xyz_parser(xyz_path)
lines
CODING TIME: Try to expand the skeleton above to convert the line strings into into a list of xyz data rows (i.e. convert the strings to floats).
If you can't figure out any approach, run the cell below which will print one possible (of many) ways of approaching this problem.
Note that you may have to run "%load" cells twice, once to load the code and once to instantiate the function.
In [ ]:
%load -s naive_xyz_parser, snippets/parsing.py
In [ ]:
data = naive_xyz_parser(xyz_path)
data
In [ ]:
np.random.seed = 1
df = pd.DataFrame(np.random.randint(0, 10, size=(6, 4)), columns=['A', 'B', 'C', 'D'])
df
In [ ]:
df += 1
df
In [ ]:
df.loc[:, 'A'] = [0, 0, 1, 1, 2, 2]
df
In [ ]:
df.groupby('A')[['B', 'C', 'D']].apply(lambda f: f.sum())
Like 99% (my estimate) of all widely established Python packages, pandas is very well documented.
Let's use this function of pandas to read in our well structured xyz data.
CODING TIME: Figure out what options we need to correctly parse in the XYZ trajectory data using pandas.read_csv
In [ ]:
def skeleton_pandas_xyz_parser(path):
'''
Parses xyz files using pandas read_csv function.
'''
# Read from disk
df = pd.read_csv(path, delim_whitespace=True, names=['symbol', 'x', 'y', 'z'])
# Remove nats and comments
# ...
# ...
return df
In [ ]:
df = skeleton_pandas_xyz_parser(xyz_path)
df.head()
One possible solution (run this only if you have already finished the above!):
In [ ]:
%load -s pandas_xyz_parser, snippets/parsing.py
In [ ]:
df = pandas_xyz_parser(xyz_path)
df.head()
In [ ]:
print(len(df) == nframe * nat) # Make sure that we have the correct number of rows
print(df.dtypes) # Make sure that each column's type is correct
In [ ]:
df = pandas_xyz_parser(xyz_path)
df.index = pd.MultiIndex.from_product((range(nframe), range(nat)), names=['frame', 'atom'])
df
CODING TIME: Put parsing and indexing together into a single function..
In [ ]:
%load -s parse, snippets/parsing.py
We did all of this work parsing our data, but this Python kernel won't be alive eternally so lets save our data so that we can load it later (i.e. in the next notebook!).
We are going to create an HDF5 store to save our DataFrame(s) to disk.
HDF is a high performance, portable, binary data storage format designed with scientific data exchange in mind. Use it!
Also note that pandas has extensive IO functionality.
In [ ]:
xyz = parse(xyz_path, nframe, nat)
store = pd.HDFStore('xyz.hdf5', mode='w')
store.put('xyz', xyz)
store.close()
Though there are a bunch of improvements/features we could make to our parse function...