In [ ]:
# First lets import some libraries we will use...
import numpy as np
import scipy as sp
import pandas as pd

xyz_path = '100.xyz'    # File path
nframe = 100            # Number of frames (or snapshots)
nat = 195               # Number of atoms
a = 12.55               # Cell size

First approach

Write a function that reads an xyz trajectory file in. We are going to need to be able to separate numbers from atomic symbols; an XYZ trajectory file looks like:

nat [unit]
[first frame]
symbol1 x11 y11 z11
symbol2 x21 y21 z21
nat [unit]
[second frame]
symbol1 x12 y12 z12
symbol2 x22 y22 z22

Stuff in [ ] are optional (if units are absent, angstroms are assumed; a blank is included if no comments are present).

Here is an example file parser. All it does is read line by line and return a list of these lines.


In [ ]:
def skeleton_naive_xyz_parser(path):
    '''
    Simple xyz parser.
    '''
    # Read in file
    lines = None
    with open(path) as f:    
        lines = f.readlines()
    # Process lines
    # ...
    # Return processed lines
    # ...
    return lines

lines = skeleton_naive_xyz_parser(xyz_path)
lines

CODING TIME: Try to expand the skeleton above to convert the line strings into into a list of xyz data rows (i.e. convert the strings to floats).

If you can't figure out any approach, run the cell below which will print one possible (of many) ways of approaching this problem.

Note that you may have to run "%load" cells twice, once to load the code and once to instantiate the function.


In [ ]:
%load -s naive_xyz_parser, snippets/parsing.py

In [ ]:
data = naive_xyz_parser(xyz_path)
data

DataFrames

People spend a lot of time reading code, especially their own code.

Lets do two things in using DataFrames: make our code more readable and not reinvent the wheel (i.e. parsers). We have pride in the code we write!

First an example of using DataFrames...


In [ ]:
np.random.seed = 1
df = pd.DataFrame(np.random.randint(0, 10, size=(6, 4)), columns=['A', 'B', 'C', 'D'])
df

In [ ]:
df += 1
df

In [ ]:
df.loc[:, 'A'] = [0, 0, 1, 1, 2, 2]
df

In [ ]:
df.groupby('A')[['B', 'C', 'D']].apply(lambda f: f.sum())

Second approach: pandas.read_csv

Like 99% (my estimate) of all widely established Python packages, pandas is very well documented.

Let's use this function of pandas to read in our well structured xyz data.

  • names: specifies column names (and implicitly number of columns)
  • delim_whitespace: tab or space separated files

CODING TIME: Figure out what options we need to correctly parse in the XYZ trajectory data using pandas.read_csv


In [ ]:
def skeleton_pandas_xyz_parser(path):
    '''
    Parses xyz files using pandas read_csv function.
    '''
    # Read from disk
    df = pd.read_csv(path, delim_whitespace=True, names=['symbol', 'x', 'y', 'z'])
    # Remove nats and comments
    # ...
    # ...
    return df

In [ ]:
df = skeleton_pandas_xyz_parser(xyz_path)
df.head()

One possible solution (run this only if you have already finished the above!):


In [ ]:
%load -s pandas_xyz_parser, snippets/parsing.py

In [ ]:
df = pandas_xyz_parser(xyz_path)
df.head()

Testing your functions is key

A couple of quick tests should suffice...though these barely make the cut...


In [ ]:
print(len(df) == nframe * nat)    # Make sure that we have the correct number of rows
print(df.dtypes)                  # Make sure that each column's type is correct

Lets attach a meaningful index

This is easy since we know the number of atoms and number of frames...


In [ ]:
df = pandas_xyz_parser(xyz_path)
df.index = pd.MultiIndex.from_product((range(nframe), range(nat)), names=['frame', 'atom'])
df

CODING TIME: Put parsing and indexing together into a single function..


In [ ]:
%load -s parse, snippets/parsing.py

Saving your work!

We did all of this work parsing our data, but this Python kernel won't be alive eternally so lets save our data so that we can load it later (i.e. in the next notebook!).

We are going to create an HDF5 store to save our DataFrame(s) to disk.

HDF is a high performance, portable, binary data storage format designed with scientific data exchange in mind. Use it!

Also note that pandas has extensive IO functionality.


In [ ]:
xyz = parse(xyz_path, nframe, nat)
store = pd.HDFStore('xyz.hdf5', mode='w')
store.put('xyz', xyz)
store.close()

Though there are a bunch of improvements/features we could make to our parse function...

...lets move on to step two