Python

Python is a dynamically typed functional programming language.

  • Has simple, readable, syntax; easy to learn
  • Companies building entire data anlytics industries based on jupyter notebook and the pydata stack (e.g. IBM Bluemix)
  • Intel is compiling the pydata stack linked to their MKL library (thanks to Barry for pointing this out!)
  • Everytime you check your gmail, Google's servers execute Python
  • Python developers were the second most in demand (functional programming) language developers in 2015 (first was C# - data available here)

Pandas by example

Pandas is a high performance library for manipulating multidimensional data.

Other notable libraries:

  • numpy N-D array mathematics
  • scipy Mathematical operations on N-D data
  • scikit-learn Compound operations (grouping, nearest neighbors searches, clustering, pattern recognition, etc.)
  • jupyter Specifically, jupyter notebook for interactive data work in many languages
  • blaze A suite of tools for data storage, manipulation, and processing

There are a number of other important packages but these are core building blocks of a Python data scientist and a good place to start.

Learning by doing

Lets explore the power of these tools by examing a problem that many of us have seen before; processing the result of a molecular dynamics simulation (MD).

MD simulations are an integral part of the tools at the computational chemist's disposal. These simulations provide a description of atomic and molecular behavior.

They can be used to predict novel properties, aid in chemical design, and support experimental conclusions. Performing these simulations can be tricky from a theoretical standpoint, but assuming we can get past that, lets consider analyzing (with a focus on reproducibility) the result of a simulation.

Task

Given an XYZ trajectory file containing 2000 frames (geometry snapshots) with 195 atoms each where each frame is in a 12.55 Å x 12.55 Å x 12.55 Å cell (thanks Adam!),

1) parse in the data

2) compute all of the interatomic distances (accounting for periodic boundary conditions), and

3) generate the pair correlation function(s) (sometimes called the radial distribution function) plots


In [ ]: