Managing Experimental Data with Pandas

FitEnsemble uses Numpy to store the predictions, measurements, and uncertainties. However, Numpy is unable to help us keep track of where your experiments come from. Pandas is a Python library that combines Numpy arrays with row and column labels to help simplify data management. You can think of Pandas as a combination of Numpy (matrix math) and Excel (spreadsheets).

Here, we will load our data as Pandas objects using a helper function that has been included in FitEnsemble.


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from fitensemble import belt, example_loader

num_samples = 20000  # Generate 20,000 MCMC samples
thin = 25  # Subsample (i.e. thin) the MCMC traces by 25X to ensure independent samples
burn = 5000  # Discard the first 5000 samples as "burn-in"

regularization_strength = 3.0  # How strongly do we prefer a "uniform" ensemble (the "raw" MD)? 

predictions, measurements, uncertainties = example_loader.load_alanine_pandas()

The main difference between a Numpy and Pandas is the presence of labels for each row and column. In Pandas, a 1D object is called a Series, while a 2D object is called a DataFrame. Predictions is a DataFrame, while measurements and uncertainties are Series objects.

We can examine these objects to get a better feel for them:


In [2]:
print(predictions)
print("\n")
print(measurements)
print("\n")
print(uncertainties)


<class 'pandas.core.frame.DataFrame'>
Int64Index: 295189 entries, 0 to 295188
Data columns (total 1 columns):
J3_HN_HA    295189  non-null values
dtypes: float64(1)


experiment
J3_HN_HA      5.68
Name: measurements, dtype: float64


experiment
J3_HN_HA      0.36
Name: uncertainties, dtype: float64

You will notice that all three objects have a column labeled J3_HN_HA. predictions has two dimensions--its other dimension has length 295189.

When using Pandas to store your data, you must pay attention to one difference. Instead of passing the Pandas objects to FitEnsemble, you must pass the Numpy data that they contain. Luckily, you can do this using the values member variable present on every Pandas object.


In [3]:
belt_model = belt.MaxEntBELT(predictions.values, measurements.values, uncertainties.values, regularization_strength)
belt_model.sample(num_samples, thin=thin, burn=burn)


[****************100%******************]  20000 of 20000 complete

For the cases with one or two experiments, you can easily keep track of things manually. However, if you have more than two experiments, Pandas offers a powerful approach to keeping track of datasets.