FitEnsemble uses Numpy to store the predictions
, measurements
, and uncertainties
. However, Numpy is unable to help us keep track of where your experiments come from. Pandas is a Python library that combines Numpy arrays with row and column labels to help simplify data management. You can think of Pandas as a combination of Numpy (matrix math) and Excel (spreadsheets).
Here, we will load our data as Pandas objects using a helper function that has been included in FitEnsemble.
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from fitensemble import belt, example_loader
num_samples = 20000 # Generate 20,000 MCMC samples
thin = 25 # Subsample (i.e. thin) the MCMC traces by 25X to ensure independent samples
burn = 5000 # Discard the first 5000 samples as "burn-in"
regularization_strength = 3.0 # How strongly do we prefer a "uniform" ensemble (the "raw" MD)?
predictions, measurements, uncertainties = example_loader.load_alanine_pandas()
The main difference between a Numpy and Pandas is the presence of labels for each row and column. In Pandas, a 1D object is called a Series
, while a 2D object is called a DataFrame
. Predictions is a DataFrame
, while measurements
and uncertainties
are Series
objects.
We can examine these objects to get a better feel for them:
In [2]:
print(predictions)
print("\n")
print(measurements)
print("\n")
print(uncertainties)
You will notice that all three objects have a column labeled J3_HN_HA. predictions
has two dimensions--its other dimension has length 295189.
When using Pandas to store your data, you must pay attention to one difference. Instead of passing the Pandas objects to FitEnsemble, you must pass the Numpy data that they contain. Luckily, you can do this using the values
member variable present on every Pandas object.
In [3]:
belt_model = belt.MaxEntBELT(predictions.values, measurements.values, uncertainties.values, regularization_strength)
belt_model.sample(num_samples, thin=thin, burn=burn)
For the cases with one or two experiments, you can easily keep track of things manually. However, if you have more than two experiments, Pandas offers a powerful approach to keeping track of datasets.