We often want to calculate additional quantities using our MCMC samples. Tracing these quantities allows us to characterize the posterior distribution of arbitrary structural features. For the case of tri-alanine, we would like to track the populations of each of the four conformational states ($\alpha$, $\beta$, $PP_{II}$, other). As before, we start by constructing and sampling an LVBP model.
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from fitensemble import belt, example_loader
predictions, measurements, uncertainties = example_loader.load_alanine_numpy()
phi, psi, assignments, indicators = example_loader.load_alanine_dihedrals()
num_samples = 20000 # Generate 20,000 MCMC samples
thin = 25 # Subsample (i.e. thin) the MCMC traces by 25X to ensure independent samples
burn = 5000 # Discard the first 5000 samples as "burn-in"
regularization_strength = 3.0 # How strongly do we prefer a "uniform" ensemble (the "raw" MD)?
belt_model = belt.MaxEntBELT(predictions, measurements, uncertainties, regularization_strength)
belt_model.sample(num_samples, thin=thin, burn=burn)
Now we run the new code that outputs the trace of an additional "observable". We simply call belt_model.trace_observable()
. This function takes a numpy array as input; here we input the transpose of the matrix of state indicators.
Thus, for each conformational ensemble in our MCMC trace, we calculate the population of the four torsional states:
In [2]:
state_pops_trace = belt_model.trace_observable(indicators.T)
In [3]:
state_pops_raw = np.bincount(assignments) / float(len(assignments))
state_pops = state_pops_trace.mean(0)
We would like to view the conformational populations as a table. We use the pandas library to construct a tabular view of the populations:
In [4]:
pd.DataFrame([state_pops_raw,state_pops],columns=[r"$PP_{II}$",r"$\beta$",r"$\alpha_r$",r"$\alpha_l$"],index=["Raw (MD)", "LVBP"])
Out[4]:
In [5]:
state_uncertainties = state_pops_trace.std(0)
pd.DataFrame([state_uncertainties],columns=[r"$PP_{II}$",r"$\beta$",r"$\alpha_r$",r"$\alpha_l$"],index=["LVBP"])
Out[5]:
In the first tutorial, we evaluated model quality using the reduced $\chi^2$. To calculate the reduced $\chi^2$, we first calculated the posterior average populations. Now that we know how to use our MCMC samples, however, we should estimate the model quality using a more "Bayesian" approach. To do so, we are going to iterate over the MCMC samples of conformational populations to calculate the reduced $\chi^2$ for every MCMC sample.
In [6]:
chi2 = []
for pi in belt_model.iterate_populations():
mu = predictions.T.dot(pi)
chi2.append((((mu - measurements) / uncertainties)**2).mean(0))
chi2_MCMC = np.mean(chi2)
p = belt_model.accumulate_populations()
mu = predictions.T.dot(p)
chi2_AVE = (((mu - measurements) / uncertainties)**2).mean(0)
pd.DataFrame([[chi2_MCMC, chi2_AVE]],columns=["MCMC Sampled", "Posterior Average Populations"],index=["Reduced $\chi^2$"])
Out[6]:
Notice that the MCMC sampled reduced $\chi^2$ is approximately equal to one, while the posterior average result is approximately equal to zero. A value of reduced $\chi^2$ near zero suggests an overfit model. Our model, however, is not overfit; we just need to calculate the reduced $\chi^2$ in a "Bayesian" way.
You've finished this tutorial. In your own research, you can use a similar approach to characterize quantities like RMSD, radius of gyration, distances between atoms, side chain torsions, etc.
Continue on to the third tutorial [here].
Beauchamp, K. A., Das, R. , and Pande, V. S. Inferring Conformational Ensembles from Noisy Experiments. In Prep.
Sosnick et al. Helix, sheet, and polyproline II frequencies and strong nearest neighbor effects in a restricted coil library. Biochemistry, 2005.