Tracing Additional Random Variables

We often want to calculate additional quantities using our MCMC samples. Tracing these quantities allows us to characterize the posterior distribution of arbitrary structural features. For the case of tri-alanine, we would like to track the populations of each of the four conformational states ($\alpha$, $\beta$, $PP_{II}$, other). As before, we start by constructing and sampling an LVBP model.



In [1]:

    
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from fitensemble import belt, example_loader

predictions, measurements, uncertainties = example_loader.load_alanine_numpy()
phi, psi, assignments, indicators = example_loader.load_alanine_dihedrals()

num_samples = 20000  # Generate 20,000 MCMC samples
thin = 25  # Subsample (i.e. thin) the MCMC traces by 25X to ensure independent samples
burn = 5000  # Discard the first 5000 samples as "burn-in"

regularization_strength = 3.0  # How strongly do we prefer a "uniform" ensemble (the "raw" MD)? 

belt_model = belt.MaxEntBELT(predictions, measurements, uncertainties, regularization_strength)
belt_model.sample(num_samples, thin=thin, burn=burn)









    



[****************100%******************]  20000 of 20000 complete

Trace Observable

Now we run the new code that outputs the trace of an additional "observable". We simply call belt_model.trace_observable(). This function takes a numpy array as input; here we input the transpose of the matrix of state indicators.

Thus, for each conformational ensemble in our MCMC trace, we calculate the population of the four torsional states:



In [2]:

    
state_pops_trace = belt_model.trace_observable(indicators.T)

Estimate Average Properties

We have calculated a trace of the state populations for each conformational ensemble in the MCMC chain. We first characterize the average (over all MCMC samples) state populations. We also look at the state populations of the raw MD simulation.



In [3]:

    
state_pops_raw = np.bincount(assignments) / float(len(assignments))
state_pops = state_pops_trace.mean(0)

We would like to view the conformational populations as a table. We use the pandas library to construct a tabular view of the populations:



In [4]:

    
pd.DataFrame([state_pops_raw,state_pops],columns=[r"$PP_{II}$",r"$\beta$",r"$\alpha_r$",r"$\alpha_l$"],index=["Raw (MD)", "LVBP"])









    Out[4]:






  
    
      
      $PP_{II}$
      $\beta$
      $\alpha_r$
      $\alpha_l$
    
  
  
    
      Raw (MD)
       0.438756
       0.509199
       0.049334
       0.002710
    
    
      LVBP
       0.559800
       0.386472
       0.050921
       0.002808

Estimate Uncertainties

It is also useful to look at the uncertainties associated with each of the state populations:



In [5]:

    
state_uncertainties = state_pops_trace.std(0)
pd.DataFrame([state_uncertainties],columns=[r"$PP_{II}$",r"$\beta$",r"$\alpha_r$",r"$\alpha_l$"],index=["LVBP"])









    Out[5]:






  
    
      
      $PP_{II}$
      $\beta$
      $\alpha_r$
      $\alpha_l$
    
  
  
    
      LVBP
       0.033894
       0.034513
       0.000611
       0.000014

A Better way to Evaluate Model Quality

In the first tutorial, we evaluated model quality using the reduced $\chi^2$. To calculate the reduced $\chi^2$, we first calculated the posterior average populations. Now that we know how to use our MCMC samples, however, we should estimate the model quality using a more "Bayesian" approach. To do so, we are going to iterate over the MCMC samples of conformational populations to calculate the reduced $\chi^2$ for every MCMC sample.



In [6]:

    
chi2 = []
for pi in belt_model.iterate_populations():
    mu = predictions.T.dot(pi)
    chi2.append((((mu - measurements) / uncertainties)**2).mean(0))

chi2_MCMC = np.mean(chi2)
p = belt_model.accumulate_populations()
mu = predictions.T.dot(p)
chi2_AVE = (((mu - measurements) / uncertainties)**2).mean(0)

pd.DataFrame([[chi2_MCMC, chi2_AVE]],columns=["MCMC Sampled", "Posterior Average Populations"],index=["Reduced $\chi^2$"])









    Out[6]:






  
    
      
      MCMC Sampled
      Posterior Average Populations
    
  
  
    
      Reduced $\chi^2$
       0.497532
       0.012774

Notice that the MCMC sampled reduced $\chi^2$ is approximately equal to one, while the posterior average result is approximately equal to zero. A value of reduced $\chi^2$ near zero suggests an overfit model. Our model, however, is not overfit; we just need to calculate the reduced $\chi^2$ in a "Bayesian" way.

You've finished this tutorial. In your own research, you can use a similar approach to characterize quantities like RMSD, radius of gyration, distances between atoms, side chain torsions, etc.

Continue on to the third tutorial [here].

References

Beauchamp, K. A., Das, R. , and Pande, V. S. Inferring Conformational Ensembles from Noisy Experiments. In Prep.
Sosnick et al. Helix, sheet, and polyproline II frequencies and strong nearest neighbor effects in a restricted coil library. Biochemistry, 2005.

	$PP_{II}$	$\beta$	$\alpha_r$	$\alpha_l$
Raw (MD)	0.438756	0.509199	0.049334	0.002710
LVBP	0.559800	0.386472	0.050921	0.002808