Even More Monte Carlo Sampling

Goals:

  • Peruse the menu of sampling options beyond those that we've seen so far

References

  • Gelman ch. 12
  • Handbook of MCMC (Brooks, Gelman, Jones and Meng, eds.), parts of which are on the web

The wide world of sampling

  1. Evolving a single state
    • Metropolis sampling, and related methods
    • Hamiltonian Monte Carlo (HMC)
  2. Evolving an ensemble of states
    • Goodman-Weare sampling
  3. Non-Markov methods
    • Population Monte Carlo (PMC)
    • Nested sampling

Codes for some of these are in our list of MCMC packages.

Hamiltonian Monte Carlo (HMC)

(Also sometimes called Hybrid Monte Carlo)

While standard MCMC is analogous to the evolution of a thermodynamic system, HMC is (almost) analogous to the evolution of a single particle. Consider our free parameters as coordinates of a position, $\theta$, and minus the log-posterior as a potential energy, $U(\theta)$.

HMC introduces momentum parameters, $\phi$, corresponding to each position parameter, and an associated "kinetic energy",

$K(\phi) = \sum_i \frac{\phi_i^2}{2m_i}$.

Looking at the analogy the other way, the probability associated with $K(\phi)$

$p(\phi) \propto e^{-K(\phi)}$

is a multivariate Gaussian with mean zero and a diagonal covariance given by the "masses".

The HMC algorithm alternates Gibbs samples of $\phi|\theta$ with joint updates of $\phi$ and $\theta$.

  1. Generate a sample of $\phi$ from the distribution defined by $K(\phi)$.
  2. Evolve $(\theta,\phi)$ for some time as a dynamical system, according to $K(\phi)$ and $U(\theta)$.
  3. The state at the end of this process is the proposal. Apply the standard Metropolis acceptance test to the initial and proposed probabilities $e^{-(K+U)}$. (This is trivial if we conserve energy in the "evolution" phase, but in practice the evolution is often done more approximately to save cycles.)

Note that this procedure requires derivatives of the log-posterior, unlike standard MCMC methods.

There's a lot more literature on HMC and its optimization than we can cover - see e.g. this chapter. In a nutshell, the introduction of "momentum" into the evolution of the chain is supposed to reduce the random-walk behavior (autocorrelation) of traditional MCMC.

What's the catch?

To evolve the state, we need the gradient of the log-posterior (the "force"). If this can be computed analytically, great. If we need to do it numerically, it isn't obvious that HMC will be more efficient than standard MCMC.

Affine-invariant sampling

This is a class of methods that evolve an ensemble of states rather than a single state. After convergence, the ensemble can be regarded as a set of samples from the target distribution.

This approach provides some of the benefits of running multiple chains - but remember that these are not independent chains!

The currently fashionable variant is coded in a python package hilariously called emcee, and implements the evolution rule proposed by Goodman and Weare (2010).

The Goodman-Weare method

The algorithm for moving each point in the ensemble is:

  1. Randomly pick a different point from the ensemble (total size $N$).
  2. Propose a move in the direction of that point, by the distance between them multiplied by a random from this distribution: $g(z) \propto \frac{1}{\sqrt z}; ~ \frac{1}{2}\leq z \leq 2$
  3. Accept or reject the move based on the ratio of posterior densities multiplied by $z^{N-1}$.

Note that there is some magic in the density $g$. We are not free to choose just any function.

This algorithm is relatively easy to use - there is no tuning required and it's straightforward to parallelize.

Important cautions:

  • if the ensemble is not started in a region of high probability, convergence will be extremely slow. You have been warned.
  • as the walkers are not independent, the Gelman-Rubin convergence criterion doesn't apply
  • assessing convergence visually is not always straightforward

Population Monte Carlo (PMC)

Recall the essence of MC integration: $\int w(x)\,p(x)\,dx = \int w(x)\,dP(x) \approx \overline{w(x_i)}; ~ x_i\sim P$.

We used this (way back) to set up Simple Monte Carlo, where $p$ is the prior and $w$ is the likelihood.

The key to SMC is that any set of points can provide an unbiased estimate of an integral, as long as the weighting is correct.

PMC is a more refined version of this - instead of sampling from the prior, we iteratively build up an approximation to the posterior distribution, called the generating distribution, along with a number of samples.

Each iteration consists of

  1. Selecting the generating distribution, $q$, and
  2. Sampling a number of points from the generating distribution and computing the corresponding posterior densities, $\pi$. These points can then be weighted by the ratio $\pi/q$ and treated as samples of the posterior.

Clearly, getting this to converge to the target quickly will require a clever selection of $q$.

A nice feature of PMC is that the generating distribution can depend on the previous crops of points while still producing unbiased results, thanks to the explicit re-weighting.

With a suitably intelligent implementation (e.g. see this), the generating distribution can be evolved towards the posterior distribution, providing samples for a posteriori analysis as well as an estimate for the evidence.

Nested sampling

This is a particular form of PMC that is primarily aimed at calculating the Bayesian evidence, the integral of the likelihood function over the prior, $\int p(D|\theta)\,p(\theta)d\theta$.

We begin with a large number of points sampled from the prior, and gradually evolve them towards higher-likelihood positions, keeping track of how the volume filled changes.

The evidence can then be calculated as a numerical approximation of $\int_0^1 L(V)\,dV$ (assuming the prior volume is normalized to 1).

By virtue of spamming the parameter space with points at the start, nested sampling is likely (though never guaranteed) to find multiple modes, if they exist (same for PMC).

The computational challenge here is to maximize the efficiency of moving a given point to a position of higher likelihood. Math is involved - see e.g. the MultiNest paper.

Take-aways

  • We looked at a few (more) complicated sampling algorithms that are in use.
  • Each has pros and cons, in terms of generality, robustness, speed, interpretability, ...
  • There's no silver bullet in general - many problems can be solved perfectly well with simple methods, but sometimes you might be compelled to look farther, or to mix and match samplers.