Efficient Monte Carlo Sampling

Goals:

Introduce some of the approaches used to make sampling more efficient
Experiment with using some of these techniques on difficult PDFs

References

Ivezic 5.8
MacKay chapters 29-30
Gelman chapters 11 and 13
Handbook of MCMC (Brooks, Gelman, Jones and Meng, eds.), parts of which are on the web
Recent and less recent notes on CosmoMC by Antony Lewis

Motivation

Quite often, repeated evaluation of the posterior function becomes a bottleneck. Any tweaks that speed up convergence can be worthwhile.

The simple methods introduced so far especially struggle with

strongly degenerate parameter combinations
PDFs with multiple, separated modes

Motivation: degeneracies

Cosmology is bananas

The Rosenbrock function

Motivation: multiple modes

Marginalized bananas

The eggbox function

Speeding things up

Broadly speaking, we can try to

tailor faster algorithms to specific classes of PDF
look for ways to make the general samplers more intelligent

We can also use different samplers for different subsets of parameters - the only rule is that every parameter must get updated somehow.

Gibbs Sampling

is a specialization of Metropolis-Hastings

Instead of making a general proposal, we cycle through the parameters proposing changes to one at a time
A proposal for $\theta_i$ is into the fully conditional posterior $P(\theta_i|\theta_{-i},x)$, where $-i$ means all subscripts other than $i$.

Gibbs Sampling

while we want more samples
    propose theta1 | theta2, theta3, ..., data
    accept/reject theta1
    propose theta2 | theta1, theta3, ..., data
    accept/reject theta2
    ...

Conjugate Gibbs Sampling

In general, this is not obviously an improvement to proposing changes to all $\theta$ simultaneously.

But, something interesting happens if the fully conditional likelihood and prior are conjugate - then we know the conditional posterior exactly. If we use independent samples of the conditional posterior as proposals, then the Metropolis-Hastings acceptance ratio becomes

$\frac{P(y)Q(x)}{P(x)Q(y)} = \frac{P(y)P(x)}{P(x)P(y)} = 1$

and every proposal is automatically accepted!

Conjugate Gibbs Sampling

while we want more samples
    draw th1 from P(th1|th2,th3,...,data)
    draw th2 from P(th2|th1,th3,...,data)
    ...

Conjugate Gibbs Sampling

Pros

No cycles "wasted" on rejected proposals
No pesky tuning of the proposal scale

Cons

Only works for conjugate or partially conjugate models
Occasionally still slower than proposing multi-parameter Metropolis updates

Conjugate Gibbs Sampling

Several packages exist which can figure out what conjugate relationships exist based on a model definition, and do Gibbs and/or Metropolis sampling.

Example: normal distribution

Suppose we have some data, $x_i$, that we're modelling as being generated by a common normal (aka Gaussian) distribution.

$P(x_i|\mu,\sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}\,\exp-\frac{(x_i-\mu)^2}{2\sigma^2}$

The model parameters are the mean, $\mu$, and variance, $\sigma^2$.

Example: normal distribution

Consulting the Wikipedia, we see that we can Gibbs sample this posterior using two steps:

the conjugate distribution for $\mu|\sigma^2,x$ is normal
the conjugate distribution for $\sigma^2|\mu,x$ is scaled inverse $\chi^2$

Remeber that using conjugacy limits our options for choosing priors!

Slice sampling

A less specialized modification of Metropolis-Hastings is slice sampling. This method also avoids having to manually tune a proposal scale, although it usually involves a rejection step.

Given a starting point $P(\theta_0)$, we uniformly choose a probability $q\leq P(\theta_0)$. This defines a slice where $P(\theta)\geq q$, and we sample the next $\theta$ uniformly within the slice.

Slice sampling

Slice sampling with rejection

Accelerating Metropolis

The simple Metropolis implementation we tinkered with in the Basic Monte Carlo chunk can be improved on in a number of ways.

choose a starting point and a proposal scale
while we want more samples
    propose a new point from a scaled Gaussian
     centered on our current position
    accept/reject the new point

Accelerating Metropolis: proposal lengths

A Gaussian proposal distribution makes the most likely proposals very near the current point. This is not necessarily efficient. We could instead

choose a proposal direction (see later)
propose a step length from a heavier-tailed distribution

Accelerating Metropolis: proposal lengths

Accelerating Metropolis: proposal basis

Parameter degeneracies are inconvenient.

Accelerating Metropolis: proposal basis

Degeneracies can usually be reduced or eliminated by reparametrizing the model.

But beware - a nonlinear reparametrization does not leave your prior invariant - recall the transformation of PDFs.
Linear reparametrizations are equivalent to a change of basis for the parameter space. To gain the benefits of this, we merely need to change the way we propose steps.

Accelerating Metropolis: proposal basis

Original

Parameter basis change

Proposal basis change

Accelerating Metropolis: proposal basis

For a nice, elliptical PDF, the optimal basis diagonalizes the covariance of the parameters.

This doesn't mean we can only propose along the preferred directions
For a given direction, we know the optimal length scale for the proposal

Accelerating Metropolis: fast-slow decompositions

If changes to some parameters require much more costly calculations than others, straightforward diagonalization may not be optimal in practice.

We can choose to propose changes to fast parameters more often than slow parameters
Given a hierarchy of speeds, we can also triangularize the covariance matrix - so the fastest parameter appears in every basis vector and the slowest only in one
... ad nauseam

Accelerating Metropolis: multiple chains

Unless we know what to do a priori, the best information for how to tune proposal lengths and directions comes from running test chains started from different positions and using the information they build up about the posterior.

Parallel, communicating chains (e.g. using MPI) that tune themselves this way can vastly decrease the time to convergence compared with single chains.

Accelerating Metropolis: multiple chains

Strictly speaking, tuning the proposal distribution on the fly breaks the rules of MCMC (specifically, detailed balance).

We should stop tuning after some time and throw away all steps before this point
In practice, if we've actually managed to converge to the target distribution, subsequent updates to the proposal distribution will be negligible anyway...

Tempering

Consider the function $[P(x)]^{1/T}$, where $P(x)$ is the target PDF.

We say a chain sampling this function has temperature $T$. For $T>1$, $P^{1/T}$ is smoothed out compared with $P$ (cf simulated annealing).
This allows chains to move among multiple peaks more easily.
Of course, we're only actually interested in $T=1$...

Parallel tempering

With parallel tempering, we run one chain with $T=1$ and several more chains with $T>1$. A modified Metropolis-Hastings update occasionally allows the chains to exchange positions, giving the $T=1$ chain a mechanism for sampling regions of parameter space it might otherwise have low probability of proposing. Samples from the $T=1$ chain can be used for inference.

Numerical exercise: play

This sandbox notebook contains code for a simple Metropolis sampler and 4 PDFs:

The Rosenbrock function (actually minus this; it's normally used for testing minimizers), illustrated above
The eggbox function, illustrated above
A spherical shell function
A Gaussian PDF (suitable for Gibbs sampling)

Choose one of the speed-ups above (something that sounds implementable in a reasonable time) and modify/replace the sampler to use it on one or more of these PDFs. Compare performance (burn-in length, acceptance rate, etc.) with simple Metropolis. (Apart from PDF 4, chances are that nothing you do will work brilliantly, but you should see a difference!)

For Gibbs sampling, interpret PDF 4 as a likelihood function and put some thought into how you define your priors. Otherwise, just take the given PDF to be the posterior distribution.