Efficient Monte Carlo Sampling

Goals:

Introduce some of the approaches used to make sampling more efficient

References

Gelman chapters 11 and 12
Ivezic 5.8
MacKay chapters 29-30
Handbook of MCMC (Brooks, Gelman, Jones and Meng, eds.), parts of which are on the web
Recent and less recent notes on CosmoMC by Antony Lewis

Motivation

Quite often, repeated evaluation of the posterior function becomes a bottleneck. Any tweaks that speed up convergence and reduce autocorrelation can be worthwhile.

The simple methods introduced so far especially struggle with

strongly degenerate parameter combinations
PDFs with multiple, separated modes

Some strong degeneracies:

Cosmology is bananas

The Rosenbrock function

Some multiple modes:

Some spectral model

The eggbox function

Speeding things up

Broadly speaking, we can try to

tailor algorithms to specific classes of PDF
look for ways to make the general samplers more intelligent

We can also use different samplers for different subsets of parameters - the only rule is that every parameter must get updated somehow.

Gibbs Sampling

is a specialization of Metropolis-Hastings:

Instead of making a general proposal, we cycle through the parameters proposing changes to one at a time
A proposal for $\theta_i$ is into the fully conditional posterior $p(\theta_i|\theta_{-i},x)$, where $-i$ means all subscripts other than $i$.

Gibbs sampling:

while we want more samples
    propose theta1 | theta2, theta3, ..., data
    accept/reject theta1
    propose theta2 | theta1, theta3, ..., data
    accept/reject theta2
    ...

In general, this is not obviously an improvement to proposing changes to all $\theta$ simultaneously.

But, something interesting happens if the fully conditional likelihood and prior are conjugate - then we know the conditional posterior exactly. If we use independent samples of the conditional posterior as proposals, then the Metropolis-Hastings acceptance ratio becomes

$\frac{p(y|\theta_{-i})Q(x|y,\theta_{-i})}{p(x|\theta_{-i})Q(y|x,\theta_{-i})} = \frac{p(y|\theta_{-i})p(x|\theta_{-i})}{p(x|\theta_{-i})p(y|\theta_{-i})} = 1$

and every proposal is automatically accepted!

Conjugate Gibbs Sampling:

while we want more samples
    draw th1 from p(th1|th2,th3,...,data)
    draw th2 from p(th2|th1,th3,...,data)
    ...

Example: normal PDF

25 Metropolis iterations (left) vs. 25 Gibbs transitions (right)

Color goes blue$\rightarrow$red with time (step number)

Gibbs Sampling Pros:

No cycles "wasted" on rejected proposals
No pesky tuning of the proposal scale

Gibbs Sampling Cons:

Only works for conjugate or partially conjugate models (hence must choose conjugate priors)
Occasionally still slower than proposing multi-parameter Metropolis updates (e.g. when degeneracies are strong)

There are some packages that can figure out what conjugate relationships exist based on a model definition, and do Gibbs sampling.

BUGS
JAGS

(See also our list of MCMC packages)

Example: normal likelihood

Suppose we have some data, $x_i$, that we're modelling as being generated by a common normal distribution.

$p(x_i|\mu,\sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}\,\exp-\frac{(x_i-\mu)^2}{2\sigma^2}$

The model parameters are the mean, $\mu$, and variance, $\sigma^2$.

Consulting the Wikipedia, we see that we can Gibbs sample this posterior using two steps:

the conjugate distribution for $\mu|\sigma^2,x$ is normal
the conjugate distribution for $\sigma^2|\mu,x$ is scaled inverse $\chi^2$

Remember that using conjugacy limits our options for choosing priors!

Slice sampling

A less specialized modification of Metropolis-Hastings is slice sampling. This method also reduces the need to manually tune a proposal scale, although it usually involves a rejection step.

Given a starting point, $\theta_0$, we uniformly choose a probability $q\leq p(\theta_0)$. This defines a slice where $p(\theta)\geq q$, and we sample the next $\theta$ uniformly within the slice.

How to do this without actually mapping out the PDF?

Accelerating Metropolis

The simple Metropolis implementation introduced in our Tutorial lesson can be improved on in a number of ways. Recall:

choose a starting point and a proposal scale while we want more samples propose a new point from a scaled Gaussian centered on our current position accept/reject the new point

These strategies often boil down to cleverly, and automatically, tuning our proposal distribution.

Strictly speaking, tuning the proposal distribution on the fly breaks the rules of MCMC (specifically, detailed balance).

Technically, we should stop tuning after some time and throw away all steps before this point.
In practice, if we've actually managed to converge to the target distribution, subsequent updates to the proposal distribution will be negligible anyway.

Intelligent proposal lengths

A Gaussian proposal distribution makes the most likely proposals very near the current point. This is not necessarily efficient. We could instead

propose a step length from a heavier-tailed distribution
tune the scale of the proposed move based on the acceptances so far (our estimate of the width of the posterior)

A non-Gaussian proposal distribution to encourage larger steps:

Tuning the proposal basis

Parameter degeneracies are inconvenient.

Degeneracies can usually be reduced or eliminated by reparametrizing the model.

But beware - a nonlinear reparametrization does not leave your prior invariant - recall the transformation of PDFs.
Linear reparametrizations are equivalent to a change of basis for the parameter space. To gain the benefits of this, we merely need to change the way we propose steps (which can be done automatically).

Original

Parameter basis change

Proposal basis change

For a nice, elliptical PDF, the optimal basis diagonalizes the covariance of the parameters.

This doesn't mean we can only propose along the preferred directions.
It does mean that, in a given direction, we know the optimal length scale for the proposal.

Fast-slow decompositions

If changes to some parameters require much more costly calculations than others, straightforward diagonalization may not be optimal in practice.

We can choose to propose changes to fast parameters more often than slow parameters.
Given a hierarchy of speeds, we can also triangularize the covariance matrix - so the fastest parameter appears in every basis vector and the slowest only in one.
... ad nauseam

Original

Diagonalized covariance

Triangular fast-slow

Running multiple chains

Unless we know what to do a priori, the best information for how to tune proposal lengths and directions comes from running test chains started from different positions and using the information they build up about the posterior.

Parallel, communicating chains (e.g. using MPI) that tune themselves this way can vastly decrease the time to convergence compared with single chains.

Information about the principal axes shows up relatively quickly, and more quickly if we look at more than 1 chain.

Here's an example of adaptive Metropolis sampling of a highly degenerate 2D Gaussian ($\rho=0.99$). The start points are dispersed, and one happens to fall quite near the degenerate axis.

Case 1 has 4 chains that adapt independently of one another by diagonalizing the parameter covariance every 100 steps.

Case 2 is the same, except that the chains pool their knowledge of the parameter covariance when adapting.

Independent, adaptive chains

Communicating, adaptive chains

Tempering

Consider the function $[p(x)]^{1/T}$, where $p(x)$ is the target PDF.

We say a chain sampling this function has temperature $T$. For $T>1$, $p^{1/T}$ is smoothed out compared with $p$ (cf simulated annealing).
This allows chains to move among multiple peaks more easily.
Of course, we're only actually interested in $T=1$...

Parallel tempering

With parallel tempering, we run one chain with $T=1$ and several more chains with $T>1$. A modified Metropolis-Hastings update occasionally allows the chains to exchange positions, giving the $T=1$ chain a mechanism for sampling regions of parameter space it might otherwise have low probability of proposing. Samples from the $T=1$ chain can be used for inference.

Exercise: think

Recall the ugly PDF features we were motivated by, namely strong/nonlinear degeneracies and multiple modes. For each of the methods above, do you expect an improvement compared with standard Metropolis in these situations. Why and for which methods?

Coping with multiple modes

Multiple, well separated posterior modes are a serious challenge for many samplers.

In general, the only way to discover that they exist is by exploring the parameter space with many widely dispersed chains.
To do inference, our chains need to be able to efficiently transition between modes - so far the most reliable general method we've seen for this is parallel tempering. (We'll look at others later.)

Bonus numerical exercise: play

Play around with sampling these 4 difficult PDFs (coded below):

The Rosenbrock function (actually minus this; it's normally used for testing minimizers), illustrated above
The eggbox function, illustrated above
A spherical shell function
A Gaussian PDF (suitable for Gibbs sampling)

First, see how well simple Metropolis works, or doesn't work. Then choose one of the speed-ups above and modify/replace the sampler to use it on one or more of these PDFs. Compare performance (burn-in length, acceptance rate, etc.) with simple Metropolis. (Apart from PDF 4, chances are that nothing you do will work brilliantly, but you should see a difference!)

For Gibbs sampling, interpret PDF 4 as a likelihood function and put some thought into how you define your priors. Otherwise, just take the given PDF to be the posterior distribution.



In [ ]:

    
def Rosenbrock_lnP(x, y, a=1.0, b=100.0):
    if y < 0.0: return -np.inf
    return -( (a-x)**2 + b*(y-x**2)**2 )
def eggbox_lnP(x, y):
    return (2.0 + np.cos(0.5*x)*np.cos(0.5*y))**3
def sphshell_lnP(x, y, s=0.1):
    return -(np.sqrt(x**2+y**2) - 1)**2/(2.0*s**2)
gaussian_data = np.random.normal(size=(20)) # arbitrary data set
def gaussian_lnP(m, v):
    if v <= 0.0: return -np.inf
    return -0.5*np.sum((m-gaussian_data)^2)/v - 0.5*np.log(v)



In [ ]: