CosmoMC
by Antony LewisQuite often, repeated evaluation of the posterior function becomes a bottleneck. Any tweaks that speed up convergence and reduce autocorrelation can be worthwhile.
The simple methods introduced so far especially struggle with
is a specialization of Metropolis-Hastings:
Gibbs sampling:
while we want more samples
propose theta1 | theta2, theta3, ..., data
accept/reject theta1
propose theta2 | theta1, theta3, ..., data
accept/reject theta2
...
In general, this is not obviously an improvement to proposing changes to all $\theta$ simultaneously.
But, something interesting happens if the fully conditional likelihood and prior are conjugate - then we know the conditional posterior exactly. If we use independent samples of the conditional posterior as proposals, then the Metropolis-Hastings acceptance ratio becomes
$\frac{p(y|\theta_{-i})Q(x|y,\theta_{-i})}{p(x|\theta_{-i})Q(y|x,\theta_{-i})} = \frac{p(y|\theta_{-i})p(x|\theta_{-i})}{p(x|\theta_{-i})p(y|\theta_{-i})} = 1$
and every proposal is automatically accepted!
Conjugate Gibbs Sampling:
while we want more samples
draw th1 from p(th1|th2,th3,...,data)
draw th2 from p(th2|th1,th3,...,data)
...
Gibbs Sampling Pros:
Gibbs Sampling Cons:
There are some packages that can figure out what conjugate relationships exist based on a model definition, and do Gibbs sampling.
(See also our list of MCMC packages)
Consulting the Wikipedia, we see that we can Gibbs sample this posterior using two steps:
Remember that using conjugacy limits our options for choosing priors!
A less specialized modification of Metropolis-Hastings is slice sampling. This method also reduces the need to manually tune a proposal scale, although it usually involves a rejection step.
Given a starting point, $\theta_0$, we uniformly choose a probability $q\leq p(\theta_0)$. This defines a slice where $p(\theta)\geq q$, and we sample the next $\theta$ uniformly within the slice.
How to do this without actually mapping out the PDF?
See also Neal 2003
start with position x
evaluate p = P(x)
guess a width W
while we want samples
draw q from Uniform(0,p)
choose L,R: R-L=W and L<=x<=R
while P(L) > q, set L = L - W
while P(R) > q, set R = R + W
loop forever
draw y from Uniform(L,R)
if P(y) < q,
if y < x, set L = x1
otherwise, set R = x1
otherwise, break
set x = x1 and record x
Slice Sampling Pros:
Slice Sampling Cons:
Slicing can be a nice option for "test chains" intended to explore the parameter space, find the high-posterior region, and roughly probe its covariance.
The simple Metropolis implementation introduced in our Tutorial lesson can be improved on in a number of ways. Recall:
choose a starting point and a proposal scale
while we want more samples
propose a new point from a scaled Gaussian
centered on our current position
accept/reject the new point
These strategies often boil down to cleverly, and automatically, tuning our proposal distribution.
Strictly speaking, tuning the proposal distribution on the fly breaks the rules of MCMC (specifically, detailed balance).
A Gaussian proposal distribution makes the most likely proposals very near the current point. This is not necessarily efficient. We could instead
Degeneracies can usually be reduced or eliminated by reparametrizing the model.
For a nice, elliptical PDF, the optimal basis diagonalizes the covariance of the parameters.
If changes to some parameters require much more costly calculations than others, straightforward diagonalization may not be optimal in practice.
Unless we know what to do a priori, the best information for how to tune proposal lengths and directions comes from running test chains started from different positions and using the information they build up about the posterior.
Parallel, communicating chains (e.g. using MPI) that tune themselves this way can vastly decrease the time to convergence compared with single chains.
Here's an example of adaptive Metropolis sampling of a highly degenerate 2D Gaussian ($\rho=0.99$). The start points are dispersed, and one happens to fall quite near the degenerate axis.
Case 1 has 4 chains that adapt independently of one another by diagonalizing the parameter covariance every 100 steps.
Case 2 is the same, except that the chains pool their knowledge of the parameter covariance when adapting.
Consider the function $[p(x)]^{1/T}$, where $p(x)$ is the target PDF.
With parallel tempering, we run one chain with $T=1$ and several more chains with $T>1$. A modified Metropolis-Hastings update occasionally allows the chains to exchange positions, giving the $T=1$ chain a mechanism for sampling regions of parameter space it might otherwise have low probability of proposing. Samples from the $T=1$ chain can be used for inference.
Multiple, well separated posterior modes are a serious challenge for many samplers.
Play around with sampling these 4 difficult PDFs (coded below):
First, see how well simple Metropolis works, or doesn't work. Then choose one of the speed-ups above and modify/replace the sampler to use it on one or more of these PDFs. Compare performance (burn-in length, acceptance rate, etc.) with simple Metropolis. (Apart from PDF 4, chances are that nothing you do will work brilliantly, but you should see a difference!)
For Gibbs sampling, interpret PDF 4 as a likelihood function and put some thought into how you define your priors. Otherwise, just take the given PDF to be the posterior distribution.
In [ ]:
def Rosenbrock_lnP(x, y, a=1.0, b=100.0):
if y < 0.0: return -np.inf
return -( (a-x)**2 + b*(y-x**2)**2 )
def eggbox_lnP(x, y):
return (2.0 + np.cos(0.5*x)*np.cos(0.5*y))**3
def sphshell_lnP(x, y, s=0.1):
return -(np.sqrt(x**2+y**2) - 1)**2/(2.0*s**2)
gaussian_data = np.random.normal(size=(20)) # arbitrary data set
def gaussian_lnP(m, v):
if v <= 0.0: return -np.inf
return -0.5*np.sum((m-gaussian_data)^2)/v - 0.5*np.log(v)
In [ ]: