pymc
.No one ever said the proposal density had to be Gaussian! Modified densities that discourage very short steps can speed up convergence.
Typically, our proposal distribution is some multidimensional function that can be characterized by a set of basic vectors and associated scale lengths. For example, the covariance matrix of an N-dimensional Gaussian encodes an N-ellipse. Proposal adaptation uses the set of samples accepted so far to adjust the proposal basis, aligning it with our best guess at the posterior. The simplest implementation of this is to diagonalize the empirical covariance of the existing samples. Proposing along these directions much more efficiently navigates denegeracies in the posterior distribution.
One can also propose along directions that are not basis vectors, in which case having an estimate of the proposal covariance allows us to calculate the optimal scale length for a proposal in any given direction.
Strictly speaking, on-the-fly adaptation breaks detailed balance, which is one of the requirements for MCMC to converge to the posterior. One should really run an initial chain, use that to improve the proposal distribution, and then run a second chain with that proposal distribution fixed, using only samples from the second chain to do inference. In practice, if our chain is really converging to the target distribution then at some point on-the-fly updates to the proposal distribution become negligible anyway, so using all samples after convergence is functionally the same. (But you've been warned!)
An more extreme variant is to adapt not just proposal directions and lengths, but to actually try to build an approximation to the posterior out of basis functions and propose into that. Apart from the violation of detailed balance (same as above), this kind of scheme can work, as long as proposals anywhere in the parameter space remain a possibility - but there's always the danger of adapting too quickly such that in practice an important part of parameter space is never explored. This is not highly recommended, but you may see it done occasionally. (Population monte carlo, covered later, is mathematically sound and conceptually similar approach.)
The initial step of conditioning the proposal distribution is a bottleneck for on-the-fly adaptation. A single chain, potentially starting far from the posterior mode, will initially be highly correlated, and so it can take a long time for the proposal distribution to begin to resemble the posterior. Pooling the information from many chains, run in parallel from different starting points, can accumulate information about the posterior much more quickly (in both CPU and wall time). Like on-the-fly adaptation, this strategy is technically wrong, and also like on-the-fly adaptation it is frequently forgivable. The speed-up provided by these two modifications working together is, in a word, dramatic.
Sometimes, the most expensive parts of our posterior calculation depend on only a subset of parameters. In these cases, it makes sense to sample the "fast" parameters more often than the "slow" ones, even if that means not proposing along the eigenvectors of the posterior degeneracy.
A clever scheme for mixing fast-slow decompositions with proposal adaptation involves triangularizing the posterior covariance rather than diagonalizing it. If we propose along vectors that form a triangular basis matrix, the fastest parameter can be varied in every proposal, no matter which vector it is along, while the slowest is varied only when proposing along one of the basis vectors. This can be refined even further by running a short chain over the fast parameter(s) for a given proposal of the slow parameter(s) and using one of these resulting sub-chain samples as the proposal which feeds into the Metropolis acceptance rule.
In a similar vein to fast-slow sampling, we sometimes know that certain parameter subsets can be sampled efficiently using a non-Metropolis method while the others are kept fixed. For example, we might be able to Gibbs sample$^1$ some of the parameters ($\theta_1$), conditional on the rest ($\theta_2$). In that case, Gibbs updates of $\theta_1$ could be alternated with Metropolis updates of $\theta_2$. Whether this is more efficient depends sensitively on what the model actually is. Sometimes the improvement is easily worth the effort, but in principle a mixed sampler could be less efficient than pure Metropolis.
$^1$ You'll learn about Gibbs sampling in this week's homework.
In [27]:
import numpy as np
import scipy.stats
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = (6.0, 4.0)
xx = np.arange(-10., 10., 0.1)
plt.plot(xx, scipy.stats.norm.pdf(xx));
yy = np.sqrt(scipy.stats.norm.pdf(xx))
yy = yy / (np.sum(yy)*0.1)
plt.plot(xx, yy);
$K(\phi) = \sum_i \frac{\phi_i^2}{2m_i}$.
There's a lot more literature on HMC than we can cover - see e.g. this chapter. In a nutshell, the introduction of "momentum" into the evolution of the chain is supposed to reduce the autocorrelation, speeding convergence.
MultiNest
paper.python
package hilariously called emcee
, and implements the evolution rule proposed by Goodman and Weare (2010). The algorithm for moving an individual point in the ensemble is quite simple:emcee
package and the lack of forethought/tuning needed compared with non-super Metropolis. Points can also move more freely throughout highly nonlinear degenerate regions compared with Metropolis-like steps, although jumps from one end of a banana to the other or between well separated modes will still be unlikely. A downside is that convergence can be very slow if the ensemble is not started in a region of high probability.