Importance Sampling and PMC

Recall that our inferences are usually based on integrals of the posterior, e.g. a marginal distribution:

$P(\Theta_0=\theta_0|D) = \int d\theta_1\,P(\Theta_0=\theta_0,\Theta_1=\theta_1|D)$,

where $D$ is our data, $\Theta$ are parameters and $\theta$ is a particular parameter value. In an MCMC analysis, we produce a chain where the density of points is proportional to the posterior, and then approximate these marginals as monte carlo integrals,

$\int d\theta_1\,P(\Theta_0\approx\theta_0,\Theta_1=\theta_1|D) \approx \sum_{i:\Theta_0\approx\theta_0} 1 =$ number of samples in a $\Theta_0$ bin containing $\theta_0$ in an ordinary histogram.

Note that this is the simplest example of a monte carlo integral - replacing an integration by a weighted sum over samples randomly generated from a helpful distribution - where in this case the weight left over is unity.

Suppose that, after doing our MCMC analysis, additional information comes along and we want to produce a combined inference. This could be in the form of new data or new prior information, but for concreteness, say that there is a new data set, $D_2$. Assuming independence of the two data sets, the combined posterior is

$P(\Theta_0=\theta_0|D_1,D_2) = \int d\Theta_1\,P(D_1|\Theta_0=\theta_0,\Theta_1=\theta_1)P(D_2|\Theta_0=\theta_0,\Theta_1=\theta_1)\,P(\Theta)$.

We can either run a brand new chain incorporating both likelihoods, or we can notice that the samples from our first experiment can be re-weighted in the monte carlo integration above:

$P(\Theta_0\approx\theta_0|D_1,D_2) \approx \sum_{i:\Theta_0\approx\theta_0} P(D_2|\Theta_0=\theta_0^{(i)},\Theta_1=\theta_1^{(i)})$,

where $i$ runs over the samples of the original chain.

Questions: ponder these for a couple minutes, and then we'll regroup.

  1. How would you expect the total computing time involved in importance sampling (running the initial chain and calculating the new weights) to compare to running a combined chain from scratch?
  2. Under what circumstances will this approach fail to provide a good approximation to the combined posterior? Are there circumstances where the scheme above is not even applicable? Can you think of remedies to salvage these situations?

Regardless of its optimality, importance sampling does get used in practice, often as a relatively easy way for people without MCMC skillz to do a Bayesian analysis. Fortunately, you now have the tools to use whichever approach makes the most sense for a given problem.

Population Monte Carlo

The idea of importance sampling is at the core of another advanced technique called Population Monte Carlo (PMC). In PMC, an approximation to the posterior distribution, called the generating distribution, is iteratively produced, along with a number of samples. Each iteration consists of

  1. Selecting the generating distribution, $q$, and
  2. Sampling a number of points from the generating distribution and computing the corresponding posterior densities, $\pi$. These points can then be weighted by the ratio $\pi/q$ and treated as samples of the posterior.

Mathematically, this is straightforward enough, but clearly the selection of $q$ needs to be clever in order for the resulting points (weighted or not) to provide useful information about the posterior. The nice, if counterintuitive, thing about PMC is that the generating distribution can depend on the previous crops of points while still producing unbiased results, thanks to the explicit re-weighting. With a suitably clever implementation (not covered here!), the generating distribution can be evolved towards the posterior distribution, providing samples for a posteriori analysis as well as an estimate for the evidence.