Basic Monte Carlo Sampling

Goals:

  • Review the motivation for doing MC sampling in inference, and the basic approaches used.
  • Dig into Markov Chain Monte Carlo and the workhorse Metropolis-Hastings algorithm.

In [ ]:
exec(open('code/montecarlo1.py').read()) # see code here for later demos
%matplotlib inline

References

  • Gelman 1.9, 11.1-11.2
  • MacKay 29.1-29.4, 29.6
  • Fishman 3.3, 3.6, 6.1-6.4
  • Ivezic 5.8
  • Ross 10.2

Motivation: why sample?

So far, we've restricted ourselves to inferences with

  • exact solutions - rare!
  • low-dimensional parameter spaces - limiting!

In general, evaluating the posterior throughout the entire parameter space is too costly. We want to focus resources on mapping the posterior where it is non-tiny. Generating samples from the posterior itself automatically accomplishes this.

Sampling and numerical integration

Almost always, we are ultimately interested in integrals of the posterior, i.e. marginal distributions of parameters. The tasks of Monte Carlo sampling and Monte Carlo integration are essentially indistinguishable. (Similar machinery is useful for difficult optimization problems.)

The essence of MC integration:

$\int w(x)\,p(x)\,dx = \int w(x)\,dP(x) \approx \overline{w(x_i)}; ~ x_i\sim P$

i.e., if we can factor the integrand into a PDF and a weight, and sample from the PDF, then our integral becomes an average over the samples.

In other words, given a list of samples of $\theta$ from $p(\theta)$,

  • the marginalized 1D posterior for $\theta_0$ is estimated by making a histogram of $\theta_0$ samples
  • the marginalized 2D posterior for $\theta_0,\theta_1$ is estimated from a 2D histogram of $\theta_0,\theta_1$ samples
  • statistics like the mean or percentiles of the posterior are estimated directly from the samples

All of these computations would be weighted if $w(\theta)\neq1$.

Simple Monte Carlo

A posterior is already naturally factored into a likelihood function and a prior PDF.

$p(\theta|x) \propto p(x|\theta)\,p(\theta)$

Applying this in the MC integration context leads to the Simple Monte Carlo algorithm:

while we want more samples
    draw theta from p(theta)
    compute weight = p(x|theta)
    store theta and weight

Obtaining marginal distribution(s) for $\theta$ then reduces to constructing weighted histograms of the samples.

SMC is indeed simple (as long as the prior is simple to draw from), but if the priors are not very informative then it still wastes many likelihood evaluations where the posterior is small. However, refinements of this approach lead to some of the advanced algorithms we'll cover later.

For now, we'll focus on the most common methods, which use a unit weight function (i.e. obtain draws directly from the posterior).

But first, a bit more context re: random number generation.

Random number generation

Useful terms to know:

  • Random: predictable only in the sense of following a PDF

  • Pseudorandom: not random, but "unpredictable enough" for practical purposes. Various computer algorithms produce pseudorandom sequences that approximate the uniform distribution on [0,1).

  • Quasirandom: sequence that doesn't even pretend to be random, but does converge to a target PDF more quickly than a random or pseudorandom process would

Here we assume that we have a reliable source of uniform pseudorandom numbers, and want to turn these into samples of another PDF.

Two simple approaches are

  1. Inverse transformation
  2. Rejection sampling

Inverse transform

Recall the definition of the CDF (and it's inverse, $F^{-1}$, the quantile function),

$F(x) = P(X \leq x) = \int_{-\infty}^x p(x')\,dx'$

By this definition, quantiles of $X$ are uniformly distributed on [0,1]. If $F^{-1}$ is easy to evaluate, we can use this straightforwardly:

draw u from Uniform(0,1)
compute x = F_inverse(u)
PDF CDF

Example: exponential distribution

This distribution has $p(x)=\lambda e^{-\lambda x}$ and $F(x)=1-e^{-\lambda x}$ for $x\geq0$.

The quantile function is, therefore, $F^{-1}(P) = -\ln(1-P)/\lambda$.

Check it out (see code):


In [ ]:
lam = 1.0
u = np.random.rand(1000)
x = -np.log(1-u)/lam
plt.rcParams['figure.figsize'] = (15.0,8.0)
# plots a histogram of these x's and the exponential PDF
inv_trans_demo(x, lam)

Rejection sampling

For this method, we need to define an envelope function which everywhere exceeds the target PDF, $p(x)$, and can be sampled. Let this be $Ag(x)$ where $A$ is a scaling factor and $g(x)$ is a PDF we know.

Then the algorithm is

while we want more samples
    draw x from g(x)
    draw u from Uniform(0,1)
    if u <= p(x)/(A*g(x)), keep the sample x
    otherwise, reject x

Visually, this corresponds to drawing points that uniformly fill in the space under $p(x)$.

Markov Chain Monte Carlo

In general, there's a trade-off between efficiency and generality in sampling algorithms. We've seen already seen examples of the very specialized and efficient (independent sampling) and the very general but inefficient (SMC).

We'll cover a number of different algorithms later, but for now let's focus on one that is widely used and simple to implement: the Metropolis-Hastings algorithm, which is one example of Markov Chain Monte Carlo.

A Markov Chain is a sequence where the $n$th entry depends explicitly on the $(n-1)$th, but not (explicitly) on previous steps.

For us, the chain will be a random walk through parameter space.

Formalities of MCMC

Markov chains provide a powerful way to sample PDFs, provided that the transition kernel satisfies a few requirements

  • Detailed balance: any transition must be reversible; the probability of being at $x$ and moving to $y$ must equal the probability of being at $y$ and moving to $x$
  • Ergodicity: the process may not be periodic, but it nevertheless must be possible to return to a given state in finite time
  • It must be possible, in principle, to reach any state with non-zero prior probability

Why does it work?

The probability of an arbitrary point from such a chain being located at $y$ is (marginalizing over the possible immediately preceding points)

$p(y) = \int dx \, p(x) \, T(y|x)$

where $T(y|x)$ is the transition probability of a step from $x$ to $y$.

If we have detailed balance, $p(x)T(y|x) = p(y)T(x|y)$, then

$p(y) = p(y) \int dx \, T(x|y) = p(y)$

which shows that the chain should converge to $P$.

Practicalities of MCMC

1st and 2nd 100 iterations of a chain (blue$\rightarrow$red with step number)

Chains take time to converge to the target distribution.

Samples are also correlated with one another.

Consequences:

  • We need to verify convergence and throw away this "burn-in" period.
  • We can keep only every $N$th sample without losing much information.

    We'll spend more time looking at these features later.

Metropolis-Hastings

This algorithm can be thought of as an MCMC adaptation of rejection sampling. We need to define

  1. An initial state (parameter values)
  2. A proposal distribution, $Q(y|x)$, giving the probability that we attempt to move from $x$ to $y$.

Let $P$ be the distribution we want to sample. The algorithm is then

set x to an initial state
compute p(x)
while we want more samples
    draw y from Q(y|x)
    compute p(y)
    draw u from Uniform(0,1)
    if u <= p(y)/p(x) * Q(x|y)/Q(y|x), set x=y
    otherwise, x stays the same
    store x as a sample

Notice that the probability of accepting a step (once it's proposed) is

$\alpha(y,x) = \mathrm{min}\left[1, \frac{p(y)Q(x|y)}{p(x)Q(y|x)}\right]$

Let's look again at the requirement of detailed balance

the probability of being at $x$ and moving to $y$ must equal the probability of being at $y$ and moving to $x$

The first of these is $p(x)Q(y|x)\alpha(y,x)$, where

  • $p(x)$ is the posterior density (probability of being at $x$, if we're sampling $P$ properly)

  • $Q(y|x)$ is the proposal distribution (probability of attempting a move to $y$ from $x$)

  • $\alpha(y,x)$ is the probability of accepting the proposed move

With this definition of $\alpha$, detailed balance is automatically satisfied!

$p(x)Q(y|x)\alpha(y,x) \equiv p(y)Q(x|y)\alpha(x,y)$

Note that even if a step is rejected, we still keep a sample (the original state, without moving). The difficulty of finding a temptingly better point is important information!

It has been claimed that an acceptance fraction of $\sim25\%$ is optimal.

Metropolis

If the proposal distribution is translation invariant, $Q(y|x)=Q\left(\left|y-x\right|\right)$, then it drops out of the acceptance ratio that decides whether to accept a step.

$\alpha(y,x) = \mathrm{min}\left[1, \frac{p(y)}{p(x)}\right]$

This is the original Metropolis algorithm, and is the easiest case to implement.

In this case, we always accept a jump to higher $p$, and sometimes accept one to lower $p$.

Take-aways

  • Monte Carlo sampling provides a way to characterize posterior distributions without exhaustively exploring the parameter space.

  • Markov Chain Monte Carlo methods are general enough to accomodate the full power of our Bayesian approach to inference, and are reasonably straightforward to implement.

  • Later on, we'll look at ways to make MCMC sampling more efficient without sacrificing this generality.

  • But first, we'll get some experience building and running basic MCMC algorithms, and assessing whether they have successfully converged to the posterior distribution.

Bonus numerical exercise: SMC

Implement the Simple Monte Carlo algorithm and use it to sample a toy PDF of your choice.


In [ ]:

Bonus numerical exercise: rejection sampling

Implement a rejection sampler corresponding to the example figure above that illustrates the method. For this example,

  • $p(x)$ is the $\chi^2$ distribution with 3 degrees of freedom

  • $A=\pi$

  • $g(x)$ is a normal distribution with mean 0 and standard deviation 5

Use a different envelope function if you wish. Verify that your samples do indeed approximate the target PDF.


In [ ]:

Bonus brain exercise: Metropolis

Hypothetical: a collaborator of yours is having trouble with the fact that rejected steps lead to repeated samples in the Metropolis algorithm. They maintain that just recording the new parameter values whenever a step is accepted ought to be good enough.

Without resorting to "mathematical proofs" or "reliable references", and without doing any actual sampling, can you come up with a toy problem that convincingly shows that your collaborator's approach is wrong? The simpler the better!