Basic Monte Carlo Sampling

Goals:

  • Review the motivation for doing MC sampling in inference, and the basic approaches used.
  • Get some hands-on experience with the workhorse Metropolis algorithm, and its strengths and weaknesses.

In [ ]:
exec(open('../code/montecarlo1.py').read()) # see code here for later demos
%matplotlib inline

References

Motivation: why sample?

So far, we've restricted ourselves to inferences with

  • exact solutions - rare!
  • low-dimensional parameter spaces - limiting!

In general, evaluating the posterior throughout the entire parameter space is too costly. We want to focus resources on mapping the posterior where it is non-tiny. Generating samples from the posterior itself automatically accomplishes this.

Sampling and numerical integration

Almost always, we are ultimately interested in integrals of the posterior, i.e. marginal distributions of parameters. The tasks of Monte Carlo sampling and Monte Carlo integration are essentially indistinguishable. (Similar machinery is useful for difficult optimization problems.)

The essence of MC integration:

$\int w(x)\,p(x)\,dx = \int w(x)\,dP(x) \approx \overline{w(x_i)}; ~ x_i\sim P$

i.e., if we can factor the integrand into a PDF and a weight, and sample from the PDF, then our integral becomes an average.

Simple Monte Carlo

A posterior is already naturally factored into a likelihood function and a prior PDF.

$P(\theta|x) \propto P(x|\theta)\,P(\theta)$

Applying this in the MC integration context leads to the Simple Monte Carlo algorithm:

while we want more samples
    draw theta from P(theta)
    compute weight = P(x|theta)
    store theta and weight

Obtaining marginal distribution(s) for $\theta$ then reduces to constructing weighted histograms of the samples.

Simple Monte Carlo

SMC is indeed simple (as long as the prior is simple to draw from), but if the priors are not very informative then it still wastes many likelihood evaluations where the posterior is small. However, refinements of this approach lead to some of the advanced algorithms we'll cover later.

For now, we'll focus on the most common methods, which use a weight function of unity (i.e. obtain draws directly from the posterior.)

But first, a bit more context re: random number generation.

Random number generation

Useful terms to know:

  • Random: predictable only in the sense of following a PDF

  • Pseudorandom: not random, but "unpredictable enough" for practical purposes. Various computer algorithms produce pseudorandom sequences that approximate the uniform distribution on [0,1).

  • Quasirandom: sequence that doesn't even pretend to be random, but does converge to a target PDF more quickly than a random or pseudorandom process would

Random number generation

Here we assume that we have a reliable source of uniform pseudorandom numbers, and want to turn these into samples of another PDF.

Two simple approaches are

  1. Inverse transformation
  2. Rejection sampling

Inverse transform

Recall the definition of the CDF (and it's inverse, $F^{-1}$, the quantile function),

$F(x) = P(X \leq x) = \int_{-\infty}^x p(x')\,dx'$

By this definition, quantiles of $X$ are uniformly distributed on [0,1]. If $F^{-1}$ is easy to evaluate, we can use this straightforwardly:

draw u from Uniform(0,1)
compute x = F_inverse(u)

Inverse transform

Example: exponential distribution

This distribution has $p(x)=\lambda e^{-\lambda x}$ and $F(x)=1-e^{-\lambda x}$ for $x\geq0$.

The quantile function is, therefore, $F^{-1}(P) = -\ln(1-P)/\lambda$.

Check it out:


In [ ]:
lam = 1.0
u = np.random.rand(1000)
x = -np.log(1-u)/lam
plt.rcParams['figure.figsize'] = (15.0,8.0)
# plots a histogram of these x's and the exponential PDF
inv_trans_demo(x, lam)

Rejection sampling

For this method, we need to define an envelope function which everywhere exceeds the target PDF, $p(x)$, and can be sampled. Let this be $A\,g(x)$ where $A$ is a scaling factor and $g(x)$ is a PDF we know.

Then the algorithm is

while we want more samples
    draw x from g(x)
    draw u from Uniform(0,1)
    if u <= p(x)/(A*g(x)), keep the sample x
    otherwise, reject x

Rejection sampling

Visually, this corresponds to drawing points that uniformly fill in the space under $p(x)$.

Markov Chain Monte Carlo

In general, there's a trade-off between efficiency and generality in sampling algorithms. We've seen already seen examples of the very specialized and efficient (independent sampling) and the very general but inefficient (SMC).

We'll cover a number of different algorithms later, but for now let's focus on one that is widely used and simple to implement: the Metropolis-Hastings algorithm, which is one example of Markov Chain Monte Carlo.

MCMC

A Markov Chain is a sequence where the $n$th entry depends explicitly on the $(n-1)$th, but not (explicitly) on previous steps (e.g. a random walk).

For our purposes, the chain will be a sequence of positions in parameter space.

Formalities of MCMC

Markov chains provide a powerful way to sample PDFs, provided that the transition kernel satisfies a few requirements

  • Detailed balance: any transition must be reversible; the probability of being at $x$ and moving to $y$ must equal the probability of being at $y$ and moving to $x$
  • Ergodicity: the process may not be periodic, but it nevertheless must be possible to return to a given state in finite time
  • It must be possible, in principle, to reach any state with non-zero prior probability

Practicalities of MCMC

Consequences:

  • Samples are correlated
    • We can keep only every $N$th sample without losing much information
  • Chains take time to converge to the target distribution, depending on where they start

    • We need to verify convergence and throw away this "burn-in" period

    We'll spend more time looking at these features later.

Metropolis-Hastings

This algorithm can be thought of as an MCMC adaptation of rejection sampling. We need to define

  1. An initial state (parameter values)
  2. A proposal distribution, $Q(y|x)$, giving the probability that we attempt to move to $y$ from $x$.

Metropolis-Hastings

Let $P$ be the distribution we want to sample. The algorithm is then

set x to an initial state
compute P(x)
while we want more samples
    draw y from Q(y|x)
    compute P(y)
    draw u from Uniform(0,1)
    if u <= P(y)/P(x) * Q(x|y)/Q(y|x), set x=y
    otherwise, x stays the same
    store x as a sample

Metropolis-Hastings

Notice that the accept/reject step depended on the acceptance ratio

$\frac{P(y)Q(x|y)}{P(x)Q(y|x)}$, where

  • $P(x)$ is the posterior density (probability of being at $x$, if we've sampling $P$ properly)

  • $Q(y|x)$ is the proposal distribution (probability of attempting a move to $y$ from $x$)

This construction allows us to satisfy detailed balance.

Metropolis-Hastings

Note that even if a step is rejected, we still keep a sample (the original state, without moving). The difficulty of finding a temptingly better point is important information!

It has been claimed that an acceptance fraction of $\sim25\%$ is optimal.

Metropolis

If the proposal distribution is translation invariant, $Q(y|x)=Q\left(\left|y-x\right|\right)$, then it drops out of the acceptance ratio that decides whether to accept a step. This is the original Metropolis algorithm, and is the easiest case to implement.

In this case, we always accept a jump to higher $P$, and sometimes accept one to lower $P$.

Metropolis implementation

This is so important to understand that we'll go through an example and exercise in detail in this sandbox notebook.

MCMC take-aways

We need to be careful about

  • correlation - we care about not the total number of samples, but the number of independent samples
  • convergence - we need to verify that chains have converged to the posterior distribution

To illustrate, we have 4 chains from the sandbox with different starting points and width=0.1.

Convergence tests

  • Inspection! There is no substitute.
  • Gelman-Rubin statistic

Convergence tests - inspection

Convergence tests - inspection

How stationary does each sequence appear?

Convergence tests - inspection

Conservatively, we might remove the first $\sim2000$ steps based on this.

Convergence tests - Gelman-Rubin statistic

This approach tests the similarlity of independent chains intended to sample the same PDF. To be meaningful, they should start from different locations and burn-in should be removed.

For a given parameter, $\theta$, the $R$ statistic compares the variance across chains with the variance within a chain. Intuitively, if the chains are random-walking in very different places, i.e. not sampling the same distribution, $R$ will be large.

We'd like to see $R\approx 1$ (e.g. $R<1.1$ is often used).

Convergence tests - Gelman-Rubin statistic

In detail, given chains $J=1,\ldots,m$, each of length $n$,

  • Let $B=\frac{n}{m-1} \sum_j \left(\bar{\theta}_j - \bar{\theta}\right)^2$, where $\bar{\theta_j}$ is the average $\theta$ for chain $j$ and $\bar{\theta}$ is the global average. This is proportional to the variance of the individual-chain averages for $\theta$.

  • Let $W=\frac{1}{m}\sum_j s_j^2$, where $s_j^2$ is the estimated variance of $\theta$ within chain $j$. This is the average of the individual-chain variances for $\theta$.

  • Let $V=\frac{n-1}{n}W + \frac{1}{n}B$. This is an estimate for the overall variance of $\theta$.

Convergence tests - Gelman-Rubin statistic

Finally, $R=\sqrt{\frac{V}{W}}$.

Note that this calculation can also be used to track convergence of combinations of parameters, or anything else derived from them.

Correlation tests

  • Inspection! Again, no substitute.
  • Autocorrelation of parameters

Correlation tests - inspection

Do subsequent samples look particularly independent?

Correlation tests - inspection

The autocorrelation of a sequence, as a function of lag, $k$, is defined thusly:

$\rho_k = \frac{\sum_{i=1}^{n-k}\left(\theta_{i} - \bar{\theta}\right)\left(\theta_{i+k} - \bar{\theta}\right)}{\sum_{i=1}^{n-k}\left(\theta_{i} - \bar{\theta}\right)^2} = \frac{\mathrm{Cov}_i\left(\theta_i,\theta_{i+k}\right)}{\mathrm{Var}(\theta)}$

The larger lag one needs to get a small autocorrelation, the less informative individual samples are.

The pandas function autocorrelation_plot() may be useful for this.

Correlation tests

Correlation tests

We would be justified in thinning the chains by a factor of $\sim200$, apparently!

Bonus numerical exercise: SMC

Implement the Simple Monte Carlo algorithm and use it to sample a toy PDF - perhaps the posterior from the linear model in the sandbox.


In [ ]:

Bonus numerical exercise: rejection sampling

Implement a rejection sampler for the example figure shown above. For this example,

  • $p(x)$ is the $\chi^2$ distribution with 3 degrees of freedom

  • $A=\pi$

  • $g(x)$ is a normal distribution with mean 0 and standard deviation 5

Use a different envelope function if you wish. Verify that your samples do indeed approximate the target PDF.


In [ ]:

Bonus brain exercise: Metropolis

Hypothetical: a collaborator of yours is having trouble with the fact that rejected steps lead to repeated samples in the Metropolis algorithm. They maintain that just recording the new parameter values whenever a step is accepted ought to be good enough.

Without resorting to "mathematical proofs" or "reliable references", and without doing any actual sampling, can you come up with a toy problem that convincingly shows that your collaborator's approach is wrong? The simpler the better!

Bonus numerical making-your-life-easier exercise: convergence

Write some code to perform the Gelman-Rubin convergence test. Try it out on

  1. multiple chains from the sandbox notebook. Fiddle with the sampler to get chains that do/do not display nice convergence after e.g. 5000 steps.

  2. multiple "chains" produced from independent sampling, e.g. from the inverse-transform or rejection examples above or one of the examples in previous chunks.

You'll be expected to test convergence from now on, so having a function to do so will be helpful.

MegaBonus: modify your code to compute the $R$ statistic for the eigenvectors of the covariance of the posterior (yikes). This can be informative when there are strong parameter degeneracies, as in the convergence example above. The eigenvectors can be estimated efficiently from a chain using singular value decomposition.


In [ ]: