MCMC Diagnostics

Goals:

  • Learn how to determine whether a Markov chain can reliably be used for inference

References

Diagnostics

To be useful to us, a chain must

  1. have converged to the posterior distribution
  2. provide enough effectively independent samples to characterize it

What would make us confident of convergence?

  • Is the chain stationary?
  • Do independent chains started from overdispersed positions find the same solution?

How do we guess the number of independent samples?

  • Check how well the chain appears to exploring the distribution.
  • Compare the autocorrelation length scale with the chain length.

There are numerical estimates that can help with this, but they are not a substitute for human visual inspection.

Common misuses of/misconceptions about convergence

Convergence does not mean

  • that parameters are "well constrained" by the data
  • that the autocorrelation length is small
  • that there are not occasional excursions beyond a locus in parameter space

Convergence tests

  • Inspection! There is no substitute.
  • Gelman-Rubin statistic
  • How stationary does each sequence appear?
  • Are all chains sampling the same PDF?

Conservatively, we might remove the first $\sim500$ steps based on this.

Gelman-Rubin convergence statistic

This approach tests the similarlity of independent chains intended to sample the same PDF. To be meaningful, they should start from different locations and burn-in should be removed.

For a given parameter, $\theta$, the $R$ statistic compares the variance across chains with the variance within a chain. Intuitively, if the chains are random-walking in very different places, i.e. not sampling the same distribution, $R$ will be large.

We'd like to see $R\approx 1$ (e.g. $R<1.1$ is often used).

In detail, given chains $J=1,\ldots,m$, each of length $n$,

  • Let $B=\frac{n}{m-1} \sum_j \left(\bar{\theta}_j - \bar{\theta}\right)^2$, where $\bar{\theta_j}$ is the average $\theta$ for chain $j$ and $\bar{\theta}$ is the global average. This is proportional to the variance of the individual-chain averages for $\theta$.

  • Let $W=\frac{1}{m}\sum_j s_j^2$, where $s_j^2$ is the estimated variance of $\theta$ within chain $j$. This is the average of the individual-chain variances for $\theta$.

  • Let $V=\frac{n-1}{n}W + \frac{1}{n}B$. This is an estimate for the overall variance of $\theta$.

Finally, $R=\sqrt{\frac{V}{W}}$.

Note that this calculation can also be used to track convergence of combinations of parameters, or anything else derived from them.

Correlation tests

  • Inspection! Again, no substitute.
  • Autocorrelation of parameters

Do subsequent samples look particularly independent?

The autocorrelation of a sequence (after removing burn-in), as a function of lag, $k$, is defined thusly:

$\rho_k = \frac{\sum_{i=1}^{n-k}\left(\theta_{i} - \bar{\theta}\right)\left(\theta_{i+k} - \bar{\theta}\right)}{\sum_{i=1}^{n-k}\left(\theta_{i} - \bar{\theta}\right)^2} = \frac{\mathrm{Cov}_i\left(\theta_i,\theta_{i+k}\right)}{\mathrm{Var}(\theta)}$

The larger lag one needs to get a small autocorrelation, the less informative individual samples are.

The pandas function autocorrelation_plot() may be useful for this.

Note that the positive/negative oscillations basically tell us when the lag is so large compared with the chain length that the autocorrelation is too noisy to be meaningful.

We would be justified in thinning the chains by a factor of $\sim150$, apparently!

Effective number of samples

From $m$ chains of length $n$, we can estimate the effective number of independent samples as

$n_{eff} = \frac{mn}{1+2\sum_0^\infty \hat{\rho}_t}$, with

$\hat{\rho}_t = 1 - \frac{V_t}{2V}$, ($V$ as in the Gelman-Rubin calculation)

$V_t = \frac{1}{m(n-t)} \sum_{j=0}^m \sum_{i=t+1}^n (\theta_{i,j} - \theta_{i-t,j})^2$

In practice, the sum in $n_{eff}$ is cut off when the estimates $\hat{\rho}_t$ become too noisy (see references).

The example shown turns out to have $n_{eff} \sim 600$, compared with the $\sim 250$ samples we would have left if thinning by a factor of 150.

Exercise: can we declare victory?

In each of the following trace plots, different colors show independent chains. For each decide by inspection whether the following are plausible claims:

  1. The chains have converged to the posterior.
  2. The chains have a reasonable number of independent samples.

Take-away

There are a handful of useful metrics for assessing convergence of chains and independence of samples. None is foolproof, and visual inspection is a critical check.

Extra reading: this short discussion provides some interesting context regarding how experts in the field (and the inventors of said metrics) approach the practical side of MCMC.

Bonus numerical making-your-life-easier exercise: convergence

Write some code to perform the Gelman-Rubin convergence test. Try it out on

  1. multiple chains from the sandbox notebook. Fiddle with the sampler to get chains that do/do not display nice convergence after e.g. 5000 steps.

  2. multiple "chains" produced from independent sampling, e.g. from the inverse-transform or rejection examples above or one of the examples in previous chunks.

You'll be expected to test convergence from now on, so having a function to do so will be helpful.


In [ ]:

Bonus numerical making-your-life-easier exercise: effective samples

Write code to compute the effective number of samples, as described above. Cut off the sum $n_{eff}$ at the point where 2 successive values $\hat{\rho}_t$ and $\hat{\rho}_{t+1}$ are negative. Try it out on the same two test cases.


In [ ]:

Megabonus numerical exercise

Modify your code from the exercises to compute $R$ and $n_{eff}$ for the eigenvectors of the covariance of the posterior, instead of for individual parameters themselves. This can be informative when there are strong parameter degeneracies, as in the example above. The eigenvectors can be estimated efficiently from a chain using singular value decomposition.


In [ ]: