What would make us confident of convergence?
How do we guess the number of independent samples?
There are numerical estimates that can help with this, but they are not a substitute for human visual inspection.
This approach tests the similarlity of independent chains intended to sample the same PDF. To be meaningful, they should start from different locations and burn-in should be removed.
For a given parameter, $\theta$, the $R$ statistic compares the variance across chains with the variance within a chain. Intuitively, if the chains are random-walking in very different places, i.e. not sampling the same distribution, $R$ will be large.
We'd like to see $R\approx 1$ (e.g. $R<1.1$ is often used).
In detail, given chains $J=1,\ldots,m$, each of length $n$,
Let $B=\frac{n}{m-1} \sum_j \left(\bar{\theta}_j - \bar{\theta}\right)^2$, where $\bar{\theta_j}$ is the average $\theta$ for chain $j$ and $\bar{\theta}$ is the global average. This is proportional to the variance of the individual-chain averages for $\theta$.
Let $W=\frac{1}{m}\sum_j s_j^2$, where $s_j^2$ is the estimated variance of $\theta$ within chain $j$. This is the average of the individual-chain variances for $\theta$.
Let $V=\frac{n-1}{n}W + \frac{1}{n}B$. This is an estimate for the overall variance of $\theta$.
Finally, $R=\sqrt{\frac{V}{W}}$.
Note that this calculation can also be used to track convergence of combinations of parameters, or anything else derived from them.
The autocorrelation of a sequence (after removing burn-in), as a function of lag, $k$, is defined thusly:
$\rho_k = \frac{\sum_{i=1}^{n-k}\left(\theta_{i} - \bar{\theta}\right)\left(\theta_{i+k} - \bar{\theta}\right)}{\sum_{i=1}^{n-k}\left(\theta_{i} - \bar{\theta}\right)^2} = \frac{\mathrm{Cov}_i\left(\theta_i,\theta_{i+k}\right)}{\mathrm{Var}(\theta)}$
The larger lag one needs to get a small autocorrelation, the less informative individual samples are.
The pandas
function autocorrelation_plot()
may be useful for this.
Note that the positive/negative oscillations basically tell us when the lag is so large compared with the chain length that the autocorrelation is too noisy to be meaningful.
We would be justified in thinning the chains by a factor of $\sim150$, apparently!
From $m$ chains of length $n$, we can estimate the effective number of independent samples as
$n_{eff} = \frac{mn}{1+2\sum_0^\infty \hat{\rho}_t}$, with
$\hat{\rho}_t = 1 - \frac{V_t}{2V}$, ($V$ as in the Gelman-Rubin calculation)
$V_t = \frac{1}{m(n-t)} \sum_{j=0}^m \sum_{i=t+1}^n (\theta_{i,j} - \theta_{i-t,j})^2$
In practice, the sum in $n_{eff}$ is cut off when the estimates $\hat{\rho}_t$ become too noisy (see references).
The example shown turns out to have $n_{eff} \sim 600$, compared with the $\sim 250$ samples we would have left if thinning by a factor of 150.
There are a handful of useful metrics for assessing convergence of chains and independence of samples. None is foolproof, and visual inspection is a critical check.
Extra reading: this short discussion provides some interesting context regarding how experts in the field (and the inventors of said metrics) approach the practical side of MCMC.
Write some code to perform the Gelman-Rubin convergence test. Try it out on
multiple chains from the sandbox notebook. Fiddle with the sampler to get chains that do/do not display nice convergence after e.g. 5000 steps.
multiple "chains" produced from independent sampling, e.g. from the inverse-transform or rejection examples above or one of the examples in previous chunks.
You'll be expected to test convergence from now on, so having a function to do so will be helpful.
In [ ]:
In [ ]:
Modify your code from the exercises to compute $R$ and $n_{eff}$ for the eigenvectors of the covariance of the posterior, instead of for individual parameters themselves. This can be informative when there are strong parameter degeneracies, as in the example above. The eigenvectors can be estimated efficiently from a chain using singular value decomposition.
In [ ]: