Gaussian approximations and model comparison

The previous notebooks have introduced a variety of methods for performing parameter inference using complex models. All these methods depend on Bayes' Theorem to estimate something proportional to the posterior probability density.

We've also covered a variety of ways for searching complex posteriors, ranging from simple Monte Carlo approaches like importance sampling and GLUE, to basic MCMC (e.g. Metropolis) and some more sophisticated MCMC algorithms such as the affine invariant sampler in emcee.

So far, we've only focussed on inferring parameter distributions for a single model, but what if we have several different models that we'd like to compare? This notebook introduces some of the basic ideas behind Bayesian model comparison. At the same time, we'll take a step back from the earlier considerations of sampling schemes (MC, MCMC etc.) to take a look at the Gaussian approximation approach to model calibration.

The Gaussian approximation method is both simpler and more computationally efficient than the methods discussed in earlier notebooks. It also provides a very complete approach to the inference process, allowing both parameter & uncertainty estimation and model comparison (as we will see, some of the more complex methods introduced earlier can struggle when it comes to model comparison).

The Gaussian approximation method is only valid in particular circumstances, but you may be surprised by just how often you can get away with using it. Because of it's power and simplicity, this approach is well worth considering for an initial attempt at the inference process - especially if you're planning to compare multiple model structures.

Work in progress

Plan

Ability to approximate the posterior in the vicinity of the MAP by a multi-dimensional Gaussian
Justification for why this often works OK for large $N$ (link to Central Limit Theorem?)
Estimates of uncertainty from parameter covariance matrix (from Hessian?)
Introduce model comparison and the idea that the normalising constant from a parameter inference step becomes the model likelihood in the model comparison step
Describe how the Bayesian method automatically penalises "over-fitting"
Estimate the "evidence" for a model from determinant of Hessian
Remember to stress that for model comparison we need proper priors - see page 354 of Information Theory, Inference and Learning Algorithms
As long as the Gaussian approximation is appropriate, provides very efficient and comprehensive method of finding:
1. The optimum parameter set for each model under consideration
2. The uncertainty associated with the parameter optimum for each model
3. The relative likelihoods of each model
Only requires a single optimisation run; no very intensive numerical sampling (unless required for optimisation)
Good place to start!
How do we know if the Gaussian approximation is valid/appropriate?

Notes so far are based on Chapter 28 of David MacKay's superb book Information Theory, Inference and Learning Algorithms.

Still need to produce some working code

1. A simple Bayesian workflow

Suppose we have several different models for a particular process or system of interest. We have a single observed dataset that we will use to calibrate each of the models. Some of the models are more complex (e.g. have more poorly-constrained parameters) than others. We are interested in:

Calibrating each model. This involves identifying plausible parameter sets and associated credible intervals.
Comparing the models. Are the more complex models significantly better at simulating the data than the simple ones? Is the additional complexity "worth it"?
Making predictions. What is our best estimate for the next observation? Should we select the "best" model and use it, or can we integrate output from all the models, weighted according to their relative probability?

We have already considered the problem of model calibration (i.e. inferring posterior parameter distributions) in some detail. However, in the discussions so far, we have generally ignored the normalising constant in Bayes' equation by writing

$$posterior \propto likelihood \times prior$$

instead of the full version

$$posterior = \frac{likelihood \times prior}{normalising \; constant}$$

Recall from notebook 3 that the normalising constant is sometimes referred to as the "probability of the data" or the "evidence".

If we're just interested in parameter inference for a single model, it is OK to ignore the normalising constant. However, if we want to compare several competing models, this constant becomes important. Below, we'll consider a slightly more general case of model calibration where more than one model is available.

1.1. Parameter inference

If we assume that a particular model, $M_i$, is correct, we can write Bayes' equation as

$$P(\theta|D, M_i) = \frac{P(D|\theta, M_i)P(\theta|M_i)}{P(D|M_i)}$$

where $\theta$ is the vector of parameters for model $M_i$ and $D$ is the data (i.e. the observations) used for calibration.

If all of the model parameters, $\theta$, are continuous variables, we can re-write the denominator in this equation as

$$P(D|M_i) = \int_\theta{P(D|\theta, M_i)P(\theta|M_i) d\theta}$$

(See the end of notebook 1 for an explanation of where these equations come from).

To calibrate a single model to a dataset, we are primarily interested in finding two things:

Some kind of central estimate for an appropriate parameter set (e.g. the median, mean or "best" parameter set), and
An indication of the uncertainty associated with the estimate for each parameter.

There are many ways of achieving these goals, depending on the complexity of the problem. In the previous notebooks we have taken a very comprehensive approach, by using MC or MCMC techniques to estimate marginal posterior distributions for each parameter. From these distributions we can calculate a whole range of different statistics for the central estimate (e.g. the MAP) and the uncertainty bounds (e.g. a 95% credible interval). The truly Bayesian approach, however, is simply to report the distribution (i.e. a histogram) in its entirety.

As we have seen, though, estimating marginal posteriors is not always easy and can be computationally expensive. In some cases, simpler approaches may provide adequate answers much more quickly.

1.1.1. Gaussian approximation

For the simplest possible example of a one parameter model with an approximately Gaussian posterior distribution, we could simply report the mean and the variance to achieve the two aims listed above. Most real problems are more complex, but for some multi-parameter models it may still be reasonable to assume an approximately Gaussian posterior in the region around the most promising parameter set. In this case, the two aims listed above can be achieved by:

Finding the Maximum a posteriori (MAP) estimate. This is just the maximum of the posterior distribution, so finding it is an optimisation problem rather than a sampling problem, so it is likely to be faster (see examples in notebooks 2 and 6).
Obtain confidence intervals for each parameter by considering the local curvature of the posterior distribution in the region surrounding the MAP. This can be done by evaluating the Hessian, which can provide an estimate of the parameter covariance matrix.

As long as the posterior can be reasonably approximated by a multi-dimensional Gaussian in the vicinity of the MAP, reporting the location of the MAP and the parameter covariance matrix is simply the multi-dimensional equivalent of reporting the mean and variance in the 1D example mentioned above. This approach therefore provides a very efficient (though less generally applicable) method of model auto-calibration and uncertainty estimation.

Note also that, as the amount of data collected increases, the Gaussian approximation can often be surprisingly effective. (Link to the Central Limit Theorem?)

1.2. Model comparison

For a discrete set of models, $M_i$, we can write another version of Bayes' equation

$$P(M_i|D) = \frac{P(D|M_i)P(M_i)}{P(D)}$$

In this case, because the set of models is discrete, we can re-write the denominator as a sum, rather than as an integral like we did above

$$P(D) = \sum_iP(D|M_i)P(M_i)$$

As we have done previosuly, in many cases it is usual to ignore the "evidence" in this equation and just write

$$P(M_i|D) \propto P(D|M_i)P(M_i)$$

The prior probability for each model, $P(M_i)$, represents our initial belief about how likely model $M_i$ is to be correct. If we have no reason to prefer some models over others, $P(M_i)$ will be the same for all models (i.e. a uniform prior).

The likelihood term, $P(D|M_i)$, encapsulates what the data can tell us about the probability of $M_i$ being correct. Note from section 1.1 above that this term is exactly the same as the evidence (i.e. the denominator) in the form of Bayes' equation used for parameter inference. This is generally the case: the "evidence" from the parameter inference stage becomes a crucial component of Bayes' equation at the model comparison stage.

In previous notebooks, we have generally ignored the normalising constant in Bayes' equation when performing parameter inference. This is fine as long as you're only interested in evaluating parameter-related uncertainty for a single model. However, as you can see, it is necessary to include it if you're interested in comparing multiple models.

1.2.1 Occam's razor

Occam's razor embodies the intuitive principle that, if two models provide equally good explanations of the data, the simplest model should be preferred. This principle is automatically incorporated into Bayesian model comparison in a quantitative way, which is extremely useful. This is because simple models have less flexibility and therefore make a narrower range of predictions than complex models. Because the total volume under the probabiltiy distribution $P(D|M_i)$ must equal $1$, the probabiltiy density for more complex models tends to be more "spread out" (and therefore lower) than for simple models. If a simple model manages to match the data just as well, it is therefore likely to have a higher probability density in the region of credible parameter sets than a more complex model with a wider range of output. The simpler model will therefore "win" during model comparison.

1.2.2. Gaussian approximation

If the posterior can be well approximated by a Gaussian in the vicinity of the MAP (which, perhaps surprisingly, is often the case given enough data), then the evidence for model $M_i$, $P(D|M_i)$, can be estimated using the best-fit likelihood and prior and the determinant of the corresponding covariance matrix (which is obtained from the Hessian). See equation 28.10 in David MacKay's book for further details.

For the case where the Gaussian approximation is valid, a full Bayesian model evaluation can therefore be performed in a surprisingly efficient way, without requiring any complicated and computationally intensive numerical sampling (e.g. MCMC). First, for each model:

Find the MAP.
Use the Hessian evaluated at the MAP to find the parameter covariance matrix.

These two steps provide, for each model, an estimate of the "best" parameter set and an indication of the parameter-related uncertainty. Then, for model comparison:

Estimate the evidence for each model using the best-fit likelihood, $P(D|\theta_{MAP}, M_i)$, the best-fit prior, $P(\theta_{MAP}, M_i)$ and the determinant of the Hessian (from above).
Multiply the evidence by the prior for each model and use the result to rank the models according their probability. Predictions can be made by calculating a weighted sum of model predictions, with the weights based on the posetrior probabiltiy of each model.

Note that this workflow uses only a single optimisation step - no complex sampling algorithms are required. As long as the Gaussian approximation is valid, it is therefore an extremely efficient way to perform model calibration, uncertainty analysis and comparison.