Detour into "less probabilistic" methods

The essence of Bayesian analysis is model building, but every so often we find a situation where we simply have no idea how to define a model, or are unwilling to do so. While there are failrly general model-building tools that can be used here (e.g. mixture models), it can also be useful to fall back on more empitical methods.

Bootstrap

The simple idea behind the bootstrap is that, since our data set is drawn from its own sampling distribution, it can be used directly as an estimate of that same distribution. Typically, the thing we're not willing to model is an intrinsic scatter. The classical procedure is

  1. Generate a new data set of the same size as the real data by sampling with replacement from the real data points.
  2. Calculate whatever statistic or estimate is of interest from the bootstrap data set.
  3. Do this many times.
  4. Interpret the width of the distribution of these estimates as a fair guess at the uncertainty in your measurement. (However, the "best" value is generally still taken to be the estimate calculated on the real data, since the bootstrap has no claim to being unbiased.)

There is a certain amount of contention as to what the "statistic or estimate" is allowed to be. Most commonly, and most robustly defended in the literature, is the case of a simple function of the data, e.g. the mean or median, or an unweighted regression. When in doubt, remember that the validity of the bootstrap rests on your ability to say, with a straight face, "the measured values of $X$ that I'm bootstrapping are a fair representation of the underlying distribution of $X$."

Parametric bootstrap

In this variant, instead of resampling the rows of a data table, each data point is scattered according to its measurement error. This is often done in weighted regression problems, for example.

Bayesian bootstrap

Because the bootstrap interprets the data as a kernel estimate of the sampling distribution, in principle it can be fit into a Bayesian analysis. The most obvious route is to attach a weight to each data point encoding how "real" it is, with the weights summing to the number of data points. This is not widely done, since it's not obviously easier than the alternative of building a flexible hierarchical mixture model.

Brain food: in what limit would the distribution of estimates in the simple bootstrap above correspond to a Bayesian posterior?

Jackknife

Similar to (but pre-dating) the bootstrap, the jackknife procedure is

  1. Remove 1 (or more) data points from the data set.
  2. Calculate the statistic or estimate using the reduced data set.
  3. Repeat this for every possible reduced data set, or at least a bunch of times.
  4. Interpret the distribution of results as an estimate of the uncertainty and bias (due to finite sample size) of the measurement.

Note: as far as I can tell, our CMB collegues use "jackknife" to refer to a different procedure.

Approximate Bayesian Computation (ABC)

The idea behind ABC is to provide a way forward for Bayesian analysis in cases where the likelihood function is too expensive to be practical or simply too difficult to write down. However, it does still use (and require) a generative model, which in principle contains the same information as the likelihood. To perform ABC, we need to be able to actually use the generative model, i.e. to create fake data sets using all the components of the model.

The simplest implementation is

  1. Generate a sample of the model parameters from the prior.
  2. For this set of parameters, generate a fake data set.
  3. Calculate some quantitative measure of the similarity of the fake and real data using a distance function.
  4. If this distance exceeds some tolerance, throw away the sample.
  5. Repeat this many times, and interpret the resulting samples as being from the posterior.

There are clearly some choices to be made here:

  1. Summary statistics of the data. Technically these are not necessary, and we could cook up a measure of distance between data sets in their full many dimensional glory. However, often the information in a data set can be boiled down to a smaller set of numbers. (Think back to the ordinary least squares problem.) A set of statistics that accomplishes this without loss of information is called "sufficient".
  2. The distance function itself. Usually just Euclidean distance, after some suitable choice of summary statistics.
  3. The tolerance. Lower tolerance will get us closer to the true posterior, but at the expense of wasting many more samples.

The logic here is simple: by brute force, we're trying to generate a list of model parameter values that can produce a data set very like the one we have. By drawin from the prior to start with, then requiring samples to (almost) reproduce our data, we end up with samples whose density is proportional to the prior $\times$ likelihood. How efficient this is in practice ultimately depends on the choices above.

Hopefully, you can see a similarity between the procedure above and some of the stupider algorithms we've looked at for sampling posterior distributions. As you might guess, there are more intelligent ways to do this than drawing samples straight from the prior and rejecting most of them, and they look a lot like MCMC.