Approximate Methods

Goals:

  • Appreciate the curse of dimensionality, and some of its consequences

  • See how the Laplace approximation can provide an approximate posterior PDF, in very convenient form

  • Gain a new appreciation for summary statistics, and see how ABC can provide an approximate posterior PDF despite the data likelihood never being computed

Big Data

This term refers to any data set whose volume, velocity or variety is sufficiently high that we need to fundamentally change the way we approach it.

For example: SDSS gave astronomy a big data problem, in that we needed web-based SQL queries to allow us to make initial subsamples that we can work with. Before, we would have just downloaded the whole dataset.

Big Models

Big datasets ought to support more interesting models of complex phenomena such as galaxy evolution, with many more parameters being able to be fitted.

The problem is that posterior characterization gets more difficult - quickly - when the dimensionality of the model parameter space increases.

We saw this when we abandoned simple grids for MCMC samples - but sampling can only get us so far

Exercise: Hypervolumes

Consider an $N$-dimensional hypersphere of radius $R$. Derive an expression for the fraction of its volume contained in a surface shell of thickness 1% of $R$. What is this volume fraction if $N=3$? How about $N=100$? $N=10,000$?

NB. $(1−x)^N \approx e^{−Nx}$ when $Nx$ is large and x is small

The Vastness of High-dimensionality Space

Consequent problems:

  • Uniform priors are a bad idea

  • Prior sampling is an even worse idea

  • Separation distances (and other misfit functions) become difficult to distinguish

These (and related) problems are often known as "the curse of dimensionality"

Solution 1: Don't Go There

"Dimensionality reduction" is an excellent practical way to avoid the curse.

  • The dimensionality of the data can be reduced by compressing the data into a (much) smaller set of "summary statistics"
  • The resulting compressed dataset may be modelable with a much smaller number of parameters.

Example: Cosmic Shear

  • The "data" are automatically-measured galaxy shapes, whose tiny distortions are due to weak gravitational lensing by a complex framework of large scale dark matter structure pervading the entire universe between those background galaxies and us
  • Rather then hierarchically model this structure, with cosmological hyper-parameters, we compress the shape data into "correlation function" summary statistics which can be directly predicted from a cosmological model
  • Issue: the exact form, and the multi-variate width, of the sampling distribution for the correlation function data is unknown (and dependent on cosmology)

Unknown Likelihood Function

What should we do if we don't know (or don't trust our model for) the sampling distribution of our data?

  1. Model it in a flexible way (for more on this, see the lesson on "model-free models")

  2. If we can generate mock data anyway - try ABC

Approximate Bayesian Computation

ABC is a family of sampling methods all based on the idea that if we can generate a mock dataset that is close to the observed data, then the parameters of that model are approximately a plausible draw from the posterior PDF

The closer we get, the better the approximation

We are in this situation when the model is too complex to enable a likelihood to be evaluated (or even written down).

Examples: Human ancestry given DNA sequence data. Galaxy evolution given survey data.

The Simplest ABC Algorithm

Suppose we have a dataset $d$, and a generative model $H$ with parameters $\theta$:

  • Draw $\theta$ from the prior PDF $P(\theta|H)$

  • Generate a mock dataset $d'$

  • Compute the "distance" between the observed and mock datasets, $\rho(d,d')$

  • If the distance $\rho(d,d') < \epsilon$, store $\theta$ as a sample

  • Repeat

ABC Notes

  • What does distance mean? We decide! (Note that $-2 \log L$ is "like" a distance.)

  • Typically it is not practical to compute the distance between datasets $\rho(d,d')$ directly. Instead, we first summarize the data into a set of summary statistics $S(d)$, and then reject samples if $\rho(S(d),S(d')) = \rho(d,d') < \epsilon$

  • ABC gives us samples from $P(\theta | \rho(d,d') < \epsilon, H)$

  • If $\epsilon = 0$, then $P(\theta | \rho(d,d') < \epsilon, H) = P(\theta | d, H)$

  • If $\epsilon = \infty$, then $P(\theta | \rho(d,d') < \epsilon, H) = P(\theta | H)$

ABC Pitfalls and Extensions

  • If the summary statistics are not close to being sufficient statistics for $\theta$, the approximate posterior is too broad

  • With too many summary statistics, accuracy and stability also degrades

  • If $\epsilon$ is not set correctly, sampling can be either too inefficient or not meaningful. Tempering, sequential sampling, etc

  • How do you know when your approximate posterior is good? PPP, cross-validation

ABC Exercise

With your neighbor, jot down the steps involved in using ABC to approximately sample the posterior PDF for the slope and intercept parameters in a straight line model for one of the Cepheids datasets. Which summary statistics would you use? What distance metric would you use?

Solution 2: Go Straight There

Exploring high dimensionality parameter space is hard, unless you know where you are going

  • Linear models with Gaussian likelihoods have log posterior surfaces that are quadratic in the parameter values: they are convex, and their peak can be found by (steepest/stochastic/conjugate/etc) gradient ascent
  • Models that are close to linear may be similarly numerically optimizable
  • Issues: local maxima, log posterior evaluation CPU time (and its scaling)

Laplace Approximation

Once the peak of the log posterior has been found by numerical optimization, we can approximate its shape with a Gaussian even if the model is non-linear:

  • Taylor expand around the peak position $\hat{\theta}$:

$\;\;\;\;\;\log P(\theta|d) \approx \log P(\hat{\theta}|d) - \frac{1}{2} \frac{\partial^2 \log P}{\partial \theta^2} \bigg\rvert_{\theta=\hat{\theta}} (\theta - \hat{\theta})^2 + O[(\theta - \hat{\theta})^3]$

  • Ignore the higher order terms and exponentiate:

$\;\;\;\;\;P(\theta|d) \approx P(\hat{\theta}|d) \exp \left[ -\frac{1}{2} (\theta - \hat{\theta})^T H (\theta - \hat{\theta}) \right]$

where $H$ is the "Hessian" matrix of second derivatives, $H_{ij} = \frac{\partial^2 \log P}{\partial \theta_i \partial \theta_j} \bigg\rvert_{\theta=\hat{\theta}}$. This is proportional to a multivariate Gaussian, with covariance matrix $C = H^{-1}$

The Laplace Approximation in Practice

  • Ask your optimizer to return the inv_hessian and assign that as your covariance matrix.
  • Note that the Laplace approximation is normalized: you can use it to compute approximate Evidences
  • The Laplace approximation may not be very accurate: but it often provides a good starting point for a sampler

Good luck out there!