Goals:
Appreciate the curse of dimensionality, and some of its consequences
See how the Laplace approximation can provide an approximate posterior PDF, in very convenient form
Gain a new appreciation for summary statistics, and see how ABC can provide an approximate posterior PDF despite the data likelihood never being computed
This term refers to any data set whose volume, velocity or variety is sufficiently high that we need to fundamentally change the way we approach it.
For example: SDSS gave astronomy a big data problem, in that we needed web-based SQL queries to allow us to make initial subsamples that we can work with. Before, we would have just downloaded the whole dataset.
Big datasets ought to support more interesting models of complex phenomena such as galaxy evolution, with many more parameters being able to be fitted.
The problem is that posterior characterization gets more difficult - quickly - when the dimensionality of the model parameter space increases.
We saw this when we abandoned simple grids for MCMC samples - but sampling can only get us so far
Consider an $N$-dimensional hypersphere of radius $R$. Derive an expression for the fraction of its volume contained in a surface shell of thickness 1% of $R$. What is this volume fraction if $N=3$? How about $N=100$? $N=10,000$?
NB. $(1−x)^N \approx e^{−Nx}$ when $Nx$ is large and x is small
"Dimensionality reduction" is an excellent practical way to avoid the curse.
What should we do if we don't know (or don't trust our model for) the sampling distribution of our data?
Model it in a flexible way (for more on this, see the lesson on "model-free models")
If we can generate mock data anyway - try ABC
ABC is a family of sampling methods all based on the idea that if we can generate a mock dataset that is close to the observed data, then the parameters of that model are approximately a plausible draw from the posterior PDF
The closer we get, the better the approximation
We are in this situation when the model is too complex to enable a likelihood to be evaluated (or even written down).
Examples: Human ancestry given DNA sequence data. Galaxy evolution given survey data.
Suppose we have a dataset $d$, and a generative model $H$ with parameters $\theta$:
Draw $\theta$ from the prior PDF $P(\theta|H)$
Generate a mock dataset $d'$
Compute the "distance" between the observed and mock datasets, $\rho(d,d')$
If the distance $\rho(d,d') < \epsilon$, store $\theta$ as a sample
Repeat
What does distance mean? We decide! (Note that $-2 \log L$ is "like" a distance.)
Typically it is not practical to compute the distance between datasets $\rho(d,d')$ directly. Instead, we first summarize the data into a set of summary statistics $S(d)$, and then reject samples if $\rho(S(d),S(d')) = \rho(d,d') < \epsilon$
ABC gives us samples from $P(\theta | \rho(d,d') < \epsilon, H)$
If $\epsilon = 0$, then $P(\theta | \rho(d,d') < \epsilon, H) = P(\theta | d, H)$
If $\epsilon = \infty$, then $P(\theta | \rho(d,d') < \epsilon, H) = P(\theta | H)$
If the summary statistics are not close to being sufficient statistics for $\theta$, the approximate posterior is too broad
With too many summary statistics, accuracy and stability also degrades
If $\epsilon$ is not set correctly, sampling can be either too inefficient or not meaningful. Tempering, sequential sampling, etc
How do you know when your approximate posterior is good? PPP, cross-validation
Exploring high dimensionality parameter space is hard, unless you know where you are going
Once the peak of the log posterior has been found by numerical optimization, we can approximate its shape with a Gaussian even if the model is non-linear:
$\;\;\;\;\;\log P(\theta|d) \approx \log P(\hat{\theta}|d) - \frac{1}{2} \frac{\partial^2 \log P}{\partial \theta^2} \bigg\rvert_{\theta=\hat{\theta}} (\theta - \hat{\theta})^2 + O[(\theta - \hat{\theta})^3]$
$\;\;\;\;\;P(\theta|d) \approx P(\hat{\theta}|d) \exp \left[ -\frac{1}{2} (\theta - \hat{\theta})^T H (\theta - \hat{\theta}) \right]$
where $H$ is the "Hessian" matrix of second derivatives, $H_{ij} = \frac{\partial^2 \log P}{\partial \theta_i \partial \theta_j} \bigg\rvert_{\theta=\hat{\theta}}$. This is proportional to a multivariate Gaussian, with covariance matrix $C = H^{-1}$
inv_hessian and assign that as your covariance matrix.