Examples:
(Strictly speaking, these tend to correspond to models with very many parameters, but the terminology persists.)
These methods try to compensate for "small sample" effects in the data, or otherwise not knowing the sampling distribution.
The classical example is the sample average in the presence of a heavy-tailed scatter.
Resampling is usually seen in frequentist estimation rather than Bayesian inference - but there are Bayesian adaptations.
The average (compared to the full-data-set calculation) and scatter of these estimates provides some idea of the small-sample bias.
(Note that our CMB colleagues have invented an unrelated test that they like to call a jackknife. Don't get confused!)
The bootstrap is a little more sophisticated. The idea is that we have data that sample a distribution, so they can be used as a direct (if crude) estimate of that distribution without further assumptions. A key requirement is that the measured data are a fair representation of what we might have gotten.
Again, the classic example is estimating a sample mean or unweighted regression.
Since the bootstrap interprets the data as a kernel estimate of some distribution, in principle it can be fit into a Bayesian analysis. The most obvious route is to attach a weight to each data point encoding how "real" it is, with the weights summing to the number of data points.
(This is not widely done, since hierarchical mixture models provide a simpler and arguably more natural Bayesian approach.)
This refers to the general practice of building a complicated distribution out of simpler components.
$p(x) = \sum_i \pi_i \, q_i(x)$,
where the coefficients $\sum_i \pi_i=1$, and the $q_i(x)$ are normalized PDFs
We could generate from this PDF by drawing from $q_i$ with probability $\pi_i$.
When might we use mixture models?
e.g.
a) supernova luminosities $L$ (without spectroscopic typing $T$) $\longrightarrow$ conditional (prior) PDF $P(L|T)$
b) source vs. background photon energies $E$ $\longrightarrow$ sampling distribution $P(E|T)$
|
e.g., this is a mixture of 3 Gaussians |
How would we decide on the number of mixture components? Depending on the application, we might
The term "non-parametric" is used vaguely (and often inaccurately), so it's best explained by example:
Example 1:
In gravitational lensing, image shear (or stronger distortions) can be measured at the positions of background galaxies in the image plane. Often, the mass distribution of the lens is modeled as the sum of a small number of idealized structures with parametrized mass distributions.
Alternatively, Bradac et al (2005) model the deflection potential on a regular grid (eg. their Figure 5), interpolating to the position of measured galaxies, avoiding explicit assumptions about the nature of the lens.
Example 2:
In cosmological studies that use distance measurements, the standard technique involves adopting a parametrized model for the energy budget of the Universe ($\Omega_m$, $\Omega_{\rm DE}$, $w_0$, $w_a$) and predicting the distance-redshift relation using that model.
However, not everyone is happy with this Dark Energy parametrization, and in particular the question of how best to test whether $w$ is constant with time is much discussed.
e.g.
Huterer & Starkman (2003) advocated a principle component-based model for $w(z)$, where the functional forms that the data are most sensitive to are determined and the amplitude of each component is then fit.
More recently, various authors, including Seikel et al (2012), have used Gaussian Process Regression, a sophisticated interpolation technique (see Seikel et al's Figure 6).
With your neighbor, discuss one of the simply-parametrized model inferences that you carried out for homework, and design a non-parametric model for the same data. Be prepared to describe to the group:
What makes your model non-parametric?
What are the parameters of your non-parametric model?
What assumptions would you be making when drawing conclusions from it?
How do you expect it to perform in a model comparison with its simpler counterpart?
Under what circumstances would you be in favor of using this model?
A common feature of non-parametric models is that they bypass the usual business of defining a physically-motivated model.
Instead, they are usually "data-driven":
"Non-parametric" models are not assumption-free - they just involve different assumptions than more simply-parametrized, physics-based models.
For the remainder of this lesson, we'll take a look at a specific class of non-parametric models, stochastic processes.
We'll then look at how non-parametric models are used in automated data analysis, or "machine learning."
A Gaussian process has the property that
$P\left[y(x) | y(x_1), y(x_2), \ldots\right]$
is a Gaussian depending on the $x_i$ and $y(x_i)$. The process is specified by a "mean function" $\mu(x)$ and a "covariance function" $C(x)$, or "kernel," which determines how quickly $y(x)$ can vary.
A draw from $P[y(x^*)]$ would represent a prior prediction for the function value $y(x^*)$
Typically we are more interested in the posterior prediction, drawn from $P[y(x^*)\vert y^{\rm obs}(x_{\rm obs})]$
The posterior PDF for $y(x^*)$ is a Gaussian, whose mean and standard deviation can be computed algebraically, and which is constrained by all the previously observed $y(x)$.
GP's provide a natural way to achieve high flexibility (and uncertainty) when interpolating data.
With the appropriate assumptions (e.g. Gaussian measurement errors), the calculation of the posterior for $y(x)$ is an algebraic operation (no Monte Carlo required).
Marginalization over the GP hyperparameters (the width of the kernel, for example) is more computationally expensive (involving the determinants of the matrices), but fast methods have been developed.
Gaussian processes appear to be "non-parametric" because the algebraic evaluation of the posterior PDF includes analytic marginalization over all the (nuisance) parameters in the model (the true values of $y$ at each $x_{\rm obs}$).
As with all non-parametric models, GPs are not "assumption-free" or "model-independent": they are just not simply or physically parametrized, and so involve different types of assumptions.
The trade-off between simply-parametrized and non-parametric models is between interpretability (typically high for simply-parametrized physical models) and prediction accuracy (typically high for non-parametric models).
The GP regression notebook walks you through the code to make the figure below - and suggests some exercise to probe your understanding of Gaussian processes.