"Model-free" Models

Goals:

Introduce and use techniques that purport to be "model independent".

References

Ivezic 6.3
Gelman ch. 18
Rasmussen & Williams Gaussian Processes for Machine Learning

What does "model-free" mean?

Sometimes we simply don't have a good first-principles model for what's going on in our data, but we're also confident that making a simple assumption (e.g. Gaussian scatter) is dead wrong.

Examples:

Photometric redshifts (catastrophic errors)
Photometric supernova detections (multiple populations)

What does "model-free" mean?

In these situations, we're motivated to avoid strong modeling assumptions and instead be more empirical.

Common adjectives:

non-parametric
model-independent
data-driven
empirical

(Strictly speaking, these tend to correspond to models with very many parameters, but the terminology persists.)

What's here

Resampling methods
Mixture models
"Non-parametric" models and stochastic processes

1. Resampling methods: jackknife and bootstrap

These methods try to compensate for "small sample" effects in the data, or otherwise not knowing the sampling distribution.

The classical example is the sample average in the presence of a heavy-tailed scatter.

Resampling is usually seen in frequentist estimation rather than Bayesian inference - but there are Bayesian adaptations.

Jackknife procedure

Remove 1 (or more) data points from the data set.
Calculate the estimate of interest using the reduced data set.
Repeat this for every possible reduced data set.

The average (compared to the full-data-set calculation) and scatter of these estimates provides some idea of the small-sample bias.

(Note that our CMB colleagues have invented an unrelated test that they like to call a jackknife. Don't get confused!)

Bootstrap

The bootstrap is a little more sophisticated. The idea is that we have data that sample a distribution, so they can be used as a direct (if crude) estimate of that distribution without further assumptions. A key requirement is that the measured data are a fair representation of what we might have gotten.

Bootstrap

Generate a new data set of the same size as the real data by sampling with replacement from the real data points.
Calculate whatever statistic or estimate is of interest from the bootstrap data set.
Do this many times, and interpret the resulting distribution as indicative of the true uncertainty in the measurement.

Again, the classic example is estimating a sample mean or unweighted regression.

Bootstrap variants: parametric

Instead of resampling data points, each point is varied randomly within it's measurement errors. This is often done in weighted regression problems.

Bootstrap variants: Bayesian

Since the bootstrap interprets the data as a kernel estimate of some distribution, in principle it can be fit into a Bayesian analysis. The most obvious route is to attach a weight to each data point encoding how "real" it is, with the weights summing to the number of data points.

(This is not widely done, since hierarchical mixture models provide a simpler and arguably more natural Bayesian approach.)

2. Mixture models

This refers to the general practice of building a complicated distribution out of simpler components.

$p(x) = \sum_i \pi_i \, q_i(x)$,

where the coefficients $\sum_i \pi_i=1$, and the $q_i(x)$ are normalized PDFs

We could generate from this PDF by drawing from $q_i$ with probability $\pi_i$.

When might we use mixture models?

If the data being modeled really are suspected to have come from multiple origins

e.g.

a) supernova luminosities $L$ (without spectroscopic typing $T$) $\longrightarrow$ conditional (prior) PDF $P(L|T)$

b) source vs. background photon energies $E$ $\longrightarrow$ sampling distribution $P(E|T)$

If we want a flexible (but still somewhat restricted) model to describe the data

e.g., this is a mixture of 3 Gaussians

How would we decide on the number of mixture components? Depending on the application, we might

Test how sensitive our inferences are to the number
Do formal model comparison (eg via an information criterion, or the Evidence) to decide
Explicitly marginalize over the number of components (either with Metropolis-Hastings-Green sampling, or using something called a Dirichlet process)

3. "Non-parametric" Models

The term "non-parametric" is used vaguely (and often inaccurately), so it's best explained by example:

Example 1:

In gravitational lensing, image shear (or stronger distortions) can be measured at the positions of background galaxies in the image plane. Often, the mass distribution of the lens is modeled as the sum of a small number of idealized structures with parametrized mass distributions.

Alternatively, Bradac et al (2005) model the deflection potential on a regular grid (eg. their Figure 5), interpolating to the position of measured galaxies, avoiding explicit assumptions about the nature of the lens.

Example 2:

In cosmological studies that use distance measurements, the standard technique involves adopting a parametrized model for the energy budget of the Universe ($\Omega_m$, $\Omega_{\rm DE}$, $w_0$, $w_a$) and predicting the distance-redshift relation using that model.

However, not everyone is happy with this Dark Energy parametrization, and in particular the question of how best to test whether $w$ is constant with time is much discussed.

e.g.

Huterer & Starkman (2003) advocated a principle component-based model for $w(z)$, where the functional forms that the data are most sensitive to are determined and the amplitude of each component is then fit.
More recently, various authors, including Seikel et al (2012), have used Gaussian Process Regression, a sophisticated interpolation technique (see Seikel et al's Figure 6).

Non-Parametric Gedanken Exercise

With your neighbor, discuss one of the simply-parametrized model inferences that you carried out for homework, and design a non-parametric model for the same data. Be prepared to describe to the group:

What makes your model non-parametric?
What are the parameters of your non-parametric model?
What assumptions would you be making when drawing conclusions from it?
How do you expect it to perform in a model comparison with its simpler counterpart?
Under what circumstances would you be in favor of using this model?

A common feature of non-parametric models is that they bypass the usual business of defining a physically-motivated model.

Instead, they are usually "data-driven":

They usually attempt to define a "physics-agnostic" model, but with enough flexibility to describe the data.
This flexibility scales with the size of the dataset, in order that the data continues to be well described.

"Non-parametric" models are not assumption-free - they just involve different assumptions than more simply-parametrized, physics-based models.

For the remainder of this lesson, we'll take a look at a specific class of non-parametric models, stochastic processes.

We'll then look at how non-parametric models are used in automated data analysis, or "machine learning."

Stochastic Processes

A stochastic process is collection of variables drawn from a probability distribution over functions.

In other words, if our function of interest is $y(x)$, a stochastic process assigns probabilities $P\left[y(x)\right]$.

Gaussian Processes

A Gaussian process has the property that

$P\left[y(x) | y(x_1), y(x_2), \ldots\right]$

is a Gaussian depending on the $x_i$ and $y(x_i)$. The process is specified by a "mean function" $\mu(x)$ and a "covariance function" $C(x)$, or "kernel," which determines how quickly $y(x)$ can vary.

Gaussian Processes in Data Analysis

A draw from $P[y(x^*)]$ would represent a prior prediction for the function value $y(x^*)$

Typically we are more interested in the posterior prediction, drawn from $P[y(x^*)\vert y^{\rm obs}(x_{\rm obs})]$

The posterior PDF for $y(x^*)$ is a Gaussian, whose mean and standard deviation can be computed algebraically, and which is constrained by all the previously observed $y(x)$.

GP Regression

GP's provide a natural way to achieve high flexibility (and uncertainty) when interpolating data.

With the appropriate assumptions (e.g. Gaussian measurement errors), the calculation of the posterior for $y(x)$ is an algebraic operation (no Monte Carlo required).

Marginalization over the GP hyperparameters (the width of the kernel, for example) is more computationally expensive (involving the determinants of the matrices), but fast methods have been developed.

Parting thoughts

Gaussian processes appear to be "non-parametric" because the algebraic evaluation of the posterior PDF includes analytic marginalization over all the (nuisance) parameters in the model (the true values of $y$ at each $x_{\rm obs}$).

As with all non-parametric models, GPs are not "assumption-free" or "model-independent": they are just not simply or physically parametrized, and so involve different types of assumptions.

The trade-off between simply-parametrized and non-parametric models is between interpretability (typically high for simply-parametrized physical models) and prediction accuracy (typically high for non-parametric models).

Tutorial

The GP regression notebook walks you through the code to make the figure below - and suggests some exercise to probe your understanding of Gaussian processes.