Hierarchical Models

Goals:

  • Introduce ourselves to the use of nuisance parameters and structure in models.
  • Practice devising models for more complex situations than we've seen so far.

References

  • Gelman Ch. 5 (5.1-5.3)

Real data sets tend to be complicated. Some common features:

  • Heteroscedastic measurement errors
  • Measurement errors on all quantities
  • Correlated measurement errors
  • Intrinsic scatter
  • Selection effects/systematially incomplete data
  • Presence of unwanted sources in the data set
  • Systematic uncertainties every which way

All of these things can addressed through appropriate modeling.

Very often, this involves introducing additional nuisance parameters and/or expanding the model such that it has a hierarchical structure.

Nuisance parameters

A nuisance parameter is any model parameter that we are not particularly interested in at the end of the day.

This is an entirely subjective label - nuisance parameters are not formally or functionally treated any differently than other parameters.

Including nuisance parameters in a model provides a way to explicitly account for and marginalize over systematic uncertainties.

This can be as simple as assigning a prior distribution to some quantity that would otherwise remain fixed, or it can involve a more explicit expansion of the model.

Often nuisance parameters represent latent variables - things that are logically or physically part of a model, but which cannot be directly measured.

Example: measuring flux with a background

Recall our simple measurement of a galaxy's flux based on a number of detected counts from the Bayes Theorem chunk:

  • $N|\mu \sim \mathrm{Poisson}(\mu)$
  • $\mu \sim \mathrm{Gamma}(\alpha,\beta)$

Let's explicitly include a constant conversion between counts and flux:

  • $N|\mu \sim \mathrm{Poisson}(\mu)$
  • $\mu = C\,F$
  • $F \sim \mathrm{Gamma}(\alpha,\beta)$

Q: Why does it make sense to keep a Gamma prior on $F$ (with slightly redefined parameters)?

Q: What secret message describing this class is encoded in the PGM?

Now let's expand the model to account for the fact that we actually measure counts from both the galaxy and the background. Being good astronomers, we also remembered to take a second measurement of a source-free (background-only) patch of sky.

Subscripts: $b$ for background, $g$ for galaxy, $s$ for the science observation that includes both.

  • $N_b|\mu_b \sim \mathrm{Poisson}(\mu_b)$
  • $\mu_b = C_b\,F_b$
  • $F_b \sim \mathrm{Gamma}(\alpha_b,\beta_b)$
  • $N_s|\mu_s \sim \mathrm{Poisson}(\mu_s)$
  • $\mu_s = C_s(F_g+F_b)$
  • $F_g \sim \mathrm{Gamma}(\alpha_g,\beta_g)$

The background quantities could be regarded as nuisance parameters here - we need to account for our uncertainty in $F_b$, but measuring it isn't the point of the analysis.

Another case: suppose we want to marginalize over systematic uncertainty in the instrument gain, assuming it's the same for both observations.

Maybe the instrument team told us it was known to $\sim5\%$. We might write

  • $\mu_b = \gamma C_b F_b$
  • $\mu_s = \gamma C_s (F_g+F_b)$
  • $\gamma \sim \mathrm{Normal}(1.0, 0.05)$

introducing $\gamma$ as a nuisance parameter.

Hierarchical models

Often, especially in physics, the model for a data set naturally takes a hierarchical structure.

  • e.g. measurements of multiple sources inform us about a source class

In statistics, this is related to the concept of exchangeability - as far as we know, individual sources of a given class are equivalent (until we measure them).

In practice, the hierarchy usually takes the form of common priors for the parameters describing individual measured sources.

  • The prior parameters describe the statistical properties of the source class, and are often what we're trying to measure.
  • Those prior parameters are therefore often left free, with priors of their own, aka hyperpriors.

General form for a hierarchical model:

  • $p(x|\theta)$ describes the measurement process
  • $p(\theta)$ decomposes as $p(\theta|\phi_1)\,p(\phi_1)$
  • $p(\phi_1)$ decomposes as $p(\phi_1|\phi_2)\,p(\phi_2)$
  • $\ldots$
  • $p(\phi_n)$, usually taken to be "uninformative"

Example: galaxy luminosity function

Let's modify the previous example as follows

  • We're interested in luminosity rather than flux - if we know the distance to the target, this just means including another known factor in $C_g$, which now converts counts to $L$.
  • We'll measure $m>1$ galaxies, and are interested in constraining the luminosity function, traditionally modelled as

$n(x) = \phi^\ast x^\alpha e^{-x}; \quad x=\frac{L}{L^\ast}$

Here $n$ is the number density of galaxies.

$n(x) = \phi^\ast x^\alpha e^{-x}; \quad x=\frac{L}{L^\ast}$

  • For simplicity, we'll assume that we've measured every galaxy above a given luminosity in some volume. This is not very realistic, but we'll tackle the issues raised by incomplete data sets some other time.
  • Let's also assume that the same background applies to each galaxy measurement, and that we have a galaxy-free observation of it, as before.

Now - what does the PGM for our experiment look like?

Compressing the $L\rightarrow N$ and $F\rightarrow N$ conversions,

Q: What is the prior on $L_{g,i}$?

Q: Which of these things are nuisance parameters?

Exercises

The exercises below will get you working with PGMs and model expressions (the $A\sim B$ type relationships) for a few scenarios that illustrate advanced (but common) features of data analysis.

When drawing a PGM, do note which parameters would need to have a prior defined, but don't worry about explicitly including prior parameters that are not part of the described model.

Exercise: Intrinsic scatter

Produce the PGM and model expressions for the following scenario:

Type Ia supernova luminosities are "standardizable". This means that, using observable properties of the lightcurve (color $c$ and "stretch" $s$), SNIa luminosities can be corrected to a scale where they are almost equal. Almost, but not quite - there is still a residual intrinsic scatter, even after this correction.

The problem, described in a non-generative way, is as follows. We measure a number of counts for each of many SNIa, and can use the known telescope properties to convert that to a flux. The flux is related to the SNIa's intrinsic luminosity by a distance which is a function of its (measured) redshift, $z$. The color and stretch of the SNIa are also measured; these relate the actual luminosity to the "standard" SNIa luminosity. However, there is a scatter, which we assume to be Gaussian, between the actual luminosities of SNIa and this standard luminosity. The parameters that we ultimately care about are the average standard luminosity, $\bar{L}$, and the intrinsic scatter, $\sigma$.

In practice, a common use of SNIa involves adding free cosmological parameters in the distance-redshift relation, but we can leave that out here.

Exercise: Measurement uncertainties

It's pretty common for multiple quantities that can be derived from the same observation to have correlated measurement errors. X-ray spectroscopy of galaxy clusters provides a good example. From this type of data, we can measure gas temperatures and estimate the total gravitating mass (assuming the gas is in equilibrium with the gravitational potential). However, temperature is one of the observables needed in the hydrostatic equation to compute the mass (the other being gas density), so any uncertainty in the temperature measurement propagates to the mass.

Naturally, determining the gas temperature as a function of total mass for galaxy clusters is interesting. A standard model for this is a power-law, $T\propto M^\beta$, with a log-normal intrinsic scatter (see last exercise). Slightly more simply, we could write $t=\alpha+\beta\,m$, with normal scatter $\tau$, for $t=\log(T)$ and $m=\log(M)$.

The new wrinkle is that we have correlated measurement errors on $t$ and $m$. We'll follow the common practice of assuming that the measurement errors on $t$ and $m$ are Gaussian and that we can use a fixed estimate of them - but instead of simply having some $\sigma_t$ and $\sigma_m$ in the problem, we actually have an estimated $2\times2$ covariance matrix, $\Sigma$, for each source.

Draw the PGM and write down the model for the above scenario.

Exercise: Group membership

Future time-domain photometric surveys will discover too many supernovae to follow up and type spectroscopically. This means that if we want to do our SNIa experiment from the "intrinsic scatter" example, our data set will be contaminated by other supernova types.

Here we simplify things a little by eliminating the standardization process, and just assume that each of $m$ SN types have luminosities that follow their own distributions. We also introduce group membership as a latent parameter.

Based on the model written out below, draw a PGM and come up with a more complete narration of what's going on than is provided above.

Some priors on $q_i$, $\bar{L}_i$, $\sigma_i$, for $i=1,\ldots,m$. ($q_i$ is the a priori probability of a SN being of type $i$.)

For each SN observed,

  • $g \sim \mathrm{Multinomial}(\mathbf{q})$ (sometimes measured)
  • $L \sim \mathrm{Normal}(\bar{L}_g, \sigma_g)$
  • $\mu = C\,L$, with $C$ a known constant
  • $N \sim \mathrm{Poisson}(\mu)$ (measured)

Hierarchical models for linear regression

An extremely common task is fitting a linear model to data that have measurement uncertainties in both the $x$ and $y$ values, and intrinsic scatter. Non-Bayesian approaches were developed in the 90's, but have serious flaws (e.g., they cannot fit for the scatter), whereas this problem is neatly addressed by hierarchical modeling (see exercises).

For the also-common assumptions of Gaussian uncertainties and scatter, there are some generic codes that fit these models using Gibbs sampling (see Efficient Monte Carlo).

Some codes (specifically) for hierarchical linear regression

Code Language multiple $x$'s multiple $y$'s
linmix_err IDL no no
mlinmix_err IDL yes no
linmix Python no no
lrgs R, Python yes yes

Excerpted from our list of MCMC packages