Real data sets tend to be complicated. Some common features:
All of these things can addressed through appropriate modeling.
Very often, this involves introducing additional nuisance parameters and/or expanding the model such that it has a hierarchical structure.
Including nuisance parameters in a model provides a way to explicitly account for and marginalize over systematic uncertainties.
This can be as simple as assigning a prior distribution to some quantity that would otherwise remain fixed, or it can involve a more explicit expansion of the model.
Often nuisance parameters represent latent variables - things that are logically or physically part of a model, but which cannot be directly measured.
Recall our simple measurement of a galaxy's flux based on a number of detected counts from the Bayes Theorem chunk:
Let's explicitly include a constant conversion between counts and flux:
Q: Why does it make sense to keep a Gamma prior on $F$ (with slightly redefined parameters)?
Q: What secret message describing this class is encoded in the PGM?
Now let's expand the model to account for the fact that we actually measure counts from both the galaxy and the background. Being good astronomers, we also remembered to take a second measurement of a source-free (background-only) patch of sky.
Subscripts: $b$ for background, $g$ for galaxy, $s$ for the science observation that includes both.
The background quantities could be regarded as nuisance parameters here - we need to account for our uncertainty in $F_b$, but measuring it isn't the point of the analysis.
Another case: suppose we want to marginalize over systematic uncertainty in the instrument gain, assuming it's the same for both observations.
Maybe the instrument team told us it was known to $\sim5\%$. We might write
introducing $\gamma$ as a nuisance parameter.
Often, especially in physics, the model for a data set naturally takes a hierarchical structure.
In statistics, this is related to the concept of exchangeability - as far as we know, individual sources of a given class are equivalent (until we measure them).
In practice, the hierarchy usually takes the form of common priors for the parameters describing individual measured sources.
General form for a hierarchical model:
Let's modify the previous example as follows
$n(x) = \phi^\ast x^\alpha e^{-x}; \quad x=\frac{L}{L^\ast}$
Here $n$ is the number density of galaxies.
$n(x) = \phi^\ast x^\alpha e^{-x}; \quad x=\frac{L}{L^\ast}$
Now - what does the PGM for our experiment look like?
Compressing the $L\rightarrow N$ and $F\rightarrow N$ conversions,
Q: What is the prior on $L_{g,i}$?
Q: Which of these things are nuisance parameters?
The exercises below will get you working with PGMs and model expressions (the $A\sim B$ type relationships) for a few scenarios that illustrate advanced (but common) features of data analysis.
When drawing a PGM, do note which parameters would need to have a prior defined, but don't worry about explicitly including prior parameters that are not part of the described model.
Produce the PGM and model expressions for the following scenario:
Type Ia supernova luminosities are "standardizable". This means that, using observable properties of the lightcurve (color $c$ and "stretch" $s$), SNIa luminosities can be corrected to a scale where they are almost equal. Almost, but not quite - there is still a residual intrinsic scatter, even after this correction.
The problem, described in a non-generative way, is as follows. We measure a number of counts for each of many SNIa, and can use the known telescope properties to convert that to a flux. The flux is related to the SNIa's intrinsic luminosity by a distance which is a function of its (measured) redshift, $z$. The color and stretch of the SNIa are also measured; these relate the actual luminosity to the "standard" SNIa luminosity. However, there is a scatter, which we assume to be Gaussian, between the actual luminosities of SNIa and this standard luminosity. The parameters that we ultimately care about are the average standard luminosity, $\bar{L}$, and the intrinsic scatter, $\sigma$.
In practice, a common use of SNIa involves adding free cosmological parameters in the distance-redshift relation, but we can leave that out here.
It's pretty common for multiple quantities that can be derived from the same observation to have correlated measurement errors. X-ray spectroscopy of galaxy clusters provides a good example. From this type of data, we can measure gas temperatures and estimate the total gravitating mass (assuming the gas is in equilibrium with the gravitational potential). However, temperature is one of the observables needed in the hydrostatic equation to compute the mass (the other being gas density), so any uncertainty in the temperature measurement propagates to the mass.
Naturally, determining the gas temperature as a function of total mass for galaxy clusters is interesting. A standard model for this is a power-law, $T\propto M^\beta$, with a log-normal intrinsic scatter (see last exercise). Slightly more simply, we could write $t=\alpha+\beta\,m$, with normal scatter $\tau$, for $t=\log(T)$ and $m=\log(M)$.
The new wrinkle is that we have correlated measurement errors on $t$ and $m$. We'll follow the common practice of assuming that the measurement errors on $t$ and $m$ are Gaussian and that we can use a fixed estimate of them - but instead of simply having some $\sigma_t$ and $\sigma_m$ in the problem, we actually have an estimated $2\times2$ covariance matrix, $\Sigma$, for each source.
Draw the PGM and write down the model for the above scenario.
Future time-domain photometric surveys will discover too many supernovae to follow up and type spectroscopically. This means that if we want to do our SNIa experiment from the "intrinsic scatter" example, our data set will be contaminated by other supernova types.
Here we simplify things a little by eliminating the standardization process, and just assume that each of $m$ SN types have luminosities that follow their own distributions. We also introduce group membership as a latent parameter.
Based on the model written out below, draw a PGM and come up with a more complete narration of what's going on than is provided above.
Some priors on $q_i$, $\bar{L}_i$, $\sigma_i$, for $i=1,\ldots,m$. ($q_i$ is the a priori probability of a SN being of type $i$.)
For each SN observed,
An extremely common task is fitting a linear model to data that have measurement uncertainties in both the $x$ and $y$ values, and intrinsic scatter. Non-Bayesian approaches were developed in the 90's, but have serious flaws (e.g., they cannot fit for the scatter), whereas this problem is neatly addressed by hierarchical modeling (see exercises).
For the also-common assumptions of Gaussian uncertainties and scatter, there are some generic codes that fit these models using Gibbs sampling (see Efficient Monte Carlo).
Some codes (specifically) for hierarchical linear regression
Code | Language | multiple $x$'s | multiple $y$'s |
---|---|---|---|
linmix_err |
IDL | no | no |
mlinmix_err |
IDL | yes | no |
linmix |
Python | no | no |
lrgs |
R, Python | yes | yes |
Excerpted from our list of MCMC packages