This notebook is adapted from a lesson from the 2019 KIPAC/StatisticalMethods course, (c) 2019 Adam Mantz and Phil Marshall, licensed under the GPLv2.
Most of this material is culled from this note, as well as the methods section of this thing and the appendix of this other thing. The first of these (in the introduction) cites some other papers elsewhere in astrophysics where the same formalism was independently arrived at.
Chapter 7 of Gelman et al.'s Bayesian Data Analysis (3rd ed.) is also a nice resource, although it's written for a non-astro audience.
What does "missing information" mean?
In physics, we're used to the idea that we never have complete information about a system.
Trivial example: non-zero measurement errors mean that we're missing some information, namely the true value of whatever we've measured. We deal with this by incorporating that fact into our model, via the sampling distribution.
Hierachical models tend to be full of such unobservable parameters, including things like group membership.
Key messages
In astronomical surveys, you might hear the (historical) terms Malmquist bias and Eddington bias.
Malmquist bias: flux-limited surveys have an effective luminosity limit for detection that rises with distance (redshift). Thus, the sample of measured luminosities is not representative of the whole population.
Eddington bias refers to the effect of noise or scatter on a luminosity function, the number of sources in some population as a function of $L$.
These are two particular manifestations of selection effects, which we can treat in a more general framework.
First, some terminology you might see in relevant statistics literature: censoring and truncation.
(These are not one-to-one with Malmquist and Eddington bias.)
Censoring: a given data point (e.g. astronomical source) is known to exist, but a relevant measurement for it is not available.
This refers both to completely absent measurements and upper limits/non-detections, although in principle the latter case still provides us with a sampling distribution.
Truncation: not only are measurements missing, but the total number of sources that should be in the data set is unknown.
In other words, the lack of a measurement means that we don't even know about a particular source's existence.
LSST will include a galaxy cluster survey, finding clusters as overdensities of red galaxies (richness). The underlying cosmological signal is their number as a function of mass. Log-richness ($y$) and log-mass ($x$) are presumed to have a linear relationship with some scatter.
Let's start by considering a complete generative model (no selection effects yet). It needs
Let's draw a PGM.
Here I'm anticipating that $N$ will be some Poisson variable based on $\Omega$. In the absence of selection effects, i.e. with a complete data set, we know what it is.
With some qualitatively reasonable parameter choices, here's a possible mock data set:
This is an inference you already know how to solve, given the modeling details. The likelihood is something like
$p(\hat{x},\hat{y},N|\Omega,\theta) = p(N|\Omega) \prod_{k=1}^{N} p(x_k|\Omega)\,p(y_k|x_k,\theta)\,p(\hat{y}_k|y_k)\,p(\hat{x}_k|x_k)$
Needless to say, it is not safe to "simply fit a line" to the detected data, even if we only care about the $y$-$x$ relation and not the mass function.
How to modify the generative model to account for truncation?
Definitions: Let
The differences:
The new likelihood can be written
$P(\hat{x},\hat{y},I,N_\mathrm{det}|x,y,\theta,\Omega,\phi,N)=$
$ \quad {N \choose N_\mathrm{det}} \,P(\mathrm{detected}~\mathrm{clusters}) \,P(\mathrm{missing}~\mathrm{clusters})$
Let's break this down.
$P(\mathrm{detected}~\mathrm{clusters})$ is more or less what we've worked with before, with the additional sampling distribution for $I_k=1$.
$P(\mathrm{detected}~\mathrm{clusters}) =$
$ \quad \prod_{k=1}^{N_{det}} p(x_k|\Omega)\,p(y_k|x_k,\theta)\,p(\hat{y}_k|y_k)\,p(\hat{x}_k|x_k)\,p(I_k=1|\hat{y}_k,\phi)$
$P(\mathrm{missing}~\mathrm{clusters})$ is almost the same, but since these $\hat{x}_k$ and $\hat{y}_k$ are actually unobserved, they need to be marginalized over:
$P(\mathrm{missing}~\mathrm{clusters}) = \prod_{k=1}^{N-N_{det}}$
$ \quad \int d\hat{x}_k\,d\hat{y}_k\, p(x_k|\Omega)\,p(y_k|x_k,\theta)\,p(\hat{y}_k|y_k)\,p(\hat{x}_k|x_k)\,P(I_k=0|\hat{y}_k,\phi)$
With no observed values to distinguish them, these factors are all equal
$\quad = P_{mis}^{N-N_{det}}$
What about this binomial term, ${N \choose N_\mathrm{det}} = \frac{N!}{N_{det}!(N-N_{det})!}$?
It's sneaky, and has to do with the statistical concept of exchangeability (a priori equivalence of data points).
We normally don't think about the fact that the order of data points is meaningless to us when mapping generative models to inferences (and a fixed $N_{det}!$ can be ignored anyway).
But now, with $N$ unknown, we have to worry about the fact that our complete model containts two non-exchangeable classes (detected and non-detected).
If $p(N|\Omega)$ is Poisson, it can be marginalized out to produce
$P(\hat{x},\hat{y},I,N_\mathrm{det}|x,y,\theta,\Omega,\phi)= e^{-\langle N_{det} \rangle} \, \langle N \rangle^{N_{det}} \,P(\mathrm{detected}~\mathrm{clusters})$
with
i.e., a relatively simple change from where we started.
There's a more intuitive, but less generally applicable, way to get to the same place. As an exercise,
Up to a constant factor of $N_{det}!$, you'll arrive at the expression above, but explicitly marginalized over $x$ and $y$.
For a general problem, selection effects are ignorable if both of these are true:
The second is often a non-starter in survey science, since whatever we're interested in will often correlate with our detection signal.
Really, these conditions boil down to whether the posterior for the parameters of interest depends on selection.
A less strict version of the same question is: does ignoring selection effects bias me at a level I care about?
Analyzing mock data is the best way of answering this.
We haven't worked one of these probems fully, but typically (when we assume independently occuring sources) our likelihood only becomes a little more complicated due to selection. We just need to be able to evaluate the selection probability and predict the number of selected sources from the model.
The need to model a hidden population places additional demands on our data, so the size/quality of data set required to get a data-dominated (rather than prior-dominated) answer can be non-intuitive. Be careful.
Consider the following variants of the galaxy cluster example:
In each case, sketch the PGM and decide whether selection effects are ignorable for inference about
If not, can you identify special cases where selection becomes ignorable?
In [ ]: