What does "missing information" mean?
In physics, we're used to the idea that we never have complete information about a system.
Trivial example: non-zero measurement errors mean that we're missing some information, namely the true value of whatever we've measured. We deal with this by incorporating that fact into our model, via the sampling distribution.
Hierachical models tend to be full of such unobservable parameters, including things like group membership.
Key messages
More missingness mechanisms
Two more ways that data can be missing are extremely common in astrophysics, and especially in surveys. In statistics, these are called censoring and truncation.
These are related (though not one-to-one) with the astronomical terms Malmquist bias and Eddington bias.
Censoring: a given data point (astronomical source) is known to exist, but a relevant measurement for it is not available.
This refers both to completely absent measurements and upper limits/non-detections, although in principle the latter case still provides us with a sampling distribution.
Truncation: not only are measurements missing, but the total number of sources that should be in the data set is unknown.
In other words, the lack of a measurement means that we don't even know about a particular source's existence.
Malmquist bias refers to the fact that flux-limited surveys have an effective luminosity limit for detection that rises with distance (redshift). Thus, the sample of measured luminosities is not representative of the whole population.
Eddington bias refers to the effect of noise or scatter on a luminosity function, $N(<L)$, the number of sources in some population less luminous than $L$.
Because the true $N(<L)$ is usually steeply decreasing in practice, and extends below the survey flux limit, scatter in measurements of $L$ can have a big impact on the measured luminosity function.
The terms Malmquist and Eddington bias were coined in relatively specific contexts. Usually, it's more accurate to say that a given data set is impacted by the selection procedure.
Example
Consider the (real) case of a flux-limited galaxy cluster survey. Cluster luminosities scale with mass, and the mass function (hence also the luminosity function) is steeply decreasing. The number as a function of mass and redshift, and the luminosity-mass relation, are both of interest.
Coping with missing information
Ad hoc approaches exist, but won't be covered. You might hear the terms "debiasing" or "deboosting" in this context.
Ideally, we should include the "selection" process that determines which data are observed and which are not in our generative model. This may involve expanding the model to include things like undetected sources.
Formally Modelling Missing Information
Adopting notation from Bayesian Data Analysis (Gelman et al. 2004 - highly recommended!)
We'll assume that $\theta$ and $\phi$ can always be separated.
The likelihood associated with a complete data set would be just
$P(y|\theta)$
For our partially missing data set, this needs to also account for the inclusion parameters, $I$
$P(y,I|\theta,\phi) = P(y|\theta)\,P(I|\phi,y)$
In other words, inclusion is part of the observed data vector.
Expanding out the $y$s,
$P(y_\mathrm{obs},y_\mathrm{mis},I|\theta,\phi) = P(y_\mathrm{obs},y_\mathrm{mis}|\theta)\,P(I|\phi,y_\mathrm{obs},y_\mathrm{mis})$
This isn't yet a likelihood for the observed data, however. For that we need to marginalize over the $y_\mathrm{mis}$.
$P(y_\mathrm{obs},I|\theta,\phi) = \int dy_\mathrm{mis} \, P(y_\mathrm{obs},y_\mathrm{mis}|\theta)\,P(I|\phi,y_\mathrm{obs},y_\mathrm{mis})$
Note that we no longer have a clean separation between data and parameters in the PGM sense.
Thinking of drawing a PGM, the $y$ nodes can be fixed by observation or be nuisance parameters, depending on the corresponding element of $I$.
When can we ignore selection?
Consider the likelihood in this form
$P(y_\mathrm{obs},I|\theta,\phi) = \int dy_\mathrm{mis} \, P(y_\mathrm{obs},y_\mathrm{mis}|\theta)\,P(I|\phi,y_\mathrm{obs},y_\mathrm{mis})$
We can get away with ignoring the selection process if the posterior for the parameters of interest $P(\theta|y_\mathrm{obs},I)$ is equivalent to simply $P(\theta|y_\mathrm{obs})$.
$P(\theta|y_\mathrm{obs},I)$
$= \int d\phi\int dy_\mathrm{mis} \, P(y_\mathrm{obs},y_\mathrm{mis}|\theta) \, P(I|\phi,y_\mathrm{obs},y_\mathrm{mis}) \, P(\theta,\phi)$
$= P(\theta|y_\mathrm{obs})$ ?
This requires two things to be true:
To start with, we'll assume a complete data set. Then the generative model needs
The data need to be augmented by the inclusion vector, $I$, which implicitly encodes the number of detected clusters, $N_\mathrm{det}$.
The model must expand to contain $\phi$ and the total number of clusters, $N$ (since this is a truncation problem).
Before drawing the PGM, let's have a look at the new likelihood:
$P(\hat{x},\hat{y},I,N_{det}|x,y,\theta,\phi,N)= {N \choose N_\mathrm{det}} \,P(\mathrm{detected}~\mathrm{clusters}) \,P(\mathrm{missing}~\mathrm{clusters})$
Note that a binomial term, ${N \choose N_\mathrm{det}}$ has sneakily appeared.
The reason for this is subtle, and has to do with the statistical concept of exchangeability (a priori equivalence of data points).
Here the fully observed data are a priori exchangeable with one another, as are the partially observed data, but the the full data set contains these two non-exchangeable classes.
It helps to think in terms of the generative model here. Namely, because the order of data points holds no meaning for us, the binomial term is there to reflect the number of ways we might generate completely equivalent (except for the ordering) data sets.
The term for detected clusters is what we had before, with the addition of the detection probability:
$P(\mathrm{detected}~\mathrm{clusters}) =$
$\prod_{k=1}^{N_{det}} P(y_k|x_k,\theta)\,P(\hat{y}_k|y_k)\,P(\hat{x}_k|x_k)\,P(I_k=1|\hat{y}_k,\phi)$
The term for missing clusters is similar, but must be marginalized over the unobserved $\hat{x}$ and $\hat{y}$ subject to the constraint that these clusters not be detectable:
$P(\mathrm{missing}~\mathrm{clusters}) =$
$\prod_{k=1}^{N-N_{det}} \int d\hat{x}\,d\hat{y}\, P(y_k|x_k,\theta)\,P(\hat{y}_k|y_k)\,P(\hat{x}_k|x_k)\,P(I_k=0|\hat{y}_k,\phi)$.
All terms in the product are equal once we marginalize over $x_k$ and $y_k$, so this is will simplify to
$P_{mis}^{N_{mis}}$
with $N_{mis}=N-N_{det}$ and $P_{mis}$ the a priori probability of a cluster going undetected.
Rather than going on to manipulate this further, just note that the additions to the data/model boil down to
Are the priors for $\theta$ and $\phi$ independent?
Yes, at least as drawn in the PGM. And usually this is the assumption.
Is selection independent of (potentially) unobserved data ($\hat{x}_k$ and $\hat{y}_k$)?
Hell, no. The detection probability explicitly depends on $\hat{y}_k$.
Let's say there somehow isn't a threshold for detection in the above problem. Ignoring large scale correlations (pretty accurate for clusters) the a priori probability of detecting a cluster is simply $f_{sky}$, the fraction of the sky surveyed.
Is selection ignorable in this case? This is not a trick question, but justify the answer in terms of the discussion above.
Consider the following variants of the galaxy cluster example:
In each case, sketch the PGM and decide whether selection effects are ignorable for inference about
If not, can you identify special cases where selection becomes ignorable?
We haven't worked one of these probems fully, but typically (when we assume independently occuring sources) our likelihood only becomes a little more complicated due to selection. We just need to be able to evaluate the selection probability and predict the number of selected sources from the model.
The need to model a hidden population places additional demands on our data, so the size/quality of data set required to get a data-dominated (rather than prior-dominated) answer can be non-intuitive. Be careful.