Missing Information and Selection Effects

Goals:

Incorporate models for data selection into our toolkit
Understand when selection effects are ignorable, and when they must be accounted for

References

Gelman chapters 7 and 21
Coping with selection effects: a primer on regression with truncated data

What does "missing information" mean?

In physics, we're used to the idea that we never have complete information about a system.

Trivial example: non-zero measurement errors mean that we're missing some information, namely the true value of whatever we've measured. We deal with this by incorporating that fact into our model, via the sampling distribution.

Hierachical models tend to be full of such unobservable parameters, including things like group membership.

Key messages

No data set is perfectly complete (especially in astronomy!)
It's our job to know whether that incompleteness can be ignored for the purpose of our inference
If not, we need to model it appropriately and marginalize over our ignorance

More missingness mechanisms

Two more ways that data can be missing are extremely common in astrophysics, and especially in surveys. In statistics, these are called censoring and truncation.

These are related (though not one-to-one) with the astronomical terms Malmquist bias and Eddington bias.

Censoring: a given data point (astronomical source) is known to exist, but a relevant measurement for it is not available.

This refers both to completely absent measurements and upper limits/non-detections, although in principle the latter case still provides us with a sampling distribution.

Truncation: not only are measurements missing, but the total number of sources that should be in the data set is unknown.

In other words, the lack of a measurement means that we don't even know about a particular source's existence.

Malmquist bias refers to the fact that flux-limited surveys have an effective luminosity limit for detection that rises with distance (redshift). Thus, the sample of measured luminosities is not representative of the whole population.

Eddington bias refers to the effect of noise or scatter on a luminosity function, $N(>L)$, the number of sources in some population more luminous than $L$.

Because the true $N(>L)$ is usually steeply decreasing in practice, and extends below the survey flux limit, scatter in measurements of $L$ can have a big impact on the measured luminosity function.

This is a histogram rather than $N(>L)$, but you get the idea.

The terms Malmquist and Eddington bias were coined in relatively specific contexts. Usually, it's more accurate to say that a given data set is impacted by the selection procedure.

Example

Consider the (real) case of a flux-limited galaxy cluster survey. Cluster luminosities scale with mass, and the mass function (hence also the luminosity function) is steeply decreasing. The number as a function of mass and redshift, and the luminosity-mass relation, are both of interest.

Complilation of ROSAT All-Sky Survey cluster detections

Fictional luminosity-mass data, applying a threshold for detection

Coping with missing information

Ad hoc approaches exist, but won't be covered. You might hear the terms "debiasing" or "deboosting" in this context.

Ideally, we should include the "selection" process that determines which data are observed and which are not in our generative model. This may involve expanding the model to include things like undetected sources.

Priors (models for things that aren't observed) are going to matter!

Formally Modelling Missing Information

Adopting notation from Bayesian Data Analysis (Gelman et al. 2004)

$y_\mathrm{obs}$ and $y_\mathrm{mis}$ are the observed and unobserved data, and $y=y_\mathrm{obs}\cup y_\mathrm{mis}$
$I$ is a vector of indicator variables (0 or 1) telling us whether a given y is observed or not
$\theta$ is the set of parameters needed to model a completely observed data set
$\phi$ are any additional parameters needed to model the selection process

The likelihood associated with a complete data set would be just

$p(y|\theta)$

For our partially missing data set, this needs to also account for the inclusion parameters, $I$

$p(y,I|\theta,\phi) = p(y|\theta)\,P(I|\phi,y)$

In other words, inclusion is part of the observed data vector.

Expanding out the $y$s,

$p(y_\mathrm{obs},y_\mathrm{mis},I|\theta,\phi) = p(y_\mathrm{obs},y_\mathrm{mis}|\theta)\,P(I|\phi,y_\mathrm{obs},y_\mathrm{mis})$

This isn't yet a likelihood for the observed data, however. For that we need to marginalize over the $y_\mathrm{mis}$.

$p(y_\mathrm{obs},I|\theta,\phi) = \int dy_\mathrm{mis} \, p(y_\mathrm{obs},y_\mathrm{mis}|\theta)\,P(I|\phi,y_\mathrm{obs},y_\mathrm{mis})$

Note that we no longer have a clean separation between data and parameters in the PGM sense.

Thinking of drawing a PGM, the $y$ nodes can be fixed by observation or be nuisance parameters, depending on the corresponding element of $I$.

Either way they go in a double circle because they come from a sampling distribution. This reflects that our model must include a sampling distribution even for putative sources that aren't in our data set!

When can we ignore selection?

Consider the likelihood in this form

$p(y_\mathrm{obs},I|\theta,\phi) = \int dy_\mathrm{mis} \, p(y_\mathrm{obs},y_\mathrm{mis}|\theta)\,P(I|\phi,y_\mathrm{obs},y_\mathrm{mis})$

We can get away with ignoring the selection process if the posterior for the parameters of interest $p(\theta|y_\mathrm{obs},I)$ is equivalent to simply $p(\theta|y_\mathrm{obs})$.

$p(\theta|y_\mathrm{obs},I)$

$= \int d\phi\int dy_\mathrm{mis} \, p(y_\mathrm{obs},y_\mathrm{mis}|\theta) \, P(I|\phi,y_\mathrm{obs},y_\mathrm{mis}) \, p(\theta,\phi)$

$= p(\theta|y_\mathrm{obs})$ ?

This requires two things to be true:

Selection doesn't depend on (potentially) unobserved data: $P(I|\phi,y_\mathrm{obs},y_\mathrm{mis}) = P(I|\phi,y_\mathrm{obs})$
Priors for the interesting ($\theta$) and selection-related ($\phi$) parameters are independent: $p(\theta,\phi)=p(\theta)p(\phi)$

Example: galaxy cluster scaling relations

Imagine we're fitting the relation between mass ($x$) and luminosity ($y$) for clusters. (Fictional, error-free data for illustration.)

To start with, we'll assume a complete data set. Then the generative model needs

true values of mass ($x$) for the $N$ clusters
true values of luminosity $y$ for each cluster, determined by a mean relation and scatter, parametrized by $\theta$
sampling distributions for $x$ and $y$, which we'll assume are independent
prior distributions for $x$ (with some parameters $\Omega$) and $\theta$

$p(\hat{x},\hat{y},x,y,\theta,\Omega)= p(\theta,\Omega)\prod_{k=1}^N p(x_k|\Omega)\,p(y_k|x_k,\theta)\,p(\hat{y}_k|y_k)\,p(\hat{x}_k|x_k)$

Now let's imagine we have data only for clusters that exceed some threshold luminosity for detection (blue points).

The data need to be augmented by the inclusion vector, $I$, which implicitly encodes the number of detected clusters, $N_\mathrm{det}$.

The model must expand to contain $\phi$ and the total number of clusters, $N$ (since this is a truncation problem).

Before drawing the PGM, let's have a look at the new likelihood:

$P(\hat{x},\hat{y},I,N_\mathrm{det}|x,y,\theta,\Omega,\phi,N)= {N \choose N_\mathrm{det}} \,P(\mathrm{detected}~\mathrm{clusters}) \,P(\mathrm{missing}~\mathrm{clusters})$

Note that a binomial term, ${N \choose N_\mathrm{det}}$ has sneakily appeared.

The reason for this is subtle, and has to do with the statistical concept of exchangeability (a priori equivalence of data points).

Here the fully observed data are a priori exchangeable with one another, as are the partially observed data, but the the full data set contains these two non-exchangeable classes.

It helps to think in terms of the generative model here. Namely, because the order of data points holds no meaning for us, the binomial term is there to reflect the number of ways we might generate completely equivalent (except for the ordering) data sets.

The term for detected clusters is what we had before, with the addition of the detection probability:

$P(\mathrm{detected}~\mathrm{clusters}) =$

$\prod_{k=1}^{N_{det}} p(x_k|\Omega)\,p(y_k|x_k,\theta)\,p(\hat{y}_k|y_k)\,p(\hat{x}_k|x_k)\,P(I_k=1|\hat{y}_k,\phi)$

The term for missing clusters is similar, but must be marginalized over the unobserved $\hat{x}$ and $\hat{y}$ subject to the constraint that these clusters not be detectable:

$P(\mathrm{missing}~\mathrm{clusters}) =$

$\prod_{k=1}^{N-N_{det}} \int d\hat{x}\,d\hat{y}\, p(x_k|\Omega)\,p(y_k|x_k,\theta)\,p(\hat{y}_k|y_k)\,p(\hat{x}_k|x_k)\,P(I_k=0|\hat{y}_k,\phi)$.

All terms in the product are equal once we marginalize over $x_k$ and $y_k$, so this is will simplify to

$P_{mis}^{N_{mis}}$

with $N_{mis}=N-N_{det}$ and $P_{mis}$ the a priori probability of a cluster going undetected.

Rather than going on to manipulate this further, just note that the additions to the data/model boil down to

A $P(I_k|\ldots)$ term within the product over clusters
Additional terms depending on $N$, $N_{det}$, $\phi$ and other global parameters.

Hence the PGM:

For comparison

Now the big question: is selection ignorable? Do we need all this formalism to do inference on $\Omega$ and/or $\theta$?

Are the priors for $\theta$ and $\phi$ independent?

Yes, at least as drawn in the PGM. And usually this is the assumption.

Is selection independent of (potentially) unobserved data ($\hat{x}_k$ and $\hat{y}_k$)?

Hell, no. The detection probability explicitly depends on $\hat{y}_k$.

Exercise: data missing at random

Let's say there somehow isn't a threshold for detection in the above problem. Ignoring large scale correlations (pretty accurate for clusters) the a priori probability of detecting a cluster is simply $f_{sky}$, the fraction of the sky surveyed.

Is selection ignorable in this case? This is not a trick question, but justify the answer in terms of the discussion above.

Exercise: other truncation mechanisms

Consider the following variants of the galaxy cluster example:

Selection is on the observed mass ($\hat{x}$)
Selection is on $\hat{y}\rightarrow\hat{y}_2$ as before, and for detected clusters we have an additional measured observable $y_1$ whose scaling with $x$ is interesting

In each case, sketch the PGM and decide whether selection effects are ignorable for inference about

The distribution of $x$ (parametrized by $\Omega$)
The scaling relation parameters $\theta$ (for $y_1$ and $y_2$ or $y_1$ alone in case 2)

If not, can you identify special cases where selection becomes ignorable?

Parting words

We haven't worked one of these probems fully, but typically (when we assume independently occuring sources) our likelihood only becomes a little more complicated due to selection. We just need to be able to evaluate the selection probability and predict the number of selected sources from the model.
The need to model a hidden population places additional demands on our data, so the size/quality of data set required to get a data-dominated (rather than prior-dominated) answer can be non-intuitive. Be careful.