Most of this material is culled from this note, as well as the methods section of this thing and the appendix of this other thing. The first of these (in the introduction) cites some other papers elsewhere in astrophysics where the same formalism was independently arrived at.

Chapter 7 of Gelman et al.'s Bayesian Data Analysis (3rd ed.) is also a nice resource, although it's written for a non-astro audience.

Missing Information and Selection Effects

Goals:

Incorporate models for data selection into our toolkit
Understand when selection effects are ignorable, and when they must be accounted for

What does "missing information" mean?

In physics, we're used to the idea that we never have complete information about a system.

Trivial example: non-zero measurement errors mean that we're missing some information, namely the true value of whatever we've measured. We deal with this by incorporating that fact into our model, via the sampling distribution.

Hierachical models tend to be full of such unobservable parameters, including things like group membership.

Key messages

No data set is perfectly complete (especially in astronomy!)
It's our job to know whether that incompleteness can be ignored for the purpose of our inference
If not, we need to model it appropriately and marginalize over our ignorance

In astronomical surveys, you might hear the (historical) terms Malmquist bias and Eddington bias.

Malmquist bias: flux-limited surveys have an effective luminosity limit for detection that rises with distance (redshift). Thus, the sample of measured luminosities is not representative of the whole population.

Eddington bias refers to the effect of noise or scatter on a luminosity function, the number of sources in some population as a function of $L$.

If the true $n(L)$ is steeply decreasing, even a symmetric scatter will on average boost the number of luminous sources.

These are two particular manifestations of selection effects, which we can treat in a more general framework.

First, some terminology you might see in relevant statistics literature: censoring and truncation.

(These are not one-to-one with Malmquist and Eddington bias.)

Censoring: a given data point (e.g. astronomical source) is known to exist, but a relevant measurement for it is not available.

This refers both to completely absent measurements and upper limits/non-detections, although in principle the latter case still provides us with a sampling distribution.

Truncation: not only are measurements missing, but the total number of sources that should be in the data set is unknown.

In other words, the lack of a measurement means that we don't even know about a particular source's existence.

A concrete (albeit simplified) example

LSST will include a galaxy cluster survey, finding clusters as overdensities of red galaxies (richness). The underlying cosmological signal is their number as a function of mass. Log-richness ($y$) and log-mass ($x$) are presumed to have a linear relationship with some scatter.

Let's start by considering a complete generative model (no selection effects yet). It needs

a mass function (number of clusters as a function of $x$), with some parameters $\Omega$
a total number of clusters in the survey volume, $N$ (also a function of $\Omega$)
a richness-mass scaling relation (mean scaling and scatter), parametrized by $\theta$
true values of mass ($x$) for the $N$ clusters
true values of richness ($y$) for each cluster
measured values of $x$ and $y$ (assume independent and known sampling distributions)

Let's draw a PGM.

Here I'm anticipating that $N$ will be some Poisson variable based on $\Omega$. In the absence of selection effects, i.e. with a complete data set, we know what it is.

With some qualitatively reasonable parameter choices, here's a possible mock data set:

This is an inference you already know how to solve, given the modeling details. The likelihood is something like

$p(\hat{x},\hat{y},N|\Omega,\theta) = p(N|\Omega) \prod_{k=1}^{N} p(x_k|\Omega)\,p(y_k|x_k,\theta)\,p(\hat{y}_k|y_k)\,p(\hat{x}_k|x_k)$

Question: censoring

How would we adjust the generative model for the case that

some clusters have measurements of $y$, but not of $x$?
some clusters have measurements of $y$, but only "upper limits" on $x$?

Discuss.

Example: truncation

In practice, our sample will only include sources whose measured $\hat{y}$ exceeds some threshold for "detection".

The total number of clusters (down to some low limiting mass) will not be known, only the number of detected clusters.

Brainstorm

How would you go about the inference for the mass function and scaling relation given only the blue points?
What if you were only interested in the $y$-$x$ scaling?

Don't worry, we'll walk through this in a minute.

Needless to say, it is not safe to "simply fit a line" to the detected data, even if we only care about the $y$-$x$ relation and not the mass function.

How to modify the generative model to account for truncation?

Continue thinking about generating a complete data set, but then apply the selection to produce a mock truncated data set.

Definitions: Let

$N_{det}$ be the number of detected clusters. It is measured, while total $N$ is not.
$\phi$ be any additional model parameters that the detection probability might depend on, beyond $\hat{y}$.
$I_k$ be an indicator variable (0 or 1) telling us whether cluster $k$ is detected (included in the observed data).

Here's an expanded PGM including these new features (right).

The differences:

$N$ is no longer observed.
$\phi$ has been added.
$\hat{y}_k$ and $\phi$ feed into (possibly stochastically) whether cluster $k$ is in the observed data ($I_k=1$) or not ($I_k=0$). $I_k$ is "measured" in the sense of being fixed by observation (it's definitely 1 for anthing in the data set and 0 for anything not in the data set), as strange as this statement sounds.
$N_{det}$ is there for completeness; it's the number of $I_k$'s that are 1.

The new likelihood can be written

$P(\hat{x},\hat{y},I,N_\mathrm{det}|x,y,\theta,\Omega,\phi,N)=$

$ \quad {N \choose N_\mathrm{det}} \,P(\mathrm{detected}~\mathrm{clusters}) \,P(\mathrm{missing}~\mathrm{clusters})$

Let's break this down.

$P(\mathrm{detected}~\mathrm{clusters})$ is more or less what we've worked with before, with the additional sampling distribution for $I_k=1$.

$P(\mathrm{detected}~\mathrm{clusters}) =$

$ \quad \prod_{k=1}^{N_{det}} p(x_k|\Omega)\,p(y_k|x_k,\theta)\,p(\hat{y}_k|y_k)\,p(\hat{x}_k|x_k)\,p(I_k=1|\hat{y}_k,\phi)$

$P(\mathrm{missing}~\mathrm{clusters})$ is almost the same, but since these $\hat{x}_k$ and $\hat{y}_k$ are actually unobserved, they need to be marginalized over:

$P(\mathrm{missing}~\mathrm{clusters}) = \prod_{k=1}^{N-N_{det}}$

$ \quad \int d\hat{x}_k\,d\hat{y}_k\, p(x_k|\Omega)\,p(y_k|x_k,\theta)\,p(\hat{y}_k|y_k)\,p(\hat{x}_k|x_k)\,P(I_k=0|\hat{y}_k,\phi)$

With no observed values to distinguish them, these factors are all equal

$\quad = P_{mis}^{N-N_{det}}$

What about this binomial term, ${N \choose N_\mathrm{det}} = \frac{N!}{N_{det}!(N-N_{det})!}$?

It's sneaky, and has to do with the statistical concept of exchangeability (a priori equivalence of data points).

We normally don't think about the fact that the order of data points is meaningless to us when mapping generative models to inferences (and a fixed $N_{det}!$ can be ignored anyway).

But now, with $N$ unknown, we have to worry about the fact that our complete model containts two non-exchangeable classes (detected and non-detected).

If $p(N|\Omega)$ is Poisson, it can be marginalized out to produce

$P(\hat{x},\hat{y},I,N_\mathrm{det}|x,y,\theta,\Omega,\phi)= e^{-\langle N_{det} \rangle} \, \langle N \rangle^{N_{det}} \,P(\mathrm{detected}~\mathrm{clusters})$

with

$\langle N \rangle$ the mean of $p(N|\Omega)$
$\langle N_{det} \rangle$ the expectation value of $N_{det}$

i.e., a relatively simple change from where we started.

Don't believe me?

There's a more intuitive, but less generally applicable, way to get to the same place. As an exercise,

Consider modeling only the detected data. We're back to the simpler PGM, except that $N \rightarrow N_{det}$.
Define a grid covering the $\hat{x}$-$\hat{y}$ plane, with cells of area $\Delta\hat{x}\Delta\hat{y}$. All our data will fall into one of these cells.
A completely equivalent likelihood (i.e. with the same assumptions) as what we worked with above would be an independent Poisson sampling distribution for the number of detected clusters in each cell. The Poisson mean for each cell will depend on which cell it is, as well as $\Omega$, $\theta$, $\phi$ and $N$.
Take the limit $\Delta\hat{x}\Delta\hat{y} \rightarrow 0$. (Hint: in this limit the occupation of each cell is either 0 or 1.)

Up to a constant factor of $N_{det}!$, you'll arrive at the expression above, but explicitly marginalized over $x$ and $y$.

Question

Qualitatively, what information does the leftmost observed point provide when we model the selection correctly? How does this differ from other possible approaches that were brainstormed earlier?

Moving on...

Unsurprisingly, fitting the data in this example without accounting for selection will wildly bias constraints on the mass function and scaling relation.

For a general problem, selection effects are ignorable if both of these are true:

Priors for the interesting ($\Omega,\theta$) and selection-related ($\phi$) parameters are independent.
Selection doesn't depend on (potentially) unobserved data.

The second is often a non-starter in survey science, since whatever we're interested in will often correlate with our detection signal.

Really, these conditions boil down to whether the posterior for the parameters of interest depends on selection.

A less strict version of the same question is: does ignoring selection effects bias me at a level I care about?

Analyzing mock data is the best way of answering this.

Parting words

We haven't worked one of these probems fully, but typically (when we assume independently occuring sources) our likelihood only becomes a little more complicated due to selection. We just need to be able to evaluate the selection probability and predict the number of selected sources from the model.
The need to model a hidden population places additional demands on our data, so the size/quality of data set required to get a data-dominated (rather than prior-dominated) answer can be non-intuitive. Be careful.

Bonus Exercise: other truncation mechanisms

Consider the following variants of the galaxy cluster example:

Selection is at random (not related to $\hat{x}$ or $\hat{y}$)
Selection is on the observed mass ($\hat{x}$)
Selection is on $\hat{y}\rightarrow\hat{y}_1$ as before, and for detected clusters we have an additional measured observable $y_2$ whose scaling with $x$ is interesting

In each case, sketch the PGM and decide whether selection effects are ignorable for inference about

The distribution of $x$ (parametrized by $\Omega$)
The scaling relation parameters $\theta$ (for both $y_1$ and $y_2$, or $y_2$ alone in case 2)

If not, can you identify special cases where selection becomes ignorable?



In [ ]: