Avoiding Fooling Ourselves: Detection, Fishing, Blinding, and Dataset Consistency

Goals:

  • See how discoveries can be assessed in both the Bayesian and Frequentist frameworks

  • Understand the dangers of fishing expeditions

  • Understand unconscious experimenter bias, and how blinding can mitigate it

  • See how to quantify dataset consistency, when bringing multiple observations together

The first principle [of science] is that you must not fool yourself — and you are the easiest person to fool.” - Richard Feynman

Detection

  • Source detection is different in character from parameter estimation. Initially, we are less interested in the properties of a new source than we are in its existence.
  • Detection is therefore a model comparison or hypothesis testing problem:

    • $H_0$: the "Null Hypothesis," that there is no source present

    • $H_1$: the "Alternative Hypothesis," that there is a source present, e.g. with flux $f$ and position $(x,y)$

Bayesian Detection

  • We calculate and compare the Bayesian Evidences for each model, ${\rm Pr}(d\,|\,H_0)$ and ${\rm Pr}(d\,|\,H_1)$
  • Their ratio gives the relative probability of getting the data under the alternative and null hypotheses (and is equal to the relative probabilities of the hypotheses being true, up to an unknown model prior ratio).
  • This is an odds ratio (like "ten to one") that guides any bet we might want to make (staking, for example, some of our professional reputation)

Prior Dependence

  • Calculating the Bayesian Evidence ratio involves marginalizing over the alternative hypothesis' model parameters, given the prior PDFs that we assigned when writing down $H_1$
  • Weakening the prior on the source position and flux (by, for example, expanding their ranges) makes any given point in parameter space less probable, the detected model correspondingly more contrived, and the data less likely
  • So, weaker priors decrease the evidence for $H_1$, making the detection less significant (but only linearly, remember).

Frequentist Detection

  • Instead of working towards the relative probabilities of two hypotheses, the Frequentist approach is to attempt to reject the null hypothesis by showing that it would be too improbable for the data to have been generated by it
  • It turns out that the most powerful statistic to use in this hypothesis test is the likelihood ratio

"Power" is defined to be the probability that the null hypothesis test is rejected when the alternative hypothesis is true.

Likelihood Ratios

  • The procedure is to maximize the likelihood for the parameters of each hypothesis, and form the following test statistic:

$\;\;\;\;\;\;\;T_d = 2 \log \frac{L(\hat{f},\,\hat{x},\,\hat{y}\,;\,H_1)}{L(H_0)}$

  • We then inspect the distribution of test statistics $T$ over an ensemble of hypothetical datasets generated from a model with no source (the null hypothesis), and compute the $p$-value $P(T_d > T)$
  • If $p < \alpha$ we then say that "we can reject the null hypothesis at the $100\,\alpha$ percent confidence level."

The Distribution of Test Statistics

1) Simulation, in the "Toy Monte Carlo" approach: for a range of model parameters, generate large numbers of mock datasets, fit them, and compute $T$ for each one. Pros: reliable. Cons: CPU-intensive, depends on "parameter ranges"...

2) Approximation: for large dataset size, $T \sim \chi^2(\Delta\nu)$, where $\Delta\nu$ is the difference in the number of degrees of freedom between the two hypotheses (3, in our example). This is Wilks' Theorem

In case 2), we get a huge computational speed-up, but there's a catch.

Fishing Expeditions

The test statistic depends only on the estimated values of the source parameters, $(\hat{f},\,\hat{x},\,\hat{y})$, and reports the likelihood ratio between two discrete hypotheses

We need to account for the fact that we "went fishing" (ie, we looked elsewhere) for the source

The Bonferroni Correction

If we carry out $m$ independent "trials", and are aiming to detect at the $\alpha$ confidence level, we would expect to get a positive result by chance in a fraction $\alpha' = 1 - (1-\alpha)^m \lesssim m \alpha$ of cases

Even if the trials are not independent, this last inequality still holds: the "Bonferroni Correction" involves comparing $p$-values to a threshold $\alpha / m$, if you want to test (and report) at the $\alpha$ confidence level.

This $1/m$ is sometimes referred to as the "trials factor," and the issue described here is known in the statistics literature as the "multiple comparisons" problem, and in (astro)particle physics as the "look elsewhere effect."

Question:

How many discrete hypothesis tests are perfomed:

  • In the cartoon?

  • When detecting sources in an astronomical image?

Another Question:

  • Does the simulation approach (1) suffer from the look elsewhere effect as well?

  • Does the Bayesian approach to detection suffer from the look elsewhere effect?

Be prepared to give reasons...

Endnote: Classification and Detection

Detection is similar to classification, in that we seek to assign labels to objects ("detected" or "undetected")

Both involve a hypothesis test, and they share some terminology:

  • False positives arise when the null hypothesis is incorrectly rejected: a "Type I Error"

  • False negatives arise when the null hypothesis is incorrectly retained: a "Type II Error"

The Bonferroni Correction serves to reduce Type I errors (at the risk of increasing the Type II error rate)

Post-Hoc Analysis

Part of the problem in the cartoon was that more statistical tests were carried out after the first one was reported.

Such "post-hoc" ("after the fact") "data dredging" inflates the number of hypotheses, and care has to be taken.

Post-hoc analysis is extremely common in astronomy, since our science is so often driven by new observations

Avoiding Type I Errors

Models suggested by an intial analysis of the data will naturally fit better.

Allowing information from the same dataset to enter twice leads to the primary risk:

  • Generating false positives (in detection/classification) or
  • Under-estimating parameter uncertainties (in regression/measurement).

The best way to avoid this problem is to test the new models on new data

This is the logic behind cross-validation

Bayesian Analyses are Not Immune

Recall that the Evidence is maximized by choosing a delta function prior centered on the maximum likelihood parameters

The fact that Bayesian analysis necessarily involves writing down the prior PDF is helpful: it should be clear that it's garbage coming out, if everyone can see that it's garbage going in.

Experimenter Bias

A particularly insidious post-hoc analysis problem is "unconscious experimenter bias:"

  • After completing their analysis, a researcher compares their conclusion with those of others

  • If there is disagreement, the researcher is more likely to try and "fix the problem" before publishing

  • Net result: there is a natural tendency towards "concordance" in the literature

Blinding

A practical solution to the problem of unconscious experimenter bias is to blind the analysis

  • The analysis team is prevented from completing their inference and viewing the results until they deem their model complete (or at least, adequate)
  • The final inference is done once, and the results tagged as "blinded"
  • Post-hoc analysis is then enabled, but with results tagged as "unblinded" (since bias could be creeping in again)

Approaches to Blinding

Before "opening the box" for the final inference run, one might use a method like:

  • Hidden signal box: the subset of data believed most likely to contain the signal is removed

  • Hidden answer: the numerical values of the parameters being measured are hidden, e.g. H0LiCOW (parameter means offset to zero), DES 3 yr (linear offsets added to correlation functions)

  • Adding/removing data: so that the team don't know whether they have detected a signal or not ("salting")

  • Training on a "pre-scaling" subset: as in cross-validation

Mitigating against unconscious experimenter bias involves doing some qualitatively different things:

  • Organizing analyses in teams, and agreeing to abide by rules
  • Temporarily censoring or adjusting datasets while inferences are developed

The results may not be pleasing, but blinding can increase confidence in the robustness of results

Examples from Astrophysics & Cosmology

Check out the talks given at the 2017 KIPAC Workshop, "Blind Analysis in High-Stakes Survey Science: When, Why, and How?": http://kipac.github.io/Blinding/

Combining Datasets

Blinding helps ensure that analysis of a model is not stopped too soon

Systematic errors may still be present, and may not be revealed until multiple independent datasets are compared and found to be statistically inconsistent, or in tension, with each other.

Lets look at a couple of ways to quantify the consistency of two datasets.

The most common quantification of tension between two datasets is the distance between the central values of their likelihoods (or posterior PDFs), in units of "sigma"

"sigma" is usually taken to be the sum of the two posterior widths, in quadrature

An alternative quantification, that works in any number of dimensions, is to use the Bayesian evidence to quantify the overlap between likelihoods.

Consider the following two models:

  • $H_1$: Both datasets $d_A$ and $d_B$ were generated from the same global model with parameters $\theta$.
    • The evidence for $H_1$ is $P(\{d_A,d_B\}|H_1) = \int P(d_A|\theta,H_1)P(d_B|\theta,H_1)P(\theta|H_1)\,d\theta$

This would be computed during a joint fit.

  • $H_2$: Each dataset $d_A$ and $d_B$ was generated from its own local model with parameters $\theta_A$ and $\theta_B$.

    • The evidence for $H_2$ is $P(\{d_A,d_B\}|H_2)$

      $= \int P(d_A|\theta_A,H_2)P(d_B|\theta_B,H_2)P(\theta_A|H_2)P(\theta_B|H_2)\,d\theta_A d\theta_B$

      $= P(d_A|H_2)P(d_B|H_2)$

i.e. In $H_2$ the evidence is just the product of the evidences computed during the two separate fits.

The Bayes factor is therefore

$\frac{P(\{d_A,d_B\}|H_2)}{P(\{d_A,d_B\}|H_1)} = \frac{P(d_A|H_2)P(d_B|H_2)}{P(\{d_A,d_B\}|H_1)}$

If the inferences of $\theta_A$ and $\theta_B$ under $H_2$ are very different, we would say that the datasets were inconsistent and the Bayes factor would be high: can you see why?

The reduced goodness of fit in the joint model $H_1$ is balanced against the inclusion of a complete set of additional parameters in $H_2$.

An unknown systematic error, causing model $H_1$ to fail to fit both datasets simultaneously, would lead to an unknown parameter vector offset - these additional parameters are present in $H_2$, which is better able to fit the data as a result.

This Bayes factor approach has been used by the DES and H0LiCoW teams to quantify the consistency of their cosmological probes / lens datasets.

Summary

Model evaluation (hypothesis testing) appears in a variety of situations.

  • Source detection and classification

  • Dataset combination

  • Systematic error modeling

Blinding helps mitigate against unconscious experimenter bias when inferring parameters and evaluating models.

Appendix: Dataset Consistency Demo

Below is the code used to make the dataset consistency illustration, in case its useful.


In [ ]:
# Dataset consistency demo - the distance between two posterior means, in "sigma"
import numpy as np
import scipy.stats
import matplotlib.pyplot as plt
%matplotlib inline

# Two model posterior PDFs for the same parameter given different datasets:
mu1,sigma1 = 25,10
mu2,sigma2 = 75,20

# Add the sigmas in quadrature:
sigma = np.sqrt(sigma1**2 + sigma2**2)
# Compute the distance in units of this sigma:
n = (mu2 - mu1)/sigma

# Plot the distributions and annotate:
x = np.linspace(0,100,100)

p1 = scipy.stats.norm.pdf(x,loc=mu1,scale=sigma1)
plt.plot(x,p1)
yy = 0.6*max(p1)
plt.annotate(s='', xy=(mu1,yy), xytext=(mu1+sigma1,yy), arrowprops=dict(arrowstyle='<->',lw=2))
plt.text(mu1+0.5*sigma1,yy*0.9,'$\sigma_1$',horizontalalignment='center',fontsize=14);

p2 = scipy.stats.norm.pdf(x,loc=mu2,scale=sigma2)
plt.plot(x,p2)
yy = 0.6*max(p2)
plt.annotate(s='', xy=(mu2,yy), xytext=(mu2+sigma2,yy), arrowprops=dict(arrowstyle='<->',lw=2))
plt.text(mu2+0.5*sigma2,yy*0.8,'$\sigma_2$',horizontalalignment='center',fontsize=14);

plt.xlim(0,100)
plt.xlabel("x")
plt.ylabel("Posterior probability P(x|data)")
yy = 0.75*max(max(p1),max(p2))
plt.annotate(s='', xy=(mu1,yy), xytext=(mu2,yy), arrowprops=dict(arrowstyle='<->',lw=2))
plt.text(0.5*(mu1+mu2),yy*1.1,'"{:.1f} $\sigma$"'.format(n),horizontalalignment='center',fontsize=18);

plt.text(0.52*(mu1+mu2),yy*0.63,'$\sigma=\sqrt{\sigma_1^2+\sigma_2^2}$',horizontalalignment='center',fontsize=14);

plt.savefig("consistency.png")

In [ ]: