Goals:
See how discoveries can be assessed in both the Bayesian and Frequentist frameworks
Understand the dangers of fishing expeditions
Understand unconscious experimenter bias, and how blinding can mitigate it
See how to quantify dataset consistency, when bringing multiple observations together
The first principle [of science] is that you must not fool yourself — and you are the easiest person to fool.” - Richard Feynman
James Berger, "The Bayesian Approach to Discovery" (PHYSTAT 2011)
Kyle Cranmer, "Practical Statistics for the LHC"
Gross & Vitells (2010), "Trial factors for the look elsewhere effect in high energy physics"
MacCoun & Perlmutter (2015), "Hide Results to Seek the Truth"
Klein & Roodman (2005) "Blind Analysis in Nuclear and Particle Physics"
Detection is therefore a model comparison or hypothesis testing problem:
$H_0$: the "Null Hypothesis," that there is no source present
$H_1$: the "Alternative Hypothesis," that there is a source present, e.g. with flux $f$ and position $(x,y)$
"Power" is defined to be the probability that the null hypothesis test is rejected when the alternative hypothesis is true.
1) Simulation, in the "Toy Monte Carlo" approach: for a range of model parameters, generate large numbers of mock datasets, fit them, and compute $T$ for each one. Pros: reliable. Cons: CPU-intensive, depends on "parameter ranges"...
2) Approximation: for large dataset size, $T \sim \chi^2(\Delta\nu)$, where $\Delta\nu$ is the difference in the number of degrees of freedom between the two hypotheses (3, in our example). This is Wilks' Theorem
In case 2), we get a huge computational speed-up, but there's a catch.
If we carry out $m$ independent "trials", and are aiming to detect at the $\alpha$ confidence level, we would expect to get a positive result by chance in a fraction $\alpha' = 1 - (1-\alpha)^m \lesssim m \alpha$ of cases
Even if the trials are not independent, this last inequality still holds: the "Bonferroni Correction" involves comparing $p$-values to a threshold $\alpha / m$, if you want to test (and report) at the $\alpha$ confidence level.
This $1/m$ is sometimes referred to as the "trials factor," and the issue described here is known in the statistics literature as the "multiple comparisons" problem, and in (astro)particle physics as the "look elsewhere effect."
Detection is similar to classification, in that we seek to assign labels to objects ("detected" or "undetected")
Both involve a hypothesis test, and they share some terminology:
False positives arise when the null hypothesis is incorrectly rejected: a "Type I Error"
False negatives arise when the null hypothesis is incorrectly retained: a "Type II Error"
The Bonferroni Correction serves to reduce Type I errors (at the risk of increasing the Type II error rate)
Part of the problem in the cartoon was that more statistical tests were carried out after the first one was reported.
Such "post-hoc" ("after the fact") "data dredging" inflates the number of hypotheses, and care has to be taken.
Post-hoc analysis is extremely common in astronomy, since our science is so often driven by new observations
Models suggested by an intial analysis of the data will naturally fit better.
Allowing information from the same dataset to enter twice leads to the primary risk:
The best way to avoid this problem is to test the new models on new data
This is the logic behind cross-validation
Recall that the Evidence is maximized by choosing a delta function prior centered on the maximum likelihood parameters
The fact that Bayesian analysis necessarily involves writing down the prior PDF is helpful: it should be clear that it's garbage coming out, if everyone can see that it's garbage going in.
A particularly insidious post-hoc analysis problem is "unconscious experimenter bias:"
After completing their analysis, a researcher compares their conclusion with those of others
If there is disagreement, the researcher is more likely to try and "fix the problem" before publishing
Net result: there is a natural tendency towards "concordance" in the literature
A practical solution to the problem of unconscious experimenter bias is to blind the analysis
Before "opening the box" for the final inference run, one might use a method like:
Hidden signal box: the subset of data believed most likely to contain the signal is removed
Hidden answer: the numerical values of the parameters being measured are hidden, e.g. H0LiCOW (parameter means offset to zero), DES 3 yr (linear offsets added to correlation functions)
Adding/removing data: so that the team don't know whether they have detected a signal or not ("salting")
Training on a "pre-scaling" subset: as in cross-validation
Mitigating against unconscious experimenter bias involves doing some qualitatively different things:
The results may not be pleasing, but blinding can increase confidence in the robustness of results
Check out the talks given at the 2017 KIPAC Workshop, "Blind Analysis in High-Stakes Survey Science: When, Why, and How?": http://kipac.github.io/Blinding/
Blinding helps ensure that analysis of a model is not stopped too soon
Systematic errors may still be present, and may not be revealed until multiple independent datasets are compared and found to be statistically inconsistent, or in tension, with each other.
Lets look at a couple of ways to quantify the consistency of two datasets.
The most common quantification of tension between two datasets is the distance between the central values of their likelihoods (or posterior PDFs), in units of "sigma"
"sigma" is usually taken to be the sum of the two posterior widths, in quadrature
An alternative quantification, that works in any number of dimensions, is to use the Bayesian evidence to quantify the overlap between likelihoods.
Consider the following two models:
This would be computed during a joint fit.
$H_2$: Each dataset $d_A$ and $d_B$ was generated from its own local model with parameters $\theta_A$ and $\theta_B$.
The evidence for $H_2$ is $P(\{d_A,d_B\}|H_2)$
$= \int P(d_A|\theta_A,H_2)P(d_B|\theta_B,H_2)P(\theta_A|H_2)P(\theta_B|H_2)\,d\theta_A d\theta_B$
$= P(d_A|H_2)P(d_B|H_2)$
i.e. In $H_2$ the evidence is just the product of the evidences computed during the two separate fits.
The Bayes factor is therefore
$\frac{P(\{d_A,d_B\}|H_2)}{P(\{d_A,d_B\}|H_1)} = \frac{P(d_A|H_2)P(d_B|H_2)}{P(\{d_A,d_B\}|H_1)}$
If the inferences of $\theta_A$ and $\theta_B$ under $H_2$ are very different, we would say that the datasets were inconsistent and the Bayes factor would be high: can you see why?
The reduced goodness of fit in the joint model $H_1$ is balanced against the inclusion of a complete set of additional parameters in $H_2$.
An unknown systematic error, causing model $H_1$ to fail to fit both datasets simultaneously, would lead to an unknown parameter vector offset - these additional parameters are present in $H_2$, which is better able to fit the data as a result.
This Bayes factor approach has been used by the DES and H0LiCoW teams to quantify the consistency of their cosmological probes / lens datasets.
In [ ]:
# Dataset consistency demo - the distance between two posterior means, in "sigma"
import numpy as np
import scipy.stats
import matplotlib.pyplot as plt
%matplotlib inline
# Two model posterior PDFs for the same parameter given different datasets:
mu1,sigma1 = 25,10
mu2,sigma2 = 75,20
# Add the sigmas in quadrature:
sigma = np.sqrt(sigma1**2 + sigma2**2)
# Compute the distance in units of this sigma:
n = (mu2 - mu1)/sigma
# Plot the distributions and annotate:
x = np.linspace(0,100,100)
p1 = scipy.stats.norm.pdf(x,loc=mu1,scale=sigma1)
plt.plot(x,p1)
yy = 0.6*max(p1)
plt.annotate(s='', xy=(mu1,yy), xytext=(mu1+sigma1,yy), arrowprops=dict(arrowstyle='<->',lw=2))
plt.text(mu1+0.5*sigma1,yy*0.9,'$\sigma_1$',horizontalalignment='center',fontsize=14);
p2 = scipy.stats.norm.pdf(x,loc=mu2,scale=sigma2)
plt.plot(x,p2)
yy = 0.6*max(p2)
plt.annotate(s='', xy=(mu2,yy), xytext=(mu2+sigma2,yy), arrowprops=dict(arrowstyle='<->',lw=2))
plt.text(mu2+0.5*sigma2,yy*0.8,'$\sigma_2$',horizontalalignment='center',fontsize=14);
plt.xlim(0,100)
plt.xlabel("x")
plt.ylabel("Posterior probability P(x|data)")
yy = 0.75*max(max(p1),max(p2))
plt.annotate(s='', xy=(mu1,yy), xytext=(mu2,yy), arrowprops=dict(arrowstyle='<->',lw=2))
plt.text(0.5*(mu1+mu2),yy*1.1,'"{:.1f} $\sigma$"'.format(n),horizontalalignment='center',fontsize=18);
plt.text(0.52*(mu1+mu2),yy*0.63,'$\sigma=\sqrt{\sigma_1^2+\sigma_2^2}$',horizontalalignment='center',fontsize=14);
plt.savefig("consistency.png")
In [ ]: