Scientific Context: Indirect (i.e., astronomical) dark matter searches with Fermi data
Analysis Context: Fermi-LAT data analysis chain
Example 1: Fitting the flux normalization of a target source (the dwarf galaxy draco)
Example 2: Extracting upper limits on the thermally averaged dark matter annihilation cross section
One sentence summary: if dark matter consists of massive particles with weak-force-scale interactions (such as in many super-symmetric scenarios) then signals may be detectable in astronomical gamma-ray data.
The most important equation gives the expected gamma-ray flux from annihiliations in a dark matter halo:
The left hand side of the equation is the observed flux as a function of energy and direction in the sky.
$ \langle \sigma_{\chi} v \rangle $ is the "thermally averaged-cross section", i.e., velocity times the cross section, which gives an interaction rate per unit density ( $cm^{-3} s^{-1}$ ).
The summation is over annihilation final states (e.g., $b \bar{b}$ or $\tau^+\tau^-$, labeled by f), and $dN_{F}/dE$ is the average spectrum of gamma-rays for the final state (f), and $B_{f}$ is the branching ratio to that final state.
Together the first two terms on the right hand side of the equation give all of our knowledge of the particle physics of the dark matter annihilation. In particle terms we don't know the cross section or the veloctiy distribution or branching fractions. For a given final state we can use numerical codes to estimate the resultion gamma-ray spectrum $dN_{F}/dE$.
Here are some examples of the gamma-ray spectra for different final states:
The final term in the flux equation contains all of the astrophysical information, and is usually called the "J-factor" (actually the common definition of the J-factor doesn't include the factors of $1/4\pi$ and $1/m_{\chi}^2$, and is just the integral along the line-of-sight of the square of the density of the dark matter).
Note that the line of sight integration intrinsically handles the distance-squared projection effects, i.e., you when you compare J-factors you don't have to account for distance, that has already been done.
Here are some examples of the radial density distribution of dark matter for different models of the Milky Way:
and here are the resulting J-factors integrated out from the Galactic center:
The J-factors given here for the center of the Milky Way are much higher than for other targets. Typical J-factors for a dwarf galaxy such as Draco, which we will consider later, are more like $10^{19}$ GeV$^2$ cm$^{-5}$.
The gamma-ray sky is dominated by there types of emission:
Here is an image of the gamma-ray sky corresponding 6 years of data, shown in Galactic coordinates in a Hammar-Aitoff projection:
Dark matter signals might be a small contribution to the gamma-ray sky. We can significantly improve our senstivity to such signals if we know where to look.
Dwarf spheroidal galaxies are know to be very dark-matter dominated.
So, we can set up a search to look for excess gamma-ray emission in the direction of the known dwarf galaxies. Today we are going to do that for one particular dwarf, Draco. (Not Segue I, which is shown in this slide: )
Draco is located at (l,b) = (86.4,34.7) is ~76kpc from us, and has of J-factor of $\log_{10} (J/GeV^{2} cm^{-5}) = 18.8$ and has an angular extent of less than half a degree (it would appear almost as a point-source to the Fermi-LAT).
We will be performing binned likelihood fitting on counts data. The function that we will be minimizing is the negative log of the Poisson likelihood to observe a particular number of counts in each pixel / bin $n_i$ given that our model predicts $m_i$ counts.
The negative log likelihood is:
(We can neglect that last term b/c it does not depend on the model).
In gamma-ray astronomy we tend to use frequentist formulations. There is a very simple reason for this: we don't really know a lot about the gamma-ray sky, so we don't have very much prior knowledge of what to expect in our measurements.
While it is true that Bayesian methods can be used with un-informative priors, the fact is that un-informative priors are often (or in part) designed to give the same answers as frequentist methods. These can be useful if you really want to put things in a Bayesian context, (i.e., if you need a posterior distribution to sample). However, by themselves they don't really tell you about the science, so in Fermi we generally don't bother with them. (We get the same results, and reduce our carbon footprint to boot.)
In the frequentist formulation the main lens through which we interpret results are Wilks'and Chernoff's theorems. They are both statements about the distribution of the "Test Statistic" in the case that the null-hypothesis is true.
Frequentist test statistics depend on the data, $N$, and our estimates of the model parameters, $\hat{\alpha}$. The distribution of the test statistic $TS$ is a transformation of the sampling distribution: ${\rm Pr}(TS(N,\hat{\alpha})|H_0)$. In frequentism there is no "probability distribution for the model parameters," only estimates of the model parameters.
In our case the test statistic is the likelihood ratio:
Wilks’ theorem: as the sample size approaches infinity, the test statistic for a nested model will be asymptotically $\chi^2$-distributed with degrees of freedom equal to the difference in dimensionality of the test hypothesis and the null-hypothesis.
Chernoff’s theorem: as above, but if the null hypothesis is at a parameter boundary of the 1/2 of the trials will have TS=0. This is particularly important in searches, where, for example the parameter in question might be the cross section for a process that might not exist, so we will be exploring the region near zero, and negative cross-sections are unphysical and may cause problems with the fitting.
Caveat: there are some subtleties in what is considered the null hypothesis in standard usage (more later)
In [1]:
%matplotlib inline
import numpy as np
import pylab as plt
import scipy.stats as stats
x = np.linspace(0,25,2500)
chi2_1dof = stats.chi2.pdf(x,1) / 100.
fig, ax = plt.subplots(1, 1)
ax.set_xlabel('Test Statistic')
ax.set_ylabel('Fraction of Trials / 0.01 TS')
ax.set_xlim(0.,25.)
curve = ax.semilogy(x, chi2_1dof,'r-', lw=1, label='chi2 pdf')
What we typically talk about are "$p$-values": the probability to observe data where the null-hypothesis is at least as disfavored as what we actually observed if the null hypothesis were true. To go from the $\chi^2$ distribution to the $p$-value we take the complement of the cumulative distribution (sometimes called the "survival function").
In [2]:
pvalue_1dof = 1. - stats.chi2.cdf(x,1)
fig2, ax2 = plt.subplots(1, 1)
ax2.set_xlabel('Test Statistic')
ax2.set_ylabel('p-value')
ax2.set_xlim(0.,25.)
curve = ax2.semilogy(x, pvalue_1dof,'r-', lw=1, label='p-value')
By way of comparison, here is what the $p$-value looks like for 1,2,3, and 4 degrees of freedom.
In [3]:
pvalue_2dof = 1. - stats.chi2.cdf(x,2)
pvalue_3dof = 1. - stats.chi2.cdf(x,3)
pvalue_4dof = 1. - stats.chi2.cdf(x,4)
fig3, ax3 = plt.subplots(1, 1)
ax3.set_xlabel('Test Statistic')
ax3.set_ylabel('p-value')
ax3.set_xlim(0.,25.)
ax3.semilogy(x, pvalue_1dof,'r-', lw=1, label='1 DOF')
ax3.semilogy(x, pvalue_2dof,'b-', lw=1, label='2 DOF')
ax3.semilogy(x, pvalue_3dof,'g-', lw=1, label='3 DOF')
ax3.semilogy(x, pvalue_4dof,'y-', lw=1, label='4 DOF')
leg = ax3.legend()
We often choose to report signal significance in terms of the equivalent number of standard deviations ($\sigma$) away from the mean of a normal (Gaussian) distribution you would have go to obtain a given $p$-value. Here are the confidence intervals correspond to 1,2,3,4,5 sigma.
In [4]:
sigma_p = []
for i in range(1,6):
print "%i sigma = %.2e p-value"%(i,2*stats.norm.sf(i))
sigma_p.append(2*stats.norm.sf(i))
Here is a plot showing how those p-values map onto values of the TS.
In [5]:
fig6, ax6 = plt.subplots(1, 1)
ax6.set_xlabel('Test Statistic')
ax6.set_ylabel('p-value')
ax6.set_xlim(0.,25.)
ax6.semilogy(x, pvalue_1dof,'r-', lw=1, label='1 DOF')
ax6.semilogy(x, pvalue_2dof,'b-', lw=1, label='2 DOF')
ax6.semilogy(x, pvalue_3dof,'g-', lw=1, label='3 DOF')
ax6.semilogy(x, pvalue_4dof,'y-', lw=1, label='4 DOF')
ax6.hlines(sigma_p,0,25.0,linestyles=u'dotted')
leg = ax6.legend()
You will notice that for 1 DOF, the significance expressed in standard deviations is simply the $\sigma = \sqrt{TS}$, i.e., the $\chi^2$ distribution is simply the positive half of the normal distribution for $\sqrt{TS}$.
In [6]:
for i in range(1,6):
print "%i sigma = %.2e == %.2e"%(i,2*stats.norm.sf(i),stats.chi2.sf(i*i,1))
Chernoff's theorem applies when 1/2 of the trials are expected to give negative fluctuations where we expect the signal. Since we have bounded the parameter at zero, the likelihood will be maximized at zero, which is the same as the null-hypothesis, so those trials will give $TS = 0$.
Here is a comparison of the $p$-value for cases where Wilks' and Chernoff's theorems apply, for a single degree of freedom.
In [7]:
pvalue_1dof_cher = 0.5*(1. - stats.chi2.cdf(x,1))
fig4, ax4 = plt.subplots(1, 1)
ax4.set_xlabel('Test Statistic')
ax4.set_ylabel('p-value')
ax4.set_xlim(0.,25.)
ax4.semilogy(x, pvalue_1dof,'r-', lw=1, label='Unbounded')
ax4.semilogy(x, pvalue_1dof_cher,'r--', lw=1, label='Bounded')
leg = ax4.legend()
A very common mistake that people make is to confuse degrees of freedom with trials factors. As a concrete example, consider a search for a new point source, where we allow the position of the source to vary in our fitting procedure. In that case we would typically have (at least) 3 additional degrees of freedom w.r.t. the null hypothesis (the magnitude of the source and 2 spatial coordiantes). We can consider two limiting cases:
If we only allow the position of the source to move a small amount compared to the image resolution then will expect the distribution of $TS$ to be $\chi^2$-distributed with 3 DOF.
If, on the other hand, we were to fit for a signal at three different locations separated by much that the image resolution they we would have three independent trials with a single degree of freedom each. In that case the smallest $p$-value would be: $p_{\rm glob} = 1 - (1-p)^3$.
Here is a comparison of those two situations, for bounded parameters.
In [9]:
pvalue_1dof_cher_3trial = 1. - (1 - 0.5*(1. - stats.chi2.cdf(x,1)))**3
fig5, ax5 = plt.subplots(1, 1)
ax5.set_xlabel('Test Statistic')
ax5.set_ylabel('p-value')
ax5.set_xlim(0.,25.)
ax5.semilogy(x, pvalue_1dof_cher,'r--', lw=1, label='1Trial, 3DOF')
ax5.semilogy(x, pvalue_1dof_cher_3trial,'r-.', lw=1, label='3Trials, 1DOF')
leg = ax5.legend()
In [ ]: