Application in Gamma-ray Astronomy:

Searching for Dark Matter Signals in Fermi-LAT data

Lecture outline

  1. Scientific Context: Indirect (i.e., astronomical) dark matter searches with Fermi data

    1. Fermi Large Area Telescope (LAT) and gamma-ray sky
    2. Dark matter signals from dwarf spheroidal (dSph) galaxies
    3. Frequentist (as opposed to Baysian) methodology
  2. Analysis Context: Fermi-LAT data analysis chain

    1. Event data and binned data
    2. Instrument response, livetime and exposure
    3. Flux model templates
    4. Likelihood fitting
  3. Example 1: Fitting the flux normalization of a target source (the dwarf galaxy draco)

  4. Example 2: Extracting upper limits on the thermally averaged dark matter annihilation cross section

Scientific Context: Indirect Detection Dark Matter Searches

One sentence summary: if dark matter consists of massive particles with weak-force-scale interactions (such as in many super-symmetric scenarios) then signals may be detectable in astronomical gamma-ray data.

The most important equation gives the expected gamma-ray flux from annihiliations in a dark matter halo:

The left hand side of the equation is the observed flux as a function of energy and direction in the sky.

$ \langle \sigma_{\chi} v \rangle $ is the "thermally averaged-cross section", i.e., velocity times the cross section, which gives an interaction rate per unit density ( $cm^{-3} s^{-1}$ ).

The summation is over annihilation final states (e.g., $b \bar{b}$ or $\tau^+\tau^-$, labeled by f), and $dN_{F}/dE$ is the average spectrum of gamma-rays for the final state (f), and $B_{f}$ is the branching ratio to that final state.

Together the first two terms on the right hand side of the equation give all of our knowledge of the particle physics of the dark matter annihilation. In particle terms we don't know the cross section or the veloctiy distribution or branching fractions. For a given final state we can use numerical codes to estimate the resultion gamma-ray spectrum $dN_{F}/dE$.

Here are some examples of the gamma-ray spectra for different final states:

The final term in the flux equation contains all of the astrophysical information, and is usually called the "J-factor" (actually the common definition of the J-factor doesn't include the factors of $1/4\pi$ and $1/m_{\chi}^2$, and is just the integral along the line-of-sight of the square of the density of the dark matter).

Note that the line of sight integration intrinsically handles the distance-squared projection effects, i.e., you when you compare J-factors you don't have to account for distance, that has already been done.

Here are some examples of the radial density distribution of dark matter for different models of the Milky Way:

and here are the resulting J-factors integrated out from the Galactic center:

The J-factors given here for the center of the Milky Way are much higher than for other targets. Typical J-factors for a dwarf galaxy such as Draco, which we will consider later, are more like $10^{19}$ GeV$^2$ cm$^{-5}$.

The Fermi Large Area Telescope and the Gamma-ray Sky

Here is a schematic view of the Fermi Large Area Telescope. Basically it is a particle physics detector in space.

Dark matter signals from dwarf spheroidal (dSph) galaxies

The gamma-ray sky is dominated by there types of emission:

  1. Galactic diffuse emssion. These are gamma-rays produced by the interactions between high-energy cosmic rays and dust, gas and radiation fields.
  2. Point sources. The most recent LAT source catalog (3FGL) identifies over 3000 gamma-ray point sources, the largest class of extra-galactic sources are Active Galactic Nuclei (AGN) and the largest class of Galactic sources are pulsars.
  3. Isotropic emission. This is largely (~85%) attributable to unresolved emisison for AGN.

Here is an image of the gamma-ray sky corresponding 6 years of data, shown in Galactic coordinates in a Hammar-Aitoff projection:

Dark matter signals might be a small contribution to the gamma-ray sky. We can significantly improve our senstivity to such signals if we know where to look.

Dwarf spheroidal galaxies are know to be very dark-matter dominated.

So, we can set up a search to look for excess gamma-ray emission in the direction of the known dwarf galaxies. Today we are going to do that for one particular dwarf, Draco. (Not Segue I, which is shown in this slide: )

Draco is located at (l,b) = (86.4,34.7) is ~76kpc from us, and has of J-factor of $\log_{10} (J/GeV^{2} cm^{-5}) = 18.8$ and has an angular extent of less than half a degree (it would appear almost as a point-source to the Fermi-LAT).

Poisson Likelihood

We will be performing binned likelihood fitting on counts data. The function that we will be minimizing is the negative log of the Poisson likelihood to observe a particular number of counts in each pixel / bin $n_i$ given that our model predicts $m_i$ counts.

The negative log likelihood is:

(We can neglect that last term b/c it does not depend on the model).

Frequentist (as opposed to Bayesian) methodology

In gamma-ray astronomy we tend to use frequentist formulations. There is a very simple reason for this: we don't really know a lot about the gamma-ray sky, so we don't have very much prior knowledge of what to expect in our measurements.

While it is true that Bayesian methods can be used with un-informative priors, the fact is that un-informative priors are often (or in part) designed to give the same answers as frequentist methods. These can be useful if you really want to put things in a Bayesian context, (i.e., if you need a posterior distribution to sample). However, by themselves they don't really tell you about the science, so in Fermi we generally don't bother with them. (We get the same results, and reduce our carbon footprint to boot.)

In the frequentist formulation the main lens through which we interpret results are Wilks'and Chernoff's theorems. They are both statements about the distribution of the "Test Statistic" in the case that the null-hypothesis is true.

Frequentist test statistics depend on the data, $N$, and our estimates of the model parameters, $\hat{\alpha}$. The distribution of the test statistic $TS$ is a transformation of the sampling distribution: ${\rm Pr}(TS(N,\hat{\alpha})|H_0)$. In frequentism there is no "probability distribution for the model parameters," only estimates of the model parameters.

In our case the test statistic is the likelihood ratio:

Wilks’ theorem: as the sample size approaches infinity, the test statistic for a nested model will be asymptotically $\chi^2$-distributed with degrees of freedom equal to the difference in dimensionality of the test hypothesis and the null-hypothesis.

Chernoff’s theorem: as above, but if the null hypothesis is at a parameter boundary of the 1/2 of the trials will have TS=0. This is particularly important in searches, where, for example the parameter in question might be the cross section for a process that might not exist, so we will be exploring the region near zero, and negative cross-sections are unphysical and may cause problems with the fitting.

Caveat: there are some subtleties in what is considered the null hypothesis in standard usage (more later)

Examples of the $\chi^2$ distribution and p-values for Wilks' Theorem

Here are a few examples of the $\chi^2$ distribution and its application with Wilks' theorem.

Here is the $\chi^2$-distribution for a single extra degree of freedom.


In [1]:
%matplotlib inline

import numpy as np
import pylab as plt
import scipy.stats as stats
x = np.linspace(0,25,2500)
chi2_1dof = stats.chi2.pdf(x,1) / 100. 
fig, ax = plt.subplots(1, 1)
ax.set_xlabel('Test Statistic')
ax.set_ylabel('Fraction of Trials / 0.01 TS')
ax.set_xlim(0.,25.)
curve = ax.semilogy(x, chi2_1dof,'r-', lw=1, label='chi2 pdf')


What we typically talk about are "$p$-values": the probability to observe data where the null-hypothesis is at least as disfavored as what we actually observed if the null hypothesis were true. To go from the $\chi^2$ distribution to the $p$-value we take the complement of the cumulative distribution (sometimes called the "survival function").


In [2]:
pvalue_1dof = 1. - stats.chi2.cdf(x,1)
fig2, ax2 = plt.subplots(1, 1)
ax2.set_xlabel('Test Statistic')
ax2.set_ylabel('p-value')
ax2.set_xlim(0.,25.)
curve = ax2.semilogy(x, pvalue_1dof,'r-', lw=1, label='p-value')


By way of comparison, here is what the $p$-value looks like for 1,2,3, and 4 degrees of freedom.


In [3]:
pvalue_2dof = 1. - stats.chi2.cdf(x,2)
pvalue_3dof = 1. - stats.chi2.cdf(x,3)
pvalue_4dof = 1. - stats.chi2.cdf(x,4)

fig3, ax3 = plt.subplots(1, 1)
ax3.set_xlabel('Test Statistic')
ax3.set_ylabel('p-value')
ax3.set_xlim(0.,25.)
ax3.semilogy(x, pvalue_1dof,'r-', lw=1, label='1 DOF')
ax3.semilogy(x, pvalue_2dof,'b-', lw=1, label='2 DOF')
ax3.semilogy(x, pvalue_3dof,'g-', lw=1, label='3 DOF')
ax3.semilogy(x, pvalue_4dof,'y-', lw=1, label='4 DOF')
leg = ax3.legend()


Converting p-values to standard deviations

We often choose to report signal significance in terms of the equivalent number of standard deviations ($\sigma$) away from the mean of a normal (Gaussian) distribution you would have go to obtain a given $p$-value. Here are the confidence intervals correspond to 1,2,3,4,5 sigma.


In [4]:
sigma_p = []
for i in range(1,6):
    print "%i sigma = %.2e p-value"%(i,2*stats.norm.sf(i))
    sigma_p.append(2*stats.norm.sf(i))


1 sigma = 3.17e-01 p-value
2 sigma = 4.55e-02 p-value
3 sigma = 2.70e-03 p-value
4 sigma = 6.33e-05 p-value
5 sigma = 5.73e-07 p-value

Here is a plot showing how those p-values map onto values of the TS.


In [5]:
fig6, ax6 = plt.subplots(1, 1)
ax6.set_xlabel('Test Statistic')
ax6.set_ylabel('p-value')
ax6.set_xlim(0.,25.)
ax6.semilogy(x, pvalue_1dof,'r-', lw=1, label='1 DOF')
ax6.semilogy(x, pvalue_2dof,'b-', lw=1, label='2 DOF')
ax6.semilogy(x, pvalue_3dof,'g-', lw=1, label='3 DOF')
ax6.semilogy(x, pvalue_4dof,'y-', lw=1, label='4 DOF')
ax6.hlines(sigma_p,0,25.0,linestyles=u'dotted')
leg = ax6.legend()


You will notice that for 1 DOF, the significance expressed in standard deviations is simply the $\sigma = \sqrt{TS}$, i.e., the $\chi^2$ distribution is simply the positive half of the normal distribution for $\sqrt{TS}$.


In [6]:
for i in range(1,6):
    print "%i sigma = %.2e == %.2e"%(i,2*stats.norm.sf(i),stats.chi2.sf(i*i,1))


1 sigma = 3.17e-01 == 3.17e-01
2 sigma = 4.55e-02 == 4.55e-02
3 sigma = 2.70e-03 == 2.70e-03
4 sigma = 6.33e-05 == 6.33e-05
5 sigma = 5.73e-07 == 5.73e-07

Examples of the $\chi^2$ distribution and p-values for Chernoff's theorem

Chernoff's theorem applies when 1/2 of the trials are expected to give negative fluctuations where we expect the signal. Since we have bounded the parameter at zero, the likelihood will be maximized at zero, which is the same as the null-hypothesis, so those trials will give $TS = 0$.

Here is a comparison of the $p$-value for cases where Wilks' and Chernoff's theorems apply, for a single degree of freedom.


In [7]:
pvalue_1dof_cher = 0.5*(1. - stats.chi2.cdf(x,1))
fig4, ax4 = plt.subplots(1, 1)
ax4.set_xlabel('Test Statistic')
ax4.set_ylabel('p-value')
ax4.set_xlim(0.,25.)
ax4.semilogy(x, pvalue_1dof,'r-', lw=1, label='Unbounded')
ax4.semilogy(x, pvalue_1dof_cher,'r--', lw=1, label='Bounded')
leg = ax4.legend()


Degrees of freedom vs. trials factor

A very common mistake that people make is to confuse degrees of freedom with trials factors. As a concrete example, consider a search for a new point source, where we allow the position of the source to vary in our fitting procedure. In that case we would typically have (at least) 3 additional degrees of freedom w.r.t. the null hypothesis (the magnitude of the source and 2 spatial coordiantes). We can consider two limiting cases:

If we only allow the position of the source to move a small amount compared to the image resolution then will expect the distribution of $TS$ to be $\chi^2$-distributed with 3 DOF.

If, on the other hand, we were to fit for a signal at three different locations separated by much that the image resolution they we would have three independent trials with a single degree of freedom each. In that case the smallest $p$-value would be: $p_{\rm glob} = 1 - (1-p)^3$.

Here is a comparison of those two situations, for bounded parameters.


In [9]:
pvalue_1dof_cher_3trial = 1. - (1 - 0.5*(1. - stats.chi2.cdf(x,1)))**3
fig5, ax5 = plt.subplots(1, 1)
ax5.set_xlabel('Test Statistic')
ax5.set_ylabel('p-value')
ax5.set_xlim(0.,25.)
ax5.semilogy(x, pvalue_1dof_cher,'r--', lw=1, label='1Trial, 3DOF')
ax5.semilogy(x, pvalue_1dof_cher_3trial,'r-.', lw=1, label='3Trials, 1DOF')
leg = ax5.legend()


Procedure:

  • Fit the data with the model, by finding the parameters $\hat{\alpha}$ that maximize the likelihood (and minimize the negative log likelihood).
  • Compute the test statistic $TS$ for the data and these parameter estimates.
  • Inspect the $\chi^2$ distribution, our approximation for ${\rm Pr}(TS|H_0)$, and compute the $p$-value, the probability of getting a test statistic larger than that observed (in a long sequence of repeated trials).
  • We often convert $p$-values to standard deviations when we report results: i.e., rather that giving the $p$-value we given the equivalent number of standard deviations. Standard deviations are defined in terms of the Gaussian distribution; helpfully, however,for 1 degree-of-freedom: $\sigma = \sqrt{TS}$.

In [ ]: