Spectral Energy Density Fitting and Dark Matter Limit Extraction

Motivation

Now we are going to discuss how we can use build a summary data product that can be used to quickly fit a wide variety of different DM spectra.

Recall that the previous example involved fitting the data across all energy bins using a powerlaw with index -2 for the dark matter target.

What we would like to do is extract the spectrum of any excess (i.e., the flux or limits associated with the various energy bins) and then fit the various DM model spectra to the observed spectra.

Something like this:

So then we would be fitting various DM models against the spectral data points, rather than the counts data, as we did in the previous example. Typically we might use the uncertainties of the data points and do a $\chi^2$ fit for the DM spectrum.

There are two main issues with that approach.

  1. Because many of our energy bins have very low statistics, the symmetric error bars that you would want to use, which are obtained by approximating the log-Likelihood surface as a parabola near the minimum, are not actually a good representation of the true log-Likelihood.
  2. Since we are doing a search for signal of new physics, it is likely that in many of the energy bins we will actually be reporting upper limits instead of flux points with error bars. Because upper limits combine to pieces of information (the mean value and the uncertainty) into a single number ( the upper limit ) there isn't a good way to combine upper limits. Consider, for example ,two measurements, the first being $1.0 \pm 0.5$ and the second being $1.9 \pm 0.05$. If we took upper limits as the best-fit value plus 2 sigma both results would give us upper limits of 2.0. What we have lost in reporting only the upper limit is the information about if the data are consistent with the null-hypothesis.

So the best way to combine the information from the various energy bins is to combine the likelihoods.

Overview of the Methodology

First we need to extract the log-likelihood versus flux in each energy bin seperately. In any one energy bin, the analysis is just the same as what we did in the previous example, except that we only use the data and model in a single energy bin.

For a single energy bin the results may look something like this:

Where the delta log-likelihood is being plot on the color scale.

For two energy bins the results might look like this:

And finally, for all of the energy bins the results might look like this:

This last figure is called a "Castro" plot.

Basically, the dark red bands show the regions favored by the data, and the other colors show the regions increasingly disfavored by the data.

Here is another version of the same plot, where we have added the 95% CL upper limits in each of the energy bands.

Recall: the confidence level here is not quantifying the probability of the energy flux taking a value below the given 95% limit - that would be a Bayesian statement.

Question:

What is the corect phrasing we should use when describing the meaning of these upper limits?

If we assume a particular spectral form for the DM signal, we can use the data that went into the Castro plot to construct the log-likelihood as a function of the paramaters of the function we assumed. In our case we will be assuming the annihilation channel and mass of the DM, so the only free parameter is the normalization of the signal.

Here are what the 95% CL upper limits would look like in this simulation for DM annihilating to b-quarks, for several different DM masses.

By way of comparison, here is the upper limits on the spectrum you would get if you simply required that the curve did not exceed any of the single bin upper limits, which you can see is markedly worse.

Question:

Why does this plot not tell us the correct upper limit on the spectrum normalization?

Finally, here is what a positive dection of a signal might look like:

Our example

There are two file that we will use to work this example that we should take a close look at.
The first is the draco_sed.yaml file in the results directory. It constains the likelihood versus flux results from the same simulation of 6 years of data we used in the first example.


In [ ]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import LikeFitUtils as lfu
import SedUtils as SED

# lets open the file and have a look
import yaml
f_sed = yaml.load(open("results/draco_sed.yaml"))
len(f_sed)

Ok, the file contains 24 sets of curves, one for each energy bin.

Let's have a look at one of the sets of results.


In [ ]:
f_sed[0].keys()

In [ ]:
print "Energy range of bin 0 is %.1e to %.1e MeV"%(f_sed[0]['emin'],f_sed[0]['emax'])
print "Flux values scanned range from %.1e to %.1e ph cm^-2 s^-1"%(f_sed[0]['flux'][0],f_sed[0]['flux'][-1])
print "The corresponding energy flux values range from %.1e to %.1e MeV cm^-2 s^-1"%(f_sed[0]['eflux'][0],f_sed[0]['eflux'][-1])
print "The resulting delta log-Likelihood values at the edges of the scan are %.1f and %.1f"%(f_sed[0]['logLike'][0],f_sed[0]['logLike'][-1])
print "The conversion factor from energy flux to number of predicted counts is %.1e"%f_sed[0]['eflux2npred']

So, as stated, that file contains everything we need to make the Castro plot.

I've put some utilities in SedUtils.py. These are functions to do things like interpolate the log-likelihood in each energy bin and then sum them together. I've added a small python class to manage things.


In [ ]:
import SedUtils as SED

sed = SED.SED(f_sed)
help(sed)

Ok, lets go ahead and take a look at the SED that we have.


In [ ]:
sed.binByBinUls = None
binByBinULs = sed.BinByBinULs()
figSED,axSED = SED.PlotSED(sed.energyBins,binByBinULs)

Ok, recall that we should never plot upper limits without also giving information about the expected upper limits. There is a file in the "ancil" sub-directory that has the quantiles for the upper limits from 300 Monte Carlo simulations of the analysis.

A pretty standard way to give a sense of the consistency of the results is to show the so called "Brazil" bands for the upper limits. I.e., expectation bands made from simulating the analysis chain numerous times. Typically people show the 1 and 2 $\sigma$ expectation bands and plot them in yellow and green, thus the name "Brazil".


In [ ]:
# let's get the file with the expected upper limits
f_sed_bands = yaml.load(open("ancil/draco_sed_mc_bands.yaml"))

In [ ]:
figSED2,axSED2 = SED.PlotSED(sed.energyBins,binByBinULs,f_sed_bands)

Question:

Does this SED plot look reasonable to you?

The second file is the DM_spectra.yaml file in the "ancil" directory. This file gives the DM spectra for several different masses for the $b\hat{b}$ and $\tau^+\tau^-$ channels. I made this file specifically to match our analysis and our energy binning.


In [ ]:
f_dmspec = yaml.load(open("ancil/DM_spectra.yaml"))
print "Channels loaded are",f_dmspec.keys()
masses_bb = f_dmspec['bb'].keys()
masses_bb.sort()
masses_tau = f_dmspec['tautau'].keys()
masses_tau.sort()
fluxVals = f_dmspec['bb'][100]
print "Masses for the bb channel are",masses_bb
print "Masses for the tautau channel are",masses_tau
print "Flux values for 100GeV bb dark matter:\n",fluxVals

Similary to the previous example, the SED object will make a function that we can then pass to the optimizer, this is the NLL_Func function.


In [ ]:
help(sed.NLL_func)

In [ ]:
nll_func = sed.NLL_func(fluxVals)
nll_null = nll_func(0.)
nll_test = nll_func(1.)   # Warning, this is in units of 10^-26 cm^3 s-1
print nll_null,nll_test

There is also a Minimize function that finds the normalization value that minimizes the negative log likelihood:


In [ ]:
result = sed.Minimize(fluxVals,1.0)
mle = result[0][0]
nll_mle = result[1]
ts = 2.*(nll_null-nll_mle)
print "Best-fit value %.1f"%(mle)
print "Test Statistic %.1f"%(ts)

So, it looks like there is no signal and we should set upper limits. As before we construct the upper limits at the point were the delta log-likelihood reaches 1.35.


In [ ]:
import LikeFitUtils as lfu
xbounds = (1e-4,1e1)
error_level = 1.35
ul = lfu.SolveForErrorLevel(nll_func,nll_mle,error_level,mle,xbounds)
print "Upper limit on <sigma v> is %.2e cm^3 s^-1"%(1e-26*ul)

In the SedUtils.py file you will find a small piece of code to loop over all the channels and masses and to write the output to ../results/draco_dm_results.yaml. Let's go ahead and open that file.


In [ ]:
f_dmlims = yaml.load(open("results/draco_dm_results.yaml"))
print "Channels are:",f_dmlims.keys()
print "Data saved for each channel:",f_dmlims['bb'].keys()

In [ ]:
print "Upper limits for bb channel are:\n",1e-26*f_dmlims['bb']['UL95']

Displaying the results

Recall the point about how presenting upper limits alone gets rid of the information about the uncertainties and if the result is consistent with the null hypothesis.

Once again, you should never present upper limits without also presenting something that allows people to determine if they think the result is consistent with the null hypothesis.

You can find the quantiles calculated from 300 Monte Carlo simulated instances of the analysis chain in the file draco_spectral_mc_bands.yaml in the ancil folder.


In [ ]:
# Ok, first we will load the bands
bands = yaml.load(open("ancil/draco_spectral_mc_bands.yaml"))   

print "MC expectation bands for channels: ",bands.keys()
print "Quantities available are: \n",bands['bb'].keys()

Ok, there as you can see, the file has a lot more information than the simple limits. The various types of limits presented in the file are:

  • ulimits and ulimits99: The simple upper limits fo 95% and 99% confidence levels i.e., the thing we want.
  • pulimits and pulimits99: The upper limits profiled over the unceratintiy in the J-factor of Draco
  • p2ulimits and p2ulimits99: The upper limits profiled over the unceratintiy in the J-factor of Draco, using a different representation of the unceratintiy of the J-factor
  • bulimits and bulimits99: The Baysian upper limits, calculated with a flat prior.
  • b2ulimits and b2ulimits99: The Baysian upper limits, calculated with an exponential prior (appropriate for Poisson data, as we have here)

For each type of limit the file contains inforamation about several quantiles from the Monte Carlo simulation runs.


In [ ]:
# Ok, let's go ahead and plot the limits against the expectation
f,a = SED.PlotLimits(f_dmlims['bb']['Masses'],f_dmlims['bb']['UL95'],bands['bb']['ulimits'])

Question:

Do these upper limits seem reasonable?

Additional questions / exercises

  1. How would you calculate upper limits in a Baysian framework? What potential pit-falls should you be careful of?
  2. How do Baysian upper limits compare with what we have done?
  3. In a real analysis we would want to incorporate the uncertainty of the J-factor into the upper limits on $\langle \sigma v \rangle$, how could we do that?
  4. Several groups have claimed that excess emission near the Galactic center around 3GeV is consistent with a DM signal, e.g. arXiv:1402.6703. How do the limits we have computed from Draco compare to those claims?

Thank you for your attention