Example Notebook for A One Bin Model

model: Can be derived from a ROOT.RooStats.HistFactory.Measurement() with MakeModelAndMeasurementFast()

Ex:

meas = ROOT.RooStats.HistFactory.Measurement("meas", "meas")
# Add lots of things to the measurement
model_workspace = ROOT.RooStats.HistFactory.MakeModelAndMeasurementFast(meas)

observable:
model parameters:
parameter of interest (POI) $\left(\theta\right)$: ROOT.Measurement().AddPOI()

nuisance parameter $\left(\nu\right)$: nuisance parameters parametrize the impact of systematic uncertainties. Each systematic uncertainty, $i$, is described with a nuisance parameter, $\nu_{i}$ , that continuously interpolates between the variation and nominal templates. [5]

Ex:

sample = ROOT.RooStats.HistFactory.Sample("sample")
# A sample uncertainity that affectst the entire sample
sample.AddOverallSys("sample_uncertainty", uncertainty_down, uncertainty_up)
# Shape uncertainity
sample_shape = ROOT.RooStats.HistFactory.HistoSys("sample_shape")
sample_shape.SetHistoHigh(sample_up_hist)
sample_shape.SetHistoLow(sample_down_hist)
sample.AddHistoSys(sample_shape)

hypothesis:

Components of a Model

Models are built by HistFactory from measurements that consist of channels which hold samples.

The measurement is the object which collects all the information that will then be turned into a model by HistFactory. It is the top level combination of the channels that form the model.

A channel is a representation of observed data. Each channel is required to have a histogram of measured data associated with it. [3] Loosely, channels can be though of as "regions" in an analysis (e.g., signal regions, control regions, validation regions).

Samples are used to describe the channel (data) they are associated with. They can represent signals and backgrounds. Each sample has a histogram describing its shape and a list of systematics describing how that shape transforms based on a number of parameters. [3]



In [1]:

    
import ROOT
ROOT.RooMsgService.instance().setGlobalKillBelow(5)
%jsroot on

import sys
import os
# Don't require pip install to test out
sys.path.append(os.getcwd() + '/../../src')
from dfgmark import histfactorybench as hfbench

#import rootpy
#from rootpy.stats.histfactory import utils as hfutils









    



Welcome to JupyROOT 6.11/01

Overview of Project

Goal: The goal of the project is to be able to benchmark the speed/performance of the current implementation of HistFactory to perform a maximum likelihood fit for a given binned model generated by the user. The user can do this by using the project Python library (which will essentially just be a collection of helper functions).

The binned model is supposed to be parameterized by:

number of events
number of bins
number of channels
number of signal components (samples)
number of background components (samples)
number of parameters of interest
number of nuisance parameters

If this is all being made "on the fly" then all components are necessary. However, if the channels are being loaded in from ROOT files, then it would seem that the number of events and number of bins could be inferred(?).

As nuisance parameters are any model parameter which is not a parameter of interest, then if a parameter is not explicitly set as a POI (Measurement.SetPOI("parameter name")) then it is treated as a nuisance parameter.

Building a Simple Model

Let's build the simplest model: A counting experiment. One bin with a flat background sample and signal model.

Functions



In [2]:

    
help(hfbench.sample_func_histogram)









    



Help on function sample_func_histogram in module dfgmark.histfactorybench:

sample_func_histogram(shape_function, n_events=1, n_bins=10, range_low=0, range_high=1, hist_name='data_hist')
    Sample a function. create an pdf from it, then histogram it
    
    Args:
        shape_function (TF1): The mathematical form of the function to be sampled
        n_events (int): The number of points to sample
        range_low (int, float): The lower bound on the sampling range
        range_high (int, float): The upper bound on the sampling range
        n_bins (int): The number of bins in the generated histogram
        hist_name (str): The name of the histogram object
    Returns:
        hist (TH1): The generated histogram of n_events



In [3]:

    
help(hfbench.make_sample)









    



Help on function make_sample in module dfgmark.histfactorybench:

make_sample(sample_name='sample', sample_histogram=None, uncertainty_down=None, uncertainty_up=None, shape_up_hist=None, shape_down_hist=None, has_theory_norm=False)
    Construct a HistFactory.Sample()
    
    Args:
        sample_name (str): The name of the signal sample
        sample_histogram: A TH1 object that encodes the shape of the sample
        uncertainty_down (float): The downward normalisation systematic uncertainty (1.0 - uncertainty_down)
        uncertainty_up (float): The upward normalisation systematic uncertainty (1.0 + uncertainty_up)
        shape_up_hist: A TH1 object that encodes the updward sample shape uncertainty
        shape_down_hist: A TH1 object that encodes the downward sample shape uncertainty
        has_theory_norm (bool): Is the sample normalized by theory or not. Default is False.
    Returns:
        sample (HistFactory.Sample()): The constructed sample. Ex: signal sample, background sample



In [4]:

    
help(hfbench.make_channel)









    



Help on function make_channel in module dfgmark.histfactorybench:

make_channel(channel_name='channel', channel_data=None, channel_samples=None)
    Make a channel with data and associated samples
    
    Args:
        channel_name (str): The name of the channel.
            Example: If this channel represents a signal region, the name might be "SR"
        channel_data (TH1): The data histogram that describes the channel
        channel_samples (list of HistFactory.Sample()): The samples associated with the channel that describe
                                                        the model for the channel
    
    Returns:
        channel (HistFactor.Channel()): The channel object



In [5]:

    
help(hfbench.make_model)









    



Help on function make_model in module dfgmark.histfactorybench:

make_model(n_evnets, n_bins, channels, POI, workspace_name=None, workspace_save_path=None)
    Args:
        n_events: The number of events
        n_bins: The numbe of bins
        channels (dictionary of HistFactory.Channel()): The channels in the model with associated samples
        # n_sig_comp: The number of signal components (automatically determined from the channels)
        # n_bkg_comp: The number of background components (automatically
        # determined from the channels)
        POI (list of str): The parameters of interest
        # n_nuisance: The number of nuisance parameters (automatically
        # determined from the samples)
        workspace_name (str): The name of the resulting RooStats.RooWorkspace()
        workspace_save_path (str): The path and filename (sans ".root") to save the workspace
    Returns:
        workspace (RooStats.RooWorkspace): The workspace for the measruement
        model (RooStats.RooSimultaneous): The model generated from the workspace



In [6]:

    
help(hfbench.benchmark_fit)









    



Help on function benchmark_fit in module dfgmark.histfactorybench:

benchmark_fit(var, model, n_events, n_trials, verbose=False)
    Fit the given model and return the time it take
    
    Args:
        var (RooAbsReal and inheritors): The variable with range over which to fit the model
        model (RooAbsPdf): The pdf of the model
        n_events (int): The number of data points to generate from the model
        n_trials (int): The number of times to fit the model
        verbose (bool):
            False: Don't print information
            True: Print information on fit
    Returns:
        mean_fit_time (float): The mean time it takes to fit the model over n_trials times
    Example:
        ws, model = make_model(n_events, n_bins, channels, POI)
        benchmark_fit(ws.var("obs_x_SR"), model, n_events, n_trials)



In [7]:

    
help(hfbench.benchmark_fit_with_workspace)









    



Help on function benchmark_fit_with_workspace in module dfgmark.histfactorybench:

benchmark_fit_with_workspace(work_space, model, n_events, n_trials, verbose=False)
    Fit the given model and return the time it takes
    
    Args:
        work_space (RooWorkspace): The work space for the model
        model (RooAbsPdf): The pdf of the model
        n_events (int): The number of data points to generate from the model
        n_trials (int): The number of times to fit the model
        verbose (bool):
            False: Don't print information
            True: Print information on fit
    Returns:
        mean_fit_time (float): The mean time it takes to fit the model over n_trials times
    Example:
        ws, model = make_model(n_events, n_bins, channels, POI)
        benchmark_fit_with_workspace(ws, model, n_events, n_trials)

Examples

Now let's create the samples (1 signal and 1 background), the channel (1 signal region), and the model.

First, set the parameters for the model and make the flat 1 bin histograms



In [8]:

    
n_events = 1000
n_bins = 1
range_low = 0
range_high = 10

frac_sig = 0.1
frac_bkg = 1 - frac_sig

signal_hist = ROOT.TH1F("signal_hist", "signal_hist", n_bins, range_low, range_high)
signal_hist.SetBinContent(1, frac_sig * n_events)

background_hist = ROOT.TH1F("background_hist", "background_hist", n_bins, range_low, range_high)
background_hist.SetBinContent(1, frac_bkg * n_events)

data_hist = ROOT.TH1F("data_hist", "data_hist", n_bins, range_low, range_high)
data_hist.SetBinContent(1, n_events)

As this is a counting experiment, should have Poisson uncertainties ($\sqrt{n_{\text{events}}}$) on all histograms, so let's check:



In [9]:

    
hist_list = [signal_hist, background_hist, data_hist]

for hist in hist_list:
    print("{0}: {1} events and uncertainty of {2} with sqrt({1}) being {3}".format(
        hist.GetName().rsplit('_')[0],
        hist.GetBinContent(1),
        hist.GetBinError(1),
        ROOT.TMath.Sqrt(hist.GetBinContent(1))))









    



signal: 100.0 events and uncertainty of 10.0 with sqrt(100.0) being 10.0
background: 900.0 events and uncertainty of 30.0 with sqrt(900.0) being 30.0
data: 1000.0 events and uncertainty of 31.622776601683793 with sqrt(1000.0) being 31.622776601683793

Now make the samples and add them and the channel data histogram to the channel



In [10]:

    
samples = []
samples.append(hfbench.make_sample("signal", signal_hist, 0.01, 0.01))
samples.append(hfbench.make_sample("background", background_hist, 0.01, 0.01))

channels = []
SR = hfbench.make_channel(channel_name="SR", channel_data=data_hist, channel_samples=samples)
channels.append(SR)

Define the parameter of interest (POI), and then make the workspace and model



In [11]:

    
POI = ["SigXsecOverSM"]
ws, model = hfbench.make_model(n_events, n_bins, channels, POI)









    




-------------------
Starting to process SR channel with 1 observables
lumi str = [1,0,10]
lumi Error str = nominalLumi[1,0,1.2],0.02
making normFactor: SignalStrength
signal_SR has no variation histograms 
processing hist signal_hist
background_SR has no variation histograms 
processing hist background_hist
-----------------------------------------
import model into workspace
RooDataSet::AsimovData[obs_x_SR,weight:binWeightAsimov] = 1 entries (1000 weighted)

RooWorkspace(SR) SR workspace contents

variables
---------
(Lumi,SignalStrength,alpha_background_uncertainty,alpha_signal_uncertainty,binWidth_obs_x_SR_0,binWidth_obs_x_SR_1,nom_alpha_background_uncertainty,nom_alpha_signal_uncertainty,nominalLumi,obs_x_SR,weightVar)

p.d.f.s
-------
RooRealSumPdf::SR_model[ binWidth_obs_x_SR_0 * L_x_signal_SR_overallSyst_x_Exp + binWidth_obs_x_SR_1 * L_x_background_SR_overallSyst_x_Exp ] = 100/1000
RooGaussian::alpha_background_uncertaintyConstraint[ x=alpha_background_uncertainty mean=nom_alpha_background_uncertainty sigma=1 ] = 1
RooGaussian::alpha_signal_uncertaintyConstraint[ x=alpha_signal_uncertainty mean=nom_alpha_signal_uncertainty sigma=1 ] = 1
RooGaussian::lumiConstraint[ x=Lumi mean=nominalLumi sigma=0.02 ] = 1
RooProdPdf::model_SR[ lumiConstraint * alpha_signal_uncertaintyConstraint * alpha_background_uncertaintyConstraint * SR_model(obs_x_SR) ] = 100

functions
--------
RooProduct::L_x_background_SR_overallSyst_x_Exp[ 1 * background_SR_overallSyst_x_Exp ] = 900
RooProduct::L_x_signal_SR_overallSyst_x_Exp[ 1 * signal_SR_overallSyst_x_Exp ] = 100
RooStats::HistFactory::FlexibleInterpVar::background_SR_epsilon[ paramList=(alpha_background_uncertainty) ] = 1
RooHistFunc::background_SR_nominal[ depList=(obs_x_SR) ] = 900
RooProduct::background_SR_overallNorm_x_sigma_epsilon[ SignalStrength * background_SR_epsilon ] = 1
RooProduct::background_SR_overallSyst_x_Exp[ background_SR_nominal * background_SR_overallNorm_x_sigma_epsilon ] = 900
RooStats::HistFactory::FlexibleInterpVar::signal_SR_epsilon[ paramList=(alpha_signal_uncertainty) ] = 1
RooHistFunc::signal_SR_nominal[ depList=(obs_x_SR) ] = 100
RooProduct::signal_SR_overallNorm_x_sigma_epsilon[ SignalStrength * signal_SR_epsilon ] = 1
RooProduct::signal_SR_overallSyst_x_Exp[ signal_SR_nominal * signal_SR_overallNorm_x_sigma_epsilon ] = 100

datasets
--------
RooDataSet::asimovData(obs_x_SR)
RooDataSet::obsData(obs_x_SR)

embedded datasets (in pdfs and functions)
-----------------------------------------
RooDataHist::signal_SRnominalDHist(obs_x_SR)
RooDataHist::background_SRnominalDHist(obs_x_SR)

named sets
----------
ModelConfig_GlobalObservables:(nom_alpha_signal_uncertainty,nom_alpha_background_uncertainty)
ModelConfig_Observables:(obs_x_SR)
coefList:(binWidth_obs_x_SR_0,binWidth_obs_x_SR_1)
constraintTerms:(lumiConstraint,alpha_signal_uncertaintyConstraint,alpha_background_uncertaintyConstraint)
globalObservables:(nom_alpha_signal_uncertainty,nom_alpha_background_uncertainty)
likelihoodTerms:(SR_model)
obsAndWeight:(weightVar,obs_x_SR)
observables:(obs_x_SR)
observablesSet:(obs_x_SR)
shapeList:(L_x_signal_SR_overallSyst_x_Exp,L_x_background_SR_overallSyst_x_Exp)

generic objects
---------------
RooStats::ModelConfig::ModelConfig

Setting Parameter(s) of Interest as: SigXsecOverSM 
WARNING: Can't find parameter of interest: SigXsecOverSM in Workspace. Not setting in ModelConfig.



In [12]:

    
ws.Print()









    



RooWorkspace(HistFactoryWorkspace) SR workspace contents

variables
---------
(Lumi,SignalStrength,alpha_background_uncertainty,alpha_signal_uncertainty,binWidth_obs_x_SR_0,binWidth_obs_x_SR_1,nom_alpha_background_uncertainty,nom_alpha_signal_uncertainty,nominalLumi,obs_x_SR,weightVar)

p.d.f.s
-------
RooRealSumPdf::SR_model[ binWidth_obs_x_SR_0 * L_x_signal_SR_overallSyst_x_Exp + binWidth_obs_x_SR_1 * L_x_background_SR_overallSyst_x_Exp ] = 100/1000
RooGaussian::alpha_background_uncertaintyConstraint[ x=alpha_background_uncertainty mean=nom_alpha_background_uncertainty sigma=1 ] = 1
RooGaussian::alpha_signal_uncertaintyConstraint[ x=alpha_signal_uncertainty mean=nom_alpha_signal_uncertainty sigma=1 ] = 1
RooGaussian::lumiConstraint[ x=Lumi mean=nominalLumi sigma=0.02 ] = 1
RooProdPdf::model_SR[ lumiConstraint * alpha_signal_uncertaintyConstraint * alpha_background_uncertaintyConstraint * SR_model(obs_x_SR) ] = 100

functions
--------
RooProduct::L_x_background_SR_overallSyst_x_Exp[ 1 * background_SR_overallSyst_x_Exp ] = 900
RooProduct::L_x_signal_SR_overallSyst_x_Exp[ 1 * signal_SR_overallSyst_x_Exp ] = 100
RooStats::HistFactory::FlexibleInterpVar::background_SR_epsilon[ paramList=(alpha_background_uncertainty) ] = 1
RooHistFunc::background_SR_nominal[ depList=(obs_x_SR) ] = 900
RooProduct::background_SR_overallNorm_x_sigma_epsilon[ SignalStrength * background_SR_epsilon ] = 1
RooProduct::background_SR_overallSyst_x_Exp[ background_SR_nominal * background_SR_overallNorm_x_sigma_epsilon ] = 900
RooStats::HistFactory::FlexibleInterpVar::signal_SR_epsilon[ paramList=(alpha_signal_uncertainty) ] = 1
RooHistFunc::signal_SR_nominal[ depList=(obs_x_SR) ] = 100
RooProduct::signal_SR_overallNorm_x_sigma_epsilon[ SignalStrength * signal_SR_epsilon ] = 1
RooProduct::signal_SR_overallSyst_x_Exp[ signal_SR_nominal * signal_SR_overallNorm_x_sigma_epsilon ] = 100

datasets
--------
RooDataSet::asimovData(obs_x_SR)
RooDataSet::obsData(obs_x_SR)

embedded datasets (in pdfs and functions)
-----------------------------------------
RooDataHist::signal_SRnominalDHist(obs_x_SR)
RooDataHist::background_SRnominalDHist(obs_x_SR)

parameter snapshots
-------------------
NominalParamValues = (Lumi=1[C],nominalLumi=1[C],alpha_signal_uncertainty=0,nom_alpha_signal_uncertainty=0[C],SignalStrength=1,obs_x_SR=5,alpha_background_uncertainty=0,nom_alpha_background_uncertainty=0[C],binWidth_obs_x_SR_0=0.1[C],binWidth_obs_x_SR_1=0.1[C],weightVar=0)

named sets
----------
ModelConfig_GlobalObservables:(nom_alpha_signal_uncertainty,nom_alpha_background_uncertainty)
ModelConfig_NuisParams:(SignalStrength,alpha_background_uncertainty,alpha_signal_uncertainty)
ModelConfig_Observables:(obs_x_SR)
ModelConfig_POI:()
coefList:(binWidth_obs_x_SR_0,binWidth_obs_x_SR_1)
constraintTerms:(lumiConstraint,alpha_signal_uncertaintyConstraint,alpha_background_uncertaintyConstraint)
globalObservables:(nom_alpha_signal_uncertainty,nom_alpha_background_uncertainty)
likelihoodTerms:(SR_model)
obsAndWeight:(weightVar,obs_x_SR)
observables:(obs_x_SR)
observablesSet:(obs_x_SR)
shapeList:(L_x_signal_SR_overallSyst_x_Exp,L_x_background_SR_overallSyst_x_Exp)

generic objects
---------------
RooStats::ModelConfig::ModelConfig



In [13]:

    
#model = ws.pdf("model_SR")

##hfbench.benchmark_fit(ws.var("obs_x_SR"), model, n_events, n_events, verbose=True)
#hfbench.benchmark_fit_with_workspace(ws, model, n_events, n_events, verbose=True)



In [14]:

    
#f = ws.var("obs_channel1").frame()
c = ROOT.TCanvas()
f = ws.var("obs_x_SR").frame()
print(ws.var("obs_x_SR"))
ws.data("obsData").plotOn(f)
f.Draw()
c.Draw()
print(ws.data("obsData"))

#h_test = super(ROOT.RooDataSet,ws.data("obsData")).createHistogram("h_test", ws.var("obs_x_SR"), ROOT.RooFit.Binning(n_bins, range_low, range_high))
roo_data_hist = ROOT.RooDataHist("h_test", "h_test", ROOT.RooArgSet(ws.var("obs_x_SR")), ws.data("obsData"))
h_test = roo_data_hist.createHistogram("h_test", ws.var("obs_x_SR"))
h_test.Sumw2(False)
h_test.Sumw2()
h_test.SetMinimum(0)
h_test.SetMaximum(h_test.GetBinContent(1) * 1.1)
print(roo_data_hist)
print(h_test)
roo_data_hist.Print("v")
c_test = ROOT.TCanvas()
h_test.Draw()
c_test.Draw()









    



<ROOT.RooRealVar object ("obs_x_SR") at 0x6af89d0>
<ROOT.RooDataSet object ("obsData") at 0x6e4d70c>
<ROOT.RooDataHist object ("h_test") at 0x6fd5fd0>
<ROOT.TH1F object ("h_test__obs_x_SR") at 0x6901d10>
DataStore h_test (h_test)
  Contains 1 entries
  Observables: 
    1)  obs_x_SR = 5  L(0 - 10) B(1)  "obs_x_SR"
Binned Dataset h_test (h_test)
  Contains 1 bins with a total weight of 1000
  Observables:     1)  obs_x_SR = 5  L(0 - 10) B(1)  "obs_x_SR"

It seems that the uncertainties on the model data (which as I understand it is stored as "obsData" in the model) have blown up here after just creating the model. So I must be doing something wrong in my make_model() function.



In [15]:

    
x = ROOT.RooRealVar('x','x', range_low, range_high)
fr = x.frame()
f = ws.var("obs_x_SR").frame()
data = ws.data('obsData')
#data.plotOn(fr) #plotOn() is complaining that it can not find a working method :?
print(data) # check to make sure that it is not a null pointer
#h_data = super(ROOT.RooDataSet,data).createHistogram("data_hist", x, ROOT.RooFit.Binning(n_bins, range_low, range_high))
h_data = super(ROOT.RooDataSet,data).createHistogram("data_hist", ws.var("obs_x_SR"), ROOT.RooFit.Binning(n_bins, range_low, range_high))
h_data.Sumw2(False)
h_data.Sumw2()
h_data.SetMinimum(0)
h_data.SetMaximum(h_data.GetBinContent(1) * 1.1)

c = ROOT.TCanvas()
c.cd()
h_data.Draw("E")
c.Draw()

print("{0}: {1} events and uncertainty of {2} with sqrt({1}) being {3}".format(
        h_data.GetName().rsplit('_')[0],
        h_data.GetBinContent(1),
        h_data.GetBinError(1),
        ROOT.TMath.Sqrt(h_data.GetBinContent(1))))









    



<ROOT.RooDataSet object ("obsData") at 0x6e4d70c>
data: 1000.0 events and uncertainty of 31.622776601683793 with sqrt(1000.0) being 31.622776601683793



In [16]:

    
# draw the histogram before the model was made
c = ROOT.TCanvas()
c.cd()
data_hist.Draw("E")
data_hist.SetMinimum(0)
data_hist.SetMaximum(data_hist.GetBinContent(1) * 1.1)
c.Draw()

model_config = ws.obj("ModelConfig")

hist_factory_nav = ROOT.RooStats.HistFactory.HistFactoryNavigation(model_config)
#print(hist_factory_nav)

# draw the histogram stack
stack = hist_factory_nav.GetChannelStack("SR")
#channel_data = hist_factory_nav.GetDataHist(ws.data("obsData"),"SR","channel_data")
channel_data = h_data
sig_hist = hist_factory_nav.GetSampleHist("SR", "signal", "sig_hist")
bkg_hist = hist_factory_nav.GetSampleHist("SR", "background", "bkg_hist")

c1 = ROOT.TCanvas()
stack.Draw()
channel_data.Draw("SAME")
leg = ROOT.TLegend(0.65, 0.65, 0.9, 0.9)
for hist in stack.GetHists():
    samplename = hist.GetName().rsplit('_')[0]
    leg.AddEntry(hist, samplename,"f")
leg.AddEntry(channel_data, "channel data","lep")
leg.Draw()
c1.Draw()

# draw the histogram stored as "obsData" in the model
c2 = ROOT.TCanvas()
channel_data.Draw()
channel_data.SetMinimum(0)
channel_data.SetMaximum(channel_data.GetBinContent(1) * 1.1)
c2.Draw()



In [17]:

    
# To avoid a segfault at kerenel exit
canvas_list = [c, c_test, c1, c2]
for canvas in canvas_list:
    canvas.IsA().Destructor(canvas)

References

RooFit/RooStats tutorials for INFN School of Statistics 2013
Vince Croft's Jupyter Notebooks on RooFit and HistFactory
Building HistFactory models using C++ and Python
RooStats::HistFactory::Measurement Class Reference
Baak, et. al, HistFitter software framework for statistical data analysis
Lukas Heinrich, Plans for HistFactory integration in HEPData, "(Re)interpreting the results of new physics searches at the LHC" workshop, June 2016