Tutorial about statistical methods

The following contains a sequence of simple exercises, designed to get familiar with using Minuit for maximum likelihood fits and emcee to determine parameters by MCMC. Commands are generally commented, i.e. in order to activate them, simply uncomment them. A few functions are still to be defined... which is part of the exercise. Have fun!



In [1]:

    
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

Generate a dataset to be fitted



In [2]:

    
np.random.seed(12345)
y = np.random.random(10000)
x = 1./np.sqrt(y)
plt.hist(x, bins=100, range=(1,10), histtype='stepfilled',color='blue')
plt.yscale('log')

Maximum likelihood fit of a simple power law

First define the negative-log likelihood function for a density proportional to x**(-a) the range 1 < x < infinity



In [3]:

    
def nllp(a):
    return np.sum(a*np.log(x) - np.log(a-1))

Then minimize it using iminuit



In [5]:

    
import iminuit



In [6]:

    
minp = iminuit.Minuit(nllp, a=3, error_a=0.1, errordef=0.5)



In [8]:

    
minp.migrad();









    











    





        
            
                FCN = 7888.70469829
                TOTAL NCALL = 29
                NCALLS = 11
            
            
                EDM = 2.39178870231e-10
                GOAL EDM = 5e-06
                
                UP = 0.5
            
        
        
        
            
                Valid
                Valid Param
                Accurate Covar
                PosDef
                Made PosDef
            
            
                True
                True
                True
                True
                False
            
            
                Hesse Fail
                HasCov
                Above EDM
                
                Reach calllim
            
            
                False
                True
                False
                
                False
            
        
        






    





        
            
                +
                Name
                Value
                Parab Error
                Minos Error-
                Minos Error+
                Limit-
                Limit+
                FIXED
            
        
            
                1
                a
                3.02417
                0.0202417
                0
                0
                
                
                
            
            
            
        
                        \begin{tabular}{|c|r|r|r|r|r|r|r|c|}
\hline
 & Name & Value & Para Error & Error+ & Error- & Limit+ & Limit- & FIXED\\
\hline
1 & a & 3.024e+00 & 2.024e-02 &  &  &  &  & \\
\hline
\end{tabular}

Error analysis

First determine the parabolic errors using hesse() and then do a parameter scan using minos() to determine the 68% confidence level errors.



In [9]:

    
minp.hesse();









    





        
            
                +
                Name
                Value
                Parab Error
                Minos Error-
                Minos Error+
                Limit-
                Limit+
                FIXED
            
        
            
                1
                a
                3.02417
                0.0202417
                0
                0
                
                
                
            
            
            
        
                        \begin{tabular}{|c|r|r|r|r|r|r|r|c|}
\hline
 & Name & Value & Para Error & Error+ & Error- & Limit+ & Limit- & FIXED\\
\hline
1 & a & 3.024e+00 & 2.024e-02 &  &  &  &  & \\
\hline
\end{tabular}
            
            






    





            
                
                    +
        
            
            
            
            a
            
            
            
            
                
                
            
                a
            
                
                1.00
                
                
            
            

                        %\usepackage[table]{xcolor} % include this for color
%\usepackage{rotating} % include this for rotate header
%\documentclass[xcolor=table]{beamer} % for beamer
\begin{tabular}{|c|c|}
\hline
\rotatebox{90}{} & \rotatebox{90}{a}\\
\hline
a & \cellcolor[RGB]{255,117,117} 1.00\\
\hline
\end{tabular}



In [10]:

    
minp.minos();
minp.draw_profile('a');









    





        Minos status for a: VALID
        
            
                Error
                -0.0201738617587
                0.0203091857502
            
            
                Valid
                True
                True
            
            
                At Limit
                False
                False
            
            
                Max FCN
                False
                False
            
            
                New Min
                False
                False

Use of an un-normalised PDF

The above example shall be modified such that the normalisation of the likelihood function, which so far was determined analytically, now is determined numerically in the fit. This is the more realistic case, since in many case no (simple) analytical normalisation exists. As a first step, this requires to load the integration package.



In [11]:

    
from scipy.integrate import quad



In [12]:

    
def pdfpn(x, a):
    return x**(-a)
def pdfpn_norm(a):
    return quad(pdfpn, 1, np.inf, args=(a))[0] 
def nllpn(a):
    return -np.sum(np.log(pdfpn(x,a))) + np.log(pdfpn_norm(a))*len(x)

Then do the same minimization steps as before.



In [13]:

    
minpn = iminuit.Minuit(nllpn, a=3, error_a=0.1, errordef=0.5)



In [14]:

    
minpn.migrad();









    











    





        
            
                FCN = 7888.70469776
                TOTAL NCALL = 18
                NCALLS = 18
            
            
                EDM = 1.2507327863e-08
                GOAL EDM = 5e-06
                
                UP = 0.5
            
        
        
        
            
                Valid
                Valid Param
                Accurate Covar
                PosDef
                Made PosDef
            
            
                True
                True
                True
                True
                False
            
            
                Hesse Fail
                HasCov
                Above EDM
                
                Reach calllim
            
            
                False
                True
                False
                
                False
            
        
        






    





        
            
                +
                Name
                Value
                Parab Error
                Minos Error-
                Minos Error+
                Limit-
                Limit+
                FIXED
            
        
            
                1
                a
                3.02417
                0.0202402
                0
                0
                
                
                
            
            
            
        
                        \begin{tabular}{|c|r|r|r|r|r|r|r|c|}
\hline
 & Name & Value & Para Error & Error+ & Error- & Limit+ & Limit- & FIXED\\
\hline
1 & a & 3.024e+00 & 2.024e-02 &  &  &  &  & \\
\hline
\end{tabular}

Extend the fit model by an exponential cutoff

The exponential cutoff is implemented by exp(-bbx), i.e. exponential growth is not allowed for real valued parameters b. The implications of this ansatz shall be discussed when looking at the solution. After that, the example can be modified to use exp(-b*x).

Here the likelihood function has no (simple) analytical normalisation anymore, i.e. we directly do the numerical approach.



In [15]:

    
def pdfcn(x, a, b):
    return x**(-a)*np.exp(-b*b*x)
def pdfcn_norm(a, b):
    return quad(pdfcn, 1, np.inf, args=(a, b))[0] 
def nllcn(a, b):
    return -np.sum(np.log(pdfcn(x,a,b))) + np.log(pdfcn_norm(a,b))*len(x)

As before, use Minuit for minimisation and error analysis, but now in two dimensions. Study parabolic errors and minos errors, the latter both for the single variables and for both together.



In [16]:

    
mincn = iminuit.Minuit(nllcn, a=3, b=1, error_a=0.1, error_b=0.1, errordef=0.5)



In [17]:

    
mincn.migrad();









    











    





        
            
                FCN = 7887.87601212
                TOTAL NCALL = 85
                NCALLS = 85
            
            
                EDM = 1.54019145338e-06
                GOAL EDM = 5e-06
                
                UP = 0.5
            
        
        
        
            
                Valid
                Valid Param
                Accurate Covar
                PosDef
                Made PosDef
            
            
                True
                True
                True
                True
                False
            
            
                Hesse Fail
                HasCov
                Above EDM
                
                Reach calllim
            
            
                False
                True
                False
                
                False
            
        
        






    





        
            
                +
                Name
                Value
                Parab Error
                Minos Error-
                Minos Error+
                Limit-
                Limit+
                FIXED
            
        
            
                1
                a
                2.99501
                0.0285219
                0
                0
                
                
                
            
            
            
                2
                b
                0.0895059
                0.0316701
                0
                0
                
                
                
            
            
            
        
                        \begin{tabular}{|c|r|r|r|r|r|r|r|c|}
\hline
 & Name & Value & Para Error & Error+ & Error- & Limit+ & Limit- & FIXED\\
\hline
1 & a & 2.995e+00 & 2.852e-02 &  &  &  &  & \\
\hline
2 & b & 8.951e-02 & 3.167e-02 &  &  &  &  & \\
\hline
\end{tabular}



In [18]:

    
mincn.hesse();









    





        
            
                +
                Name
                Value
                Parab Error
                Minos Error-
                Minos Error+
                Limit-
                Limit+
                FIXED
            
        
            
                1
                a
                2.99501
                0.0282245
                0
                0
                
                
                
            
            
            
                2
                b
                0.0895059
                0.0309978
                0
                0
                
                
                
            
            
            
        
                        \begin{tabular}{|c|r|r|r|r|r|r|r|c|}
\hline
 & Name & Value & Para Error & Error+ & Error- & Limit+ & Limit- & FIXED\\
\hline
1 & a & 2.995e+00 & 2.822e-02 &  &  &  &  & \\
\hline
2 & b & 8.951e-02 & 3.100e-02 &  &  &  &  & \\
\hline
\end{tabular}
            
            






    





            
                
                    +
        
            
            
            
            a
            
            
            
            
            
            
            
            b
            
            
            
            
                
                
            
                a
            
                
                1.00
                
                
                
                -0.69
                
                
            
            
            
                b
            
                
                -0.69
                
                
                
                1.00
                
                
            
            

                        %\usepackage[table]{xcolor} % include this for color
%\usepackage{rotating} % include this for rotate header
%\documentclass[xcolor=table]{beamer} % for beamer
\begin{tabular}{|c|c|c|}
\hline
\rotatebox{90}{} & \rotatebox{90}{a} & \rotatebox{90}{b}\\
\hline
a & \cellcolor[RGB]{255,117,117} 1.00 & \cellcolor[RGB]{226,160,138} -0.69\\
\hline
b & \cellcolor[RGB]{226,160,138} -0.69 & \cellcolor[RGB]{255,117,117} 1.00\\
\hline
\end{tabular}



In [19]:

    
mincn.minos();









    





        Minos status for a: VALID
        
            
                Error
                -0.0326062951032
                0.031209848349
            
            
                Valid
                True
                True
            
            
                At Limit
                False
                False
            
            
                Max FCN
                False
                False
            
            
                New Min
                False
                False
            
        
        






    





        Minos status for b: VALID
        
            
                Error
                -0.0500402203176
                0.0350458653407
            
            
                Valid
                True
                True
            
            
                At Limit
                False
                False
            
            
                Max FCN
                False
                False
            
            
                New Min
                False
                False



In [20]:

    
mincn.draw_profile('a');



In [21]:

    
mincn.draw_profile('b');



In [23]:

    
mincn.draw_contour('a','b');

Do the same analysis by MCMC



In [24]:

    
import emcee

emcee requires as input the log-likelihood of the posterior in the parameters a and b. In the following it is composed of the log-of the prior and the log-likelihood of the data. Initially use a simple uniform prior in a and b with the constraint b>0. Afterwards one can play with the prior to see how strongly it affects the result.



In [25]:

    
def log_prior(theta):
    a, b = theta
    if b < 0:
        return -np.inf  # log(0)
    else:
        return  0.

def log_likelihood(theta, x):
    a, b = theta
    return np.sum(-a*np.log(x) - b*b*x)

def log_posterior(theta, x):
    a , b = theta
    return log_prior(theta) + log_likelihood(theta, x) - np.log(pdfcn_norm(a,b))*len(x)

Here we'll set up the computation. emcee combines multiple "walkers", each of which is its own MCMC chain. The number of trace results will be nwalkers * nsteps



In [26]:

    
ndim     = 2     # number of parameters in the model
nwalkers = 50    # number of MCMC walkers
nburn    = 100   # "burn-in" period to let chains stabilize
nsteps   = 1000  # number of MCMC steps to take

# random starting point
np.random.seed(0)
starting_guesses = np.random.random((nwalkers, ndim))

run the MCMC (and time it using IPython's %time magic



In [28]:

    
sampler = emcee.EnsembleSampler(nwalkers, ndim, log_posterior, args=[x])
%time sampler.run_mcmc(starting_guesses, nsteps)
print("done")









    



CPU times: user 33 s, sys: 35 ms, total: 33 s
Wall time: 33 s
done

sampler.chain is of shape (nwalkers, nsteps, ndim). Before analysis throw-out the burn-in points and reshape.



In [29]:

    
emcee_trace = sampler.chain[:, nburn:, :].reshape(-1, ndim).T
len(emcee_trace[0])









    Out[29]:





45000

Analyse the results. Plot the projected (marginalized) posteriors for the parameters a and b and also the joinyt density as sampled by the MCMC.



In [31]:

    
plt.hist(emcee_trace[0], 100, range=(2.5,3.5) , histtype='stepfilled', color='cyan');



In [33]:

    
plt.hist(emcee_trace[1], 100, range=(0.,0.5) , histtype='stepfilled', color='cyan');



In [34]:

    
plt.plot(emcee_trace[0],emcee_trace[1],',k');

As a final step, generate 2-dim bayesian confidence level contours containing 68.3% and 95.5% probability content. For that define a convenient plot functions and use them. Overlay the contours with the scatter plot.



In [35]:

    
def compute_sigma_level(trace1, trace2, nbins=20):
    """From a set of traces, bin by number of standard deviations"""
    L, xbins, ybins = np.histogram2d(trace1, trace2, nbins)
    L[L == 0] = 1E-16
    logL = np.log(L)

    shape = L.shape
    L = L.ravel()

    # obtain the indices to sort and unsort the flattened array
    i_sort = np.argsort(L)[::-1]
    i_unsort = np.argsort(i_sort)

    L_cumsum = L[i_sort].cumsum()
    L_cumsum /= L_cumsum[-1]
    
    xbins = 0.5 * (xbins[1:] + xbins[:-1])
    ybins = 0.5 * (ybins[1:] + ybins[:-1])

    return xbins, ybins, L_cumsum[i_unsort].reshape(shape)


xbins, ybins, sigma = compute_sigma_level(emcee_trace[0], emcee_trace[1])
plt.contour(xbins, ybins, sigma.T, levels=[0.683, 0.955])
plt.plot(emcee_trace[0], emcee_trace[1], ',k', alpha=0.1)









    Out[35]:





[<matplotlib.lines.Line2D at 0x10793f150>]



In [ ]:

FCN = 7888.70469829	TOTAL NCALL = 29	NCALLS = 11
EDM = 2.39178870231e-10	GOAL EDM = 5e-06	UP = 0.5

Valid	Valid Param	Accurate Covar	PosDef	Made PosDef
True	True	True	True	False
Hesse Fail	HasCov	Above EDM		Reach calllim
False	True	False		False

+	Name	Value	Parab Error	Minos Error-	Minos Error+	Limit-	Limit+	FIXED
1	a	3.02417	0.0202417	0	0

Error	-0.0201738617587	0.0203091857502
Valid	True	True
At Limit	False	False
Max FCN	False	False
New Min	False	False

FCN = 7888.70469776	TOTAL NCALL = 18	NCALLS = 18
EDM = 1.2507327863e-08	GOAL EDM = 5e-06	UP = 0.5

FCN = 7887.87601212	TOTAL NCALL = 85	NCALLS = 85
EDM = 1.54019145338e-06	GOAL EDM = 5e-06	UP = 0.5

Error	-0.0326062951032	0.031209848349
Valid	True	True
At Limit	False	False
Max FCN	False	False
New Min	False	False

Error	-0.0500402203176	0.0350458653407
Valid	True	True
At Limit	False	False
Max FCN	False	False
New Min	False	False