Statistical Frameworks

Goals:

Explore the relationship between "characterizing the posterior PDF" and "fitting a model to data."
Understand how to derive maximum likelihood estimators and their confidence intervals
Be able to compare, contrast and appreciate the Bayesian and Frequentist approaches to statistics

We'll work through a simple Bayesian inference (fitting a straight line and reporting summaries of marginalized posterior PDFs), and then compare with the frequentist procedure of estimating the gradient and intercept and reporting confidence intervals.

The period-magnitude relation in Cepheid stars

Cepheids are stars whose brightness oscillates with a stable period the logarithm of which appears to be strongly correlated with their mean luminosity (or absolute magnitude).

A lot of monitoring data - repeated imaging and subsequent "photometry" of the star - can provide a measurement of the absolute magnitude (if we know the distance to it's host galaxy) and the period of the oscillation.
Let's look at some Cepheid measurements reported by Riess et al (2011, R11). The data are in the form of datapoints, one for each star.
The periods are well measured, while the magnitudes come with reported error bars.

The model, and the data

Let's assume that Cepheid stars' luminosities are related to their oscillation periods by a power law, such that their apparent magnitude and log period follow the linear relation

$\;\;\;\;\;\;\;m = a\;\log_{10} P + b$

The data consist of observed magnitudes with quoted uncertainties, such as:

$\;\;\;\;\;\;\;m^{\rm obs} = 24.51 \pm 0.31$ at $\log_{10} P = \log_{10} (13.0/{\rm days})$

Bayesian inference

We compute the posterior PDF for the parameters $a$ and $b$ given the data and the assumed model $H$:

$\;\;\;\;\;P(a,b|\boldsymbol{m}^{\rm obs},H) \propto P(\boldsymbol{m}^{\rm obs}|a,b,H)\;P(a,b|H)$

We can evaluate the unnormalized posterior PDF on a grid, renormalize it numerically, and then visualize and summarize the resulting 2D function.

Probabilistic graphical model

Let's draw a PGM for this inverse problem, imagining our way through what we would do to generate a mock dataset like the one we have from R11.
If we were generating mock data, then for any plausible choice of parameters $a$ and $b$ we can predict the true magnitude $m_k$ of each star given its period $P_k$, and then add noise to simulate each observed magnitude $m^{\rm obs}_k$.

Exercise: Draw the PGM for this problem with your neighbor, and be prepared to point out its features in 5 minutes' time

PGM

NB. The magnitude uncertainties $\sigma^{\rm obs}$ are given to us in the data file; we can use them as-is if we believe them. The "true" magnitudes $m$ are determined by our power law model.

Building an inference

Now let's assign PDFs for each node in the PGM, and derive the unnormalized posterior PDF for $a$ and $b$.

We'll need:

The sampling distribution: $P(\boldsymbol{m}^{\rm obs}|\boldsymbol{m},H)$
The conditional PDF for the latent variables $m_k$, $P(m_k|a,b,\log_{10}{P_k},H)$
A prior PDF for our parameters: $P(a,b|H)$

The sampling distribution $P(\boldsymbol{m}^{\rm obs}|\boldsymbol{m},H)$

We were given points ($m^{\rm obs}_k$) with error bars ($\sigma_k$), which suggests a Gaussian sampling distribution for each one:

$\;\;\;\;\;\;\;P(m^{\rm obs}_k|m_k,\sigma_k,H) = \frac{1}{\sqrt{2\pi\sigma_k^2}} \exp{-\frac{(m^{\rm obs}_k - m_k)^2}{2\sigma_k^2}}$

Note that we are never given the form of the sampling distribution: it always has to be assumed. It turns out that the Gaussian distribution is the "Maximum Entropy" (minimally informative) choice given the information provided.

If we assume that the measurements of each Cepheid start are independent of each other, then we can define predicted and observed data "vectors" $\boldsymbol{m}$ and $\boldsymbol{m}^{\rm obs}$ (plus a corresponding observational uncertainty "vector" $\boldsymbol{\sigma}$) and compute the joint sampling distribution as:

$\;\;\;\;\;\;\;P(\boldsymbol{m}^{\rm obs}|\boldsymbol{m},\boldsymbol{\sigma},H) = \prod_k P(m^{\rm obs}_k|m_k,\sigma_k,H)$

The conditional PDF $P(m_k|a,b,\log_{10}{P_k},H)$

Our relationship between the intrinsic magnitude and the log period is linear and deterministic, indicating the following delta-function PDF:

$\;\;\;\;\;\;\;P(m_k|a,b,\log_{10}{P_k},H) = \delta(m_k - a\log_{10}{P_k} - b)$

The resulting joint likelihood, $\mathcal{L}(a,b;\boldsymbol{m}^{\rm obs}) = P(\boldsymbol{m}^{\rm obs}|a,b,H)$

The PDF for everything inside the PGM plate is the following product:

$\;\;\;\;\;\;\;P(\boldsymbol{m}^{\rm obs}|\boldsymbol{m},\sigma,H)\;P(\boldsymbol{m}|a,b,H)$

$\;\;\;\;\;\;\;\;\;\;\;\;\;\; = \prod_k P(m^{\rm obs}_k|m_k,\sigma_k,H)\;\delta(m_k - a\log_{10}{P_k} - b)$

Marginalizing out the latent variables

The intrinsic magnitudes of each Cepheid $m_k$ are "latent variables," to be marginalized out:

$\;\;\;\;\;\;\;P(\boldsymbol{m}^{\rm obs}|a,b,H) = \int P(\boldsymbol{m}^{\rm obs}|\boldsymbol{m},\sigma,H)\;P(\boldsymbol{m}|a,b,H)\; d\boldsymbol{m}$

$\;\;\;\;\;\;\;\;\;\;\;\;\;\; = \prod_k \int P(m^{\rm obs}_k|m_k,\sigma_k,H)\;\delta(m_k - a\log_{10}{P_k} - b) dm_k$

$\;\;\;\;\;\;\; \longrightarrow P(\boldsymbol{m}^{\rm obs}|a,b,H) = \prod_k P(\boldsymbol{m}^{\rm obs}_k|(a\log{P_k} + b),\sigma_k,H)$

The log likelihood

Taking logs, for numerical stability, the product in the joint likelihood becomes the following sum:

$\;\;\;\;\;\;\;\log P(\boldsymbol{m}^{\rm obs}|a,b,H) = \sum_k \log P(m^{\rm obs}_k|(a\log{P_k} + b),\sigma,H)$

which, substituting in our Gaussian form, gives us:

$\;\log P(\boldsymbol{m}^{\rm obs}|a,b,H) = -\frac{1}{2}\sum_k \log{2\pi\sigma_k^2} - \frac{1}{2} \sum_k \frac{(m^{\rm obs}_k - a\log{P_k} - b)^2}{\sigma_k^2}$

Note that the log likelihood $\log P(\boldsymbol{m}^{\rm obs}|a,b,H)$ is a function, $\log \mathcal{L}(a,b;\boldsymbol{m}^{\rm obs})$ that can be evaluated, as a function of $a$ and $b$, at constant $\boldsymbol{m}^{\rm obs}$

The Chi-squared misfit statistic

Astronomers often call the term in the log likelihood that depends on the parameters $\chi^2$ ("chi-squared"):

$\;\;\;\;\;\;\;\chi^2 = \sum_k \frac{(m^{\rm obs}_k - a\log{P_k} - b)^2}{\sigma_k^2}$

$\chi^2$ is a "misfit" statistic, that quantifies the difference between "observed and predicted data." Under our assumptions, it's equal to -2 times the log likelihood (up to a constant). The "predicted data" are $m_k = a\log{P_k} - b$

Including the prior $P(a,b|H)$

Let's assume the prior PDFs for $a$ and $b$ to be independent, such that $P(a,b|H) = P(a|H)P(b|H)$.
For now, let's assume uniform prior PDFs for both $a$ and $b$, supposing that we know roughly what size they are:

$\;\;\;\;\;\;\;P(a|H) = \frac{1}{a_{\rm max} - a_{\rm min}}$ with $(a_{\rm min}, a_{\rm max}) = (-10, 10)$

$\;\;\;\;\;\;\;P(b|H) = \frac{1}{b_{\rm max} - b_{\rm min}}$ with $(b_{\rm min}, b_{\rm max}) = (10, 30)$

Putting it all together

The joint PDF is:

$\;\;\;\;\;\;\;P(\boldsymbol{m}^{\rm obs},a,b|H) = P(\boldsymbol{m}^{\rm obs}|a,b,H) P(a|H) P(b|H)$

Since we marginalized out the $m$, analytically, we could have drawn the PGM more simply, jumping directly to $P(\boldsymbol{m}^{\rm obs}|a,b,H)$. However, it's often helpful to explicitly distinguish between "true" parameters and latent ones.

Characterizing the posterior PDF

With the completed factorization of the joint PDF for all variables, we have the following product:

$\;\;P(a,b|\boldsymbol{m}^{\rm obs},H) \propto P(\boldsymbol{m}^{\rm obs}|a,b,H) P(a|H) P(b|H)$

We can evaluate the posterior PDF $P(a,b|\boldsymbol{m}^{\rm obs},H)$ for any choice of parameters $(a,b)$, up to a normalization constant.

Posterior Evaluation on a Grid

Our 2-D posterior PDF can be visualized as a contour plot
We can choose contours that display the credible regions that enclose 68% and 95% of the posterior probability.
Given our assumption that the model is true, the probability that the true values of the model parameters lie within the 95% credible region given the data is 0.95.

Summarizing our inferences

Typically, we will want to (or will be expected to) report "answers" for our model parameters
This can be difficult: our result is the posterior PDF for the model parameters given the data!
A convenient, and in this case appropriate, choice is to report quantiles of the 1D marginalized PDFs

In general, the most important thing when summarizing inferences is to state clearly what you are doing, preferably with critical commentary

1D marginalized posterior PDFs

Let's define the 68% credible interval as the range between the 16th and 84th percentile, and quote it as an (asymmetric) "error bar" on the 1-D posterior medians:

$a = -2.95^{+0.06}_{-0.06} \;\;\;\;\;\;\; b = 26.27^{+0.09}_{-0.1} $

Notes on summaries of marginal PDFs

In this simple case, our report makes sense: the medians of both 1D marginalized PDFs lie within the region of high 2D posterior PDF. This will not always be the case.

The marginalized 1-D posterior for $x$ has a well-defined meaning, regardless of the higher dimensional structure of the joint posterior: it is $P(x|d,H)$, the PDF for $x$ given the data and the model, and accounting for the uncertainty in all other parameters.

The posterior PDF we computed is close to, but not quite, a bivariate Gaussian.

Question: What choice of (proper) prior would we have had to make in order for the posterior PDF to be exactly Gaussian?

Post-inference model checking

It's always a good idea to check that the inferred parameter values are sensible: look at the model predictions in data space.

$\longrightarrow$ Let's overlay the model period-magnitude relation defined by the 1-D posterior median parameter values on the data.

"Fitting the data"

The Bayesian solution is not a single set of "best-fit" parameters.
We can think of the posterior PDF as providing us with a continuous distribution of model fits that are plausible given the data and our assumptions.
There are other ways of defining the parameters that best fit the data: the primary one is "the method of Maximum Likelihood"

Maximum likelihood

Instead of asking for the posterior probability for the parameters given the data, $P(a,b|\boldsymbol{m}^{\rm obs},H)$, we could find the parameters that maximize the probability of getting the data: $P(\boldsymbol{m}^{\rm obs}|a,b,H)$

In astronomy, "best fit" often (but not always) means "maximum likelihood"
Where does the emphasis on the likelihood, rather than the posterior come from?

Frequentism

In the frequentist school of statistics, parameters do not have probability distributions. Probability can only be used to describe frequencies, not degrees of belief (or odds).
In the frequentist view, it's only the data that can be modeled as having been drawn from a probability distribution, because we can imagine doing the experiment or observation multiple times, and building up a frequency distribution of results.
This PDF over datasets is the sampling distribution, e.g. $P(\boldsymbol{m}^{\rm obs}|a,b,H)$

Given an assumed model, the frequentist view is that there is only one set of parameters, the true ones, and our job is to estimate them.
Derivation of good estimators is a major activity in frequentist statistics, and has led to some powerful mathematical results and fast computational shortcuts - some of which are useful in Bayesian inference too

The Likelihood Principle

The likelihood principle holds that all of the information in the data that is relevant to the model parameters is contained in the likelihood function $\mathcal{L}(a,b;\boldsymbol{m}^{\rm obs}) = P(\boldsymbol{m}^{\rm obs}|a,b,H)$

This was evident in our Bayesian treatment, PGMs etc too: Frequentists and Bayesians are in full agreement about the importance of the likelihood function!

As a result of this focus, Maximum Likelihood estimators (MLEs) have some good properties

Maximum likelihood estimators

Consistency: as more data are taken, the MLE tends towards the true parameter value if the model is correct.

MLEs can be "biased" but this bias goes to zero as $N_{\rm data} \rightarrow \infty$

Efficiency: among estimators, MLEs have the minimum variance when sampled over datasets
Asymptotic Normality: as the dataset size increases, the distribution of MLEs over datasets tends to a Gaussian centred at the true parameter value.

The covariance of this ultimate Gaussian distribution is the inverse of the "Fisher information matrix"

In our Cepheid straight line fit, we can derive MLEs for our parameters by finding the maximum (log) likelihood parameters analytically

$\;\;\;\;\;\;\;\log L(a,b) = \log P(\boldsymbol{m}^{\rm obs}|a,b,H) = -\frac{1}{2}\sum_k \log{2\pi\sigma_k^2} - \frac{1}{2} \sum_k \frac{(m^{\rm obs}_k - a\log{P_k} - b)^2}{\sigma_k^2}$

That is, we need to find the parameters $(\hat{a}, \hat{b})$ that give

$\;\;\;\;\;\;\; -2 \nabla \log L(a,b) = \nabla\,\chi^2$ = 0

NB: Maximizing a Gaussian likelihood is equivalent to minimizing $\chi^2$ - and gives a "weighted least squares" fit

The result in this case is a pair of equations that we can solve for the best-fit parameters $(\hat{a}, \hat{b})$, that give the smallest misfit between observed and model-predicted data

Writing $x = \log{P}$ and $y = m^{\rm obs}$, we have

$\frac{\partial \log L}{\partial a}\Bigr|_{\hat{a},\hat{b}} = \sum_k \frac{x_k(y_k - \hat{a}x_k - \hat{b})}{\sigma_k^2} = 0 \longrightarrow \hat{a} \sum_k \frac{x_k^2}{\sigma_k^2} + \hat{b} \sum_k \frac{x_k}{\sigma_k^2} = \sum_k \frac{x_k y_k}{\sigma_k^2}$

$\frac{\partial \log L}{\partial b}\Bigr|_{\hat{a},\hat{b}} = \sum_k \frac{(y_k - \hat{a}x_k - \hat{b})}{\sigma_k^2} = 0 \longrightarrow \hat{a} \sum_k \frac{x_k}{\sigma_k^2} + \hat{b} \sum_k \frac{1}{\sigma_k^2} = \sum_k \frac{y_k}{\sigma_k^2}$

This set of linear equations can be solved straightforwardly to find the estimators $\hat{a}$ and $\hat{b}$:

$\begin{pmatrix} S_{xx} & S_{x} \\ S_x & S_0 \end{pmatrix} \begin{pmatrix} \hat{a} & \hat{b} \end{pmatrix} = \begin{pmatrix} S_{xy} \\ S_{y} \end{pmatrix}$

All the information in the data that is needed to find the best-fit parameters $\boldsymbol{\hat{\theta}}$ is contained in a set of so-called sufficient statistics $S_{xx}$, $S_y$ etc. This is a common feature of maximum likelihod estimators

Computing and combining the sufficient statistics, the maximum likelihood estmimators are as follows:

$ \hat{a} = -2.95 $

$ \hat{b} = 26.26 $

Questions: Why do you think the MLE $\hat{b}$ not exactly the same as the posterior median value for $b$?

Under what circumstances might $\hat{b}$ or $\hat{a}$ be very different from the posterior medians?

Distributions of estimators

In frequentism, we think of the estimators having distributions, since each dataset that we imagine being drawn from the sampling distribution will produce one estimator. An ensemble of (hypothetical) datasets leads to a (hypothetical) distribution of estimators
One straightforward approximate way to estimate these distributions is to use the asymptotic normality property of MLEs, and associate a Gaussian approximation to the likelihood with the Gaussian distribution for the MLEs we expect to see when averaging over datasets

The distribution of the log likelihood

An estimator can be thought of as a summary of the data, and so can the value of the value of the log likelihood evaluated at the estimators.
The distribution of the log likelihood itself over the hypothetical ensemble of datasets provides a route to a confidence interval.
In our simple Gaussian likelihood example, and also in the large dataset limit, twice the negative log likelihood (our $\chi^2$ statistic) follows a $\chi^2$ distribution with the same number of degrees of freedom as the dimensionality of the parameter space. Integrating this distribution from 0 to some boundary $\Delta \chi^2_{D}$ defines a confidence region in parameter space.

Confidence regions from $\chi^2$ distributions

For example: in 1D, the 68.3% confidence region is bounded by the contour at $\chi^2_{\rm min} + \Delta \chi^2_{D}$ where $\Delta \chi^2_{D} = 1$ in 1D, and $\Delta \chi^2_{D} = 2.30$ in 2D.
In the 1D case, the boundary of the 68.3% confidence interval lies 1 standard deviation (or "1-sigma") from the mean, while the 95.4% CI lies 2-sigma from the mean.

In general, $\Delta \chi^2_{D}$ can be computed from the $\chi^2$ distribution quantile (or "percentage point") function, e.g. scipy.stats.chi2.ppf(0.683, D)



In [ ]:

    
import scipy.stats, numpy as np

dimensions = 2
level = 0.683
dchisq = scipy.stats.chi2.ppf(level, dimensions)

np.round(dchisq,2)

In our example, we have that:

$\;\;\;\;\;\chi^2 = (\boldsymbol{\hat{\theta}} - \boldsymbol{\theta})^T (\mathcal{M}^T C^{-1} \mathcal{M})^{-1} (\boldsymbol{\hat{\theta}} - \boldsymbol{\theta})$,

$\;\;\;\;\;$where $C$ is the covariance matrix of the data (i.e. $C$ is diagonal, with elements equal to the squared uncertainties on each datapoint) and $\mathcal{M}$ is the 2xN design matrix that predicts data given parameters via $\mathcal{M}\boldsymbol{\theta} = \boldsymbol{m}$.

This $\chi^2$ function can be computed on a grid, and visualized as a contour plot: the contour at $\chi^2_{\rm min} + 2.30$ will enclose the 68% 2-D confidence region.

Approximate uncertainties

In general, the covariance matrix of a Gaussian approximation to the likelihood can be calculated by taking second derivatives of the log likelihood at the peak, and inverting the resulting Hessian matrix.
This gives a lower limit to the covariance of a set of estimators:

$\;\;\;\;\;V^{-1}_{ij} \geq -\frac{\partial^2 \log{L}}{\partial\theta_i\partial\theta_j} \biggr|_{\boldsymbol{\hat{\theta}}}$

$V$ is what scipy.optimize.minimize returns in its hess_inv field if you pass it the negative log likelihood.

Coping with multiple dimensions

In Frequentism there is no concept of marginalization: parameters are assumed to be single-valued and fixed, and the only probability distribution considered is P(data|pars), not P(pars|data).

Instead, 1D confidence intervals are usually derived by maximizing the likelihood over the other parameters, in a profile likelihood analysis. The repeated optimizations involved can be computationally expensive.

Wording

In general, Frequentist confidence intervals are different from Bayesian credible regions:

"68% of datasets would give a 68% frequentist confidence interval that contains the true parameter value"

"The probability of the true parameter value lying within the 68% Bayesian credible region is 68%"

The difference in wording comes from the different ways that probability is used in the two approaches.

Uncertainties in the estimators

The covariance matrix of a Gaussian approximation to the likelihood defines a 1-sigma, 2D, elliptical, frequentist confidence region
Since this came from transforming the sampling distribution, which is a PDF over datasets, the confidence interval enables conclusions in terms of fractions of an ensemble of (hypothetical) datasets
The 68% confidence interval is the range of estimators that we expect to contain the true parameter value in 68% of the (hypothetical) datasets observed

Frequentism and Bayesianism

In Frequentism, the data are considered to be random variables (in large sets of hypothetical trials described with probability distributions) while parameters are considered fixed (and to be estimated)
In Bayesianism, the data are considered to be fixed (as constants, in datafiles) while parameters are considered random variables (to be inferred, with uncertainty described by probability distributions)

Given an assumed model:

Frequentists seek to transform the frequency distribution of the data into a frequency distribution of their estimators, and hence quantify their uncertainty in terms of what they expect would happen if the observation were to be repeated
Bayesians seek to update their knowledge of their model parameters, and hence quantify their uncertainty in terms of what might have been had the observation been different, and what they knew before the data were taken

Frequentism vs. Bayesianism

"You have to assume a prior" cf. "You get to assume a prior"

"Your calculations are computationally expensive"

"How do you account for your nuisance parameters?"

"Your conclusions are not relevant"

Things to remember

The most important thing is to know what you are doing, and to communicate that clearly to others: both approaches involve assumptions which must be recorded and tested
The Bayesian approach provides a logical framework for combining datasets and additional information, and provides answers in terms of the probability distribution for the model parameters
The Frequentist approach provides a way of studying the model independent of additional information beyond the dataset in hand, and provides answers in terms of the probability of getting the data

Endnote

The astronomy literature contains a mixture of frequentist and Bayesian analysis, sometimes within the same paper
Frequentist estimators often make good summary statistics with well understood sampling distributions: astronomical catalogs are full of them
In most of this course we follow the Bayesian approach: Bayes' Theorem gives you a framework for deriving the solution to any inference problem you encounter. Having said that, we'll keep our eyes open.

Appendix: Computing the Bayesian Posterior PDF

Here's how the figures for the Bayesian inference were made - starting with the functions for the log likelihood, the log prior, and the unnormalized log posterior, evaluated on a 2D $(a,b)$ parameter grid given the R11 data.



In [ ]:

    
exec(open('graphics/cepheids.py').read())
%matplotlib inline
plt.rcParams['figure.figsize'] = (15.0, 8.0)

data = Cepheids('../data/R11ceph.dat')

def log_likelihood(logP, mobs, sigma, a, b):
    return -0.5*np.sum(2*np.pi*sigma**2) - \
            0.5*np.sum((mobs - a*logP - b)**2/(sigma**2))

def log_prior(a, b):

    amin,amax = -10.0,10.0
    bmin,bmax = 10.0,30.0

    if (b > bmin)*(b < bmax):
        value = 0.0
# Interesting alterniative: Cauchy distribution for b, equivalent to uniform
# in angle of orientation of line:
#        value = np.log(1.0/(bmax-bmin)) - \
#                np.log(np.pi) - np.log(1 + a**2)
    else:
        value = -np.inf
    
    if (a > amin)*(a < amax):
        value += 0.0
    else:
        value += -np.inf
        
    return value

def log_posterior(logP, mobs, sigma, a, b):
    return log_likelihood(logP,mobs,sigma,a,b) + log_prior(a,b)

Evaluating the posterior PDF

Now, let's set up a suitable parameter grid, evaluate the unnormalized log posterior on it, and then renormalize it numerically.



In [ ]:

    
# Limits of parameter grids, focused on the high likelihood region:
amin, amax = -3.4, -2.4
bmin, bmax = 25.7, 26.8
limits = (amin, amax, bmin, bmax)

def evaluate_posterior_on_a_grid(limits, NGC_ID=4258, npix=100):
    
    # Make grids:
    amin, amax, bmin, bmax = limits
    agrid, bgrid, logprob = np.linspace(amin,amax,npix), np.linspace(bmin,bmax,npix), np.zeros([npix,npix])

    # Select a Cepheid dataset:
    data.select(NGC_ID)

    # Loop over parameters, computing unnormlized log posterior PDF:
    for i,a in enumerate(agrid):
        for j,b in enumerate(bgrid):
            logprob[j,i] = log_posterior(data.logP, data.mobs, data.sigma, a, b)

    # Exponentiate and normalize to get posterior density:
    prob = np.exp(logprob - np.max(logprob))
    prob /= np.sum(prob)
    
    return prob, agrid, bgrid



In [ ]:

    
%%time
prob, a, b = evaluate_posterior_on_a_grid(limits, NGC_ID=4258, npix=100)

Visualizing the 2D PDF

Typically we want to be able to see the centroid, size and shape of the posterior PDF
In particular we want to see the credible regions that enclose 68% and 95% of the posterior probability. These are best plotted as contours
Given our assumption that the model is true, the probability that the true values of the model parameters lie within the 95% credible region given the data is 0.95



In [ ]:

    
sorted = np.sort(prob.flatten())
C = sorted.cumsum()

# Find the pixel values that lie at the levels that contain 68% and 95% of the probability:
lvl68 = np.min(sorted[C > (1.0 - 0.68)])
lvl95 = np.min(sorted[C > (1.0 - 0.95)])

plt.figure(figsize=(10,10))
plt.imshow(prob, origin='lower', cmap='Blues', interpolation='none', extent=limits)
plt.contour(prob,[lvl95,lvl68],colors='black',extent=limits)
plt.grid()
plt.xlabel('slope a', fontsize=20)
plt.ylabel('intercept b / AB magnitudes', fontsize=20);

# plt.savefig("cepheids_2d-posterior.png");

Summarizing our inferences

Let's compute the 1D marginalized posterior PDFs for $a$ and for $b$, and report the median and "68% credible interval" (defined as the region of 1D parameter space enclosing 68% of the posterior probability).



In [ ]:

    
prob_a_given_data = np.sum(prob, axis=0) # Approximate the integral as a sum
prob_b_given_data = np.sum(prob, axis=1) # Approximate the integral as a sum

# Check that we do have a 1D PDF:
# print(prob_a_given_data.shape, np.sum(prob_a_given_data))

plot_1d_marginalized_pdfs(a, b, prob_a_given_data, prob_b_given_data)

# plt.savefig("cepheids_1d-posteriors.png")

print("a = ",compress_1D_pdf(a, prob_a_given_data, ci=68, dp=2))
print("b = ",compress_1D_pdf(b, prob_b_given_data, ci=68, dp=2))

Post-inference model checking

Are these inferred parameters sensible?



In [ ]:

    
data.plot(4258)

data.overlay_straight_line_with(a=-2.95, b=26.25, label='Model')

data.add_legend()

# plt.savefig("cepheids_posterior-median-check.png")

Appendix: Finding the Maximum likelihood parameters

Here's the code to find the maximum likelihood parameters in the Cepheid problem.



In [ ]:

    
exec(open('graphics/cepheids.py').read())
%matplotlib inline
plt.rcParams['figure.figsize'] = (15.0, 8.0)

data = Cepheids('../data/R11ceph.dat')



In [ ]:

    
data.select(4258)
M, v = data.sufficient_statistics()
a, b = np.linalg.solve(M, v)
print('$ \hat{a} = %.2f $' % np.round(a, 2))
print('$ \hat{b} = %.2f $' % np.round(b, 2))



In [ ]:

    
data.plot(4258)

data.overlay_straight_line_with(a=a, b=b, label='Maximum Likelihood fit')

data.add_legend()

The covariance of MLEs

Here's code to compute the inverse Hessian of the log likelihood, and hence a lower limit on the covariance of the ML estimators. The diagonal elements of the covariance matrix give an approximate symmetrical error bar.



In [ ]:

    
# Generalized maximum likelihood approach:

import scipy.optimize

pars = np.array([0.0, 20])
result = scipy.optimize.minimize(data.negative_log_likelihood, pars, method='BFGS', tol=0.001)

print(result)



In [ ]:

    
C = result.hess_inv
np.sqrt(C[0,0]), np.sqrt(C[1,1])

Statistical Frameworks

Further reading

The period-magnitude relation in Cepheid stars

The model, and the data

Bayesian inference

Probabilistic graphical model

Exercise: Draw the PGM for this problem with your neighbor, and be prepared to point out its features in 5 minutes' time

PGM

Building an inference

The sampling distribution $P(\boldsymbol{m}^{\rm obs}|\boldsymbol{m},H)$

The conditional PDF $P(m_k|a,b,\log_{10}{P_k},H)$

The resulting joint likelihood, $\mathcal{L}(a,b;\boldsymbol{m}^{\rm obs}) = P(\boldsymbol{m}^{\rm obs}|a,b,H)$

Marginalizing out the latent variables

The log likelihood

The Chi-squared misfit statistic

Including the prior $P(a,b|H)$

Putting it all together

Characterizing the posterior PDF

Posterior Evaluation on a Grid

Summarizing our inferences

1D marginalized posterior PDFs

Notes on summaries of marginal PDFs

Post-inference model checking

"Fitting the data"

Maximum likelihood

Frequentism

The Likelihood Principle

Maximum likelihood estimators

Distributions of estimators

The distribution of the log likelihood

Confidence regions from $\chi^2$ distributions

$\;\;\;\;\;\chi^2 = (\boldsymbol{\hat{\theta}} - \boldsymbol{\theta})^T (\mathcal{M}^T C^{-1} \mathcal{M})^{-1} (\boldsymbol{\hat{\theta}} - \boldsymbol{\theta})$,

Approximate uncertainties

$\;\;\;\;\;V^{-1}_{ij} \geq -\frac{\partial^2 \log{L}}{\partial\theta_i\partial\theta_j} \biggr|_{\boldsymbol{\hat{\theta}}}$

Coping with multiple dimensions

Wording

Uncertainties in the estimators

Frequentism and Bayesianism

Frequentism vs. Bayesianism

Things to remember

Endnote

Appendix: Computing the Bayesian Posterior PDF

Evaluating the posterior PDF

Visualizing the 2D PDF

Summarizing our inferences

Post-inference model checking

Appendix: Finding the Maximum likelihood parameters

The covariance of MLEs