Goals:
Explore the relationship between "characterizing the posterior PDF" and "fitting a model to data."
Understand how to derive maximum likelihood estimators and their confidence intervals
Be able to compare, contrast and appreciate the Bayesian and Frequentist approaches to statistics
We'll work through a simple Bayesian inference (fitting a straight line and reporting summaries of marginalized posterior PDFs), and then compare with the frequentist procedure of estimating the gradient and intercept and reporting confidence intervals.
A lot of monitoring data - repeated imaging and subsequent "photometry" of the star - can provide a measurement of the absolute magnitude (if we know the distance to it's host galaxy) and the period of the oscillation.
Let's look at some Cepheid measurements reported by Riess et al (2011, R11). The data are in the form of datapoints, one for each star.
The periods are well measured, while the magnitudes come with reported error bars.
$\;\;\;\;\;\;\;m = a\;\log_{10} P + b$
$\;\;\;\;\;\;\;m^{\rm obs} = 24.51 \pm 0.31$ at $\log_{10} P = \log_{10} (13.0/{\rm days})$
$\;\;\;\;\;P(a,b|\boldsymbol{m}^{\rm obs},H) \propto P(\boldsymbol{m}^{\rm obs}|a,b,H)\;P(a,b|H)$
Let's draw a PGM for this inverse problem, imagining our way through what we would do to generate a mock dataset like the one we have from R11.
If we were generating mock data, then for any plausible choice of parameters $a$ and $b$ we can predict the true magnitude $m_k$ of each star given its period $P_k$, and then add noise to simulate each observed magnitude $m^{\rm obs}_k$.
Now let's assign PDFs for each node in the PGM, and derive the unnormalized posterior PDF for $a$ and $b$.
We'll need:
The sampling distribution: $P(\boldsymbol{m}^{\rm obs}|\boldsymbol{m},H)$
The conditional PDF for the latent variables $m_k$, $P(m_k|a,b,\log_{10}{P_k},H)$
A prior PDF for our parameters: $P(a,b|H)$
We were given points ($m^{\rm obs}_k$) with error bars ($\sigma_k$), which suggests a Gaussian sampling distribution for each one:
$\;\;\;\;\;\;\;P(m^{\rm obs}_k|m_k,\sigma_k,H) = \frac{1}{\sqrt{2\pi\sigma_k^2}} \exp{-\frac{(m^{\rm obs}_k - m_k)^2}{2\sigma_k^2}}$
Note that we are never given the form of the sampling distribution: it always has to be assumed. It turns out that the Gaussian distribution is the "Maximum Entropy" (minimally informative) choice given the information provided.
If we assume that the measurements of each Cepheid start are independent of each other, then we can define predicted and observed data "vectors" $\boldsymbol{m}$ and $\boldsymbol{m}^{\rm obs}$ (plus a corresponding observational uncertainty "vector" $\boldsymbol{\sigma}$) and compute the joint sampling distribution as:
$\;\;\;\;\;\;\;P(\boldsymbol{m}^{\rm obs}|\boldsymbol{m},\boldsymbol{\sigma},H) = \prod_k P(m^{\rm obs}_k|m_k,\sigma_k,H)$
The PDF for everything inside the PGM plate is the following product:
$\;\;\;\;\;\;\;P(\boldsymbol{m}^{\rm obs}|\boldsymbol{m},\sigma,H)\;P(\boldsymbol{m}|a,b,H)$
$\;\;\;\;\;\;\;\;\;\;\;\;\;\; = \prod_k P(m^{\rm obs}_k|m_k,\sigma_k,H)\;\delta(m_k - a\log_{10}{P_k} - b)$
$\;\;\;\;\;\;\;\;\;\;\;\;\;\; = \prod_k \int P(m^{\rm obs}_k|m_k,\sigma_k,H)\;\delta(m_k - a\log_{10}{P_k} - b) dm_k$
$\;\;\;\;\;\;\; \longrightarrow P(\boldsymbol{m}^{\rm obs}|a,b,H) = \prod_k P(\boldsymbol{m}^{\rm obs}_k|(a\log{P_k} + b),\sigma_k,H)$
Taking logs, for numerical stability, the product in the joint likelihood becomes the following sum:
$\;\;\;\;\;\;\;\log P(\boldsymbol{m}^{\rm obs}|a,b,H) = \sum_k \log P(m^{\rm obs}_k|(a\log{P_k} + b),\sigma,H)$
which, substituting in our Gaussian form, gives us:
$\;\log P(\boldsymbol{m}^{\rm obs}|a,b,H) = -\frac{1}{2}\sum_k \log{2\pi\sigma_k^2} - \frac{1}{2} \sum_k \frac{(m^{\rm obs}_k - a\log{P_k} - b)^2}{\sigma_k^2}$
Note that the log likelihood $\log P(\boldsymbol{m}^{\rm obs}|a,b,H)$ is a function, $\log \mathcal{L}(a,b;\boldsymbol{m}^{\rm obs})$ that can be evaluated, as a function of $a$ and $b$, at constant $\boldsymbol{m}^{\rm obs}$
Astronomers often call the term in the log likelihood that depends on the parameters $\chi^2$ ("chi-squared"):
$\;\;\;\;\;\;\;\chi^2 = \sum_k \frac{(m^{\rm obs}_k - a\log{P_k} - b)^2}{\sigma_k^2}$
$\chi^2$ is a "misfit" statistic, that quantifies the difference between "observed and predicted data." Under our assumptions, it's equal to -2 times the log likelihood (up to a constant). The "predicted data" are $m_k = a\log{P_k} - b$
Let's assume the prior PDFs for $a$ and $b$ to be independent, such that $P(a,b|H) = P(a|H)P(b|H)$.
For now, let's assume uniform prior PDFs for both $a$ and $b$, supposing that we know roughly what size they are:
$\;\;\;\;\;\;\;P(a|H) = \frac{1}{a_{\rm max} - a_{\rm min}}$ with $(a_{\rm min}, a_{\rm max}) = (-10, 10)$
$\;\;\;\;\;\;\;P(b|H) = \frac{1}{b_{\rm max} - b_{\rm min}}$ with $(b_{\rm min}, b_{\rm max}) = (10, 30)$
The joint PDF is:
$\;\;\;\;\;\;\;P(\boldsymbol{m}^{\rm obs},a,b|H) = P(\boldsymbol{m}^{\rm obs}|a,b,H) P(a|H) P(b|H)$
Since we marginalized out the $m$, analytically, we could have drawn the PGM more simply, jumping directly to $P(\boldsymbol{m}^{\rm obs}|a,b,H)$. However, it's often helpful to explicitly distinguish between "true" parameters and latent ones.
With the completed factorization of the joint PDF for all variables, we have the following product:
$\;\;P(a,b|\boldsymbol{m}^{\rm obs},H) \propto P(\boldsymbol{m}^{\rm obs}|a,b,H) P(a|H) P(b|H)$
We can evaluate the posterior PDF $P(a,b|\boldsymbol{m}^{\rm obs},H)$ for any choice of parameters $(a,b)$, up to a normalization constant.
Our 2-D posterior PDF can be visualized as a contour plot
We can choose contours that display the credible regions that enclose 68% and 95% of the posterior probability.
Given our assumption that the model is true, the probability that the true values of the model parameters lie within the 95% credible region given the data is 0.95.
Typically, we will want to (or will be expected to) report "answers" for our model parameters
This can be difficult: our result is the posterior PDF for the model parameters given the data!
A convenient, and in this case appropriate, choice is to report quantiles of the 1D marginalized PDFs
In general, the most important thing when summarizing inferences is to state clearly what you are doing, preferably with critical commentary
Question: What choice of (proper) prior would we have had to make in order for the posterior PDF to be exactly Gaussian?
The Bayesian solution is not a single set of "best-fit" parameters.
We can think of the posterior PDF as providing us with a continuous distribution of model fits that are plausible given the data and our assumptions.
There are other ways of defining the parameters that best fit the data: the primary one is "the method of Maximum Likelihood"
Instead of asking for the posterior probability for the parameters given the data, $P(a,b|\boldsymbol{m}^{\rm obs},H)$, we could find the parameters that maximize the probability of getting the data: $P(\boldsymbol{m}^{\rm obs}|a,b,H)$
In astronomy, "best fit" often (but not always) means "maximum likelihood"
Where does the emphasis on the likelihood, rather than the posterior come from?
In the frequentist school of statistics, parameters do not have probability distributions. Probability can only be used to describe frequencies, not degrees of belief (or odds).
In the frequentist view, it's only the data that can be modeled as having been drawn from a probability distribution, because we can imagine doing the experiment or observation multiple times, and building up a frequency distribution of results.
This PDF over datasets is the sampling distribution, e.g. $P(\boldsymbol{m}^{\rm obs}|a,b,H)$
Given an assumed model, the frequentist view is that there is only one set of parameters, the true ones, and our job is to estimate them.
Derivation of good estimators is a major activity in frequentist statistics, and has led to some powerful mathematical results and fast computational shortcuts - some of which are useful in Bayesian inference too
This was evident in our Bayesian treatment, PGMs etc too: Frequentists and Bayesians are in full agreement about the importance of the likelihood function!
MLEs can be "biased" but this bias goes to zero as $N_{\rm data} \rightarrow \infty$
Efficiency: among estimators, MLEs have the minimum variance when sampled over datasets
Asymptotic Normality: as the dataset size increases, the distribution of MLEs over datasets tends to a Gaussian centred at the true parameter value.
The covariance of this ultimate Gaussian distribution is the inverse of the "Fisher information matrix"
$\;\;\;\;\;\;\;\log L(a,b) = \log P(\boldsymbol{m}^{\rm obs}|a,b,H) = -\frac{1}{2}\sum_k \log{2\pi\sigma_k^2} - \frac{1}{2} \sum_k \frac{(m^{\rm obs}_k - a\log{P_k} - b)^2}{\sigma_k^2}$
$\;\;\;\;\;\;\; -2 \nabla \log L(a,b) = \nabla\,\chi^2$ = 0
NB: Maximizing a Gaussian likelihood is equivalent to minimizing $\chi^2$ - and gives a "weighted least squares" fit
The result in this case is a pair of equations that we can solve for the best-fit parameters $(\hat{a}, \hat{b})$, that give the smallest misfit between observed and model-predicted data
Writing $x = \log{P}$ and $y = m^{\rm obs}$, we have
$\frac{\partial \log L}{\partial a}\Bigr|_{\hat{a},\hat{b}} = \sum_k \frac{x_k(y_k - \hat{a}x_k - \hat{b})}{\sigma_k^2} = 0 \longrightarrow \hat{a} \sum_k \frac{x_k^2}{\sigma_k^2} + \hat{b} \sum_k \frac{x_k}{\sigma_k^2} = \sum_k \frac{x_k y_k}{\sigma_k^2}$
$\frac{\partial \log L}{\partial b}\Bigr|_{\hat{a},\hat{b}} = \sum_k \frac{(y_k - \hat{a}x_k - \hat{b})}{\sigma_k^2} = 0 \longrightarrow \hat{a} \sum_k \frac{x_k}{\sigma_k^2} + \hat{b} \sum_k \frac{1}{\sigma_k^2} = \sum_k \frac{y_k}{\sigma_k^2}$
This set of linear equations can be solved straightforwardly to find the estimators $\hat{a}$ and $\hat{b}$:
$\begin{pmatrix} S_{xx} & S_{x} \\ S_x & S_0 \end{pmatrix} \begin{pmatrix} \hat{a} & \hat{b} \end{pmatrix} = \begin{pmatrix} S_{xy} \\ S_{y} \end{pmatrix}$
All the information in the data that is needed to find the best-fit parameters $\boldsymbol{\hat{\theta}}$ is contained in a set of so-called sufficient statistics $S_{xx}$, $S_y$ etc. This is a common feature of maximum likelihod estimators
Computing and combining the sufficient statistics, the maximum likelihood estmimators are as follows:
$ \hat{a} = -2.95 $
$ \hat{b} = 26.26 $
Questions: Why do you think the MLE $\hat{b}$ not exactly the same as the posterior median value for $b$?
Under what circumstances might $\hat{b}$ or $\hat{a}$ be very different from the posterior medians?
In frequentism, we think of the estimators having distributions, since each dataset that we imagine being drawn from the sampling distribution will produce one estimator. An ensemble of (hypothetical) datasets leads to a (hypothetical) distribution of estimators
One straightforward approximate way to estimate these distributions is to use the asymptotic normality property of MLEs, and associate a Gaussian approximation to the likelihood with the Gaussian distribution for the MLEs we expect to see when averaging over datasets
An estimator can be thought of as a summary of the data, and so can the value of the value of the log likelihood evaluated at the estimators.
The distribution of the log likelihood itself over the hypothetical ensemble of datasets provides a route to a confidence interval.
In our simple Gaussian likelihood example, and also in the large dataset limit, twice the negative log likelihood (our $\chi^2$ statistic) follows a $\chi^2$ distribution with the same number of degrees of freedom as the dimensionality of the parameter space. Integrating this distribution from 0 to some boundary $\Delta \chi^2_{D}$ defines a confidence region in parameter space.
For example: in 1D, the 68.3% confidence region is bounded by the contour at $\chi^2_{\rm min} + \Delta \chi^2_{D}$ where $\Delta \chi^2_{D} = 1$ in 1D, and $\Delta \chi^2_{D} = 2.30$ in 2D.
In the 1D case, the boundary of the 68.3% confidence interval lies 1 standard deviation (or "1-sigma") from the mean, while the 95.4% CI lies 2-sigma from the mean.
In general, $\Delta \chi^2_{D}$ can be computed from the $\chi^2$ distribution quantile (or "percentage point") function, e.g.
scipy.stats.chi2.ppf(0.683, D)
In [ ]:
import scipy.stats, numpy as np
dimensions = 2
level = 0.683
dchisq = scipy.stats.chi2.ppf(level, dimensions)
np.round(dchisq,2)
$\;\;\;\;\;$where $C$ is the covariance matrix of the data (i.e. $C$ is diagonal, with elements equal to the squared uncertainties on each datapoint) and $\mathcal{M}$ is the 2xN design matrix that predicts data given parameters via $\mathcal{M}\boldsymbol{\theta} = \boldsymbol{m}$.
In general, the covariance matrix of a Gaussian approximation to the likelihood can be calculated by taking second derivatives of the log likelihood at the peak, and inverting the resulting Hessian matrix.
This gives a lower limit to the covariance of a set of estimators:
$V$ is what
scipy.optimize.minimize
returns in itshess_inv
field if you pass it the negative log likelihood.
In Frequentism there is no concept of marginalization: parameters are assumed to be single-valued and fixed, and the only probability distribution considered is P(data|pars), not P(pars|data).
Instead, 1D confidence intervals are usually derived by maximizing the likelihood over the other parameters, in a profile likelihood analysis. The repeated optimizations involved can be computationally expensive.
"68% of datasets would give a 68% frequentist confidence interval that contains the true parameter value"
"The probability of the true parameter value lying within the 68% Bayesian credible region is 68%"
The covariance matrix of a Gaussian approximation to the likelihood defines a 1-sigma, 2D, elliptical, frequentist confidence region
Since this came from transforming the sampling distribution, which is a PDF over datasets, the confidence interval enables conclusions in terms of fractions of an ensemble of (hypothetical) datasets
The 68% confidence interval is the range of estimators that we expect to contain the true parameter value in 68% of the (hypothetical) datasets observed
In Frequentism, the data are considered to be random variables (in large sets of hypothetical trials described with probability distributions) while parameters are considered fixed (and to be estimated)
In Bayesianism, the data are considered to be fixed (as constants, in datafiles) while parameters are considered random variables (to be inferred, with uncertainty described by probability distributions)
Given an assumed model:
Frequentists seek to transform the frequency distribution of the data into a frequency distribution of their estimators, and hence quantify their uncertainty in terms of what they expect would happen if the observation were to be repeated
Bayesians seek to update their knowledge of their model parameters, and hence quantify their uncertainty in terms of what might have been had the observation been different, and what they knew before the data were taken
The most important thing is to know what you are doing, and to communicate that clearly to others: both approaches involve assumptions which must be recorded and tested
The Bayesian approach provides a logical framework for combining datasets and additional information, and provides answers in terms of the probability distribution for the model parameters
The Frequentist approach provides a way of studying the model independent of additional information beyond the dataset in hand, and provides answers in terms of the probability of getting the data
The astronomy literature contains a mixture of frequentist and Bayesian analysis, sometimes within the same paper
Frequentist estimators often make good summary statistics with well understood sampling distributions: astronomical catalogs are full of them
In most of this course we follow the Bayesian approach: Bayes' Theorem gives you a framework for deriving the solution to any inference problem you encounter. Having said that, we'll keep our eyes open.
In [ ]:
exec(open('graphics/cepheids.py').read())
%matplotlib inline
plt.rcParams['figure.figsize'] = (15.0, 8.0)
data = Cepheids('../data/R11ceph.dat')
def log_likelihood(logP, mobs, sigma, a, b):
return -0.5*np.sum(2*np.pi*sigma**2) - \
0.5*np.sum((mobs - a*logP - b)**2/(sigma**2))
def log_prior(a, b):
amin,amax = -10.0,10.0
bmin,bmax = 10.0,30.0
if (b > bmin)*(b < bmax):
value = 0.0
# Interesting alterniative: Cauchy distribution for b, equivalent to uniform
# in angle of orientation of line:
# value = np.log(1.0/(bmax-bmin)) - \
# np.log(np.pi) - np.log(1 + a**2)
else:
value = -np.inf
if (a > amin)*(a < amax):
value += 0.0
else:
value += -np.inf
return value
def log_posterior(logP, mobs, sigma, a, b):
return log_likelihood(logP,mobs,sigma,a,b) + log_prior(a,b)
In [ ]:
# Limits of parameter grids, focused on the high likelihood region:
amin, amax = -3.4, -2.4
bmin, bmax = 25.7, 26.8
limits = (amin, amax, bmin, bmax)
def evaluate_posterior_on_a_grid(limits, NGC_ID=4258, npix=100):
# Make grids:
amin, amax, bmin, bmax = limits
agrid, bgrid, logprob = np.linspace(amin,amax,npix), np.linspace(bmin,bmax,npix), np.zeros([npix,npix])
# Select a Cepheid dataset:
data.select(NGC_ID)
# Loop over parameters, computing unnormlized log posterior PDF:
for i,a in enumerate(agrid):
for j,b in enumerate(bgrid):
logprob[j,i] = log_posterior(data.logP, data.mobs, data.sigma, a, b)
# Exponentiate and normalize to get posterior density:
prob = np.exp(logprob - np.max(logprob))
prob /= np.sum(prob)
return prob, agrid, bgrid
In [ ]:
%%time
prob, a, b = evaluate_posterior_on_a_grid(limits, NGC_ID=4258, npix=100)
Typically we want to be able to see the centroid, size and shape of the posterior PDF
In particular we want to see the credible regions that enclose 68% and 95% of the posterior probability. These are best plotted as contours
Given our assumption that the model is true, the probability that the true values of the model parameters lie within the 95% credible region given the data is 0.95
In [ ]:
sorted = np.sort(prob.flatten())
C = sorted.cumsum()
# Find the pixel values that lie at the levels that contain 68% and 95% of the probability:
lvl68 = np.min(sorted[C > (1.0 - 0.68)])
lvl95 = np.min(sorted[C > (1.0 - 0.95)])
plt.figure(figsize=(10,10))
plt.imshow(prob, origin='lower', cmap='Blues', interpolation='none', extent=limits)
plt.contour(prob,[lvl95,lvl68],colors='black',extent=limits)
plt.grid()
plt.xlabel('slope a', fontsize=20)
plt.ylabel('intercept b / AB magnitudes', fontsize=20);
# plt.savefig("cepheids_2d-posterior.png");
In [ ]:
prob_a_given_data = np.sum(prob, axis=0) # Approximate the integral as a sum
prob_b_given_data = np.sum(prob, axis=1) # Approximate the integral as a sum
# Check that we do have a 1D PDF:
# print(prob_a_given_data.shape, np.sum(prob_a_given_data))
plot_1d_marginalized_pdfs(a, b, prob_a_given_data, prob_b_given_data)
# plt.savefig("cepheids_1d-posteriors.png")
print("a = ",compress_1D_pdf(a, prob_a_given_data, ci=68, dp=2))
print("b = ",compress_1D_pdf(b, prob_b_given_data, ci=68, dp=2))
In [ ]:
data.plot(4258)
data.overlay_straight_line_with(a=-2.95, b=26.25, label='Model')
data.add_legend()
# plt.savefig("cepheids_posterior-median-check.png")
In [ ]:
exec(open('graphics/cepheids.py').read())
%matplotlib inline
plt.rcParams['figure.figsize'] = (15.0, 8.0)
data = Cepheids('../data/R11ceph.dat')
In [ ]:
data.select(4258)
M, v = data.sufficient_statistics()
a, b = np.linalg.solve(M, v)
print('$ \hat{a} = %.2f $' % np.round(a, 2))
print('$ \hat{b} = %.2f $' % np.round(b, 2))
In [ ]:
data.plot(4258)
data.overlay_straight_line_with(a=a, b=b, label='Maximum Likelihood fit')
data.add_legend()
In [ ]:
# Generalized maximum likelihood approach:
import scipy.optimize
pars = np.array([0.0, 20])
result = scipy.optimize.minimize(data.negative_log_likelihood, pars, method='BFGS', tol=0.001)
print(result)
In [ ]:
C = result.hess_inv
np.sqrt(C[0,0]), np.sqrt(C[1,1])