This notebook is adapted from a lesson from the 2019 KIPAC/StatisticalMethods course, (c) 2019 Adam Mantz and Phil Marshall, licensed under the GPLv2.

Evaluating Models


  • Be able to design and carry out tests of model adequacy (goodness of fit), and comparisons between models

  • Understand and be prepared to use the Bayesian Evidence


You can't do inference without making assumptions.

$\longrightarrow$ We must test the hypotheses defined by our models.

Three related but distinct questions come under the heading of model evaluation.

  1. Does a model describe (fit) the data well?
  2. Does a model make accurate predictions about new data?
  3. How probable are our competing models in light of the data?

Often (2) and (3) are directly related to model comparison or selection.

Throughout this (and always!), "model" means a complete generative model.

That is, a "model" includes the specification of a prior.

A familiar example: imagine we have a data set like this


  • we have precisely known $x$ values
  • we have precisely known, Gaussian errors on $y$
  • we're fitting a linear model, $\bar{y}(x)=b+mx$

How do we decide whether the model is any good?

Visual comparison of models drawn from the posterior with the data:


How might we decide whether our model adequately explains how the data were generated?

We can make this quantitative:

  • In this specific case, the likelihood is $\propto e^{-\chi^2/2}$.

  • So is the posterior, given uniform priors on $m$ and $b$.

Classical Hypothesis Testing

Assuming this model (line plus Gaussian errors) is correct, the distribution over data sets of $\hat{\chi}^2$ must follow a $\chi^2_\nu$ distribution, where

  • $\hat{\chi}^2$ is the best-fit $\chi^2$ over parameters for a given data set
  • the number of degrees of freedom $\nu=N_\mathrm{data}-N_\mathrm{params}$

Hence, the classical $\chi^2$ test looks at whether $\hat{\chi}^2$ is consistent with this distribution. If not, it's unlikely that our data came from the assumed model.

In this case, the value of $\hat{\chi}^2\approx104$ doesn't look good in light of the expectation.

The probability $P(\chi^2\geq\hat{\chi}^2|\nu)$ ($\sim10^{-10}$ in this case) is called the $p$-value or significance.

  • If the "null hypothesis" (our assumed model, with fitted parameters $[\hat{m},\hat{b}]$) is true, we expect the fraction of hypothetical new datasets to have $\chi^2$ values greater than $\hat{\chi}^2$ to be $p$.

The $p$-value is not the probability of the model $(m,b)$ being true. Like the sampling distribution from which it is derived, it characterizes the probability of getting the data given the assumed model and its estimated parameters.

The result of a classical hypothesis test is of the following form:

"We reject the null hypothesis at the $p$ significance level"

(i.e. on the grounds that it inadequately predicts the data.)

Practical Chi-squared Testing

  • We can compute the p-value assuming a chi-squared distribution using scipy.stats:
    import scipy.stats
    chisq = scipy.stats.chi2(Ndof)
    pvalue = chisq.sf(chisq_min)
  • The "reduced chi-squared", $\hat{\chi}^2_{R} = \hat{\chi}^2 / N_{\rm dof}$, is often used by astronomers to quantify goodness of fit - but note that you need to know the number of degrees of freedom separately from $\hat{\chi}^2$ to be able to interpret it.
  • A useful, quick way to make sense of $\hat{\chi}^2$ and $N_{\rm dof}$ values is to use Fisher's Gaussian approximation to the chi-squared distribution:

$\;\;\;\;\;\sqrt{2\hat{\chi}^2} \sim \mathcal{N}\left( \sqrt{2 N_{\rm dof}-1}, 1 \right)$ (approximately)

$\longrightarrow$ The difference between $\sqrt{2\hat{\chi}^2}$ and $\sqrt{2 N_{\rm dof}-1}$ is the "number of sigma" ($n_{\sigma}$) we are away from a good fit.

In our case, the MLE model is about 7-sigma away from being a good fit.

Bayesian Hypothesis Testing

  • In general our likelihood won't have nice, analytic properties.

  • We will want to evaluate the success of our model at explaining the data taking our uncertainty in the model parameters into account.

We can construct analogous hypothesis tests by simulating many "replica" data sets realized from the posterior distribution, and

comparing the observed data with the replica data via a suitable summary "test statistic", and its "posterior predictive distribution".

We are free to design our test statistic to focus on the aspect of the data that we want the model to fit well.

Posterior predictive model checking - logic:

  • If our model is the true one, then replica data generated by it should "look like" the one dataset we have.

  • This means that any summary $T$ of both the real dataset, $T(d)$, and the replica datasets, $T(d^{\rm rep})$, should follow the same distribution over noise realizations and model parameters.

  • If the real dataset was not generated with our model, then its summary may be an outlier from the distribution of summaries of replica datasets.

Note the similarity to the logic of the classical hypothesis test. The difference is that the Bayesian replica datasets were generated with plausible values of the parameters (drawn from the posterior PDF), while all the hypothetical datasets in frequentism (each with their own $\hat{\chi}^2$) are drawn using the same model parameters (the estimated ones).

Example test statistic: Pearson Correlation $r_{12}$

  • Focuses on the tightness of linear correlation between $x$ and $y$

  • $T(d) = r_{12} = \frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{\left[ \sum_i (x_i - \bar{x})^2 \sum_i (y_i - \bar{y})^2 \right]^{1/2}}$

For each one of many posterior samples, we draw a replica dataset from the sampling distribution given the sample parameter vector, and compute $T(d^{\rm rep})$, building up a histogram of $T$.

${\rm P}[T(d^{\rm rep})>T(d)\,|\,d] = 99.43\%$ - our dataset $d$ is clearly an outlier.

  • The posterior predictive probability distribution for the test statistic $T(d)$ generated by sampling in this way is marginalized over both parameters and (replica) datasets.

  • It takes into account both the uncertainty in the data (captured by the sampling distribution) and the uncertainty in the parameters (propagated from our one dataset and our prior knowledge during posterior sampling).

  • Posterior predictive model checking can be seen as the Bayesian extension of classical hypothesis testing, and is a useful test of model adequacy.

  • As with classical hypothesis testing, a model can be discarded (or retained) on the basis of a posterior predictive model check.

  • Note that we did not have to make any approximations in order to use a standard distribution for our summary $T$: we just used the posterior PDF we already had.

Test statistics $T(d,\theta)$ that are functions of both the data and the parameters are called discrepancy measures.

The maximum log-likelihood is a common example.

Discrepancy measure: $T = \hat{\chi}^2$; ${\rm Pr}(T(d^{\rm rep},\theta)>T(d,\theta)\,|\,d) \approx 0.0$

Any way we look at it, it's unlikely that we'd conclude the linear model explains these data adequately. How do we choose an alternative?

One way to compare the fitness of models is to look at question (2) in model evaluation: How accurately do they predict new data?

Generalized Predictive Accuracy and "Information Criteria"

  • We typically want a fit that works well with any potential data set, rather than just reproducing the one we have.
  • In general, this means an "Occam's Razor"-like penalty for complexity should be involved (to avoid focusing on models that "over-fit" the data).

In our example, we might add a quadratic term to the model: $y = b + m x + q x^2$. How do we quantify the improvement?

The gold standard for testing predictive accuracy is to get more data.

Short of that, the best option is cross-validation: fitting a model on many random subsets of the data and seeing how well it describes the complementary "out of sample" subsets.

This method is ubiquitous in machine learning, where accurate out-of-sample prediction is usually the goal.

Short of exhaustive cross-validation, a number of information criteria exist that (asymptotically) relate to generalized predictive accuracy.

These have the advantage of being relatively quick to calculate from the results of a fit - either an MLE or a set of posterior samples - and include a penalty for models with greater freedom.

Some information criteria:

  • Akaike information criterion (AIC)
  • Deviance information criterion (DIC)
  • Watanabe-Akaike information criterion (WAIC)

The DIC has the advantage of being compatible with Bayesian analysis (unlike AIC), and not requiring the data to be cleanly separable into conditionally independent subsets (unlike WAIC).

$\mathrm{DIC} = \langle D(\theta) \rangle + 2p_D; \quad p_D = \langle D(\theta) \rangle - D(\langle\theta\rangle)$

where $D(\theta)=-2\log P(\mathrm{data}|\theta)$ and averages $\langle\rangle$ are over the posterior.

$p_D$ is an effective number of free parameters, i.e. the number of parameters primarily constrained by the data rather than by their priors.

The DIC thus doesn't necessarily count unconstrained nuisance parameters used to marginalize out systematics as "added complexity".

Note that for all of these information criteria, a lower value is preferable (larger likelihood and/or less model complexity).

A somewhat motivated scale for interpreting differences in IC exists (named for Jeffreys):

$$e^{(\mathrm{IC}_1-\mathrm{IC}_2)/2}$$Strength of evidence for model 2
$<1$ Negative
$1$-$3$ Barely worth mentioning
$3$-$10$ Substantial
$10$-$30$ Strong
$30$-$100$ Very strong
$>100$ Decisive

Exercise: Priors and the DIC

Say our model has 1 parameter, $\theta$, and the likelihood is a unit width Gaussian centered on $\theta=0$ with peak value $L_{\rm max}$.

For each of the priors on $\theta$ below, (a) sketch the likelihood and prior as a function of theta, (b) roughly approximate the DIC and $p_D$ for that model (just well enough for a qualitative comparison between the models).

  1. $P(\theta|H_1)$ uniform on $[-1,+1]$
  2. $P(\theta|H_2)$ uniform on $[-100,+100]$
  3. $P(\theta|H_3)$ uniform on $[+3,+5]$

Recall: $\mathrm{DIC} = \langle D(\theta) \rangle + 2p_D; \quad p_D = \langle D(\theta) \rangle - D(\langle\theta\rangle)$

In [ ]:
import numpy as np
import scipy.stats as st
def DIC_thingy(lower, upper):
    y = st.truncnorm.rvs(lower, upper, size=100000)
    av_of_D = np.mean(-2.0*st.norm.logpdf(y))
    D_of_av = -2.0*st.norm.logpdf( np.mean(y) )
    pD = av_of_D - D_of_av
    DIC = av_of_D + 2*pD
    return av_of_D, D_of_av, pD, DIC

print(DIC_thingy(-1.0, 1.0))
print(DIC_thingy(-100.0, 100.0))
print(DIC_thingy(3.0, 5.0))

DIC exercise: notes

1) Models that are less prescriptive (in terms of their priors) are penalized in the DIC.

2) However, there is a limit to this penalty. As the prior becomes less prescriptive, we get the penalty associated with "another free parameter", and that's it.

3) Sufficiently large improvements to the likelihood will overcome this.

How about the third question - How probable are our competing models in the light of the data?

  • This question cannot be asked in classical statistics - where only data have probability distributions.

  • Bayes theorem gives us a framework for assessing relative model probabilities which naturally includes Occam's razor.

Bayesian Model Comparison

Inference on parameters $\theta$ given model $H$:


Inference on models $H$:


NB. $H$ is a list of all of our assumptions - including our prior PDF assignments.

Here $\Omega$ is some space of all allowed models. As we normally do for parameter inference, we'll work with a simplified version:

$P(H|D)\propto P(D|H)P(H)$

$P(H)$ is a prior on the model, and

$P(D|H)=\int P(D|\theta,H) \, P(\theta|H) d\theta$

is the evidence - the normalizing denominator in Bayesian parameter inference (also known as the fully marginalized likelihood).

Ideally, we would compare models by looking at


General difficulties in computing the terms in this ratio:

  • Assigning meaningful priors to models
  • Assigning meaningful priors to parameters
  • Calculating the evidence integral

Exercise: Priors and the evidence

Say we have a model with 1 parameter, $\theta$, and a likelihood that works out to be a unit width Gaussian centered on $\theta=0$ with peak value $L_{\rm max}$.

For each of the priors on $\theta$ below, (a) sketch the likelihood and prior as a function of theta, (b) roughly approximate the evidence for that model (just well enough for a qualitative comparison between the models).

  1. $P(\theta|H_1)$ uniform on $[-1,+1]$
  2. $P(\theta|H_2)$ uniform on $[-100,+100]$
  3. $P(\theta|H_3)$ uniform on $[+3,+5]$

Recall: $P(D|H)=\int P(D|\theta,H) \, P(\theta|H) d\theta$

In [ ]:
def Evidence_thingy(lower, upper):
    return (st.norm.cdf(upper) - st.norm.cdf(lower)) / (upper - lower)

print(Evidence_thingy(-1.0, 1.0))
print(Evidence_thingy(-100.0, 100.0))
print(Evidence_thingy(3.0, 5.0))

Evidence exercise: notes

1) Models that are less prescriptive (in terms of their priors) are penalized in the evidence. This is a feature, although it means we need to put real thought into those priors.

2) The evidence can be made arbitrarily small by increasing the prior volume: comparing evidences is more conservative than focusing on the goodness of fit ($L_{\rm max}$) alone.

3) The evidence is linearly sensitive to prior volume, but exponentially sensitive to goodness of fit ($L_{\rm max} \propto e^{-\hat{\chi}^2/2}$). It's still a likelihood, after all.

The evidence for model $H$, $P(D\,|\,H)$, enables a form of Bayesian hypothesis testing: model comparison with the "evidence ratio" or "odds ratio" or "Bayes Factor" $R$

$R = \frac{P(D|H_2)}{P(D|H_1)}$

$R$ is a fully marginalized likelihood ratio - which is to say that it takes into account our uncertainty about values of the parameters of each model by integrating over all plausible values of them.

Notice that if your two models are equally probable a priori, then

$\frac{P(H_2)}{P(H_1)} = 1$ such that $\frac{P(H_2|D)}{P(H_1|D)} = R$

This assumption is often not always easy to justify, but it makes $R$ easy to interpret: its just the ratio of model probabilities in our ideal comparison.

A more practical way to interpret the Bayes factor is to note that it updates the model prior ratio into a posterior one. This means that:

  • If you believe, despite having seen the data and computed $R$, that your two models are still equally probable,

  • then $R$ gives _the odds that you would have had to have been willing to take against $H_2$, before seeing the data._

In our linear model fit example, we can compute the evidence for the linear and quadratic models, and form the odds ratio $R$.

log Evidence for Straight Line Model: -157.2
log Evidence for Quadratic Model: -120.7
Evidence ratio in favour of the Quadratic Model:
  7e15 to 1

The 26 unit difference in log evidence between the two models translates to a huge odds ratio in favour of the quadratic model.

Incidentally those data did not come from either a linear or quadratic model...

The same Jeffreys scale used to interpret the information criteria can be used to interpret evidence ratios:

$R$Strength of evidence for model 2
$<1$ Negative
1-3 Barely worth mentioning
3-10 Substantial
10-30 Strong
30-100 Very strong
$>100$ Decisive

The Bayesian Information Criterion (BIC) is an approximation of $R$ (assuming $N$ datapoints greatly outnumber $k$ parameters, and the priors are uninformative).

Calculating the evidence

Estimates directly calculated from Markov chains produced for parameter inference are generally not reliable.

Good methods include nested sampling (e.g. MultiNest) and parallel tempering / thermodynamc integration (e.g. emcee).

Bayesian Evidence: closing thoughts

  • The Bayesian evidence is qualitatively different from other model assessments. While they focus primarily on prediction accuracy, the evidence is the way in which information from the prior PDF propagates through into our posterior beliefs about the model as a whole.

  • There are no inherent mathematical limitations to its use, in contrast to various other hypothesis tests that are only valid under certain assumptions (such as the models being nested, e.g. the classical $F$ test for comparing $\chi^2$ values). Any two models can be compared and the odds ratio computed.

Model Evaluation Summary

  1. Does a model describe (fit) the data well?

    Posterior predictive model checks (visual, test stats, discrepancy measures)

  1. Does a model make accurate predictions about new data?

    Cross validation; information criteria to quantify generalized predictive accuracy

  1. How probable are our competing models in light of the data?

    Bayesian Evidence ratios ("Bayes factors")