Three related but distinct questions come under the heading of **model evaluation**.

- Does a model describe (fit) the data well?
- Does a model make accurate predictions about new data?
- How probable are our competing models in light of the data?

Often (2) and (3) are directly related to **model comparison** or **selection**.

Throughout this (and always!), **"model" means a complete generative model**.

That is, a "model" includes the specification of a prior.

**A familiar example:** imagine we have a data set like this

Specifically,

- we have precisely known $x$ values
- we have precisely known, Gaussian errors on $y$
- we're fitting a linear model, $\bar{y}(x)=b+mx$

How do we decide whether the model is any good?

Visual comparison of models drawn from the posterior with the data:

We can make this quantitative:

In this specific case, the likelihood is $\propto e^{-\chi^2/2}$.

So is the posterior, given uniform priors on $m$ and $b$.

Assuming this model (line plus Gaussian errors) is correct, the distribution over data sets of $\hat{\chi}^2$ must follow a $\chi^2_\nu$ distribution, where

- $\hat{\chi}^2$ is the best-fit $\chi^2$ over parameters for a given data set
- the number of degrees of freedom $\nu=N_\mathrm{data}-N_\mathrm{params}$

In this case, the value of $\hat{\chi}^2\approx104$ doesn't look good in light of the expectation.

The probability $P(\chi^2\geq\hat{\chi}^2|\nu)$ ($\sim10^{-10}$ in this case) is called the **$p$-value** or **significance**.

- If the "null hypothesis" (our assumed model, with fitted parameters $[\hat{m},\hat{b}]$) is true, we expect the fraction of hypothetical new datasets to have $\chi^2$ values greater than $\hat{\chi}^2$ to be $p$.

The $p$-value is *not* the probability of the model $(m,b)$ being true. Like the sampling distribution from which it is derived, it characterizes the probability of getting the data given the assumed model and its estimated parameters.

The result of a classical hypothesis test is of the following form:

*"We reject the null hypothesis at the $p$ significance level"*

(i.e. on the grounds that it inadequately predicts the data.)

- A useful, quick way to make sense of $\hat{\chi}^2$ and $N_{\rm dof}$ values is to use
**Fisher's Gaussian approximation to the chi-squared distribution**:

$\;\;\;\;\;\sqrt{2\hat{\chi}^2} \sim \mathcal{N}\left( \sqrt{2 N_{\rm dof}-1}, 1 \right)$ (approximately)

$\longrightarrow$ The difference between $\sqrt{2\hat{\chi}^2}$ and $\sqrt{2 N_{\rm dof}-1}$ is the "number of sigma" ($n_{\sigma}$) we are away from a good fit.

In our case, the MLE model is about 7-sigma away from being a good fit.

We can construct analogous hypothesis tests by *simulating many "replica" data sets realized from the posterior distribution,* and

comparing the observed data with the replica data via a suitable summary "test statistic", and its **"posterior predictive distribution"**.

We are free to design our test statistic to focus on the aspect of the data that we want the model to fit well.

**Posterior predictive model checking** - logic:

If our model is the true one, then

*replica*data generated by it should "look like" the one dataset we have.This means that any

*summary*$T$ of both the real dataset, $T(d)$, and the replica datasets, $T(d^{\rm rep})$, should follow the same distribution over noise realizations*and*model parameters.If the real dataset was not generated with our model, then its summary may be an

*outlier*from the distribution of summaries of replica datasets.

Example test statistic: Pearson Correlation $r_{12}$

Focuses on the tightness of linear correlation between $x$ and $y$

$T(d) = r_{12} = \frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{\left[ \sum_i (x_i - \bar{x})^2 \sum_i (y_i - \bar{y})^2 \right]^{1/2}}$

For each one of many posterior samples, we draw a *replica dataset* from the sampling distribution given the sample parameter vector, and compute $T(d^{\rm rep})$, building up a histogram of $T$.

${\rm P}[T(d^{\rm rep})>T(d)\,|\,d] = 99.43\%$ - our dataset $d$ is clearly an outlier.

The posterior predictive probability distribution for the test statistic $T(d)$ generated by sampling in this way is marginalized over both parameters and (replica) datasets.

It takes into account both the uncertainty in the data (captured by the sampling distribution)

*and*the uncertainty in the parameters (propagated from our one dataset and our prior knowledge during posterior sampling).

Posterior predictive model checking can be seen as the Bayesian extension of classical hypothesis testing, and is a useful test of

*model adequacy*.As with classical hypothesis testing, a model can be discarded (or retained) on the basis of a posterior predictive model check.

Note that we did not have to make any approximations in order to use a standard distribution for our summary $T$:

*we just used the posterior PDF we already had*.

Test statistics $T(d,\theta)$ that are functions of both the data and the parameters are called **discrepancy measures**.

The maximum log-likelihood is a common example.

Any way we look at it, it's unlikely that we'd conclude the linear model explains these data adequately. How do we choose an alternative?

One way to compare the fitness of models is to look at question (2) in model evaluation: **How accurately do they predict new data?**

- We typically want a fit that works well with any
*potential*data set, rather than just reproducing the one we have. - In general, this means an "Occam's Razor"-like penalty for complexity should be involved (to avoid focusing on models that "over-fit" the data).

The gold standard for testing predictive accuracy is to *get more data*.

Short of that, the best option is **cross-validation**: fitting a model on many random subsets of the data and seeing how well it describes the complementary "out of sample" subsets.

This method is ubiquitous in machine learning, where accurate out-of-sample prediction is usually the goal.

Short of exhaustive cross-validation, a number of **information criteria** exist that (asymptotically) relate to generalized predictive accuracy.

These have the advantage of being relatively quick to calculate from the results of a fit - either an MLE or a set of posterior samples - and include a penalty for models with greater freedom.

Some information criteria:

- Akaike information criterion (AIC)
- Deviance information criterion (DIC)
- Watanabe-Akaike information criterion (WAIC)

The DIC has the advantage of being compatible with Bayesian analysis (unlike AIC), and not requiring the data to be cleanly separable into conditionally independent subsets (unlike WAIC).

$\mathrm{DIC} = \langle D(\theta) \rangle + 2p_D; \quad p_D = \langle D(\theta) \rangle - D(\langle\theta\rangle)$

where $D(\theta)=-2\log P(\mathrm{data}|\theta)$ and averages $\langle\rangle$ are over the posterior.

$p_D$ is an *effective number of free parameters*, i.e. the number of parameters primarily constrained by the data rather than by their priors.

The DIC thus doesn't necessarily count unconstrained nuisance parameters used to marginalize out systematics as "added complexity".

Note that for all of these information criteria, a **lower** value is preferable (larger likelihood and/or less model complexity).

A somewhat motivated scale for interpreting differences in IC exists (named for Jeffreys):

$$e^{(\mathrm{IC}_1-\mathrm{IC}_2)/2}$$ | Strength of evidence for model 2 |

$<1$ | Negative |

$1$-$3$ | Barely worth mentioning |

$3$-$10$ | Substantial |

$10$-$30$ | Strong |

$30$-$100$ | Very strong |

$>100$ | Decisive |

Say our model has 1 parameter, $\theta$, and the likelihood is a unit width Gaussian centered on $\theta=0$ with peak value $L_{\rm max}$.

For each of the priors on $\theta$ below, (a) sketch the likelihood and prior as a function of theta, (b) roughly approximate the DIC and $p_D$ for that model (just well enough for a qualitative comparison between the models).

- $P(\theta|H_1)$ uniform on $[-1,+1]$
- $P(\theta|H_2)$ uniform on $[-100,+100]$
- $P(\theta|H_3)$ uniform on $[+3,+5]$

Recall: $\mathrm{DIC} = \langle D(\theta) \rangle + 2p_D; \quad p_D = \langle D(\theta) \rangle - D(\langle\theta\rangle)$

```
In [ ]:
```import numpy as np
import scipy.stats as st
def DIC_thingy(lower, upper):
y = st.truncnorm.rvs(lower, upper, size=100000)
av_of_D = np.mean(-2.0*st.norm.logpdf(y))
D_of_av = -2.0*st.norm.logpdf( np.mean(y) )
pD = av_of_D - D_of_av
DIC = av_of_D + 2*pD
return av_of_D, D_of_av, pD, DIC
print(DIC_thingy(-1.0, 1.0))
print(DIC_thingy(-100.0, 100.0))
print(DIC_thingy(3.0, 5.0))

**DIC exercise: notes**

1) Models that are less prescriptive (in terms of their priors) are penalized in the DIC.

2) However, there is a limit to this penalty. As the prior becomes less prescriptive, we get the penalty associated with "another free parameter", and that's it.

3) Sufficiently large improvements to the likelihood will overcome this.

How about the third question - **How probable are our competing models in the light of the data?**

This question cannot be asked in classical statistics - where only data have probability distributions.

Bayes theorem gives us a framework for assessing relative model probabilities which naturally includes Occam's razor.

Inference on models $H$:

$P(H|D,\Omega)=\frac{P(D|H,\Omega)P(H|\Omega)}{P(D|\Omega)}$

NB. $H$ is a list of all of our assumptions - including our prior PDF assignments.

Here $\Omega$ is some space of all allowed models. As we normally do for parameter inference, we'll work with a simplified version:

$P(H|D)\propto P(D|H)P(H)$

$P(H)$ is a prior on the model, and

$P(D|H)=\int P(D|\theta,H) \, P(\theta|H) d\theta$

is the **evidence** - the normalizing denominator in Bayesian parameter inference (also known as the **fully marginalized likelihood**).

Ideally, we would compare models by looking at

$\frac{P(H_2|D)}{P(H_1|D)}=\frac{P(D|H_2)\,P(H_2)}{P(D|H_1)\,P(H_1)}$

General difficulties in computing the terms in this ratio:

- Assigning meaningful priors to models
- Assigning meaningful priors to parameters
- Calculating the evidence integral

Say we have a model with 1 parameter, $\theta$, and a likelihood that works out to be a unit width Gaussian centered on $\theta=0$ with peak value $L_{\rm max}$.

For each of the priors on $\theta$ below, (a) sketch the likelihood and prior as a function of theta, (b) roughly approximate the evidence for that model (just well enough for a qualitative comparison between the models).

- $P(\theta|H_1)$ uniform on $[-1,+1]$
- $P(\theta|H_2)$ uniform on $[-100,+100]$
- $P(\theta|H_3)$ uniform on $[+3,+5]$

Recall: $P(D|H)=\int P(D|\theta,H) \, P(\theta|H) d\theta$

```
In [ ]:
```def Evidence_thingy(lower, upper):
return (st.norm.cdf(upper) - st.norm.cdf(lower)) / (upper - lower)
print(Evidence_thingy(-1.0, 1.0))
print(Evidence_thingy(-100.0, 100.0))
print(Evidence_thingy(3.0, 5.0))

**Evidence exercise: notes**

1) Models that are less prescriptive (in terms of their priors) are penalized in the evidence. This is a feature, although it means we need to put real thought into those priors.

2) The evidence can be made arbitrarily small by increasing the prior volume: comparing evidences is more conservative than focusing on the goodness of fit ($L_{\rm max}$) alone.

3) The evidence is linearly sensitive to prior volume, but exponentially sensitive to goodness of fit ($L_{\rm max} \propto e^{-\hat{\chi}^2/2}$). It's still a likelihood, after all.

The evidence for model $H$, $P(D\,|\,H)$, enables a form of Bayesian hypothesis testing: model comparison with the "evidence ratio" or "odds ratio" or "Bayes Factor" $R$

$R = \frac{P(D|H_2)}{P(D|H_1)}$

$R$ is a *fully marginalized likelihood ratio* - which is to say that it *takes into account our uncertainty about values of the parameters of each model by integrating over all plausible values of them.*

Notice that if your two models are equally probable *a priori*, then

$\frac{P(H_2)}{P(H_1)} = 1$ such that $\frac{P(H_2|D)}{P(H_1|D)} = R$

This assumption is often not always easy to justify, but it makes $R$ easy to interpret: its just the ratio of model probabilities in our ideal comparison.

A more practical way to interpret the Bayes factor is to note that it updates the model prior ratio into a posterior one. This means that:

If you believe, despite having seen the data and computed $R$, that your two models are

*still equally probable,*then $R$ gives _the odds that you would have had to have been willing to take against $H_2$, before seeing the data._

In our linear model fit example, we can compute the evidence for the linear and quadratic models, and form the odds ratio $R$.

```
log Evidence for Straight Line Model: -157.2
log Evidence for Quadratic Model: -120.7
Evidence ratio in favour of the Quadratic Model:
7e15 to 1
```

The 26 unit difference in log evidence between the two models translates to a *huge* odds ratio in favour of the quadratic model.

Incidentally those data did not come from *either* a linear or quadratic model...

The same Jeffreys scale used to interpret the information criteria can be used to interpret evidence ratios:

$R$ | Strength of evidence for model 2 |

$<1$ | Negative |

1-3 | Barely worth mentioning |

3-10 | Substantial |

10-30 | Strong |

30-100 | Very strong |

$>100$ | Decisive |

**Bayesian Evidence: closing thoughts**

The Bayesian evidence is

*qualitatively different*from other model assessments. While they focus primarily on*prediction accuracy,*the evidence is the way in which information from the prior PDF propagates through into our posterior beliefs about the model as a whole.There are no inherent mathematical limitations to its use, in contrast to various other hypothesis tests that are only valid under certain assumptions (such as the models being nested, e.g. the classical $F$ test for comparing $\chi^2$ values).

*Any two models can be compared and the odds ratio computed.*

- Does a model make accurate predictions about new data?
Cross validation; information criteria to quantify generalized predictive accuracy

- How probable are our competing models in light of the data?
Bayesian Evidence ratios ("Bayes factors")