- Set up and carry out a simple Bayesian inference, characterizing a simple posterior PDF

- Compare brute force, analytic and MCMC sampled results

- Check models, with posterior predictive distributions of test statistics

- Use the Bayesian Evidence to compare various models, and understand its properties

- Suggest some simple hacks to get some practice with Bayesian inference

```
In [ ]:
```%load_ext autoreload
%autoreload 2
from __future__ import print_function
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = (6.0, 6.0)
plt.rcParams['savefig.dpi'] = 100

```
In [ ]:
```from straightline_utils import *

```
In [ ]:
```(x,y,sigmay) = generate_data()
plot_yerr(x, y, sigmay)

- An industry standard: find the slope and intercept of a straight line that minimizes the mean square residual. Since the predicted data depend linearly on the parameters, the least squares solution can be found by a matrix inversion and multiplication, conveneniently packed in
`numpy.linalg`

.

- Suppose that the $y$ values are related to the $x$ values by $y = m x + b$, where $x$ and $y$ are "vectors" (
`numpy`

arrays). We might say that we want to find $m$ and $b$ that minimize

$W^2 = \sum_k \frac{(y_k - m x_k - b)^2}{\sigma_k^2}$

- Here, $1/\sigma_k^2$ is a sensible weighting factor that reduces the influence of $y_k$ values with large uncertainty $\sigma_k$

- Differentiating $W^2$ with respect to $m$ and $b$ and setting the result to zero leads to a matrix equation that can be solved - this is what
`np.linalg.lstsq`

is doing.

```
In [ ]:
```# Linear algebra to implement weighted least squares.
N = len(x)
A = np.zeros((N,2))
A[:,0] = 1. / sigmay
A[:,1] = x / sigmay
yy = y / sigmay
theta,nil,nil,nil = np.linalg.lstsq(A,yy)
b_ls,m_ls = theta
print('Weighted least-squares estimated b,m:', b_ls,m_ls)
plot_yerr(x, y, sigmay)
plot_line(m_ls, b_ls);

- The Bayesian answer to the question "What are the values of $b$ and $m$ in a straight line model for the above data?" is the posterior PDF for those parameters, ${\rm Pr}(m,b|y)$

- The posterior PDF doesn't tell us the
*best-fit parameter combination*- it gives us the means to simulate*all plausible parameter combinations, given the data.*

- Let's derive this posterior PDF, starting with a
*probabilistic graphical model*.

```
In [ ]:
```# import straightline_pgm
# straightline_pgm.draw();
from IPython.display import Image
Image(filename="straightline_pgm.png",width=600)

- The PGM illustrates the
*forward*problem of simulating a dataset $\{y_k\}$ by drawing random numbers from*conditional*probability distributions, which are depicted by the*nodes*. The conditional dependences are depicted by the*edges*.

- The PGM is a complete description of the forward problem: it is a representation of the
*joint PDF for all our variables*, ${\rm Pr}(m,b,\{y_k\},\{y^{\rm true}_k\}\,|\,\{x_k\},\{\sigma_k\})$, factorized such that we can simulate data:

${\rm Pr}(m,b,\{y_k\},\{y^{\rm true}_k\}\,|\,\{x_k\},\{\sigma_k\},H_1) = \prod_k {\rm Pr}(y_k\,|\,y^{\rm true}_k,\sigma_k,H_1) \;{\rm Pr}(y^{\rm true}_k\,|\,x_k,m,b,H_1) \;\;\;{\rm Pr}(m,b\,|\,H_1)$

Where did that product come from? What assumption does it encode?

Why are the $\{x_k\}$ and the $\{\sigma_k\}$ always on the right hand side of the bar, but $\{y_k\}$ is sometimes on the left?

What is the meaning of "$H_1$"?

*Hint: notice that it too is always on the right hand side of the bar.*What functional form does the conditional PDF ${\rm Pr}(y^{\rm true}_k\,|\,x_k,m,b,H_1)$ have?

What functional form do you think we should assume for the sampling distribution, ${\rm Pr}(y_k\,|\,y^{\rm true}_k,x_k,\sigma_k,H_1)$?

*Hint: imagine you were generating mock data.*What functional form should we assume for ${\rm Pr}(m,b\,|\,H_1)$?

Talk to your neighbor(s) for 5 minutes about these, note down some brief answers, and be prepared to provide them when called upon.

```
In [ ]:
``````
# %load straightline_answers.md
```

- Let's simplify our joint probability expression a bit, by marginalizing out the latent variables $\{y^{\rm true}_k\}$, and absorbing the $\{x_k\}$ and $\{\sigma_k\}$ into $H_1$ wherever it's helpful.

${\rm Pr}(m,b,\{y_k\}|H_1) = \int\, \prod_k {\rm Pr}(y_k\,|\,y^{\rm true}_k,H_1) \;{\rm Pr}(y^{\rm true}_k\,|\,m,b,H_1) \;\;\;{\rm Pr}(m,b\,|\,H_1) \; d y^{\rm true}_k $

$\longrightarrow {\rm Pr}(m,b,\{y_k\}|H_1) = \prod_k {\rm Pr}(y_k\,|\,(m x_k + b),H_1) \;\;\;{\rm Pr}(m,b\,|\,H_1) $

- Now we just factorize the left hand side and rearrange (like Bayes did), to get:

${\rm Pr}(m,b\,|\,\{y_k\},H_1) = \frac{1}{Z} \prod_k {\rm Pr}(y_k\,|\,(m x_k + b),H_1) \;\;\;{\rm Pr}(m,b\,|\,H_1) $

- Here, $Z = {\rm Pr}(\{y_k\}\,|\,H_1)$, and is known as the "evidence" for $H_1$. It also goes by the name "Fully Marginalized Likelihood" because

$Z = {\rm Pr}(\{y_k\}\,|\,H_1) = \int \prod_k {\rm Pr}(y_k\,|\,(m x_k + b),H_1) \;{\rm Pr}(m,b\,|\,H_1)\;dm\,db$

and $\prod_k {\rm Pr}(y_k\,|\,(m x_k + b),H_1)$ is the *joint likelihood* for the parameters. ${\rm Pr}(m,b\,|\,H_1)$ is the *prior PDF for the parameters*.

- For simple, 2-dimensional parameter spaces like this one, evaluating on a grid is not a bad way to go. We'll see that the least squares solution lies at the peak of the posterior PDF - for a certain set of assumptions about the data and the model.

- To compute the evidence $Z$ and normalize the posterior PDF, we need to be able to integrate the likelihood function over the prior - on a grid, we can do this numerically, to normlize the posterior PDF.

- Un-normalized, the likelihood function can take very small values, so we usually work with the log likelihood instead.

```
In [ ]:
```def straight_line_log_likelihood(theta, x, y, sigmay):
'''
Returns the log-likelihood of drawing data values *y* at
known values *x* given Gaussian measurement noise with standard
deviation with known *sigmay*, where the "true" y values are
*y_t = m * x + b*
x: list of x coordinates
y: list of y coordinates
sigmay: list of y uncertainties
m: scalar slope
b: scalar line intercept
and the parameter vector theta = (m,b).
Returns: scalar log likelihood
'''
m,b = theta
L = 1.0
for k in range(len(y)):
L *= np.exp(-(y[k] - m*x[k] - b)**2/(2.0*sigmay[k]**2))
L /= np.sqrt(2.0 * np.pi * sigmay[k]**2)
return np.log(L)

```
In [ ]:
``````
# %load straightline_log_likelihood.py
```

```
In [ ]:
```def straight_line_log_prior(theta, theta_limits):
m, b = theta
mlimits, blimits = theta_limits
# Uniform in m:
if (m < mlimits[0]) | (m > mlimits[1]):
log_m_prior = -np.inf
else:
log_m_prior = np.log(1.0/(mlimits[1] - mlimits[0]))
# Uniform in b:
if (b < blimits[0]) | (b > blimits[1]):
log_b_prior = -np.inf
else:
log_b_prior = np.log(1.0/(blimits[1] - blimits[0]))
return log_m_prior + log_b_prior
def straight_line_log_posterior(theta, x, y, sigmay, theta_limits):
return (straight_line_log_likelihood(theta, x, y, sigmay) +
straight_line_log_prior(theta, theta_limits))

```
In [ ]:
```# Evaluate log P(m,b | x,y,sigmay) on a grid.
# Define uniform prior limits, enforcing positivity in both parameters:
mlimits = [0.0, 2.0]
blimits = [0.0, 200.0]
theta_limits = (mlimits, blimits)
# Set up grid:
mgrid = np.linspace(mlimits[0], mlimits[1], 101)
bgrid = np.linspace(blimits[0], blimits[1], 101)
log_posterior = np.zeros((len(mgrid),len(bgrid)))
# Evaluate log posterior PDF:
for im,m in enumerate(mgrid):
for ib,b in enumerate(bgrid):
theta = (m,b)
log_posterior[im,ib] = straight_line_log_posterior(theta, x, y, sigmay, theta_limits)
# Convert to probability density and plot, taking care with very small values:
posterior = np.exp(log_posterior - log_posterior.max())
plt.imshow(posterior, extent=[blimits[0],blimits[1],mlimits[0],mlimits[1]],cmap='Blues',
interpolation='none', origin='lower', aspect=(blimits[1]-blimits[0])/(mlimits[1]-mlimits[0]),
vmin=0, vmax=1)
plt.contour(bgrid, mgrid, posterior, pdf_contour_levels(posterior), colors='k')
i = np.argmax(posterior)
i,j = np.unravel_index(i, posterior.shape)
print('Grid maximum posterior values (m,b) =', mgrid[i], bgrid[j])
plt.title('Straight line: posterior PDF for parameters');
plt.plot(b_ls, m_ls, 'r+', ms=12, mew=4);
plot_mb_setup(*theta_limits);

- In problems with higher dimensional parameter spaces, we need a more efficient way of approximating the posterior PDF - both when characterizing it in the first place, and then when doing integrals over that PDF (to get the marginalized PDFs for the parameters, or to compress them in to single numbers with uncertainties that can be easily reported).

- In most applications it's sufficient to approximate a PDF with a (relatively) small number of samples drawn from it; MCMC is a procedure for drawing samples from PDFs.

- The following piece of code implements the
*Metropolis algorithm*, whose derivation can be read in many textbooks. You can read a brief justification aimed at Phil's grad student self in Chapter 2 of Phil's thesis, for example.

```
In [ ]:
```def metropolis(log_posterior, theta, theta_limits, stepsize, nsteps=10000):
'''
log_posterior: function of theta
theta_limits: uniform prior ranges
stepsize: scalar or vector proposal distribution width
nsteps: desired number of samples
'''
# Start in the center of the prior volume:
log_prob = log_posterior(theta, x, y, sigmay, theta_limits)
# Store Markov chain as an array of samples:
chain = np.empty((nsteps, len(theta)))
log_probs = np.empty(nsteps)
# Count our accepted proposals:
naccept = 0
for i in range(nsteps):
theta_new = theta + stepsize * np.random.randn(len(theta))
log_prob_new = log_posterior(theta_new, x, y, sigmay, theta_limits)
if np.log(np.random.rand()) < (log_prob_new - log_prob):
# accept, and move to the proposed position:
theta = theta_new
log_prob = log_prob_new
naccept += 1
else:
# reject, and store the same sample as before:
pass
chain[i] = theta
log_probs[i] = log_prob
acceptance_rate = naccept/float(nsteps)
return chain,log_probs,acceptance_rate

```
In [ ]:
```# Initialize m, b at the center of prior:
m = 0.5*(mlimits[0]+mlimits[1])
b = 0.5*(blimits[0]+blimits[1])
theta = np.array([m,b])
# Step sizes, 2% or 5% of the prior
mstep = 0.02*(mlimits[1]-mlimits[0])
bstep = 0.05*(blimits[1]-blimits[0])
stepsize = np.array([mstep,bstep])
# How many steps?
nsteps = 10000
print('Running Metropolis Sampler for', nsteps, 'steps...')
chain, log_probs, acceptance_rate = metropolis(
straight_line_log_posterior, theta, theta_limits, stepsize, nsteps=nsteps
)
print('Acceptance fraction:', acceptance_rate)

```
In [ ]:
```# Pull m and b arrays out of the Markov chain and plot them:
mm = [m for m,b in chain]
bb = [b for m,b in chain]
# Traces, for convergence inspection:
plt.figure(figsize=(8,5))
plt.subplot(2,1,1)
plt.plot(mm, 'k-')
plt.ylim(mlimits)
plt.ylabel('m')
plt.subplot(2,1,2)
plt.plot(bb, 'k-')
plt.ylabel('Intercept b')
plt.ylim(blimits)

- This looks pretty good: no plateauing, or drift.

- A more rigorous test for convergence is due to Gelman & Rubin, and involves comparing the intrachain variance with the inter-chain variance in an ensemble. It's worth reading up on.

- Foreman-Mackey & Hogg recommend looking at the autocorrelation length, and whether it stablizes during the run.

- Note that we kept all the samples, and did not discard any at the beginning of the run as "burn-in". Nor did we thin the chain at all. Both of these are good ideas.

- Now let's look at the samples in parameter space, overlaid on our gridded posterior.

```
In [ ]:
```# First show contours from gridding calculation:
plt.contour(bgrid, mgrid, posterior, pdf_contour_levels(posterior), colors='k')
plt.gca().set_aspect((blimits[1]-blimits[0])/(mlimits[1]-mlimits[0]))
# Scatterplot of m,b posterior samples, overlaid:
plt.plot(bb, mm, 'b.', alpha=0.01)
plot_mb_setup(mlimits,blimits)

```
In [ ]:
```!pip install --upgrade --no-deps corner

```
In [ ]:
```import corner
corner.corner(chain, labels=['m','b'], range=[mlimits,blimits],quantiles=[0.16,0.5,0.84],
show_titles=True, title_args={"fontsize": 12},
plot_datapoints=True, fill_contours=True, levels=[0.68, 0.95],
color='b', bins=80, smooth=1.0);

- How do we know if our model is any good? There are two properties that "good" models have: the first is accuracy, and the second is efficiency. We'll return to th equestion of efficiency below.

- Accurate models generate data that is
*like*the observed data. What does this mean? First we have to define what similarity is, in this context. Visual impression is one very important way, that's usually best done first.

- Test statistics that capture relevant features of the data are another. We'll look at the posterior predictive distributions for the datapoints, and for a particularly interesting test statistic, the reduced chi-squared.

```
In [ ]:
```# Generate a straight line model for each parameter combination, and plot:
X = np.linspace(xlimits[0],xlimits[1],50)
for i in (np.random.rand(100)*len(chain)).astype(int):
m,b = chain[i]
plt.plot(X, b+X*m, 'b-', alpha=0.1)
plot_line(m_ls, b_ls)
# Overlay the data, for comparison:
plot_yerr(x, y, sigmay);

- If our model is the true one, then
*replica*data generated by it should look like the one dataset we have. This means that*summaries*of both the real dataset, $T(d)$, and the replica datasets, $T(d^{\rm rep})$, should be drawn from the same distribution.

- If the real dataset was not generated with our model, then its summary may be an outlier from the distribution of summaries of replica datasets.

- We can account for our uncertainty in the parameters $\theta$ by marginalizing them out, which can be easily done by just making the histogram of $T(d^{\rm rep}(\theta))$ from our posterior samples, after drawing one replica dataset $d^{\rm rep}$ from the model sampling distribution ${\rm Pr}(d^{\rm rep}\,|\,\theta)$ for each one.

- Then, we can ask: what is the posterior probability for the summary $T$ to be greater than the observed summary $T(d)$? If this is very small or very large, we should be suspicious of our model - because it is not predicting the data very accurately.

${\rm Pr}(T(d^{\rm rep})>T(d)\,|\,d) = \int I(T(d^{\rm rep})>T(d))\,{\rm Pr}(d^{\rm rep}\,|\,\theta)\,{\rm Pr}(\theta\,|\,d)\;d\theta\,dd^{\rm rep}$

- Here $I$ is the "indicator function" - 1 or 0 according to the condition.

```
In [ ]:
```# Test statistics: functions of the data, not the parameters.
# 1) Reduced chisq for the best fit model:
# def test_statistic(x,y,sigmay,b_ls,m_ls):
# return np.sum((y - m_ls*x - b_ls)**2.0/sigmay**2.0)/(len(y)-2)
# 2) Reduced chisq for the best fit m=0 model:
# def test_statistic(x,y,sigmay,dummy1,dummy2):
# return np.sum((y - np.mean(y))**2.0/sigmay**2.0)/(len(y)-1)
# 3) Weighted mean y:
# def test_statistic(x,y,sigmay,dummy1,dummy2):
# return np.sum(y/sigmay**2.0)/np.sum(1.0/sigmay**2.0)
# 4) Variance of y:
# def test_statistic(x,y,sigmay,dummy1,dummy2):
# return np.var(y)
# 5) Correlation coefficient:
import scipy.stats
def test_statistic(x,y,dummy1,dummy2,dummy3):
'''
Pearson r correlation coefficient:
r12 = \sum [(xi - xbar)*(yi - ybar)] / [\sum (xi - xbar)^2 * \sum (yi - ybar)^2]^1/2
'''
r12 = np.sum((x - np.mean(x))*(y - np.mean(y))) / \
np.sqrt(np.sum((x - np.mean(x))**2) * np.sum((y - np.mean(y))**2))
return r12
# Approximate the posterior predictive distribution for T,
# by drawing a replica dataset for each sample (m,b) and computing its T:
T = np.zeros(len(chain))
for k,(m,b) in enumerate(chain):
yrep = b + m*x + np.random.randn(len(x)) * sigmay
T[k] = test_statistic(x,yrep,sigmay,b_ls,m_ls)
# Compare with the test statistic of the data, on a plot:
Td = test_statistic(x, y, sigmay, b_ls, m_ls)
plt.hist(T, 100, histtype='step', color='blue', lw=2, range=(0.0,np.percentile(T,99.0)))
plt.axvline(Td, color='black', linestyle='--', lw=2)
plt.xlabel('Test statistic')
plt.ylabel('Posterior predictive distribution')
# What is Pr(T>T(d)|d)?
greater = (T > Td)
P = 100*len(T[greater])/(1.0*len(T))
print("Pr(T>T(d)|d) = ",P,"%")

- If our model is true (and we're just uncertain about its parameters, given the data), we can compute the probability of getting a $T$ less than that observed, where T is the reduced chisq relative to a straight line with some reference $(m,b)$.

- Note that we did not have to look up the "chi squared distribution" - we can simply compute the posterior predictive distribution given our generative model.

- Still, this test statistic looks a little bit strange: it's computed relative to the best fit straight line - which is a sensible reference but somehow not really in the spirit of fitting the data...

- Instead, lets look at a
*discrepancy measure*$T(d,\theta)$ that is a function of both the data and the parameters, and compute the posterior probability of getting $T(d^{\rm rep},\theta) > T(d,\theta)$:

${\rm Pr}(T(d^{\rm rep},\theta)>T(d,\theta)\,|\,d) = \int I(T(d^{\rm rep},\theta)>T(d,\theta))\,{\rm Pr}(d^{\rm rep}\,|\,\theta)\,{\rm Pr}(\theta\,|\,d)\;d\theta\,dd^{\rm rep}$

```
In [ ]:
```# Discrepancy: functions of the data AND parameters.
# 1) Reduced chisq for the model:
def test_statistic(x,y,sigmay,b,m):
return np.sum((y - m*x - b)**2.0/sigmay**2.0)/(len(y)-2)
# Approximate the posterior predictive distribution for T,
# by drawing a replica dataset for each sample (m,b) and computing its T,
# AND ALSO its Td (which now depends on the parameters, too):
T = np.zeros(len(chain))
Td = np.zeros(len(chain))
for k,(m,b) in enumerate(chain):
yrep = b + m*x + np.random.randn(len(x)) * sigmay
T[k] = test_statistic(x,yrep,sigmay,b,m)
Td[k] = test_statistic(x,y,sigmay,b,m)
# Compare T with Td, on a scatter plot - how often is T>Td?
plt.scatter(Td, T, color='blue',alpha=0.1)
plt.plot([0.0, 100.0], [0.0, 100.], color='k', linestyle='--', linewidth=2)
plt.xlabel('Observed discrepancy $T(d,\\theta)$')
plt.ylabel('Replicated discrepancy $T(d^{\\rm rep},\\theta)$')
plt.ylim([0.0,np.percentile(Td,99.0)])
plt.xlim([0.0,np.percentile(Td,99.0)])

```
In [ ]:
```# Histogram of differences:
diff = T-Td
plt.hist(diff, 100, histtype='step', color='blue', lw=2, range=(np.percentile(diff,1.0),np.percentile(diff,99.0)))
plt.axvline(0.0, color='black', linestyle='--', lw=2)
plt.xlabel('Difference $T(d^{\\rm rep},\\theta) - T(d,\\theta)$')
plt.ylabel('Posterior predictive distribution')
# What is Pr(T>T(d)|d)?
greater = (T > Td)
Pline = 100*len(T[greater])/(1.0*len(T))
print("Pr(T>T(d)|d) = ",Pline,"%")

- The conclusion drawn from the discrepancy is more interesting, in this case. All our $\theta = (m,b)$ samples are plausible, so replica datasets generated by them should also be plausible. The straight line defined by each $(m,b)$ should go through the real data points as readily (
*on average*) as it does its replica dataset.

- Do our posterior predictive $p-$values suggest we need to improve our model? What about the visual check?

- The reduced chi-squared is actually not such an interesting test statistic - because we already used it in the likelihood! Better discrepancy measures would stress-test
*different*aspects of the model - such as the correlations between datapoints, or the behavior of the model in particular places, etc.

- Maybe I see some curvature in the data - or maybe I have a new astrophysical idea for how this data was generated. Let's try adding an extra parameter to the model, to make a quadratic function:

$y = m x + b + q x^2$

- The coefficient $q$ is probably pretty small (we were originally expecting to only have to use a straight line model for these data!), so I guess we can assign a fairly narrow prior, centered on zero.

```
In [ ]:
```def quadratic_log_likelihood(theta, x, y, sigmay):
'''
Returns the log-likelihood of drawing data values y at
known values x given Gaussian measurement noise with standard
deviation with known sigmay, where the "true" y values are
y_t = m*x + b + q**2
x: list of x coordinates
y: list of y coordinates
sigmay: list of y uncertainties
m: scalar slope
b: scalar line intercept
q: quadratic term coefficient
where theta = (m, b, q)
Returns: scalar log likelihood
'''
m, b, q = theta
return (np.sum(np.log(1./(np.sqrt(2.*np.pi) * sigmay))) +
np.sum(-0.5 * (y - (m*x + b + q*x**2))**2 / sigmay**2))
def quadratic_log_prior(theta, theta_limits):
m, b, q = theta
mlimits, blimits, qpars = theta_limits
# m and b:
log_mb_prior = straight_line_log_prior(np.array([m,b]), np.array([mlimits, blimits]))
# q:
log_q_prior = np.log(1./(np.sqrt(2.*np.pi) * qpars[1])) - \
0.5 * (q - qpars[0])**2 / qpars[1]**2
return log_mb_prior + log_q_prior
def quadratic_log_posterior(theta, x, y, sigmay, theta_limits):
return (quadratic_log_likelihood(theta, x, y, sigmay) +
quadratic_log_prior(theta, theta_limits))

```
In [ ]:
```# Define uniform prior limits, enforcing positivity in m and b:
mlimits = [0.0, 2.0]
blimits = [0.0, 200.0]
# Define Gaussian prior centered on zero for q:
qpars = [0.0,0.003]
# Initialize m, b at the center of prior:
m = 0.5*(mlimits[0]+mlimits[1])
b = 0.5*(blimits[0]+blimits[1])
q = qpars[0]
# Arrays to pass to the sampler:
qtheta = np.array([m,b,q])
qtheta_limits = (mlimits, blimits, qpars)
# Step sizes, small fractions of the prior width:
mstep = 0.02*(mlimits[1]-mlimits[0])
bstep = 0.05*(blimits[1]-blimits[0])
qstep = 0.02*qpars[1]
stepsize = np.array([mstep,bstep,qstep])
# How many steps?
nsteps = 10000
print('Running Metropolis Sampler for', nsteps, 'steps...')
qchain, log_probs, acceptance_rate = metropolis(
quadratic_log_posterior, qtheta, qtheta_limits, stepsize, nsteps=nsteps
)
print('Acceptance fraction:', acceptance_rate)

```
In [ ]:
```# Pull m, b and q arrays out of the Markov chain and plot them:
mm = [m for m,b,q in qchain]
bb = [b for m,b,q in qchain]
qq = [q for m,b,q in qchain]
# Traces, for convergence inspection:
plt.figure(figsize=(8,5))
plt.subplot(3,1,1)
plt.plot(mm, 'k-')
plt.ylim(mlimits)
plt.ylabel('Slope m')
plt.subplot(3,1,2)
plt.plot(bb, 'k-')
plt.ylim(blimits)
plt.ylabel('Intercept b')
plt.subplot(3,1,3)
plt.plot(qq, 'k-')
plt.ylim([qpars[0]-3*qpars[1],qpars[0]+3*qpars[1]])
plt.ylabel('Quadratic coefficient q')

```
In [ ]:
```corner.corner(qchain, labels=['m','b','q'], range=[mlimits,blimits,(qpars[0]-3*qpars[1],qpars[0]+3*qpars[1])],quantiles=[0.16,0.5,0.84],
show_titles=True, title_args={"fontsize": 12},
plot_datapoints=True, fill_contours=True, levels=[0.68, 0.95],
color='green', bins=80, smooth=1.0);

- All parameters are again precisely constrained.

- The gradient and intercept $m$ and $b$ are significantly different from before, though...

```
In [ ]:
```# Posterior visual check, in data space:
X = np.linspace(xlimits[0],xlimits[1],100)
for i in (np.random.rand(100)*len(chain)).astype(int):
m,b,q = qchain[i]
plt.plot(X, b + X*m + q*X**2, 'g-', alpha=0.1)
plot_line(m_ls, b_ls);
plot_yerr(x, y, sigmay)

```
In [ ]:
```# Discrepancy: functions of the data AND parameters.
# 1) Reduced chisq for the model:
def test_statistic(x,y,sigmay,m,b,q):
return np.sum((y - m*x - b - q*x**2)**2.0/sigmay**2.0)/(len(y)-3)
# 2) Some other discrepancy measure:
# Approximate the posterior predictive distribution for T,
# by drawing a replica dataset for each sample (m,b) and computing its T,
# AND ALSO its Td:
T = np.zeros(len(qchain))
Td = np.zeros(len(qchain))
for k,(m,b,q) in enumerate(qchain):
yp = b + m*x + q*x**2 + sigmay*np.random.randn(len(x))
T[k] = test_statistic(x,yp,sigmay,m,b,q)
Td[k] = test_statistic(x,y,sigmay,m,b,q)
# Histogram of differences:
diff = T - Td
plt.hist(diff, 100, histtype='step', color='green', lw=2, range=(np.percentile(diff,1.0),np.percentile(diff,99.0)))
plt.axvline(0.0, color='black', linestyle='--', lw=2)
plt.xlabel('Difference $T(d^{\\rm rep},\\theta) - T(d,\\theta)$')
plt.ylabel('Posterior predictive distribution')
# What is Pr(T>T(d)|d)?
greater = (T > Td)
Pquad = 100*len(T[greater])/(1.0*len(T))
print("Pr(T>T(d)|d,quadratic) = ",Pquad,"%, cf. Pr(T>T(d)|d,straightline) = ",Pline,"%")

- How do the two models compare? Which one matches the data better?

All of the above discussion was about *model checking* - assessment of the ability of a model to make accurate predictions.

- We used
*replicated data*, drawn from the model assuming it was true, and then looked at the probability of getting (a summary of) the data, given this assumption.

- Note the parallel with frequentist hypothesis testing:

**Frequentist:** assume the null hypothesis $H_0$, and calculate the probability of getting the estimators for the parameters by chance, by studying the distribution of a test statistic over a large ensemble of replica datasets (which under certain assumptions is a standard distribution).

**Bayesian:** assume the alternative hypothesis $H_1$, and calculate the posterior predictive probability distribution for the test statistic given the data, relative to the one test statistic we observed, marginalizing over the model parameters. In general this distribution will be non-standard, but we can often draw samples from the posterior PDF anyway.

- The Bayesian posterior predictive check is fairly conservative, by comparison with tests that focus on the best-fit parameters. This can be a good thing, to err on the side of caution.

- An interesting alternative to posterior predictive model checking is to follow the machine learning approach, and use cross-validation to assess model prediction accuracy.

- What should the algorithm be? Here's a starter suggestion:
`split data into training set dt and holdback test set dh infer Pr(x|dt), drawing posterior samples in x predict dhp compute score, using dhp and dh combine scores`

- What should the "score" be?
`Pr(dh|dt)`

, marginalizing over x? Machine learning "Accuracy"? How should we combine the scores?

`scikit-learn`

already has a nice API, with (I guess) a`Model`

class and some very useful cross-validation machinery. Maybe we could write`BayesianModel`

that inherits from`Model`

, and give it a sampler, priors, etc, as well as`fit`

and`predict`

methods (which could return*ensembles of samples*instead of point estimates).

- The evidence for model $H$, ${\rm Pr}(d\,|\,H)$, enables a form of Bayesian hypothesis testing via the evidence ratio (or "odds ratio", or "Bayes Factor"):

$R = \frac{{\rm Pr}(d\,|\,H_1)}{{\rm Pr}(d\,|\,H_0)}$

- This quantity is similar to a likelihood ratio, but it's a
*fully marginalized likelihood ratio*- which is to say that it*takes into account our uncertainty about values of the parameters of each model by integrating over them all.*

- As well as predictive accuracy, the other virtue a model can have is
*efficiency*. The evidence summarizes*all*the information we put into our model inferences, via both the data*and*our prior beliefs. You can see this by inspection of the integrand of the fully marginalized likelihood (#FML) integral:

${\rm Pr}(d\,|\,H) = \int\;{\rm Pr}(d\,|\,\theta,H)\;{\rm Pr}(\theta\,|\,H)\;d\theta$

The consequence of this is that the evidence depends on

*both*model accuracy (goodness of fit, ability to predict data)*and*efficiency (not over-fitting by having implausible flexibility).The following figure might help illustrate how the evidence depends on both goodness of fit (through the likelihood) and the complexity of the model (via the prior). In this 1D case, a Gaussian likelihood (red) is integrated over a uniform prior (blue): the evidence can be shown to be given by $E = f \times L_{\rm max}$, where $L_{\rm max}$ is the maximum possible likelihood, and $f$ is the fraction of the blue dashed area that is shaded red. $f$ is 0.31, 0.98, and 0.07 in each case.

```
In [ ]:
```from IPython.display import Image
Image('evidence.png')

The illustration above shows us a few things:

1) The evidence can be made arbitrarily small by increasing the prior volume: the evidence is more conservative than focusing on the goodness of fit ($L_{\rm max}$) alone. Of course if you assign a prior you don't believe, then you should not expect to get out a meaningful answer for ${\rm Pr}(d\,|\,H)$.

2) The evidence is linearly sensitive to prior volume ($f$), but exponentially sensitive to goodness of fit ($L_{\rm max}$). It's still a likelihood, after all.

- The odds ratio can, in principle, be combined with the ratio of priors for each model to give us the relative probability for each model being true, given the data:

$\frac{{\rm Pr}(H_1|d)}{{\rm Pr}(H_0|d)} = \frac{{\rm Pr}(d|H_1)}{{\rm Pr}(d|H_0)} \; \frac{{\rm Pr}(H_1)}{{\rm Pr}(H_0)}$

Prior probabilities are very difficult to assign in most practical problems (notice that no theorist ever provides them). So, one way to interpret the evidence ratio is to note that:

- If you think that having seen the data, the two models are
*still equally probable,* - then the evidence ratio in favor of $H_1$ is you the odds that you would have had to have been willing to take against $H_1$, before seeing the data.
- That is: the evidence ratio updates the prior ratio into a posterior one - as usual.

- If you think that having seen the data, the two models are

- The FML is in general quite difficult to calculate, since it involves averaging the likelihood over the prior. MCMC gives us samples from the posterior - and these cannot, it turns out, be reprocessed so as to estimate the evidence stably.

A number of sampling algorithms have been developed that

*do*calculate the evidence, during the process of sampling. These include:- Nested Sampling (including MultiNest and DNest)
- Parallel Tempering, Thermodynamic Integration
- ...

- If we draw samples from the
*prior*, we can then estimate the evidence via the usual sum over samples,

${\rm Pr}(d\,|\,H) \approx \frac{1}{N_s} \sum_k\;{\rm Pr}(d\,|\,\theta_s,H)$

## Simple Monte Carlo:

Consider the simplest possbile posterior inference, the mean of parameter $\theta$: $\bar{\theta} = \int\,\theta\,{\rm Pr}(\theta\,|\,d,H)\,d\theta$

This is approximated by an average over posterior samples:

$\bar{\theta} \approx \frac{1}{N_s}\,\sum_s\,\theta_s$

Now replace the posterior PDF with Bayes' Theorem:

$\bar{\theta} = \frac{1}{Z} \int\,\theta\,{\rm Pr}(d\,|\,\theta,H)\,{\rm Pr}(\theta\,|\,H)\,d\theta$

and the posterior samples with prior samples:

$\bar{\theta} \approx \frac{1}{(Z N_s)}\,\sum_s\,\theta_s\,{\rm Pr}(d\,|\,\theta_s,H)$

This is a

weighted mean- if we don't want $Z$, we just define weights $w_s = {\rm Pr}(d\,|\,\theta_s,H)$ compute the approximate posterior mean as the likelihood-weighted prior mean:$\bar{\theta} \approx \frac{\sum_s\,w_s\,\theta_s}{\sum_s\,w_s}$

To compute Z, we multiply both sides by $Z$ and replace $\theta$ with 1:

$Z \approx \frac{1}{N_s} \sum_s\,\,w_s$

- Simple Monte Carlo works well in certain low-dimensional situations, but in general it is very inefficient (at best).

- Still, let's give it a try on our two models, and attempt to compute the Bayes Factor

$R = \frac{{\rm Pr}(d\,|\,{\rm quadratic})}{{\rm Pr}(d\,|\,{\rm straight line})}$

```
In [ ]:
```# Draw a large number of prior samples and calculate the log likelihood for each one:
N = 50000
# Set the priors:
mlimits = [0.0, 2.0]
blimits = [0.0, 200.0]
qpars = [0.0,0.003]
# Sample from the prior:
mm = np.random.uniform(mlimits[0],mlimits[1], size=N)
bb = np.random.uniform(blimits[0],blimits[1], size=N)
qq = qpars[0] + qpars[1]*np.random.randn(N)
# We'll store the posterior samples as a "chain" again
schain = []
log_likelihood_straight_line = np.zeros(N)
log_likelihood_quadratic = np.zeros(N)
for i in range(N):
theta = np.array([mm[i], bb[i]])
log_likelihood_straight_line[i] = straight_line_log_likelihood(theta, x, y, sigmay)
qtheta = np.array([mm[i], bb[i], qq[i]])
log_likelihood_quadratic[i] = quadratic_log_likelihood(qtheta, x, y, sigmay)
schain.append((mm[i],bb[i],qq[i]))

*not* want to plot the points as well as the contours, since the points are prior samples not posterior ones!):

```
In [ ]:
```# Unnormalized likelihoods for plotting:
unnormalized_likelihood_straight_line = np.exp(log_likelihood_straight_line - log_likelihood_straight_line.max())
unnormalized_likelihood_quadratic = np.exp(log_likelihood_quadratic - log_likelihood_quadratic.max())

```
In [ ]:
```corner.corner(schain, labels=['m','b','q'], range=[mlimits,blimits,(qpars[0]-3*qpars[1],qpars[0]+3*qpars[1])],quantiles=[0.16,0.5,0.84],
weights=unnormalized_likelihood_straight_line,
show_titles=True, title_args={"fontsize": 12},
plot_datapoints=False, fill_contours=True, levels=[0.68, 0.95],
color='blue', bins=80, smooth=1.0);

```
In [ ]:
```corner.corner(schain, labels=['m','b','q'], range=[mlimits,blimits,(qpars[0]-3*qpars[1],qpars[0]+3*qpars[1])],quantiles=[0.16,0.5,0.84],
weights=unnormalized_likelihood_quadratic,
show_titles=True, title_args={"fontsize": 12},
plot_datapoints=False, fill_contours=True, levels=[0.68, 0.95],
color='green', bins=80, smooth=1.0);

```
In [ ]:
```def logaverage(x):
mx = x.max()
return np.log(np.sum(np.exp(x - mx))) + mx - np.log(len(x))
log_evidence_straight_line = logaverage(log_likelihood_straight_line)
log_evidence_quadratic = logaverage(log_likelihood_quadratic)
print('log Evidence for Straight Line Model:', log_evidence_straight_line)
print('log Evidence for Quadratic Model:', log_evidence_quadratic)
print('Odds ratio in favour of the Quadratic Model:', np.int(np.exp(log_evidence_quadratic - log_evidence_straight_line)),"to 1")

Play around with the Simple Monte Carlo sampler, and see if you can break it.

Try running it a few times, with the same

`N`

: how precise are these evidence estimates from 50000 samples?We said we believed

*a priori*that $q$ should be small, and set the prior width accordingly. What would happen if we started to doubt ourselves, and re-assigned a broader prior? Try running with some larger values for the Gaussian prior width $q[1]$, and note what happens to the Bayes Factor.

- The Bayesian evidence is
*qualitatively different*from other model assessments. While they focus primarily on*prediction accuracy,*the evidence is the way in which information from the prior PDF propagates through into our posterior beliefs about the model as a whole.

- There are no mathematical limitations to its use, in contrast to various other hypothesis tests that are only valid under certain assumptions (such as the models being nested). Any two models can be compared and the odds ratio computed.

Good things to keep in mind when using, or reading about, the evidence for a model:

Garbage in, garbage out: if you don't believe your prior, why should you or anyone else believe your evidence?

The evidence is only linearly sensitive to prior volume, but exponentially sensitive to goodness of fit. If you want a clearer sense of which model makes the data more probable,

*get more data.*If you don't believe your priors, the evidence can be a distraction from things you care about more - like accuracy, or cost.

If you do believe your priors, the evidence is very useful: it appears in second-level inferences, when parts of your model that you previously considered to be constant now need to be varied and inferred. The FML is only one parameter liberation away from becoming a likelihood.

.

**Intrinsic Scatter**: Implement the log likelihood for the straight line plus intrinsic scatter, and compare it with the other models via the evidence. This is more of an exercise than a hack, but it's good to practice.

**Better Discrepancy Measures:**Design and implement some more interesting discrepancy measures $T(d,\theta)$ that do a better job of highlighting the failures of the straight line model.

**Improve This Notebook**: The coding in this notebook leaves a lot to be desired. The Simple Monte Carlo sampler could be packaged into a`def`

for much easier re-use, and the plots could be combined to allow much easier comparison. Also, who does linear algebra these days? Surely`scikit-learn`

has a linear model we could use. Pull requests are most welcome!

**Simple Monte Carlo**: Write a general purpose Simple Monte Carlo sampler whose API mimics something well-used (like`emcee`

), package it up, and compete for business. Can you automate the calculation of the evidence to a desired precision? If you can, I think that will probably guarantee that you have sampled the posterior well.

**Simple Monte Carlo Sci-Kit Learn Model:**So that the machine learning community can experiment with scoring with the Bayesian evidence.

.

Copyright (C) Phil Marshall, Daniel Foreman-Mackey & Dustin Lang

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.

```
In [ ]:
```