In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
# fig_code is in the repository linked above
import fig_code
# use seaborn plotting defaults
# If this causes an error, you can comment it out.
import seaborn as sns
sns.set()
Bayesian approaches to model fitting are all about probability. Previously we looked at probability within the likelihood; for a single measured number, we have a probability density that looks like this:
$$ p(y~|~I) $$Here $y$ is the value that we observe, and $I$ is whatever information is out there that helps us specify this. This expression should be read the probability density of $y$ given $I$.
Often statistics courses will go to great lengths to distinguish discrete probabilities from continuous probability densities; here we're going to ignore the discrete case and just look at the continuous case. A probability density must satisfy the following:
$$ \int_{-\infty}^\infty p(y~|~I)dy = 1 $$That is, the probability density is normalized.
Question 1: what are the units of $p(y~|~I)$?
Question 2: what does $\int p(y~|~I)dI$ mean?
Answer 1: units of $p(y~|~I)$ are inverse of the units of $y$! So, for example, if $y$ is a distance, $p(y~|~I)$ has units 1/distance
Answer 2: it's meaningless! You should never do that integral!
As we saw in the frequentist section, we can write joint probability densities for multiple variables. This is written as:
$$ p(x, y~|~I) $$This should be read "the probability density of $x$ and $y$ given $I$". In analogy to above, it must satisfy the equation
$$ \iint p(x, y~|~I) dxdy = 1 $$Take a moment and think about this: what are the units of this joint probability?
In [2]:
fig_code.plot_venn_diagram()
One nice way to think about joint probabilities is in terms of a Venn Diagram. Areas in the Venn diagram are proportional to probabilities, and here we're just looking at the part of parameter space where $\theta$ holds.
Question: what would $p(x~|~y,I)$ mean?
If we write something like $p(x~|~y,I)$, this is known as the conditional probability of $x$ given $y$ (and $I$).
Question: Looking at the Venn Diagram, how could you express this conditional probability in terms of the joint probability?
One of the fundamental rules of conditional probability is this:
$$ p(x~|~y,I) \equiv \frac{p(x, y~|~I)}{p(y~|~I)} $$Or, rewriting this, we have the expression:
$$ p(x, y~|~I) = p(x~|~y,I) p(y~|~I) $$Of course, this is symmetric so we could also write:
$$ p(x, y~|~I) = p(y~|~x,I) p(x~|~I) $$Putting these last two together, we see
$$ p(x~|~y,I) p(y~|~I) = p(y~|~x,I) p(x~|~I) $$Which, rearranged slightly, looks like this
$$ p(x~|~y,I) = \frac{p(y~|~x,I) p(x~|~I)}{p(y~|~I)} $$This little result is known as Bayes' Rule or Bayes' Theorem, after the reverend Thomas Bayes. We'll come back to this in a bit, but first we have to get a bit philosophical.
Above we've been happily exploring the mathematical characteristics of probability; but we haven't stopped to ask what the probability is measuring. For example, when I say "the probability that this coin will land heads is 50%", what claim am I actually making?
A full answer to this would probably require a semester philosophy class, but history has distilled for us two camps of thought on this question:
frequentist or classical statistics would say that the probability is a statement about limiting frequencies in repeated trials: in the limit of many coin flips, 50% of the results will be heads.
Bayesian statistics extends this notion of probability to include degrees of belief. So, even absent a large number of (real or imaginary) trials, we could say that the 50% probability indicates our certainty of a single result: given what we know about the coin, we are 50% certain that the flip will land heads.
But enough of this philosophy stuff... how does this relate to model fitting?
Let's re-write Bayes' rule here, but we'll replace $x$ with $\theta$, representing our model parameters (e.g. $\theta = [m, b]$, the slope and intercept of the line), and we'll replace $y$ with $D$, representing our observed data (e.g. $D = \{x_i, y_i, \sigma_i\}$):
$$ p(\theta~|~D,I) = \frac{p(D~|~\theta,I)~p(\theta~|~I)}{p(D~|~I)} $$This shows us a formula for computing the probability (i.e. degree of belief) of our model parameters given the observed data.
When we write Bayes' rule this way, we're all of a sudden doing something controversial: can you see where this controversy lies?
Answer: two controversial points:
Nevertheless, applying Bayes' rule in this manner gives us a means of quantifying our knowledge of the parameters $\theta$ given observed data
We have four terms in the above expression, and we need to make sure we understand them:
This contains all the information about the answer we're interested in.
notice that it's identical to the frequentist likelihood that we discussed in the earlier section.
This contains any information we know about the model before updating it with our data $D$
This amounts basically to a normalization term: notice that $p(D~|~I) = \int p(D~|~\theta,I)p(\theta~|~I)d\theta$. You (almost) never need to actually calulate this.
In [3]:
from fig_code import linear_data_sample
x, y, dy = linear_data_sample()
plt.errorbar(x, y, dy, fmt='o');
And our model is $y_M(x; \theta) = mx + b$ where $\theta = [m, b]$.
We write Bayes' theorem as follows:
$$ p(\theta~|~D) \propto p(D~|~\theta)~p(\theta) $$And we'll maximize this expression with respect to $\theta$. Notice that the only difference between this and the maximum likelihood approach is that we multiply the likelihood by a prior before maximization. We have to specify this prior... so what should we choose?
Question: what are some potential prior choices?
But note that in the presence of data which strongly constrains the model, the effect of these priors is extremely small.
You might be wondering what the point of all this is. Now rather than maximizing the likelihood, we're maximizing the likelihood after multiplying it by a function (the prior) which is seeminly arbitrarily chosen. What's the point of it?
Well, the posterior actually has some nice properties. For example, it is possible to integrate over some (but not all) of the parameters in the posterior to find marginalized posteriors. This lets us ignore some parts of the fit, usually known as nuisance parameters.
For example, our posterior here is
$$ p(m,b~|~D) $$Say we don't care about the intercept $b$, but it is required because of some unknown bias in the measurement. We simply marginalize it out, and compute
$$ p(m~|~D) = \int_{-\infty}^\infty p(m,b~|~D)~{\rm d}b $$This marginalized posterior only gives us what we're interested in: the posterior probability of the slope given our data. But now this is becoming a hard computation. Analytically integrating our likelihood over $b$ (even with a simple prior) is hard! For this reason, Bayesian computations usually proceed using sampling methods such as Markov Chain Monte Carlo
We'll not go through the theoretical details of sampling here, but for now just say that a sampling approach bases results on drawing samples from the posterior. Once these samples are drawn, we can use the characteristics of our sample to solve our problem.
We'll use the excellend emcee package to do this. To use emcee, we'll need to define a function which computes the log-posterior:
In [4]:
def model(theta, x):
return theta[0] * x + theta[1]
def log_prior(theta):
# Use a flat prior for now
return 0
def log_likelihood(theta, x, y, dy):
y_model = model(theta, x)
return -0.5 * np.sum(np.log(2 * np.pi * dy ** 2) + (y - y_model) ** 2 / (2 * dy) ** 2)
def log_posterior(theta, x, y, dy):
return log_prior(theta) + log_likelihood(theta, x, y, dy)
In [5]:
# run this to install emcee:
# !pip install emcee
In [6]:
import emcee
np.random.seed(0)
ndim = 2 # number of parameters in the model
nwalkers = 50 # number of MCMC walkers
nburn = 1000 # "burn-in" period to let chains stabilize
nsteps = 2000 # number of MCMC steps to take
# we'll start at random locations between 0 and 2000
starting_guesses = np.random.rand(nwalkers, ndim)
sampler = emcee.EnsembleSampler(nwalkers, ndim, log_posterior, args=[x, y, dy])
sampler.run_mcmc(starting_guesses, nsteps)
sample = sampler.chain # shape = (nwalkers, nsteps, ndim)
sample = sampler.chain[:, nburn:, :].reshape(-1, 2)
Now we have a sample of points drawn from the posterior! Let's plot these using Dan Foreman-Mackey's triangle
package:
In [7]:
# run this to install triangle:
# !pip install triangle_plot
In [8]:
import triangle
triangle.corner(sample, labels=['slope', 'intercept']);
We see that the slope and intercept are roughly Gaussian-distributed, and are negatively correlated. If we want to see the "best-fit" value, we can choose the maximum of the posterior to find the for Maximum a Posteriori fit, which in the Gaussian case will be very close to the mean of the samples:
In [9]:
theta_MAP = sample.mean(0)
print(theta_MAP)
Let's plot this result over the data:
In [10]:
xfit = np.linspace(0, 10)
plt.errorbar(x, y, dy, fmt='o')
plt.plot(xfit, model(theta_MAP, xfit), label='Mean Posterior Fit')
plt.legend(loc='upper left');
Note though that in the Bayesian problem, the result is not a single number. The Bayesian result is the full posterior, which our lines have sampled. One way to examine the results is to plot many lines based on these samples:
In [11]:
xfit = np.linspace(0, 10)
# Choose 200 random indices
indices = np.random.choice(np.arange(len(sample)), 200)
plt.errorbar(x, y, dy, fmt='o')
for i in indices:
plt.plot(xfit, model(sample[i], xfit), '-k', alpha=0.02)
This gives us an idea of the spread in the posterior, which is related to the error in the derived parameters.
In the afternoon breakout, we'll explore what is perhaps the simplest sampling algorithm, the Metropolis-Hastings sampler.
This sampler is a procedure which selects a chain of points which (in the long-term limit) will be a representative sample from the posterior. The procedure is surprisingly simple:
Repeat the following:
Given $\theta_i$, draw a new $\theta_{i + 1}$
Compute the acceptance ratio $$ a = \frac{p(\theta_{i + 1}~|~D,I)}{p(\theta_i~|~D,I)} $$
If $a \ge 1$, the proposal is more likely: accept the draw and add $\theta_{i + 1}$ to the chain.
If $a < 1$, then accept the point with probability $a$: this can be done by drawing a uniform random number $r$ and checking if $a < r$. If the point is accepted, add $\theta_{i + 1}$ to the chain. If not, then add $\theta_i$ to the chain again.
During the breakout, you'll have a chance to create this sampler from scratch and use it to solve the straight line fit problem!