Fitting a Model the Bayesian Way with Emcee

This notebook is a continuation from the previous one (Introduction to fitting). The first part is identical: make some fake data (emission line) and fit it with a non-linear model (Gaussian + background). But this time, we place priors on the parameters and use Markov Chain Monte Carlo to solve the problem. Another notebook will do the same, but using a different MCMC sampler (pystan).

This notebook requires the emcee and corner packages. You can install them by running:

pip install emcee
pip install corner

1 Making a Fake Emission Line

The "true" data is some background flux of photons (a continuum from the source or background) plus a Gaussian line with some amplitude, width and center. I set these up as variables so it's easy to play around with them and see how things change.


In [ ]:
from numpy import *  # mmmmmm  crunchy
# Start by defining some parameters. Change these if you like!
cont_zp = 500.0
cont_slope = 5.0
amplitude = 150.0
width = 0.5
center = 5.0

# Next, a grid of wavelenght channels (assumed to have no uncertainty)
wave = linspace(0,10,100)
# The 'true' observations
flux = amplitude*exp(-0.5*power(wave-center,2)/width**2)+ \
        cont_zp + cont_slope*wave
# The actual observations = true observations + Poisson noise
obs_flux = random.poisson(flux)

So we have the wavelength on the x-axis, which is assumed to have no uncertainty. The measured flux is different from the "true" flux due to Poisson noise. Let's plot the true flux and observed flux to see how things look.


In [ ]:
%matplotlib inline
from matplotlib.pyplot import plot,step,xlabel,ylabel,show,subplots
plot(wave, flux, 'r-')
step(wave, obs_flux, color='k')
xlabel('Wavelength (Angstroms)')
ylabel('Counts')
show()

2 Bayes' Theorem

Bayesian statistics is based on Bayes' theorem (for an excellent intro, see 3blue1brown's video). It's actually a very simple idea and an equally simple equation. It's dealing with the equation that gets complicated. Let's say we have some data $\vec{D}$. In the case of our emission line, the data is the number of counts in each wavelength bin. We have a model with some number of parameters $\vec{\theta}$. Bayes' theorem simply states: $$P\left(\vec{\theta}\left|\vec{D}\right.\right) = \frac{P\left(\vec{D}\left|\vec{\theta}\right.\right)P\left(\vec{\theta}\right)}{P\left(\vec{D}\right)}$$ What this says is that the probability that we get a particular set of parameters given a fixed set of data (which is what we want) is proportional to the probability that we get the data given a fixed set of parameters (which we can calculate) times the probability of the parameters (the priors). The denominator is the probability that we got the data we did, which requires integrating over all possible parameters: $$P\left(\vec{D}\right) = \int P\left(\vec{D}\left|\vec{\theta}\right.\right)P\left(\vec{\theta}\right)\ d\vec{\theta}$$ and really just ensures that the probability is normalized to 1.

You might wonder what the difference between the priors $P\left(\vec{\theta}\right)$ and $P\left(\vec{\theta}\left|\vec{D}\right.\right)$ (called the likelihood) is. The likelihood is what your data tells you about the parameters. The priors are constraints that are external to the data. It could be a previous experiment's result that you are incorporating into your own. It could be a purely logical constraint (e.g., the age of the universe must be greater than 0), it could even be a gut feeling.

Working with the above equation isn't too bad if the number of parameters is small and the priors and likelihoods are all simple. In fact, if you use uniform priors and normally-distributed errors, you get the good-old least-squares formalism. But pretty quickly you can get in a situation where the equation (and integrals of the equation) are not possible to evaluate analytically. This is where Markov Chain Monte Carlo (MCMC) is useful.

3 The Priors

Our five parameters are cont_zp, cont_slope, amp, center, and width. As in the previous tutorial, the order of these parameters will be fixed. The MCMC module we will be using is called emcee. Let's first define the model: a function that, given the parameters, predicts the observations.


In [ ]:
def model(x, cont, slope, amp, center, width):
    model = amp*exp(-0.5*power(x-center,2)/width**2) + cont + \
             slope*x
    return model

Now we write some python functions that give us the ingredients of Bayes' formula. First up are the priors. We make a function that takes the parameters as a list (keeping the order we've established). Let's say we insist the width of the line must be positive (what does a negative width even mean?) and we know it's an emission line, so amp should be positive. If we don't specify anything, parameters are assumed to have a uniform (equal) probability. Emcee also wants the natural logarithm of the probability, so we call it lnprior().


In [ ]:
def lnprior(p):
    cont,slope,amp,center,width = p
    if width <= 0 or slope < 0:
        # ln(0)
        return -inf
    return 0

Next, we need the likelihood $P(\vec{D}|\vec{\theta})$. Given the parameters $\vec{\theta}$, the model $M(x,\vec{\theta})$ is given by the function model(). Under our assumpsions, this model will differ from the observed data because of Poisson errors. For large counts, the Poisson distribution is well-approximated by a normal distribution with variance ($\sigma^2$) equal to the counts. So, given a set of parameters $\vec{\theta}$, the probability we measure the flux in channel $i$ to be $f_i$ given by: $$P\left(f_i\left|\vec{\theta}\right.\right) = N\left(M(\vec{\theta}), \sqrt{f_i}\right)$$, where $N$ is the normal distribution. For the entire data-set, we have to multiply the probabilities of all the individual channels. Or, since we need the log of the probability: $$P\left(\vec{D}\left|\vec{\theta}\right.\right) = \Pi_i P\left(f_i\left|\vec{\theta}\right.\right)$$ We'll use scipy's stats module, which has the normal distribution (and its logarithm) built in. Just like the priors, emcee wants the natural logarithm of the probability, so instead of multiplying all the probabilities, we sum all the logarithms of the probabilities.


In [ ]:
from scipy.stats import norm
def lnlike(p, wave, flux):
    cont,slope,amp,center,width = p
    m = model(wave, *p)
    return sum(norm.logpdf(flux, loc=m, scale=sqrt(flux)))

Lastly, we construct the numerator of Bayes' formula. We won't compute the denominator, since it is a constant and we are only interested in the shape of $P\left(\vec{\theta}\left|\vec{D}\right.\right)$, since we are only interested in parameter inference. In other words, we only care about the relative probability of different values of the parameters. If we were comparing two models and wanted to know which was more likely the correct one, then we'd need the compute the denominator as well to get a full probability.


In [ ]:
def lnprob(p, wave, flux):
    # priors
    lp = lnprior(p)
    if not isfinite(lp): return -inf
    return lp + lnlike(p, wave, flux)

Now that we have the probability all figured out, we could in principle figure out where it is maximal and compute the 1-2-3-sigma intervals. This may or may not be possible in "closed form". The more parameters, priors and complicated the model gets, the less likely you'll be able to compute the derivatives (for optimization) and integrals (for expectation values and confidence intervals). But we can always compute these numerically and that's what MCMC is all about. With the emcee module, we do this by creating a bunch of "walkers" that wander around parameter space, always seeking higher probability regions, but also randomly sampling the space. After a certain amount of time, they wander around randomly enough that they lose track of where they started. When this happens, the steps the walkers take is a reflection of $P\left(\vec{\theta}\left|\vec{D}\right.\right)$. So inferences about the moments of $P\left(\vec{\theta}\left|\vec{D}\right.\right)$ can be determined by doing statistics on the walkers' steps. For example, the expectation (mean value) of the amplitude is: $$\left<A\right> \equiv \int P\left(\vec{\theta}\left|\vec{D}\right.\right)A d\vec{\theta} \simeq mean(A_i)$$ where A_i are the values of amp at each step $i$. The more steps you take, the more accurate the estimate.

So now we create a number of walkers and start them off in random locations around parameter space. In this example, we know the true values so we just perturb around that. When you don't know the true values, you could start in completely random locations or use other tools (like curve_fit) to find an initial starting point.


In [ ]:
Nwalker,Ndim = 50,5
ptrue = array([500.,5.0,150.,5.0, 0.5])
# add a random vector 0.1 times the true vector to the true vector
p0 = [ptrue + 0.1*random.randn(Ndim)*ptrue for i in range(Nwalker)]

So we now have Nwalker initial points. We can run the emcee sampler, givin it the lnprob function and any extra arguments it needs. The run_mcmc function takes the initial starting points and how many steps you want each to take. It returns the last position, probability, and state of each walker.


In [ ]:
import emcee
sampler = emcee.EnsembleSampler(Nwalker, Ndim, lnprob, args=(wave, obs_flux))
pos,prob,state = sampler.run_mcmc(p0, 500)

So let's see what each walker did. We'll graph the value of each parameter as a function of step number. Each walker will have its own line.


In [ ]:
fig,ax = subplots(4,1)
res = [ax[i].plot(sampler.chain[:,:,i].T, '-', color='k', alpha=0.3) for i in range(4)]
res = [ax[i].axhline(ptrue[i]) for i in range(4)]

As you can see, the walkers can start out rather far from the true value (blue horizontal lines), but after some time, they all converge to a value close to the true value (though not equal, thanks to the noise we added). It's at this point that we say the MCMC chain has converged. Since we're sure this is the case (make sure), we can reset the chains and run for a longer time to get good statistics.


In [ ]:
sampler.reset()
pos,prob,state = sampler.run_mcmc(pos, 1000)

Once the sampler is done, we can do statisics on the "chains". The sampler object has an attribute flatchain, where all the walkers are combined. This gives us Nwalkers*Nsteps samples from the posterior. We could get the best-fit values and errors by doing statistics on the chains:


In [ ]:
print(mean(sampler.flatchain, axis=0)) # best-fit, well really expectation value
print(std(sampler.flatchain, axis=0)) # errors
# deviation from true parameters in units of standard error
print((mean(sampler.flatchain, axis=0)-ptrue)/std(sampler.flatchain, axis=0))
print(cov(sampler.flatchain.T)) # covariance matrix

Lastly, we can visualize the posterior probabilities of the parameters as well as the covariances between them by plotting a corner plot.


In [ ]:
import corner
rmp = corner.corner(sampler.flatchain, labels=['cont_zp','cont_slope','amp','cent','width'],
                    truths=[cont_zp,cont_slope,amplitude,center,width])

These corner plots show the covariance between parameters and the histograms show the posterior probability distribution for each parameter. In this case they are all pretty Guassian, so the mean of the distribution is very close to the maximum likelihood (mode) and the standard deviation is a good estimate of the uncertainy. As before, we see that the continuum zero-point and slope are highly covariant, as is the amplitude and width.