In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy import stats
import emcee
from astroML.plotting import plot_mcmc
# These defaults are the ones I like, but feel free to adapt them
sns.set_style('white')
sns.set_context('talk')
These are exercises belonging to the workshop material on Bayesian statistics and inference "Bayesian Statistics: the what, the why, and the how". I leave it totally up to you whether you do any if this or not. In the Solutions notebook you will find one way to tackle the problems posed here. Most, if not all, problems can be solved in a plethora of ways. Feel free to experiment around, it will only teach you more in the end.
Here are the equations for the disease testing exercise. Plug in some numbers and play around if you like.
In [2]:
P_positive_if_ill = .99
P_positive_if_notill = .01
P_ill = 1.e-3
def P_ill_if_positive(P_positive_if_ill=.99, P_positive_if_notill=.01, P_ill=1.e-3):
"""Function that simply plugs the numbers into Bayes' theorem"""
P_notill = 1 - P_ill
return P_positive_if_ill * P_ill / (P_positive_if_ill * P_ill + P_positive_if_notill * P_notill )
In [3]:
# Feel free to play around with these numbers!
print(P_ill_if_positive(P_positive_if_ill=.99, P_positive_if_notill=.01, P_ill=1.e-3))
Questions you could ask yourself are:
In [ ]:
In [ ]:
Given the numbers in the original example, as stated above, how many positive tests do you need to get before your certainty about being ill rises above 99%? It's probably easiest to write a while-loop (or a for-loop with a break statement) in which Bayes' theorem is used to update the knowledge with the new data every time.
In [ ]:
In [ ]:
For the situation where we try to determine the mean flux of that star, while it is itself fluctuating in brightness, please recreate the sampler, try to do so from scratch, but if that is too much to ask, then find some inspiration in the notebook.
If that works well: try the following:
In [ ]:
# First, here's the code that generates the data set:
np.random.seed(42)
N = 100
mu_true, sigma_true = 1000, 10 # True flux at time of measurement is distributed following a gaussian.
F_true = stats.norm(mu_true, sigma_true).rvs(N) # Onbekende werkelijke aantallen, nu met scatter
F = stats.poisson(F_true).rvs() # Waargenomen aantallen, met errors
e = np.sqrt(F) # root-N error
# For the visual, a graph of that:
fig, ax = plt.subplots()
ax.errorbar(F, np.arange(N), xerr=e, fmt='ok', ecolor='gray', alpha=0.5)
ax.vlines([F_true], 0, N, linewidth=5, alpha=0.1)
ax.set_xlabel("F");ax.set_ylabel("Measurement number");
In [ ]:
# The likelihood, prior and resulting posterior:
# Setting up the emcee run (think of dimensions, walkers, starting guesses)
# Use the tab completion help from the docstring to see how to call the run!
In [ ]:
In [ ]:
In [ ]:
x = np.array([ 0, 3, 9, 14, 15, 19, 20, 21, 30, 35,
40, 41, 42, 43, 54, 56, 67, 69, 72, 88])
y = np.array([33, 68, 34, 34, 37, 71, 37, 44, 48, 49,
53, 49, 50, 48, 56, 60, 61, 63, 44, 71])
e = np.array([ 3.6, 3.9, 2.6, 3.4, 3.8, 3.8, 2.2, 2.1, 2.3, 3.8,
2.2, 2.8, 3.9, 3.1, 3.4, 2.6, 3.4, 3.7, 2.0, 3.5])
plt.errorbar(x, y, e, fmt='.k', ecolor='gray')
plt.xlabel('X');plt.ylabel('Y');
As a reminder, we will give the definition of the likelihood, which is a bit complicated, and the definition of the posterior, we leave the definition of the prior to you! Do make sure to check if you understand the likelihood function!
In [ ]:
def log_likelihood(theta, x, y, e, sigma_B):
dy = y - theta[0] - theta[1] * x
g = np.clip(theta[2:], 0, 1) # g<0 or g>1 leads to NaNs in logarithm
logL1 = np.log(g) - 0.5 * np.log(2 * np.pi * e ** 2) - 0.5 * (dy / e) ** 2
logL2 = np.log(1 - g) - 0.5 * np.log(2 * np.pi * sigma_B ** 2) - 0.5 * (dy / sigma_B) ** 2
return np.sum(np.logaddexp(logL1, logL2))
def log_posterior(theta, x, y, e, sigma_B):
return log_prior(theta) + log_likelihood(theta, x, y, e, sigma_B)
The definition of the posterior shows what will need to go into the prior and what comes. What is it? Also define the prior as a function. Think of the allowed parameter ranges, don't cheat and look the solutions notebook.
In [ ]:
Now let's run the MCMC, in the same set-up as in the notebook, to make sure we get the weird points:
In [ ]:
ndim = 2 + len(x) # number of parameters in the model
nwalkers = 50 # number of MCMC walkers
nburn = 10000 # "burn-in" period to let chains stabilize
nsteps = 15000 # number of MCMC steps to take
# set theta near the maximum likelihood, with
from scipy import optimize
def squared_loss(theta, x=x, y=y, e=e):
dy = y - theta[0] - theta[1] * x
return np.sum(0.5 * (dy / e) ** 2)
theta1 = optimize.fmin(squared_loss, [0, 0], disp=False)
np.random.seed(42)
starting_guesses = np.zeros((nwalkers, ndim))
starting_guesses[:, :2] = np.random.normal(theta1, 1, (nwalkers, 2))
starting_guesses[:, 2:] = np.random.normal(0.5, 0.1, (nwalkers, ndim - 2))
import emcee
sampler = emcee.EnsembleSampler(nwalkers, ndim, log_posterior, args=[x, y, e, 50])
sampler.run_mcmc(starting_guesses, nsteps)
sample = sampler.chain # shape = (nwalkers, nsteps, ndim)
sample = sampler.chain[:, nburn:, :].reshape(-1, ndim)
In [ ]:
fig = plt.figure()
ax = plot_mcmc(sample[:,:2].T, fig=fig, labels=[r'Intercept', r'slope'], colors='k')
ax[0].plot(sample[:, 0], sample[:, 1], '.k', alpha=0.1, ms=4);
ax[0].plot([mu_true], [sigma_true], 'o', color='red', ms=10);
Investigate the attributes that come with teh sampler object. Is there any way in which you can assess the evolution of the sampled posterior probability density? What does it look like? You probably want to investigate the evolution of that pdf along all the walkers.
In [ ]:
If you happen to run this with a different radom number seed, most likely you find no deviant points. Somehow, this particular walker took very long to get the equilibrium region where it should be sampling. Always check if your choice of burn-in period seems appropriate!
In [ ]:
We have seen the Jaynes' truncated exponential in the instruction notebook. We have found an analytical way to come up with the 95% confidence interval. Because we like numerical sampling, let's have a try and construct an MCMC sampling of this posterior and get the 95% confidence interval numerically.
In [ ]:
In [ ]:
You may notice that there are valus for $\theta$ in the sample that are higher than the common sense upper limit (likely). Investigate if your burn-in period is taken long enough. Can there be other explanations?
Clean up your sample after burn in by discarding points that have an infintely low probabiilty (the attributes of the sample that is the result of the MCMC run should have some hints) and estimate the 95% credible region again.
In [ ]:
In [ ]:
Think of your own work. Where would it all fit in? Do you maximize likelihood functions every once in a while? And if so: do you actually have prior knowledge you would like to take into account? Are there nuisance parameters that you ignore and just pick some value for because that made the statistics easier?
In [ ]:
In [ ]: