In [1]:

    
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

import numpy as np
from scipy import stats

import emcee
from astroML.plotting import plot_mcmc


# These defaults are the ones I like, but feel free to adapt them
sns.set_style('white')
sns.set_context('talk')

Exercises on Bayesian statistics and inference

These are exercises belonging to the workshop material on Bayesian statistics and inference "Bayesian Statistics: the what, the why, and the how". I leave it totally up to you whether you do any if this or not. In the Solutions notebook you will find one way to tackle the problems posed here. Most, if not all, problems can be solved in a plethora of ways. Feel free to experiment around, it will only teach you more in the end.

1) Testing positively to a serious disease

Here are the equations for the disease testing exercise. Plug in some numbers and play around if you like.



In [2]:

    
P_positive_if_ill = .99
P_positive_if_notill = .01
P_ill = 1.e-3


def P_ill_if_positive(P_positive_if_ill=.99, P_positive_if_notill=.01, P_ill=1.e-3):
    """Function that simply plugs the numbers into Bayes' theorem"""
    P_notill = 1 - P_ill
    return P_positive_if_ill * P_ill / (P_positive_if_ill * P_ill + P_positive_if_notill * P_notill )



In [3]:

    
# Feel free to play around with these numbers!
print(P_ill_if_positive(P_positive_if_ill=.99, P_positive_if_notill=.01, P_ill=1.e-3))









    



0.09016393442622951

Questions you could ask yourself are:

What number would the pharmaceutical industry like to see optimized compared to the example?
Most people guess a much higher probability than 0.09 when given the question. What aspect are they forgetting about, and how/where does this appear in Bayes' theorem?
What happens when you forget the terms about false positives?
Investigate the dependence of the posterior probability graphically.



In [ ]:



In [ ]:

2) Updating one's knowledge of a system

Given the numbers in the original example, as stated above, how many positive tests do you need to get before your certainty about being ill rises above 99%? It's probably easiest to write a while-loop (or a for-loop with a break statement) in which Bayes' theorem is used to update the knowledge with the new data every time.



In [ ]:



In [ ]:

3) A world of Gaussians

For the situation where we try to determine the mean flux of that star, while it is itself fluctuating in brightness, please recreate the sampler, try to do so from scratch, but if that is too much to ask, then find some inspiration in the notebook.

If that works well: try the following:

Play with the number of walkers. Does it make much of a difference if you use only one (hint: emcee will tell you!)?
Try to not use a flat prior, but a very wide Gaussian for the mean, still flat for the standard deviation. Try this with the gaussian centered at the "correct" value, and with Gaussians that do include the correct value at a low but non-zero number of standard deviations from the correct values and one that is far off. Discuss outcomes! It may help to make the intrinsic distribution of fluxes wider (and/or the sample really small).



In [ ]:

    
# First, here's the code that generates the data set:
np.random.seed(42)  
N = 100  
mu_true, sigma_true = 1000, 10  # True flux at time of measurement is distributed following a gaussian.

F_true = stats.norm(mu_true, sigma_true).rvs(N)  # Onbekende werkelijke aantallen, nu met scatter
F = stats.poisson(F_true).rvs()  # Waargenomen aantallen, met errors
e = np.sqrt(F)  # root-N error

# For the visual, a graph of that:
fig, ax = plt.subplots()
ax.errorbar(F, np.arange(N), xerr=e, fmt='ok', ecolor='gray', alpha=0.5)
ax.vlines([F_true], 0, N, linewidth=5, alpha=0.1)
ax.set_xlabel("F");ax.set_ylabel("Measurement number");



In [ ]:

    
# The likelihood, prior and resulting posterior:

# Setting up the emcee run (think of dimensions, walkers, starting guesses)

# Use the tab completion help from the docstring to see how to call the run!



In [ ]:



In [ ]:

4) The linear fit with outliers, with outliers.

In the notebook we saw how to fit a slope and intercept to points with errors in the vertical direction. We reproduce the data set here:



In [ ]:

    
x = np.array([ 0,  3,  9, 14, 15, 19, 20, 21, 30, 35,
              40, 41, 42, 43, 54, 56, 67, 69, 72, 88])
y = np.array([33, 68, 34, 34, 37, 71, 37, 44, 48, 49,
              53, 49, 50, 48, 56, 60, 61, 63, 44, 71])
e = np.array([ 3.6, 3.9, 2.6, 3.4, 3.8, 3.8, 2.2, 2.1, 2.3, 3.8,
               2.2, 2.8, 3.9, 3.1, 3.4, 2.6, 3.4, 3.7, 2.0, 3.5])

plt.errorbar(x, y, e, fmt='.k', ecolor='gray')
plt.xlabel('X');plt.ylabel('Y');

As a reminder, we will give the definition of the likelihood, which is a bit complicated, and the definition of the posterior, we leave the definition of the prior to you! Do make sure to check if you understand the likelihood function!



In [ ]:

    
def log_likelihood(theta, x, y, e, sigma_B):
    dy = y - theta[0] - theta[1] * x
    g = np.clip(theta[2:], 0, 1)  # g<0 or g>1 leads to NaNs in logarithm
    logL1 = np.log(g) - 0.5 * np.log(2 * np.pi * e ** 2) - 0.5 * (dy / e) ** 2
    logL2 = np.log(1 - g) - 0.5 * np.log(2 * np.pi * sigma_B ** 2) - 0.5 * (dy / sigma_B) ** 2
    return np.sum(np.logaddexp(logL1, logL2))

def log_posterior(theta, x, y, e, sigma_B):
    return log_prior(theta) + log_likelihood(theta, x, y, e, sigma_B)

The definition of the posterior shows what will need to go into the prior and what comes. What is it? Also define the prior as a function. Think of the allowed parameter ranges, don't cheat and look the solutions notebook.



In [ ]:

Now let's run the MCMC, in the same set-up as in the notebook, to make sure we get the weird points:



In [ ]:

    
ndim = 2 + len(x)  # number of parameters in the model
nwalkers = 50  # number of MCMC walkers
nburn = 10000  # "burn-in" period to let chains stabilize
nsteps = 15000  # number of MCMC steps to take


# set theta near the maximum likelihood, with 

from scipy import optimize

def squared_loss(theta, x=x, y=y, e=e):
    dy = y - theta[0] - theta[1] * x
    return np.sum(0.5 * (dy / e) ** 2)

theta1 = optimize.fmin(squared_loss, [0, 0], disp=False)


np.random.seed(42)
starting_guesses = np.zeros((nwalkers, ndim))
starting_guesses[:, :2] = np.random.normal(theta1, 1, (nwalkers, 2))
starting_guesses[:, 2:] = np.random.normal(0.5, 0.1, (nwalkers, ndim - 2))

import emcee
sampler = emcee.EnsembleSampler(nwalkers, ndim, log_posterior, args=[x, y, e, 50])
sampler.run_mcmc(starting_guesses, nsteps)

sample = sampler.chain  # shape = (nwalkers, nsteps, ndim)
sample = sampler.chain[:, nburn:, :].reshape(-1, ndim)



In [ ]:

    
fig = plt.figure()
ax = plot_mcmc(sample[:,:2].T, fig=fig, labels=[r'Intercept', r'slope'], colors='k')
ax[0].plot(sample[:, 0], sample[:, 1], '.k', alpha=0.1, ms=4);
ax[0].plot([mu_true], [sigma_true], 'o', color='red', ms=10);

Investigate the attributes that come with teh sampler object. Is there any way in which you can assess the evolution of the sampled posterior probability density? What does it look like? You probably want to investigate the evolution of that pdf along all the walkers.



In [ ]:

If you happen to run this with a different radom number seed, most likely you find no deviant points. Somehow, this particular walker took very long to get the equilibrium region where it should be sampling. Always check if your choice of burn-in period seems appropriate!



In [ ]:

5) Truncated exponentials the numeric way

We have seen the Jaynes' truncated exponential in the instruction notebook. We have found an analytical way to come up with the 95% confidence interval. Because we like numerical sampling, let's have a try and construct an MCMC sampling of this posterior and get the 95% confidence interval numerically.



In [ ]:



In [ ]:

You may notice that there are valus for $\theta$ in the sample that are higher than the common sense upper limit (likely). Investigate if your burn-in period is taken long enough. Can there be other explanations?

Clean up your sample after burn in by discarding points that have an infintely low probabiilty (the attributes of the sample that is the result of the MCMC run should have some hints) and estimate the 95% credible region again.



In [ ]:



In [ ]:

If you have some time left...

Think of your own work. Where would it all fit in? Do you maximize likelihood functions every once in a while? And if so: do you actually have prior knowledge you would like to take into account? Are there nuisance parameters that you ignore and just pick some value for because that made the statistics easier?



In [ ]:



In [ ]: