In this tutorial we'll look at some simple, realistic, simulated data, and do some model evaluation:
For homework, you will:
Since we know that you'll need to repeat the above steps we will aim to write this notebook such that, when completed, it provides a module that you can extend at home.
After pulling down the tutorial notebook, immediately make a copy. Then do not modify the original. Do your work in the copy. This will prevent the possibility of git conflicts should the version-controlled file change at any point in the future.
In [ ]:
class SolutionMissingError(Exception):
def __init__(self):
Exception.__init__(self,"You need to edit the code provided for it to run!")
def REPLACE_WITH_YOUR_SOLUTION():
raise SolutionMissingError
Our data is just a list of numbers. Each one represents a measured distance $y$ between two different estimates of the center of a galaxy cluster: the location of the brightest galaxy and a centroid of the emissive, diffuse gas. The context here is that automated algorithms sometimes fail to chose the central galaxy correctly (because of image artifacts or other problems), whereas the gas centroid is more reliable but also more expensive to measure. Therefore, we'd like to use this data set to characterize the distribution of mis-centerings so that the galaxy-based centers can be used for large sample, with the resulting errors propagated forward through future processing, e.g., weak lensing estimates.
Let's load up the data and have a look.
In [ ]:
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
plt.rcParams['xtick.labelsize'] = 'large'
plt.rcParams['ytick.labelsize'] = 'large'
%matplotlib inline
# or %matplotlib notebook
In [ ]:
# Load data into global variable y.
# You'll need to make sure of the data file's location! A good guess would be:
# y = np.loadtxt('../../data/model_comparison.dat')
y = REPLACE_WITH_YOUR_SOLUTION()
In [ ]:
plt.rcParams['figure.figsize'] = (8.0, 6.0)
bins = np.linspace(0,1000,20)
plt.hist(y, bins=bins, color="lightgray")
plt.xlabel("Measured distance $y$")
plt.ylabel("Frequency");
The hypothesis we will test in this tutorial is the model outlined in the next section - but how well that model fits the data is a question we will answer in part using a test statistic.
Having understood what the data represent (and had a quick look at them), what feature in the data do you want your model to explain well?
With this in mind, what is a good test statistic to summarize the data? Spend a few mins thinking about this and discussing it, before writing it down below. You'll then use this "pre-registered" test statistic in a Bayesian model check below.
"Registering a hypothesis" in this way is a good idea (and standard practice in clinical trials in medical research) - it helps you avoid falling into the trap of a posteriori reasoning. Here's an interesting article in Science magazine on the topic, to read at home.
In [ ]:
# Test statistic: ...
First, let's assume a simple model $H_1$, that the sampling distribution is an exponential:
Model 1: $P(y|a_1, H_1) = \frac{1}{a_1}e^{-y/a_1}$; $y\geq0$
We'll then want to fit that model to the data, and then evaluate the model using the methods we saw in the model evaluation lesson. These include:
Notice that each of these bulleted operations can be coded as a function of the model (e.g. a visual check of the model, the evidence of the model, and so on). That suggests that we should write a class that completely describes the model, and then a set of functions that act on model objects passed to them.
Look carefully through the SimpleModel
class below. It contains methods for evaluating its various PDFs, and also for drawing samples (or replica datasets) from them.
Note that some methods have lines that need to be replaced with your own code before they will run! The first part of the tutorial will involve fixing these lines. Each time you edit the class, you'll need to re-run its cell, and also re-make any instances of the class that you are using.
When you are done reading the code, scroll down to the exercises.
In [ ]:
class SimpleModel(object):
"""
Parameter inference and model evaluation in a simple cluster mis-centering analysis.
"""
def __init__(self):
# Sometimes it will be convenient to compute many log_likelihood values at once:
self.vectorized_log_likelihood = np.vectorize(self.log_likelihood)
# NB Higher dimension models would need an additional kwarg, signature="(Ndim)->()"
# Dimensionality of the parameter space:
self.Ndim = 1
# Every prior needs some hyperparameters to define it:
self.hyperparameters = REPLACE_WITH_YOUR_SOLUTION()
return
def __repr__(self):
return "Simple model assuming exponential sampling distribution with mean a, and a <to be named> prior for a, with hyperparameters {} and {}.".format(self.hyperparameters[0], self.hyperparameters[1])
def log_prior(self, a):
"""
Evaluate the log prior PDF P(a|H)
"""
value = REPLACE_WITH_YOUR_SOLUTION()
return value
def draw_samples_from_prior(self, Ns):
"""
Draw samples from the prior PDF P(a|H)
"""
samples = REPLACE_WITH_YOUR_SOLUTION()
return samples
def log_likelihood(self, a):
"""
Evaluate the log of the likelihood function L(a) = P(y|a,H)
May need to guard against values of a<0...
"""
value = np.sum(st.expon.logpdf(y, scale=a))
return value
def log_posterior(self, a):
"""
Evaluate the log of the (unnormalized) posterior PDF P(a|y,H)
Notes
----
* We'll use this with an MCMC sampler, so it should call the non-vectorized likelihood.
"""
value = REPLACE_WITH_YOUR_SOLUTION()
return value
def sampling_distribution(self, yy, a):
"""
Evaluate the sampling distribution P(yy|a,H) at a point in data space yy given parameter a
Notes
-----
This is useful for making plots of "the model" overlaid on the histogram of the data
"""
value = st.expon.pdf(yy, scale=a)
return value
def generate_replica_dataset(self, a):
"""
Draw a replica dataset y_rep from the sampling distribution P(y_rep|a,H)
"""
Nd = len(y)
y_rep = st.expon.rvs(size=Nd, scale=a)
return y_rep
def draw_samples_from_posterior(self, guess=None, nwalkers=None, nsteps=None, burn=None, thinby=None):
"""
Use emcee to draw samples from P(a|y,H)
"""
# Deal with unset inputs:
if guess is None: print("You need to specify a starting point in parameter space with the `guess=` kwarg...")
if nwalkers is None: print("You need to specify the `nwalkers=` kwarg...")
if nsteps is None: print("You need to specify the chain length `nsteps=` kwarg...")
if burn is None: print("You need to specify the length of burnin `burn=` kwarg...")
if thinby is None: print("You need to specify the thinning factor `thinby=` kwarg...")
import emcee
# The density to sample is this model's own posterior PDF"
lnprob = self.log_posterior
npars = self.Ndim
self.sampler = emcee.EnsembleSampler(nwalkers, npars, lnprob)
# You could add e.g. threads=4 to speed things up with multiprocessing
# Generate an ensemble of walkers within +/-1% of the guess:
theta_0 = np.array([[guess]*(1.0 + 0.01*np.random.randn(npars)) for j in range(nwalkers)])
# Note that the initial parameter array theta_0 should have dimensions nwalkers × npars
# Evolve the ensemble:
self.sampler.run_mcmc(theta_0, nsteps)
# Plot the raw samples:
plt.rcParams['figure.figsize'] = (12.0, 6.0)
plt.subplot(211)
for j in range(nwalkers):
plt.plot(self.sampler.chain[j,:,0], 'o', alpha=0.2)
plt.title("Raw Markov chains")
# Extract the chain, remove burnin, merge, and thin:
samples = self.sampler.chain[:, burn:, :].reshape((-1, npars))
samples = samples[range(0,samples.shape[0],thinby),:]
# Keep the samples with the model for future use!
self.samples = samples
self.Nsamples = len(samples)
# Plot the thinned chain
plt.subplot(212)
plt.plot(samples[:,0], 'o')
plt.title("Thinned, post-burnin chains");
return samples
Make sure it's a proper (normalizable) distribution. We don't want to deal with improper distributions when calculating the evidence later on.
Now implement your prior PDF in the class above. For example, to adopt a uniform prior, you could just use two hyperparameters to define the upper and lower limits of a uniform distribution. Note that you need to write code both to evaluate the log prior and draw samples from the prior.
Once you have completed your functions, instantiate a model and draw a dozen samples from its prior, like this:
In [ ]:
Model1 = SimpleModel()
print(Model1)
a = Model1.draw_samples_from_prior(12)
print("12 sample a values drawn from the prior: ",a)
In [ ]:
In [ ]:
The draw_samples_from_posterior
method carries out a parameter inference with emcee
, displaying its Markov chains, removing burn-in, thinning, and concatenating the chains. Since this step isn't really the point of this problem, the code is given to you, but you'll still need to experiment with the keyword argument ("kwarg") inputs (and read the code to see what they do) in order to get good results. (The suggestions in the cell below are pretty terrible.)
NB. For future reference, the MCMC samples are also stored in the Model.samples
array.
If you do not already have
emcee
installed, you should be able topip install
it. The documentation is available here.
In [ ]:
samples = Model1.draw_samples_from_posterior(guess=1000.0, nwalkers=100, nsteps=10, burn=0, thinby=100)
It will be useful for later to know the mean of the posterior:
In [ ]:
pm = np.mean(Model.samples, axis=0)
print("Posterior mean value of a = ",pm)
In [ ]:
plt.rcParams['figure.figsize'] = (8.0, 6.0)
# First the histogram of observed data, as backdrop:
plt.hist(y, bins=bins, color="lightgray", density=True, label="Observed")
# Now overlay a curve following the sampling distribution conditioned on the posterior mean value of a:
pp = REPLACE_WITH_YOUR_SOLUTION()
plt.plot(bins, pp, linestyle="dashed", color="red", label="Posterior mean model")
plt.xlabel("Measured distance $y$")
plt.ylabel("Normalized Frequency")
plt.legend();
This kind of plot should be familiar: it's often a good idea to evaluate model adequacy in data space. You should already be able to see telling differences between the a well-fitting model's sampling distribution, and the data histogram.
Now, let's compare a random predicted ("replica") data set, drawn from the posterior predictive distribution, with the data. To do this we first draw a parameter value from the posterior PDF, and then generate a dataset from the sampling distribution conditioned on that value. The result is a sample from $P(y_{rep}|y)$.
In [ ]:
plt.rcParams['figure.figsize'] = (8.0, 6.0)
# First the histogram of observed data, as backdrop:
plt.hist(y, bins=bins, color="lightgray", density=True, label="Observed")
# Choose a posterior sample "at random" and generate a replica dataset, and show its histogram
j = REPLACE_WITH_YOUR_SOLUTION()
a = Model.samples[j]
mock = REPLACE_WITH_YOUR_SOLUTION()
plt.hist(mock, bins=bins, alpha=1.0, histtype="step", color="red", density=True, label="Sample posterior prediction")
plt.xlabel("Measured distance $y$")
plt.ylabel("Normalized Frequency")
plt.legend();
This plot is nice because it is comparing apples with apples: do mock datasets drawn from our model sampling distribution with any plausible parameter value "look like" the real data?
To best evaluate this, we want to visualize the posterior predictive distribution of replica datasets. We can do this by plotting many replica datasets drawn from the posterior predictive PDF, e.g. one for each of our posterior samples. Let's put this plot in a function, so we can re-use it later.
In [ ]:
def visual_check(Model, Nreps=None):
plt.rcParams['figure.figsize'] = (8.0, 6.0)
# First the histogram of observed data, as backdrop:
bins = np.linspace(0,1000,20)
plt.hist(y, bins=bins, color="lightgray", density=True, label="Observed")
# Compute the posterior mean parameter (vector)
pm = np.mean(Model.samples, axis=0)
# Make a large number of replica datasets, and overlay histograms of them all
if Nreps is None: Nreps = len(Model.samples)
alpha = 5.0 / Nreps
for j in range(Nreps):
if j==0:
# Plot a dataset drawn using a = the posterior mean a, to give a good legend
mock = REPLACE_WITH_YOUR_SOLUTION()
plt.hist(mock, bins=bins, histtype="step", alpha=1.0, color="red", density=True, label="Sample posterior predictions")
else:
# Take the next posterior sample a and generate a replica dataset
mock = REPLACE_WITH_YOUR_SOLUTION()
plt.hist(mock, bins=bins, histtype="step", alpha=alpha, color="red", density=True)
# Include the posterior mean model for comparison
pp = REPLACE_WITH_YOUR_SOLUTION()
plt.plot(bins, pp, linestyle="dashed", color="red", label="Posterior mean model")
plt.xlabel("Measured distance $y$")
plt.ylabel("Normalized Frequency")
plt.legend();
return
In [ ]:
visual_check(Model1)
Now let's quantify the (in)adequacy of the fit with a quantitative posterior predictive model check. We define a test statistic $T(y)$ that summarizes a dataset $y$, and look at its posterior predictive distribution.
Code up the test statistic you came up with when looking at the data at the start of this tutorial - that relates to your "pre-registered hypothesis."
In [ ]:
def T(y):
"""
Compute a test statistic to summarize a single dataset y
"""
T = REPLACE_WITH_YOUR_SOLUTION()
return T
To sample the posterior predictive distribution of test statistics $P(T(y_{rep})|y)$, we need to generate replica datasets from the model:
In [ ]:
def distribution_of_T(Model):
"""
Compute T(yrep) for each yrep drawn from the posterior predictive distribution, using samples stored in Model
"""
TT = np.array([T(Model.generate_replica_dataset(a)) for a in Model.samples])
return TT
We can now do the following:
TT
And we want all of that in packaged in functions of the model, so that we can re-use it later (on different models!).
First let's write a function to compute the p-value, $P(T > T(y)|y,H)$:
In [ ]:
def pvalue(Model):
"""
Compute the posterior predictive p-value, P(T > T(y)|y,H):
"""
p = REPLACE_WITH_YOUR_SOLUTION()
# Print out the result and return it
print("The (approximate) posterior predictive probability P(T > T(y)|y,H) = {}".format(p))
return p
And then write a function that plots the distribution of T, and reports the p-value:
In [ ]:
def posterior_predictive_check(Model):
"""
Compute the posterior predictive distribution of the test statistic T(y_rep), and compare with T(y_obs)
"""
# Compute distribution of T(yrep):
TT = distribution_of_T(Model)
# Plot:
plt.rcParams['figure.figsize'] = (8.0, 6.0)
plt.hist(TT, bins=50, histtype="step", color="red", label="$P(T(y_{\\rm rep})|y)$")
# Overlay T(y_obs):
plt.axvline(x=T(y), color="gray", linestyle="dashed", label="$T(y_{\\rm observed})$")
plt.xlabel("Test statistic T(y)")
plt.ylabel("Posterior predictive probability density")
plt.legend();
# Compute p-value:
p = pvalue(Model)
return p
In [ ]:
p1 = posterior_predictive_check(Model1)
We saw in class that the Deviance Information Criterion is given by:
$\mathrm{DIC} = \langle D(\theta) \rangle + p_D; \quad p_D = \langle D(\theta) \rangle - D(\langle\theta\rangle)$
where the deviance $D(\theta)=-2\log P(\mathrm{data}|\theta)$, and averages $\langle\rangle$ are over the posterior.
In the homework we'll compare these to the corresponding values for a different model - so again, we need a re-usable function that takes a model object as input and returns the DIC, and pD. Let's write this function, and execute it for the simple model.
In [ ]:
def DIC(Model):
"""
Compute the Deviance Information Criterion for the given model
"""
pD = REPLACE_WITH_YOUR_SOLUTION()
DIC = REPLACE_WITH_YOUR_SOLUTION()
print("The effective number of parameters fitted, pD = {:.3f}. The value of the DIC is {:.1f}".format(pD,DIC))
return DIC, pD
In [ ]:
DIC1, pD1 = DIC(Model1)
Does your value of $p_D$ make sense?
To do this, note that
$P(D|H)=\int P(D|\theta,H) \, P(\theta|H) d\theta$
can be approximated by an average over samples from the prior
$P(D|H) \approx \frac{1}{m}\sum_{k=1}^m P(D|\theta_k,H)$; $\theta_k\sim P(\theta|H)$.
This approach is not especially efficient or numerically stable in general (because the likelihood typically is large in only a small fraction of the prior volume), but it will do for this exercise.
In a function, draw a large number of samples from the prior and use them to calculate the evidence. To avoid numerical over/underflows, we'll use the special scipy
function logsumexp
to do the sum.
Roughly how precisely do we need to know the log Evidence, to be able to compare models?
In [ ]:
def log_evidence(Model, N=1000):
"""
Compute the log evidence for the model
"""
from scipy.special import logsumexp
log_evidence = REPLACE_WITH_YOUR_SOLUTION()
print("The log Evidence for the model is {}".format(log_evidence))
return log_evidence
In [ ]:
%%time
logE1 = log_evidence(Model1, N=1000)
We now have a lot of useful machinery for evaluating models for our data. We can do both visual and quantitative posterior predictive model checks to assess the adequacy of the model fit, and we can compute the DIC and the Bayesian evidence for the model, for use in model comparison. In the homework you will use this machinery to investigate a more complex model of your own design. Go to "File" -> "Download as..." -> "Python (.py)" to export the code from this notebook to a module you can %load
into your homework notebook.