MCMC Exercises

David Kirkby, UC Irvine / LSSTC Data Science Fellows Program



In [ ]:

    
%pylab inline

You will need to pip install these two packages, if you haven't already.



In [ ]:

    
import emcee
import corner

Introduction

This first section is a walk through to introduce you to the terminology and define some useful functions. Read it carefully, evaluate each cell, then move to the following sections where you will be repeating this basic study with variations.

We will study one of the simplest models, the straight line (which still has many subtleties): $$ f(x) = m x + b $$ This exercise is based on a more complicated example in the emcee documentation.

Initialization

First, specify our true parameters. A dictionary is overkill with only two parameters, but is useful when tracking a larger number.



In [ ]:

    
truth = dict(m=-1, b=5)

Plot the true model over the range [0, 10]:



In [ ]:

    
x_lo, x_hi = 0, 10
x_plot = np.linspace(x_lo, x_hi, 100)  # Really only need 2 points here
y_plot = truth['m'] * x_plot + truth['b']

plt.plot(x_plot, y_plot, 'r-', label='Truth')
plt.grid()
plt.legend();

Next, generate a "toy Monte Carlo" sample of n_data = 50 observations using the true model:



In [ ]:

    
# Reproducible "randomness"
gen = np.random.RandomState(seed=123)

# Random x values, in increasing order.
n_data = 50
x_data = np.sort(gen.uniform(low=0, high=10, size=n_data))
# True y values without any noise.
y_true = truth['m'] * x_data + truth['b']

# Add Gaussian noise.
rms_noise = 0.5
y_error = np.full(n_data, rms_noise)
y_data = y_true + gen.normal(scale=y_error)

Plot the generated sample with the model superimposed:



In [ ]:

    
plt.errorbar(x_data, y_data, yerr=y_error, fmt='.k', label='Toy MC')
plt.plot(x_plot, y_plot, 'r-', label='Truth')
plt.grid()
plt.legend();

Least Squares Cross Check

We chose a linear problem so that we would have an alternate method (weighted least squares linear regression) to calculate the expected answer, for cross checks. For details on this method, see:

https://en.wikipedia.org/wiki/Linear_least_squares_(mathematics)#Weighted_linear_least_squares

The appropriate weights here are "inverse variances":



In [ ]:

    
weights = y_error ** -2

scikit-learn provides a convenient class for (weighted) linear least squares:



In [ ]:

    
import sklearn.linear_model

linear_model = sklearn.linear_model.LinearRegression()
linear_model.fit(x_data.reshape(-1, 1), y_data, sample_weight=weights)

fit = dict(m=linear_model.coef_[0], b=linear_model.intercept_)
print fit

The weighted linear least squares method also provides a covariance matrix for the best-fit parameters, but unfortunately sklearn does not calculate this for you. However, we will soon have something even better using MCMC.

Add this fit to our plot:



In [ ]:

    
y_fit = fit['m'] * x_plot + fit['b']



In [ ]:

    
plt.errorbar(x_data, y_data, yerr=rms_noise, fmt='.k', label='Toy MC')
plt.plot(x_plot, y_plot, 'r-', label='Truth')
plt.plot(x_plot, y_fit, 'b--', label='Fit')
plt.grid()
plt.legend();

Markov Chain Monte Carlo

Define our priors $P(\theta)$ on the two fit parameters $\theta$ with a function to calculate $logP(\theta)$. The function should return -np.inf for any parameter values that are forbidden.

Start with simple "top hat" priors on both parameters:



In [ ]:

    
def log_prior(theta):
    m, b = theta
    if -5 < m < 0.5 and 0 < b < 10:
        return 0 # log(1) = 0
    else:
        return -np.inf # some parameter is outside its allowed range.

Defined the likelihood $P(D|\theta)$ of our data $D$ given the parameters $\theta$ with a function that returns $\log P(D|\theta)$:



In [ ]:

    
def log_likelihood(theta):
    m, b = theta
    # Calculate the predicted value at each x.
    y_pred = m * x_data + b
    # Calculate the residuals at each x.
    dy = y_data - y_pred
    # Assume Gaussian errors on each data point.
    return -0.5 * np.sum(dy ** 2) / rms_noise ** 2

Combine the priors and likelihood to calculate the log of the posterior $\log P(\theta|D)$:



In [ ]:

    
def log_posterior(theta):
    lp = log_prior(theta)
    if np.isfinite(lp):
        # Only calculate likelihood if necessary.
        lp += log_likelihood(theta)
    return lp

Initialize a Markov chain sampler using the emcee package:



In [ ]:

    
ndim = 2  # the number of parameters
nwalkers = 100  # bigger is better
sampler = emcee.EnsembleSampler(nwalkers, ndim, log_posterior)

Give each "walker" a unique starting point in parameter space. The advice on this in the docs is:

The best technique seems to be to start in a small ball around the a priori preferred position. Don’t worry, the walkers quickly branch out and explore the rest of the space.

Pick something close but exactly the truth for our a priori preferred position.



In [ ]:

    
a_priori = np.array([-1.1, 5.5])
ball_size = 1e-4
initial = [a_priori + ball_size * gen.normal(size=ndim) for i in range(nwalkers)]

Generate 500 samples for each walker (and ignore the sampler return value).



In [ ]:

    
sampler.run_mcmc(initial, 500);

The sampler object has a chain attribute with our generated chains for each walker:



In [ ]:

    
sampler.chain.shape

Plot the chains from the first 3 walkers for each parameter. Notice how they all start from the same small ball around our a priori initial point.



In [ ]:

    
plt.plot(sampler.chain[0, :, 0])
plt.plot(sampler.chain[1, :, 0])
plt.plot(sampler.chain[2, :, 0]);



In [ ]:

    
plt.plot(sampler.chain[0, :, 1])
plt.plot(sampler.chain[1, :, 1])
plt.plot(sampler.chain[2, :, 1]);

Combine all the walkers into a single array of generated samples. You could trim off some initial "burn-in" here, but I don't recommend it (see lecture notes):



In [ ]:

    
samples = sampler.chain[:, :, :].reshape((-1, ndim))



In [ ]:

    
samples.shape

Consecutive samples are certainly correlated, but the ensemble of all samples should have the correct distribution for inferences. Make a corner plot to see this distribution.



In [ ]:

    
fig = corner.corner(samples, labels=['m', 'b'], truths=[truth['m'], truth['b']])

Each sample represents a possible model. Pick a few at random to superimpose on our data:



In [ ]:

    
shuffled = samples[gen.permutation(len(samples))]
shuffled.shape



In [ ]:

    
plt.errorbar(x_data, y_data, yerr=rms_noise, fmt='.k', label='Toy MC')
plt.plot(x_plot, y_plot, 'r-', label='Truth')

for m, b in shuffled[:100]:
    y_sample = m * x_plot + b
    plt.plot(x_plot, y_sample, 'b', alpha=0.05)

plt.grid()
plt.legend();

Exercise: Marginal Distributions and Confidence Limits

Create a function that uses the samples array to plot the distribution of predicted values for $y(x)^8$ given $x$ and prints the (central) 68% and 95% confidence limits on this prediction. (Hint: np.percentile)



In [ ]:

    
def marginal(x):
    # ...

Note that you can use this technique to make inferences about any non-linear function of the parameters. I chose $y(x)^8$ here so that the result would have an asymmetric distribution.

Exercise: Known Heteroscedastic Errors

In the example above, the RMS error on each point is known and the same for each point. In this exercise, you will repeat the whole walk-through study above, but with known but non-constant (aka heteroscedastic) errors for each point. Copy each cell into this section then edit it, rather than editing the cells above directly, so you can compare later on.

For your new errors use:



In [ ]:

    
y_error = gen.uniform(low=0.1, high=1.0, size=n_data)

Before starting, take a moment to review the results above and predict how they will change...

Exercise: Uncertain Heteroscedastic Errors

In this exercise, the errors reported for each point are uncertain: the true errors are different from the reported errors by an unknown constant scale factor. This introduces a new parameter, the scale factor, that we will call s:



In [ ]:

    
truth = dict(m=-1, b=5, s=1.5)

Regenerate the data using this larger error, while keeping the reported error the same:



In [ ]:

    
y_data = y_true + gen.normal(scale=y_error * truth['s'])

The model now has three parameters instead of two. Repeat the previous exercise with this additional parameter. Before starting, take a moment to review your results above and predict how they will change...

Challenge Exercise: PyMC3

Repeat the walk through above (constant known error) using the pymc3 package instead of emcee. Be sure to use pymc3 and not pymc2 (since they are quite different). You will find that pymc3 has a different approach that takes some getting used to, but it is very helpful to have both packages in your toolkit.