Bayesian approach with emcee - Test case - 3 free parameters

An example of applying the Bayesian approach with 3 free parameters (erosion rate, time exposure and inheritance), using the emcee package.

For more info about the method used, see the notebook Inference_Notes.

This example (a test case) is based on a generic dataset of 10Be concentration vs. depth, which is drawn from a distribution with given "true" parameters.

This notebook has the following external dependencies:


In [1]:
import math

import numpy as np
import pandas as pd
from scipy import stats
from scipy import optimize
import emcee
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
clr_plt = sns.color_palette()

The mathematical (deterministic, forward) model

An implementation of the mathematical model used for predicting profiles of 10Be concentrations is available in the models Python module (see the notebook Models). The 10Be model assumes that the soil density is constant along the depth profile and that the inheritance is the same for the whole sample of 10Be concentration vs. depth.


In [2]:
import models

The data

The dataset is generated using the following parameter values. eps is the erosion rate, t is the exposure time, rho is the soil density and inh is the inheritance.


In [3]:
# the true parameters 
eps_true = 5e-4
t_true = 3e5
rho_true = 2.
inh_true = 5e4

# depths and sample size
depth_minmax = [50, 500]
N = 8

# perturbations
err_magnitude = 20.
err_variability = 5.

The gendata Python module is used to generate the dataset (see the notebook Datasets).


In [4]:
import gendata

profile_data = gendata.generate_dataset(
    models.C_10Be,
    (eps_true, t_true, rho_true, inh_true),
    zlimits=depth_minmax,
    n=N,
    err=(err_magnitude, err_variability)
)

Make a plot of the dataset


In [5]:
sns.set_context('notebook')

fig, ax = plt.subplots()

profile_data.plot(
    y='depth', x='C', xerr='std',
    kind="scatter", ax=ax, rot=45
)

ax.invert_yaxis()


The statistical model used for computing the posterior probability density PPD

Here below we define a data model by the tuple m = (eps, t, inh). It correspond to a given location in the 3-d parameter space. The soil density is assumed known.

  • Define the parameter names. It is important to use the same order to further define the priors and bounds tuples!

In [6]:
param_names = 'erosion rate', 'time exposure', 'inheritance'
  • Create a pd.Series with the true parameter values. It will be used for plotting purpose.

In [7]:
param_true = pd.Series((eps_true, t_true, inh_true), index=param_names)
  • Define the prior probability distribution for each free parameter. Here the uniform distribution is used, with given bounds (loc and scale arguments of scipy.stats.uniform are the lower bound and the range, respectively)

In [8]:
eps_prior = stats.uniform(loc=0., scale=1e-3)
t_prior = stats.uniform(loc=0., scale=8e5)
inh_prior = stats.uniform(loc=0., scale=1.5e5)

priors = eps_prior, t_prior, inh_prior
param_priors = pd.Series(priors, index=param_names)
  • Define (min, max) bounds for each free parameter. It should be given by lower and upper quantiles (lower_qtl, upper_qtl) of the prior distribution. Choose the extreme quantiles (0, 1) if the distribution is uniform. It will be used for plotting purpose and also for constrained optimization (see below).

In [9]:
def get_bounds(f, lower_qtl=0., upper_qtl=1.):
    return f.ppf(lower_qtl), f.ppf(upper_qtl)

eps_bounds = get_bounds(eps_prior, 0, 1)
t_bounds = get_bounds(t_prior, 0, 1)
inh_bounds = get_bounds(inh_prior, 0, 1)

bounds = eps_bounds, t_bounds, inh_bounds
param_bounds = pd.DataFrame(
    np.array(bounds), columns=('min', 'max'), index=param_names
)

param_bounds


Out[9]:
min max
erosion rate 0 0.001
time exposure 0 800000.000
inheritance 0 150000.000
  • Plot the prior probability density for each parameter.

In [10]:
fig, axes = plt.subplots(1, 3, figsize=(13, 3))

for ax, p, b, name in zip(axes.flatten(),
                          param_priors.values,
                          param_bounds.values,
                          param_names):
    xmin, xmax = b
    eps = 0.1 * (xmax - xmin)
    x = np.linspace(xmin - eps, xmax + eps, 200)
    d = p.pdf(x)
    ax.plot(x, d)
    ax.fill(x, d, alpha=0.4)
    plt.setp(ax.xaxis.get_majorticklabels(), rotation=45)
    plt.setp(ax, ylim=(0, None), yticklabels=[],
             xlabel=name)

plt.subplots_adjust()


  • Define a function that returns the (logarithm of the) prior probability density for a given data model m.

In [11]:
def lnprior(m):
    lps = [p.logpdf(v) for (p, v) in zip(priors, m)]
    if not np.all(np.isfinite(lps)):
        return -np.inf
    return np.sum(lps)
  • Define a function that returns the log-likelihood. It is a $n$-dimensional Gaussian ($n$ nucleide concentrations sampled along the depth profile) with the mean given by the formard model and the variance given by the error estimated from the measurements of the nucleide concentration of each sample. This Gaussian implies that (1) the error on each measurement is random, (2) the sampled nucleide concentrations are measured independently of each other, (3) the forward model - i.e., the deterministic model that predicts the nucleide concentration profile - represents the real physics and (4) the values of the non-free parameters of the forward model - e.g., nucleide surface production rate, attenuation lengths... - are exactly known.

In [12]:
def lnlike(m):
    eps, t, inh = m
    
    mean = models.C_10Be(profile_data['depth'].values,
                         eps, t, rho_true, inh)
    var = profile_data['std']**2
    
    lngauss = -0.5 * np.sum(
        np.log(2. * np.pi * var) +
        (profile_data['C'] - mean)**2 / var
    )   
    
    return lngauss
  • Define a function that returns the log-posterior probability density, according to the Bayes's theorem.

In [13]:
def lnprob(m):
    lp = lnprior(m)
    if not np.isfinite(lp):
        return -np.inf
    return lp + lnlike(m)

Sampling the posterior probablility density using MCMC

In our case, the from of the PPD may be highly anisotropic ; it may present high (negative or positive) correlations between its parameters (erosion rate, exposure time, soil density, inheritance). Usually, these relationships are even non-linear.

It is therefore important to use a robust algorithm to sample this complex PPD. The Affine Invariant Markov chain Monte Carlo (MCMC) Ensemble sampler implemented in the emcee package will be more efficient in our case than the standard MCMC algorithms such as the Metropolis-Hasting method.

The emcee sampler allows to define multiple, independent walkers. This requires to first set the initial position of each walker in the parameter space. As shown in the emcee documentation, the author suggests initializing the walkers in a tiny Gaussian ball around the maximum likelihood result. We can obtain the maximum likelihood estimate by applying an optimization algorithm such as one of those implemented in the scipy.optimize module. Note that non-linear optimization usually requires to provide an initial guess.

Given our complex, non-linear, and potentially flat form of the PDD in some areas of the parameter space, we prefer to set the initial positions of the walkers as the maximum likelihood estimates resulting from randomly chosing initial guesses in the parameter space according to the prior probability density. Note that we use a constrained optimization algorithm to ensure that the initial positions are within the bounds defined above.


In [14]:
n_params, n_walkers = len(param_names), 100

# randomly choose initial guesses according to the prior
init_guesses = np.array(
    [p.rvs(size=n_walkers) for p in priors]
).T

# perform bounded non-linear optimization from each initial guess
op_lnlike = lambda *args: -lnlike(*args)
init_walkers = np.empty_like(init_guesses)

for i, g in enumerate(init_guesses):
    res = optimize.minimize(op_lnlike, g,
                            method='TNC',
                            bounds=bounds)
    init_walkers[i] = res['x']

We show below the initial guesses and the initial positions of the walkers in a scatter plot.


In [15]:
df_init_guesses = pd.DataFrame(init_guesses, columns=param_names)
df_init_walkers = pd.DataFrame(init_walkers, columns=param_names)

def scatter_pos(xcol, ycol, ax):
    df_init_guesses.plot(
        kind='scatter', x=xcol, y=ycol,
        alpha=0.5, ax=ax, color=clr_plt[0], label='init guesses'
    )
    df_init_walkers.plot(
        kind='scatter', x=xcol, y=ycol,
        alpha=0.8, ax=ax, color=clr_plt[1], label='init walkers'
    )
    legend = ax.legend(frameon=True, loc='lower right')
    legend.get_frame().set_facecolor('w')
    plt.setp(ax, xlim=param_bounds.loc[xcol],
             ylim=param_bounds.loc[ycol])

fig, ax = plt.subplots(2, 2, figsize=(12,12))
scatter_pos('erosion rate', 'time exposure', ax[0][0])
scatter_pos('inheritance', 'time exposure', ax[0][1])
scatter_pos('erosion rate', 'inheritance', ax[1][0])


We can then setup the emcee sampler and run the MCMC for n_steps iterations starting from the initial positions defined above.


In [16]:
sampler = emcee.EnsembleSampler(n_walkers, n_params, lnprob)

n_steps = 500
sampler.run_mcmc(init_walkers, n_steps)

mcmc_samples = pd.DataFrame(sampler.flatchain,
                            columns=param_names)

Let's plot the trace of the MCMC iterations. The red lines show the true values.


In [17]:
sample_plot_range = slice(None)

axes = mcmc_samples[sample_plot_range].plot(
    kind='line', subplots=True,
    figsize=(10, 8), color=clr_plt[0]
)

for i, ax in enumerate(axes):
    ax.axhline(param_true.iloc[i], color='r')


Try plotting only the firsts samples (e.g., sample_plot_range = slice(0, 1000)). We see that thanks to the initial positions of the walkers, the emcee sampler quickly starts exploring the full posterior distribution. The “burn-in” period is small and we can therefore set a small value for nburn below.


In [18]:
nburn = 100

mcmc_kept_samples = pd.DataFrame(
    sampler.chain[:, nburn:, :].reshape((-1, n_params)),
    columns=param_names
)

We can visualize the sampled posterior propbability density by joint plots of the MCMC samples. The red lines show the true values.


In [19]:
def jointplot_density(xcol, ycol):
    p = sns.jointplot(
        xcol, ycol,
        data=mcmc_kept_samples,
        xlim=(mcmc_kept_samples[xcol].min(),
              mcmc_kept_samples[xcol].max()),
        ylim=(mcmc_kept_samples[ycol].min(),
              mcmc_kept_samples[ycol].max()),
        joint_kws={'alpha': 0.02}
    )
    p.ax_joint.axhline(param_true.loc[ycol], color='r')
    p.ax_joint.axvline(param_true.loc[xcol], color='r')

jointplot_density('erosion rate', 'time exposure')
jointplot_density('inheritance', 'time exposure')
jointplot_density('erosion rate', 'inheritance')


Given the samples, it is straightforward to characterize the posterior porbability density and estimate its moments.

  • the PPD mean (if the PPD distribution is strictly gaussian, it also correspond to the MAP (Maximum A-Posterori) and therefore the most probable model)

In [20]:
mcmc_kept_samples.mean()


Out[20]:
erosion rate          0.000501
time exposure    473647.518335
inheritance       42126.413518
dtype: float64
  • the sample which have the max PPD value (i.e., the most probable sampled model)

In [21]:
max_ppd = sampler.lnprobability[:, nburn:].reshape((-1)).argmax()
mcmc_kept_samples.iloc[max_ppd]


Out[21]:
erosion rate          0.000583
time exposure    709824.280697
inheritance       34142.977543
Name: 17722, dtype: float64
  • the PPD quantiles (useful for delineating the Bayesian confidence intervals or credible intervals for each free parameter)

In [22]:
percentiles = np.array([2.5, 5, 25, 50, 75, 95, 97.5])
mcmc_kept_samples.quantile(percentiles * 0.01)


Out[22]:
erosion rate time exposure inheritance
0.025 0.000100 149625.445560 28447.767450
0.050 0.000166 161825.305593 30267.274391
0.250 0.000482 296840.392443 36311.079530
0.500 0.000558 480580.654183 41826.092822
0.750 0.000579 646676.229006 48212.655540
0.950 0.000599 772858.397411 54249.254262
0.975 0.000606 787380.588030 55854.865212

We finally plot the nucleide concentration profiles (blue dots: data w/ error bars, red line: true profile, grey lines: randomly chosen profiles from MCMC samples).


In [23]:
fig, ax = plt.subplots()

# plot the profile data with error bars
profile_data.plot(
    y='depth', x='C', xerr='std',
    kind="scatter", ax=ax, rot=45
)

# plot 50 randomly chosen profiles from MCMC samples
depths = np.linspace(profile_data['depth'].min(),
                     profile_data['depth'].max(),
                     100)

for i in np.random.randint(len(mcmc_kept_samples), size=100):
    eps, t, inh = mcmc_kept_samples.iloc[i]
    c = models.C_10Be(depths, eps, t, rho_true, inh)
    ax.plot(c, depths, color='grey', alpha=0.1)

# plot the true profile
c_true = models.C_10Be(depths, eps_true, t_true,
                       rho_true, inh_true)
ax.plot(c_true, depths, color='r', label='true model')

ax.invert_yaxis()


The plot shows here that the uncertainty on the fitted model parameters has only a small influence on the shape of the profile of nucleide concentration vs. depth. This illustrates the non-linearity of that dependence.

Information about this notebook

Author: B. Bovy, Ulg


This work is licensed under a Creative Commons Attribution 4.0 International License.