emcee
- Wintrich - 4 free parameters (gaussian prior for density)Application of the Bayesian approach with 4 free parameters (erosion rate, time exposure, density and inheritance) to the Wintrich site, using the emcee package.
Instead of using a uniform prior for density, we set a gaussian.
For more info about the method used, see the notebook Inference_Notes.
This notebook has the following external dependencies:
In [1]:
import math
import csv
import numpy as np
import pandas as pd
from scipy import stats
from scipy import optimize
import emcee
import yaml
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
clr_plt = sns.color_palette()
An implementation of the mathematical model used for predicting profiles of 10Be concentrations is available in the models
Python module (see the notebook Models). The 10Be model assumes that the soil density is constant along the depth profile and that the inheritance is the same for the whole sample of 10Be concentration vs. depth.
In [2]:
import models
In [3]:
profile_data = pd.read_csv('profiles_data/wintrich_10Be_profile_data.csv',
index_col='sample',
delim_whitespace=True,
comment='#',
quoting=csv.QUOTE_NONNUMERIC, quotechar='\"',
na_values=[-9999],
dtype={'depth': 'f', 'depth_g-cm-2': 'f',
'C': 'f', 'std': 'f'})
profile_data
Out[3]:
Import other parameters
In [4]:
with open('profiles_data/wintrich_10Be_settings.yaml') as f:
wintrich_settings = yaml.load(f)
wintrich_settings
Out[4]:
Make a plot of the dataset
In [5]:
sns.set_context('notebook')
fig, ax = plt.subplots()
profile_data.plot(
y='depth', x='C', xerr='std',
kind="scatter", ax=ax, rot=45
)
ax.invert_yaxis()
priors
and bounds
tuples!
In [6]:
param_names = ('erosion rate', 'time exposure',
'soil density', 'inheritance')
loc
and scale
arguments of scipy.stats.uniform
are the lower bound and the range, respectively), excepted for the soil density (normal distribution).
In [7]:
eps_prior = stats.uniform(loc=0., scale=3e-3)
t_prior = stats.uniform(loc=0., scale=1.5e6)
rho_prior = stats.norm(loc=2.2, scale=0.2)
inh_prior = stats.uniform(loc=0., scale=1.5e5)
priors = eps_prior, t_prior, rho_prior, inh_prior
param_priors = pd.Series(priors, index=param_names)
lower_qtl
, upper_qtl
) of the prior distribution. Choose the extreme quantiles (0, 1) if the distribution is uniform. It will be used for plotting purpose and also for constrained optimization (see below).
In [8]:
def get_bounds(f, lower_qtl=0., upper_qtl=1.):
return f.ppf(lower_qtl), f.ppf(upper_qtl)
eps_bounds = get_bounds(eps_prior, 0, 1)
t_bounds = get_bounds(t_prior, 0, 1)
rho_bounds = get_bounds(rho_prior, 0.01, 0.99)
inh_bounds = get_bounds(inh_prior, 0, 1)
bounds = eps_bounds, t_bounds, rho_bounds, inh_bounds
param_bounds = pd.DataFrame(
np.array(bounds), columns=('min', 'max'), index=param_names
)
param_bounds
Out[8]:
In [9]:
fig, axes = plt.subplots(1, 4, figsize=(13, 3))
for ax, p, b, name in zip(axes.flatten(),
param_priors.values,
param_bounds.values,
param_names):
xmin, xmax = b
eps = 0.1 * (xmax - xmin)
x = np.linspace(xmin - eps, xmax + eps, 200)
d = p.pdf(x)
ax.plot(x, d)
ax.fill(x, d, alpha=0.4)
plt.setp(ax.xaxis.get_majorticklabels(), rotation=45)
plt.setp(ax, ylim=(0, None), yticklabels=[],
xlabel=name)
plt.subplots_adjust()
m
.
In [10]:
def lnprior(m):
lps = [p.logpdf(v) for (p, v) in zip(priors, m)]
if not np.all(np.isfinite(lps)):
return -np.inf
return np.sum(lps)
In [11]:
def lnlike(m):
eps, t, rho, inh = m
mean = models.C_10Be(profile_data['depth'].values,
eps, t, rho, inh,
wintrich_settings['P_0'])
var = profile_data['std']**2
lngauss = -0.5 * np.sum(
np.log(2. * np.pi * var) +
(profile_data['C'] - mean)**2 / var
)
return lngauss
In [12]:
def lnprob(m):
lp = lnprior(m)
if not np.isfinite(lp):
return -np.inf
return lp + lnlike(m)
In our case, the from of the PPD may be highly anisotropic ; it may present high (negative or positive) correlations between its parameters (erosion rate, exposure time, soil density, inheritance). Usually, these relationships are even non-linear.
It is therefore important to use a robust algorithm to sample this complex PPD. The Affine Invariant Markov chain Monte Carlo (MCMC) Ensemble sampler implemented in the emcee package will be more efficient in our case than the standard MCMC algorithms such as the Metropolis-Hasting method.
The emcee
sampler allows to define multiple, independent walkers. This requires to first set the initial position of each walker in the parameter space. As shown in the emcee
documentation, the author suggests initializing the walkers in a tiny Gaussian ball around the maximum likelihood result. We can obtain the maximum likelihood estimate by applying an optimization algorithm such as one of those implemented in the scipy.optimize module. Note that non-linear optimization usually requires to provide an initial guess.
Given our complex, non-linear, and potentially flat form of the PDD in some areas of the parameter space, we prefer to set the initial positions of the walkers as the maximum likelihood estimates resulting from randomly chosing initial guesses in the parameter space according to the prior probability density. Note that we use a constrained optimization algorithm to ensure that the initial positions are within the bounds defined above.
In [13]:
n_params, n_walkers = len(param_names), 100
# randomly choose initial guesses according to the prior
init_guesses = np.array(
[p.rvs(size=n_walkers) for p in priors]
).T
# perform non-linear optimization from each initial guess
op_lnlike = lambda *args: -lnlike(*args)
init_walkers = np.empty_like(init_guesses)
for i, g in enumerate(init_guesses):
res = optimize.minimize(op_lnlike, g,
method='L-BFGS-B',
bounds=bounds)
init_walkers[i] = res['x']
We show below the initial guesses and the initial positions of the walkers in a scatter plot.
In [14]:
df_init_guesses = pd.DataFrame(init_guesses, columns=param_names)
df_init_walkers = pd.DataFrame(init_walkers, columns=param_names)
def scatter_pos(xcol, ycol, ax):
df_init_guesses.plot(
kind='scatter', x=xcol, y=ycol,
alpha=0.5, ax=ax, color=clr_plt[0], label='init guesses'
)
df_init_walkers.plot(
kind='scatter', x=xcol, y=ycol,
alpha=0.8, ax=ax, color=clr_plt[2], label='init walkers'
)
legend = ax.legend(frameon=True, loc='lower right')
legend.get_frame().set_facecolor('w')
plt.setp(ax, xlim=param_bounds.loc[xcol],
ylim=param_bounds.loc[ycol])
fig, ax = plt.subplots(3, 2, figsize=(12,12))
scatter_pos('erosion rate', 'time exposure', ax[0][0])
scatter_pos('soil density', 'time exposure', ax[0][1])
scatter_pos('erosion rate', 'soil density', ax[1][0])
scatter_pos('inheritance', 'time exposure', ax[1][1])
scatter_pos('erosion rate', 'inheritance', ax[2][0])
scatter_pos('soil density', 'inheritance', ax[2][1])
We can then setup the emcee
sampler and run the MCMC for n_steps
iterations starting from the initial positions defined above.
In [15]:
sampler = emcee.EnsembleSampler(n_walkers, n_params, lnprob)
n_steps = 500
sampler.run_mcmc(init_walkers, n_steps)
mcmc_samples = pd.DataFrame(sampler.flatchain,
columns=param_names)
Let's plot the trace of the MCMC iterations.
In [16]:
sample_plot_range = slice(None)
axes = mcmc_samples[sample_plot_range].plot(
kind='line', subplots=True,
figsize=(10, 8), color=clr_plt[0]
)
Try plotting only the firsts samples (e.g., sample_plot_range = slice(0, 1000)
). We see that thanks to the initial positions of the walkers, the emcee
sampler quickly starts exploring the full posterior distribution. The “burn-in” period is small and we can therefore set a small value for nburn
below.
In [17]:
nburn = 100
mcmc_kept_samples = pd.DataFrame(
sampler.chain[:, nburn:, :].reshape((-1, n_params)),
columns=param_names
)
We can visualize the sampled posterior propbability density by joint plots of the MCMC samples.
In [18]:
def jointplot_density(xcol, ycol, lim_qtl=(0.01, 0.99)):
lower_qtl, upper_qtl = lim_qtl
p = sns.jointplot(
xcol, ycol,
data=mcmc_kept_samples,
xlim=(mcmc_kept_samples[xcol].quantile(lower_qtl),
mcmc_kept_samples[xcol].quantile(upper_qtl)),
ylim=(mcmc_kept_samples[ycol].quantile(lower_qtl),
mcmc_kept_samples[ycol].quantile(upper_qtl)),
joint_kws={'alpha': 0.02}
)
jointplot_density('erosion rate', 'time exposure')
jointplot_density('soil density', 'time exposure')
jointplot_density('erosion rate', 'soil density')
jointplot_density('erosion rate', 'inheritance')
jointplot_density('inheritance', 'time exposure')
jointplot_density('soil density', 'inheritance')
Given the samples, it is straightforward to characterize the posterior porbability density and estimate its moments.
In [19]:
mcmc_kept_samples.mean()
Out[19]:
In [20]:
max_ppd = sampler.lnprobability[:, nburn:].reshape((-1)).argmax()
mcmc_kept_samples.iloc[max_ppd]
Out[20]:
In [21]:
percentiles = np.array([2.5, 5, 25, 50, 75, 95, 97.5])
mcmc_kept_samples.quantile(percentiles * 0.01)
Out[21]:
We finally plot the nucleide concentration profiles (blue dots: data w/ error bars, red area: filled interval between the 1% and 99% percentiles of the MCMC samples).
In [22]:
fig, ax = plt.subplots()
# plot MCMC samples 1-99 percentiles
depths = np.linspace(profile_data['depth'].min(),
profile_data['depth'].max(),
50)
c_mcmc_samples = np.array(
[models.C_10Be(depths, eps, t, rho, inh,
P_0=wintrich_settings['P_0'])
for (eps, t, rho, inh) in mcmc_kept_samples.values]
)
c01, c99 = np.percentile(c_mcmc_samples, (1, 99), axis=0)
ax.fill_betweenx(depths, c01, c99, color='r', alpha=0.3)
ax.plot(c01, depths, color='r')
ax.plot(c99, depths, color='r')
# plot the profile data with error bars
profile_data.plot(
y='depth', x='C', xerr='std',
kind="scatter", ax=ax, rot=45, zorder=5
)
ax.invert_yaxis()
Author: B. Bovy, Ulg
This work is licensed under a Creative Commons Attribution 4.0 International License.