An example of applying the Bayesian approach with 2 free parameters only (erosion rate and exposure time), using the PyMC package.
For more info about the method used, see the notebook Inference_Notes.
This example (a test case) is based on a generic dataset of 10Be concentration vs. depth, which is drawn from a distribution with given "true" parameters.
This notebook has the following external dependencies:
In [1]:
import math
import numpy as np
import pandas as pd
import pymc
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
The mathematical model is available in the models
Python module (see the notebook Models).
In [2]:
import models
The dataset is generated using the following parameter values
In [3]:
# the true parameters
true_erosion = 1e-3
true_exposure = 1e5
# fixed parameters
fixed_density = 2.0
fixed_inheritance = 0.
# depths and sample size
depth_minmax = [50, 500]
N = 30
# perturbations
err_magnitude = 20.
err_variability = 5.
The gendata
Python module is used to generate the dataset (see the notebook Datasets).
In [4]:
import gendata
profile_data = gendata.generate_dataset(
models.C_10Be,
[true_erosion, true_exposure, fixed_density, fixed_inheritance],
zlimits=depth_minmax,
n=N,
err=[err_magnitude, err_variability]
)
profile_data
Out[4]:
PyMC allows to define the statistical model in a very nice way. The "nodes" of the model - i.e., the elements that contribute to the Posterior probability distribution - must be defined first.
In [5]:
erosion_rate = pymc.Uniform('erosion_rate', 0., 4e-3)
exposure_time = pymc.Uniform('exposure_time', 0., 4e5)
In [6]:
@pymc.deterministic(plot=False)
def mprofile(erosion=erosion_rate, exposure=exposure_time):
"""
Modelled concentration profile.
"""
return models.C_10Be(profile_data['depth'].values,
erosion, exposure,
fixed_density, fixed_inheritance)
In [7]:
d = pymc.Normal('data_distribution', mprofile, 1./profile_data['std']**2,
value=profile_data['C'].values, observed=True)
PyMC builds the model automatically from all nodes.
In [8]:
smodel = pymc.Model({'erosion_rate': erosion_rate,
'exposure_time': exposure_time,
'mprofile': mprofile,
'data_distribution': d})
PyMC use the prior distributions to generate a random initial value for each parameter (further needed)
In [9]:
print erosion_rate.value
print exposure_time.value
The log-PPD value that correspond to the initial values is given by the logp
attribute
In [10]:
smodel.logp
Out[10]:
:class:pymc.MAP
allows to "clone" a model and set all parameters to their maximum a posteriori values. It uses the scipy's optimize package to maximize the log-PPD over the parameter space.
In [11]:
maxppd_smodel = pymc.MAP(smodel)
maxppd_smodel.fit(method='fmin', iterlim=15000, tol=1e-6, verbose=0)
See the resulting optimized values assigned to each parameter (and compare it to the true values defined above, it should be closer than the initial values)
In [12]:
print erosion_rate.value
print exposure_time.value
The log-PPD value at the maximum
In [13]:
maxppd_smodel.logp_at_max
Out[13]:
:class:pymc.MCMC
allows to sample the parameter space following the PPD with MCMC. It is a good idea to use the results of the log-PPD maximization as initial values for MCMC. Metropolis-Hastings is the default samplerused by MCMC.
In [14]:
mcmc_smodel = pymc.MCMC(smodel)
Generate samples, discard the first $n$ samples (burn-in, burn
argument) and keep only one sample every $t$ samples (thinning, thin
argument) in order to obtain idependent samples from the PPD.
In [15]:
mcmc_smodel.sample(50000, burn=5000, thin=40)
Plot some characteristics of the samples for each parameter in order to evaluate the quality of sampling.
In [16]:
pymc.Matplot.plot(mcmc_smodel)
Print a summary about the distribution of the samples for each parameter. It shows the values of various moments and percentiles of the PPD, the 95% Highest Posterior Density (HPD) interval (see here for an explanation) and an estimation of the Monte-Carlo integration error.
In [17]:
mcmc_smodel.summary()
Show the marginal distributions of the generated samples for each parameter. Continuous lines are obtained from the samples by Kernel Density Estimation (KDE). The result is an estimation of the marginal PPD of each parameter. The red lines indicate the true values.
In [18]:
sns.set(style="darkgrid")
sns.set_context('paper')
fig, axes = plt.subplots(ncols=2, figsize=(8, 2))
d = sns.distplot(mcmc_smodel.trace('exposure_time')[:],
hist=False, kde_kws={"shade": True},
ax=axes[0])
axes[0].axvline(true_exposure, color='r')
axes[0].set_xlabel('exposure time [yr]')
sns.distplot(mcmc_smodel.trace('erosion_rate')[:] * 1e4,
hist=False, kde_kws={"shade": True},
ax=axes[1])
axes[1].axvline(true_erosion * 1e4, color='r')
axes[1].set_xlabel('erosion rate [m Myr-1]')
plt.setp(axes, yticks=[])
Out[18]:
Join plot of the generated samples with marginals, obtained by KDE. This represents an approximation of the PPD. The red lines indicate the true values.
In [19]:
sns.set(style="white")
sns.set_context('paper')
p = sns.jointplot(mcmc_smodel.trace('erosion_rate')[:] * 1e4,
mcmc_smodel.trace('exposure_time')[:],
xlim=[5, 15], ylim=[6e4, 1.4e5],
kind="kde", size=5, space=0)
plt.setp(p.ax_joint,
xlabel='erosion rate [m Myr-1]',
ylabel='exposure time [yr]')
p.ax_joint.axhline(true_exposure, color='r')
p.ax_joint.axvline(true_erosion * 1e4, color='r')
Out[19]:
Plot the measured concentrations vs. depth and the predicted concentration profiles corresponding to the mean and specific quantiles of the PPD. The shaded grey area indicates the 95% HPD interval.
In [20]:
sns.set(style="darkgrid")
sns.set_context('paper')
depths = np.linspace(profile_data['depth'].min(),
profile_data['depth'].max(),
100)
Cm_true = models.C_10Be(depths, true_erosion, true_exposure,
fixed_density, fixed_inheritance)
res_erosion_rate = erosion_rate.stats()
res_exposure_time = exposure_time.stats()
res_mprofile = mprofile.stats()
Cm_mean = models.C_10Be(depths,
res_erosion_rate['mean'],
res_exposure_time['mean'],
fixed_density,
fixed_inheritance)
Cm_median = models.C_10Be(depths,
res_erosion_rate['quantiles'][50],
res_exposure_time['quantiles'][50],
fixed_density,
fixed_inheritance)
plt.figure()
plt.errorbar(profile_data['C'],
-profile_data['depth'],
xerr=profile_data['std'],
fmt='o', markersize=4,
label='data')
plt.plot(Cm_mean, -depths, label='PPD mean')
plt.plot(Cm_median, -depths, label='PPD median')
plt.plot(Cm_true, -depths, color='r', label='true model')
plt.fill_betweenx(-profile_data['depth'],
res_mprofile['95% HPD interval'][:,0],
res_mprofile['95% HPD interval'][:,1],
color='k', alpha=0.3)
plt.setp(plt.gca(),
xlabel='10Be concentration [atoms g-1]',
ylabel='-1 * depth [cm]',
xlim=[0, None], ylim=[None, 0])
plt.legend(loc='lower right')
Out[20]:
Show the correlation matrix of the free parameters (computed from the samples generated by MCMC).
In [21]:
traces = np.column_stack((mcmc_smodel.trace('erosion_rate')[:],
mcmc_smodel.trace('exposure_time')[:]))
traces_df = pd.DataFrame(data=traces,
columns=['erosion rate', 'exposure time'])
f, ax = plt.subplots()
cmap = sns.blend_palette(["#00008B", "#6A5ACD", "#F0F8FF",
"#FFE6F8", "#C71585", "#8B0000"],
as_cmap=True)
sns.corrplot(traces_df, annot=True, sig_stars=False,
diag_names=False, cmap=cmap, ax=ax)
f.tight_layout()
2 free parameters
The results obtained here seem to be consistent with MLE with 2 parameters (see the notebooks GS_test_2params and MLE_test_2params).
In [ ]: