Bayesian approach - Test case - 4 free parameters

An example of applying the Bayesian approach with 4 free parameters, using the PyMC package.

For more info about the method used, see the notebook Inference_Notes.

This example (a test case) is based on a generic dataset of 10Be concentration vs. depth, which is drawn from a distribution with given "true" parameters.

This notebook has the following external dependencies:


In [1]:
import math

import numpy as np
import pandas as pd
import pymc
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

The mathematical (deterministic) model

The mathematical model is available in the models Python module (see the notebook Models).


In [2]:
import models

The data

The dataset is generated using the following parameter values


In [3]:
# the true parameters 
true_erosion = 1e-3
true_exposure = 1e5
true_density = 2.0
true_inheritance = 0.

# depths and sample size
depth_minmax = [50, 500]
N = 30

# perturbations
err_magnitude = 20.
err_variability = 5.

The gendata Python module is used to generate the dataset (see the notebook Datasets).


In [4]:
import gendata

profile_data = gendata.generate_dataset(
    models.C_10Be,
    [true_erosion, true_exposure, true_density, true_inheritance],
    zlimits=depth_minmax,
    n=N,
    err=[err_magnitude, err_variability]
)

profile_data


Out[4]:
depth C std
0 50.000000 166682.979192 7332.451747
1 65.517241 139954.308810 6928.552484
2 81.034483 112649.875060 7144.750119
3 96.551724 95360.056162 7344.193919
4 112.068966 73394.677861 3603.947061
5 127.586207 68189.139499 5266.346626
6 143.103448 52111.003997 4900.098383
7 158.620690 40794.116289 5257.312055
8 174.137931 41211.026928 3463.789563
9 189.655172 33731.429311 3335.344529
10 205.172414 26794.456376 3842.675669
11 220.689655 23955.274874 2932.714270
12 236.206897 20309.847725 2189.194900
13 251.724138 20835.423055 3424.486949
14 267.241379 18378.628699 2710.987471
15 282.758621 12510.372314 4114.602094
16 298.275862 14199.873575 155.907984
17 313.793103 10335.014696 3135.128259
18 329.310345 9221.310094 2532.018754
19 344.827586 9604.670596 2449.004427
20 360.344828 8817.141726 1942.921509
21 375.862069 7517.649592 1285.869625
22 391.379310 8665.241294 1526.833665
23 406.896552 7559.313667 1624.562701
24 422.413793 5831.077501 946.380210
25 437.931034 6244.165276 2139.771774
26 453.448276 4953.543618 1630.764157
27 468.965517 8024.093601 1468.810926
28 484.482759 9078.341457 1628.295985
29 500.000000 5714.422244 1880.041881

30 rows × 3 columns

The statistical model (used for computing the PPD values)

PyMC allows to define the statistical model in a very nice way. The "nodes" of the model - i.e., the elements that contribute to the Posterior probability distribution - must be defined first.

  • Define the prior probability distribution for each free parameter (here the uniform distribution is used, with given bounds)

In [5]:
erosion_rate = pymc.Uniform('erosion_rate', 0., 4e-3)
exposure_time = pymc.Uniform('exposure_time', 0., 4e5)
soil_density = pymc.Uniform('soil_density', 1.8, 2.3)
inheritance = pymc.Uniform('inheritance', 0, 3e5)
  • Define the deterministic part, i.e., the model used to predict nucleide concentration vs. depth

In [6]:
@pymc.deterministic
def mprofile(erosion=erosion_rate, exposure=exposure_time,
             density=soil_density, inheritance=inheritance):
    """
    Modelled concentration profile.
    """
    return models.C_10Be(profile_data['depth'].values,
                         erosion, exposure,
                         density, inheritance)
  • Define the distribution of measured nucleide concentrations, which is here assumed to be a Gaussian with $\mu$ is predicted by the model above and $\sigma$ the known measurement errors. The likelihood will be derived from this distribution.

In [7]:
d = pymc.Normal('data_distribution', mprofile, 1./profile_data['std']**2,
                value=profile_data['C'].values, observed=True)

PyMC builds the model automatically from all nodes.


In [8]:
smodel = pymc.Model({'erosion_rate': erosion_rate,
                     'exposure_time': exposure_time,
                     'soil_density': soil_density,
                     'inheritance': inheritance,
                     'mprofile': mprofile,
                     'data_distribution': d})

PyMC use the prior distributions to generate a random initial value for each parameter (further needed)


In [9]:
print erosion_rate.value
print exposure_time.value
print soil_density.value
print inheritance.value


0.0030383483373
353740.615011
1.92284833369
58714.6972633

The log-PPD value that correspond to the initial values is given by the logp attribute


In [10]:
smodel.logp


Out[10]:
-73347.1685479176

Fitting the statistical model

:class:pymc.MAP allows to "clone" a model and set all parameters to their maximum a posteriori values. It uses the scipy's optimize package to maximize the log-PPD over the parameter space.


In [11]:
maxppd_smodel = pymc.MAP(smodel)
maxppd_smodel.fit(method='fmin', iterlim=15000, tol=1e-6, verbose=0)

See the resulting optimized values assigned to each parameter (and compare it to the true values defined above, it should be closer than the initial values)


In [12]:
print erosion_rate.value
print exposure_time.value
print soil_density.value
print inheritance.value


0.00399999999843
398135.161274
2.2999872895
0.00526119438799

The log-PPD value at the maximum


In [13]:
maxppd_smodel.logp_at_max


Out[13]:
-787.696853495951

Markov-Chain Monte-Carlo sampling

:class:pymc.MCMC allows to sample the parameter space following the PPD with MCMC. It is a good idea to use the results of the log-PPD maximization as initial values for MCMC. Metropolis-Hastings is the default samplerused by MCMC.


In [14]:
mcmc_smodel = pymc.MCMC(smodel)

Generate samples, discard the first $n$ samples (burn-in, burn argument) and keep only one sample every $t$ samples (thinning, thin argument) in order to obtain idependent samples from the PPD.


In [15]:
mcmc_smodel.sample(50000, burn=5000, thin=40)


 [-----------------100%-----------------] 50000 of 50000 complete in 84.6 sec

Plot some characteristics of the samples for each parameter in order to evaluate the quality of sampling.

  • trace: plot of the values of the - burnt and thinned - samples (in the order of which it has been generated), useful to check if the burn-in setting was chosen appropriately (no trend visible for the firsts samples)
  • acorr: autocorrelation of the values of the - burnt and thinned - samples, useful to check if the thinning setting was chosen appropriately (rapid decrease of autocorrelation)
  • histogram of the samples

In [16]:
pymc.Matplot.plot(mcmc_smodel)


Plotting exposure_time
Plotting soil_density
Plotting erosion_rate
Plotting inheritance

Print a summary about the distribution of the samples for each parameter. It shows the values of various moments and percentiles of the PPD, the 95% Highest Posterior Density (HPD) interval (see here for an explanation) and an estimation of the Monte-Carlo integration error.


In [17]:
mcmc_smodel.summary()


exposure_time:
 
	Mean             SD               MC Error        95% HPD interval
	------------------------------------------------------------------
	78405.582        12864.894        1195.936[  57536.206  102054.876]
	
	
	Posterior quantiles:
	
	2.5             25              50              75             97.5
	 |---------------|===============|===============|---------------|
	59099.561        67750.8         76585.822      88069.717     105402.98
	

soil_density:
 
	Mean             SD               MC Error        95% HPD interval
	------------------------------------------------------------------
	1.983            0.07             0.005            [ 1.832  2.108]
	
	
	Posterior quantiles:
	
	2.5             25              50              75             97.5
	 |---------------|===============|===============|---------------|
	1.852            1.933           1.983          2.029         2.129
	

erosion_rate:
 
	Mean             SD               MC Error        95% HPD interval
	------------------------------------------------------------------
	0.001            0.0              0.0              [ 0.     0.001]
	
	
	Posterior quantiles:
	
	2.5             25              50              75             97.5
	 |---------------|===============|===============|---------------|
	0.0              0.0             0.001          0.001         0.001
	

mprofile:
 
	Mean             SD               MC Error        95% HPD interval
	------------------------------------------------------------------
	161097.163       4841.995         273.763[ 150464.051  169434.037]
	134303.165       3504.403         177.173[ 127099.527  140661.136]
	112184.188       2570.096         111.617[ 107353.707  117283.947]
	93920.751        1940.06          71.793   [ 90442.546  97878.072]
	78837.223        1529.228         53.596   [ 75779.774  81642.228]
	66376.541        1263.203         49.416   [ 63923.183  68789.626]
	56079.387        1081.796         49.697   [ 53964.501  58182.717]
	47567.01         943.493          49.384   [ 45639.098  49373.221]
	40527.084        824.539          47.21    [ 38969.379  42250.011]
	34702.044        713.787          43.304   [ 33347.493  36183.376]
	29879.473        607.352          38.146   [ 28535.931  30939.754]
	25884.181        504.964          32.227   [ 24876.249  26885.817]
	22571.669        408.0            25.965   [ 21746.917  23382.657]
	19822.751        318.723          19.69    [ 19228.055  20486.772]
	17539.111        240.495          13.678   [ 17078.378  18035.839]
	15639.644        179.171          8.262    [ 15277.871  15987.639]
	14057.439        144.894          4.498    [ 13751.629  14325.627]
	12737.296        145.563          5.381    [ 12459.756  13034.948]
	11633.666        172.175          8.945    [ 11270.887  11948.407]
	10708.966        209.124          12.651   [ 10293.48   11102.576]
	9932.183         247.564          16.054   [  9450.323  10406.813]
	9277.722         283.901          19.074     [ 8690.997  9790.911]
	8724.46          316.821          21.71      [ 8106.284  9324.932]
	8254.961         345.952          23.982     [ 7591.751  8921.603]
	7854.832         371.331          25.921     [ 7161.757  8582.119]
	7512.19          393.179          27.558     [ 6766.956  8269.746]
	7217.224         411.797          28.925     [ 6417.13   8006.659]
	6961.834         427.513          30.055     [ 6123.025  7771.968]
	6739.326         440.657          30.977     [ 5867.765  7572.316]
	6544.176         451.544          31.719     [ 5638.666  7398.811]
	
	
	Posterior quantiles:
	
	2.5             25              50              75             97.5
	 |---------------|===============|===============|---------------|
	151079.29        157797.281      161221.448     164528.17     170557.037
	127158.804       132053.048      134442.123     136747.321    140722.164
	106974.802       110439.438      112225.299     113981.735    117004.645
	90054.54         92619.524       93922.732      95298.37      97512.416
	75779.774        77817.402       78819.884      79912.11      81642.228
	63809.525        65543.134       66420.055      67254.05      68742.885
	53830.127        55375.751       56093.555      56838.237     58135.001
	45579.151        46944.968       47607.251      48236.077     49324.318
	38761.445        39999.84        40558.312      41099.481     42050.003
	33143.655        34249.965       34711.323      35186.787     35996.964
	28562.871        29500.831       29890.942      30288.795     31023.073
	24795.641        25571.874       25884.076      26222.814     26839.306
	21676.143        22321.316       22569.191      22846.028     23352.447
	19128.017        19619.273       19825.629      20037.991     20432.035
	17045.502        17383.067       17542.39       17691.913     18009.599
	15281.963        15522.827       15637.318      15753.109     16014.67
	13769.8          13965.983       14054.102      14151.224     14352.329
	12456.651        12642.021       12736.971      12827.984     13034.308
	11296.905        11518.384       11626.419      11746.833     11975.598
	10306.391        10571.091       10706.476      10851.287     11121.005
	9463.817         9768.184        9932.123       10097.556     10428.611
	8729.848         9087.763        9277.44        9469.801      9837.762
	8108.889         8511.817        8720.778       8937.062      9341.344
	7591.751         8023.263        8258.07        8485.757      8921.603
	7140.901         7606.624        7861.473       8102.843      8571.285
	6751.779         7243.748        7520.407       7771.589      8267.745
	6418.336         6941.434        7226.627       7492.977      8015.845
	6129.287         6679.369        6971.753       7244.043      7796.684
	5880.59          6446.015        6752.121       7027.785      7606.127
	5664.753         6243.961        6556.445       6841.376      7430.898
	

inheritance:
 
	Mean             SD               MC Error        95% HPD interval
	------------------------------------------------------------------
	1073.62          723.94           57.84[  2.28900000e+00   2.39884100e+03]
	
	
	Posterior quantiles:
	
	2.5             25              50              75             97.5
	 |---------------|===============|===============|---------------|
	46.525           465.084         984.722        1577.127      2627.05
	

Show the marginal distributions of the generated samples for each parameter. Continuous lines are obtained from the samples by Kernel Density Estimation (KDE). The result is an estimation of the marginal PPD of each parameter. The red lines indicate the true values.


In [18]:
sns.set(style="darkgrid")
sns.set_context('paper')

fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(8, 6))

sns.distplot(mcmc_smodel.trace('exposure_time')[:], 
             hist=False, kde_kws={"shade": True},
             ax=axes[0][0])
axes[0][0].axvline(true_exposure, color='r')
axes[0][0].set_xlabel('exposure time [yr]')

sns.distplot(mcmc_smodel.trace('erosion_rate')[:] * 1e4, 
             hist=False, kde_kws={"shade": True},
             ax=axes[0][1])
axes[0][1].axvline(true_erosion * 1e4, color='r')
axes[0][1].set_xlabel('erosion rate [m Myr-1]')

sns.distplot(mcmc_smodel.trace('soil_density')[:], 
             hist=False, kde_kws={"shade": True},
             ax=axes[1][0])
axes[1][0].axvline(true_density, color='r')
axes[1][0].set_xlabel('soil_density [g cm-3]')

sns.distplot(mcmc_smodel.trace('inheritance')[:], 
             hist=False, kde_kws={"shade": True},
             ax=axes[1][1])
axes[1][1].axvline(true_inheritance, color='r')
axes[1][1].set_xlabel('inheritance [atoms g-1]')

plt.setp(axes, yticks=[])


Out[18]:
[]

Join plot of the generated samples with marginals, obtained by KDE. This represents an approximation of the 2D marginal PPDs. The red lines indicate the true values.


In [19]:
sns.set(style="white")
sns.set_context('paper')

# exposure vs. erosion
p = sns.jointplot(mcmc_smodel.trace('erosion_rate')[:] * 1e4,
                  mcmc_smodel.trace('exposure_time')[:],
                  xlim=[0, 15], ylim=[4e4, 1.2e5],
                  kind="kde", size=5, space=0)
plt.setp(p.ax_joint,
         xlabel='erosion rate [m Myr-1]',
         ylabel='exposure time [yr]')
p.ax_joint.axhline(true_exposure, color='r')
p.ax_joint.axvline(true_erosion * 1e4, color='r')

# exposure vs. inheritance
p = sns.jointplot(mcmc_smodel.trace('inheritance')[:],
                  mcmc_smodel.trace('exposure_time')[:],
                  xlim=[0, 4000], ylim=[4e4, 1.2e5],
                  kind="kde", size=5, space=0)
plt.setp(p.ax_joint,
         xlabel='inheritance [atoms g-1]',
         ylabel='exposure time [yr]')
p.ax_joint.axhline(true_exposure, color='r')
p.ax_joint.axvline(true_inheritance, color='r')

# exposure vs. density
p = sns.jointplot(mcmc_smodel.trace('soil_density')[:],
                  mcmc_smodel.trace('exposure_time')[:],
                  xlim=[1.8, 2.3], ylim=[4e4, 1.2e5],
                  kind="kde", size=5, space=0)
plt.setp(p.ax_joint,
         xlabel='soil density [g cm-3]',
         ylabel='exposure time [yr]')
p.ax_joint.axhline(true_exposure, color='r')
p.ax_joint.axvline(true_density, color='r')

# density vs. erosion
p = sns.jointplot(mcmc_smodel.trace('erosion_rate')[:] * 1e4,
                  mcmc_smodel.trace('soil_density')[:],
                  xlim=[0, 15], ylim=[1.8, 2.3],
                  kind="kde", size=5, space=0)
plt.setp(p.ax_joint,
         xlabel='erosion rate [m Myr-1]',
         ylabel='soil density [g cm-3]')
p.ax_joint.axhline(true_density, color='r')
p.ax_joint.axvline(true_erosion * 1e4, color='r')

# inheritance vs. erosion
p = sns.jointplot(mcmc_smodel.trace('erosion_rate')[:] * 1e4,
                  mcmc_smodel.trace('inheritance')[:],
                  xlim=[0, 15], ylim=[0, 4000],
                  kind="kde", size=5, space=0)
plt.setp(p.ax_joint,
         xlabel='erosion rate [m Myr-1]',
         ylabel='inheritance [atoms g-1]')
p.ax_joint.axhline(true_inheritance, color='r')
p.ax_joint.axvline(true_erosion * 1e4, color='r')


Out[19]:
<matplotlib.lines.Line2D at 0x7f5c1ef14a10>

Plot the measured concentrations vs. depth and the predicted concentration profiles corresponding to the mean and specific quantiles of the PPD. The shaded grey area indicates the 95% HPD interval.


In [20]:
sns.set(style="darkgrid")
sns.set_context('paper')

depths = np.linspace(profile_data['depth'].min(),
                     profile_data['depth'].max(),
                     100)

Cm_true = models.C_10Be(depths, true_erosion, true_exposure,
                        true_density, true_inheritance)

res_erosion_rate = erosion_rate.stats()
res_exposure_time = exposure_time.stats()
res_soil_density = soil_density.stats()
res_inheritance = inheritance.stats()
res_mprofile = mprofile.stats()

Cm_mean = models.C_10Be(depths,
                        res_erosion_rate['mean'],
                        res_exposure_time['mean'],
                        res_soil_density['mean'],
                        res_inheritance['mean'])

Cm_median = models.C_10Be(depths,
                          res_erosion_rate['quantiles'][50],
                          res_exposure_time['quantiles'][50],
                          res_soil_density['quantiles'][50],
                          res_inheritance['quantiles'][50])

plt.figure()
plt.errorbar(profile_data['C'],
             -profile_data['depth'],
             xerr=profile_data['std'],
             fmt='o', markersize=4,
             label='data')
plt.plot(Cm_mean, -depths, label='PPD mean')
plt.plot(Cm_median, -depths, label='PPD median')
plt.plot(Cm_true, -depths, color='r', label='true model')
plt.fill_betweenx(-profile_data['depth'],
                  res_mprofile['95% HPD interval'][:,0],
                  res_mprofile['95% HPD interval'][:,1],
                  color='k', alpha=0.3)
plt.setp(plt.gca(),
         xlabel='10Be concentration [atoms g-1]',
         ylabel='-1 * depth [cm]',
         xlim=[0, None], ylim=[None, 0])

plt.legend(loc='lower right')


Out[20]:
<matplotlib.legend.Legend at 0x7f5c1ef76910>

Show the correlation matrix of the free parameters (computed from the samples generated by MCMC).


In [21]:
traces = np.column_stack((mcmc_smodel.trace('erosion_rate')[:],
                          mcmc_smodel.trace('exposure_time')[:],
                          mcmc_smodel.trace('soil_density')[:],
                          mcmc_smodel.trace('inheritance')[:]))

traces_df = pd.DataFrame(data=traces,
                         columns=['erosion rate', 'exposure time',
                                  'soil density', 'inheritance'])

f, ax = plt.subplots()
cmap = sns.blend_palette(["#00008B", "#6A5ACD", "#F0F8FF",
                          "#FFE6F8", "#C71585", "#8B0000"],
                         as_cmap=True)
sns.corrplot(traces_df, annot=True, sig_stars=False,
             diag_names=False, cmap=cmap, ax=ax)
f.tight_layout()


Observations

4 free parameters

The results obtained here have similarities with the results obtained by MLE with 4 parameters (see the notebooks GS_test_4params and MLE_test_4params), though the shapes of the marginal PPDs may differ significantly from those of the profile like-likelihoods.

We can aslo see that the traces and autocorrelations of erosion rate and exposure time are not as good as for the case of 2 free parameters.

This shows the limit of the MCMC sampler used here (Metropolis-Hastings algorithm) in covering extended, diagonal regions of high PPD in the parameter space (this is the case here as the parameters are often strongly correlated with each other).

A solution would be to use a more robust sampler, e.g., the one implemented in the emcee Python package or the one available in PyMC version 3 (alpha). The advantage of the affine-invariant ensemble sampler implemented in emcee is that it is based on multiple, independent random walkers. Combining this with initial samples taken from the PPD evaluated on a grid search with a broad resolution would give a more robust method.


In [ ]: