Bayesian approach - Test case - 2 free parameters

An example of applying the Bayesian approach with 2 free parameters only (erosion rate and exposure time), using the PyMC package.

For more info about the method used, see the notebook Inference_Notes.

This example (a test case) is based on a generic dataset of 10Be concentration vs. depth, which is drawn from a distribution with given "true" parameters.

This notebook has the following external dependencies:


In [1]:
import math

import numpy as np
import pandas as pd
import pymc
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

The mathematical (deterministic) model

The mathematical model is available in the models Python module (see the notebook Models).


In [2]:
import models

The data

The dataset is generated using the following parameter values


In [3]:
# the true parameters 
true_erosion = 1e-3
true_exposure = 1e5

# fixed parameters
fixed_density = 2.0
fixed_inheritance = 0.

# depths and sample size
depth_minmax = [50, 500]
N = 30

# perturbations
err_magnitude = 20.
err_variability = 5.

The gendata Python module is used to generate the dataset (see the notebook Datasets).


In [4]:
import gendata

profile_data = gendata.generate_dataset(
    models.C_10Be,
    [true_erosion, true_exposure, fixed_density, fixed_inheritance],
    zlimits=depth_minmax,
    n=N,
    err=[err_magnitude, err_variability]
)

profile_data


Out[4]:
depth C std
0 50.000000 164646.804971 7246.214797
1 65.517241 139945.656423 4694.358876
2 81.034483 104804.416241 4515.849236
3 96.551724 83616.370489 5328.510663
4 112.068966 69070.664256 5755.018312
5 127.586207 68407.271936 5886.214743
6 143.103448 50059.908676 5828.091753
7 158.620690 47205.539203 3759.697055
8 174.137931 42710.970466 4924.361067
9 189.655172 29562.877929 3852.225865
10 205.172414 33368.663037 2392.127825
11 220.689655 28028.704149 3731.505506
12 236.206897 24991.531196 2436.588896
13 251.724138 15786.337554 2159.890170
14 267.241379 20613.006215 3403.223099
15 282.758621 15883.045725 2120.986663
16 298.275862 15371.425741 2103.435946
17 313.793103 12807.532690 1974.772720
18 329.310345 14097.191111 1609.028868
19 344.827586 12128.263339 2080.154203
20 360.344828 9357.142721 1932.778390
21 375.862069 9223.208481 1070.071844
22 391.379310 6646.650690 1802.140092
23 406.896552 8584.329077 1651.810494
24 422.413793 10533.685591 2216.628289
25 437.931034 8995.316606 1613.037409
26 453.448276 7611.556721 1878.782161
27 468.965517 8311.046879 1441.655169
28 484.482759 2824.392297 2370.549270
29 500.000000 4625.146006 1379.133538

30 rows × 3 columns

The statistical model (used for computing the PPD values)

PyMC allows to define the statistical model in a very nice way. The "nodes" of the model - i.e., the elements that contribute to the Posterior probability distribution - must be defined first.

  • Define the prior probability distribution for each free parameter (here the uniform distribution is used, with given bounds)

In [5]:
erosion_rate = pymc.Uniform('erosion_rate', 0., 4e-3)
exposure_time = pymc.Uniform('exposure_time', 0., 4e5)
  • Define the deterministic part, i.e., the model used to predict nucleide concentration vs. depth

In [6]:
@pymc.deterministic(plot=False)
def mprofile(erosion=erosion_rate, exposure=exposure_time):
    """
    Modelled concentration profile.
    """
    return models.C_10Be(profile_data['depth'].values,
                         erosion, exposure,
                         fixed_density, fixed_inheritance)
  • Define the distribution of measured nucleide concentrations, which is here assumed to be a Gaussian with $\mu$ is predicted by the model above and $\sigma$ the known measurement errors. The likelihood will be derived from this distribution.

In [7]:
d = pymc.Normal('data_distribution', mprofile, 1./profile_data['std']**2,
                value=profile_data['C'].values, observed=True)

PyMC builds the model automatically from all nodes.


In [8]:
smodel = pymc.Model({'erosion_rate': erosion_rate,
                     'exposure_time': exposure_time,
                     'mprofile': mprofile,
                     'data_distribution': d})

PyMC use the prior distributions to generate a random initial value for each parameter (further needed)


In [9]:
print erosion_rate.value
print exposure_time.value


0.00129259429611
212815.732925

The log-PPD value that correspond to the initial values is given by the logp attribute


In [10]:
smodel.logp


Out[10]:
-422.2760683441503

Fitting the statistical model

:class:pymc.MAP allows to "clone" a model and set all parameters to their maximum a posteriori values. It uses the scipy's optimize package to maximize the log-PPD over the parameter space.


In [11]:
maxppd_smodel = pymc.MAP(smodel)
maxppd_smodel.fit(method='fmin', iterlim=15000, tol=1e-6, verbose=0)

See the resulting optimized values assigned to each parameter (and compare it to the true values defined above, it should be closer than the initial values)


In [12]:
print erosion_rate.value
print exposure_time.value


0.00098553595361
102125.233535

The log-PPD value at the maximum


In [13]:
maxppd_smodel.logp_at_max


Out[13]:
-288.81231529022523

Markov-Chain Monte-Carlo sampling

:class:pymc.MCMC allows to sample the parameter space following the PPD with MCMC. It is a good idea to use the results of the log-PPD maximization as initial values for MCMC. Metropolis-Hastings is the default samplerused by MCMC.


In [14]:
mcmc_smodel = pymc.MCMC(smodel)

Generate samples, discard the first $n$ samples (burn-in, burn argument) and keep only one sample every $t$ samples (thinning, thin argument) in order to obtain idependent samples from the PPD.


In [15]:
mcmc_smodel.sample(50000, burn=5000, thin=40)


 [-----------------100%-----------------] 50000 of 50000 complete in 44.0 sec

Plot some characteristics of the samples for each parameter in order to evaluate the quality of sampling.

  • trace: plot of the values of the - burnt and thinned - samples (in the order of which it has been generated), useful to check if the burn-in setting was chosen appropriately (no trend visible for the firsts samples)
  • acorr: autocorrelation of the values of the - burnt and thinned - samples, useful to check if the thinning setting was chosen appropriately (rapid decrease of autocorrelation)
  • histogram of the samples

In [16]:
pymc.Matplot.plot(mcmc_smodel)


Plotting exposure_time
Plotting erosion_rate

Print a summary about the distribution of the samples for each parameter. It shows the values of various moments and percentiles of the PPD, the 95% Highest Posterior Density (HPD) interval (see here for an explanation) and an estimation of the Monte-Carlo integration error.


In [17]:
mcmc_smodel.summary()


exposure_time:
 
	Mean             SD               MC Error        95% HPD interval
	------------------------------------------------------------------
	102170.348       7687.582         360.837[  87194.802  118120.02 ]
	
	
	Posterior quantiles:
	
	2.5             25              50              75             97.5
	 |---------------|===============|===============|---------------|
	87194.802        97141.709       101974.573     107246.395    118120.02
	

erosion_rate:
 
	Mean             SD               MC Error        95% HPD interval
	------------------------------------------------------------------
	0.001            0.0              0.0              [ 0.001  0.001]
	
	
	Posterior quantiles:
	
	2.5             25              50              75             97.5
	 |---------------|===============|===============|---------------|
	0.001            0.001           0.001          0.001         0.001
	

mprofile:
 
	Mean             SD               MC Error        95% HPD interval
	------------------------------------------------------------------
	158674.767       3149.538         108.706[ 152695.602  164525.008]
	132247.598       2559.881         86.67  [ 127412.963  137029.153]
	110455.947       2078.513         68.744 [ 106633.126  114433.976]
	92482.955        1687.15          54.257   [ 89484.599  95823.882]
	77655.776        1370.926         42.68    [ 75299.576  80476.233]
	65420.187        1117.804         33.608   [ 63463.096  67646.947]
	55319.675        918.053          26.74    [ 53700.142  57176.258]
	46978.207        763.725          21.852   [ 45715.328  48646.341]
	40086.042        648.082          18.731   [ 38949.612  41481.415]
	34388.043        564.992          17.085   [ 33364.675  35610.353]
	29674.049        508.434          16.52    [ 28685.741  30687.305]
	25770.946        472.375          16.618   [ 24823.885  26669.851]
	22536.133        451.082          17.047   [ 21620.231  23378.118]
	19852.147        439.626          17.592   [ 19040.187  20739.218]
	17622.227        434.196          18.134   [ 16830.289  18507.135]
	15766.663        432.109          18.616   [ 14946.92   16622.721]
	14219.796        431.617          19.012   [ 13422.476  15106.864]
	12927.534        431.65           19.32    [ 12103.506  13807.428]
	11845.316        431.596          19.541   [ 10973.901  12682.794]
	10936.432        431.136          19.685   [ 10157.628  11870.941]
	10170.638        430.127          19.762   [  9401.176  11113.145]
	9523.014         428.531          19.78    [  8759.583  10469.692]
	8973.032         426.371          19.75      [ 8151.496  9853.571]
	8503.773         423.698          19.68      [ 7684.719  9380.107]
	8101.297         420.576          19.576     [ 7286.272  8969.573]
	7754.117         417.072          19.444     [ 6950.944  8619.247]
	7452.764         413.251          19.291     [ 6649.256  8305.451]
	7189.433         409.173          19.121     [ 6416.127  8052.338]
	6957.692         404.892          18.936     [ 6192.953  7808.217]
	6752.235         400.453          18.741     [ 5993.377  7590.674]
	
	
	Posterior quantiles:
	
	2.5             25              50              75             97.5
	 |---------------|===============|===============|---------------|
	152590.909       156524.942      158706.433     160926.295    164441.363
	127297.513       130504.18       132276.063     134091.471    136978.916
	106469.152       109015.683      110451.92      111957.015    114381.782
	89332.905        91334.651       92476.825      93660.124     95731.504
	75105.278        76712.253       77632.884      78605.958     80333.318
	63231.409        64656.001       65395.85       66196.707     67614.978
	53561.555        54692.864       55302.376      55950.755     57129.665
	45479.868        46463.753       46961.625      47499.224     48503.371
	38793.554        39652.771       40081.452      40532.786     41391.188
	33234.721        34016.546       34371.298      34762.628     35525.406
	28651.498        29332.811       29662.262      30007.615     30675.831
	24832.001        25449.531       25770.515      26084.942     26711.049
	21649.617        22225.341       22527.414      22829.399     23449.405
	18996.358        19560.148       19831.014      20139.067     20715.613
	16784.036        17332.203       17603.498      17908.986     18479.277
	14947.397        15478.162       15750.02       16067.691     16639.108
	13405.07         13929.714       14201.837      14524.408     15097.645
	12081.849        12640.436       12910.836      13234.302     13793.411
	10997.763        11556.91        11835.902      12148.259     12734.71
	10079.229        10652.863       10920.896      11232.659     11824.066
	9301.026         9890.846        10151.477      10465.06      11057.293
	8657.006         9246.325        9505.583       9816.488      10404.579
	8114.118         8698.461        8959.243       9263.306      9845.553
	7652.743         8229.507        8490.758       8790.652      9360.529
	7257.256         7828.024        8089.479       8379.17       8955.348
	6920.556         7481.356        7741.551       8028.665      8607.933
	6628.525         7183.648        7441.873       7725.705      8298.484
	6374.773         6923.115        7178.543       7458.964      8028.238
	6152.788         6694.167        6947.54        7225.345      7786.835
	5957.2           6492.362        6741.356       7018.593      7569.324
	

Show the marginal distributions of the generated samples for each parameter. Continuous lines are obtained from the samples by Kernel Density Estimation (KDE). The result is an estimation of the marginal PPD of each parameter. The red lines indicate the true values.


In [18]:
sns.set(style="darkgrid")
sns.set_context('paper')

fig, axes = plt.subplots(ncols=2, figsize=(8, 2))

d = sns.distplot(mcmc_smodel.trace('exposure_time')[:], 
                 hist=False, kde_kws={"shade": True},
                 ax=axes[0])
axes[0].axvline(true_exposure, color='r')
axes[0].set_xlabel('exposure time [yr]')

sns.distplot(mcmc_smodel.trace('erosion_rate')[:] * 1e4, 
             hist=False, kde_kws={"shade": True},
             ax=axes[1])
axes[1].axvline(true_erosion * 1e4, color='r')
axes[1].set_xlabel('erosion rate [m Myr-1]')

plt.setp(axes, yticks=[])


Out[18]:
[]

Join plot of the generated samples with marginals, obtained by KDE. This represents an approximation of the PPD. The red lines indicate the true values.


In [19]:
sns.set(style="white")
sns.set_context('paper')

p = sns.jointplot(mcmc_smodel.trace('erosion_rate')[:] * 1e4,
                  mcmc_smodel.trace('exposure_time')[:],
                  xlim=[5, 15], ylim=[6e4, 1.4e5],
                  kind="kde", size=5, space=0)

plt.setp(p.ax_joint,
         xlabel='erosion rate [m Myr-1]',
         ylabel='exposure time [yr]')

p.ax_joint.axhline(true_exposure, color='r')
p.ax_joint.axvline(true_erosion * 1e4, color='r')


Out[19]:
<matplotlib.lines.Line2D at 0x7fac69cc0950>

Plot the measured concentrations vs. depth and the predicted concentration profiles corresponding to the mean and specific quantiles of the PPD. The shaded grey area indicates the 95% HPD interval.


In [20]:
sns.set(style="darkgrid")
sns.set_context('paper')

depths = np.linspace(profile_data['depth'].min(),
                     profile_data['depth'].max(),
                     100)

Cm_true = models.C_10Be(depths, true_erosion, true_exposure,
                        fixed_density, fixed_inheritance)

res_erosion_rate = erosion_rate.stats()
res_exposure_time = exposure_time.stats()
res_mprofile = mprofile.stats()

Cm_mean = models.C_10Be(depths,
                        res_erosion_rate['mean'],
                        res_exposure_time['mean'],
                        fixed_density,
                        fixed_inheritance)

Cm_median = models.C_10Be(depths,
                          res_erosion_rate['quantiles'][50],
                          res_exposure_time['quantiles'][50],
                          fixed_density,
                          fixed_inheritance)

plt.figure()
plt.errorbar(profile_data['C'],
             -profile_data['depth'],
             xerr=profile_data['std'],
             fmt='o', markersize=4,
             label='data')
plt.plot(Cm_mean, -depths, label='PPD mean')
plt.plot(Cm_median, -depths, label='PPD median')
plt.plot(Cm_true, -depths, color='r', label='true model')
plt.fill_betweenx(-profile_data['depth'],
                  res_mprofile['95% HPD interval'][:,0],
                  res_mprofile['95% HPD interval'][:,1],
                  color='k', alpha=0.3)
plt.setp(plt.gca(),
         xlabel='10Be concentration [atoms g-1]',
         ylabel='-1 * depth [cm]',
         xlim=[0, None], ylim=[None, 0])

plt.legend(loc='lower right')


Out[20]:
<matplotlib.legend.Legend at 0x7fac69c01550>

Show the correlation matrix of the free parameters (computed from the samples generated by MCMC).


In [21]:
traces = np.column_stack((mcmc_smodel.trace('erosion_rate')[:],
                          mcmc_smodel.trace('exposure_time')[:]))

traces_df = pd.DataFrame(data=traces,
                         columns=['erosion rate', 'exposure time'])

f, ax = plt.subplots()
cmap = sns.blend_palette(["#00008B", "#6A5ACD", "#F0F8FF",
                          "#FFE6F8", "#C71585", "#8B0000"],
                         as_cmap=True)
sns.corrplot(traces_df, annot=True, sig_stars=False,
             diag_names=False, cmap=cmap, ax=ax)
f.tight_layout()


Observations

2 free parameters

The results obtained here seem to be consistent with MLE with 2 parameters (see the notebooks GS_test_2params and MLE_test_2params).


In [ ]: