Bayesian approach - Test case - 2 free parameters

An example of applying the Bayesian approach with 2 free parameters only (erosion rate and exposure time), using the PyMC package.

For more info about the method used, see the notebook Inference_Notes.

This example (a test case) is based on a generic dataset of 10Be concentration vs. depth, which is drawn from a distribution with given "true" parameters.

This notebook has the following external dependencies:



In [1]:

    
import math

import numpy as np
import pandas as pd
import pymc
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

The mathematical (deterministic) model

The mathematical model is available in the models Python module (see the notebook Models).



In [2]:

    
import models

The data

The dataset is generated using the following parameter values



In [3]:

    
# the true parameters 
true_erosion = 1e-3
true_exposure = 1e5

# fixed parameters
fixed_density = 2.0
fixed_inheritance = 0.

# depths and sample size
depth_minmax = [50, 500]
N = 30

# perturbations
err_magnitude = 20.
err_variability = 5.

The gendata Python module is used to generate the dataset (see the notebook Datasets).



In [4]:

    
import gendata

profile_data = gendata.generate_dataset(
    models.C_10Be,
    [true_erosion, true_exposure, fixed_density, fixed_inheritance],
    zlimits=depth_minmax,
    n=N,
    err=[err_magnitude, err_variability]
)

profile_data









    Out[4]:






  
    
      
      depth
      C
      std
    
  
  
    
      0 
        50.000000
       164646.804971
       7246.214797
    
    
      1 
        65.517241
       139945.656423
       4694.358876
    
    
      2 
        81.034483
       104804.416241
       4515.849236
    
    
      3 
        96.551724
        83616.370489
       5328.510663
    
    
      4 
       112.068966
        69070.664256
       5755.018312
    
    
      5 
       127.586207
        68407.271936
       5886.214743
    
    
      6 
       143.103448
        50059.908676
       5828.091753
    
    
      7 
       158.620690
        47205.539203
       3759.697055
    
    
      8 
       174.137931
        42710.970466
       4924.361067
    
    
      9 
       189.655172
        29562.877929
       3852.225865
    
    
      10
       205.172414
        33368.663037
       2392.127825
    
    
      11
       220.689655
        28028.704149
       3731.505506
    
    
      12
       236.206897
        24991.531196
       2436.588896
    
    
      13
       251.724138
        15786.337554
       2159.890170
    
    
      14
       267.241379
        20613.006215
       3403.223099
    
    
      15
       282.758621
        15883.045725
       2120.986663
    
    
      16
       298.275862
        15371.425741
       2103.435946
    
    
      17
       313.793103
        12807.532690
       1974.772720
    
    
      18
       329.310345
        14097.191111
       1609.028868
    
    
      19
       344.827586
        12128.263339
       2080.154203
    
    
      20
       360.344828
         9357.142721
       1932.778390
    
    
      21
       375.862069
         9223.208481
       1070.071844
    
    
      22
       391.379310
         6646.650690
       1802.140092
    
    
      23
       406.896552
         8584.329077
       1651.810494
    
    
      24
       422.413793
        10533.685591
       2216.628289
    
    
      25
       437.931034
         8995.316606
       1613.037409
    
    
      26
       453.448276
         7611.556721
       1878.782161
    
    
      27
       468.965517
         8311.046879
       1441.655169
    
    
      28
       484.482759
         2824.392297
       2370.549270
    
    
      29
       500.000000
         4625.146006
       1379.133538
    
  

30 rows × 3 columns

The statistical model (used for computing the PPD values)

PyMC allows to define the statistical model in a very nice way. The "nodes" of the model - i.e., the elements that contribute to the Posterior probability distribution - must be defined first.

Define the prior probability distribution for each free parameter (here the uniform distribution is used, with given bounds)



In [5]:

    
erosion_rate = pymc.Uniform('erosion_rate', 0., 4e-3)
exposure_time = pymc.Uniform('exposure_time', 0., 4e5)

Define the deterministic part, i.e., the model used to predict nucleide concentration vs. depth



In [6]:

    
@pymc.deterministic(plot=False)
def mprofile(erosion=erosion_rate, exposure=exposure_time):
    """
    Modelled concentration profile.
    """
    return models.C_10Be(profile_data['depth'].values,
                         erosion, exposure,
                         fixed_density, fixed_inheritance)

Define the distribution of measured nucleide concentrations, which is here assumed to be a Gaussian with $\mu$ is predicted by the model above and $\sigma$ the known measurement errors. The likelihood will be derived from this distribution.



In [7]:

    
d = pymc.Normal('data_distribution', mprofile, 1./profile_data['std']**2,
                value=profile_data['C'].values, observed=True)

PyMC builds the model automatically from all nodes.



In [8]:

    
smodel = pymc.Model({'erosion_rate': erosion_rate,
                     'exposure_time': exposure_time,
                     'mprofile': mprofile,
                     'data_distribution': d})

PyMC use the prior distributions to generate a random initial value for each parameter (further needed)



In [9]:

    
print erosion_rate.value
print exposure_time.value









    



0.00129259429611
212815.732925

The log-PPD value that correspond to the initial values is given by the logp attribute



In [10]:

    
smodel.logp









    Out[10]:





-422.2760683441503

Fitting the statistical model

:class:pymc.MAP allows to "clone" a model and set all parameters to their maximum a posteriori values. It uses the scipy's optimize package to maximize the log-PPD over the parameter space.



In [11]:

    
maxppd_smodel = pymc.MAP(smodel)
maxppd_smodel.fit(method='fmin', iterlim=15000, tol=1e-6, verbose=0)

See the resulting optimized values assigned to each parameter (and compare it to the true values defined above, it should be closer than the initial values)



In [12]:

    
print erosion_rate.value
print exposure_time.value









    



0.00098553595361
102125.233535

The log-PPD value at the maximum



In [13]:

    
maxppd_smodel.logp_at_max









    Out[13]:





-288.81231529022523

Markov-Chain Monte-Carlo sampling

:class:pymc.MCMC allows to sample the parameter space following the PPD with MCMC. It is a good idea to use the results of the log-PPD maximization as initial values for MCMC. Metropolis-Hastings is the default samplerused by MCMC.



In [14]:

    
mcmc_smodel = pymc.MCMC(smodel)

Generate samples, discard the first $n$ samples (burn-in, burn argument) and keep only one sample every $t$ samples (thinning, thin argument) in order to obtain idependent samples from the PPD.



In [15]:

    
mcmc_smodel.sample(50000, burn=5000, thin=40)









    



 [-----------------100%-----------------] 50000 of 50000 complete in 44.0 sec

Plot some characteristics of the samples for each parameter in order to evaluate the quality of sampling.

trace: plot of the values of the - burnt and thinned - samples (in the order of which it has been generated), useful to check if the burn-in setting was chosen appropriately (no trend visible for the firsts samples)
acorr: autocorrelation of the values of the - burnt and thinned - samples, useful to check if the thinning setting was chosen appropriately (rapid decrease of autocorrelation)
histogram of the samples



In [16]:

    
pymc.Matplot.plot(mcmc_smodel)









    



Plotting exposure_time
Plotting erosion_rate

Print a summary about the distribution of the samples for each parameter. It shows the values of various moments and percentiles of the PPD, the 95% Highest Posterior Density (HPD) interval (see here for an explanation) and an estimation of the Monte-Carlo integration error.



In [17]:

    
mcmc_smodel.summary()









    



exposure_time:
 
	Mean             SD               MC Error        95% HPD interval
	------------------------------------------------------------------
	102170.348       7687.582         360.837[  87194.802  118120.02 ]
	
	
	Posterior quantiles:
	
	2.5             25              50              75             97.5
	 |---------------|===============|===============|---------------|
	87194.802        97141.709       101974.573     107246.395    118120.02
	

erosion_rate:
 
	Mean             SD               MC Error        95% HPD interval
	------------------------------------------------------------------
	0.001            0.0              0.0              [ 0.001  0.001]
	
	
	Posterior quantiles:
	
	2.5             25              50              75             97.5
	 |---------------|===============|===============|---------------|
	0.001            0.001           0.001          0.001         0.001
	

mprofile:
 
	Mean             SD               MC Error        95% HPD interval
	------------------------------------------------------------------
	158674.767       3149.538         108.706[ 152695.602  164525.008]
	132247.598       2559.881         86.67  [ 127412.963  137029.153]
	110455.947       2078.513         68.744 [ 106633.126  114433.976]
	92482.955        1687.15          54.257   [ 89484.599  95823.882]
	77655.776        1370.926         42.68    [ 75299.576  80476.233]
	65420.187        1117.804         33.608   [ 63463.096  67646.947]
	55319.675        918.053          26.74    [ 53700.142  57176.258]
	46978.207        763.725          21.852   [ 45715.328  48646.341]
	40086.042        648.082          18.731   [ 38949.612  41481.415]
	34388.043        564.992          17.085   [ 33364.675  35610.353]
	29674.049        508.434          16.52    [ 28685.741  30687.305]
	25770.946        472.375          16.618   [ 24823.885  26669.851]
	22536.133        451.082          17.047   [ 21620.231  23378.118]
	19852.147        439.626          17.592   [ 19040.187  20739.218]
	17622.227        434.196          18.134   [ 16830.289  18507.135]
	15766.663        432.109          18.616   [ 14946.92   16622.721]
	14219.796        431.617          19.012   [ 13422.476  15106.864]
	12927.534        431.65           19.32    [ 12103.506  13807.428]
	11845.316        431.596          19.541   [ 10973.901  12682.794]
	10936.432        431.136          19.685   [ 10157.628  11870.941]
	10170.638        430.127          19.762   [  9401.176  11113.145]
	9523.014         428.531          19.78    [  8759.583  10469.692]
	8973.032         426.371          19.75      [ 8151.496  9853.571]
	8503.773         423.698          19.68      [ 7684.719  9380.107]
	8101.297         420.576          19.576     [ 7286.272  8969.573]
	7754.117         417.072          19.444     [ 6950.944  8619.247]
	7452.764         413.251          19.291     [ 6649.256  8305.451]
	7189.433         409.173          19.121     [ 6416.127  8052.338]
	6957.692         404.892          18.936     [ 6192.953  7808.217]
	6752.235         400.453          18.741     [ 5993.377  7590.674]
	
	
	Posterior quantiles:
	
	2.5             25              50              75             97.5
	 |---------------|===============|===============|---------------|
	152590.909       156524.942      158706.433     160926.295    164441.363
	127297.513       130504.18       132276.063     134091.471    136978.916
	106469.152       109015.683      110451.92      111957.015    114381.782
	89332.905        91334.651       92476.825      93660.124     95731.504
	75105.278        76712.253       77632.884      78605.958     80333.318
	63231.409        64656.001       65395.85       66196.707     67614.978
	53561.555        54692.864       55302.376      55950.755     57129.665
	45479.868        46463.753       46961.625      47499.224     48503.371
	38793.554        39652.771       40081.452      40532.786     41391.188
	33234.721        34016.546       34371.298      34762.628     35525.406
	28651.498        29332.811       29662.262      30007.615     30675.831
	24832.001        25449.531       25770.515      26084.942     26711.049
	21649.617        22225.341       22527.414      22829.399     23449.405
	18996.358        19560.148       19831.014      20139.067     20715.613
	16784.036        17332.203       17603.498      17908.986     18479.277
	14947.397        15478.162       15750.02       16067.691     16639.108
	13405.07         13929.714       14201.837      14524.408     15097.645
	12081.849        12640.436       12910.836      13234.302     13793.411
	10997.763        11556.91        11835.902      12148.259     12734.71
	10079.229        10652.863       10920.896      11232.659     11824.066
	9301.026         9890.846        10151.477      10465.06      11057.293
	8657.006         9246.325        9505.583       9816.488      10404.579
	8114.118         8698.461        8959.243       9263.306      9845.553
	7652.743         8229.507        8490.758       8790.652      9360.529
	7257.256         7828.024        8089.479       8379.17       8955.348
	6920.556         7481.356        7741.551       8028.665      8607.933
	6628.525         7183.648        7441.873       7725.705      8298.484
	6374.773         6923.115        7178.543       7458.964      8028.238
	6152.788         6694.167        6947.54        7225.345      7786.835
	5957.2           6492.362        6741.356       7018.593      7569.324

Show the marginal distributions of the generated samples for each parameter. Continuous lines are obtained from the samples by Kernel Density Estimation (KDE). The result is an estimation of the marginal PPD of each parameter. The red lines indicate the true values.



In [18]:

    
sns.set(style="darkgrid")
sns.set_context('paper')

fig, axes = plt.subplots(ncols=2, figsize=(8, 2))

d = sns.distplot(mcmc_smodel.trace('exposure_time')[:], 
                 hist=False, kde_kws={"shade": True},
                 ax=axes[0])
axes[0].axvline(true_exposure, color='r')
axes[0].set_xlabel('exposure time [yr]')

sns.distplot(mcmc_smodel.trace('erosion_rate')[:] * 1e4, 
             hist=False, kde_kws={"shade": True},
             ax=axes[1])
axes[1].axvline(true_erosion * 1e4, color='r')
axes[1].set_xlabel('erosion rate [m Myr-1]')

plt.setp(axes, yticks=[])









    Out[18]:





[]

Join plot of the generated samples with marginals, obtained by KDE. This represents an approximation of the PPD. The red lines indicate the true values.



In [19]:

    
sns.set(style="white")
sns.set_context('paper')

p = sns.jointplot(mcmc_smodel.trace('erosion_rate')[:] * 1e4,
                  mcmc_smodel.trace('exposure_time')[:],
                  xlim=[5, 15], ylim=[6e4, 1.4e5],
                  kind="kde", size=5, space=0)

plt.setp(p.ax_joint,
         xlabel='erosion rate [m Myr-1]',
         ylabel='exposure time [yr]')

p.ax_joint.axhline(true_exposure, color='r')
p.ax_joint.axvline(true_erosion * 1e4, color='r')









    Out[19]:





<matplotlib.lines.Line2D at 0x7fac69cc0950>

Plot the measured concentrations vs. depth and the predicted concentration profiles corresponding to the mean and specific quantiles of the PPD. The shaded grey area indicates the 95% HPD interval.



In [20]:

    
sns.set(style="darkgrid")
sns.set_context('paper')

depths = np.linspace(profile_data['depth'].min(),
                     profile_data['depth'].max(),
                     100)

Cm_true = models.C_10Be(depths, true_erosion, true_exposure,
                        fixed_density, fixed_inheritance)

res_erosion_rate = erosion_rate.stats()
res_exposure_time = exposure_time.stats()
res_mprofile = mprofile.stats()

Cm_mean = models.C_10Be(depths,
                        res_erosion_rate['mean'],
                        res_exposure_time['mean'],
                        fixed_density,
                        fixed_inheritance)

Cm_median = models.C_10Be(depths,
                          res_erosion_rate['quantiles'][50],
                          res_exposure_time['quantiles'][50],
                          fixed_density,
                          fixed_inheritance)

plt.figure()
plt.errorbar(profile_data['C'],
             -profile_data['depth'],
             xerr=profile_data['std'],
             fmt='o', markersize=4,
             label='data')
plt.plot(Cm_mean, -depths, label='PPD mean')
plt.plot(Cm_median, -depths, label='PPD median')
plt.plot(Cm_true, -depths, color='r', label='true model')
plt.fill_betweenx(-profile_data['depth'],
                  res_mprofile['95% HPD interval'][:,0],
                  res_mprofile['95% HPD interval'][:,1],
                  color='k', alpha=0.3)
plt.setp(plt.gca(),
         xlabel='10Be concentration [atoms g-1]',
         ylabel='-1 * depth [cm]',
         xlim=[0, None], ylim=[None, 0])

plt.legend(loc='lower right')









    Out[20]:





<matplotlib.legend.Legend at 0x7fac69c01550>

Show the correlation matrix of the free parameters (computed from the samples generated by MCMC).



In [21]:

    
traces = np.column_stack((mcmc_smodel.trace('erosion_rate')[:],
                          mcmc_smodel.trace('exposure_time')[:]))

traces_df = pd.DataFrame(data=traces,
                         columns=['erosion rate', 'exposure time'])

f, ax = plt.subplots()
cmap = sns.blend_palette(["#00008B", "#6A5ACD", "#F0F8FF",
                          "#FFE6F8", "#C71585", "#8B0000"],
                         as_cmap=True)
sns.corrplot(traces_df, annot=True, sig_stars=False,
             diag_names=False, cmap=cmap, ax=ax)
f.tight_layout()

Observations

2 free parameters

The results obtained here seem to be consistent with MLE with 2 parameters (see the notebooks GS_test_2params and MLE_test_2params).



In [ ]:

	depth	C	std
0	50.000000	164646.804971	7246.214797
1	65.517241	139945.656423	4694.358876
2	81.034483	104804.416241	4515.849236
3	96.551724	83616.370489	5328.510663
4	112.068966	69070.664256	5755.018312
5	127.586207	68407.271936	5886.214743
6	143.103448	50059.908676	5828.091753
7	158.620690	47205.539203	3759.697055
8	174.137931	42710.970466	4924.361067
9	189.655172	29562.877929	3852.225865
10	205.172414	33368.663037	2392.127825
11	220.689655	28028.704149	3731.505506
12	236.206897	24991.531196	2436.588896
13	251.724138	15786.337554	2159.890170
14	267.241379	20613.006215	3403.223099
15	282.758621	15883.045725	2120.986663
16	298.275862	15371.425741	2103.435946
17	313.793103	12807.532690	1974.772720
18	329.310345	14097.191111	1609.028868
19	344.827586	12128.263339	2080.154203
20	360.344828	9357.142721	1932.778390
21	375.862069	9223.208481	1070.071844
22	391.379310	6646.650690	1802.140092
23	406.896552	8584.329077	1651.810494
24	422.413793	10533.685591	2216.628289
25	437.931034	8995.316606	1613.037409
26	453.448276	7611.556721	1878.782161
27	468.965517	8311.046879	1441.655169
28	484.482759	2824.392297	2370.549270
29	500.000000	4625.146006	1379.133538