Bayesian approach - Test case - 4 free parameters

An example of applying the Bayesian approach with 4 free parameters, using the PyMC package.

For more info about the method used, see the notebook Inference_Notes.

This example (a test case) is based on a generic dataset of 10Be concentration vs. depth, which is drawn from a distribution with given "true" parameters.

This notebook has the following external dependencies:



In [1]:

    
import math

import numpy as np
import pandas as pd
import pymc
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

The mathematical (deterministic) model

The mathematical model is available in the models Python module (see the notebook Models).



In [2]:

    
import models

The data

The dataset is generated using the following parameter values



In [3]:

    
# the true parameters 
true_erosion = 1e-3
true_exposure = 1e5
true_density = 2.0
true_inheritance = 0.

# depths and sample size
depth_minmax = [50, 500]
N = 30

# perturbations
err_magnitude = 20.
err_variability = 5.

The gendata Python module is used to generate the dataset (see the notebook Datasets).



In [4]:

    
import gendata

profile_data = gendata.generate_dataset(
    models.C_10Be,
    [true_erosion, true_exposure, true_density, true_inheritance],
    zlimits=depth_minmax,
    n=N,
    err=[err_magnitude, err_variability]
)

profile_data









    Out[4]:






  
    
      
      depth
      C
      std
    
  
  
    
      0 
        50.000000
       166682.979192
       7332.451747
    
    
      1 
        65.517241
       139954.308810
       6928.552484
    
    
      2 
        81.034483
       112649.875060
       7144.750119
    
    
      3 
        96.551724
        95360.056162
       7344.193919
    
    
      4 
       112.068966
        73394.677861
       3603.947061
    
    
      5 
       127.586207
        68189.139499
       5266.346626
    
    
      6 
       143.103448
        52111.003997
       4900.098383
    
    
      7 
       158.620690
        40794.116289
       5257.312055
    
    
      8 
       174.137931
        41211.026928
       3463.789563
    
    
      9 
       189.655172
        33731.429311
       3335.344529
    
    
      10
       205.172414
        26794.456376
       3842.675669
    
    
      11
       220.689655
        23955.274874
       2932.714270
    
    
      12
       236.206897
        20309.847725
       2189.194900
    
    
      13
       251.724138
        20835.423055
       3424.486949
    
    
      14
       267.241379
        18378.628699
       2710.987471
    
    
      15
       282.758621
        12510.372314
       4114.602094
    
    
      16
       298.275862
        14199.873575
        155.907984
    
    
      17
       313.793103
        10335.014696
       3135.128259
    
    
      18
       329.310345
         9221.310094
       2532.018754
    
    
      19
       344.827586
         9604.670596
       2449.004427
    
    
      20
       360.344828
         8817.141726
       1942.921509
    
    
      21
       375.862069
         7517.649592
       1285.869625
    
    
      22
       391.379310
         8665.241294
       1526.833665
    
    
      23
       406.896552
         7559.313667
       1624.562701
    
    
      24
       422.413793
         5831.077501
        946.380210
    
    
      25
       437.931034
         6244.165276
       2139.771774
    
    
      26
       453.448276
         4953.543618
       1630.764157
    
    
      27
       468.965517
         8024.093601
       1468.810926
    
    
      28
       484.482759
         9078.341457
       1628.295985
    
    
      29
       500.000000
         5714.422244
       1880.041881
    
  

30 rows × 3 columns

The statistical model (used for computing the PPD values)

PyMC allows to define the statistical model in a very nice way. The "nodes" of the model - i.e., the elements that contribute to the Posterior probability distribution - must be defined first.

Define the prior probability distribution for each free parameter (here the uniform distribution is used, with given bounds)



In [5]:

    
erosion_rate = pymc.Uniform('erosion_rate', 0., 4e-3)
exposure_time = pymc.Uniform('exposure_time', 0., 4e5)
soil_density = pymc.Uniform('soil_density', 1.8, 2.3)
inheritance = pymc.Uniform('inheritance', 0, 3e5)

Define the deterministic part, i.e., the model used to predict nucleide concentration vs. depth



In [6]:

    
@pymc.deterministic
def mprofile(erosion=erosion_rate, exposure=exposure_time,
             density=soil_density, inheritance=inheritance):
    """
    Modelled concentration profile.
    """
    return models.C_10Be(profile_data['depth'].values,
                         erosion, exposure,
                         density, inheritance)

Define the distribution of measured nucleide concentrations, which is here assumed to be a Gaussian with $\mu$ is predicted by the model above and $\sigma$ the known measurement errors. The likelihood will be derived from this distribution.



In [7]:

    
d = pymc.Normal('data_distribution', mprofile, 1./profile_data['std']**2,
                value=profile_data['C'].values, observed=True)

PyMC builds the model automatically from all nodes.



In [8]:

    
smodel = pymc.Model({'erosion_rate': erosion_rate,
                     'exposure_time': exposure_time,
                     'soil_density': soil_density,
                     'inheritance': inheritance,
                     'mprofile': mprofile,
                     'data_distribution': d})

PyMC use the prior distributions to generate a random initial value for each parameter (further needed)



In [9]:

    
print erosion_rate.value
print exposure_time.value
print soil_density.value
print inheritance.value









    



0.0030383483373
353740.615011
1.92284833369
58714.6972633

The log-PPD value that correspond to the initial values is given by the logp attribute



In [10]:

    
smodel.logp









    Out[10]:





-73347.1685479176

Fitting the statistical model

:class:pymc.MAP allows to "clone" a model and set all parameters to their maximum a posteriori values. It uses the scipy's optimize package to maximize the log-PPD over the parameter space.



In [11]:

    
maxppd_smodel = pymc.MAP(smodel)
maxppd_smodel.fit(method='fmin', iterlim=15000, tol=1e-6, verbose=0)

See the resulting optimized values assigned to each parameter (and compare it to the true values defined above, it should be closer than the initial values)



In [12]:

    
print erosion_rate.value
print exposure_time.value
print soil_density.value
print inheritance.value









    



0.00399999999843
398135.161274
2.2999872895
0.00526119438799

The log-PPD value at the maximum



In [13]:

    
maxppd_smodel.logp_at_max









    Out[13]:





-787.696853495951

Markov-Chain Monte-Carlo sampling

:class:pymc.MCMC allows to sample the parameter space following the PPD with MCMC. It is a good idea to use the results of the log-PPD maximization as initial values for MCMC. Metropolis-Hastings is the default samplerused by MCMC.



In [14]:

    
mcmc_smodel = pymc.MCMC(smodel)

Generate samples, discard the first $n$ samples (burn-in, burn argument) and keep only one sample every $t$ samples (thinning, thin argument) in order to obtain idependent samples from the PPD.



In [15]:

    
mcmc_smodel.sample(50000, burn=5000, thin=40)









    



 [-----------------100%-----------------] 50000 of 50000 complete in 84.6 sec

Plot some characteristics of the samples for each parameter in order to evaluate the quality of sampling.

trace: plot of the values of the - burnt and thinned - samples (in the order of which it has been generated), useful to check if the burn-in setting was chosen appropriately (no trend visible for the firsts samples)
acorr: autocorrelation of the values of the - burnt and thinned - samples, useful to check if the thinning setting was chosen appropriately (rapid decrease of autocorrelation)
histogram of the samples



In [16]:

    
pymc.Matplot.plot(mcmc_smodel)









    



Plotting exposure_time
Plotting soil_density
Plotting erosion_rate
Plotting inheritance

Print a summary about the distribution of the samples for each parameter. It shows the values of various moments and percentiles of the PPD, the 95% Highest Posterior Density (HPD) interval (see here for an explanation) and an estimation of the Monte-Carlo integration error.



In [17]:

    
mcmc_smodel.summary()









    



exposure_time:
 
	Mean             SD               MC Error        95% HPD interval
	------------------------------------------------------------------
	78405.582        12864.894        1195.936[  57536.206  102054.876]
	
	
	Posterior quantiles:
	
	2.5             25              50              75             97.5
	 |---------------|===============|===============|---------------|
	59099.561        67750.8         76585.822      88069.717     105402.98
	

soil_density:
 
	Mean             SD               MC Error        95% HPD interval
	------------------------------------------------------------------
	1.983            0.07             0.005            [ 1.832  2.108]
	
	
	Posterior quantiles:
	
	2.5             25              50              75             97.5
	 |---------------|===============|===============|---------------|
	1.852            1.933           1.983          2.029         2.129
	

erosion_rate:
 
	Mean             SD               MC Error        95% HPD interval
	------------------------------------------------------------------
	0.001            0.0              0.0              [ 0.     0.001]
	
	
	Posterior quantiles:
	
	2.5             25              50              75             97.5
	 |---------------|===============|===============|---------------|
	0.0              0.0             0.001          0.001         0.001
	

mprofile:
 
	Mean             SD               MC Error        95% HPD interval
	------------------------------------------------------------------
	161097.163       4841.995         273.763[ 150464.051  169434.037]
	134303.165       3504.403         177.173[ 127099.527  140661.136]
	112184.188       2570.096         111.617[ 107353.707  117283.947]
	93920.751        1940.06          71.793   [ 90442.546  97878.072]
	78837.223        1529.228         53.596   [ 75779.774  81642.228]
	66376.541        1263.203         49.416   [ 63923.183  68789.626]
	56079.387        1081.796         49.697   [ 53964.501  58182.717]
	47567.01         943.493          49.384   [ 45639.098  49373.221]
	40527.084        824.539          47.21    [ 38969.379  42250.011]
	34702.044        713.787          43.304   [ 33347.493  36183.376]
	29879.473        607.352          38.146   [ 28535.931  30939.754]
	25884.181        504.964          32.227   [ 24876.249  26885.817]
	22571.669        408.0            25.965   [ 21746.917  23382.657]
	19822.751        318.723          19.69    [ 19228.055  20486.772]
	17539.111        240.495          13.678   [ 17078.378  18035.839]
	15639.644        179.171          8.262    [ 15277.871  15987.639]
	14057.439        144.894          4.498    [ 13751.629  14325.627]
	12737.296        145.563          5.381    [ 12459.756  13034.948]
	11633.666        172.175          8.945    [ 11270.887  11948.407]
	10708.966        209.124          12.651   [ 10293.48   11102.576]
	9932.183         247.564          16.054   [  9450.323  10406.813]
	9277.722         283.901          19.074     [ 8690.997  9790.911]
	8724.46          316.821          21.71      [ 8106.284  9324.932]
	8254.961         345.952          23.982     [ 7591.751  8921.603]
	7854.832         371.331          25.921     [ 7161.757  8582.119]
	7512.19          393.179          27.558     [ 6766.956  8269.746]
	7217.224         411.797          28.925     [ 6417.13   8006.659]
	6961.834         427.513          30.055     [ 6123.025  7771.968]
	6739.326         440.657          30.977     [ 5867.765  7572.316]
	6544.176         451.544          31.719     [ 5638.666  7398.811]
	
	
	Posterior quantiles:
	
	2.5             25              50              75             97.5
	 |---------------|===============|===============|---------------|
	151079.29        157797.281      161221.448     164528.17     170557.037
	127158.804       132053.048      134442.123     136747.321    140722.164
	106974.802       110439.438      112225.299     113981.735    117004.645
	90054.54         92619.524       93922.732      95298.37      97512.416
	75779.774        77817.402       78819.884      79912.11      81642.228
	63809.525        65543.134       66420.055      67254.05      68742.885
	53830.127        55375.751       56093.555      56838.237     58135.001
	45579.151        46944.968       47607.251      48236.077     49324.318
	38761.445        39999.84        40558.312      41099.481     42050.003
	33143.655        34249.965       34711.323      35186.787     35996.964
	28562.871        29500.831       29890.942      30288.795     31023.073
	24795.641        25571.874       25884.076      26222.814     26839.306
	21676.143        22321.316       22569.191      22846.028     23352.447
	19128.017        19619.273       19825.629      20037.991     20432.035
	17045.502        17383.067       17542.39       17691.913     18009.599
	15281.963        15522.827       15637.318      15753.109     16014.67
	13769.8          13965.983       14054.102      14151.224     14352.329
	12456.651        12642.021       12736.971      12827.984     13034.308
	11296.905        11518.384       11626.419      11746.833     11975.598
	10306.391        10571.091       10706.476      10851.287     11121.005
	9463.817         9768.184        9932.123       10097.556     10428.611
	8729.848         9087.763        9277.44        9469.801      9837.762
	8108.889         8511.817        8720.778       8937.062      9341.344
	7591.751         8023.263        8258.07        8485.757      8921.603
	7140.901         7606.624        7861.473       8102.843      8571.285
	6751.779         7243.748        7520.407       7771.589      8267.745
	6418.336         6941.434        7226.627       7492.977      8015.845
	6129.287         6679.369        6971.753       7244.043      7796.684
	5880.59          6446.015        6752.121       7027.785      7606.127
	5664.753         6243.961        6556.445       6841.376      7430.898
	

inheritance:
 
	Mean             SD               MC Error        95% HPD interval
	------------------------------------------------------------------
	1073.62          723.94           57.84[  2.28900000e+00   2.39884100e+03]
	
	
	Posterior quantiles:
	
	2.5             25              50              75             97.5
	 |---------------|===============|===============|---------------|
	46.525           465.084         984.722        1577.127      2627.05

Show the marginal distributions of the generated samples for each parameter. Continuous lines are obtained from the samples by Kernel Density Estimation (KDE). The result is an estimation of the marginal PPD of each parameter. The red lines indicate the true values.



In [18]:

    
sns.set(style="darkgrid")
sns.set_context('paper')

fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(8, 6))

sns.distplot(mcmc_smodel.trace('exposure_time')[:], 
             hist=False, kde_kws={"shade": True},
             ax=axes[0][0])
axes[0][0].axvline(true_exposure, color='r')
axes[0][0].set_xlabel('exposure time [yr]')

sns.distplot(mcmc_smodel.trace('erosion_rate')[:] * 1e4, 
             hist=False, kde_kws={"shade": True},
             ax=axes[0][1])
axes[0][1].axvline(true_erosion * 1e4, color='r')
axes[0][1].set_xlabel('erosion rate [m Myr-1]')

sns.distplot(mcmc_smodel.trace('soil_density')[:], 
             hist=False, kde_kws={"shade": True},
             ax=axes[1][0])
axes[1][0].axvline(true_density, color='r')
axes[1][0].set_xlabel('soil_density [g cm-3]')

sns.distplot(mcmc_smodel.trace('inheritance')[:], 
             hist=False, kde_kws={"shade": True},
             ax=axes[1][1])
axes[1][1].axvline(true_inheritance, color='r')
axes[1][1].set_xlabel('inheritance [atoms g-1]')

plt.setp(axes, yticks=[])









    Out[18]:





[]

Join plot of the generated samples with marginals, obtained by KDE. This represents an approximation of the 2D marginal PPDs. The red lines indicate the true values.



In [19]:

    
sns.set(style="white")
sns.set_context('paper')

# exposure vs. erosion
p = sns.jointplot(mcmc_smodel.trace('erosion_rate')[:] * 1e4,
                  mcmc_smodel.trace('exposure_time')[:],
                  xlim=[0, 15], ylim=[4e4, 1.2e5],
                  kind="kde", size=5, space=0)
plt.setp(p.ax_joint,
         xlabel='erosion rate [m Myr-1]',
         ylabel='exposure time [yr]')
p.ax_joint.axhline(true_exposure, color='r')
p.ax_joint.axvline(true_erosion * 1e4, color='r')

# exposure vs. inheritance
p = sns.jointplot(mcmc_smodel.trace('inheritance')[:],
                  mcmc_smodel.trace('exposure_time')[:],
                  xlim=[0, 4000], ylim=[4e4, 1.2e5],
                  kind="kde", size=5, space=0)
plt.setp(p.ax_joint,
         xlabel='inheritance [atoms g-1]',
         ylabel='exposure time [yr]')
p.ax_joint.axhline(true_exposure, color='r')
p.ax_joint.axvline(true_inheritance, color='r')

# exposure vs. density
p = sns.jointplot(mcmc_smodel.trace('soil_density')[:],
                  mcmc_smodel.trace('exposure_time')[:],
                  xlim=[1.8, 2.3], ylim=[4e4, 1.2e5],
                  kind="kde", size=5, space=0)
plt.setp(p.ax_joint,
         xlabel='soil density [g cm-3]',
         ylabel='exposure time [yr]')
p.ax_joint.axhline(true_exposure, color='r')
p.ax_joint.axvline(true_density, color='r')

# density vs. erosion
p = sns.jointplot(mcmc_smodel.trace('erosion_rate')[:] * 1e4,
                  mcmc_smodel.trace('soil_density')[:],
                  xlim=[0, 15], ylim=[1.8, 2.3],
                  kind="kde", size=5, space=0)
plt.setp(p.ax_joint,
         xlabel='erosion rate [m Myr-1]',
         ylabel='soil density [g cm-3]')
p.ax_joint.axhline(true_density, color='r')
p.ax_joint.axvline(true_erosion * 1e4, color='r')

# inheritance vs. erosion
p = sns.jointplot(mcmc_smodel.trace('erosion_rate')[:] * 1e4,
                  mcmc_smodel.trace('inheritance')[:],
                  xlim=[0, 15], ylim=[0, 4000],
                  kind="kde", size=5, space=0)
plt.setp(p.ax_joint,
         xlabel='erosion rate [m Myr-1]',
         ylabel='inheritance [atoms g-1]')
p.ax_joint.axhline(true_inheritance, color='r')
p.ax_joint.axvline(true_erosion * 1e4, color='r')









    Out[19]:





<matplotlib.lines.Line2D at 0x7f5c1ef14a10>

Plot the measured concentrations vs. depth and the predicted concentration profiles corresponding to the mean and specific quantiles of the PPD. The shaded grey area indicates the 95% HPD interval.



In [20]:

    
sns.set(style="darkgrid")
sns.set_context('paper')

depths = np.linspace(profile_data['depth'].min(),
                     profile_data['depth'].max(),
                     100)

Cm_true = models.C_10Be(depths, true_erosion, true_exposure,
                        true_density, true_inheritance)

res_erosion_rate = erosion_rate.stats()
res_exposure_time = exposure_time.stats()
res_soil_density = soil_density.stats()
res_inheritance = inheritance.stats()
res_mprofile = mprofile.stats()

Cm_mean = models.C_10Be(depths,
                        res_erosion_rate['mean'],
                        res_exposure_time['mean'],
                        res_soil_density['mean'],
                        res_inheritance['mean'])

Cm_median = models.C_10Be(depths,
                          res_erosion_rate['quantiles'][50],
                          res_exposure_time['quantiles'][50],
                          res_soil_density['quantiles'][50],
                          res_inheritance['quantiles'][50])

plt.figure()
plt.errorbar(profile_data['C'],
             -profile_data['depth'],
             xerr=profile_data['std'],
             fmt='o', markersize=4,
             label='data')
plt.plot(Cm_mean, -depths, label='PPD mean')
plt.plot(Cm_median, -depths, label='PPD median')
plt.plot(Cm_true, -depths, color='r', label='true model')
plt.fill_betweenx(-profile_data['depth'],
                  res_mprofile['95% HPD interval'][:,0],
                  res_mprofile['95% HPD interval'][:,1],
                  color='k', alpha=0.3)
plt.setp(plt.gca(),
         xlabel='10Be concentration [atoms g-1]',
         ylabel='-1 * depth [cm]',
         xlim=[0, None], ylim=[None, 0])

plt.legend(loc='lower right')









    Out[20]:





<matplotlib.legend.Legend at 0x7f5c1ef76910>

Show the correlation matrix of the free parameters (computed from the samples generated by MCMC).



In [21]:

    
traces = np.column_stack((mcmc_smodel.trace('erosion_rate')[:],
                          mcmc_smodel.trace('exposure_time')[:],
                          mcmc_smodel.trace('soil_density')[:],
                          mcmc_smodel.trace('inheritance')[:]))

traces_df = pd.DataFrame(data=traces,
                         columns=['erosion rate', 'exposure time',
                                  'soil density', 'inheritance'])

f, ax = plt.subplots()
cmap = sns.blend_palette(["#00008B", "#6A5ACD", "#F0F8FF",
                          "#FFE6F8", "#C71585", "#8B0000"],
                         as_cmap=True)
sns.corrplot(traces_df, annot=True, sig_stars=False,
             diag_names=False, cmap=cmap, ax=ax)
f.tight_layout()

Observations

4 free parameters

The results obtained here have similarities with the results obtained by MLE with 4 parameters (see the notebooks GS_test_4params and MLE_test_4params), though the shapes of the marginal PPDs may differ significantly from those of the profile like-likelihoods.

We can aslo see that the traces and autocorrelations of erosion rate and exposure time are not as good as for the case of 2 free parameters.

This shows the limit of the MCMC sampler used here (Metropolis-Hastings algorithm) in covering extended, diagonal regions of high PPD in the parameter space (this is the case here as the parameters are often strongly correlated with each other).

A solution would be to use a more robust sampler, e.g., the one implemented in the emcee Python package or the one available in PyMC version 3 (alpha). The advantage of the affine-invariant ensemble sampler implemented in emcee is that it is based on multiple, independent random walkers. Combining this with initial samples taken from the PPD evaluated on a grid search with a broad resolution would give a more robust method.



In [ ]:

	depth	C	std
0	50.000000	166682.979192	7332.451747
1	65.517241	139954.308810	6928.552484
2	81.034483	112649.875060	7144.750119
3	96.551724	95360.056162	7344.193919
4	112.068966	73394.677861	3603.947061
5	127.586207	68189.139499	5266.346626
6	143.103448	52111.003997	4900.098383
7	158.620690	40794.116289	5257.312055
8	174.137931	41211.026928	3463.789563
9	189.655172	33731.429311	3335.344529
10	205.172414	26794.456376	3842.675669
11	220.689655	23955.274874	2932.714270
12	236.206897	20309.847725	2189.194900
13	251.724138	20835.423055	3424.486949
14	267.241379	18378.628699	2710.987471
15	282.758621	12510.372314	4114.602094
16	298.275862	14199.873575	155.907984
17	313.793103	10335.014696	3135.128259
18	329.310345	9221.310094	2532.018754
19	344.827586	9604.670596	2449.004427
20	360.344828	8817.141726	1942.921509
21	375.862069	7517.649592	1285.869625
22	391.379310	8665.241294	1526.833665
23	406.896552	7559.313667	1624.562701
24	422.413793	5831.077501	946.380210
25	437.931034	6244.165276	2139.771774
26	453.448276	4953.543618	1630.764157
27	468.965517	8024.093601	1468.810926
28	484.482759	9078.341457	1628.295985
29	500.000000	5714.422244	1880.041881