Grid Search - Test case - 2 free parameters

An example of applying MLE with 2 free parameters only (erosion rate and exposure time), using the grid search method.

For more info about the method used, see the notebook Inference_Notes.

This example (a test case) is based on a generic dataset of 10Be concentration vs. depth, which is drawn from a distribution with given "true" parameters.

This notebook has the following external dependencies:


In [1]:
import math

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns

%matplotlib inline

The mathematical (Predictive) Model

The mathematical model is available in the models Python module (see the notebook Models)


In [2]:
import models

The Data

The dataset is generated using the following true parameter values


In [3]:
# the true parameters 
true_erosion = 1e-3
true_exposure = 1e5

# fixed parameters
fixed_density = 2.0
fixed_inheritance = 0.

# depths and sample size
depth_minmax = [50, 500]
N = 30

# perturbations
err_magnitude = 20.
err_variability = 5.

The gendata Python module is used to generate the dataset (see the notebook Datasets).


In [4]:
import gendata

profile_data = gendata.generate_dataset(
    models.C_10Be,
    [true_erosion, true_exposure, fixed_density, fixed_inheritance],
    zlimits=depth_minmax,
    n=N,
    err=[err_magnitude, err_variability]
)

profile_data


Out[4]:
depth C std
0 50.000000 155977.908401 7089.943059
1 65.517241 124417.222527 7556.556188
2 81.034483 103784.107834 5947.080217
3 96.551724 90589.550440 5600.494598
4 112.068966 84495.110605 6459.338508
5 127.586207 68094.675136 3935.587768
6 143.103448 56243.581776 5512.315485
7 158.620690 41172.090732 3472.183237
8 174.137931 40208.032148 3819.413371
9 189.655172 32389.742749 1828.566268
10 205.172414 28302.172991 3306.757831
11 220.689655 20248.003436 3683.114280
12 236.206897 21371.717110 2610.280918
13 251.724138 18847.053588 2341.029296
14 267.241379 22287.325468 2947.374344
15 282.758621 17714.355369 1453.071392
16 298.275862 15452.685560 1388.449329
17 313.793103 11105.194525 2612.945488
18 329.310345 6311.091736 2867.388163
19 344.827586 7791.938453 1880.184986
20 360.344828 9908.565314 1850.770860
21 375.862069 10438.750690 2363.221780
22 391.379310 9699.409217 1338.286404
23 406.896552 7693.693952 1991.855197
24 422.413793 6364.890931 2661.509273
25 437.931034 6573.955062 1925.582892
26 453.448276 8675.610268 2023.326955
27 468.965517 6741.533742 1873.011451
28 484.482759 3132.369619 1658.034144
29 500.000000 7269.100898 2019.150967

30 rows × 3 columns

The dataset is stored as a :class:pandas.DataFrame object.

Fitting the model

The grid search method is implemented in the gridsearch module (see the notebook Grid-Search for more info).


In [5]:
import gridsearch

Create a new object for setup and results


In [6]:
gstest = gridsearch.CosmogenicInferenceGC(description='test case')

Set the data


In [7]:
gstest.set_profile_measured(
    profile_data['depth'].values,
    profile_data['C'].values,
    profile_data['std'].values,
    None,
)

Set the model


In [8]:
def C_10Be_2p(depth, erosion, exposure):
    """
    A model for 10Be with 2 parameters.
    """
    return models.C_10Be(depth, erosion, exposure,
                         fixed_density, fixed_inheritance)

gstest.set_profile_model(C_10Be_2p)

Define the parameters to fit and their search ranges / steps. The order must be the same than the order of the arguments of the function used for the model!


In [9]:
gstest.set_parameter(
    'erosion_rate',
    [0., 3e-3, 200j],
    stats.uniform(loc=0, scale=3e-3).pdf
)

gstest.set_parameter(
    'exposure_time',
    [0., 2e5, 200j],
    stats.uniform(loc=0, scale=2e5).pdf
)

Grid search setup summary


In [10]:
print gstest.setup_summary()


Modelling C profile (Bayes, Grid-Search)

DESCRIPTION:
test case

MEASURED PROFILE (30 samples):
                C       depth nucleide          std
0   155977.908401   50.000000     None  7089.943059
1   124417.222527   65.517241     None  7556.556188
2   103784.107834   81.034483     None  5947.080217
3    90589.550440   96.551724     None  5600.494598
4    84495.110605  112.068966     None  6459.338508
5    68094.675136  127.586207     None  3935.587768
6    56243.581776  143.103448     None  5512.315485
7    41172.090732  158.620690     None  3472.183237
8    40208.032148  174.137931     None  3819.413371
9    32389.742749  189.655172     None  1828.566268
10   28302.172991  205.172414     None  3306.757831
11   20248.003436  220.689655     None  3683.114280
12   21371.717110  236.206897     None  2610.280918
13   18847.053588  251.724138     None  2341.029296
14   22287.325468  267.241379     None  2947.374344
15   17714.355369  282.758621     None  1453.071392
16   15452.685560  298.275862     None  1388.449329
17   11105.194525  313.793103     None  2612.945488
18    6311.091736  329.310345     None  2867.388163
19    7791.938453  344.827586     None  1880.184986
20    9908.565314  360.344828     None  1850.770860
21   10438.750690  375.862069     None  2363.221780
22    9699.409217  391.379310     None  1338.286404
23    7693.693952  406.896552     None  1991.855197
24    6364.890931  422.413793     None  2661.509273
25    6573.955062  437.931034     None  1925.582892
26    8675.610268  453.448276     None  2023.326955
27    6741.533742  468.965517     None  1873.011451
28    3132.369619  484.482759     None  1658.034144
29    7269.100898  500.000000     None  2019.150967

[30 rows x 4 columns]

PROFILE MODEL:
C_10Be_2p
A model for 10Be with 2 parameters.

'UNKNOWN' PARAMETERS (2):
erosion_rate:
	prior: <bound method rv_frozen.pdf of <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fc3cd329c90>>
	range: [0.0, 0.003, 200j]
exposure_time:
	prior: <bound method rv_frozen.pdf of <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fc3cd329c50>>
	range: [0.0, 200000.0, 200j]

degrees of freedom: 28

GRID SEARCH:
nb. of nodes per parameter: [200, 200]
total nb. of nodes: 40000


Perform Maximum likelihood estimation on the search grid


In [11]:
gstest.compute_mle()

Get the MLE (i.e., the parameter values at the maximum likelihood), in the same order than the definition of the parameters


In [12]:
gstest.mle


Out[12]:
[array([ 0.0009799]), array([ 97487.43718593])]

Plot the profile log-likelihood for each parameter. The blue lines represent the difference between the profile log-likelihood and the maximum log-likelihood, The intersections between the blue line and the black lines define the confidence intervals at the given confidence levels (based on the likelihood ratio test). The red lines indicate the true values.


In [13]:
%matplotlib inline

def plot_proflike1d(cobj, pname, clevels=[0.68, 0.95, 0.997],
                    true_val=None, ax=None):
    
    p = cobj.parameters[pname]
    pindex = cobj.parameters.keys().index(pname)
    
    x = cobj.grid[pindex].flatten()
    proflike = cobj.proflike1d[pindex]

    if ax is None:
        ax = plt.subplot(111)
    
    difflike = proflike - cobj.maxlike
    
    ax.plot(x, difflike, label='profile loglike')
    
    ccrit = gridsearch.profile_likelihood_crit(
        cobj.proflike1d[pindex],
        cobj.maxlike,
        clevels=clevels
    )
    ccrit -= cobj.maxlike
    
    for lev, cc in zip(clevels, ccrit):
        l = ax.axhline(cc, color='k')
        hpos = x.min() + (x.max() + x.min()) * 0.05
        ax.text(hpos, cc, str(lev * 100),
                size=9, color = l.get_color(),
                ha="center", va="center",
                bbox=dict(ec='1',fc='1'))
    
    if true_val is not None:
        ax.axvline(true_val, color='r')
    
    plt.setp(ax, xlabel=pname,
             ylabel='profile log-like - max log-like',
             xlim=p['range'][0:2],
             ylim=[ccrit[-1], 0.])
    

def plot_proflike1d_all(cobj, n_subplot_cols=2, **kwargs):
    
    true_vals = [true_erosion, true_exposure]
    
    n_subplots = len(cobj.parameters)
    n_subplot_rows = int(math.ceil(1. * 
                                   n_subplots /
                                   n_subplot_cols))
    
    fig, aax = plt.subplots(nrows=n_subplot_rows,
                            ncols=n_subplot_cols,
                            **kwargs)
    axes = aax.flatten()
    fig.text(0.5, 0.975,
             "Profile log-like: " + cobj.description,
             horizontalalignment='center',
             verticalalignment='top')
    
    for i, pname in enumerate(cobj.parameters.keys()):
        ax = axes[i]
        plot_proflike1d(cobj, pname,
                        true_val=true_vals[i],
                        ax=ax)
    
    plt.tight_layout()
    plt.subplots_adjust(top=0.93)


plot_proflike1d_all(gstest, figsize=(8, 4))


Show the log-likelihood vs. erosion rate and exposure. Confidence regions are also shown (also based on the likelihood ratio test).


In [14]:
def plot_proflike2d(cobj, p1p2, ax=None,
                    cmap='Blues', show_colorbar=True):
    
    pname1, pname2 = p1p2
    idim = cobj.parameters.keys().index(pname2)
    jdim = cobj.parameters.keys().index(pname1)
    
    if ax is None:
        ax = plt.subplot(111)
    
    X, Y = np.meshgrid(cobj.grid[idim].flatten(),
                       cobj.grid[jdim].flatten())
    
    difflike = cobj.proflike2d[idim][jdim] - cobj.maxlike
    
    ccrit = gridsearch.profile_likelihood_crit(
        cobj.proflike2d[idim][jdim],
        cobj.maxlike,
        clevels=[0.68, 0.95]
    )
    ccrit -= cobj.maxlike
    
    contours = np.linspace(np.median(difflike),
                           0,
                           10)
    
    P2D = ax.contourf(Y, X, difflike,
                      contours,
                      cmap=plt.get_cmap(cmap))
    
    ci68 = ax.contour(Y, X, difflike,
                      [ccrit[0]], colors='w',
                      linestyles='solid')
    plt.clabel(ci68, fontsize=8, inline=True,
               fmt='68')
    ci95 = ax.contour(Y, X, difflike,
                      [ccrit[1]], colors=['k'],
                      linestyles='solid')
    plt.clabel(ci95, fontsize=8, inline=True,
               fmt='95')
    
    plt.setp(ax, xlabel=pname1, ylabel=pname2)
    
    if show_colorbar:
        plt.colorbar(P2D, ax=ax)
    
    ax.axhline(true_exposure, color='r')
    ax.axvline(true_erosion, color='r')
    

fig = plt.figure(figsize=(6, 6))
ax = plt.subplot(111)
plot_proflike2d(gstest, ('erosion_rate', 'exposure_time'), ax=ax)


Plot the measured concentrations and the predicted profile corresponding to the best fitted data model


In [15]:
sns.set_context('notebook')

depths = np.linspace(profile_data['depth'].min(),
                     profile_data['depth'].max(),
                     100)
Cm_fitted = models.C_10Be(depths,
                          gstest.mle[0], gstest.mle[1],
                          fixed_density,
                          fixed_inheritance)

Cm_true = models.C_10Be(depths, true_erosion, true_exposure,
                        fixed_density, fixed_inheritance)

plt.figure()
plt.plot(Cm_fitted, -depths, label='best-fitted model')
plt.plot(Cm_true, -depths, color='r', label='true model')
plt.errorbar(profile_data['C'],
             -profile_data['depth'],
             xerr=profile_data['std'],
             fmt='o', markersize=4,
             label='data')
plt.setp(plt.gca(),
         xlabel='10Be concentration [atoms g-1]',
         ylabel='-1 * depth [cm]',
         xlim=[0, None], ylim=[None, 0])

plt.legend(loc='lower right')


Out[15]:
<matplotlib.legend.Legend at 0x7fc3cd01e6d0>

Observations

2 free parameters

As the sample size increases, keeping fixed the error magnitude, we can see that the shapes of the profile log-likelihoods tend to be closer to a parabola (asymptotic normality) and the confidence intervals narrow up. For very small sample sizes (< 10), the confidence intervals are larger and the assumption of the normality is not reasonable.

For small sample sizes, we can also see that the maximum likelihood estimator is not always close to the true parameter values (sometimes the true values lie outside the 65% - or even 95% - confidence intervals). The results change from one generated dataset to another, setting the same perturbation parameters. It shows that the maximum likelihood estimator is biased for finite sample sizes. This bias is visible even for sample sizes $\sim$ 100 to 500. With very large sample sizes, we can see that it converges towards the true values.

The shape of the likelihood and the confidence regions show the positive correlation (non-linear relationship) between erosion rates and exposure time.


In [ ]: