Infer exposure times and erosion rates from measured concentrations of cosmogenic nuclides vs. depth

Notes on the general problem of the determination of exposure times and/or erosion rates from measured profiles of cosmogenic nuclides concentration vs. depth, and the methods used for solving this problem.

DISCLAIMER: I'm not a statistician. In the text below, I try to detail as deeply as possible my understanding about the common statistical methods used for data fitting and inference - taking the cosmogenic nuclides concentration profiles as an example -, which hopefully will help in the interpretation of the results. However, there may be some points that I still don't correctly understand.

Author: B. Bovy

1. General description of the problem

Our problem is an inverse problem: we have measured concentrations of cosmogenic nuclides along a depth profile and there is a lack of knowledge about some components of the physical system that caused these concentrations, i.e., exposure time, erosion rate, inheritance... In addition, we can rely on a model of this physical system to predict its behavior, i.e., a mathematical model that predicts the concentrations of a nuclide at given depths, given some parameters (see the notebook Models. Using the observations and the mathematical model, our goal is to improve our knowledge about the phenomena - the unknown parameters of the mathematical model - which caused the observed nuclide concentrations.

In our problem, we have (1) a vector formed by the $p$ parameters in the mathematical model that we want to fit, noted $\mathbf{m} = (\epsilon, t, \rho, C_{t0})$, which represent any data model in the parameter space ${\cal M}$, and (2) a vector formed by the $N$ nuclide concentrations measured along a depth profile $\mathbf{C^o} = C^o(z), \, z \in (1, 2 ... N)$. Note that we defined $\mathbf{m}$ with 4 parameters to fit (erosion rate, exposure time, soil density and inheritance) as an example. Other combinations can be defined, e.g., 2 parameters only $\rightarrow \mathbf{m} = (\epsilon, t)$, modelling the concentrations of both 10Be and 26Al $\rightarrow \mathbf{m} = (\epsilon, t, \rho, C_{t0}^{10Be}, C_{t0}^{26Al})$...

Our problem involves both optimization, i.e., searching for the best data-fitted model in the parameter space, and statistical inference, i.e., taking into account that we have only access to a finite sample of nuclide concentrations vs. depth, for which each measurement is subject to error.

Two statistical approachs can be used to solve our problem: Frequentist inference and Bayesian inference. These two methods are conceptually different, but may lead to the same results under certain conditions. This blog post propose an interesting explanation and a comparison based on a simple example. Here below we apply each method to our problem.

NOTE 1: we will distinguish between:

the mathematical model: the model defined by the equations described in the notebook Models, which has parameters that are assumed to be well-known and other parameters that we want to fit with data
a data model: the mathematical model with a given value for each of its parameters

NOTE 2: We assume that the measurements of nuclide concentration vs. depth are independent of each other and have each a known, random error. This assumption is important and is the basis of most of the statements that we will make below.

2. The frequentist approach: Maximum Likelihood Estimation (MLE)

The Maximum Likelihood is the classical approach used by "frequentists". According to frequentists, probabilities are fundamentally related to frequencies of events, in the limit of a large number of repeated experiments. Since the "true" values of parameters of the mathematical model described in the previous section are fixed, non-random numeric values, these have no distribution of frequency and it is therefore meaningless to talk about their probability. It is, however, possible to make statistical inference of these parameters using the frequentist approach.

Concerning our problem, we assumed that the nuclide concentration measurements have a random error. Hence we can represent each observation as a normal distribution given by:

$$ P(C^o_{i} | \mathbf{m}_{\mathrm{true}}) = \frac{1}{\sqrt{2\pi(\sigma^o_i)^2}} \exp \left[- \frac{1}{2} \frac{[C^o_i - C^m(z_i, \mathbf{m}_{\mathrm{true}})]^2}{(\sigma^o_i)^2}\right] $$

where $C^o_{i}$ is the measured nuclide concentration at depth $z_i$, $\sigma^o_i$ is the measurement error that is assumed to be known, $C^m(z_i, \mathbf{m}_{\mathrm{true}})$ is the "true" nuclide concentration, which is here assumed to be the concentration predicted by the mathematical model at the same depth $z_i$ with the "true" parameter values $\mathbf{m}_{\mathrm{true}}$, and $P(C^o_{i} | \mathbf{m}_{\mathrm{true}})$ is the conditional density probability of $C^o_{i}$, i.e., the probability of obtaining $C^o_{i}$, given $\mathbf{m}_{\mathrm{true}}$.

Extending this to a sample of $N$ independent concentration measurements taken along a depth profile, we can derive the expression of the likelihood function, which is equivalent to the joint probability distribution for all measurements:

$$ {\cal L}(\mathbf{m}_{\mathrm{true}} ; \mathbf{C^o}) \equiv P(\mathbf{C^o} | \mathbf{m}_{\mathrm{true}}) = \prod_{i=1}^N P(C^o_{i} | \mathbf{m}_{\mathrm{true}}) $$

We prefer using the likelihood function ${\cal L}(\mathbf{m}_{\mathrm{true}} ; \mathbf{C^o})$ instead of the joint probability distribution $P(\mathbf{C^o} | \mathbf{m}_{\mathrm{true}})$ since in our problem $\mathbf{m}_{\mathrm{true}}$ is rather viewed as the unknown value of the continuous variable $\mathbf{m}$ (that's what we want to estimate), while the measured concentration profile sample $\mathbf{C^o}$ is a known, fixed parameter. Although it is algebraically equivalent, $P(\mathbf{C^o} | \mathbf{m}_{\mathrm{true}})$ is the evaluation, at our measured concentrations, of a function given in the space defined by all possible concentration measurements, while ${\cal L}(\mathbf{m}_{\mathrm{true}} ; \mathbf{C^o})$ is the evaluation, at the "true" parameter values, of a function given in the parameter space ${\cal M}$. Note that the likelihood function cannot be considered as a conditional probability distribution! Note also that taking the likelihood function instead of the joint probability distribution is purely conventional and for ease-of-use.

The maximum-likelihood estimator of $\mathbf{m}_{\mathrm{true}}$, noted $\hat{\mathbf{m}}_{\mathrm{MLE}}$, is obtained by finding the value of $\mathbf{m}$ that maximizes ${\cal L}(\mathbf{m} ; \mathbf{C^o})$:

$$ \{ \hat{\mathbf{m}}_{\mathrm{MLE}} \} \subseteq \{ \underset{\mathbf{m} \, \in \, {\cal M}}{\operatorname{arg\,max}}\ {\cal L}(\mathbf{m} ; \mathbf{C^o}) \} $$

Note that the value of ${\cal L}(\mathbf{m} ; \mathbf{C^o})$ can become very small, and it is often more convenient to instead using the log-likelihood function, noted $l(\mathbf{m} ; \mathbf{C^o})$, as it can greatly simplify the calculations. It has no impact on the found $\hat{\mathbf{m}}_{\mathrm{MLE}}$ value since the logarithm is a strictly monotonically increasing function. Combining the equations above, it gives:

$$ l(\mathbf{m} ; \mathbf{C^o}) = \ln {\cal L}(\mathbf{m} ; \mathbf{C^o}) = -\frac{1}{2} \sum_{i=1}^{N} \left[ \ln(2 \pi (\sigma^o_i)^2) + \frac{[C^o_i - C^m(z_i, \mathbf{m})]^2}{(\sigma^o_i)^2} \right] $$

2.1 Why using MLE?

$\hat{\mathbf{m}}_{\mathrm{MLE}}$ has interesting asymptotical properties (i.e., properties that arise as the sample size increases):

$\hat{\mathbf{m}}_{\mathrm{MLE}}$ converges in probability to $\mathbf{m}_{true}$ as the sample size increases (consistency)
The distribution of $\hat{\mathbf{m}}_{\mathrm{MLE}}$ is asympotically Gaussian
$\hat{\mathbf{m}}_{\mathrm{MLE}}$ is a minimum variance estimate as the sample size increases (efficiency)

2.2 Comparison with Minimum $\chi^2$ Estimation (MCE)

From the equation above we see that, when the measurements errors are random, Maximum Likelihood Estimation is equivalent to Minimum Chi-square Estimation as both functions are equivalent, up to a constant:

$$ {\cal L}(\mathbf{m} ; \mathbf{C^o}) \propto \exp\left(-\frac{1}{2} \chi^2(\mathbf{m} ; \mathbf{C^o}) \right) $$

where

$$ \chi^2(\mathbf{m} ; \mathbf{C^o}) = \sum_{i=1}^N \frac{[C^o_i - C^m(z_i, \mathbf{m})]^2}{(\sigma^o_i)^2} $$

The minimum chi-square estimator is given by:

$$ \{ \hat{\mathbf{m}}_{\mathrm{MCE}} \} \subseteq \{ \underset{\mathbf{m} \, \in \, {\cal M}}{\operatorname{arg\,min}}\ \chi^2(\mathbf{m} ; \mathbf{C^o}) \} $$

with $\hat{\mathbf{m}}_{\mathrm{MCE}} = \hat{\mathbf{m}}_{\mathrm{MLE}}$ (only if measurement errors are random!).

2.3 Optimization

Finding the minimum of $\chi^2(\mathbf{m} ; \mathbf{C^o})$ or the maximum of ${\cal L}(\mathbf{m} ; \mathbf{C^o})$ - or equivalently the minimum of $-{\cal L}(\mathbf{m} ; \mathbf{C^o})$ - is an optimization problem. The goal is to find the location of $d\chi^2 = 0$ or $d{\cal L} = 0$, where $d$ is the first derivative. It can be solved either analytically for simple cases or numerically for more complex cases.

Solving an optimization problem depends on whether the mathematical model is linear or non-linear, i.e., whether there is a linear or non-linear dependence between the predicted variable and the model parameters (our model of nuclide concentration vs. depth is non-linear). Non-linear models have additional optimization issues that linear models don't have:

algorithms for non-linear models require an initial guess value for $\mathbf{m}$
algorithms for non-linear models are iterative, and therefore a convergence criterion must be arbitrarily defined for the optimization to stop
in some cases the optimzation may not converge for non-linear models
non-linear models may have multiple local maxima (minima) ; most algorithms will find only one of these, though a few global optimization algorithms exist.

Note that theses issues are not very relevant if, alternatively, one uses the grid-search method (also called brute-force) to find the minimum. This method barely consists of calculating ${\cal L}(\mathbf{m} ; \mathbf{C^o})$ or $\chi^2(\mathbf{m} ; \mathbf{C^o})$ at a great number of points distributed uniformly in the parameter space, and then take the minimum / maximum value. However, it can be computationally expensive, and it is based on an arbitrary subset of the parameter space (finite bounds and resolution).

2.4 Estimation uncertainty and confidence regions/intervals

We can only measure a finite number of nuclide concentrations along a depth profile (moreover, this number is often very limited as collecting this kind of data is expensive). If we take other samples along the profile, it is very likely that we will obtain different values of ${\cal L}(\mathbf{m} ; \mathbf{C^o})$, and therefore different values of $\hat{\mathbf{m}}_{\mathrm{MLE}}$. There is thus some degree of uncertainty on the equality of $\mathbf{m}_{\mathrm{true}}$ and $\hat{\mathbf{m}}_{\mathrm{MLE}}$ due to sampling. While $\mathbf{m}_{\mathrm{true}}$ is fixed, $\hat{\mathbf{m}}_{\mathrm{MLE}}$ is a random variable.

It is possible to quantify this uncertainty. Of course we can repeat the MLE process from a great number of independent samples of measured nuclide concentrations vs. depth so that we will obtain a sample of the distribution of $\hat{\mathbf{m}}_{\mathrm{MLE}}$, and hence characterize this distribution. However, collecting many independent data samples is most of the time too expensive. One can "generate" other independent samples from the collected data sample using Monte-Carlo or bootstrapping methods, but it is not always possible in practice. Other methods are based on the asympotical properties of the distribution of $\hat{\mathbf{m}}_{\mathrm{MLE}}$ (Gaussian) or the distribution of statistics based on $\hat{\mathbf{m}}_{\mathrm{MLE}}$.

Most useful is to calculate confidence regions in the parameter space around $\hat{\mathbf{m}}_{\mathrm{MLE}}$. Note that these regions don't represent the probability that $\mathbf{m}_{\mathrm{true}}$ fall within the region (as said before, for frequentists it is a non-sense to define probabilites for the fixed, non-random "true" values of the model parameters). The 100(1-$\alpha$)% confidence region should rather be interpreted as the region that would encompass $\mathbf{m}_{true}$ 100(1-$\alpha$)% of the time if we repeat the process of collecting a sample of nuclide concentrations vs. depth. More precisely, according to its calculation detailled below, the 100(1-$\alpha$)% confidence region is defined by all possible values of $\mathbf{m}$ that we would assume to be $\mathbf{m}_{\mathrm{true}}$ and for which the difference with $\hat{\mathbf{m}}_{\mathrm{MLE}}$ is not statistically significant at the $\alpha$ level, given our data sample.

Even more useful than the confidence region is to calculate the confidence intervals (CIs), which correspond to the projection of a given confidence region on each parameter $m_i$. There are two ways of calculating these intervals, each are reviewed below.

2.4.1 Wald-type CIs

These intervals are obtained from the inversion of the Wald test, which is based on the asymptotic Gaussian distribution of $\hat{\mathbf{m}}_{\mathrm{MLE}}$. Taking the distribution of the log-likelihood around $\hat{\mathbf{m}}_{\mathrm{MLE}}$ as a surrogate (?? why? needs more explanation), we can estimate the standard error of each parameter $m_i$ from the partial second derivative of $l(\hat{\mathbf{m}}_{\mathrm{MLE}} ; \mathbf{C^o})$ with respect to $m_i$. The partial second derivatives of $l(\hat{\mathbf{m}}_{\mathrm{MLE}} ; \mathbf{C^o})$ define the curvature of the hyper-parabola centered on $\hat{\mathbf{m}}_{\mathrm{MLE}}$, which is meaningful here because hyper-parabola is precisely the shape of the log-likelihood from the Gaussian distribution. This information is included in the so-called Fisher information matrix, which is often provided (or at least can be derived) by the optimization algorithms. The 100(1-$\alpha$)% Wald CI for the parameter $m_i$ is then given by:

$$ \hat{m}_{i, \mathrm{MLE}} \pm Z_{1-\alpha/2} \, \hat\sigma_i $$

where $\hat{m}_{i, \mathrm{MLE}}$ is the value of the $i$th parameter of $\hat{\mathbf{m}}_{\mathrm{MLE}}$, $Z_{1-\alpha/2}$ is the $(1-\alpha/2)$th quantile of the standard Gaussian distribution and $\hat\sigma_i$ is the estimated standard error of $\hat{m}_{i, \mathrm{MLE}}$, which can be derived from the Fisher information matrix.

The advantage of Wald-type CIs is that it can be calculated without much effort. But the main disavantage is that it is closely related to the assumed normality of $\hat{\mathbf{m}}_{\mathrm{MLE}}$. This assumption is true for very large sample sizes, but it may be incorrect for small to moderate sample size. It may also work poorly if the distribution of $\hat{\mathbf{m}}_{\mathrm{MLE}}$ is markedly skewed (for example when $\hat{m}_{i, \mathrm{MLE}}$ is close to the boundary of the parameter space) or if the standard error is a poor estimate of the standard deviation of the estimator.

When the observed distribution of $\hat{m}_{i, \mathrm{MLE}}$ is not Gaussian, a workaround is to first transform the variable $m_i$ to $\phi_i$ so that the distribution of $\hat{\phi}_{i, \mathrm{MLE}}$ will be more close to the Gaussian distribution. After having replaced $m_i$ by $\phi_i$, we then perfom MLE and calculate the Wald CI. However, the obtained CI limits, transformed back into the original scale, may not be symmetric about $\hat{m}_{i, \mathrm{MLE}}$ (Note that it is not really a problem). Finding a good transformation is not straightforward. Generally, other, more robust methods are preferred to calculate the CIs.

2.4.2 Profile likelihood CIs

These intervals are based on the computation of the profile likelihood for one ore more parameters and the use of the likelihood ratio test.

The profile likelihood is defined by fixing the value of one or more parameters of $\mathbf{m}$ and by maximizing the likelihood function over the other parameters. The profile likelihood is thus not a likelihood function, it is a function formed by maximum values of a likelihood function! The profile likelihood for the $i$th parameter, $m_i$, is given by:

$$ {\cal L_1}(m_i ; \mathbf{m}_{\mathrm{J}}, \mathbf{C^o}) = \underset{\mathbf{m}_{\mathrm{J}} \, \in \, {\cal M_J}}{\operatorname{max}} {\cal L}(\mathbf{m}_{\mathrm{J}} ; m_i, \mathbf{C^o})$$

where $\mathbf{m}_{\mathrm{J}} = (m_j); \, \forall \, m_j \in \mathbf{m}, \, m_j \neq m_i$, and ${\cal M_J}$ is a subset of the parameter space ${\cal M}$ that doesn't contain the $i$th dimension.

The likelihood ratio test is used to compare the likelihood of two data models - one being a special case of the other - as two hypotheses and then conclude at a $\alpha$ level of significance which hypothesis is better supported by the data.

For example, let's choose a fixed value for $m_i$, noted $m_{i, \mathrm{null}}$, for which we make the null hypothesis that this value is the true parameter value. The alternate hypothesis is that the value from the maximum likelihood estimator is closer to the true parameter value. Thus we have:

$$ H_0: \, m_{i, \mathrm{null}} = m_{i, \mathrm{true}} $$$$ H_1: \, \hat{m}_{i, \mathrm{MLE}} \sim m_{i, \mathrm{true}} $$

For our set of hypotheses, the likelihood ratio statistic, noted $\Lambda$, is proportional to the logarithm of the ratio of the value of the profile likelihood at $m_{i, \mathrm{null}}$ and the value of the likelihood at $\hat{m}_{i, \mathrm{MLE}}$:

$$ \Lambda = - 2 \ln \left( \frac{{\cal L_1}(m_{i, \mathrm{null}} ; \mathbf{m}_{\mathrm{J}}, \mathbf{C^o})}{{\cal L}(\hat{\mathbf{m}}_{\mathrm{MLE}} ; \mathbf{C^o})} \right) $$

We can also use the profile log-likelihood and the log-likelihood, and then:

$$ \Lambda = 2 \, [l(\hat{\mathbf{m}}_{\mathrm{MLE}} ; \mathbf{C^o}) - l_1(m_{i, \mathrm{null}} ; \mathbf{m}_{\mathrm{J}}, \mathbf{C^o})] $$

From the statistic theory one can show that $\Lambda$ is asymptotically distributed like the $\chi^2(\nu_2 - \nu_1)$ distribution with $\nu_2 - \nu_1$ degrees of freedom, $\nu_2$ being the number of parameters in $\mathbf{m}$ and $\nu_1$ being the number of parameters in $\mathbf{m}_{\mathrm{J}}$ (in our example, $\nu_2 - \nu_1 = 1$). Therefore, we cannot reject $H_0$ at the significant level $\alpha$ if

$$ \Lambda < \chi^2_{1-\alpha}(\nu_2 - \nu_1) $$

where $\chi^2_{1-\alpha}(\nu_2 - \nu_1)$ is the $(1-\alpha)$th quantile of $\chi^2(\nu_2 - \nu_1)$.

Hence, we can invert our likelihood ratio test in order to construct the CI for the parameter $m_i$. The 100(1-$\alpha$)% CI is defined by all values of $m_i$ for which we cannot reject the null hypothesis $H_0$ at the $\alpha$ level of significance, i.e., the values that verify:

$$ \Lambda - \chi^2_{1-\alpha}(\nu_2 - \nu_1) < 0 $$

The limits of the CI can be calculated by finding the roots of the function defined by the left-hand term of the latter inequality, in both directions from $\hat{m}_{i, \mathrm{MLE}}$.

Compared to the Wald CI method, the profile likelihood CI method has the advantage that it doesn't assume the normality of $\hat{m}_{i, \mathrm{MLE}}$. Unlike Wald CIs, the profile likelihood CIs don't change if we transform the parameters. However, since the likelihood ratio test is based on a asymptotically distributed $\chi^2$, the calculation of the CIs is still an approximation for finite samples.

2.4.3 Confidence intervals from Minimum $\chi^2$ Estimation

The calculation of CIs from Minimum $\chi^2$ Estimation is similar than the two method presented above for MLE. One can show that the chi-square value at the minimum, $\chi^2(\hat{\mathbf{m}}_{\mathrm{MCE}} ; \mathbf{C^o})$, is distributed like a $\chi^2(N-p)$ distribution, where $N-p$ is the degrees of freedom (use with caution with non-linear models and/or with additional constraints, see below).

Wald CIs can be calculated if Minimum $\chi^2$ Estimation is equivalent to MLE (e.g., if measurements errors are random), and for large sample sizes (i.e., high degrees of freedom), as the $\chi^2$ distribution tends to a Gaussian distribution as the degrees of freedom increase.

Similarily to the profile likelihood method for MLE, it is possible to profile the chi-square, i.e., compute the minimum $\chi^2$ for fixed values of one or more parameters, and then use the F-test with the ratio of two $\chi^2$ values.

2.5 Data fitting and goodness of fit

Because of the random measurement errors, the predicted nuclide concentrations should never perfectly fit the measured concentrations. In fact, as for any data fitting problem, our problem consists to estimate a parent distribution, which is a probability distribution, actually. This distribution is unknown: it has unknown mean, variance and form. In our case:

the parent distribution is given by $P(\mathbf{C})$, where $\mathbf{C}$ represent all nuclide concentrations that one can measure along a profile (an infinite number)
by collecting measurements of nuclide concentration vs. depth ($\mathbf{C^o}$), we are sampling the parent distribution
by assuming random measurement errors, we assume that the parent distribution can be decomposed as a combination of normal distributions, each related to one possible measurement.
by assuming that the measurements of concentrations are made independently of each other, we assume that the parent distribution is the product of the normal distributions
by assuming that the measurement errors are known, we assume that we know the variance of each normal distribution
we assume that one can rely on a mathematical model, i.e., a fitting function, that allow to predict nuclide concentrations from a set of parameters ; in fact this function predicts the mean of each normal distribution.

Among these assumptions, the latter is the most interesting and we can test whether it is statistically acceptable or not, given our sample. There are several methods to quantify how a data model describes well the data, i.e, how accurately the fitting function + a combination of parameters values predicts the mean of the normal distributions.

2.5.1 Chi-square goodness of fit

We can look at the chi-square value, defined above (re-shown here for more clarity):

$$ \chi^2(\mathbf{m} ; \mathbf{C^o}) = \sum_{i=1}^N \frac{[C^o_i - C^m(z_i, \mathbf{m})]^2}{(\sigma^o_i)^2} $$

Intuitively, we see that the value of $\chi^2(\mathbf{m} ; \mathbf{C^o})$ is high when the differences between the measured and the predicted concentrations (the residuals) are high, compared to the corresponding measurement errors. Ideally, one should expect that, if we repeat any measurement a great number of times, (1) the set of the resulting residuals will have a standard deviation that is equal to the error of the measurement (variance of the fit = variance of the data), and (2) that the mean of the set of measured concentrations is equal to the concentration predicted by the model. This would mean that the model is the "true" model in the sense that it effectively predicts the mean of the normal distribution assumed for that measurement.

In other words, the normalized residuals,

$$ R_i = \frac{C^o_i - C^m(z_i, \mathbf{m})}{\sigma^o_i}, \, \forall i $$

would be each distributed according to the standard Gaussian ($\mu = 0$, $\sigma = 1$) only if $C^m(z, \mathbf{m})$ is the "true" data model, i.e., the "true" mathematical model with the "true" parameters ($\mathbf{m} = \mathbf{m}_{\mathrm{true}}$).

The $\chi^2$ distribution is precisely defined by the sum of the squares of $\nu$ idependent standard Gaussians! It follows that, for our problem, $\chi^2(\mathbf{m} ; \mathbf{C^o})$ can be a measure of the "goodness of fit" of the model to the collected data. A common practice is to calculate the reduced chi-square,

$$ \chi^2_{\nu}(\mathbf{m} ; \mathbf{C^o}) = \frac{\chi^2(\mathbf{m} ; \mathbf{C^o})}{\nu} $$

which has the advantage that it normalizes for the size of the data sample and model complexity ($\nu = N - p$) and which can be interpreted as follows:

$\chi^2_{\nu}(\mathbf{m} ; \mathbf{C^o}) \sim 1$: good fit
$\chi^2_{\nu}(\mathbf{m} ; \mathbf{C^o}) >> 1$: poor model fit (or measurement errors largely underestimated)
$\chi^2_{\nu}(\mathbf{m} ; \mathbf{C^o}) < 1$: the model is "over-fitting" the data, it improperly fits the random measurement errors (or the measurement errors are overestimated)

We see that, in fact, the reduced chi-square can also tell us about whether our assumption of known measurement errors is correct or not, but it can't distinguish between this assumption and the goodness of fit of the model.

Moreover, $\chi^2$ has a well-defined probability distribution, which means that one can test the goodness of fit of the model against the case of any random sample of size $N$ (and not only the $N$-sized data sample that we have collected).

It seems that there are, however, several issues related to the use of the $\chi^2$ distribution (detailled in Andrae, 2010 (Non reviewed)):

$\chi^2(\mathbf{m} ; \mathbf{C^o})$ really follows a $\chi^2$ distribution only for the "true" model with de "true" parameters. For other models, it is an approximation and for some models it may be far from correct.
Although it is common to use $\nu = N - p$, it is actually not possible to precisely define the degrees of freedom for non-linear models (like the one used for the nuclide concentrations vs. depth) or for a linear model with additional constraints on some parameters ; the degrees of freedom value lies somewhere between $N-1$ and 0 for the first case, and $N-1$ and $N-p$ for the latter case.
We use here the chi-square test on the original data, not on binned data, and therefore the $\chi^2$ distribution may be questioning (Chernoff and Lehmann, 2012).

2.5.2 Other methods

Unlike the chi-square, the likelihood function cannot be considered as a measure of goodness of fit (e.g., Heinrich, 2003).

Andrae, 2010 (Non reviewed) suggests analysing the distribution of residuals, cross-validation or bootstrapping. However, these methods are not easily applicable if the sample size is very small, like it is often the case for the datasets of measured nuclide concentrations vs. depth.

2.6 Note on data noise and measurement errors

It is important to note that we consider the measurement errors as the only source of noise in the collected nucleide concentration data. Field processes such as bioturbation, cryoturbation or compaction may also result in "random" noise in the data, but it is not taken into account here. Including those processes in our modelling does not necessarily imply reconsidering the Gaussian distribution of each observation - still valid if the processes are effectively random -, but the assumption of a known variance for each Gaussian would be more questioning.

If those field processes effectively put non-negligible noise in the data in addition to the measurement errors, then it is likely that we underestimated the variances of the Gaussian distributions by taking the latter into account only. Therefore, we must be careful in the interpretation of the results. A good model fit may correspond to $\chi^2_{\nu}(\mathbf{m} ; \mathbf{C^o}) > 1$.

3. Bayesian inference

The fundamental difference between the Bayesian approach and the Frequentist approach is that, for Bayesians, it is allowed to assign a probability to the values of the parameters of the mathematical model. This doesn't mean that the parameters don't have a fixed, non-random "true" value (it must have one!); the probability rather defines our degree of belief about whether the parameter values are the "true" values or not. Frequentists don't agree with this definition of probability because a degree of belief is not related to any frequency of event.

One advantage of the Bayesian approach:

Because to goal of the frequentist approach is to use the data to find the best estimate of the true parameter values, the results of MLE is limited to the best data-fitted model and (an approximation of) its surroundings in the parameter space. With Bayesian inference, the goal is to use the data to refine our degree of belief about the true parameter values, and therefore we can characterize the results over the whole parameter space.

One of it's limit:

It is not possible to use the Bayesian approach without providing an a-priori information about the model parameters, i.e., we must already have a degree of belief on the parameter values if we want to make inference from data (note that this can also be an advantage of the approach). When we don't have any a-priori idea on what are likely to be the true values of the parameters, a common practice is to define flat priors (e.g., uniform prior probability distributions, see below). In many cases, this produce the same results than MLE, but in some situations the results may be surprising, and it turns out that a truly noninformative prior doesn't exist.

The bayesian approach allows to calculate a posterior probability distribution (PPD), i.e., the probability distribution of the data models given the collected nuclide concentrations vs. depth.

Applying the Bayes theorem gives:

$$ P(\mathbf{m} | \mathbf{C^o_z}) = \frac{P(\mathbf{m}) P(\mathbf{C^o} | \mathbf{m})}{P(\mathbf{C^o})}$$

where $P(\mathbf{m} | \mathbf{C^o})$ is the posterior probability distribution, $P(\mathbf{m})$ is the prior probability distribution (i.e., our prior knowledge about the parameters of the mathematical model that we want to fit), $P(\mathbf{C^o} | \mathbf{m})$ is the probability distribution of the observed concentration profile given the data models, and $P(\mathbf{C^o})$ can be treated as a normalizing constant.

$P(\mathbf{C^o})$ is in fact the non-conditional probability of obtaining the measured nucleide concentration profile, but it is not easy to calculate. We can still derive it considering that the intergral of $P(\mathbf{m} | \mathbf{C^o})$ over ${\cal M}$ should be 1.

For calculating $P(\mathbf{C^o} | \mathbf{m})$ we can re-use the likelihood function defined for MLE!

Our goal is to characterize $P(\mathbf{m} | \mathbf{C^o})$ in the multidimensional space ${\cal M}$. The regions which maximize $P(\mathbf{m} | \mathbf{C^o})$ is one property of interest. With a uniform prior probability distribution, the data model that maximizes $P(\mathbf{m} | \mathbf{C^o})$ would correspond to the best data fit model as it would be obtained using MLE (not true in all cases).

3.1 Characterizing the PPD: moments, covariance matrix and marginal PPD

The PPD, $P(\mathbf{m} | \mathbf{C^o})$, is a multidimensional function that is not easy to characterize. It has, however, a few properties that are very useful for interpreting the results:

mean of PPD: if the PPD is Gaussian, then the mean of the PPD equals the maximum PPD value and thus correspond to the best fit data model
convariance matrix of PPD: diagonals are the posterior variances and off-diagonal terms contain information on trade-off between the model parameters.
1D, 2D marginal PPDs: allow to characterize the PPD for one or two parameters with all other parameters taken into account.

The mean of the PPD for the $i$th parameter is given by:

$$ \bar{m_i} = \int_{\cal M} m_i \, P(\mathbf{m} | \mathbf{C^o}) \, d\mathbf{m} $$

The element of the covariance matrix corresponding to the $i$th and $j$th parameters is given by:

$$ C^M_{i,j} = \int_{\cal M} m_i m_j \, P(\mathbf{m} | \mathbf{C^o}) \, d\mathbf{m} - \bar{m_i} \bar{m_i} $$

The 1D marginal PPD for the $i$th parameter is given by:

$$ M(m_i) = \int ... \int P(\mathbf{m} | \mathbf{C^o}) \, \prod_{d=1, d \neq i}^n dm_d $$

The 2D joint marginal PPD for the $i$th and $j$th parameters is given by:

$$ M(m_i, m_j) = \int ... \int P(\mathbf{m} | \mathbf{C^o}) \, \prod_{d=1, d \neq i, d \neq j}^n dm_d $$

3.2 Evaluating the Bayesian integrals: searching and sampling the parameter space with Markov-Chain Monte-Carlo (MCMC) methods

$P(\mathbf{m} | \mathbf{C^o_z})$ is a continuous function in the multidimensional parameter space ${\cal M}$; it is obviously impossible to calculate its value at every point (i.e., every data model). In most cases the function is also too complex to be analytically characterized; calculating its integrals is not a simple task. Therefore, we have to sample the parameter space.

The Bayesian integrals defined above can be estimated numerically using multidimensional Monte Carlo integration, which is itself based on sampling the parameter space. The Bayesian integrals can be expressed in the following generic form:

$$ J = \int_{\cal M} g(\mathbf{m}) \, P(\mathbf{m} | \mathbf{C^o}) \, d\mathbf{m} $$

where $g(\mathbf{m})$ is any function of the parameter space used to define each integrand. Monte Carlo estimation is given by:

$$ \hat{J} = \frac{1}{S} \sum_{k=1}^S \frac{g(\mathbf{m}_k) \, P(\mathbf{m}_k | \mathbf{C^o})}{h(\mathbf{m}_k)} $$

where $S$ is the total number of discrete samples, $\mathbf{m}_k$ is the $k$th data model sample, and $h(\mathbf{m}_k)$ is the density distribution of the samples, which is assumed to be normalized, i.e., $\int h(\mathbf{m}) d\mathbf{m} = 1$.

Without going into details about how it works, the Markov Chain Monte Carlo (MCMC) algorithms basically allow to generate samples in the parameter space, whose distribution asymptotically tends towards any given distribution. Using one of these algorithms, we can therefore sample the parameter space so that, after a number of iterations:

$$ h_{\mathrm{MCMC}}(\mathbf{m}) \simeq P(\mathbf{m} | \mathbf{C^o}) $$

Therefore, Bayesian integrals can be estimated simply by computing averages of the generated samples:

$$ \hat{J}_{\mathrm{MCMC}} = \frac{1}{S} \sum_{k=1}^S g(\mathbf{m}_k) $$

In other words, the MCMC methods can be used to generate quasi-independent samples in the parameter space as it was drawn from $P(\mathbf{m} | \mathbf{C^o})$, and therefore it is trivial to calculate the $P(\mathbf{m} | \mathbf{C^o})$ characteritics (mean, quantiles, covariances, marginal distributions...) from these samples.

Because MCMC is by defintion a Markov process (i.e., a process where the next state is only defined by the current state), it needs the defintion of an arbitrary initial state (initial samples must be provided) and a minimum number of iterations (often called burn-in) before accurately sample the parameter space following $P(\mathbf{m} | \mathbf{C^o})$. Because adjacent generated samples are also auto-correlated, a common workaround is to keep only one sample every $n$ iteration in order to have independent samples (thinning). As the evolution of the MCMC algorithms depends on computed values for the prior, the likelihood and the PPD, which can be very small, it is common to use the logarithms, $\ln P(\mathbf{m}_k | \mathbf{C^o})$, $\ln L(\mathbf{m}_k | \mathbf{C^o})$ and $\ln P(\mathbf{m}_k) $, instead of the original values in order to avoid numerical issues.



In [ ]: