Notes on the general problem of the determination of exposure times and/or erosion rates from measured profiles of cosmogenic nuclides concentration vs. depth, and the methods used for solving this problem.
DISCLAIMER: I'm not a statistician. In the text below, I try to detail as deeply as possible my understanding about the common statistical methods used for data fitting and inference - taking the cosmogenic nuclides concentration profiles as an example -, which hopefully will help in the interpretation of the results. However, there may be some points that I still don't correctly understand.
Author: B. Bovy
Our problem is an inverse problem: we have measured concentrations of cosmogenic nuclides along a depth profile and there is a lack of knowledge about some components of the physical system that caused these concentrations, i.e., exposure time, erosion rate, inheritance... In addition, we can rely on a model of this physical system to predict its behavior, i.e., a mathematical model that predicts the concentrations of a nuclide at given depths, given some parameters (see the notebook Models. Using the observations and the mathematical model, our goal is to improve our knowledge about the phenomena - the unknown parameters of the mathematical model - which caused the observed nuclide concentrations.
In our problem, we have (1) a vector formed by the $p$ parameters in the mathematical model that we want to fit, noted $\mathbf{m} = (\epsilon, t, \rho, C_{t0})$, which represent any data model in the parameter space ${\cal M}$, and (2) a vector formed by the $N$ nuclide concentrations measured along a depth profile $\mathbf{C^o} = C^o(z), \, z \in (1, 2 ... N)$. Note that we defined $\mathbf{m}$ with 4 parameters to fit (erosion rate, exposure time, soil density and inheritance) as an example. Other combinations can be defined, e.g., 2 parameters only $\rightarrow \mathbf{m} = (\epsilon, t)$, modelling the concentrations of both 10Be and 26Al $\rightarrow \mathbf{m} = (\epsilon, t, \rho, C_{t0}^{10Be}, C_{t0}^{26Al})$...
Our problem involves both optimization, i.e., searching for the best data-fitted model in the parameter space, and statistical inference, i.e., taking into account that we have only access to a finite sample of nuclide concentrations vs. depth, for which each measurement is subject to error.
Two statistical approachs can be used to solve our problem: Frequentist inference and Bayesian inference. These two methods are conceptually different, but may lead to the same results under certain conditions. This blog post propose an interesting explanation and a comparison based on a simple example. Here below we apply each method to our problem.
NOTE 1: we will distinguish between:
NOTE 2: We assume that the measurements of nuclide concentration vs. depth are independent of each other and have each a known, random error. This assumption is important and is the basis of most of the statements that we will make below.
The Maximum Likelihood is the classical approach used by "frequentists". According to frequentists, probabilities are fundamentally related to frequencies of events, in the limit of a large number of repeated experiments. Since the "true" values of parameters of the mathematical model described in the previous section are fixed, non-random numeric values, these have no distribution of frequency and it is therefore meaningless to talk about their probability. It is, however, possible to make statistical inference of these parameters using the frequentist approach.
Concerning our problem, we assumed that the nuclide concentration measurements have a random error. Hence we can represent each observation as a normal distribution given by:
$$ P(C^o_{i} | \mathbf{m}_{\mathrm{true}}) = \frac{1}{\sqrt{2\pi(\sigma^o_i)^2}} \exp \left[- \frac{1}{2} \frac{[C^o_i - C^m(z_i, \mathbf{m}_{\mathrm{true}})]^2}{(\sigma^o_i)^2}\right] $$where $C^o_{i}$ is the measured nuclide concentration at depth $z_i$, $\sigma^o_i$ is the measurement error that is assumed to be known, $C^m(z_i, \mathbf{m}_{\mathrm{true}})$ is the "true" nuclide concentration, which is here assumed to be the concentration predicted by the mathematical model at the same depth $z_i$ with the "true" parameter values $\mathbf{m}_{\mathrm{true}}$, and $P(C^o_{i} | \mathbf{m}_{\mathrm{true}})$ is the conditional density probability of $C^o_{i}$, i.e., the probability of obtaining $C^o_{i}$, given $\mathbf{m}_{\mathrm{true}}$.
Extending this to a sample of $N$ independent concentration measurements taken along a depth profile, we can derive the expression of the likelihood function, which is equivalent to the joint probability distribution for all measurements:
$$ {\cal L}(\mathbf{m}_{\mathrm{true}} ; \mathbf{C^o}) \equiv P(\mathbf{C^o} | \mathbf{m}_{\mathrm{true}}) = \prod_{i=1}^N P(C^o_{i} | \mathbf{m}_{\mathrm{true}}) $$We prefer using the likelihood function ${\cal L}(\mathbf{m}_{\mathrm{true}} ; \mathbf{C^o})$ instead of the joint probability distribution $P(\mathbf{C^o} | \mathbf{m}_{\mathrm{true}})$ since in our problem $\mathbf{m}_{\mathrm{true}}$ is rather viewed as the unknown value of the continuous variable $\mathbf{m}$ (that's what we want to estimate), while the measured concentration profile sample $\mathbf{C^o}$ is a known, fixed parameter. Although it is algebraically equivalent, $P(\mathbf{C^o} | \mathbf{m}_{\mathrm{true}})$ is the evaluation, at our measured concentrations, of a function given in the space defined by all possible concentration measurements, while ${\cal L}(\mathbf{m}_{\mathrm{true}} ; \mathbf{C^o})$ is the evaluation, at the "true" parameter values, of a function given in the parameter space ${\cal M}$. Note that the likelihood function cannot be considered as a conditional probability distribution! Note also that taking the likelihood function instead of the joint probability distribution is purely conventional and for ease-of-use.
The maximum-likelihood estimator of $\mathbf{m}_{\mathrm{true}}$, noted $\hat{\mathbf{m}}_{\mathrm{MLE}}$, is obtained by finding the value of $\mathbf{m}$ that maximizes ${\cal L}(\mathbf{m} ; \mathbf{C^o})$:
$$ \{ \hat{\mathbf{m}}_{\mathrm{MLE}} \} \subseteq \{ \underset{\mathbf{m} \, \in \, {\cal M}}{\operatorname{arg\,max}}\ {\cal L}(\mathbf{m} ; \mathbf{C^o}) \} $$Note that the value of ${\cal L}(\mathbf{m} ; \mathbf{C^o})$ can become very small, and it is often more convenient to instead using the log-likelihood function, noted $l(\mathbf{m} ; \mathbf{C^o})$, as it can greatly simplify the calculations. It has no impact on the found $\hat{\mathbf{m}}_{\mathrm{MLE}}$ value since the logarithm is a strictly monotonically increasing function. Combining the equations above, it gives:
$$ l(\mathbf{m} ; \mathbf{C^o}) = \ln {\cal L}(\mathbf{m} ; \mathbf{C^o}) = -\frac{1}{2} \sum_{i=1}^{N} \left[ \ln(2 \pi (\sigma^o_i)^2) + \frac{[C^o_i - C^m(z_i, \mathbf{m})]^2}{(\sigma^o_i)^2} \right] $$$\hat{\mathbf{m}}_{\mathrm{MLE}}$ has interesting asymptotical properties (i.e., properties that arise as the sample size increases):
From the equation above we see that, when the measurements errors are random, Maximum Likelihood Estimation is equivalent to Minimum Chi-square Estimation as both functions are equivalent, up to a constant:
$$ {\cal L}(\mathbf{m} ; \mathbf{C^o}) \propto \exp\left(-\frac{1}{2} \chi^2(\mathbf{m} ; \mathbf{C^o}) \right) $$where
$$ \chi^2(\mathbf{m} ; \mathbf{C^o}) = \sum_{i=1}^N \frac{[C^o_i - C^m(z_i, \mathbf{m})]^2}{(\sigma^o_i)^2} $$The minimum chi-square estimator is given by:
$$ \{ \hat{\mathbf{m}}_{\mathrm{MCE}} \} \subseteq \{ \underset{\mathbf{m} \, \in \, {\cal M}}{\operatorname{arg\,min}}\ \chi^2(\mathbf{m} ; \mathbf{C^o}) \} $$with $\hat{\mathbf{m}}_{\mathrm{MCE}} = \hat{\mathbf{m}}_{\mathrm{MLE}}$ (only if measurement errors are random!).
Finding the minimum of $\chi^2(\mathbf{m} ; \mathbf{C^o})$ or the maximum of ${\cal L}(\mathbf{m} ; \mathbf{C^o})$ - or equivalently the minimum of $-{\cal L}(\mathbf{m} ; \mathbf{C^o})$ - is an optimization problem. The goal is to find the location of $d\chi^2 = 0$ or $d{\cal L} = 0$, where $d$ is the first derivative. It can be solved either analytically for simple cases or numerically for more complex cases.
Solving an optimization problem depends on whether the mathematical model is linear or non-linear, i.e., whether there is a linear or non-linear dependence between the predicted variable and the model parameters (our model of nuclide concentration vs. depth is non-linear). Non-linear models have additional optimization issues that linear models don't have:
Note that theses issues are not very relevant if, alternatively, one uses the grid-search method (also called brute-force) to find the minimum. This method barely consists of calculating ${\cal L}(\mathbf{m} ; \mathbf{C^o})$ or $\chi^2(\mathbf{m} ; \mathbf{C^o})$ at a great number of points distributed uniformly in the parameter space, and then take the minimum / maximum value. However, it can be computationally expensive, and it is based on an arbitrary subset of the parameter space (finite bounds and resolution).
We can only measure a finite number of nuclide concentrations along a depth profile (moreover, this number is often very limited as collecting this kind of data is expensive). If we take other samples along the profile, it is very likely that we will obtain different values of ${\cal L}(\mathbf{m} ; \mathbf{C^o})$, and therefore different values of $\hat{\mathbf{m}}_{\mathrm{MLE}}$. There is thus some degree of uncertainty on the equality of $\mathbf{m}_{\mathrm{true}}$ and $\hat{\mathbf{m}}_{\mathrm{MLE}}$ due to sampling. While $\mathbf{m}_{\mathrm{true}}$ is fixed, $\hat{\mathbf{m}}_{\mathrm{MLE}}$ is a random variable.
It is possible to quantify this uncertainty. Of course we can repeat the MLE process from a great number of independent samples of measured nuclide concentrations vs. depth so that we will obtain a sample of the distribution of $\hat{\mathbf{m}}_{\mathrm{MLE}}$, and hence characterize this distribution. However, collecting many independent data samples is most of the time too expensive. One can "generate" other independent samples from the collected data sample using Monte-Carlo or bootstrapping methods, but it is not always possible in practice. Other methods are based on the asympotical properties of the distribution of $\hat{\mathbf{m}}_{\mathrm{MLE}}$ (Gaussian) or the distribution of statistics based on $\hat{\mathbf{m}}_{\mathrm{MLE}}$.
Most useful is to calculate confidence regions in the parameter space around $\hat{\mathbf{m}}_{\mathrm{MLE}}$. Note that these regions don't represent the probability that $\mathbf{m}_{\mathrm{true}}$ fall within the region (as said before, for frequentists it is a non-sense to define probabilites for the fixed, non-random "true" values of the model parameters). The 100(1-$\alpha$)% confidence region should rather be interpreted as the region that would encompass $\mathbf{m}_{true}$ 100(1-$\alpha$)% of the time if we repeat the process of collecting a sample of nuclide concentrations vs. depth. More precisely, according to its calculation detailled below, the 100(1-$\alpha$)% confidence region is defined by all possible values of $\mathbf{m}$ that we would assume to be $\mathbf{m}_{\mathrm{true}}$ and for which the difference with $\hat{\mathbf{m}}_{\mathrm{MLE}}$ is not statistically significant at the $\alpha$ level, given our data sample.
Even more useful than the confidence region is to calculate the confidence intervals (CIs), which correspond to the projection of a given confidence region on each parameter $m_i$. There are two ways of calculating these intervals, each are reviewed below.
These intervals are obtained from the inversion of the Wald test, which is based on the asymptotic Gaussian distribution of $\hat{\mathbf{m}}_{\mathrm{MLE}}$. Taking the distribution of the log-likelihood around $\hat{\mathbf{m}}_{\mathrm{MLE}}$ as a surrogate (?? why? needs more explanation), we can estimate the standard error of each parameter $m_i$ from the partial second derivative of $l(\hat{\mathbf{m}}_{\mathrm{MLE}} ; \mathbf{C^o})$ with respect to $m_i$. The partial second derivatives of $l(\hat{\mathbf{m}}_{\mathrm{MLE}} ; \mathbf{C^o})$ define the curvature of the hyper-parabola centered on $\hat{\mathbf{m}}_{\mathrm{MLE}}$, which is meaningful here because hyper-parabola is precisely the shape of the log-likelihood from the Gaussian distribution. This information is included in the so-called Fisher information matrix, which is often provided (or at least can be derived) by the optimization algorithms. The 100(1-$\alpha$)% Wald CI for the parameter $m_i$ is then given by:
$$ \hat{m}_{i, \mathrm{MLE}} \pm Z_{1-\alpha/2} \, \hat\sigma_i $$where $\hat{m}_{i, \mathrm{MLE}}$ is the value of the $i$th parameter of $\hat{\mathbf{m}}_{\mathrm{MLE}}$, $Z_{1-\alpha/2}$ is the $(1-\alpha/2)$th quantile of the standard Gaussian distribution and $\hat\sigma_i$ is the estimated standard error of $\hat{m}_{i, \mathrm{MLE}}$, which can be derived from the Fisher information matrix.
The advantage of Wald-type CIs is that it can be calculated without much effort. But the main disavantage is that it is closely related to the assumed normality of $\hat{\mathbf{m}}_{\mathrm{MLE}}$. This assumption is true for very large sample sizes, but it may be incorrect for small to moderate sample size. It may also work poorly if the distribution of $\hat{\mathbf{m}}_{\mathrm{MLE}}$ is markedly skewed (for example when $\hat{m}_{i, \mathrm{MLE}}$ is close to the boundary of the parameter space) or if the standard error is a poor estimate of the standard deviation of the estimator.
When the observed distribution of $\hat{m}_{i, \mathrm{MLE}}$ is not Gaussian, a workaround is to first transform the variable $m_i$ to $\phi_i$ so that the distribution of $\hat{\phi}_{i, \mathrm{MLE}}$ will be more close to the Gaussian distribution. After having replaced $m_i$ by $\phi_i$, we then perfom MLE and calculate the Wald CI. However, the obtained CI limits, transformed back into the original scale, may not be symmetric about $\hat{m}_{i, \mathrm{MLE}}$ (Note that it is not really a problem). Finding a good transformation is not straightforward. Generally, other, more robust methods are preferred to calculate the CIs.
These intervals are based on the computation of the profile likelihood for one ore more parameters and the use of the likelihood ratio test.
The profile likelihood is defined by fixing the value of one or more parameters of $\mathbf{m}$ and by maximizing the likelihood function over the other parameters. The profile likelihood is thus not a likelihood function, it is a function formed by maximum values of a likelihood function! The profile likelihood for the $i$th parameter, $m_i$, is given by:
$$ {\cal L_1}(m_i ; \mathbf{m}_{\mathrm{J}}, \mathbf{C^o}) = \underset{\mathbf{m}_{\mathrm{J}} \, \in \, {\cal M_J}}{\operatorname{max}} {\cal L}(\mathbf{m}_{\mathrm{J}} ; m_i, \mathbf{C^o})$$where $\mathbf{m}_{\mathrm{J}} = (m_j); \, \forall \, m_j \in \mathbf{m}, \, m_j \neq m_i$, and ${\cal M_J}$ is a subset of the parameter space ${\cal M}$ that doesn't contain the $i$th dimension.
The likelihood ratio test is used to compare the likelihood of two data models - one being a special case of the other - as two hypotheses and then conclude at a $\alpha$ level of significance which hypothesis is better supported by the data.
For example, let's choose a fixed value for $m_i$, noted $m_{i, \mathrm{null}}$, for which we make the null hypothesis that this value is the true parameter value. The alternate hypothesis is that the value from the maximum likelihood estimator is closer to the true parameter value. Thus we have:
$$ H_0: \, m_{i, \mathrm{null}} = m_{i, \mathrm{true}} $$$$ H_1: \, \hat{m}_{i, \mathrm{MLE}} \sim m_{i, \mathrm{true}} $$For our set of hypotheses, the likelihood ratio statistic, noted $\Lambda$, is proportional to the logarithm of the ratio of the value of the profile likelihood at $m_{i, \mathrm{null}}$ and the value of the likelihood at $\hat{m}_{i, \mathrm{MLE}}$:
$$ \Lambda = - 2 \ln \left( \frac{{\cal L_1}(m_{i, \mathrm{null}} ; \mathbf{m}_{\mathrm{J}}, \mathbf{C^o})}{{\cal L}(\hat{\mathbf{m}}_{\mathrm{MLE}} ; \mathbf{C^o})} \right) $$We can also use the profile log-likelihood and the log-likelihood, and then:
$$ \Lambda = 2 \, [l(\hat{\mathbf{m}}_{\mathrm{MLE}} ; \mathbf{C^o}) - l_1(m_{i, \mathrm{null}} ; \mathbf{m}_{\mathrm{J}}, \mathbf{C^o})] $$From the statistic theory one can show that $\Lambda$ is asymptotically distributed like the $\chi^2(\nu_2 - \nu_1)$ distribution with $\nu_2 - \nu_1$ degrees of freedom, $\nu_2$ being the number of parameters in $\mathbf{m}$ and $\nu_1$ being the number of parameters in $\mathbf{m}_{\mathrm{J}}$ (in our example, $\nu_2 - \nu_1 = 1$). Therefore, we cannot reject $H_0$ at the significant level $\alpha$ if
$$ \Lambda < \chi^2_{1-\alpha}(\nu_2 - \nu_1) $$where $\chi^2_{1-\alpha}(\nu_2 - \nu_1)$ is the $(1-\alpha)$th quantile of $\chi^2(\nu_2 - \nu_1)$.
Hence, we can invert our likelihood ratio test in order to construct the CI for the parameter $m_i$. The 100(1-$\alpha$)% CI is defined by all values of $m_i$ for which we cannot reject the null hypothesis $H_0$ at the $\alpha$ level of significance, i.e., the values that verify:
$$ \Lambda - \chi^2_{1-\alpha}(\nu_2 - \nu_1) < 0 $$The limits of the CI can be calculated by finding the roots of the function defined by the left-hand term of the latter inequality, in both directions from $\hat{m}_{i, \mathrm{MLE}}$.
Compared to the Wald CI method, the profile likelihood CI method has the advantage that it doesn't assume the normality of $\hat{m}_{i, \mathrm{MLE}}$. Unlike Wald CIs, the profile likelihood CIs don't change if we transform the parameters. However, since the likelihood ratio test is based on a asymptotically distributed $\chi^2$, the calculation of the CIs is still an approximation for finite samples.
The calculation of CIs from Minimum $\chi^2$ Estimation is similar than the two method presented above for MLE. One can show that the chi-square value at the minimum, $\chi^2(\hat{\mathbf{m}}_{\mathrm{MCE}} ; \mathbf{C^o})$, is distributed like a $\chi^2(N-p)$ distribution, where $N-p$ is the degrees of freedom (use with caution with non-linear models and/or with additional constraints, see below).
Wald CIs can be calculated if Minimum $\chi^2$ Estimation is equivalent to MLE (e.g., if measurements errors are random), and for large sample sizes (i.e., high degrees of freedom), as the $\chi^2$ distribution tends to a Gaussian distribution as the degrees of freedom increase.
Similarily to the profile likelihood method for MLE, it is possible to profile the chi-square, i.e., compute the minimum $\chi^2$ for fixed values of one or more parameters, and then use the F-test with the ratio of two $\chi^2$ values.
Because of the random measurement errors, the predicted nuclide concentrations should never perfectly fit the measured concentrations. In fact, as for any data fitting problem, our problem consists to estimate a parent distribution, which is a probability distribution, actually. This distribution is unknown: it has unknown mean, variance and form. In our case:
Among these assumptions, the latter is the most interesting and we can test whether it is statistically acceptable or not, given our sample. There are several methods to quantify how a data model describes well the data, i.e, how accurately the fitting function + a combination of parameters values predicts the mean of the normal distributions.
We can look at the chi-square value, defined above (re-shown here for more clarity):
$$ \chi^2(\mathbf{m} ; \mathbf{C^o}) = \sum_{i=1}^N \frac{[C^o_i - C^m(z_i, \mathbf{m})]^2}{(\sigma^o_i)^2} $$Intuitively, we see that the value of $\chi^2(\mathbf{m} ; \mathbf{C^o})$ is high when the differences between the measured and the predicted concentrations (the residuals) are high, compared to the corresponding measurement errors. Ideally, one should expect that, if we repeat any measurement a great number of times, (1) the set of the resulting residuals will have a standard deviation that is equal to the error of the measurement (variance of the fit = variance of the data), and (2) that the mean of the set of measured concentrations is equal to the concentration predicted by the model. This would mean that the model is the "true" model in the sense that it effectively predicts the mean of the normal distribution assumed for that measurement.
In other words, the normalized residuals,
$$ R_i = \frac{C^o_i - C^m(z_i, \mathbf{m})}{\sigma^o_i}, \, \forall i $$would be each distributed according to the standard Gaussian ($\mu = 0$, $\sigma = 1$) only if $C^m(z, \mathbf{m})$ is the "true" data model, i.e., the "true" mathematical model with the "true" parameters ($\mathbf{m} = \mathbf{m}_{\mathrm{true}}$).
The $\chi^2$ distribution is precisely defined by the sum of the squares of $\nu$ idependent standard Gaussians! It follows that, for our problem, $\chi^2(\mathbf{m} ; \mathbf{C^o})$ can be a measure of the "goodness of fit" of the model to the collected data. A common practice is to calculate the reduced chi-square,
$$ \chi^2_{\nu}(\mathbf{m} ; \mathbf{C^o}) = \frac{\chi^2(\mathbf{m} ; \mathbf{C^o})}{\nu} $$which has the advantage that it normalizes for the size of the data sample and model complexity ($\nu = N - p$) and which can be interpreted as follows:
We see that, in fact, the reduced chi-square can also tell us about whether our assumption of known measurement errors is correct or not, but it can't distinguish between this assumption and the goodness of fit of the model.
Moreover, $\chi^2$ has a well-defined probability distribution, which means that one can test the goodness of fit of the model against the case of any random sample of size $N$ (and not only the $N$-sized data sample that we have collected).
It seems that there are, however, several issues related to the use of the $\chi^2$ distribution (detailled in Andrae, 2010 (Non reviewed)):
Unlike the chi-square, the likelihood function cannot be considered as a measure of goodness of fit (e.g., Heinrich, 2003).
Andrae, 2010 (Non reviewed) suggests analysing the distribution of residuals, cross-validation or bootstrapping. However, these methods are not easily applicable if the sample size is very small, like it is often the case for the datasets of measured nuclide concentrations vs. depth.
It is important to note that we consider the measurement errors as the only source of noise in the collected nucleide concentration data. Field processes such as bioturbation, cryoturbation or compaction may also result in "random" noise in the data, but it is not taken into account here. Including those processes in our modelling does not necessarily imply reconsidering the Gaussian distribution of each observation - still valid if the processes are effectively random -, but the assumption of a known variance for each Gaussian would be more questioning.
If those field processes effectively put non-negligible noise in the data in addition to the measurement errors, then it is likely that we underestimated the variances of the Gaussian distributions by taking the latter into account only. Therefore, we must be careful in the interpretation of the results. A good model fit may correspond to $\chi^2_{\nu}(\mathbf{m} ; \mathbf{C^o}) > 1$.
The fundamental difference between the Bayesian approach and the Frequentist approach is that, for Bayesians, it is allowed to assign a probability to the values of the parameters of the mathematical model. This doesn't mean that the parameters don't have a fixed, non-random "true" value (it must have one!); the probability rather defines our degree of belief about whether the parameter values are the "true" values or not. Frequentists don't agree with this definition of probability because a degree of belief is not related to any frequency of event.
One advantage of the Bayesian approach:
One of it's limit:
The bayesian approach allows to calculate a posterior probability distribution (PPD), i.e., the probability distribution of the data models given the collected nuclide concentrations vs. depth.
Applying the Bayes theorem gives:
$$ P(\mathbf{m} | \mathbf{C^o_z}) = \frac{P(\mathbf{m}) P(\mathbf{C^o} | \mathbf{m})}{P(\mathbf{C^o})}$$where $P(\mathbf{m} | \mathbf{C^o})$ is the posterior probability distribution, $P(\mathbf{m})$ is the prior probability distribution (i.e., our prior knowledge about the parameters of the mathematical model that we want to fit), $P(\mathbf{C^o} | \mathbf{m})$ is the probability distribution of the observed concentration profile given the data models, and $P(\mathbf{C^o})$ can be treated as a normalizing constant.
$P(\mathbf{C^o})$ is in fact the non-conditional probability of obtaining the measured nucleide concentration profile, but it is not easy to calculate. We can still derive it considering that the intergral of $P(\mathbf{m} | \mathbf{C^o})$ over ${\cal M}$ should be 1.
For calculating $P(\mathbf{C^o} | \mathbf{m})$ we can re-use the likelihood function defined for MLE!
Our goal is to characterize $P(\mathbf{m} | \mathbf{C^o})$ in the multidimensional space ${\cal M}$. The regions which maximize $P(\mathbf{m} | \mathbf{C^o})$ is one property of interest. With a uniform prior probability distribution, the data model that maximizes $P(\mathbf{m} | \mathbf{C^o})$ would correspond to the best data fit model as it would be obtained using MLE (not true in all cases).
The PPD, $P(\mathbf{m} | \mathbf{C^o})$, is a multidimensional function that is not easy to characterize. It has, however, a few properties that are very useful for interpreting the results:
The mean of the PPD for the $i$th parameter is given by:
$$ \bar{m_i} = \int_{\cal M} m_i \, P(\mathbf{m} | \mathbf{C^o}) \, d\mathbf{m} $$The element of the covariance matrix corresponding to the $i$th and $j$th parameters is given by:
$$ C^M_{i,j} = \int_{\cal M} m_i m_j \, P(\mathbf{m} | \mathbf{C^o}) \, d\mathbf{m} - \bar{m_i} \bar{m_i} $$The 1D marginal PPD for the $i$th parameter is given by:
$$ M(m_i) = \int ... \int P(\mathbf{m} | \mathbf{C^o}) \, \prod_{d=1, d \neq i}^n dm_d $$The 2D joint marginal PPD for the $i$th and $j$th parameters is given by:
$$ M(m_i, m_j) = \int ... \int P(\mathbf{m} | \mathbf{C^o}) \, \prod_{d=1, d \neq i, d \neq j}^n dm_d $$$P(\mathbf{m} | \mathbf{C^o_z})$ is a continuous function in the multidimensional parameter space ${\cal M}$; it is obviously impossible to calculate its value at every point (i.e., every data model). In most cases the function is also too complex to be analytically characterized; calculating its integrals is not a simple task. Therefore, we have to sample the parameter space.
The Bayesian integrals defined above can be estimated numerically using multidimensional Monte Carlo integration, which is itself based on sampling the parameter space. The Bayesian integrals can be expressed in the following generic form:
$$ J = \int_{\cal M} g(\mathbf{m}) \, P(\mathbf{m} | \mathbf{C^o}) \, d\mathbf{m} $$where $g(\mathbf{m})$ is any function of the parameter space used to define each integrand. Monte Carlo estimation is given by:
$$ \hat{J} = \frac{1}{S} \sum_{k=1}^S \frac{g(\mathbf{m}_k) \, P(\mathbf{m}_k | \mathbf{C^o})}{h(\mathbf{m}_k)} $$where $S$ is the total number of discrete samples, $\mathbf{m}_k$ is the $k$th data model sample, and $h(\mathbf{m}_k)$ is the density distribution of the samples, which is assumed to be normalized, i.e., $\int h(\mathbf{m}) d\mathbf{m} = 1$.
Without going into details about how it works, the Markov Chain Monte Carlo (MCMC) algorithms basically allow to generate samples in the parameter space, whose distribution asymptotically tends towards any given distribution. Using one of these algorithms, we can therefore sample the parameter space so that, after a number of iterations:
$$ h_{\mathrm{MCMC}}(\mathbf{m}) \simeq P(\mathbf{m} | \mathbf{C^o}) $$Therefore, Bayesian integrals can be estimated simply by computing averages of the generated samples:
$$ \hat{J}_{\mathrm{MCMC}} = \frac{1}{S} \sum_{k=1}^S g(\mathbf{m}_k) $$In other words, the MCMC methods can be used to generate quasi-independent samples in the parameter space as it was drawn from $P(\mathbf{m} | \mathbf{C^o})$, and therefore it is trivial to calculate the $P(\mathbf{m} | \mathbf{C^o})$ characteritics (mean, quantiles, covariances, marginal distributions...) from these samples.
Because MCMC is by defintion a Markov process (i.e., a process where the next state is only defined by the current state), it needs the defintion of an arbitrary initial state (initial samples must be provided) and a minimum number of iterations (often called burn-in) before accurately sample the parameter space following $P(\mathbf{m} | \mathbf{C^o})$. Because adjacent generated samples are also auto-correlated, a common workaround is to keep only one sample every $n$ iteration in order to have independent samples (thinning). As the evolution of the MCMC algorithms depends on computed values for the prior, the likelihood and the PPD, which can be very small, it is common to use the logarithms, $\ln P(\mathbf{m}_k | \mathbf{C^o})$, $\ln L(\mathbf{m}_k | \mathbf{C^o})$ and $\ln P(\mathbf{m}_k) $, instead of the original values in order to avoid numerical issues.
In [ ]: