Given some set of data $\{d_{i}\}_{i\in [1, N]}$ and a model $M$, the evidence for the model is given by
$$ P(M| \; \{d_{i}\}_{i\in [1, N]}) = \frac{P(\{d_{i}\}_{i\in [1, N]}|\;M)P(M)}{P(\{d_{i}\}_{i\in [1, N]})} $$In this post, we will discuss how each element of the data set contributes to this total.
The joint-likelihood can be rewritten as
$$P(\{d_{i}\}_{i\in [1, N]} | \; M) = p(d_1 | \; \{d_{i}\}_{i\in [1, N]}, M) P(\{d_{i}\}_{2\in [1, N]} | M)$$Then,if all the data points are independent, such that the probability of $d_1$ does change given our knowledge of $\{d_i\}_{i\in [2, N]}$, we can simplify to give
$$P(\{d_{i}\}_{i\in [1, N]} | \; M) = p(d_1 | \; M) P(\{d_{i}\}_{2\in [1, N]} | M)$$and hence more generally,
$$P(\{d_{i}\}_{i\in [1, N]} | \; M) = \prod_{i=1}^{N} P(d_i | \; M)$$This provides an intuition that each data point provides an individual peice of evidence. This is most clearly seen by writing the log-evidence as
$$\log(P(M| \; \{d_i\}_{i\in [1, N]}) = \sum_{i=1}^{N}\log(P(d_i|\; M)) + \log(P(M)) - \log(\{d_{i}\}_{i\in [1, N]})$$where the final two terms are the prior evidence and a normalisation factor.
In the examples above, the probability distributions did not contain any additional parameters or nuisance parameters which may be part of the model. Such a name is misleading if one considers them to make pertinent predictions, so here we will refer to them as model parameters. In formulating a model, one may have uncertainty about some aspect of the model. For example, if the model was simply Gaussian noise, then one may be uncertain about either or both of the mean and variance. There are many more examples such as having a signal model parameters, or hyper-parameters in a hierarchical model. In all cases however, to compute the evidence, one must marginalise over the model parameters. This amounts to using the following identity:
$$P(\{d_{i}\}_{i\in [1, N]} | \; M) = \int_{\theta} \underbrace{ P(\{d_{i}\}_{i\in [1, N]}, \theta | \; M)}_{ \textrm{joint-probability of the data and model parameters}} d\theta = \int_{\theta} P(\{d_{i}\}_{i\in [1, N]}| \; \theta, M) P(\theta|\; M) d\theta $$where $\theta$ is a vector of the model-parameters. As such, the integral has a dimensionality equal to the number of model parameters, often making it analytically intractile and requiring the use of more sophisticated method such as MCMC or nested sampling. Again at this stage, we can assume independence of the data points, such that
THIS IS WRONG: $$P(\{d_{i}\}_{i\in [1, N]} | \; M) = \int_{\theta} \prod_{i=1}^{N}P(d_{i}| \; \theta, M) P(\theta|\; M) d\theta $$
In [ ]: