| notebook.community

Leave one Out Cross Validation vs Marginal Likelihood

Let $ \mathcal{D}=(X,y)$ be the data of a regression/classification problem.

The marginal likelihood (ML) is the probability of the data for a given model: $$ ML(\mathcal{D}) = p(y \mid X) $$

The leave-one-out marginal likelihood (LOOML) is defined as: $$ \text{LOOML}(\mathcal{D}) = \prod_{i=1}^N p(y_i \mid \overbrace{X_{(i)},x_i}^{X},y_{(i)}) $$ Where $y_{(i)}$ is the vector without the $i-th$ variable: $y_{(i)} = (y_1,..,y_{i-1},y_{i+1},...,y_N)$

How the ML is related to LOOML? We can decompose the ML to resemble to the LOOML:

$$ \begin{aligned} ML(\mathcal{D}) =p(y \mid X) &= p(y_1 \mid y_2,..,y_N,X)p(y_2,..,y_N \mid X)\\ &= p(y_1 \mid y_2,..,y_N,X)p(y_2 \mid y_3,..,y_N, X) p(y_3,...,y_N \mid X) = \\ &= p(y_1 \mid y_2,..,y_N,X)p(y_2 \mid y_3,..,y_N, X) p(y_3 \mid y_4,...,y_N, X)p(y_4,...,y_N \mid X) = ...\\ &\vdots \\ &= \prod_{i=1}^N p(y_i \mid y_{i+1},...,y_N,X) \end{aligned} $$

Now, let call the vectors: $\text{prev}_i = (y_1,...,y_{i-1})$ $\text{post}_i = (y_{i+1},...,y_{N})$ the ML and the LOOML can be written as: $$ \begin{aligned} LOOML(\mathcal{D}) &= \prod_{i=1}^N p(y_i \mid \text{prev}_i, \text{post}_i, X) \\ ML(\mathcal{D}) &= \prod_{i=1}^N p(y_i \mid \text{post}_i, X) \end{aligned} $$

Therefore to find a relation we just have to find the relation between the probabilities: $p(y_i \mid \text{prev}_i,\text{post}_i,X)$ and $p(y_i \mid \text{post}_i,X)$:

$$ \begin{aligned} p(y_i \mid \text{prev}_i,\text{post}_i,X) &=p(y_i \mid \text{prev}_i,\text{post}_i,X) \frac{p(\text{prev}_i \mid \text{post}_i,X)}{p(\text{prev}_i \mid \text{post}_i,X)} \\ &= \frac{p(y_i, \text{prev}_i \mid \text{post}_i,X)}{p(\text{prev}_i \mid \text{post}_i,X)} \\ &= \frac{p(\text{prev}_i \mid y_i, \text{post}_i, X) p(y_i \mid \text{post}_i, X)}{p(\text{prev}_i \mid \text{post}_i,X)} \end{aligned} $$

If we plug this in the LOOML expression we have: $$ \begin{aligned} LOOML(\mathcal{D}) &= \prod_{i=1}^N p(y_i \mid \text{prev}_i, \text{post}_i, X) \\ &= \prod_{i=1}^N p(y_i \mid \text{post}_i, X) \frac{p(\text{prev}_i \mid y_i, \text{post}_i, X)}{p(\text{prev}_i \mid \text{post}_i,X)} \\ &= ML(\mathcal{D}) \prod_{i=1}^N \frac{p(\text{prev}_i \mid y_i, \text{post}_i, X)}{p(\text{prev}_i \mid \text{post}_i,X)} \end{aligned} $$

The LOOML seems like an strange construction of the more frequentist approach Leave one out error:

$$ LOO_{err}(\mathcal{D})= \sum_{i=1}^N loss(y_i,pred(m_{(i)},x_i)) $$

Where $pred(m_{(i)},x_i)$ is the prediction given by the model trained without the datapoint $i$ evaluated on the datapoint $x_i$.

We can reformulate this error using the predictive leave one out error PLOOE defined as:

$$ \begin{aligned} PLOOE(\mathcal{D}) &= \frac{1}{N}\sum_{i=1}^N \mathbb{E}_{p(t \mid X_{(i)},x_i,y_{(i)})}\big[loss(t,\hat{y}_{(i)}) \big] \\ &= \frac{1}{N}\sum_{i=1}^N \int loss(t,\hat{y}_{(i)}) p(t \mid X_{(i)},x_i, y_{(i)}) dt \end{aligned} $$

Where we call $\hat{y}_{(i)}$ to the optimal prediction for the loss function $loss$. That is:

$$ \hat{y}_{(i)} = \arg \min_{a} \left[ \int loss(t,a) p(t \mid X_{(i)},x_i, y_{(i)}) dt \right] $$