Variational free energy

Introduction

Let $f=\{f^*,\mathrm{f}\}$ be the noise free latent variables of a $\mathcal{GP}$. We write $\mathrm{f}$ to be the latent variables where we have noisy observations $y$: $y_i = \mathrm{f}_i + \epsilon_i$ and $f^*$ the points where we want to have predictions. For the purpose of understanding the Variational Free Energy (VFE) approach to $\mathcal{GP}$s of [Titsias 2009] we will consider the joint posterior $p(f^*,\mathrm{f} \mid y)$ which for the full $\mathcal{GP}$ is:

$$ \begin{aligned} \require{cancel} p(f^*,\mathrm{f} \mid y) &= p(f^* \mid \mathrm{f},\cancel{y})p(\mathrm{f} \mid y) \\ &= \mathcal{N}\left(f^* \mid K_{*f}K_{ff}^{-1}\mathrm{f}, K_{**} - K_{*f}K_{ff}^{-1}K_{f*}\right)\frac{p(y \mid \mathrm{f}) p(\mathrm{f})}{p(y)} \\ &= \mathcal{N}\left(f^* \mid K_{*f}K_{ff}^{-1}\mathrm{f}, K_{**} - K_{*f}K_{ff}^{-1}K_{f*}\right)\frac{\mathcal{N}(y \mid \mathrm{f},\sigma^2 I) \mathcal{N}(\mathrm{f} \mid 0, K_{ff})}{\mathcal{N}(y \mid 0, K_{ff}+\sigma^2 I)} \end{aligned} $$

Note that, for the purpose of making predictions, what we are interested in is not on the joint posterior but the predictive posterior $p(f^* \mid y)$. This means we have to integrate out $\mathrm{f}$ from the above equation. When we do this for the normal $\mathcal{GP}$ we get the following well known predictive distribution:

$$ \begin{aligned} p(f^* \mid y ) &= \int p(f^* , \mathrm{f} \mid y) d\mathrm{f} \\ &= \mathcal{N}(f^* \mid K_{*f}(K_{ff}+\sigma^2 I)^{-1}y, \\ &\quad \quad \quad \quad K_{**}- K_{*f}(K_{ff}+\sigma^2 I)^{-1}K_{f*}) \end{aligned} $$

(If you manage to do this integral without driving you crazy let me know)

Variational sparse approximation

We want to approximate the true joint posterior: $q(f)\approx p(f \mid y)$. To do this we use the standard bound used in variational inference:

$$ \begin{aligned} \log p(y) &= \overbrace{\mathbb{E}_{q(f)}\left[\log \frac{p(y,f)}{q(f)} \right]}^{\mathcal{L}(q)} \overbrace{-\mathbb{E}_{q(f)}\left[\log \frac{p(f \mid y)}{q(f)}\right]}^{KL\left[q \mid\mid p\right]}\\ &\geq \mathcal{L}(q) \end{aligned} $$

Thus, maximizing $\mathcal{L}(q)$ is equivalent to minimizing the Kullback-Leiveck divergence between $q$ and the true joint posterior $p(f\mid y)$.

Let $u$ be a subset of $f^*$: $u \subset f^*$. We will write $f$ as $f = \{f^*,\mathrm{f}, u\}$. The bound $\mathcal{L}$ can be decomposed as:

$$ \require{cancel} \begin{aligned} \mathcal{L}(q) &= \mathbb{E}_{q(f)}\left[\log \frac{p(y\mid f)p(f)}{q(f)} \right] \\ &= \mathbb{E}_{q(f)}\left[\log \frac{p(y\mid \mathrm{f}, \cancel{f^*}, \cancel{u})p(f^*, \mathrm{f} \mid u) p(u)}{q(f)} \right] \end{aligned} $$

Notice that $p(f^*, \mathrm{f} \mid u)$ is the standard posterior of the $\mathcal{GP}$ with noise free variables $u$:

$$ \begin{aligned} p(f^*, \mathrm{f} \mid u) = \mathcal{N}\Big(\begin{pmatrix} \mathrm{f} \\ f^* \end{pmatrix} \mid &\begin{pmatrix} K_{\mathrm{f}u} \\ K_{*u}\end{pmatrix}K_{uu}^{-1}u,\\ &\begin{pmatrix} K_{\mathrm{f}\mathrm{f}} & K_{\mathrm{f}*} \\ K_{*\mathrm{f}} & K_{**}\end{pmatrix} - \begin{pmatrix} Q_{\mathrm{f}\mathrm{f}} & Q_{\mathrm{f}*}\\ Q_{*\mathrm{f}} & Q_{**}\end{pmatrix} \Big) \end{aligned} $$

Where $Q_{\mathrm{f}*} := K_{\mathrm{f}u}K_{uu}^{-1}K_{u*}$

The Titsias VFE approximation to the posterior chooses $q$ as: $$ q(f) = q(f^*,\mathrm{f},u) = p(f^*,\mathrm{f} \mid u) q(u) $$

Notice that:

  1. We could also rewrite the true posterior $p(f\mid y)$ using the subset $u$ of $f$: $$ p(f\mid y) = p(f^*,\mathrm{f},u \mid y) = p(f^*, \mathrm{f} \mid u,y)p(u \mid y) $$

  2. If we compare this two equations we see that The Titsias approximation removes the dependency on the data ($y$) in the first term. (While the second term $q(u)$ is let free).

  3. Again, in order to make predictions, we will need the approximate predictive distribution $q(f^*)$, this can be found integrating out $\mathrm{f}$ and $u$ from $q$: $$ \begin{aligned} q(f^*) &= \int \int p(f^*,\mathrm{f} \mid u) q(u) d\mathrm{f}du \\ &= \int q(u) \underbrace{\int p(f^*,\mathrm{f} \mid u) d\mathrm{f}}_{p(f^* \mid u)}du \\ &= \int q(u) \mathcal{N}(f^* \mid K_{*u}K_{uu}^{-1}u, K_{**}-Q_{**})du \end{aligned} $$

  4. If we are "lucky" and $q$ is normal, $q(u)=\mathcal{N}(u \mid m, S)$, then the approximate predictive distribution can be computed using the Bishop Chapter 2 pág 93 trick of Marginal and Conditional Gaussians: $$ \begin{aligned} q(f^*) = \mathcal{N}(f^* &\mid K_{*u}K_{uu}^{-1}m, \\ &K_{**}-Q_{**}+ K_{*u}K_{uu}^{-1}S K_{uu}^{-1}K_{u*}) \end{aligned} $$

Substituting this $q$ in the bound $\mathcal{L}(q)$ leads to:

$$ \begin{aligned} \require{cancel} \mathcal{L}(q) &= \mathbb{E}_{q(f)}\left[\log \frac{p(y\mid \mathrm{f})\cancel{p(f_{\ne u} \mid u)} p(u)}{\cancel{p(f_{\ne u} \mid u)}q(u)}\right] \\ &= \mathbb{E}_{q(f)}\left[\log p(y\mid \mathrm{f})\right] + \mathbb{E}_{p(f_{\ne u} \mid u)q(u)}\left[\log \frac{p(u)}{q(u)}\right] \\ &= \mathbb{E}_{q(f)}\left[\log p(y\mid \mathrm{f})\right] + \int \int \left[\log \frac{p(u)}{q(u)}\right] p(f_{\ne u} \mid u)q(u) df_{\ne u} du \\ &= \mathbb{E}_{q(f)}\left[\log p(y\mid \mathrm{f})\right] + \int \left[\log \frac{p(u)}{q(u)}\right] q(u) \overbrace{\left(\int p(f_{\ne u} \mid u) df_{\ne u} \right)}^{=1} du \quad \text{Trick (1)}\\ &= \mathbb{E}_{p(f^*, \mathrm{f} \mid u)q(u)}\left[\log p(y\mid \mathrm{f})\right] + \mathbb{E}_{q(u)}\left[\log \frac{p(u)}{q(u)}\right] \\ &= \mathbb{E}_{p(f^*\mid \mathrm{f} , u)p(\mathrm{f} \mid u)q(u)}\left[\log p(y\mid \mathrm{f})\right] + KL\left[q(u)\mid\mid p(u)\right] \\ &= \mathbb{E}_{p(\mathrm{f} \mid u)q(u)}\left[\mathbb{E}_{p(f^*\mid \mathrm{f} , u)}\left[\log p(y\mid \mathrm{f})\right]\right] + KL\left[q(u)\mid\mid p(u)\right]\quad \text{Applying trick (1) to the first expectation}\\ &= \mathbb{E}_{p(\mathrm{f} \mid u)q(u)}\left[\log p(y\mid \mathrm{f})\right] + KL\left[q(u)\mid\mid p(u)\right] \\ &= \mathbb{E}_{q(u)}\left[\mathbb{E}_{p(\mathrm{f} \mid u)}\left[\log p(y\mid \mathrm{f})\right]\right] + KL\left[q(u)\mid\mid p(u)\right] \end{aligned} $$

The integral $\mathbb{E}_{p(\mathrm{f} \mid u)}\left[\log p(y \mid \mathrm{f})\right]$ plays a central role in the derivation. We will call it $\mathcal{L}_1$. Let's try to compute it:

$$ \begin{aligned} \mathcal{L}_1 &= \mathbb{E}_{p(\mathrm{f} \mid u)}\left[\log \mathcal{N}(y \mid \mathrm{f}, \sigma^2 I)\right] \\ &= \mathbb{E}_{p(\mathrm{f} \mid u)}\left[\log \left( \left(2\pi \sigma^{2N}\right)^{-1/2} \exp\left(\frac{-1}{2\sigma^{2}} \|y -\mathrm{f}\|^2\right)\right)\right] \\ &= \mathbb{E}_{p(\mathrm{f} \mid u)}\left[\log \left( \left(2\pi \sigma^{2N}\right)^{-1/2} \right) + \frac{-1}{2\sigma^{2}} \|y -\mathrm{f}\|^2\right] \\ &= \log \left( \left(2\pi \sigma^{2N}\right)^{-1/2} \right) + \mathbb{E}_{p(\mathrm{f} \mid u)}\left[\frac{-1}{2\sigma^{2}} \|y -\mathrm{f}\|^2\right] \\ &= \log \left( \left(2\pi \sigma^{2N}\right)^{-1/2} \right) + \frac{-1}{2\sigma^{2}}\mathbb{E}_{p(\mathrm{f} \mid u)}\left[ \left(\|y\|^2 + \|\mathrm{f}\|^2 -2 y^t \mathrm{f}\right)\right] \\ &= \log \left( \left(2\pi \sigma^{2N}\right)^{-1/2} \right) + \frac{-1}{2\sigma^{2}}\left( \|y\|^2 + \mathbb{E}_{p(\mathrm{f} \mid u)}\left[\|\mathrm{f}\|^2\right] -2 y^t \mathbb{E}_{p(\mathrm{f} \mid u)}\left[\mathrm{f}\right]\right) \\ \end{aligned} $$

Now the trick is to use that $p(\mathrm{f} \mid u)$ is normal, if we write: $\mathcal{N}(\mathrm{f} \mid \mu_{\mathrm{f}\mid u}, \operatorname{Cov}_{u\mid \mathrm{f}})$ $$ \begin{aligned} \mathcal{L}_1 &= \log \left( \left(2\pi \sigma^{2N}\right)^{-1/2} \right) + \frac{-1}{2\sigma^{2}}\left( y^t y + \mathbb{E}_{p(\mathrm{f} \mid u)}\left[\mathrm{f}^t \mathrm{f}\right] -2 y^t \mu_{\mathrm{f}\mid u}\right) \\ &= \log \left( \left(2\pi \sigma^{2N}\right)^{-1/2} \right) + \frac{-1}{2\sigma^{2}}\left( y^t y + \mathbb{E}_{p(\mathrm{f} \mid u)}\left[(\mathrm{f}-\mu_{\mathrm{f}\mid u})^t (\mathrm{f}-\mu_{\mathrm{f}\mid u}) +2\mu_{\mathrm{f}\mid u}^t f - \mu_{\mathrm{f}\mid u}^t\mu_{\mathrm{f}\mid u}\right] -2 y^t \mu_{\mathrm{f}\mid u}\right) \\ &= \log \left( \left(2\pi \sigma^{2N}\right)^{-1/2} \right) + \frac{-1}{2\sigma^{2}}\left( y^t y + \mathbb{E}_{p(\mathrm{f} \mid u)}\left[(\mathrm{f}-\mu_{\mathrm{f}\mid u})^t (\mathrm{f}-\mu_{\mathrm{f}\mid u}) \right] +2\mu_{\mathrm{f}\mid u}^t \mathbb{E}_{p(\mathrm{f} \mid u)}\left[f\right] - \mu_{\mathrm{f}\mid u}^t\mu_{\mathrm{f}\mid u} -2 y^t \mu_{\mathrm{f}\mid u}\right) \\ &= \log \left( \left(2\pi \sigma^{2N}\right)^{-1/2} \right) + \frac{-1}{2\sigma^{2}}\left( y^t y + \mathrm{trace}(\operatorname{Cov}_{u\mid \mathrm{f}}) + \mu_{\mathrm{f}\mid u}^t\mu_{\mathrm{f}\mid u} -2 y^t \mu_{\mathrm{f}\mid u}\right) \\ &= \log \left( \left(2\pi \sigma^{2N}\right)^{-1/2} \right) + \frac{-1}{2\sigma^{2}}\left\| y - \mu_{\mathrm{f}\mid u}\right\|^2 + \frac{-1}{2\sigma^{2}}\left(\mathrm{trace}(\operatorname{Cov}_{u\mid \mathrm{f}}) \right) \\ &= \log \left( \left(2\pi \sigma^{2N}\right)^{-1/2} \exp\left(\frac{-1}{2\sigma^{2}}\left\| y - \mu_{\mathrm{f}\mid u}\right\|^2\right)\right) + \frac{-1}{2\sigma^{2}}\left(\mathrm{trace}(\operatorname{Cov}_{u\mid \mathrm{f}}) \right) \\ &= \log \left(\mathcal{N}\left(y \mid \mu_{\mathrm{f}\mid u}, \sigma^2 I\right)\right) - \frac{1}{2\sigma^{2}}\left(\mathrm{trace}(\operatorname{Cov}_{u\mid \mathrm{f}}) \right) \end{aligned} $$

If we now substitute the true value of $p(\mathrm{f} \mid u)$: $$ \begin{aligned} p(\mathrm{f} \mid u) &= \mathcal{N}(\mathrm{f} \mid \mu_{\mathrm{f}\mid u}, \operatorname{Cov}_{u\mid \mathrm{f}}) \\ &= \mathcal{N}(\mathrm{f} \mid K_{fu}K_{uu}^{-1}u, K_{ff}-\overbrace{Q_{ff}}^{K_{fu}K_{uu}^{-1}K_{uf}}) \end{aligned} $$

We have: $$ \begin{aligned} \mathcal{L}_1 &=\log \left(\mathcal{N}\left(y \mid K_{fu}K_{uu}^{-1}u, \sigma^2 I\right)\right) - \tfrac{1}{2\sigma^{2}}\left(\mathrm{trace}(K_{ff}-Q_{ff}) \right) \\ &= \sum_{i=1}^N \left[\log\mathcal{N}\left(y_i \mid K_{f_iu}K_{uu}^{-1}u, \sigma^2\right) - \tfrac{1}{2\sigma^{2}}\left(K_{f_if_i}- K_{f_iu}K_{uu}^{-1}K_{uf_i}\right)\right] \\ &= \sum_{i=1}^N \left[-\tfrac{1}{2\sigma^2}(y_i - K_{f_iu}K_{uu}^{-1}u)^2 -\tfrac{M}{2}\log(2\pi\sigma^2) - \tfrac{1}{2\sigma^{2}}\left(K_{f_if_i}- K_{f_iu}K_{uu}^{-1}K_{uf_i}\right)\right] \\ \end{aligned} $$

Great! So now it we get back the $\mathcal{L}(q)$ bound we will see it is a sum over the training data!! Therefore it is possible that it can be optimized by stochastic gradient descent:

$$ \begin{aligned} \mathcal{L}(q) &= \mathbb{E}_{q(u)}[\mathcal{L}_1] + KL\left[q(u)\mid\mid p(u)\right] \\ &= \mathbb{E}_{q(u)}\left[\sum_{i=1}^N \left[\log\mathcal{N}\left(y_i \mid K_{f_iu}K_{uu}^{-1}u, \sigma^2\right) - \tfrac{1}{2\sigma^{2}}\left(K_{f_if_i}- K_{f_iu}K_{uu}^{-1}K_{uf_i}\right)\right]\right] + KL\left[q(u)\mid\mid p(u)\right] \\ &= \sum_{i=1}^N \mathbb{E}_{q(u)}\left[\log\mathcal{N}\left(y_i \mid K_{f_iu}K_{uu}^{-1}u, \sigma^2\right) - \tfrac{1}{2\sigma^{2}}\left(K_{f_if_i}- K_{f_iu}K_{uu}^{-1}K_{uf_i}\right)\right] + KL\left[q(u)\mid\mid p(u)\right] \\ &= \sum_{i=1}^N \mathbb{E}_{q(u)}\left[-\tfrac{1}{2\sigma^2}(y_i - K_{f_iu}K_{uu}^{-1}u)^2 -\tfrac{M}{2}\log(2\pi\sigma^2) - \tfrac{1}{2\sigma^{2}}\left(K_{f_if_i}- K_{f_iu}K_{uu}^{-1}K_{uf_i}\right)\right] + KL\left[q(u)\mid\mid p(u)\right] \\ &= \sum_{i=1}^N -\tfrac{1}{2\sigma^2}\mathbb{E}_{q(u)}\left[(y_i - K_{f_iu}K_{uu}^{-1}u)^2\right] -\tfrac{NM}{2}\log(2\pi\sigma^2) - \tfrac{1}{2\sigma^{2}}\sum_{i=1}^N\left(K_{f_if_i}- K_{f_iu}K_{uu}^{-1}K_{uf_i}\right) + KL\left[q(u)\mid\mid p(u)\right] \end{aligned} $$

If we assume now that $q(u)$ is normal: $q(u)=\mathcal{N}(u \mid m, S)$ we can use the change of variables $u = m + R\epsilon$ (where $S=RR^t$) with $\epsilon$ a $\mathcal{N(0,1)}$ M-dimensional vector to have:

$$ \begin{aligned} \mathcal{L}(q) &= \sum_{i=1}^N -\tfrac{1}{2\sigma^2}\mathbb{E}_{\epsilon}\left[(y_i - K_{f_iu}K_{uu}^{-1}(m + R\epsilon))^2\right]-\tfrac{NM}{2}\log(2\pi\sigma^2) - \tfrac{1}{2\sigma^{2}}\sum_{i=1}^N\left(K_{f_if_i}- K_{f_iu}K_{uu}^{-1}K_{uf_i}\right) + KL\left[q(u)\mid\mid p(u)\right] \\ &= \sum_{i=1}^N -\tfrac{1}{2\sigma^2}\left(\mathbb{E}_{\epsilon}\left[(y_i - K_{f_iu}K_{uu}^{-1}m)^2\right] + \mathbb{E}_{\epsilon}\left[(K_{f_iu}K_{uu}^{-1}R\epsilon)^2\right] -\cancel{\mathbb{E}_{\epsilon}\left[2(y_i - K_{f_iu}K_{uu}^{-1}m)R\epsilon \right]}\right)-\tfrac{NM}{2}\log(2\pi\sigma^2) - \tfrac{1}{2\sigma^{2}}\sum_{i=1}^N\left(K_{f_if_i}- K_{f_iu}K_{uu}^{-1}K_{uf_i}\right) + KL\left[q(u)\mid\mid p(u)\right]\\ &= \sum_{i=1}^N -\tfrac{1}{2\sigma^2}\left((y_i - K_{f_iu}K_{uu}^{-1}m)^2 + \mathbb{E}_{\epsilon}\left[\epsilon^tR^tK_{uu}^{-1}K_{uf_i}K_{f_iu}K_{uu}^{-1}R\epsilon\right] \right)-\tfrac{NM}{2}\log(2\pi\sigma^2) - \tfrac{1}{2\sigma^{2}}\sum_{i=1}^N\left(K_{f_if_i}- K_{f_iu}K_{uu}^{-1}K_{uf_i}\right) + KL\left[q(u)\mid\mid p(u)\right]\\ &= \sum_{i=1}^N -\tfrac{1}{2\sigma^2}\left((y_i - K_{f_iu}K_{uu}^{-1}m)^2 + \operatorname{trace}(SK_{uu}^{-1}K_{uf_i}K_{f_iu}K_{uu}^{-1}) \right)-\tfrac{NM}{2}\log(2\pi\sigma^2) - \tfrac{1}{2\sigma^{2}}\sum_{i=1}^N\left(K_{f_if_i}- K_{f_iu}K_{uu}^{-1}K_{uf_i}\right) + KL\left[q(u)\mid\mid p(u)\right]\\ \end{aligned} $$

This equation is the $\mathcal{L}_3$ equation that appears in [Hensman et. al 2013]. We can fully expand it adding the value of the Kullback-Leibler divergence between the approximating posterior and the prior over $u$ since both of them are multivariate normal distributions the wikipedia says it is:

$$ \begin{aligned} KL[q(u) \mid\mid p(u)] &= KL[\mathcal{N}(m,S)\mid\mid \mathcal{N}(0,K_{uu})] \\ &= \tfrac{1}{2}m^t K_{uu}^{-1}m - \tfrac{k}{2} + \tfrac{1}{2} \operatorname{trace}(K_{uu}^{-1}S) + \tfrac{1}{2}\log\left(\frac{\operatorname{det}K_{uu}}{\operatorname{det}S}\right) \end{aligned} $$

Finally it is worth to have a look at the full equation $\mathcal{L}(q)$ and see its dependencies:

  1. $\mathcal{L}_q$ has two variational parameters $m$ and $S$.
  2. The mean of the $q(u)$ distribution is all we need to compute the mean of the approximate predictive distribution.
  3. The dependencies of $m$ are only in the first and last term.
  4. For the purpose of finding $m$ it suffices to optimize:
  5. The hyperparameters of the method are $\sigma^2$, the parameters of the kernel that we will call $\Theta$, and the pseudo-inputs $X_u$.

Derivation from [Hensman et. al 2013]

Let's go back again and see the things using [Hensman et. al 2013] derivation: $$ \begin{aligned} \log p(y) &= \log \mathbb{E}_{p(u)}\left[p(y \mid u)\right] \\ % &= \log \mathbb{E}_{p(u)}\left[\mathbb{E}_{p(\mathrm{f} \mid u)}\left[p(y\mid \mathrm{f},\cancel{u})\right]\right] \\ \end{aligned} $$

We can decompose $\log p(y)$ using also the standard trick of variational inference: $$ \log p(y) = \mathbb{E}_{p(u)}\left[\log p(y\mid u)\right] - \mathbb{E}_{p(u)}\left[\log \frac{p(u \mid y)}{p(u)}\right] $$

Instead of computing the bound on $p(y)$, we can compute the bound on $p(y \mid u)$ which is:

$$ \log p(y \mid u) = \overbrace{\mathbb{E}_{q(f^*,\mathrm{f} \mid u)}\left[\log \frac{p(y,f^*,\mathrm{f} \mid u)}{q(f^*,\mathrm{f} \mid u)}\right]}^{\mathcal{L}\left(q(f^*,\mathrm{f} \mid u)\right)} \overbrace{-\mathbb{E}_{q(f^*,\mathrm{f} \mid u)}\left[\log \frac{p(f^*,\mathrm{f} \mid y,u)}{q(f^*,\mathrm{f} \mid u)}\right]}^{KL\left(q(f^*,\mathrm{f} \mid u) \mid \mid p(f^*,\mathrm{f} \mid u,y)\right)} $$

If we use here the Titsias approximation $q(f^*,\mathrm{f} \mid u) = p(f^*,\mathrm{f} \mid u)$. We find that:

  1. The first term is our beloved $\mathcal{L}_1$ integral: $$ \mathcal{L}\left(p(f^*,\mathrm{f} \mid u)\right) = \mathbb{E}_{p(\mathrm{f} \mid u)}\left[\log p(y \mid \mathrm{f})\right]= \mathcal{L}_1 $$
  2. The second term is just the divergence between the prior and and the posterior: $$ KL\left(p(f^*,\mathrm{f} \mid u) \mid \mid p(f^*,\mathrm{f} \mid u,y)\right) $$

We can apply this decomposition to $\log p(y)$: $$ \begin{aligned} \log p(y) &= \log \mathbb{E}_{p(u)}\left[p(y \mid u)\right] \\ &= \log \mathbb{E}_{p(u)}\left[\exp\left(\mathcal{L}_1 + KL\left(p(f^*,\mathrm{f} \mid u) \mid \mid p(f^*,\mathrm{f} \mid u,y)\right)\right)\right] \\ &= \log \mathbb{E}_{p(u)}\left[\exp\left(\mathcal{L}_1\right)\exp\left(KL\left(p(f^*,\mathrm{f} \mid u) \mid \mid p(f^*,\mathrm{f} \mid u,y)\right)\right)\right] \\ \end{aligned} $$

We can also apply this decomposition in the variational inference decomposition of $\log p(y)$: $$ \begin{aligned} \log p(y) &= \mathbb{E}_{p(u)}\left[\log p(y\mid u)\right] - \mathbb{E}_{p(u)}\left[\log \frac{p(u \mid y)}{p(u)}\right] \\ &= \mathbb{E}_{p(u)}\left[\mathcal{L}_1 + KL\left(p(f^*,\mathrm{f} \mid u) \mid \mid p(f^*,\mathrm{f} \mid u,y)\right) \right] - \mathbb{E}_{p(u)}\left[\log \frac{p(u \mid y)}{p(u)}\right] \\ &= \mathbb{E}_{p(u)}\left[\mathcal{L}_1\right] + \mathbb{E}_{p(u)}\left[KL\left(p(f^*,\mathrm{f} \mid u) \mid \mid p(f^*,\mathrm{f} \mid u,y)\right) \right] - \mathbb{E}_{p(u)}\left[\log \frac{p(u \mid y)}{p(u)}\right] \\ &= \mathbb{E}_{p(u)}\left[\mathcal{L}_1\right] -\mathbb{E}_{p(u)}\left[\mathbb{E}_{p(f^*,\mathrm{f} \mid u)}\left[\log \frac{p(f^*,\mathrm{f} \mid y,u)}{p(f^*,\mathrm{f} \mid u)}\right]\right] - \mathbb{E}_{p(u)}\left[\log \frac{p(u \mid y)}{p(u)}\right] \\ &= \mathbb{E}_{p(u)}\left[\mathcal{L}_1\right] -\mathbb{E}_{p(u)}\left[\mathbb{E}_{p(f^*,\mathrm{f} \mid u)}\left[\log \frac{p(f^*,\mathrm{f} \mid y,u)}{p(f^*,\mathrm{f} \mid u)}\right] +\log \frac{p(u \mid y)}{p(u)}\right] \\ &= \mathbb{E}_{p(u)}\left[\mathcal{L}_1\right] -\mathbb{E}_{p(u)}\left[\mathbb{E}_{p(f^*,\mathrm{f} \mid u)}\left[\log \frac{p(f^*,\mathrm{f} \mid y,u)}{p(f^*,\mathrm{f} \mid u)} +\log \frac{p(u \mid y)}{p(u)}\right]\right] \\ &= \mathbb{E}_{p(u)}\left[\mathcal{L}_1\right] -\mathbb{E}_{p(u)}\left[\mathbb{E}_{p(f^*,\mathrm{f} \mid u)}\left[\log \frac{p(f^*,\mathrm{f} \mid y,u)}{p(f^*,\mathrm{f} \mid u)} \frac{p(u \mid y)}{p(u)}\right]\right] \\ &= \mathbb{E}_{p(u)}\left[\mathcal{L}_1\right] -\mathbb{E}_{p(u)}\left[\mathbb{E}_{p(f^*,\mathrm{f} \mid u)}\left[\log \frac{p(f^*,\mathrm{f},u \mid y)}{p(f^*,\mathrm{f} \mid u)\cancel{p(u \mid y)}} \frac{\cancel{p(u \mid y)}}{p(u)}\right]\right] \\ &= \mathbb{E}_{p(u)}\left[\mathcal{L}_1\right] -\mathbb{E}_{p(u)}\left[\mathbb{E}_{p(f^*,\mathrm{f} \mid u)}\left[\log \frac{p(f^*,\mathrm{f},u \mid y)}{p(f^*,\mathrm{f}, u)}\right]\right] \\ &= \mathbb{E}_{p(u)}\left[\mathcal{L}_1\right] -\mathbb{E}_{p(f^*,\mathrm{f}, u)}\left[\log \frac{p(f^*,\mathrm{f},u \mid y)}{p(f^*,\mathrm{f}, u)}\right] \\ &= \mathbb{E}_{p(u)}\left[\mathcal{L}_1\right] -KL\left(p(f^*,\mathrm{f}, u) \mid\mid p(f^*,\mathrm{f},u \mid y)\right) \end{aligned} $$

Things left: compute $\mathbb{E}_{p(u)}\left[\mathcal{L}_1\right]$. Get back the idea of substituting in the first expresion $q(u)=\mathcal{N}(m,S)$.

OLD STUFF: $$ \begin{aligned} \mathcal{L}(q) &= \mathbb{E}_{q(f)}\left[\log \prod_{i=1}^N p(y_i\mid \mathrm{f}_i)\right] + KL\left[q(u)\mid\mid p(u)\right] \quad \text{(step 6)} \\ &= \mathbb{E}_{q(\mathrm{f})}\left[\sum_{i=1}^N \log p(y_i\mid \mathrm{f}_i)\right] + KL\left[q(u)\mid\mid p(u)\right]\\) &= \sum_{i=1}^N \mathbb{E}_{q(\mathrm{f})}\left[\log p(y_i\mid \mathrm{f}_i)\right] + KL\left[q(u)\mid\mid p(u)\right] \\ &= \sum_{i=1}^N \mathbb{E}_{q(\mathrm{f})}\left[\frac{-1}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}(y_i- f_i)^2\right] + KL\left[q(u)\mid\mid p(u)\right] \\ &= \sum_{i=1}^N \int \left[\frac{-1}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}(y_i- f_i)^2\right]q(\mathrm{f}_i)\overbrace{\int q(f_1,..,f_{i-1},f_{i+1},..,f_N \mid \mathrm{f}_i) d(\mathrm{f}_1,..,\mathrm{f}_{i-1},\mathrm{f}_{i+1},..,\mathrm{f}_N)}^{=1}d\mathrm{f}_i + KL\left[q(u)\mid\mid p(u)\right] \\ &= \sum_{i=1}^N \mathbb{E}_{q(\mathrm{f}_i)}\left[\frac{-1}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}(y_i- \mathrm{f}_i)^2\right] + KL\left[q(u)\mid\mid p(u)\right]\\ &= \frac{-N}{2}\log(2\pi\sigma^2)+\sum_{i=1}^N \mathbb{E}_{q(\mathrm{f}_i)}\left[ - \frac{1}{2\sigma^2}(y_i- \mathrm{f}_i)^2\right] + KL\left[q(u)\mid\mid p(u)\right] \\ &= \frac{-N}{2}\log(2\pi\sigma^2)+\sum_{i=1}^N \mathbb{E}_{q(u)}\left[\mathbb{E}_{q(\mathrm{f}_i \mid u)}\left[ - \frac{1}{2\sigma^2}(y_i- \mathrm{f}_i)^2\right]\right] + KL\left[q(u)\mid\mid p(u)\right] \\ &= \frac{-N}{2}\log(2\pi\sigma^2)+\sum_{i=1}^N \mathbb{E}_{q(u)}\left[\mathbb{E}_{p(\mathrm{f}_i \mid u)}\left[ - \frac{1}{2\sigma^2}(y_i- \mathrm{f}_i)^2\right]\right] + KL\left[q(u)\mid\mid p(u)\right] \\ &= \frac{-N}{2}\log(2\pi\sigma^2)+\sum_{i=1}^N \mathbb{E}_{q(u)}\left[\mathbb{E}_{p(\mathrm{f}_i \mid u)}\left[ - \frac{1}{2\sigma^2}(y_i^2+ \mathrm{f}_i^2 -2y_i \mathrm{f}_i)\right]\right] + KL\left[q(u)\mid\mid p(u)\right] \\ &= \frac{-N}{2}\log(2\pi\sigma^2)+\sum_{i=1}^N \mathbb{E}_{q(u)}\left[- \frac{1}{2\sigma^2}\left(y_i^2+ \mathbb{E}_{p(\mathrm{f}_i \mid u)}[\mathrm{f}_i^2] -2y_i \mathbb{E}_{p(\mathrm{f}_i \mid u)}[\mathrm{f}_i]\right)\right] + KL\left[q(u)\mid\mid p(u)\right] \quad \text{Using }p(\mathrm{f}_i \mid u ) = \mathcal{N}(K_{\mathrm{f}_iu}K_{uu}^{-1}u, K_{\mathrm{f}_i\mathrm{f}_i} - Q_{\mathrm{f}_i\mathrm{f}_i}) \\ &= \frac{-N}{2}\log(2\pi\sigma^2)+\sum_{i=1}^N \mathbb{E}_{q(u)}\left[- \frac{1}{2\sigma^2}\left(y_i^2+ K_{\mathrm{f}_i\mathrm{f}_i} - Q_{\mathrm{f}_i\mathrm{f}_i} + u^tK_{uu}^{-1}K_{u\mathrm{f}_i}K_{\mathrm{f}_iu}K_{uu}^{-1}u -2y_i K_{\mathrm{f}_iu}K_{uu}^{-1}u\right)\right] + KL\left[q(u)\mid\mid p(u)\right] \\ \end{aligned} $$