Let $f=\{f^*,\mathrm{f}\}$ be the noise free latent variables of a $\mathcal{GP}$. We write $\mathrm{f}$ to be the latent variables where we have noisy observations $y$: $y_i = \mathrm{f}_i + \epsilon_i$ and $f^*$ the points where we want to have predictions. For the purpose of understanding the Variational Free Energy (VFE) approach to $\mathcal{GP}$s of [Titsias 2009] we will consider the joint posterior $p(f^*,\mathrm{f} \mid y)$ which for the full $\mathcal{GP}$ is:
Note that, for the purpose of making predictions, what we are interested in is not on the joint posterior but the predictive posterior $p(f^* \mid y)$. This means we have to integrate out $\mathrm{f}$ from the above equation. When we do this for the normal $\mathcal{GP}$ we get the following well known predictive distribution:
$$ \begin{aligned} p(f^* \mid y ) &= \int p(f^* , \mathrm{f} \mid y) d\mathrm{f} \\ &= \mathcal{N}(f^* \mid K_{*f}(K_{ff}+\sigma^2 I)^{-1}y, \\ &\quad \quad \quad \quad K_{**}- K_{*f}(K_{ff}+\sigma^2 I)^{-1}K_{f*}) \end{aligned} $$(If you manage to do this integral without driving you crazy let me know)
We want to approximate the true joint posterior: $q(f)\approx p(f \mid y)$. To do this we use the standard bound used in variational inference:
$$ \begin{aligned} \log p(y) &= \overbrace{\mathbb{E}_{q(f)}\left[\log \frac{p(y,f)}{q(f)} \right]}^{\mathcal{L}(q)} \overbrace{-\mathbb{E}_{q(f)}\left[\log \frac{p(f \mid y)}{q(f)}\right]}^{KL\left[q \mid\mid p\right]}\\ &\geq \mathcal{L}(q) \end{aligned} $$Thus, maximizing $\mathcal{L}(q)$ is equivalent to minimizing the Kullback-Leiveck divergence between $q$ and the true joint posterior $p(f\mid y)$.
Let $u$ be a subset of $f^*$: $u \subset f^*$. We will write $f$ as $f = \{f^*,\mathrm{f}, u\}$. The bound $\mathcal{L}$ can be decomposed as:
$$ \require{cancel} \begin{aligned} \mathcal{L}(q) &= \mathbb{E}_{q(f)}\left[\log \frac{p(y\mid f)p(f)}{q(f)} \right] \\ &= \mathbb{E}_{q(f)}\left[\log \frac{p(y\mid \mathrm{f}, \cancel{f^*}, \cancel{u})p(f^*, \mathrm{f} \mid u) p(u)}{q(f)} \right] \end{aligned} $$Notice that $p(f^*, \mathrm{f} \mid u)$ is the standard posterior of the $\mathcal{GP}$ with noise free variables $u$:
$$ \begin{aligned} p(f^*, \mathrm{f} \mid u) = \mathcal{N}\Big(\begin{pmatrix} \mathrm{f} \\ f^* \end{pmatrix} \mid &\begin{pmatrix} K_{\mathrm{f}u} \\ K_{*u}\end{pmatrix}K_{uu}^{-1}u,\\ &\begin{pmatrix} K_{\mathrm{f}\mathrm{f}} & K_{\mathrm{f}*} \\ K_{*\mathrm{f}} & K_{**}\end{pmatrix} - \begin{pmatrix} Q_{\mathrm{f}\mathrm{f}} & Q_{\mathrm{f}*}\\ Q_{*\mathrm{f}} & Q_{**}\end{pmatrix} \Big) \end{aligned} $$Where $Q_{\mathrm{f}*} := K_{\mathrm{f}u}K_{uu}^{-1}K_{u*}$
The Titsias VFE approximation to the posterior chooses $q$ as: $$ q(f) = q(f^*,\mathrm{f},u) = p(f^*,\mathrm{f} \mid u) q(u) $$
Notice that:
We could also rewrite the true posterior $p(f\mid y)$ using the subset $u$ of $f$: $$ p(f\mid y) = p(f^*,\mathrm{f},u \mid y) = p(f^*, \mathrm{f} \mid u,y)p(u \mid y) $$
If we compare this two equations we see that The Titsias approximation removes the dependency on the data ($y$) in the first term. (While the second term $q(u)$ is let free).
Again, in order to make predictions, we will need the approximate predictive distribution $q(f^*)$, this can be found integrating out $\mathrm{f}$ and $u$ from $q$: $$ \begin{aligned} q(f^*) &= \int \int p(f^*,\mathrm{f} \mid u) q(u) d\mathrm{f}du \\ &= \int q(u) \underbrace{\int p(f^*,\mathrm{f} \mid u) d\mathrm{f}}_{p(f^* \mid u)}du \\ &= \int q(u) \mathcal{N}(f^* \mid K_{*u}K_{uu}^{-1}u, K_{**}-Q_{**})du \end{aligned} $$
If we are "lucky" and $q$ is normal, $q(u)=\mathcal{N}(u \mid m, S)$, then the approximate predictive distribution can be computed using the Bishop Chapter 2 pág 93 trick of Marginal and Conditional Gaussians: $$ \begin{aligned} q(f^*) = \mathcal{N}(f^* &\mid K_{*u}K_{uu}^{-1}m, \\ &K_{**}-Q_{**}+ K_{*u}K_{uu}^{-1}S K_{uu}^{-1}K_{u*}) \end{aligned} $$
Substituting this $q$ in the bound $\mathcal{L}(q)$ leads to:
The integral $\mathbb{E}_{p(\mathrm{f} \mid u)}\left[\log p(y \mid \mathrm{f})\right]$ plays a central role in the derivation. We will call it $\mathcal{L}_1$. Let's try to compute it:
$$ \begin{aligned} \mathcal{L}_1 &= \mathbb{E}_{p(\mathrm{f} \mid u)}\left[\log \mathcal{N}(y \mid \mathrm{f}, \sigma^2 I)\right] \\ &= \mathbb{E}_{p(\mathrm{f} \mid u)}\left[\log \left( \left(2\pi \sigma^{2N}\right)^{-1/2} \exp\left(\frac{-1}{2\sigma^{2}} \|y -\mathrm{f}\|^2\right)\right)\right] \\ &= \mathbb{E}_{p(\mathrm{f} \mid u)}\left[\log \left( \left(2\pi \sigma^{2N}\right)^{-1/2} \right) + \frac{-1}{2\sigma^{2}} \|y -\mathrm{f}\|^2\right] \\ &= \log \left( \left(2\pi \sigma^{2N}\right)^{-1/2} \right) + \mathbb{E}_{p(\mathrm{f} \mid u)}\left[\frac{-1}{2\sigma^{2}} \|y -\mathrm{f}\|^2\right] \\ &= \log \left( \left(2\pi \sigma^{2N}\right)^{-1/2} \right) + \frac{-1}{2\sigma^{2}}\mathbb{E}_{p(\mathrm{f} \mid u)}\left[ \left(\|y\|^2 + \|\mathrm{f}\|^2 -2 y^t \mathrm{f}\right)\right] \\ &= \log \left( \left(2\pi \sigma^{2N}\right)^{-1/2} \right) + \frac{-1}{2\sigma^{2}}\left( \|y\|^2 + \mathbb{E}_{p(\mathrm{f} \mid u)}\left[\|\mathrm{f}\|^2\right] -2 y^t \mathbb{E}_{p(\mathrm{f} \mid u)}\left[\mathrm{f}\right]\right) \\ \end{aligned} $$Now the trick is to use that $p(\mathrm{f} \mid u)$ is normal, if we write: $\mathcal{N}(\mathrm{f} \mid \mu_{\mathrm{f}\mid u}, \operatorname{Cov}_{u\mid \mathrm{f}})$ $$ \begin{aligned} \mathcal{L}_1 &= \log \left( \left(2\pi \sigma^{2N}\right)^{-1/2} \right) + \frac{-1}{2\sigma^{2}}\left( y^t y + \mathbb{E}_{p(\mathrm{f} \mid u)}\left[\mathrm{f}^t \mathrm{f}\right] -2 y^t \mu_{\mathrm{f}\mid u}\right) \\ &= \log \left( \left(2\pi \sigma^{2N}\right)^{-1/2} \right) + \frac{-1}{2\sigma^{2}}\left( y^t y + \mathbb{E}_{p(\mathrm{f} \mid u)}\left[(\mathrm{f}-\mu_{\mathrm{f}\mid u})^t (\mathrm{f}-\mu_{\mathrm{f}\mid u}) +2\mu_{\mathrm{f}\mid u}^t f - \mu_{\mathrm{f}\mid u}^t\mu_{\mathrm{f}\mid u}\right] -2 y^t \mu_{\mathrm{f}\mid u}\right) \\ &= \log \left( \left(2\pi \sigma^{2N}\right)^{-1/2} \right) + \frac{-1}{2\sigma^{2}}\left( y^t y + \mathbb{E}_{p(\mathrm{f} \mid u)}\left[(\mathrm{f}-\mu_{\mathrm{f}\mid u})^t (\mathrm{f}-\mu_{\mathrm{f}\mid u}) \right] +2\mu_{\mathrm{f}\mid u}^t \mathbb{E}_{p(\mathrm{f} \mid u)}\left[f\right] - \mu_{\mathrm{f}\mid u}^t\mu_{\mathrm{f}\mid u} -2 y^t \mu_{\mathrm{f}\mid u}\right) \\ &= \log \left( \left(2\pi \sigma^{2N}\right)^{-1/2} \right) + \frac{-1}{2\sigma^{2}}\left( y^t y + \mathrm{trace}(\operatorname{Cov}_{u\mid \mathrm{f}}) + \mu_{\mathrm{f}\mid u}^t\mu_{\mathrm{f}\mid u} -2 y^t \mu_{\mathrm{f}\mid u}\right) \\ &= \log \left( \left(2\pi \sigma^{2N}\right)^{-1/2} \right) + \frac{-1}{2\sigma^{2}}\left\| y - \mu_{\mathrm{f}\mid u}\right\|^2 + \frac{-1}{2\sigma^{2}}\left(\mathrm{trace}(\operatorname{Cov}_{u\mid \mathrm{f}}) \right) \\ &= \log \left( \left(2\pi \sigma^{2N}\right)^{-1/2} \exp\left(\frac{-1}{2\sigma^{2}}\left\| y - \mu_{\mathrm{f}\mid u}\right\|^2\right)\right) + \frac{-1}{2\sigma^{2}}\left(\mathrm{trace}(\operatorname{Cov}_{u\mid \mathrm{f}}) \right) \\ &= \log \left(\mathcal{N}\left(y \mid \mu_{\mathrm{f}\mid u}, \sigma^2 I\right)\right) - \frac{1}{2\sigma^{2}}\left(\mathrm{trace}(\operatorname{Cov}_{u\mid \mathrm{f}}) \right) \end{aligned} $$
If we now substitute the true value of $p(\mathrm{f} \mid u)$: $$ \begin{aligned} p(\mathrm{f} \mid u) &= \mathcal{N}(\mathrm{f} \mid \mu_{\mathrm{f}\mid u}, \operatorname{Cov}_{u\mid \mathrm{f}}) \\ &= \mathcal{N}(\mathrm{f} \mid K_{fu}K_{uu}^{-1}u, K_{ff}-\overbrace{Q_{ff}}^{K_{fu}K_{uu}^{-1}K_{uf}}) \end{aligned} $$
We have: $$ \begin{aligned} \mathcal{L}_1 &=\log \left(\mathcal{N}\left(y \mid K_{fu}K_{uu}^{-1}u, \sigma^2 I\right)\right) - \tfrac{1}{2\sigma^{2}}\left(\mathrm{trace}(K_{ff}-Q_{ff}) \right) \\ &= \sum_{i=1}^N \left[\log\mathcal{N}\left(y_i \mid K_{f_iu}K_{uu}^{-1}u, \sigma^2\right) - \tfrac{1}{2\sigma^{2}}\left(K_{f_if_i}- K_{f_iu}K_{uu}^{-1}K_{uf_i}\right)\right] \\ &= \sum_{i=1}^N \left[-\tfrac{1}{2\sigma^2}(y_i - K_{f_iu}K_{uu}^{-1}u)^2 -\tfrac{M}{2}\log(2\pi\sigma^2) - \tfrac{1}{2\sigma^{2}}\left(K_{f_if_i}- K_{f_iu}K_{uu}^{-1}K_{uf_i}\right)\right] \\ \end{aligned} $$
Great! So now it we get back the $\mathcal{L}(q)$ bound we will see it is a sum over the training data!! Therefore it is possible that it can be optimized by stochastic gradient descent:
$$ \begin{aligned} \mathcal{L}(q) &= \mathbb{E}_{q(u)}[\mathcal{L}_1] + KL\left[q(u)\mid\mid p(u)\right] \\ &= \mathbb{E}_{q(u)}\left[\sum_{i=1}^N \left[\log\mathcal{N}\left(y_i \mid K_{f_iu}K_{uu}^{-1}u, \sigma^2\right) - \tfrac{1}{2\sigma^{2}}\left(K_{f_if_i}- K_{f_iu}K_{uu}^{-1}K_{uf_i}\right)\right]\right] + KL\left[q(u)\mid\mid p(u)\right] \\ &= \sum_{i=1}^N \mathbb{E}_{q(u)}\left[\log\mathcal{N}\left(y_i \mid K_{f_iu}K_{uu}^{-1}u, \sigma^2\right) - \tfrac{1}{2\sigma^{2}}\left(K_{f_if_i}- K_{f_iu}K_{uu}^{-1}K_{uf_i}\right)\right] + KL\left[q(u)\mid\mid p(u)\right] \\ &= \sum_{i=1}^N \mathbb{E}_{q(u)}\left[-\tfrac{1}{2\sigma^2}(y_i - K_{f_iu}K_{uu}^{-1}u)^2 -\tfrac{M}{2}\log(2\pi\sigma^2) - \tfrac{1}{2\sigma^{2}}\left(K_{f_if_i}- K_{f_iu}K_{uu}^{-1}K_{uf_i}\right)\right] + KL\left[q(u)\mid\mid p(u)\right] \\ &= \sum_{i=1}^N -\tfrac{1}{2\sigma^2}\mathbb{E}_{q(u)}\left[(y_i - K_{f_iu}K_{uu}^{-1}u)^2\right] -\tfrac{NM}{2}\log(2\pi\sigma^2) - \tfrac{1}{2\sigma^{2}}\sum_{i=1}^N\left(K_{f_if_i}- K_{f_iu}K_{uu}^{-1}K_{uf_i}\right) + KL\left[q(u)\mid\mid p(u)\right] \end{aligned} $$If we assume now that $q(u)$ is normal: $q(u)=\mathcal{N}(u \mid m, S)$ we can use the change of variables $u = m + R\epsilon$ (where $S=RR^t$) with $\epsilon$ a $\mathcal{N(0,1)}$ M-dimensional vector to have:
$$ \begin{aligned} \mathcal{L}(q) &= \sum_{i=1}^N -\tfrac{1}{2\sigma^2}\mathbb{E}_{\epsilon}\left[(y_i - K_{f_iu}K_{uu}^{-1}(m + R\epsilon))^2\right]-\tfrac{NM}{2}\log(2\pi\sigma^2) - \tfrac{1}{2\sigma^{2}}\sum_{i=1}^N\left(K_{f_if_i}- K_{f_iu}K_{uu}^{-1}K_{uf_i}\right) + KL\left[q(u)\mid\mid p(u)\right] \\ &= \sum_{i=1}^N -\tfrac{1}{2\sigma^2}\left(\mathbb{E}_{\epsilon}\left[(y_i - K_{f_iu}K_{uu}^{-1}m)^2\right] + \mathbb{E}_{\epsilon}\left[(K_{f_iu}K_{uu}^{-1}R\epsilon)^2\right] -\cancel{\mathbb{E}_{\epsilon}\left[2(y_i - K_{f_iu}K_{uu}^{-1}m)R\epsilon \right]}\right)-\tfrac{NM}{2}\log(2\pi\sigma^2) - \tfrac{1}{2\sigma^{2}}\sum_{i=1}^N\left(K_{f_if_i}- K_{f_iu}K_{uu}^{-1}K_{uf_i}\right) + KL\left[q(u)\mid\mid p(u)\right]\\ &= \sum_{i=1}^N -\tfrac{1}{2\sigma^2}\left((y_i - K_{f_iu}K_{uu}^{-1}m)^2 + \mathbb{E}_{\epsilon}\left[\epsilon^tR^tK_{uu}^{-1}K_{uf_i}K_{f_iu}K_{uu}^{-1}R\epsilon\right] \right)-\tfrac{NM}{2}\log(2\pi\sigma^2) - \tfrac{1}{2\sigma^{2}}\sum_{i=1}^N\left(K_{f_if_i}- K_{f_iu}K_{uu}^{-1}K_{uf_i}\right) + KL\left[q(u)\mid\mid p(u)\right]\\ &= \sum_{i=1}^N -\tfrac{1}{2\sigma^2}\left((y_i - K_{f_iu}K_{uu}^{-1}m)^2 + \operatorname{trace}(SK_{uu}^{-1}K_{uf_i}K_{f_iu}K_{uu}^{-1}) \right)-\tfrac{NM}{2}\log(2\pi\sigma^2) - \tfrac{1}{2\sigma^{2}}\sum_{i=1}^N\left(K_{f_if_i}- K_{f_iu}K_{uu}^{-1}K_{uf_i}\right) + KL\left[q(u)\mid\mid p(u)\right]\\ \end{aligned} $$This equation is the $\mathcal{L}_3$ equation that appears in [Hensman et. al 2013]. We can fully expand it adding the value of the Kullback-Leibler divergence between the approximating posterior and the prior over $u$ since both of them are multivariate normal distributions the wikipedia says it is:
$$ \begin{aligned} KL[q(u) \mid\mid p(u)] &= KL[\mathcal{N}(m,S)\mid\mid \mathcal{N}(0,K_{uu})] \\ &= \tfrac{1}{2}m^t K_{uu}^{-1}m - \tfrac{k}{2} + \tfrac{1}{2} \operatorname{trace}(K_{uu}^{-1}S) + \tfrac{1}{2}\log\left(\frac{\operatorname{det}K_{uu}}{\operatorname{det}S}\right) \end{aligned} $$Finally it is worth to have a look at the full equation $\mathcal{L}(q)$ and see its dependencies:
Let's go back again and see the things using [Hensman et. al 2013] derivation: $$ \begin{aligned} \log p(y) &= \log \mathbb{E}_{p(u)}\left[p(y \mid u)\right] \\ % &= \log \mathbb{E}_{p(u)}\left[\mathbb{E}_{p(\mathrm{f} \mid u)}\left[p(y\mid \mathrm{f},\cancel{u})\right]\right] \\ \end{aligned} $$
We can decompose $\log p(y)$ using also the standard trick of variational inference: $$ \log p(y) = \mathbb{E}_{p(u)}\left[\log p(y\mid u)\right] - \mathbb{E}_{p(u)}\left[\log \frac{p(u \mid y)}{p(u)}\right] $$
Instead of computing the bound on $p(y)$, we can compute the bound on $p(y \mid u)$ which is:
$$ \log p(y \mid u) = \overbrace{\mathbb{E}_{q(f^*,\mathrm{f} \mid u)}\left[\log \frac{p(y,f^*,\mathrm{f} \mid u)}{q(f^*,\mathrm{f} \mid u)}\right]}^{\mathcal{L}\left(q(f^*,\mathrm{f} \mid u)\right)} \overbrace{-\mathbb{E}_{q(f^*,\mathrm{f} \mid u)}\left[\log \frac{p(f^*,\mathrm{f} \mid y,u)}{q(f^*,\mathrm{f} \mid u)}\right]}^{KL\left(q(f^*,\mathrm{f} \mid u) \mid \mid p(f^*,\mathrm{f} \mid u,y)\right)} $$If we use here the Titsias approximation $q(f^*,\mathrm{f} \mid u) = p(f^*,\mathrm{f} \mid u)$. We find that:
We can apply this decomposition to $\log p(y)$: $$ \begin{aligned} \log p(y) &= \log \mathbb{E}_{p(u)}\left[p(y \mid u)\right] \\ &= \log \mathbb{E}_{p(u)}\left[\exp\left(\mathcal{L}_1 + KL\left(p(f^*,\mathrm{f} \mid u) \mid \mid p(f^*,\mathrm{f} \mid u,y)\right)\right)\right] \\ &= \log \mathbb{E}_{p(u)}\left[\exp\left(\mathcal{L}_1\right)\exp\left(KL\left(p(f^*,\mathrm{f} \mid u) \mid \mid p(f^*,\mathrm{f} \mid u,y)\right)\right)\right] \\ \end{aligned} $$
We can also apply this decomposition in the variational inference decomposition of $\log p(y)$: $$ \begin{aligned} \log p(y) &= \mathbb{E}_{p(u)}\left[\log p(y\mid u)\right] - \mathbb{E}_{p(u)}\left[\log \frac{p(u \mid y)}{p(u)}\right] \\ &= \mathbb{E}_{p(u)}\left[\mathcal{L}_1 + KL\left(p(f^*,\mathrm{f} \mid u) \mid \mid p(f^*,\mathrm{f} \mid u,y)\right) \right] - \mathbb{E}_{p(u)}\left[\log \frac{p(u \mid y)}{p(u)}\right] \\ &= \mathbb{E}_{p(u)}\left[\mathcal{L}_1\right] + \mathbb{E}_{p(u)}\left[KL\left(p(f^*,\mathrm{f} \mid u) \mid \mid p(f^*,\mathrm{f} \mid u,y)\right) \right] - \mathbb{E}_{p(u)}\left[\log \frac{p(u \mid y)}{p(u)}\right] \\ &= \mathbb{E}_{p(u)}\left[\mathcal{L}_1\right] -\mathbb{E}_{p(u)}\left[\mathbb{E}_{p(f^*,\mathrm{f} \mid u)}\left[\log \frac{p(f^*,\mathrm{f} \mid y,u)}{p(f^*,\mathrm{f} \mid u)}\right]\right] - \mathbb{E}_{p(u)}\left[\log \frac{p(u \mid y)}{p(u)}\right] \\ &= \mathbb{E}_{p(u)}\left[\mathcal{L}_1\right] -\mathbb{E}_{p(u)}\left[\mathbb{E}_{p(f^*,\mathrm{f} \mid u)}\left[\log \frac{p(f^*,\mathrm{f} \mid y,u)}{p(f^*,\mathrm{f} \mid u)}\right] +\log \frac{p(u \mid y)}{p(u)}\right] \\ &= \mathbb{E}_{p(u)}\left[\mathcal{L}_1\right] -\mathbb{E}_{p(u)}\left[\mathbb{E}_{p(f^*,\mathrm{f} \mid u)}\left[\log \frac{p(f^*,\mathrm{f} \mid y,u)}{p(f^*,\mathrm{f} \mid u)} +\log \frac{p(u \mid y)}{p(u)}\right]\right] \\ &= \mathbb{E}_{p(u)}\left[\mathcal{L}_1\right] -\mathbb{E}_{p(u)}\left[\mathbb{E}_{p(f^*,\mathrm{f} \mid u)}\left[\log \frac{p(f^*,\mathrm{f} \mid y,u)}{p(f^*,\mathrm{f} \mid u)} \frac{p(u \mid y)}{p(u)}\right]\right] \\ &= \mathbb{E}_{p(u)}\left[\mathcal{L}_1\right] -\mathbb{E}_{p(u)}\left[\mathbb{E}_{p(f^*,\mathrm{f} \mid u)}\left[\log \frac{p(f^*,\mathrm{f},u \mid y)}{p(f^*,\mathrm{f} \mid u)\cancel{p(u \mid y)}} \frac{\cancel{p(u \mid y)}}{p(u)}\right]\right] \\ &= \mathbb{E}_{p(u)}\left[\mathcal{L}_1\right] -\mathbb{E}_{p(u)}\left[\mathbb{E}_{p(f^*,\mathrm{f} \mid u)}\left[\log \frac{p(f^*,\mathrm{f},u \mid y)}{p(f^*,\mathrm{f}, u)}\right]\right] \\ &= \mathbb{E}_{p(u)}\left[\mathcal{L}_1\right] -\mathbb{E}_{p(f^*,\mathrm{f}, u)}\left[\log \frac{p(f^*,\mathrm{f},u \mid y)}{p(f^*,\mathrm{f}, u)}\right] \\ &= \mathbb{E}_{p(u)}\left[\mathcal{L}_1\right] -KL\left(p(f^*,\mathrm{f}, u) \mid\mid p(f^*,\mathrm{f},u \mid y)\right) \end{aligned} $$
Things left: compute $\mathbb{E}_{p(u)}\left[\mathcal{L}_1\right]$. Get back the idea of substituting in the first expresion $q(u)=\mathcal{N}(m,S)$.
OLD STUFF: $$ \begin{aligned} \mathcal{L}(q) &= \mathbb{E}_{q(f)}\left[\log \prod_{i=1}^N p(y_i\mid \mathrm{f}_i)\right] + KL\left[q(u)\mid\mid p(u)\right] \quad \text{(step 6)} \\ &= \mathbb{E}_{q(\mathrm{f})}\left[\sum_{i=1}^N \log p(y_i\mid \mathrm{f}_i)\right] + KL\left[q(u)\mid\mid p(u)\right]\\) &= \sum_{i=1}^N \mathbb{E}_{q(\mathrm{f})}\left[\log p(y_i\mid \mathrm{f}_i)\right] + KL\left[q(u)\mid\mid p(u)\right] \\ &= \sum_{i=1}^N \mathbb{E}_{q(\mathrm{f})}\left[\frac{-1}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}(y_i- f_i)^2\right] + KL\left[q(u)\mid\mid p(u)\right] \\ &= \sum_{i=1}^N \int \left[\frac{-1}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}(y_i- f_i)^2\right]q(\mathrm{f}_i)\overbrace{\int q(f_1,..,f_{i-1},f_{i+1},..,f_N \mid \mathrm{f}_i) d(\mathrm{f}_1,..,\mathrm{f}_{i-1},\mathrm{f}_{i+1},..,\mathrm{f}_N)}^{=1}d\mathrm{f}_i + KL\left[q(u)\mid\mid p(u)\right] \\ &= \sum_{i=1}^N \mathbb{E}_{q(\mathrm{f}_i)}\left[\frac{-1}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}(y_i- \mathrm{f}_i)^2\right] + KL\left[q(u)\mid\mid p(u)\right]\\ &= \frac{-N}{2}\log(2\pi\sigma^2)+\sum_{i=1}^N \mathbb{E}_{q(\mathrm{f}_i)}\left[ - \frac{1}{2\sigma^2}(y_i- \mathrm{f}_i)^2\right] + KL\left[q(u)\mid\mid p(u)\right] \\ &= \frac{-N}{2}\log(2\pi\sigma^2)+\sum_{i=1}^N \mathbb{E}_{q(u)}\left[\mathbb{E}_{q(\mathrm{f}_i \mid u)}\left[ - \frac{1}{2\sigma^2}(y_i- \mathrm{f}_i)^2\right]\right] + KL\left[q(u)\mid\mid p(u)\right] \\ &= \frac{-N}{2}\log(2\pi\sigma^2)+\sum_{i=1}^N \mathbb{E}_{q(u)}\left[\mathbb{E}_{p(\mathrm{f}_i \mid u)}\left[ - \frac{1}{2\sigma^2}(y_i- \mathrm{f}_i)^2\right]\right] + KL\left[q(u)\mid\mid p(u)\right] \\ &= \frac{-N}{2}\log(2\pi\sigma^2)+\sum_{i=1}^N \mathbb{E}_{q(u)}\left[\mathbb{E}_{p(\mathrm{f}_i \mid u)}\left[ - \frac{1}{2\sigma^2}(y_i^2+ \mathrm{f}_i^2 -2y_i \mathrm{f}_i)\right]\right] + KL\left[q(u)\mid\mid p(u)\right] \\ &= \frac{-N}{2}\log(2\pi\sigma^2)+\sum_{i=1}^N \mathbb{E}_{q(u)}\left[- \frac{1}{2\sigma^2}\left(y_i^2+ \mathbb{E}_{p(\mathrm{f}_i \mid u)}[\mathrm{f}_i^2] -2y_i \mathbb{E}_{p(\mathrm{f}_i \mid u)}[\mathrm{f}_i]\right)\right] + KL\left[q(u)\mid\mid p(u)\right] \quad \text{Using }p(\mathrm{f}_i \mid u ) = \mathcal{N}(K_{\mathrm{f}_iu}K_{uu}^{-1}u, K_{\mathrm{f}_i\mathrm{f}_i} - Q_{\mathrm{f}_i\mathrm{f}_i}) \\ &= \frac{-N}{2}\log(2\pi\sigma^2)+\sum_{i=1}^N \mathbb{E}_{q(u)}\left[- \frac{1}{2\sigma^2}\left(y_i^2+ K_{\mathrm{f}_i\mathrm{f}_i} - Q_{\mathrm{f}_i\mathrm{f}_i} + u^tK_{uu}^{-1}K_{u\mathrm{f}_i}K_{\mathrm{f}_iu}K_{uu}^{-1}u -2y_i K_{\mathrm{f}_iu}K_{uu}^{-1}u\right)\right] + KL\left[q(u)\mid\mid p(u)\right] \\ \end{aligned} $$