Bayesian Neural Networks

In Neural Networks we assume we want to learn a function $f^\omega(x)$ that maps points $x$ from our input space to the output space. To do this we minimize a risk function $J(\omega)$ over our set of training pairs $\mathcal{D}=\{x_i,y_i\}_{i=1}^N$:

$$ J(\omega) = \sum_{i=1}^N err(y_i,f^\omega(x_i)) + \Omega(\omega) $$

Where $\Omega(\omega)$ is the regularization term. To minimize $J$ we use stochastic gradient descent on $\omega$, that is we randomly initialize $\omega$ and then update this value iteratively using the stochastic gradient. This will eventually give us some optimal weights $\omega^*$ that we will use to make predictions for new $x^*$ points: $f^{\omega^*}(x^*)$.

This is useful and maybe enough in many applications. However, sometimes we might want to give some uncertainty information to our predictions. One of the most formal ways to do this is by using bayesian estatistics. Bayesian estatistics + neural networks = Bayesian Neural Networks!

In Bayesian Neural Networks we assume some distribution for the conditional distrution $p(y_i \mid f^\omega(x_i))$. In the case of regression we usually use the normal distribution: $p(y_i \mid f^\omega(x_i))\sim \mathcal{N}(f^\omega(x_i),\sigma^2)$ which is equivalent to say that $y = f^\omega(x_i) + \epsilon$ with $\epsilon\sim \mathcal{N}(0,\sigma^2)$.

In classification we could set for example $p(y_i \mid f^\omega(x_i))\sim Ber(\text{sigmoid}(f^\omega(x_i)))$.

Bayesian inference promises us a probability distribution of the weights $\omega$ instead of some bare values for $\omega$. This probability distribution over the weights $\omega$ is called the posterior distribution: $p(\omega \mid y,X)$. Bayes Theorem gives us the way to compute it:

$$ p(\omega \mid y,X) = \frac{p(y\mid X,\omega)p(\omega)}{p(y\mid X)}\quad \text{(Bayes Theorem)} $$

The terms on the $\text{(Bayes Theorem)}$ have their own names:

  • The first term on the denominator is called the likelihood. We will assume that each term on this joint probability is independent of each other given the prediction vale $f^\omega(x_i)$ which mathematically means: $$ p(y\mid \omega, X) = \prod_{i=1}^N p(y_i \mid f^\omega(x_i))\quad \text{(Likelihood)} $$
  • The second term on the denominator is called the prior over the weights $\omega$. It is the controversial one since we have to choose it by hand and it will affect the posterior distribution we find out. If we have enough data we can just use a non informative prior like $\mathcal{N}(0,\beta)$ with $\beta$ large.
  • The term on the denominator is called the marginal likelihood. This value is constant for a fixed dataset, if we knew it we would have the posterior since we can evaluate the likelihood and the prior. We can compute it doing the following integral.
$$ p(y\mid X) = \int p(y\mid X, \omega)p(\omega)d\omega $$

Variational Inference Approach

The Variational Inference (VI) approach arises because the marginal likelihood term is difficult to obtain. In VI we want to find a distribution over $\omega$ that depends on a set of parameters $\theta$ that approximates the posterior: $q_\theta(\omega)\approx p(\omega \mid y,X)$.

Here, in VI, "Approximates" means minimizes the Kullback-Leibler divergence between the two distributions: $$ \arg\min_\theta KL[q_\theta(\omega) || p(\omega \mid y,X)] = \arg\min_\theta -\mathbb{E}_{q_\theta(\omega)}\left[\log \frac{p(\omega \mid y, X)}{q_\theta(\omega)}\right] $$

For example, we can try to find the normal distribution that best approximates de posterior: in that case we can set: $q_\theta(\omega) = \mathcal{N}(\omega \mid \mu, \Sigma)$ and $\theta=\{\mu,\Sigma\}$. Then we minimize w.r.t. $\theta$ this divergence and we are done!

Nice, but, if we look at the KL divergence expression it depends on the posterior that it is what we are looking for... So it is not very useful... Fortunatelly we have the following expression that is central on the VI literature.

$$ \log p(y \mid X) = -\overbrace{-\mathbb{E}_{q_\theta(\omega)}\left[\log \frac{p(y\mid\omega,X)p(\omega)}{q_\theta(\omega)} \right]}^{\mathcal{L}(q_\theta)} \overbrace{-\mathbb{E}_{q_\theta(\omega)}\left[\log \frac{p(\omega \mid y,X)}{q_\theta(\omega)}\right]}^{KL[q_\theta(\omega) || p(\omega \mid y,X)]} $$

This expression says that the KL divergence + the $\mathcal{L}$ term is constant for whatever $q_\theta$ distribution we have!

The term $-\mathcal{L}$ is called ELBO which stands for estimated lower bound since this term is a lower bound of the marginal likelihood $p(y\mid X)$. So we see that minimizing the KL divergence is equivalent to maximizing the ELBO which is equivalent to minimizing $\mathcal{L}$:

$$ \arg \min_\theta KL[q_\theta(\omega) || p(\omega \mid y,X)] = \arg \min_\theta \mathcal{L}(q_\theta) $$

Therefore our approach will be to minimize the $\mathcal{L}$ term w.r.t. $\theta$. We can write $\mathcal{L}$ in the following handy way:

$$ \begin{aligned} \mathcal{L}(q_\theta) &= \mathbb{E}_{q_\theta(\omega)}\left[-\log p(y\mid\omega,X) - \log \frac{p(\omega)}{q_\theta(\omega)} \right] \\ &= -\mathbb{E}_{q_\theta(\omega)}\left[\log p(y\mid\omega,X)\right] -\mathbb{E}_{q_\theta(\omega)}\left[ \log \frac{p(\omega)}{q_\theta(\omega)} \right]\\ &= -\mathbb{E}_{q_\theta(\omega)}\left[\log p(y\mid\omega,X)\right] + KL[q_\theta(\omega) || p(\omega)] \end{aligned} $$

Watch out that the $KL$ divergence of this equation is the divergence between the prior and our approximate distribution $q_\theta$ whereas before we had the divergence between the posterior and the approximate distribution $q_\theta$.

This expression is a trade-off between the prior and the likelihood: it has the interpretation of do not diverge too much from the prior unless you can reduce significantly the first expectation!

If we now plug in the likelihood definition $(\text{Likelihood})$ we get:

$$ \begin{aligned} \mathcal{L}(q_\theta) &=-\mathbb{E}_{q_\theta(\omega)}\left[ \log \left( \prod_{i=1}^N p(y_i \mid f^\omega(x_i))\right)\right] + KL[q_\theta(\omega) || p(\omega)] \\ &= -\sum_{i=1}^N \mathbb{E}_{q_\theta(\omega)}\left[ \log p(y_i \mid f^\omega(x_i))\right] + KL[q_\theta(\omega) || p(\omega)] \end{aligned} $$

This expression have the nice property that is a sum over all the training data $\mathcal{D}$.This is cool since it means we can optimize it by stochastic gradient descent. Formally that means that if we consider $S$ a random subset of $\mathcal{D}$ of size $M$ and the estimator of $\mathcal{L}$: $$ \hat{\mathcal{L}}(q_\theta) = -\frac{N}{M} \sum_{i \in S} \mathbb{E}_{q_\theta(\omega)}\left[ \log p(y_i \mid f^\omega(x_i))\right] + KL[q_\theta(\omega) || p(\omega)] $$ We have that: $\nabla_\theta \mathcal{L(q_\theta)} = \mathbb{E}_S[\nabla_\theta \hat{\mathcal{L}}(q_\theta)]$. (The expectation here is taken over all the subsets of $\mathcal{D}$ of size $M$!)

So now we can optimize $\mathcal{L}$ by stochastic gradient descent to find the optimal $\theta$ and we will be done! But... wait... how do we compute the derivative w.r.t. the density function of an expectation?

$$ \nabla_\theta \hat{\mathcal{L}}(q_\theta) = -\frac{N}{M} \sum_{i \in S}\nabla_{\color{red}{\theta}}\mathbb{E}_{q_\color{red}{\theta}(\omega)}\left[ \log p(y_i \mid f^\omega(x_i))\right] + \nabla_\color{red}{\theta} KL[q_\color{red}{\theta}(\omega) || p(\omega)] $$

This is where we can use Shakirm tricks:

In the case of dropout networks we will use the reparametrization trick.The $q_\theta$ function in this case is not manually set but it is rather defined through its reparametrisation:

$$ \begin{aligned} \omega_\theta &= \{diag(\epsilon_i)M_i, b_i\}_{i=1}^L \quad \epsilon_i \sim Ber(0,p_i)\\ \theta &= \{M_i,b_i\}_{i=1}^L \end{aligned} $$

The expectation can be written as:

$$ \mathbb{E}_{q_\color{red}{\theta}(\omega)}\left[ \log p(y_i \mid f^\omega(x_i))\right] = \mathbb{E}_{p(\epsilon)} \left[ \log p(y_i \mid f^\omega(x_i))\right] $$

And we can interchange the $\nabla_\theta$ operator with the expectation to get:

$$ \nabla_\theta \hat{\mathcal{L}}(q_\theta) = -\frac{N}{M} \sum_{i \in S} \mathbb{E}_{p(\epsilon)}\left[\nabla_{\color{green}{\theta}} \log p(y_i \mid f^\omega(x_i))\right] + \nabla_\color{red}{\theta} KL[q_\color{red}{\theta}(\omega) || p(\omega)] $$