Variational Bayes Notes

What is variational Bayes?

A technique that allows us to compute approximations to non-closed form Bayes update equations.

Why would we use it?

In Bayesian parameter estimation, to compute the intractable integration of the denominator in the parameter posterior probability:

\begin{equation} p(\Theta \vert \Omega, Y) = \frac{p(Y, \Theta \vert \Omega)}{\int p(Y, \Theta \vert \Omega) d\Omega} \end{equation}

and in Bayesian model fitting, to find the denominator of the model posterior probability (for a fixed $\Omega$):

\begin{equation} p(\Omega \vert Y) = \frac{p(Y \vert \Omega) p( \Omega )}{\sum_{\Omega \in \mathcal{M}} p(Y\vert \Omega)p(\Omega)} \end{equation}

Where $\Omega$ is the candidate model, $Y = \{(x_1,d_1),\dots,(x_N,d_N)\}$ is the training data, $\Theta$ is the set of unknown model parameters (and prior hyperparameters) under $\Omega$.

Also for Bayesian data fusion, if we're using GMM priors with MMS likelihoods, we're trying to evaluate

\begin{equation}\label{eq:bdf} p(X_k \vert D_k) = \frac{P(D_k \vert X_k) p( X_k )}{\int P(D_k \vert X_k) p( X_k )dX_k} = \frac{p(X_k,D_k)}{P(D_k)} \end{equation}

Where $p(D_k \vert X_k)$ is the MMS model and

$$ p( X_k ) = p(X_{k} \vert D_{1:k-1}) = \int p(X_k \vert X_{k-1}) p(X_{k} \vert D_{1:k-1}) dX_{k-1} $$

where $p(X_k \vert X_{k-1})$ is the state transition pdf and we are only fusing $D_k$ sensor data.

Alternatives & Extensions

  • Grid-based approaches
  • MCMC (and other Monte Carlo/Particle Filter techniques)
  • Laplace Approximation
  • VBIS: Use variational bayes outputs as the parameters for imporance distribution $q(X_k)$
  • LWIS: Use prior $p(X_k)$ as imporance distribution $q(X_k)$

How does it work?

VB minimizes the KLD between the integrable, closed-form variational posterior parameter distribution $q(\Theta \vert Y, \Omega)$ and the true posterior $p(\Theta \vert \Omega, Y)$ for $\Omega \in \mathcal{M}$:

$$ KL(q\vert\vert p) = - \int q(\Theta \vert Y, \Omega) log{\frac{p(\Theta \vert Y, \Omega)}{q(\Theta \vert Y, \Omega)}}d\Theta $$

But, since $p(\Theta \vert Y, \Omega)$ is unavailable, we can instead minimize the KLD by maximizing a lower bound $\mathcal{L}$ to $\log{p(Y\vert\Omega)}$, where:

$$ \mathcal{L} = \int q(\Theta \vert Y, \Omega) log{\frac{p(Y, \Omega \vert \Theta)}{q(\Theta \vert Y, \Omega)}}d\Theta $$

and

\begin{align*} \log{p(Y\vert\Omega)} &= \mathcal{L} + KL(q \vert\vert p) \\ &= \int q(\Theta \vert Y, \Omega) log{\frac{p(Y, \Omega \vert \Theta)}{q(\Theta \vert Y, \Omega)}}d\Theta - \int q(\Theta \vert Y, \Omega) log{\frac{p(\Theta \vert Y, \Omega)}{q(\Theta \vert Y, \Omega)}}d\Theta \\ &= \int q(\Theta \vert Y, \Omega) log{(p(Y, \Omega \vert \Theta))} - q(\Theta \vert Y, \Omega) log{(p(\Theta \vert Y, \Omega))}d\Theta \\ &= \int q(\Theta \vert Y, \Omega) log{\frac{p(Y, \Omega \vert \Theta)}{p(\Theta \vert Y, \Omega)}} d\Theta \end{align*}

How do we use it?

Take equation \ref{eq:bdf}. We want to approximate $p(X_k,D_k)$ (analytically intractable) with an unnormalized Gaussian lower bound pdf, which leads to a variational Bayesian Gaussian posterior approximation $\hat{p}(X_k\vert D_k)$.

If $f(D_k,X_k)$ is an unnormalized Gaussian function that approximates the softmax likelihood $P(D_k \vert X_k)$, then

\begin{align}\label{eq:approx_joint} p(X_k,D_k) &\approx \hat{p}(X_k,D_k) = p(X_k)f(D_k,X_k) \\ P(D_k) = C &\approx \hat{C} = \int_{-\infty}^{\infty} \hat{p}(X_k,D_k) \end{align}

Since $p(X_k)$ is Gaussian, $\hat{p}(X_k,D_k)$ is as well (as the product of two gaussians).

How do we derive $f(D_k,X_k)$?

[2] derives an upper bound to the softmax denominator:

\begin{equation}\label{eq:upper} \log\left(\sum_{c=1}^m e^{y_c}\right) \leq \alpha + \sum_{c=1}^m \frac{y_c - \alpha - \xi_c}{2} + \lambda(\xi_c)[(y_c - \alpha)^2 - \xi_c^2] + log(1 + e^{\xi_c}) \end{equation}

where $\lambda(\xi_c) = \frac{1}{2\xi_c}\left[\frac{1}{1 + e^{-\xi_c}}\right] - \frac{1}{2}$ and $y_c = w^T_cx + b_c$. $\alpha$ and $\xi_c$ are free variational parameters; given $y_c$, $\alpha$ and $\xi_c$ can be selected to minimize the upper bound in \ref{eq:upper}.

Assuming known $\alpha$ and $\xi_c$, we take the log of the softmax likelihood to get:

\begin{align*} \log{P(D_k=j\vert X_k)} &= w^T_jx + b_j - \log{\left(\sum_{c=1}^m e^{w^T_cx + v_c}\right)} \\ &\leq \log{f(D_j=j,X_k)} = g_j + h^T_jx - \frac{1}{2}x^TK_jx \end{align*}

Where

\begin{equation}\label{eq:approx_likelihood} f(D_j=j,X_k) = \exp\{g_j + h^T_jx - \frac{1}{2}x^TK_jx\} \end{equation}

The prior $p(X_k)$ can be expressed similarly:

\begin{equation}\label{eq:gaussian_prior} p(X_k) = \exp\{g_p + h^T_px - \frac{1}{2}x^TK_px\} \end{equation}

where $g_p = -\frac{1}{2}(log{\lvert2\pi\Sigma\rvert} + \mu^TK_p\mu)$, $h_p = K_p\mu$ $K_p = \Sigma^{-1}$. This is simply a reformulation of the equation of a gaussian:

\begin{align*} p(X_k) &= \frac{1}{\sqrt{\lvert 2 \pi \Sigma \rvert}} exp{\{-\frac{1}{2}(x - \mu)^T\Sigma^{-1}(x - \mu) \}} \\ &=exp{\{-\frac{1}{2}(x^T - \mu^T)K_p(x - \mu) -\frac{1}{2}\log{\lvert 2 \pi \Sigma \rvert} \}}\\ &=exp{\{-\frac{1}{2}x^TK_px +\frac{1}{2}(x^TK_p\mu +\mu^TK_px) -\frac{1}{2}\mu^TK_p\mu -\frac{1}{2}\log{\lvert 2 \pi \Sigma \rvert} \}}\\ &=exp{\{-\frac{1}{2}(\log{\lvert 2 \pi \Sigma \rvert} + \mu^TK_p\mu ) +\frac{1}{2}(K_p\mu x +K_p \mu x) -\frac{1}{2}x^TK_px \}}\\ &= \exp\{g_p + h^T_px - \frac{1}{2}x^TK_px\} \end{align*}

Since equation \ref{eq:approx_joint} is simply the product of two Gaussians, it becomes:

\begin{equation}\label{eq:approx_joint_product} \hat{p}(X_k,D_k) = p(X_k)f(D_k,X_k) = \exp\{g_l + h^T_lx - \frac{1}{2}x^TK_lx\} = \mathcal{N}(\hat{\mu}_{VB},\hat{\Sigma}_{VB}) \end{equation}

But how do we optimize $\alpha$ and $\xi_c$ ?

Minimizing the RHS of \ref{eq:upper} gives us:

\begin{align} \xi^2_c &= y^2_c + \alpha^2 - 2\alpha y_c \label{eq:xi} \\ \alpha &= \frac{\left(\frac{m-2}{4}\right) \sum_{c=1}^m\lambda(\xi_c)y_c}{\sum_{c=1}^m\lambda(\xi_c)} \label{eq:alpha} \end{align}

But, both depend on $X_k$, which is unobserved. Instead, we minimize the expected value of the RHS of \ref{eq:upper} with respect to the posterior.

Apparently (?) this is equivalent to maximizing $\log{\hat{P}(D_k)}$, the approximate the marginal log-likelihood of the observation $D_k = j$:

$$ \log{\hat{P}(D_k)} = \log{\hat{C}} = \log \int_{-\infty}^{\infty} \hat{p}(X_k,D_k)dX_k $$

We can now use expectation-maximization (EM) to iteratively optimize $\alpha$ and $\xi_c$. We take the expectations of \ref{eq:xi} and \ref{eq:alpha} under the current $\hat{p}(X_k \vert D_k)$ estimate. Additionally, we'll need the following:

\begin{align} \langle y_c\rangle &= w^T_c\hat{\mu}_{VB} + b_c \label{eq:y_expected} \\ \langle y^2_c\rangle &= w^T_c(\hat{\Sigma}_{VB} + \hat{\mu}_{VB}\hat{\mu}_{VB}^T)w_c + 2w^T_c\hat{\mu}_{VB}b_c + b^2_c \label{eq:y2_expected} \end{align}

The code below demonstrates the expectation maximization algorithm.

Issues

The approximate posterior is optimistic relative to the true posterior, and will be biased due to this optimism as well.

Examples

Variational Bayes (VB) with Softmax and Gaussian Prior

Let's restate the bayesian data fusion problem. We want to compute the following:

\begin{equation}\label{eq:vb-bdf} p(X_k \vert D_k) = \frac{P(D_k = j \vert X_k) p( X_k )}{\int P(D_k = j \vert X_k) p( X_k )dX_k} = \frac{p(X_k,D_k)}{P(D_k)} \end{equation}

Where, for a softmax likelihood,

\begin{equation}\label{eq:softmax} P(D_k = j \vert X_k) = \frac{e^{w^T_jx + b_j}}{\sum_{c=1}^m e^{w^T_cx + b_c}} \end{equation}

and, for a gaussian prior,

\begin{equation} P(X_k) = \frac{1}{\sqrt{\lvert 2 \pi \Sigma}} \exp{\{-\frac{1}{2}(x - \mu)^T\Sigma^{-1}(x - \mu)\}} \end{equation}

We show these two below for a one-dimensional problem of estimating the speed of a target.


In [1]:
from cops_and_robots.robo_tools.fusion.softmax import speed_model
%matplotlib inline

sm = speed_model()
sm.plot(plot_classes=False)



In [8]:
from __future__ import division
from scipy.stats import norm
import numpy as np
import matplotlib.pyplot as plt

mu = 0.3
sigma = 0.1

gaussian = norm(loc=mu, scale=sigma)
x_space = np.linspace(-5, 5, 500)

# Plot the frozen distribution
fig, ax = plt.subplots(1, 1)
fig.set_size_inches(8, 8)
ax.plot(x_space, gaussian.pdf(x_space), lw=2, label='frozen pdf', c='g')
ax.fill_between(x_space, 0, gaussian.pdf(x_space), alpha=0.2, facecolor='g')
ax.set_xlim([0,0.4])
ax.set_ylim([0,10])
ax.set_title('Gaussian prior')


Out[8]:
<matplotlib.text.Text at 0x111b5e490>

While the numerator of \ref{eq:vb-bdf} can be computed easily, there is no closed form solution for its denominator.

We'll follow the following algorithm:

Inputs

  • prior $\mu$ and $\Sigma$;
  • $D_k = j$ with likelihood in eq.(7);
  • initial $\alpha$ and $\xi_c$, for j, c ∈ {1, ...,m}

Outputs

  • posterior mean $\hat{\mu}_{VB}$
  • Posterior covariance $\hat{\Sigma}_{VB}$

Steps

  1. E-step: for all fixed $\xi_c$ and $\alpha$,
    1. compute $\hat{\alpha}_{VB}$ and $\hat{\Sigma}_{VB}$ via eq. (19);
    2. compute $\langle y_c\rangle$ and $\langle y_c^2\rangle$ via eqs. (23)-(24);
  2. M-step: for all fixed $\langle y_c\rangle$ and $\langle y_c^2\rangle$,
           for $i = 1 : n_{lc}$ do
    1. compute all $\xi_c$ for fixed $\alpha$ via eq. (20)
    2. compute $\alpha$ for all fixed $\xi_c$ via eq. (21)
      end for
  3. If converged, return $\hat{C}$ via eq. (25) and stop; otherwise, return to step 1

In [ ]:
%debug
from numpy.linalg import inv
import pandas as pd

# SETTINGS:
n_lc = 15  # number of convergence loops
measurement = 'Slow'
tolerance = 10 ** -3  # for convergence
max_EM_steps = 500

# INPUT: Define input priors and initial values
prior_mu = np.zeros(1)
prior_sigma = np.array([[1]])
initial_alpha = 0
initial_xi = np.ones(4)

# Softmax values
m = 4
w = sm.weights
b = sm.biases
j = sm.class_labels.index(measurement)

# Preparation
xi_cs = initial_xi
alpha = initial_alpha
mu_hat = prior_mu
sigma_hat = prior_sigma


# dataframe
df = pd.DataFrame([[alpha, mu_hat[0], sigma_hat[0][0], xi_cs[0],xi_cs[1],xi_cs[2],xi_cs[3]]],
                  columns=('alpha','mu_hat','sigma_hat','xi_1','xi_2','xi_3','xi_4',))

def lambda_(xi_c):
    return 1 / (2 * xi_c) * ( (1 / (1 + np.exp(-xi_c))) - 0.5)

converged = False
EM_step = 0

import pdb; pdb.set_trace()  # breakpoint 5a7a3236 //

while not converged and EM_step < max_EM_steps:
    ################################################################
    # STEP 1 - EXPECTATION
    ################################################################
    for xi_c in xi_cs:
        # PART A #######################################################

        # find g_j
        sum1 = 0
        for c in range(m):
            if c != j:
                sum1 += b[c]
        sum2 = 0
        for c in range(m):
            sum2 = xi_c / 2 \
                + lambda_(xi_c) * (xi_c ** 2 - (b[c] - alpha) ** 2) \
                - np.log(1 + np.exp(xi_c))
        g_j = 0.5 *(b[j] - sum1) + alpha * (m / 2 - 1) + sum2

        # find h_j
        sum1 = 0
        for c in range(m):
            if c != j:
                sum1 += w[c]
        sum2 = 0
        for c in range(m):
            sum2 += lambda_(xi_c) * (alpha - b[c]) * w[c]
        h_j = 0.5 * (w[j] - sum1) + 2 * sum2

        # find K_j
        sum1 = 0
        for c in range(m):
            sum1 += lambda_(xi_c) * w[c].T .dot (w[c])
        K_j = 2 * sum1
        
        K_p = inv(prior_sigma)
        g_p = -0.5 * (np.log( np.linalg.det(2 * np.pi * prior_sigma))) \
            + prior_mu.T .dot (K_p) .dot (prior_sigma)
        h_p = K_p .dot (prior_mu)

        g_l = g_p + g_j
        h_l = h_p + h_j
        K_l = K_p + K_j

        mu_hat = inv(K_l) .dot (h_l)
        sigma_hat = inv(K_l)

        # PART B #######################################################
        y_cs = np.zeros(m)
        y_cs_squared = np.zeros(m)
        for c in range(m):
            y_cs[c] = w[c].T .dot (mu_hat) + b[c]
            y_cs_squared[c] = w[c].T .dot (sigma_hat + mu_hat .dot (mu_hat.T)) .dot (w[c]) \
                + 2 * w[c].T .dot (mu_hat) * b[c] + b[c] ** 2

    ################################################################
    # STEP 2 - MAXIMIZATION
    ################################################################
    for i, y_c in enumerate(y_cs):
        y_c_squared = y_cs_squared[i]

        for i in range(n_lc):

            # PART A #######################################################
            # Find xi_cs
            for c in range(m):
                xi_cs[c] = np.sqrt(y_c_squared + alpha ** 2 - 2 * alpha * y_c)


            # PART B #######################################################
            # Find alpha
            num_sum = 0
            den_sum = 0
            for xi_c in xi_cs:
                num_sum = lambda_(xi_c) * y_c
                den_sum = lambda_(xi_c)
            alpha = ((m - 2) / 4 + num_sum) / den_sum

    ################################################################
    # STEP 3 - CONVERGENCE CHECK
    ################################################################    
    
    new_df = pd.DataFrame([[alpha, mu_hat[0], sigma_hat[0][0], xi_cs[0],xi_cs[1],xi_cs[2],xi_cs[3]]],
                          columns=('alpha','mu_hat','sigma_hat','xi_1','xi_2','xi_3','xi_4',))
    df = df.append(new_df, ignore_index=True)
    EM_step += 1
df

In [27]:
inv(a)


Out[27]:
array([[ 1.]])

Comparison: discretized state space

Using a discrete environment, we get the following:


In [7]:
measurement = 'Slow'
measurement_i = sm.class_labels.index(measurement)

dx = 10/500

normalizer = 0
for x in x_space:
    lh = sm.probs_at_state(x, measurement)
    if np.isnan(lh):
        lh = 0.00
    normalizer += lh  * (gaussian.cdf(x + dx/2) - gaussian.cdf(x - dx/2))

posterior = np.zeros_like(x_space)
for i, x in enumerate(x_space):
    lh = sm.probs_at_state(x, measurement)
    if np.isnan(lh):
        lh = 0.00
    posterior[i] = lh * (gaussian.cdf(x + dx/2) - gaussian.cdf(x - dx/2)) / normalizer


# sum_of_gauss = 0
# for x in x_space:
#     sum_of_gauss += gaussian.cdf(x + dx/2) - gaussian.cdf(x - dx/2)
# print sum_of_gauss
    
ax = sm.plot_class(measurement_i, fill_between=False)
ax.plot(x_space, posterior, lw=3, label='posterior pdf', c='b')
ax.fill_between(x_space, 0, posterior, alpha=0.2, facecolor='b')
ax.plot(x_space, gaussian.pdf(x_space), lw=1, label='prior pdf', c='g')


ax.set_title('Posterior distribtuion')
ax.legend()
ax.set_ylim([0, 1.0])
plt.show()



In [ ]:

References

[1] N. Ahmed and M. Campbell, “Variational Bayesian learning of probabilistic discriminative models with latent softmax variables,” Signal Process. IEEE Trans. …, vol. XX, no. c, pp. 1–27, 2011.

[2] G. Bouchard, “Efficient bounds for the softmax function and applications to approximate inference in hybrid models,” in NIPS 2007 Workshop for Approximate Bayesian Inference in Continuous/Hybrid Systems, Whistler, BC, Canada, 2007.

[3] N. Ahmed, E. Sample, and M. Campbell, “Bayesian Multicategorical Soft Data Fusion for Human--Robot Collaboration,” IEEE Trans. Robot., vol. 29, no. 1, pp. 189–206, 2013.


In [4]:
from IPython.core.display import HTML

# Borrowed style from Probabilistic Programming and Bayesian Methods for Hackers
def css_styling():
    styles = open("../styles/custom.css", "r").read()
    return HTML(styles)
css_styling()


Out[4]: