Let's start with the problem of fitting a probability distribution $p(x)$ into a data set of points. This is the problem of generative modeling where one aims to learn a joint distribution over a set of variables. This is the main difference between discriminative and generative modeling. While in discriminative modeling, one aims to learn the predictor from a set of observations, in generative modeling one simulates how the data is generated in the real world. Generative modeling gives one an understanding of the generative process of data and naturally expresses causal relations of the world.
Depending on the complexity of the data being fit we can think of using models like Gaussian model, Gaussian Mixture model or Probabilistic PCA to fit the distribution. But it's a matter of fact that none of these methods are able to handle the complexity of complicated dataset like natural images. In this document we will explore various options of how to model $p(x)$ till we find a suitable way that can handle a complex dataset like natural images.
This document is the result of the course notes that I took when doing the excellent Coursera MOOC on Bayesian Methods for Machine Learning and subsequently enhanced from my own personal readings on this subject.
We may like to fit the dataset of natural images into a probabilistic distribution for the following use cases:
The main focus of this docyment is to find options to model $p(x)$, where $x$ is a complicated data like a natural image. In this section we will take a look at why some of the simpler models don't work for our use case. In the following sections we will work our way into the world of stochastic variational inference and see how it gives us a suitable generative model for natural images.
One way to think of modeling ${p}(x)$ would be use a convolutional neural net that accepts an image and returns the probability for that image.
Looks like it's the simplest possible parametric model that should work, given the fact that CNNs work very well with images.
The problem with this approach is that you have to normalize your distribution. You have to make your distribution to sum up to one, with respect to sum according to all possible images in the world, and there are billions of them. So, this normalization constant is very expensive to compute, and you have to compute it to do the training or inference in the proper manner.
So, this is infeasible. You can't do that because of normalization.
Think of modeling ${p}(x)$ for an image as a joint distribution over the individual pixels, ${p\left(x_{1}, \ldots, x_{d}\right)}$, which, by chain rule, can be decomposed into a product of some conditional distributions, ${p\left(x_{1}\right) p\left(x_{2} | x_{1}\right) \ldots p\left(x_{d} | x_{1}, \ldots, x_{d-1}\right)}$. Now you can try to build this conditional probability models to model the overall joint probability.
The most natural way to model this conditional probability distribution is through a Recurrent Neural Network (RNN), $p\left(x_{k} | x_{1}, \ldots, x_{k-1}\right)=\operatorname{RNN}\left(x_{1}, \ldots, x_{k-1}\right)$. The RNN will read the image pixel by pixel and will try to predict the next pixel. In this approach we don;t have the issue of dealing with a large normalization constant as the RNN needs to think only about one dimensional distribution. So, if for example, your image is grayscale, then each pixel can be decoded with the number from zero to 255. So, the brightness level, and then your normalization constant can be computed by just summing up with respect to all these 256 values, so it's easy. The problem with this approach is that the image generation process is pixelwise and hence very slow.
Can we assume that the distribution over the pixels are independent ? In that case we have this $p\left(x_{1}, \ldots, x_{d}\right)=p\left(x_{1}\right) \ldots p\left(x_{d}\right)$ factorization which makes life so simpler. But, as it turns out, this looks like a very naive assumption and a too restricted model, as images tend to have correlated pixels in almost all cases.
Theoretically you can use this technique as it can represent any probability distribution. But in practice for a complicated dataset like natural images, this can be really inefficient. You may need to use thousands of Gaussians that will make it too hard to train the model.
Here we can use a latent variable model, where each data item $x_{i}$ is caused by a latent variable ${t}$ and we can marginalize out ${t}$, $p(x)=\int p(x | t) p(t) d t$. The conditional here can be a Gaussian and we, kind of, have a mixture of infinitely many Gaussians. For each value of ${t}$, there's one Gaussian and we mix them with weights. Note here that, even if the Gaussians are factorized, so they have independent components for each dimension, the mixture is not. So, this is a little bit more powerful model than the Gaussian mixture model. We will explore this model more in the next section.
Let's start with a model based on the continuous mixture of Gaussians: $p(x)=\int p(x | t) p(t) d t$. It's a latent variable model that will serve as the baseline on which we will apply successive refinements to add more generative power.
To complete the model we need to define the prior and the likelihood:
We use a standard Normal prior : ${p(t)=\mathcal{N}(0, I)}$. It will just force the latent variables t to be around zero and with some unique variants.
We use Gaussian likelihood with the parameters of the Gaussian determined by the latent variable ${t}$: ${p(x | t)=\mathcal{N}(\mu(t), \Sigma(t))}$
Now we need to find a way to convert ${t}$ to the parameters of the Gaussian in the likelihood. One option is to use a linear function for this transformation, somewhat similar to what we do for Probabilistic PCA:
But as it turns out, this is not powerful enough to model complicated structures like natural images. If linearity isn't powerful enough, let's think in terms of non linear transformations.
A Neural Network is a universal function approximator. Let's use the power of non linearity that a Convolutional Neural Net (CNN) offers. We input $t$ to two CNNs and have them generate the sufficient statistics needed to model the Gaussian distribution.
We have one CNN that takes as input the latent variable ${t}$ and outputs the mean vector for the image ${\mu}$. Another CNN takes the latent variable ${t}$ and outputs the covariance matrix ${\Sigma}$. This will define our model in some parametric form.
Also we need some weights ${w}$ for our neural networks which will be set during the training of the CNNs. Here's what we have for the model so far:
The latent variable model for the density estimation, which is a mixture of Gaussians:
Prior and likelihood as Gaussians:
But here we have a computational problem. If our images are of size 100 x 100, then each image has 10000 pixels, which by the way, is fairly low resolution. But even in this case the covariance matrix $\Sigma$ will have a size 10000 x 10000, which is too big for the CNN to output. One way to simplify this is instead of using the full covariance matrix, let's just assume that $\Sigma$ will be a diagonal matrix with NN weights only on the diagonals. Here's how the likelihood of the model changes:
and we have the following model:
Now that we have the model, we need to train it somehow. And we will use Maximum Likelihood Estimation for this - maximize the density of our dataset given the parameters (which come from the convolutional neural network). Here's the integral that does this by marginalizing out the latent variable ${t}$:
Since we have a latent variable model, let's try and use the Expectation Maximization (EM) algorithm. It is specifically invented for these kinds of models.
In EM algorithm:
For more details on the Expectation Maximization Algorithm and Latent Variable Models, here's an excellent post by Martin Krasser.
But there is a problem. In the E-step of EM we need to find the posterior of the latent variable $p(T | X, w)$. And this is intractable in this case because you need to compute integrals and you have CNNs in them. It is just too hard to do this analytically.
So plain EM is not the way to model our use case.
Brief Recap: EM Algorithm
We find a lower bound for the log likelihood $\log p(X | \theta) \geq \mathcal{L}(\theta, q) \text { for any } q$, called the Variational Lower Bound (VLB). Here ${q}$ is the variational distribution parameter.
In the E-step, we maximize the VLB with respect to ${q}$, keeping ${\theta}$ fixed
In the M-step, we maximize the VLB with respect to ${\theta}$ using the ${q}$ we got from the E-step
We repeat E-step and M-step till convergence
We could also use MCMC in the M-step and approximate the expectation (with respect to ${q}$) with samples:
and then we can maximize this average instead of the expected value. But this will be very slow. In each step of the EM algorithm we need to collect a number of samples using Monte Carlo till convergence. So you have nested loops here - an outer loop for EM algorithm and an inner loop for sampling using MCMC. This will be slow. There are better alternatives that we can explore.
When plain EM algorithm doesn't work, we need to have approximations.
Let's try to approximate further. How about choosing Variational Inference, where we maximize the lower bound ($\log p(X | w) \geq \mathcal{L}(w, q)$) as above ($\underset{w, q}{\operatorname{maximize}} \mathcal{L}(w, q)$) (also known as the Variational Lower Bound or Evidence Lower Bound, abbreviated the ELBO) in the EM algorithm but restricting the variational distribution to one that can be factorized (i.e. $ q_{i}\left(t_{i}\right)=\widetilde{q}\left(t_{i 1}\right) \ldots \widetilde{q}\left(t_{i m}\right)$). However as it turns out, even this is intractable. We need to find out methods that scale the idea of variational inference for our latent variable model.
Besides being factorizable, let's also assume that the variational distribution of each individual object is Gaussian. So we have the following sequence of approximations:
$\underset{w, q}{\operatorname{maximize}} \mathcal{L}(w, q)$ (Maximize lower bound as in Plain EM)
$\Rightarrow$
$\underset{w, q_{1}, \ldots, q_{N}}{\operatorname{maximize}} \mathcal{L}\left(w, q_{1}, \ldots, q_{N}\right)$ subject to $q_{i}\left(t_{i}\right)=\widetilde{q}_{i}\left(t_{i 1}\right) \ldots \widetilde{q}_{i}\left(t_{i m}\right)$ (Factorized variational distribution)
$\Rightarrow$
$\underset{w, m_{1}, \ldots, s_{N} \atop s_{1}, \ldots, s_{N}}{\operatorname{maximize}} \mathcal{L}\left(w, q_{1}, \ldots, q_{N}\right)$ subject to $q_{i}\left(t_{i}\right)=\mathcal{N}\left(m_{i}, \operatorname{diag}\left(s_{i}^{2}\right)\right)$ (Variational distribution for each object is a Gaussian). Each object will have a latent variable $t_{i}$ which will have it's own variational distribution $q_{i}$ and which will be parameterized by $m_{i}$ and $s_{i}$, which will be the parameters of our model which we will train. And then we will maximize our lower bound with respect to these parameters.
Looks like we are not yet out of the woods. We just added a number of parameters to our model. e.g. if our latent variables are of 50 dimensions, each of $m_{i}$ and $s_{i}$ will contribute 100 parameters per object to train. And if you have a million training objects, then we have 100 million more parameters just for the purpose of implementing an approximation model. The model will be very complex and will possibly overfit. Also it's not clear how to get these parameters for new objects during inference.
We cannot make all $q_{i}$'s same and yet we cannot afford to have all of them contribute such a huge number of parameters to the model. One approach to alleviate this problem is to have separate distribution for each object but all $q$'s will have a common form - we say that all $q_{i}$'s are of the same distribution (Gaussian) but have parameters that somehow depend on $x_{i}$'s and some weight. Here's the new optimization objective:
$\underset{w, \phi}{\operatorname{maximize}} \mathcal{L}\left(w, q_{1}, \ldots, q_{N}\right)$ subject to $q_{i}\left(t_{i}\right)=\mathcal{N}\left(m\left(x_{i}, \phi\right), \operatorname{diag}\left(s^{2}\left(x_{i}, \phi\right)\right)\right)$
Note here, we still have all $q_{i}$'s different but all of them have the same form depending on only 2 parameters $m$ and $s$.
And for new objects during inference, we can easily find its variational approximation $q$. We can pass this new object through the function $m$, and $s$ to compute the parameters of its Gaussian.
Now we need to maximize this lower bound with respect to $w$ and $\phi$, the latter being the parameter that enables us to convert the $x_{i}$'s to the parameters of the distribution.
We use a convolutional neural network (CNN) for this. This CNN will take as input the data object and will have its own parameter $\phi$ and output the values of $m$ and $s$:
Earlier using plain EM algorithm we needed to maximize the lower bound which was defined as follows:
Here $q$ is the actual posterior of the latent variables and was complicated and we only knew upto the normalization constant. And sampling using MCMC was also found to be expensive and slow.
But now we approximate $q$ with a Gaussian and we know how to compute the parameters of this Gaussian. So we can pass an object $x$ through a CNN with parameter $\phi$ and obtain the values of $m$ and $s$, the parameters of the Gaussian. Now we can easily sample from this Gaussian to approximate the expected value. Thus we have converted the intractable expectation into a value through sampling, because sampling is now cheap - we are sampling from Gaussians:
And refer to Fig 2 as to how we defined our model ${p(x | t, w)}$ using a CNN - we will feed the current CNN's output (see Fig 3) into the CNN in Fig 2. This is how the overall workflow looks like:
Here's the overall schematic diagram of the full model architecture:
This architecture is called Variational Autoencoder (VAE), as opposed to plain Autoencoders, as it has some sampling and variational approximations as part of the model.
The VAE can be viewed as two coupled, but independently parameterized models:
In the model that we built so far, if we take away the variance part $s$ i.e. if $s(x)=0$ then $\widehat{t}_{i}=m\left(x_{i}, \phi\right)$, which means the variational distribution is a deterministic one - it always outputs the mean value (no stochasticity). That is equivalent to passing $m$ directly to the second CNN, the decoder. This changes a variational autoencoder to a plain autoencoder where the input gets replicated to the output. So, it is the variance of the variational distribution that makes this model a variational autoencoder.
In the monograph An Introduction to Variational Autoencoders, Diederik P. Kingma Max Welling sums this up beautifully:
One advantage of the VAE framework, relative to ordinary Variational Inference (VI), is that the recognition model (also called inference model) is now a (stochastic) function of the input variables. This in contrast to VI where each data-case has a separate variational distribution, which is inefficient for large data-sets. The recognition model uses one set of parameters to model the relation between input and latent variables and as such is called “amortized inference”. This recognition model can be arbitrary complex but is still reasonably fast because by construction it can be done using a single feedforward pass from input to latent variables.
Assumptions Made by a VAE
VAEs make strong assumptions about the posterior distribution. Typically VAE models assume that the posterior is approximately factorial, and that its parameters can be predicted from the observables through a nonlinear regression
Let's take a deeper look at the optimization objective (the variational lower bound) of a VAE. For details on the derivation of the lower bound for a general form of EM algorithm, take a look at Martin's post on Expectation Maximization. This VLB can be expressed as the sum of two terms as follows:
$\max . \sum_{i} \mathbb{E}_{q_{i}} \log \frac{p\left(x_{i} | t_{i}, w\right) p\left(t_{i}\right)}{q_{i}\left(t_{i}\right)}$ $=\sum_{i} \mathbb{E}_{q_{i}} \log p\left(x_{i} | t_{i}, w\right)+\mathbb{E}_{q_{i}} \log \frac{p\left(t_{i}\right)}{q_{i}\left(t_{i}\right)}$
$\max . \sum_{i} \mathbb{E}_{q_{i}} \log \frac{p\left(x_{i} | t_{i}, w\right) p\left(t_{i}\right)}{q_{i}\left(t_{i}\right)}$ $=\sum_{i} \mathbb{E}_{q_{i}} \underbrace{\log p\left(x_{i} | t_{i}, w\right)}_{-\left\|x_{i}-\mu\left(t_{i}\right)\right\|^{2}+\text { const }}-\mathcal{K} \mathcal{L}\left(q_{i}\left(t_{i}\right) \| p\left(t_{i}\right)\right)$
The optimization objection of the VAE has a KL divergence as one of the components, which we try to minimize and pull the variational distribution $q_{i}$ towards the prior distribution, which we decided to be a Standard Gaussian. So when our VAE model receives an object similar to what it has seen before and is passed through the encoder, it will output a distribution $q_{i}$ that will be close to standard Gaussian.
But when the model receives an image that it has never seen or it is completely different from what it was trained on, the encoder will output a distribution which is far from standard Gaussian. These data can be classified as outliers or anomalous.
Note that the model of VAE is defined as this integral with respect to $t$, $p(x | w)=\int p(x | t, w) p(t) d t$, we can generate new samples in two steps:
Let's now take a detailed look at how we can maximize the objective of VAE, which is given by $\max _{w, \phi} . \sum_{i} \mathbb{E}_{q_{i}} \log p\left(x_{i} | t_{i}, w\right)-\mathcal{K} \mathcal{L}\left(q_{i}\left(t_{i}\right) \| p\left(t_{i}\right)\right)$. Now we need to maximize this objective with respect to the parameters $w$ and $\phi$ of the two neural networks. And since we have an expectation inside we need to make use of MCMC sampling.
The second part of the expression, the KL divergence part, is easy to be computed analytically. The KL divergence can be expressed as $\mathcal{K} \mathcal{L}\left(q_{i}\left(t_{i}\right) \| p\left(t_{i}\right)\right)$ = $\sum_{j}\left(-\log \sigma_{j}\left(t_{i}\right)+\frac{\sigma_{j}^{2}\left(t_{i}\right)+\mu_{j}^{2}\left(t_{i}\right)-1}{2}\right)$, which can be easily evaluated and it should be easy to compute the gradients as well. Tensorflow will be happy to do this.
The first part is more challenging. Let's call this expression $f(w, \phi)=\sum_{i} \mathbb{E}_{q_{i}} \log p\left(x_{i} | t_{i}, w\right)$, which is a sum over all objects of the expectation of the logarithm of a conditional probability. In order to maximize we need to find the gradients of this expression with respect to $w$ and $\phi$.
Recall that we decided that each $q_{i}$ of individual object will be a distribution of $t_{i}$ given $x_{i}$ and $\phi$ and is modeled as a CNN with parameter $\phi$, $q_{i}\left(t_{i}\right)=q\left(t_{i} | x_{i}, \phi\right)=\mathcal{N}\left(m_{i}, \operatorname{diag}\left(s_{i}^{2}\right)\right)$. Then we have $f(w, \phi)=\sum_{i} \mathbb{E}_{q\left(t_{i} | x_{i}, \phi\right)} \log p\left(x_{i} | t_{i}, w\right)$.
Let's first look at the gradient of this function with respect to $w$, which looks as follows:
$\nabla_{w} f(w, \phi)=\nabla_{w} \sum_{i=1}^{N} \mathbb{E}_{q\left(t_{i} | x_{i}, \phi\right)} \log p\left(x_{i} | t_{i}, w\right)$
(replacing expectation by its definition)
$=\nabla_{w} \sum_{i=1}^{N} \int q\left(t_{i} | x_{i}, \phi\right) \log p\left(x_{i} | t_{i}, w\right) d t_{i}$
(move gradient inside the summation)
$=\sum_{i} \nabla_{w} \int q\left(t_{i} | x_{i}, \phi\right) \log p\left(x_{i} | t_{i}, w\right) d t_{i}$
(swap gradient and the integral - see Leibniz Integral Rule)
$=\sum_{i} \int \nabla_{w} q\left(t_{i} | x_{i}, \phi\right) \log p\left(x_{i} | t_{i}, w\right) d t_{i}$
($q$ is independent of $w$)
$=\sum_{i} \int q\left(t_{i} | x_{i}, \phi\right) \nabla_{w} \log p\left(x_{i} | t_{i}, w\right) d t_{i}$
(by definition of expectation)
$=\sum_{i} \mathbb{E}_{q\left(t_{i} | x_{i}, \phi\right)} \nabla_{w} \log p\left(x_{i} | t_{i}, w\right)$
(we can approximate the expected value by sampling, where $\widehat{t_{i}} \sim q\left(t_{i} | x_{i}, \phi\right)$)
$\approx \sum_{i} \nabla_{w} \log p\left(x_{i} | \widehat{t}_{i}, w\right)$
So we sample one point $\widehat{t_{i}}$ from the variational distribution $q\left(t_{i} | x_{i}, \phi\right)$, and use that within the logarithm and then compute the gradient with respect to $w$.
Basically what we are doing here is the following (see Fig 4 for the complete workflow):
Note that the above computation depends on the whole data set though we can easily transform it to a minibatch based computation:
$\begin{aligned} \nabla_{w} f(w, \phi) & \approx \sum_{i=1}^{N} \nabla_{w} \log p\left(x_{i} | \widehat{t}_{i}, w\right) \\ & \approx \frac{N}{n} \sum_{s=1}^{n} \nabla_{w} \log p\left(x_{i_{s}} | \widehat{t}_{i_{s}}, w\right) \\ & \text { Stochastic gradient of standard NN } \end{aligned}$
$\begin{aligned} \widehat{t}_{i} & \sim q\left(t_{i} | x_{i}, \phi\right) \\ i_{s} & \sim \mathcal{U}\{1, \ldots, N\} \end{aligned}$
Here's the objective:
$\nabla_{\phi} f(w, \phi)=\nabla_{\phi} \sum_{i} \mathbb{E}_{q\left(t_{i} | x_{i}, \phi\right)} \log p\left(x_{i} | t_{i}, w\right)$
(replacing expectation by its definition)
$=\nabla_{\phi} \sum_{i} \int q\left(t_{i} | x_{i}, \phi\right) \log p\left(x_{i} | t_{i}, w\right) d t_{i}$
(move the expected value inside the summation and swap integral and diff)
$=\sum_{i} \int \nabla_{\phi} q\left(t_{i} | x_{i}, \phi\right) \log p\left(x_{i} | t_{i}, w\right) d t_{i}$
Now we have a problem. Unlike the case when we found the gradient with respect to $w$, we cannot move the differential $\nabla_{\phi}$ before the logarithm. This is because $q$ depends on $\phi$. And if we differentiate $q$ with respect to $\phi$ we no longer get the form of an expectation from which we can sample using Monte Carlo.
We can however do a trick. Let's introduce some distribution artificially and multiply and divide by the distribution $q$:
$=\sum_{i} \int \frac{q\left(t_{i} | x_{i}, \phi\right)}{q\left(t_{i} | x_{i}, \phi\right)} \nabla_{\phi} q\left(t_{i} | x_{i}, \phi\right) \log p\left(x_{i} | t_{i}, w\right) d t_{i}$
(using the fact $\nabla \log g(\phi)=\frac{\nabla g(\phi)}{g(\phi)}$)
$=\sum_{i} \int q\left(t_{i} | x_{i}, \phi\right) \nabla_{\phi} \log q\left(t_{i} | x_{i}, \phi\right) \log p\left(x_{i} | t_{i}, w\right) d t_{i}$
(now we have the form of an expectation)
$=\sum_{i} \mathbb{E}_{q\left(t_{i} | x_{i}, \phi\right)} \nabla_{\phi} \log q\left(t_{i} | x_{i}, \phi\right) \log p\left(x_{i} | t_{i}, w\right) d t_{i}$
This is often referred to as the Log Derivative Trick. Now you can sample from $q$ and approximate using Monte Carlo.
This is a valid approach and can be used to generate approximations using Monte Carlo. However it can be shown that this approximation is really quite loose and will result in a very high variance. For more details on why this variance is high, have a look at the paper Neural Variational Inference and Learning in Belief Networks. This blog post gives a good overview of some of the tricks on how to compute expectations of the form $\nabla_{\phi} \mathbf{E}_{x \sim p_{\phi}(x)}\left[f_{\theta}(x)\right]$.
Let's explore an idea to reduce the variance of the stochastic approximation that we saw in the last section.
Here's what we have as the objective:
$\nabla_{\phi} f(w, \phi)=\sum_{i} \nabla_{\phi} \mathbb{E}_{q\left(t_{i} | x_{i}, \phi\right)} \log p\left(x_{i} | t_{i}, w\right)$, where $t_{i}$ is sampled from $q$ as $t_{i} \sim q\left(t_{i} | x_{i}, \phi\right)=\mathcal{N}\left(m_{i}, \operatorname{diag}\left(s_{i}^{2}\right)\right)$
Let's do a change of variable. Instead of sampling $t_{i}$ let's sample a new element $\epsilon_{i}$ from standard Normal, $\varepsilon_{i} \sim p\left(\varepsilon_{i}\right)=\mathcal{N}(0, I)$ and make $t_{i}$ from it through an elementwise multiplication with the standard deviation and adding the mean vector to it, $t_{i}=\varepsilon_{i} \odot s_{i}+m_{i}$.
So we have $t_{i}=\varepsilon_{i} \odot s_{i}+m_{i}=g\left(\varepsilon_{i}, x_{i}, \phi\right)$ and instead of sampling from $q$ we sample from a distribution $\epsilon_{i}$:
$\nabla_{\phi} f(w, \phi)=\sum_{i} \nabla_{\phi} \mathbb{E}_{q\left(t_{i} | x_{i}, \phi\right)} \log p\left(x_{i} | t_{i}, w\right)$
(change of variable)
$=\sum_{i} \nabla_{\phi} \mathbb{E}_{p\left(\varepsilon_{i}\right)} \log p\left(x_{i} | g\left(\varepsilon_{i}, x_{i}, \phi\right), w\right)$
(now we can push the gradient inside the expectation)
$=\sum_{i} \int p\left(\varepsilon_{i}\right) \nabla_{\phi} \log p\left(x_{i} | g\left(\varepsilon_{i}, x_{i}, \phi\right), w\right) d \varepsilon_{i}$
(now we have an expectation without any additional distribution being introduced)
$=\sum_{i} \mathbb{E}_{p\left(\varepsilon_{i}\right)} \nabla_{\phi} \log p\left(x_{i} | g\left(\varepsilon_{i}, x_{i}, \phi\right), w\right)$
Here's the modified workflow in the model:
Have a look at this blog post for a good discussion of the differences between REINFORCE and Reparameterization trick in computing expectations.