Developed by Stephan Gouws, Avishkar Bhoopchand & Ulrich Paquet.
Background
In this practical, our only aim is for you to understand what a variational auto-encoder (VAE) is. Some people prefer to call it "auto-encoding variational Bayes", which is technically more correct.
If you are mathematically inclined, you are welcome to look at https://arxiv.org/pdf/1312.6114.pdf, but it is not neccessary for the scope of this practical.
Learning objectives
Our only aim for this practical is that you understand some key concepts:
What is expected of you?
In [1]:
import itertools
import matplotlib as mpl
import numpy as np
import os
import tensorflow as tf
import time
from matplotlib import pyplot as plt
import tensorflow.contrib.slim as slim
from tensorflow.examples.tutorials.mnist import input_data
sg = tf.contrib.bayesflow.stochastic_graph
st = tf.contrib.bayesflow.stochastic_tensor
distributions = tf.contrib.distributions
In [2]:
tf.reset_default_graph()
try:
flags = tf.app.flags
flags.DEFINE_string('data_dir', '/tmp/dat/', 'Directory for data')
flags.DEFINE_string('logdir', '/tmp/log/', 'Directory for logs')
flags.DEFINE_integer('latent_dim', 100, 'Latent dimensionality of model')
flags.DEFINE_integer('batch_size', 64, 'Minibatch size')
flags.DEFINE_integer('n_samples', 1, 'Number of samples to save')
flags.DEFINE_integer('print_every', 1000, 'Print every n iterations')
flags.DEFINE_integer('hidden_size', 200, 'Hidden size for neural networks')
# flags.DEFINE_integer('n_iterations', 1, 'number of iterations')
except:
pass
FLAGS = flags.FLAGS
# Flags are immutable once defined. If you were to ever run big experiments,
# you'd use something like FLAGS.n_iterations, but as you want to try things in
# the cells below, we'll just make n_iterations something that you can change
# to easily re-run.
N_ITERATIONS = 10000
Before continuing, you'll have to sharpen your mental mathematical pencils, and step through a derivation that you've encountered in the Deep Generative Models lecture.
"Evidence? Of what?," you might ask. The term is short for "model evidence" $p_{\theta}(x)$ and specifies how well the model supports the data point $x$. A more common term is "marginal likelihood", as we obtain it by marginalizing (or integrating or summing) out $z$ from $p_{\theta}(x, z)$.
Bayes's theorem states that
\begin{align} p_{\theta}(z|x) & = \frac{p_{\theta}(x|z) \, p(z)}{p_{\theta}(x)} \\ \textrm{posterior} & = \frac{\textrm{likelihood} \times \textrm{prior}}{\textrm{evidence (or marginal likelihood)}} \end{align}and the denominator is an average of the likelihood over all prior assignments of $z$. Of course we can't compute it in most cases, and will find a way to approximate it as best we could.
What would we like to optimize?
We would like to optimize $\log p_{\theta}(x)$ over $\theta$. It's the best description of the data, averaged over all prior settings of $z$. But we cannot compute it. Oh dear!
If we were able to compute $p_{\theta}(x)$, we would normally also be able to compute the true posterior $p_{\theta}(z | x)$. These two are intimately tied. Let's say we want to know which settings of $z$ were responsible for generating the $x$ that we see. Instead of doing exact inference with $p_{\theta}(z | x)$ (which is hard as we have to invert a complicated generative process) we will approximate it with another distribution $q_{\phi}(z | x) \approx p_{\theta}(z | x)$.
A lower bound, a.k.a. "sensible approximations rock!"
For every data point (or MNIST image) $x$, we define the bound on the log marginal likelihood -- or ELBO -- as:
\begin{align} \log p_{\theta}(x) & = \log \int p_{\theta}(x, z) \, \mathrm{d} z \\ & = \log \int p_{\theta}(x | z) \, p(z) \, \mathrm{d} z \\ & = \log \int q_{\phi}(z | x) \left[ \frac{ p_{\theta}(x | z) \, p(z) }{q_{\phi}(z | x)} \right] \mathrm{d} z \\ & \ge \int q_{\phi}(z | x) \log \left[ \frac{ p_{\theta}(x | z) \, p(z) }{q_{\phi}(z | x)} \right] \mathrm{d} z && \triangleright \textrm{Jensen's Inequality.}\\ & = \mathbb{E}_{q_{\phi}(z|x)}\Big[ \log p_{\theta}(x | z) + \log p(z) - \log q_{\phi}(z | x)\Big] \\ & \equiv \mathcal{L}(\theta, \phi) \end{align}The inequality arises due to Jensen's inequality, since logs are convex functions. So, instead of maximizing $\log p_{\theta}(x)$ as a function of $\theta$, we will maximize $\mathcal{L}(\theta, \phi)$ as a function of both $\theta$ and $\phi$.
We have a few pieces:
In the code below, we are going to define $p_{\theta}(x | z)$ and $q_{\phi}(z | x)$.
Take a breather
Explain to your neighbour what is going on here. Is there anything that you don't understand?
Deep dive 101
Your challenge is to find how $\theta$ and $\phi$ are represented in the code.
Note that in the code, there are many images $x$ that are stacked in a batch x. The bound refers to one such image, and we assume that they are generated i.i.d., we sum the bounds...
In [3]:
def inference_network(x, latent_dim, hidden_size):
"""Construct an inference network q(z|x).
Args:
x: A batch of MNIST digits.
latent_dim: The latent dimensionality of z.
hidden_size: The size of the neural net hidden layers.
Returns:
mu: The mean of q(z|x)
sigma: The component-wise standard deviation of q(z|x)
"""
with slim.arg_scope([slim.fully_connected], activation_fn=tf.nn.relu):
# Question: What is the shape of x as you run the code with MNIST images?
# Hint: You might have to put a `print(net.get_shape())` somewhere around
# here, run the `train()` cell to contruct the TF graph, and see what
# you get. Debugging is part of life!
# Answer: ....
net = slim.flatten(x)
# Question: Now that we've flattened x, what is its shape now, as `net`?
# Answer: ...
# At this point, `net` contains the input to the network. We will now
# transform it layer-by-layer:
net = slim.fully_connected(net, hidden_size)
# Question: What is the shape of the first layer, as `net`?
# Answer: ...
# Question: Write down the mathematical expression for
# net = slim.fully_connected(net, hidden_size). It it was not wrapped in
# "with slim.arg_scope(..., activation_fn=tf.nn.relu)", would the
# mathematical expression look different? Why?
#
# Hint: If "l" indicates this layer, then we can either say
# output = weights_l * input + biases_l or
# output = relu(weights_l * input + biases_l)
# Answer: ...
# Question: How does it differ from the network that you devised in the
# Deep Generative Models lecuture?
# Answer: ...
net = slim.fully_connected(net, hidden_size)
# Question: What is the shape of the second layer, as `net`?
# Answer: ...
gaussian_params = slim.fully_connected(
net, latent_dim * 2, activation_fn=None)
# Question: What is the shape of the third (output) layer,
# `gaussian_params`?
# Answer: ...
# We're taking the output of the MLP (multi-layer perceptron) and split it
# into two parts: the first latent_dim outputs become the means, while the
# second latent_dim outputs become the standard deviations.
# Challenge 1. Can you write the entire network as a mathematical function?
# Challenge 2. Can you draw it?
# Challenge 3. Can you say which parameters constitute \phi?
#
# Challenge 4. (You can skip this challenge and do it at home).
# Can you see it in the actual computation graph?
# Hint: You can write the graph to file with
#
# tf.train.write_graph(tf.get_default_graph().as_graph_def(),
# directory, 'mygraph.pbtxt')
# The mean parameter is unconstrained
mu = gaussian_params[:, :latent_dim]
# The standard deviation must be positive. Parametrize with a softplus and
# add a small epsilon for numerical stability
sigma = 1e-6 + tf.nn.softplus(gaussian_params[:, latent_dim:])
# Question: Can you draw the softplus function, log(exp(inputs) + 1)?
# Answer: ...
return mu, sigma
def generative_network(z, hidden_size):
"""Build a generative network to return the logits of p(x|z).
Args:
z: Samples of latent variables.
hidden_size: Size of the hidden state of the neural net.
Returns:
bernoulli_logits: logits for the Bernoulli likelihood of the data.
"""
with slim.arg_scope([slim.fully_connected], activation_fn=tf.nn.relu):
net = slim.fully_connected(z, hidden_size)
net = slim.fully_connected(net, hidden_size)
bernoulli_logits = slim.fully_connected(net, 784, activation_fn=None)
bernoulli_logits = tf.reshape(bernoulli_logits, [-1, 28, 28, 1])
# Challenge 5. Can you again write the entire network as a mathematical
# function, draw it, and say which parameters constitute
# \theta?
# Hold that thought... Can you (later) say why "logits" are returned, and
# why the are not transformed here to a probability?
return bernoulli_logits
Now, look for the line q_z = st.StochasticTensor(distributions.Normal(loc=q_mu, scale=q_sigma)) in train().
Your challenge will be to rewrite that line, so that st.StochasticTensor(...) is not used.
First of all, you need to know what it does. Do you remember the reparameterization trick, from the Deep Generative Models lecture? As a recap, this rewrites:
\begin{align} \mathcal{L} & = \mathbb{E}_{q_{\phi}(z|x)}\Big[ \log p_{\theta}(x | z) + \log p(z) - \log q_{\phi}(z | x)\Big] \\ & = \mathbb{E}_{z \sim \mathcal{N}(\mu_{\theta}(x), \sigma^2_{\theta}(x))}\Big[ \log p_{\theta}(x | z) \Big] - \mathbb{E}_{z \sim \mathcal{N}(\mu_{\theta}(x), \sigma^2_{\theta}(x))}\left[ \log \frac{ q_{\phi}(z | x) }{p(z)} \right] \\ & = \mathbb{E}_{z \sim \mathcal{N}(\mu_{\theta}(x), \sigma^2_{\theta}(x))}\Big[ \log p_{\theta}(x | z) \Big] - \textrm{Kullback Leibler divergence} \ . \end{align}You can turn to your favourite souce, and find an expression for the KL divergence. It is function of a mean $\mu_{\theta}(x)$ and variance $\sigma^2_{\theta}(x)$, and we can look in Wikipedia on how to compute it.
Note that the mean $\mu_{\theta}(x)$ and variance $\sigma^2_{\theta}(x)$ are parameterized as multi-layer perceptrons (MLPs) in def inference_network(x, latent_dim, hidden_size).
Easy there, what's the catch?
Well, we can write the KL divergence part as a function, that we can differentiate (thank you, TensorFlow! back in the day people had to do this by hand!). But what about the first expectation?
For the first expectation, we'd have to do something called a "Monte Carlo estimate" because we can't browse Wikipedia and look for a nice and analytic form of the function.
Monte Carlo estimate
Take $L$ samples $z^{(l)}$ from $z \sim \mathcal{N}(\mu_{\theta}(x), \sigma^2_{\theta}(x))$, and call them $z^{(1)}, z^{(2)}, \ldots, z^{(L)}$. The "sampling process" essentially maps $x \to z$ stochastically. We now approximate
\begin{align} \mathbb{E}_{z \sim \mathcal{N}(\mu_{\theta}(x), \sigma^2_{\theta}(x))}\Big[ \log p_{\theta}(x | z) \Big] \approx \frac{1}{L} \sum_{l=1}^{L} \log p_{\theta}(x | z^{(l)}) \ . \end{align}This is really awesome, as the expression above maps back $z \to x$. Except for one small detail. If we were to write $x \to z \to x$ as a function, TensorFlow cannot differentiate it for us any more. We cannot take gradients through the samples :-( :-(
Reparameterization trick
There is a neat trick, where we put the samples (and stochasticity) outside the $x \to z \to x$ path. The trick is to sample $L$ uniform Gaussian random variables $\epsilon^{(l)} \sim \mathcal{N}(0, 1)$. Can you see that there are no fancy parameters (that we have to adjust) in $\mathcal{N}(0, 1)$?
We can get the same samples from $z$, as we've had before, by transforming the $\mathcal{N}(0, 1)$ random variables: $z^{(l)} = \mu_{\theta}(x) + \epsilon^{(l)} * \sigma_{\theta}(x)$.
\begin{align} \mathbb{E}_{z \sim \mathcal{N}(\mu_{\theta}(x), \sigma^2_{\theta}(x))}\Big[ \log p_{\theta}(x | z) \Big] = \mathbb{E}_{\epsilon \sim \mathcal{N}(0, 1)}\Big[ \log p_{\theta}\Big(x \, \Big| \, z = \mu_{\theta}(x) + \epsilon * \sigma_{\theta}(x)\Big) \Big] \approx \frac{1}{L} \sum_{l=1}^{L} \log p_{\theta}\Big(x \, \Big| \, z = \mu_{\theta}(x) + \epsilon^{(l)} * \sigma_{\theta}(x)\Big) \ . \end{align}
In [4]:
def train():
# Train a Variational Autoencoder on MNIST
# We need to reset the graph between multiple runs of the cell to avoid
# TF complaining about creating duplicate variables.
tf.reset_default_graph()
# Input placeholders
with tf.name_scope('data'):
x = tf.placeholder(tf.float32, [None, 28, 28, 1])
with tf.variable_scope('variational'):
# Given x, get the mean and variance of q(z|x).
q_mu, q_sigma = inference_network(x=x,
latent_dim=FLAGS.latent_dim,
hidden_size=FLAGS.hidden_size)
with st.value_type(st.SampleValue()):
# We now draw a sample from z ~ q(z|x).
#
# This is the key to this practical. We effectively transform Normal(0,1)
# noise that it *outside* the computation graph, to represent a sample
# from Normal(q_mu, q_sigma).
#
# This is the REPARAMETERIZATION TRICK.
q_z = st.StochasticTensor(distributions.Normal(loc=q_mu, scale=q_sigma))
# Challenge: Can you write
#
# q_z = ...
#
# by using tf.random_normal(...) and then transforming it?
with tf.variable_scope('model'):
# The likelihood is Bernoulli-distributed with logits given by the
# generative network
p_x_given_z_logits = generative_network(z=q_z,
hidden_size=FLAGS.hidden_size)
p_x_given_z = distributions.Bernoulli(logits=p_x_given_z_logits)
posterior_predictive_samples = p_x_given_z.sample()
# You can ignore the "Take samples" bits for now, until the end of the prac.
# Take samples from the prior
with tf.variable_scope('model', reuse=True):
p_z = distributions.Normal(loc=np.zeros(FLAGS.latent_dim, dtype=np.float32),
scale=np.ones(FLAGS.latent_dim, dtype=np.float32))
p_z_sample = p_z.sample(FLAGS.n_samples)
p_x_given_z_logits = generative_network(z=p_z_sample,
hidden_size=FLAGS.hidden_size)
prior_predictive = distributions.Bernoulli(logits=p_x_given_z_logits)
prior_predictive_samples = prior_predictive.sample()
# You can ignore the "Take samples" bits for now, until the end of the prac.
# Take samples from the prior with a placeholder
with tf.variable_scope('model', reuse=True):
z_input = tf.placeholder(tf.float32, [None, FLAGS.latent_dim])
p_x_given_z_logits = generative_network(z=z_input,
hidden_size=FLAGS.hidden_size)
prior_predictive_inp = distributions.Bernoulli(logits=p_x_given_z_logits)
prior_predictive_inp_sample = prior_predictive_inp.sample()
# The juice! Convince yourself that you understand the lines below.
# Build the evidence lower bound (ELBO) or the negative loss
kl = tf.reduce_sum(distributions.kl_divergence(q_z.distribution, p_z), 1)
expected_log_likelihood = tf.reduce_sum(p_x_given_z.log_prob(x),
[1, 2, 3])
elbo = tf.reduce_sum(expected_log_likelihood - kl, 0)
# Challenge. Can you see how the four lines above relate to the mathematical
# expressions you've encountered?
# Answer: ...
optimizer = tf.train.RMSPropOptimizer(learning_rate=0.001)
train_op = optimizer.minimize(-elbo)
init_op = tf.global_variables_initializer()
# Run training
sess = tf.InteractiveSession()
sess.run(init_op)
#mnist = read_data_sets(FLAGS.data_dir, one_hot=True)
mnist = input_data.read_data_sets(FLAGS.data_dir, one_hot=True)
# Get fixed MNIST digits for plotting posterior means during training
np_x_fixed, np_y = mnist.test.next_batch(5000)
np_x_fixed = np_x_fixed.reshape(5000, 28, 28, 1)
np_x_fixed = (np_x_fixed > 0.5).astype(np.float32)
for i in range(N_ITERATIONS):
# Re-binarize the data at every batch; this improves results
np_x, _ = mnist.train.next_batch(FLAGS.batch_size)
np_x = np_x.reshape(FLAGS.batch_size, 28, 28, 1)
np_x = (np_x > 0.5).astype(np.float32)
sess.run(train_op, {x: np_x})
t0 = time.time()
if i % 100 == 0:
np_elbo = sess.run(elbo, feed_dict={x: np_x})
print('Iteration: {0:d} ELBO: {1:.3f} Examples/s: {2:.3e}'.format(
i,
np_elbo / FLAGS.batch_size,
FLAGS.batch_size * FLAGS.print_every / (time.time() - t0)))
t0 = time.time()
if i % 1000 == 0 and i > 0:
# Get samples
np_posterior_samples, np_prior_samples = sess.run(
[posterior_predictive_samples, prior_predictive_samples],
feed_dict={x: np_x})
for k in range(1): # range(FLAGS.n_samples):
# Plot the data using Matplotlib
fig = plt.figure()
plt.subplot(1, 2, 1)
plt.imshow(np_posterior_samples[k, :, :, 0])
plt.title('Posterior sample')
plt.subplot(1, 2, 2)
plt.imshow(np_prior_samples[k, :, :, 0])
plt.title('Prior sample')
plt.show()
train()
Doing better
At this point, you might realize that the prior samples don't all look like perfectly formed digits.
Now, think about the discussion about the "data manifold" from the Deep Generative Models lecture. Do you think $p(z)$ actually captured the data manifold, using $p_{\theta}(x | z)$ as a generative model?
What will you do to make the randomly generated digits (prior samples) look more realistic? Train for longer? Make the network deeper?
Objective function
Do you think the ELBO (evidence lower bound) is the right measure of success? Why? Why not?
Do you think the ELBO has pathological flaws? Or is it a convenient tool?
Do you think a data manifold should be described by a Gaussian distribution?
In which other ways will you solve the inference problem, finding $p_{\theta}(z | x)$?
In which other ways will you solve the learning problem, finding the $\theta$ that maximizes $p_{\theta}(x)$?