In this assignment, you will build Variational Autoencoder, train it on the MNIST dataset, and play with its architecture and hyperparameters.
In [1]:
%tensorflow_version 1.x
In [2]:
try:
import google.colab
IN_COLAB = True
except:
IN_COLAB = False
if IN_COLAB:
print("Downloading Colab files")
! shred -u setup_google_colab.py
! wget https://raw.githubusercontent.com/hse-aml/bayesian-methods-for-ml/master/setup_google_colab.py -O setup_google_colab.py
import setup_google_colab
setup_google_colab.load_data_week5()
In [3]:
import tensorflow as tf
import keras
import numpy as np
import matplotlib.pyplot as plt
from keras.layers import Input, Dense, Lambda, InputLayer, concatenate
from keras.models import Model, Sequential
from keras import backend as K
from keras import metrics
from keras.datasets import mnist
from keras.utils import np_utils
from w5_grader import VAEGrader
We will create a grader instance below and use it to collect your answers. Note that these outputs will be stored locally inside grader and will be uploaded to the platform only after running submit function in the last part of this assignment. If you want to make a partial submission, you can run that cell anytime you want.
In [0]:
grader = VAEGrader()
Recall that Variational Autoencoder is a probabilistic model of data based on a continious mixture of distributions. In the lecture we covered the mixture of gaussians case, but here we will apply VAE to binary MNIST images (each pixel is either black or white). To better model binary data we will use a continuous mixture of binomial distributions: $p(x \mid w) = \int p(x \mid t, w) p(t) dt$, where the prior distribution on the latent code $t$ is standard normal $p(t) = \mathcal{N}(0, I)$, but probability that $(i, j)$-th pixel is black equals to $(i, j)$-th output of the decoder neural detwork: $p(x_{i, j} \mid t, w) = \text{decoder}(t, w)_{i, j}$.
To train this model we would like to maximize marginal log-likelihood of our dataset $\max_w \log p(X \mid w)$, but it's very hard to do computationally, so instead we maximize the Variational Lower Bound w.r.t. both the original parameters $w$ and variational distribution $q$ which we define as encoder neural network with parameters $\phi$ which takes input image $x$ and outputs parameters of the gaussian distribution $q(t \mid x, \phi)$: $\log p(X \mid w) \geq \mathcal{L}(w, \phi) \rightarrow \max_{w, \phi}$.
So overall our model looks as follows: encoder takes an image $x$, produces a distribution over latent codes $q(t \mid x)$ which should approximate the posterior distribution $p(t \mid x)$ (at least after training), samples a point from this distribution $\widehat{t} \sim q(t \mid x, \phi)$, and finally feeds it into a decoder that outputs a distribution over images.
In the lecture, we also discussed that variational lower bound has an expected value inside which we are going to approximate with sampling. But it is not trivial since we need to differentiate through this approximation. However, we learned about reparametrization trick which suggests instead of sampling from distribution $\widehat{t} \sim q(t \mid x, \phi)$ sample from a distribution which doesn't depend on any parameters, e.g. standard normal, and then deterministically transform this sample to the desired one: $\varepsilon \sim \mathcal{N}(0, I); ~~\widehat{t} = m(x, \phi) + \varepsilon \sigma(x, \phi)$. This way we don't have to worry about our stochastic gradient being biased and can straightforwardly differentiate our loss w.r.t. all the parameters while treating the current sample $\varepsilon$ as constant.
Task 1 Derive and implement Variational Lower Bound for the continuous mixture of Binomial distributions.
Note that in lectures we discussed maximizing the VLB (which is typically a negative number), but in this assignment, for convenience, we will minimize the negated version of VLB (which will be a positive number) instead of maximizing the usual VLB. In what follows we always talk about negated VLB, even when we use the term VLB for short.
Also note that to pass the test, your code should work with any mini-batch size.
To do that, we need a stochastic estimate of VLB: $$\text{VLB} = \sum_{i=1}^N \text{VLB}_i \approx \frac{N}{M}\sum_{i_s}^M \text{VLB}_{i_s}$$ where $N$ is the dataset size, $\text{VLB}_i$ is the term of VLB corresponding to the $i$-th object, and $M$ is the mini-batch size. But instead of this stochastic estimate of the full VLB we will use an estimate of the negated VLB normalized by the dataset size, i.e. in the function below you need to return average across the mini-batch $-\frac{1}{M}\sum_{i_s}^M \text{VLB}_{i_s}$. People usually optimize this normalized version of VLB since it doesn't depend on the dataset set - you can write VLB function once and use it for different datasets - the dataset size won't affect the learning rate too much. The correct value for this normalized negated VLB should be around $100 - 170$ in the example below.
In [0]:
def vlb_binomial(x, x_decoded_mean, t_mean, t_log_var):
"""Returns the value of negative Variational Lower Bound
The inputs are tf.Tensor
x: (batch_size x number_of_pixels) matrix with one image per row with zeros and ones
x_decoded_mean: (batch_size x number_of_pixels) mean of the distribution p(x | t), real numbers from 0 to 1
t_mean: (batch_size x latent_dim) mean vector of the (normal) distribution q(t | x)
t_log_var: (batch_size x latent_dim) logarithm of the variance vector of the (normal) distribution q(t | x)
Returns:
A tf.Tensor with one element (averaged across the batch), VLB
"""
### YOUR CODE HERE
kl = K.mean(0.5 * K.sum(-t_log_var + K.exp(t_log_var) + K.square(t_mean) - 1, axis=1))
eq = K.mean(-K.sum(x * K.log(x_decoded_mean+1e-6) + (1-x) * K.log(1-x_decoded_mean+1e-6), axis=1))
return (eq+kl)
In [0]:
# Start tf session so we can run code.
#import tensorflow.compat.v1 as tfc
#sess = tf.compat.v1.keras.backend.get_session()
sess = tf.InteractiveSession()
# Connect keras to the created session.
K.set_session(sess)
In [7]:
grader.submit_vlb(sess, vlb_binomial)
Task 2 Read the code below that defines encoder and decoder networks and implement sampling with reparametrization trick in the provided space.
In [8]:
batch_size = 100
original_dim = 784 # Number of pixels in MNIST images.
latent_dim = 100 # d, dimensionality of the latent code t.
intermediate_dim = 256 # Size of the hidden layer.
epochs = 50
x = Input(batch_shape=(batch_size, original_dim))
def create_encoder(input_dim):
# Encoder network.
# We instantiate these layers separately so as to reuse them later
encoder = Sequential(name='encoder')
encoder.add(InputLayer([input_dim]))
encoder.add(Dense(intermediate_dim, activation='relu'))
encoder.add(Dense(2 * latent_dim))
return encoder
encoder = create_encoder(original_dim)
get_t_mean = Lambda(lambda h: h[:, :latent_dim])
get_t_log_var = Lambda(lambda h: h[:, latent_dim:])
h = encoder(x)
t_mean = get_t_mean(h)
t_log_var = get_t_log_var(h)
# Sampling from the distribution
# q(t | x) = N(t_mean, exp(t_log_var))
# with reparametrization trick.
def sampling(args):
"""Returns sample from a distribution N(args[0], diag(args[1]))
The sample should be computed with reparametrization trick.
The inputs are tf.Tensor
args[0]: (batch_size x latent_dim) mean of the desired distribution
args[1]: (batch_size x latent_dim) logarithm of the variance vector of the desired distribution
Returns:
A tf.Tensor of size (batch_size x latent_dim), the samples.
"""
t_mean, t_log_var = args
return t_mean + K.exp(0.5*t_log_var)* K.random_normal(t_mean.shape)
# YOUR CODE HERE
t = Lambda(sampling)([t_mean, t_log_var])
def create_decoder(input_dim):
# Decoder network
# We instantiate these layers separately so as to reuse them later
decoder = Sequential(name='decoder')
decoder.add(InputLayer([input_dim]))
decoder.add(Dense(intermediate_dim, activation='relu'))
decoder.add(Dense(original_dim, activation='sigmoid'))
return decoder
decoder = create_decoder(latent_dim)
x_decoded_mean = decoder(t)
In [9]:
grader.submit_samples(sess, sampling)
Task 3 Run the cells below to train the model with the default settings. Modify the parameters to get better results. Especially pay attention to the encoder/decoder architectures (e.g. using more layers, maybe making them convolutional), learning rate, and the number of epochs.
In [0]:
loss = vlb_binomial(x, x_decoded_mean, t_mean, t_log_var)
vae = Model(x, x_decoded_mean)
# Keras will provide input (x) and output (x_decoded_mean) to the function that
# should construct loss, but since our function also depends on other
# things (e.g. t_means), it is easier to build the loss in advance and pass
# a function that always returns it.
vae.compile(optimizer=keras.optimizers.RMSprop(lr=0.001), loss=lambda x, y: loss)
In [0]:
# train the VAE on MNIST digits
(x_train, y_train), (x_test, y_test) = mnist.load_data()
# One hot encoding.
y_train = np_utils.to_categorical(y_train)
y_test = np_utils.to_categorical(y_test)
x_train = x_train.astype('float32') / 255.
x_test = x_test.astype('float32') / 255.
x_train = x_train.reshape((len(x_train), np.prod(x_train.shape[1:])))
x_test = x_test.reshape((len(x_test), np.prod(x_test.shape[1:])))
In [12]:
hist = vae.fit(x=x_train, y=x_train,
shuffle=True,
epochs=epochs,
batch_size=batch_size,
validation_data=(x_test, x_test),
verbose=2)
In the picture below you can see the reconstruction ability of your network on training and validation data. In each of the two images, the left column is MNIST images and the right column is the corresponding image after passing through autoencoder (or more precisely the mean of the binomial distribution over the output images).
Note that getting the best possible reconstruction is not the point of VAE, the KL term of the objective specifically hurts the reconstruction performance. But the reconstruction should be anyway reasonable and they provide a visual debugging tool.
In [13]:
fig = plt.figure(figsize=(10, 10))
for fid_idx, (data, title) in enumerate(
zip([x_train, x_test], ['Train', 'Validation'])):
n = 10 # figure with 10 x 2 digits
digit_size = 28
figure = np.zeros((digit_size * n, digit_size * 2))
decoded = sess.run(x_decoded_mean, feed_dict={x: data[:batch_size, :]})
for i in range(10):
figure[i * digit_size: (i + 1) * digit_size,
:digit_size] = data[i, :].reshape(digit_size, digit_size)
figure[i * digit_size: (i + 1) * digit_size,
digit_size:] = decoded[i, :].reshape(digit_size, digit_size)
ax = fig.add_subplot(1, 2, fid_idx + 1)
ax.imshow(figure, cmap='Greys_r')
ax.set_title(title)
ax.axis('off')
plt.show()
In [14]:
grader.submit_best_val_loss(hist)
Task 4 Write code to generate new samples of images from your trained VAE. To do that you have to sample from the prior distribution $p(t)$ and then from the likelihood $p(x \mid t)$.
Note that the sampling you've written in Task 2 was for the variational distribution $q(t \mid x)$, while here you need to sample from the prior.
In [0]:
n_samples = 10 # To pass automatic grading please use at least 2 samples here.
# YOUR CODE HERE.
# ...
# sampled_im_mean is a tf.Tensor of size 10 x 784 with 10 random
# images sampled from the vae model.
z = tf.random_normal((n_samples, latent_dim))
sampled_im_mean = decoder(z)
In [16]:
sampled_im_mean_np = sess.run(sampled_im_mean)
# Show the sampled images.
plt.figure()
for i in range(n_samples):
ax = plt.subplot(n_samples // 5 + 1, 5, i + 1)
plt.imshow(sampled_im_mean_np[i, :].reshape(28, 28), cmap='gray')
ax.axis('off')
plt.show()
In [17]:
grader.submit_hallucinating(sess, sampled_im_mean)
In the final task, you will modify your code to obtain Conditional Variational Autoencoder [1]. The idea is very simple: to be able to control the samples you generate, we condition all the distributions on some additional information. In our case, this additional information will be the class label (the digit on the image, from 0 to 9).
So now both the likelihood and the variational distributions are conditioned on the class label: $p(x \mid t, \text{label}, w)$, $q(t \mid x, \text{label}, \phi)$.
The only thing you have to change in your code is to concatenate input image $x$ with (one-hot) label of this image to pass into the encoder $q$ and to concatenate latent code $t$ with the same label to pass into the decoder $p$. Note that it's slightly harder to do with convolutional encoder/decoder model.
[1] Sohn, Kihyuk, Honglak Lee, and Xinchen Yan. “Learning Structured Output Representation using Deep Conditional Generative Models.” Advances in Neural Information Processing Systems. 2015.
Task 5.1 Implement CVAE model. You may reuse create_encoder
and create_decoder
modules defined previously (now you can see why they accept the input size as an argument ;) ). You may also need concatenate
Keras layer to concatenate labels with input data and latent code.
To finish this task, you should go to Conditionally hallucinate data
section and find there Task 5.2
In [0]:
# One-hot labels placeholder.
x = Input(batch_shape=(batch_size, original_dim))
label = Input(batch_shape=(batch_size, 10))
# YOUR CODE HERE.
cencoder = create_encoder(original_dim + 10)
stacked_x = concatenate([x, label])
h = cencoder(stacked_x)
cond_t_mean = get_t_mean(h)
cond_t_log_var = get_t_log_var(h)
t = Lambda(sampling)([cond_t_mean, cond_t_log_var])
stacked_t = concatenate([t, label])
cdecoder = create_decoder(latent_dim + 10)
#cond_t_mean =
#cond_t_log_var = # Logarithm of the variance of the latent code (without label) for cvae model.
cond_x_decoded_mean = cdecoder(stacked_t) # Final output of the cvae model.
In [0]:
conditional_loss = vlb_binomial(x, cond_x_decoded_mean, cond_t_mean, cond_t_log_var)
cvae = Model([x, label], cond_x_decoded_mean)
cvae.compile(optimizer=keras.optimizers.RMSprop(lr=0.001), loss=lambda x, y: conditional_loss)
In [20]:
hist = cvae.fit(x=[x_train, y_train],
y=x_train,
shuffle=True,
epochs=epochs,
batch_size=batch_size,
validation_data=([x_test, y_test], x_test),
verbose=2)
In [21]:
fig = plt.figure(figsize=(10, 10))
for fid_idx, (x_data, y_data, title) in enumerate(
zip([x_train, x_test], [y_train, y_test], ['Train', 'Validation'])):
n = 10 # figure with 10 x 2 digits
digit_size = 28
figure = np.zeros((digit_size * n, digit_size * 2))
decoded = sess.run(cond_x_decoded_mean,
feed_dict={x: x_data[:batch_size, :],
label: y_data[:batch_size, :]})
for i in range(10):
figure[i * digit_size: (i + 1) * digit_size,
:digit_size] = x_data[i, :].reshape(digit_size, digit_size)
figure[i * digit_size: (i + 1) * digit_size,
digit_size:] = decoded[i, :].reshape(digit_size, digit_size)
ax = fig.add_subplot(1, 2, fid_idx + 1)
ax.imshow(figure, cmap='Greys_r')
ax.set_title(title)
ax.axis('off')
plt.show()
In [0]:
# Prepare one hot labels of form
# 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 ...
# to sample five zeros, five ones, etc
curr_labels = np.eye(10)
curr_labels = np.repeat(curr_labels, 5, axis=0) # Its shape is 50 x 10.
# YOUR CODE HERE.
# ...
# cond_sampled_im_mean is a tf.Tensor of size 50 x 784 with 5 random zeros,
# then 5 random ones, etc sampled from the cvae model.
z = tf.random_normal((50, latent_dim))
labels = tf.convert_to_tensor(curr_labels, dtype=tf.float32)
stacked_z = concatenate([z, labels])
cond_sampled_im_mean = cdecoder(stacked_z)
In [23]:
cond_sampled_im_mean_np = sess.run(cond_sampled_im_mean)
# Show the sampled images.
plt.figure(figsize=(10, 10))
global_idx = 0
for digit in range(10):
for _ in range(5):
ax = plt.subplot(10, 5, global_idx + 1)
plt.imshow(cond_sampled_im_mean_np[global_idx, :].reshape(28, 28), cmap='gray')
ax.axis('off')
global_idx += 1
plt.show()
In [24]:
# Submit Task 5 (both 5.1 and 5.2).
grader.submit_conditional_hallucinating(sess, cond_sampled_im_mean)
In [25]:
STUDENT_EMAIL = "saketkc@gmail.com" # EMAIL HERE
STUDENT_TOKEN = "7vivXLipZ6P2kJdf" # TOKEN HERE
grader.status()
In [26]:
grader.submit(STUDENT_EMAIL, STUDENT_TOKEN)
In [0]:
from keras.datasets import cifar10
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
In [28]:
plt.imshow(x_train[7, :])
plt.show()
In [0]: