Variational Autoencoder (VAE)

In this seminalr we will train an autoencoder to model images of faces. For this we take "Labeled Faces in the Wild" dataset (LFW) (http://vis-www.cs.umass.edu/lfw/), deep funneled version of it. (frontal view of all faces)

Prepare the data


In [1]:
#The following line fetches you two datasets: images, usable for autoencoder training and attributes.
#Those attributes will be required for the final part of the assignment (applying smiles), so please keep them in mind
from lfw_dataset import fetch_lfw_dataset
data,attrs = fetch_lfw_dataset()

In [2]:
import numpy as np
X_train = data[:10000].reshape((10000,-1))
print(X_train.shape)
X_val = data[10000:].reshape((-1,X_train.shape[1]))
print(X_val.shape)

image_h = data.shape[1]
image_w = data.shape[2]


(10000, 6075)
(3143, 6075)

For simplicity we want all values of the data to lie in the interval $[0,1]$:


In [3]:
X_train = np.float32(X_train)
X_train = X_train/255
X_val = np.float32(X_val)
X_val = X_val/255

In [4]:
%matplotlib inline
import matplotlib.pyplot as plt

def plot_gallery(images, h, w, n_row=3, n_col=6):
    """Helper function to plot a gallery of portraits"""
    plt.figure(figsize=(1.5 * n_col, 1.7 * n_row))
    plt.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
    for i in range(n_row * n_col):
        plt.subplot(n_row, n_col, i + 1)
        plt.imshow(images[i].reshape((h, w, 3)), cmap=plt.cm.gray, vmin=-1, vmax=1, interpolation='nearest')
        plt.xticks(())
        plt.yticks(())

In [5]:
plot_gallery(X_train, image_h, image_w)



In [6]:
import theano
import theano.tensor as T


WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10).  Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
 https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

Using gpu device 0: Tesla K40m (CNMeM is disabled, cuDNN 5105)

Autoencoder

Why to use all this complicated formulaes and regularizations, what is the need for variational inference? To analyze the difference, let's first train just an autoencoder on the data:


In [7]:
import lasagne

input_X = T.matrix("X")

input_shape = [None,image_h*image_w*3]

In [8]:
HU_encoder = 2000 #you can play with this values
HU_decoder = 2000
dimZ = 100 #considering face reconstruction task, which size of representation seems reasonable?

# define the network
# use ReLU for hidden layers' activations
# GlorotUniform initialization for W
# zero initialization for biases
# it's also convenient to put sigmoid activation on output layer to get nice normalized pics

l_input = lasagne.layers.InputLayer(input_shape, input_X)
l_enc = lasagne.layers.DenseLayer(l_input, HU_encoder, b=lasagne.init.Constant(0))
l_z = lasagne.layers.DenseLayer(l_enc, dimZ, b=lasagne.init.Constant(0))
l_dec = lasagne.layers.DenseLayer(l_z, HU_decoder, b=lasagne.init.Constant(0))
l_out = lasagne.layers.DenseLayer(l_dec, input_shape[1], b=lasagne.init.Constant(0), 
                                  nonlinearity=lasagne.nonlinearities.sigmoid)

In [9]:
# create prediction variable
prediction = lasagne.layers.get_output(l_out)

# create loss function
loss = lasagne.objectives.squared_error(prediction, input_X).mean()

# create parameter update expressions
params = lasagne.layers.get_all_params(l_out, trainable=True)
updates = lasagne.updates.adam(loss, params, learning_rate=0.001)

# compile training function that updates parameters and returns training loss
# this will take a while
train_fn = theano.function([input_X], loss, updates=updates)
test_fn = theano.function([input_X], prediction)
test_loss = theano.function([input_X], loss)

In [10]:
def iterate_minibatches(inputs, batchsize, shuffle=True):
    if shuffle:
        indices = np.arange(len(inputs))
        np.random.shuffle(indices)
    for start_idx in range(0, len(inputs) - batchsize + 1, batchsize):
        if shuffle:
            excerpt = indices[start_idx:start_idx + batchsize]
        else:
            excerpt = slice(start_idx, start_idx + batchsize)
        yield inputs[excerpt]

In [11]:
from IPython.display import clear_output

In [12]:
val_log = []
train_log = []

In [13]:
# train your autoencoder
# visualize progress in reconstruction and loss decay
for ep in range(50):
    n_batches = 0
    err = 0
    for X_batch in iterate_minibatches(X_train, 200):
        err += train_fn(X_batch)
        n_batches += 1
    train_log.append(err / n_batches)
    n_batches = 0
    err = 0
    for X_batch in iterate_minibatches(X_val, 200):
        err += test_loss(X_batch)
        n_batches += 1
    val_log.append(err / n_batches)
    clear_output(True)
    
    plt.figure(figsize=(15,10))
    ax1 = plt.subplot2grid((3, 4), (0, 0), colspan=2)
    ax2 = plt.subplot2grid((3, 4), (0, 2), colspan=2)
    ax3 = plt.subplot2grid((3, 4), (1, 0), colspan=4, rowspan=2)
    
    ind = np.random.randint(X_val.shape[0])
    ax1.imshow(X_val[ind].reshape((image_h, image_w, 3)), 
               cmap=plt.cm.gray, vmin=-1, vmax=1, interpolation='nearest')
    ax2.imshow(test_fn([X_val[ind]]).reshape((image_h, image_w, 3)), 
               cmap=plt.cm.gray, vmin=-1, vmax=1, interpolation='nearest')
    
    ax3.plot(train_log, 'r')
    ax3.plot(val_log, 'b')
    ax3.grid(True)
    ax3.legend(('Train loss', 'Val loss'), loc=0, fontsize=16);
    plt.show()



In [14]:
plot_gallery(np.vstack((X_val[:10], test_fn(X_val[:10]))), image_h, image_w, n_row=2, n_col=10)


Sampling

This task requires deeper Lasagne knowledge. You need to perform inference from $z$, reconstruct an image given some random $z$ representations.


In [15]:
z_sample = T.matrix()

# Your code goes here:
# generated_x = 

hidden_state = lasagne.layers.InputLayer([None, 100], z_sample)
decoder = lasagne.layers.DenseLayer(hidden_state, HU_decoder, W=l_dec.W, b=l_dec.b)
output_layer = lasagne.layers.DenseLayer(decoder, HU_decoder, W=l_out.W, b=l_out.b, 
                                        nonlinearity=lasagne.nonlinearities.sigmoid)

generated_x = lasagne.layers.get_output(output_layer)

gen_fn = theano.function([z_sample], generated_x)

In [16]:
z = np.random.randn(25, dimZ)*0.5
output = gen_fn(np.asarray(z, dtype=theano.config.floatX))
plot_gallery(output, image_h, image_w, n_row=5, n_col=5)


Can you visualize how the distribution of $z$ looks like? Is it dense? What properties would we expect from it? Can we perform interpolation in $z$ space?

Variational Autoencoder

Bayesian approach in deep learning considers everything in terms of distributions. Now our encoder generates not just a vector $z$ but posterior ditribution q(z|x). Technically, the first difference is that you need to split bottleneck layer in two. One dense layer will generate vector $\mu$, and another will generate vector $\sigma$. Reparametrization trick is implemented via the GaussianSampler layer, that generates random vetor $\epsilon$ and returns $\mu+\sigma\epsilon$

The code for this layer taken from "recipes" folder of Lasagne github repo:


In [17]:
import GS

Since our decoder is also a function that generates distribution, we need to do the same splitting for output layer. When testing the model we will look only on mean values, so one of the output will be actual autoencoder output.

In this homework we only ask for implementation of the simplest version of VAE - one $z$ sample per input. You can consider to sample several outputs from one input and average like it is in Lasagne recipes.


In [95]:
# to compare with conventional AE, keep these hyperparameters
# or change them for the values that you used before

HU_encoder = 2000
HU_decoder = 2000
dimZ = 200

# define the network
# you can start from https://github.com/Lasagne/Recipes/blob/master/examples/variational_autoencoder/variational_autoencoder.py
# or another example https://github.com/y0ast/Variational-Autoencoder/blob/master/VAE.py
# but remember that this is not your ground truth since the data is not MNIST

In [96]:
input_X = T.matrix("X")

input_shape = [None,image_h*image_w*3]

# relu_shift is for numerical stability - if training data has any
# dimensions where stdev=0, allowing logsigma to approach -inf
# will cause the loss function to become NAN. So we set the limit
# stdev >= exp(-1 * relu_shift)
relu_shift = 10

l_input = lasagne.layers.InputLayer(input_shape, input_X)
l_enc_hid = lasagne.layers.DenseLayer(l_input, HU_encoder, nonlinearity=T.nnet.softplus)
l_enc_mu = lasagne.layers.DenseLayer(l_enc_hid, dimZ, nonlinearity=None)
l_enc_logsigma = lasagne.layers.DenseLayer(l_enc_hid, dimZ, nonlinearity=None)

l_Z = GS.GaussianSampleLayer(l_enc_mu, l_enc_logsigma)

l_dec_hid = lasagne.layers.DenseLayer(l_Z, HU_decoder, nonlinearity=T.nnet.softplus)
l_dec_mu = lasagne.layers.DenseLayer(l_dec_hid, input_shape[1], nonlinearity=None)
l_dec_logsigma = lasagne.layers.DenseLayer(l_dec_hid, input_shape[1],
        nonlinearity=lambda a: T.nnet.relu(a + relu_shift) - relu_shift)
l_output = GS.GaussianSampleLayer(l_dec_mu, l_dec_logsigma)

And the last, but not least! Place in the code where the most of the formulaes goes to - optimization objective. The objective for VAE has it's own name - variational lowerbound. And as for any lowerbound our intention is to maximize it. Here it is (for one sample $z$ per input $x$):

$$\mathcal{L} = -D_{KL}(q_{\phi}(z|x)||p_{\theta}(z)) + \log p_{\theta}(x|z)$$

Your next task is to implement two functions that compute KL-divergence and the second term - log-likelihood of an output. Here is some necessary math for your convenience:

$$D_{KL} = -\frac{1}{2}\sum_{i=1}^{dimZ}(1+log(\sigma_i^2)-\mu_i^2-\sigma_i^2)$$$$\log p_{\theta}(x|z) = \sum_{i=1}^{dimX}\log p_{\theta}(x_i|z)=\sum_{i=1}^{dimX} \log \Big( \frac{1}{\sigma_i\sqrt{2\pi}}e^{-\frac{(\mu_I-x)^2}{2\sigma_i^2}} \Big)=...$$

Don't forget in the code that you are using $\log\sigma$ as variable. Explain, why not $\sigma$?


In [97]:
lasagne.layers.get_all_layers(l_output) # should be ~9 layers total


Out[97]:
[<lasagne.layers.input.InputLayer at 0x7f998f47f2e8>,
 <lasagne.layers.dense.DenseLayer at 0x7f9995b23780>,
 <lasagne.layers.dense.DenseLayer at 0x7f9996ac3f98>,
 <lasagne.layers.dense.DenseLayer at 0x7f998f151080>,
 <GS.GaussianSampleLayer at 0x7f99fd52a5f8>,
 <lasagne.layers.dense.DenseLayer at 0x7f99965d7eb8>,
 <lasagne.layers.dense.DenseLayer at 0x7f998f091dd8>,
 <lasagne.layers.dense.DenseLayer at 0x7f998f0919e8>,
 <GS.GaussianSampleLayer at 0x7f99fe123668>]

In [98]:
def KL_divergence(mu, logsigma):
    return -0.5 * T.sum(2 * logsigma - T.sqr(mu) - T.exp(2 * logsigma))

def log_likelihood(x, mu, logsigma):
    return T.sum(-logsigma - 0.5 * T.sqr(x - mu) / T.exp(2 * logsigma))

In [99]:
hid_mu, hid_ls, dec_mu, dec_ls = lasagne.layers.get_output([l_enc_mu, l_enc_logsigma, l_dec_mu, l_dec_logsigma])

Now build the loss and training function:


In [100]:
loss = KL_divergence(hid_mu, hid_ls) - log_likelihood(input_X, dec_mu, dec_ls)
pred = lasagne.layers.get_output(l_output, deterministic=True)

params = lasagne.layers.get_all_params(l_output, trainable=True)
updates = lasagne.updates.adam(loss, params, learning_rate=1e-4)

train_fn = theano.function([input_X], loss, updates=updates)
test_fn = theano.function([input_X], loss)
prediction = theano.function([input_X], pred)

And train the model:


In [101]:
val_log = []
train_log = []

In [348]:
# train your autoencoder
# visualize progress in reconstruction and loss decay
for ep in range(100):
    n_batches = 0
    err = 0
    for X_batch in iterate_minibatches(X_train, 500):
        err += train_fn(X_batch)
        n_batches += 1
    train_log.append(err / n_batches)
    n_batches = 0
    err = 0
    for X_batch in iterate_minibatches(X_val, 500):
        err += test_fn(X_batch)
        n_batches += 1
    val_log.append(err / n_batches)
    clear_output(True)
    
    plt.figure(figsize=(15,10))
    ax1 = plt.subplot2grid((3, 4), (0, 0), colspan=2)
    ax2 = plt.subplot2grid((3, 4), (0, 2), colspan=2)
    ax3 = plt.subplot2grid((3, 4), (1, 0), colspan=4, rowspan=2)
    
    ind = np.random.randint(X_val.shape[0])
    ax1.imshow(X_val[ind].reshape((image_h, image_w, 3)), 
               cmap=plt.cm.gray, vmin=-1, vmax=1, interpolation='nearest')
    ax2.imshow(prediction([X_val[ind]]).reshape((image_h, image_w, 3)), 
               cmap=plt.cm.gray, vmin=-1, vmax=1, interpolation='nearest')
    
    ax3.plot(train_log, 'r')
    ax3.plot(val_log, 'b')
    ax3.grid(True)
    ax3.legend(('Train loss', 'Val loss'), loc=0, fontsize=16);
    plt.show()


---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-348-d56b219bb480> in <module>()
      5     err = 0
      6     for X_batch in iterate_minibatches(X_train, 500):
----> 7         err += train_fn(X_batch)
      8         n_batches += 1
      9     train_log.append(err / n_batches)

/home/mlevkov/env/lib/python3.4/site-packages/theano/compile/function_module.py in __call__(self, *args, **kwargs)
    882         try:
    883             outputs =\
--> 884                 self.fn() if output_subset is None else\
    885                 self.fn(output_subset=output_subset)
    886         except Exception:

KeyboardInterrupt: 

Congrats!

If you managed to tune your autoencoders to converge and learn something about the world, now it's time to make fun out of it. As you may have noticed, there are face attributes in dataset. We're interesting in "Smiling" column, but feel free to try others as well! Here is the first task:

1) Extract the "Smilling" column as a separate numpy vector and sort this vector.


In [105]:
attrs[:10]


Out[105]:
Male Asian White Black Baby Child Youth Middle Aged Senior Black Hair ... Pale Skin 5 o' Clock Shadow Strong Nose-Mouth Lines Wearing Lipstick Flushed Face High Cheekbones Brown Eyes Wearing Earrings Wearing Necktie Wearing Necklace
0 1.56835 -1.88904 1.7372 -0.929729 -1.4718 -0.19558 -0.835609 -0.351468 -1.01253 -0.719593 ... 0.361738 1.16612 -1.16492 -1.13999 -2.37175 -1.29993 -0.414682 -1.1449 0.694007 -0.826609
1 0.169851 -0.982408 0.422709 -1.28218 -1.36006 -0.867002 -0.452293 -0.197521 -0.956073 -0.802107 ... -0.832036 -0.39768 0.87416 -0.945431 -0.268649 -0.00624408 -0.0304057 -0.480128 0.66676 -0.496559
2 0.997749 -1.36419 -0.157377 -0.756447 -1.89183 -0.871526 -0.862893 0.0314447 -1.34152 -0.0900375 ... 1.54974 1.88475 -0.999765 -1.35986 -1.91211 -1.09563 0.915126 -0.572332 0.144262 -0.841231
3 1.12272 -1.9978 1.91614 -2.51421 -2.58007 -1.40424 0.0575511 0.000195882 -1.27351 -1.43146 ... 0.567822 -0.176089 1.10812 -1.60094 -3.26461 0.813418 0.308631 -0.848693 0.475941 -0.447025
4 1.07821 -2.0081 1.67621 -2.27806 -2.65185 -1.34841 0.649089 0.0176564 -1.88911 -1.85721 ... -1.46147 -0.955283 0.119113 -1.12818 -3.16105 0.0826804 -0.439614 -0.359859 -0.760774 -0.410152
5 0.850491 -1.48208 1.90852 -1.87365 -3.22993 -0.864006 0.31382 -0.35268 -1.55929 -1.91459 ... 1.77547 -1.0635 1.35435 -0.960133 -5.35464 1.15002 -0.142195 -0.672725 0.886694 -0.154051
6 0.944548 -1.37722 1.29906 -1.40534 -1.86233 -0.502664 -0.48629 0.0150016 -0.892478 -0.586615 ... 0.155763 -0.0333257 -0.869705 -0.942216 -3.77615 -1.0283 0.0992853 -0.335493 0.185644 -0.671966
7 1.59467 -1.50443 0.441401 -1.77175 -2.44985 -1.10597 -0.0425912 -0.136437 -1.00851 0.653294 ... -3.25803 0.128691 -1.57324 -1.10214 -1.12391 -1.39417 1.52065 -0.487475 0.601968 -0.617698
8 0.286489 -1.90351 0.697239 -1.85985 -1.44025 -1.55243 0.102555 -0.191471 -0.726333 -0.226157 ... -1.41784 -0.483956 0.717483 -0.288405 -2.29545 0.937971 1.23307 -0.632812 0.00999485 -0.144265
9 0.663497 -1.03694 0.46161 -2.49853 -2.81593 -1.63779 -0.20148 -0.134941 -1.32816 -0.712721 ... 1.43291 0.771419 1.59549 -0.697505 -2.51197 0.98124 0.424768 -0.518236 -0.306921 0.348492

10 rows × 73 columns


In [358]:
smiling = attrs.Smiling.values

In [359]:
ind = np.argsort(smiling)

In [360]:
plot_gallery(np.vstack((data[ind[:10]], data[ind[-10:]])), image_h, image_w, n_row=2, n_col=10)


2) Take z-representations of those top images (you can do it only for positive or for both) and average them to find "vector representation" of the attribute.

3) Show how "feature arithmetics" works with representations of both VAE and conventional autoencoder. Show how to generate an image with preconditioned attribute. Take some sad faces and make them smiling.

4) (If you didn't manage to tune VAE, just show if it works for just AE.) Discuss the results.


In [361]:
X = np.float32(data / 255)

In [362]:
z_repr = theano.function([input_X], lasagne.layers.get_output(l_enc_mu))

In [368]:
mean_smile = z_repr(X[ind[-50:]].reshape(50, -1)).mean(axis=0)
mean_smile -= z_repr(X[ind[:50]].reshape(50, -1)).mean(axis=0)

In [369]:
z_sample = T.matrix()

inp_layer = lasagne.layers.InputLayer([None, dimZ], z_sample)
hid_state = lasagne.layers.DenseLayer(inp_layer, HU_decoder, nonlinearity=T.nnet.softplus,
                                      W=l_dec_hid.W, b=l_dec_hid.b)
decoder_mu = lasagne.layers.DenseLayer(hid_state, input_shape[1], nonlinearity=None, W=l_dec_mu.W, b=l_dec_mu.b)

In [370]:
gen_fn = theano.function([z_sample], lasagne.layers.get_output(decoder_mu))

In [382]:
img = X[ind[10]]
smiling_faces = z_repr(img.reshape(1,-1))
generated = []
for c in range(6):
    smiling_faces += mean_smile * c * 0.1
    generated.append(gen_fn(smiling_faces))

In [383]:
plot_gallery(generated, image_h, image_w, n_row=1, n_col=6)