Autoencoders

Autoencoders are artificial neural networks capable of learning efficient representations of the input data, called codings, without any supervision. These codings typically have a much lower dimensionality than the input data, making autoencoders useful for dimensionality reduction. More importantly, autoencoders act as powerful feature detectors, and they can be used for unsupervised pretraining of deep neural networks. Lastly, they are capable of randomly generating new data that looks very similar to the training data; this is called a generative model. For example, you could train an autoencoder on pictures of faces, and it would then be able to generate new faces.

You can limit the size of the internal representation, or you can add noise to the inputs and train the network to recover the original inputs. These constraints prevent the autoencoder from trivially copying the inputs directly to the outputs, which forces it to learn efficient ways of representing the data. In short, the codings are byproducts of the autoencoder’s attempt to learn the identity function under some constraints.

The relationship between memory, perception, and pattern matching was famously studied by (Chase-Simon72). They observed that expert chess players were able to memorize the positions of all the pieces in a game by looking at the board for just 5 seconds, a task that most people would find impossible. However, this was only the case when the pieces were placed in realistic positions (from actual games), not when the pieces were placed randomly. Chess experts don’t have a much better memory than normal people, they just see chess patterns more easily thanks to their experience with the game. Noticing patterns helps them store information efficiently.

An autoencoder typically has the same architecture as a Multi-Layer Perceptron (MLP), except that the number of neurons in the output layer must be equal to the number of inputs.

In typical encoder / decoder architectures the outputs are often called the reconstructions since the autoencoder tries to reconstruct the inputs, and the cost function contains a reconstruction loss that penalizes the model when the reconstructions are different from the inputs.

PCA with a linear Autoencoder

If the autoencoder uses only linear activations and the cost function is the Mean Squared Error (MSE), then it can be shown that it ends up performing Principal Component Analysis.



In [8]:

    
# Tensorflow
import tensorflow as tf 
from tensorflow.contrib.layers import fully_connected

# Common imports
import numpy as np
import numpy.random as rnd
import os
import sys

# to make this notebook's output stable across runs
rnd.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12



In [14]:

    
rnd.seed(4)
m = 100
w1, w2 = 0.1, 0.3
noise = 0.1

angles = rnd.rand(m) * 3 * np.pi / 2 - 0.5
X_train = np.empty((m, 3))
X_train[:, 0] = np.cos(angles) + np.sin(angles)/2 + noise * rnd.randn(m) / 2
X_train[:, 1] = np.sin(angles) * 0.7 + noise * rnd.randn(m) / 2
X_train[:, 2] = X_train[:, 0] * w1 + X_train[:, 1] * w2 + noise * rnd.randn(m)

# Normalize the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)



In [20]:

    
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure()
ax = Axes3D(fig)

ax.scatter3D(X_train[:, 0], X_train[:, 1], X_train[:, 2])
plt.xlabel("$z_1$", fontsize=18)
plt.ylabel("$z_2$", fontsize=18, rotation=0)
pyplot.show()

Autoencoder



In [10]:

    
##
tf.reset_default_graph()

n_inputs = 3
n_hidden = 2  # codings
n_outputs = n_inputs

X = tf.placeholder(tf.float32, shape=[None, n_inputs])
hidden = fully_connected(X, n_hidden, activation_fn=None)
outputs = fully_connected(hidden, n_outputs, activation_fn=None)

mse = tf.reduce_mean(tf.square(outputs - X))
optimizer = tf.train.AdamOptimizer(learning_rate=0.01)
training_op = optimizer.minimize(mse)

init = tf.global_variables_initializer()

##
n_iterations = 10000
codings = hidden

with tf.Session() as sess:
    init.run()
    for iteration in range(n_iterations):
        training_op.run(feed_dict={X: X_train})
    codings_val = codings.eval(feed_dict={X: X_train})



In [11]:

    
fig = plt.figure(figsize=(4,3))
plt.plot(codings_val[:,0], codings_val[:, 1], "b.")
plt.xlabel("$z_1$", fontsize=18)
plt.ylabel("$z_2$", fontsize=18, rotation=0)
plt.show()

Stacked Autoencoders on MNIST



In [21]:

    
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/")









    



Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting /tmp/data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting /tmp/data/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting /tmp/data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz

Train all layers at once

Let's build a stacked Autoencoder with 3 hidden layers and 1 output layer (ie. 2 stacked Autoencoders). We will use ELU activation, He initialization and L2 regularization.



In [23]:

    
tf.reset_default_graph()

from tensorflow.contrib.layers import fully_connected

n_inputs = 28*28
n_hidden1 = 300
n_hidden2 = 150  # codings
n_hidden3 = n_hidden1
n_outputs = n_inputs

learning_rate = 0.01
l2_reg = 0.0001

initializer = tf.contrib.layers.variance_scaling_initializer() # initialization

X = tf.placeholder(tf.float32, shape=[None, n_inputs])
with tf.contrib.framework.arg_scope([fully_connected],
                                    activation_fn=tf.nn.elu,
                                    weights_initializer=initializer,
                                    weights_regularizer=tf.contrib.layers.l2_regularizer(l2_reg)):
    hidden1 = fully_connected(X, n_hidden1)
    hidden2 = fully_connected(hidden1, n_hidden2)
    hidden3 = fully_connected(hidden2, n_hidden3)
    outputs = fully_connected(hidden3, n_outputs, activation_fn=None)

mse = tf.reduce_mean(tf.square(outputs - X))

reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
loss = tf.add_n([mse] + reg_losses)

optimizer = tf.train.AdamOptimizer(learning_rate)
training_op = optimizer.minimize(loss)

init = tf.global_variables_initializer()
saver = tf.train.Saver()

Now let's train it! Note that we don't feed target values (y_batch is not used). This is unsupervised training.



In [24]:

    
n_epochs = 4
batch_size = 150

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        n_batches = mnist.train.num_examples // batch_size
        for iteration in range(n_batches):
            print("\r{}%".format(100 * iteration // n_batches), end="")
            sys.stdout.flush()
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            sess.run(training_op, feed_dict={X: X_batch})
        mse_train = mse.eval(feed_dict={X: X_batch})
        print("\r{}".format(epoch), "Train MSE:", mse_train)
        saver.save(sess, "./my_model_all_layers.ckpt")









    



0 Train MSE: 0.0219669
1 Train MSE: 0.0125765
2 Train MSE: 0.0103965
3 Train MSE: 0.00993011

This function loads the model, evaluates it on the test set (it measures the reconstruction error), then it displays the original image and its reconstruction:



In [28]:

    
def plot_image(image, shape=[28, 28]):
    plt.imshow(image.reshape(shape), cmap="Greys", interpolation="nearest")
    plt.axis("off")
    
def show_reconstructed_digits(X, outputs, model_path = None, n_test_digits = 2):
    with tf.Session() as sess:
        if model_path:
            saver.restore(sess, model_path)
        X_test = mnist.test.images[:n_test_digits]
        outputs_val = outputs.eval(feed_dict={X: X_test})

    fig = plt.figure(figsize=(8, 3 * n_test_digits))
    for digit_index in range(n_test_digits):
        plt.subplot(n_test_digits, 2, digit_index * 2 + 1)
        plot_image(X_test[digit_index])
        plt.subplot(n_test_digits, 2, digit_index * 2 + 2)
        plot_image(outputs_val[digit_index])



In [31]:

    
show_reconstructed_digits(X, outputs, "./my_model_all_layers.ckpt",n_test_digits = 2)









    



INFO:tensorflow:Restoring parameters from ./my_model_all_layers.ckpt

Tying Weights

When an autoencoder is neatly symmetrical, like the one we just built, a common technique is to tie the weights of the decoder layers to the weights of the encoder layers. This halves the number of weights in the model, speeding up training and limiting the risk of overfitting.



In [52]:

    
tf.reset_default_graph()

n_inputs = 28 * 28
n_hidden1 = 300
n_hidden2 = 150  # codings
n_hidden3 = n_hidden1
n_outputs = n_inputs

learning_rate = 0.01
l2_reg = 0.0001

activation = tf.nn.elu
regularizer = tf.contrib.layers.l2_regularizer(l2_reg)
initializer = tf.contrib.layers.variance_scaling_initializer()

X = tf.placeholder(tf.float32, shape=[None, n_inputs])

weights1_init = initializer([n_inputs, n_hidden1])
weights2_init = initializer([n_hidden1, n_hidden2])

weights1 = tf.Variable(weights1_init, dtype=tf.float32, name="weights1")
weights2 = tf.Variable(weights2_init, dtype=tf.float32, name="weights2")
weights3 = tf.transpose( weights2, name ="weights3") # tied weights 
weights4 = tf.transpose( weights1, name ="weights4") # tied weights

biases1 = tf.Variable(tf.zeros(n_hidden1), name="biases1")
biases2 = tf.Variable(tf.zeros(n_hidden2), name="biases2")
biases3 = tf.Variable(tf.zeros(n_hidden3), name="biases3")
biases4 = tf.Variable(tf.zeros(n_outputs), name="biases4")

hidden1 = activation(tf.matmul(X, weights1) + biases1)
hidden2 = activation(tf.matmul(hidden1, weights2) + biases2)
hidden3 = activation(tf.matmul(hidden2, weights3) + biases3)
outputs = tf.matmul(hidden3, weights4) + biases4

reconstruction_loss = tf.reduce_mean( tf.square( outputs - X)) 
reg_loss = regularizer(weights1) + regularizer(weights2) 
loss = reconstruction_loss + reg_loss

optimizer = tf.train.AdamOptimizer(learning_rate) 
training_op = optimizer.minimize(loss) 
init = tf.global_variables_initializer()
saver = tf.train.Saver()



In [53]:

    
n_epochs = 4
batch_size = 150

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        n_batches = mnist.train.num_examples // batch_size
        for iteration in range(n_batches):
            print("\r{}%".format(100 * iteration // n_batches), end="")
            sys.stdout.flush()
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            sess.run(training_op, feed_dict={X: X_batch})
        mse_train = reconstruction_loss.eval(feed_dict={X: X_batch},session=sess)
        print("\r{}".format(epoch), "Train MSE:", mse_train)
        saver.save(sess, "./my_model_tied.ckpt")









    



0 Train MSE: 0.0114661
1 Train MSE: 0.00763779
2 Train MSE: 0.00693481
3 Train MSE: 0.00729831



In [54]:

    
show_reconstructed_digits(X, outputs, "./my_model_tied.ckpt")









    



INFO:tensorflow:Restoring parameters from ./my_model_tied.ckpt

Highlights:

weight3 and weights4 are not variables, they are respectively the transpose of weights2 and weights1 (they are “tied” to them).
Since they are not variables, it’s no use regularizing them: we only regularize weights1 and weights2.
Biases are never tied, and never regularized.

Training one Autoencoder at a time in multiple graphs

Rather than training the whole stacked autoencoder in one go like we just did, it is often much faster to train one shallow autoencoder at a time, then stack all of them into a single stacked autoencoder. This is especially useful for very deep autoencoders.

There are many ways to train one Autoencoder at a time. The first approach it to train each Autoencoder using a different graph, then we create the Stacked Autoencoder by simply initializing it with the weights and biases copied from these Autoencoders.Let's create a function that will train one autoencoder and return the transformed training set (ie. the output of the hidden layer) and the model parameters.



In [55]:

    
def train_autoencoder(X_train, n_neurons, n_epochs, batch_size, learning_rate = 0.01, l2_reg = 0.0005, activation_fn=tf.nn.elu):
    graph = tf.Graph()
    with graph.as_default():
        n_inputs = X_train.shape[1]

        X = tf.placeholder(tf.float32, shape=[None, n_inputs])
        with tf.contrib.framework.arg_scope(
                [fully_connected],
                activation_fn=activation_fn,
                weights_initializer=tf.contrib.layers.variance_scaling_initializer(),
                weights_regularizer=tf.contrib.layers.l2_regularizer(l2_reg)):
            hidden = fully_connected(X, n_neurons, scope="hidden")
            outputs = fully_connected(hidden, n_inputs, activation_fn=None, scope="outputs")

        mse = tf.reduce_mean(tf.square(outputs - X))

        reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
        loss = tf.add_n([mse] + reg_losses)

        optimizer = tf.train.AdamOptimizer(learning_rate)
        training_op = optimizer.minimize(loss)

        init = tf.global_variables_initializer()

    with tf.Session(graph=graph) as sess:
        init.run()
        for epoch in range(n_epochs):
            n_batches = len(X_train) // batch_size
            for iteration in range(n_batches):
                print("\r{}%".format(100 * iteration // n_batches), end="")
                sys.stdout.flush()
                indices = rnd.permutation(len(X_train))[:batch_size]
                X_batch = X_train[indices]
                sess.run(training_op, feed_dict={X: X_batch})
            mse_train = mse.eval(feed_dict={X: X_batch})
            print("\r{}".format(epoch), "Train MSE:", mse_train)
        params = dict([(var.name, var.eval()) for var in tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES)])
        hidden_val = hidden.eval(feed_dict={X: X_train})
        return hidden_val, params["hidden/weights:0"], params["hidden/biases:0"], params["outputs/weights:0"], params["outputs/biases:0"]

Now let's train two Autoencoders. The first one is trained on the training data, and the second is trained on the previous Autoencoder's hidden layer output:



In [56]:

    
hidden_output, W1, b1, W4, b4 = train_autoencoder(mnist.train.images, n_neurons=300, n_epochs=4, batch_size=150)
_, W2, b2, W3, b3 = train_autoencoder(hidden_output, n_neurons=150, n_epochs=4, batch_size=150)









    



0 Train MSE: 0.0177944
1 Train MSE: 0.0188232
2 Train MSE: 0.0195855
3 Train MSE: 0.0191368
0 Train MSE: 0.00453382
19% Train MSE: 0.00451651
2 Train MSE: 0.00443749
3 Train MSE: 0.004681

Finally, we can create a Stacked Autoencoder by simply reusing the weights and biases from the Autoencoders we just trained:



In [57]:

    
tf.reset_default_graph()

n_inputs = 28*28

X = tf.placeholder(tf.float32, shape=[None, n_inputs])
hidden1 = tf.nn.elu(tf.matmul(X, W1) + b1)
hidden2 = tf.nn.elu(tf.matmul(hidden1, W2) + b2)
hidden3 = tf.nn.elu(tf.matmul(hidden2, W3) + b3)
outputs = tf.matmul(hidden3, W4) + b4



In [58]:

    
show_reconstructed_digits(X, outputs)

Training one Autoencoder at a time in a single graph

Another approach is to use a single graph. To do this, we create the graph for the full Stacked Autoencoder, but then we also add operations to train each Autoencoder independently: phase 1 trains the bottom and top layer (ie. the first Autoencoder) and phase 2 trains the two middle layers (ie. the second Autoencoder).



In [60]:

    
tf.reset_default_graph()

n_inputs = 28 * 28
n_hidden1 = 300
n_hidden2 = 150  # codings
n_hidden3 = n_hidden1
n_outputs = n_inputs

learning_rate = 0.01
l2_reg = 0.0001

activation = tf.nn.elu
regularizer = tf.contrib.layers.l2_regularizer(l2_reg)
initializer = tf.contrib.layers.variance_scaling_initializer()

X = tf.placeholder(tf.float32, shape=[None, n_inputs])

weights1_init = initializer([n_inputs, n_hidden1])
weights2_init = initializer([n_hidden1, n_hidden2])
weights3_init = initializer([n_hidden2, n_hidden3])
weights4_init = initializer([n_hidden3, n_outputs])

weights1 = tf.Variable(weights1_init, dtype=tf.float32, name="weights1")
weights2 = tf.Variable(weights2_init, dtype=tf.float32, name="weights2")
weights3 = tf.Variable(weights3_init, dtype=tf.float32, name="weights3")
weights4 = tf.Variable(weights4_init, dtype=tf.float32, name="weights4")

biases1 = tf.Variable(tf.zeros(n_hidden1), name="biases1")
biases2 = tf.Variable(tf.zeros(n_hidden2), name="biases2")
biases3 = tf.Variable(tf.zeros(n_hidden3), name="biases3")
biases4 = tf.Variable(tf.zeros(n_outputs), name="biases4")

hidden1 = activation(tf.matmul(X, weights1) + biases1)
hidden2 = activation(tf.matmul(hidden1, weights2) + biases2)
hidden3 = activation(tf.matmul(hidden2, weights3) + biases3)
outputs = tf.matmul(hidden3, weights4) + biases4


with tf.name_scope("phase1"):
    optimizer = tf.train.AdamOptimizer(learning_rate)
    phase1_outputs = tf.matmul(hidden1, weights4) + biases4  # bypass hidden2 and hidden3
    phase1_mse = tf.reduce_mean(tf.square(phase1_outputs - X))
    phase1_reg_loss = regularizer(weights1) + regularizer(weights4)
    phase1_loss = phase1_mse + phase1_reg_loss
    phase1_training_op = optimizer.minimize(phase1_loss)

with tf.name_scope("phase2"):
    optimizer = tf.train.AdamOptimizer(learning_rate)
    phase2_mse = tf.reduce_mean(tf.square(hidden3 - hidden1))
    phase2_reg_loss = regularizer(weights2) + regularizer(weights3)
    phase2_loss = phase2_mse + phase2_reg_loss
    phase2_training_op = optimizer.minimize(phase2_loss, 
                                            var_list=[weights2, biases2, weights3, biases3]) # freeze hidden1
    
init = tf.global_variables_initializer()
saver = tf.train.Saver()



In [61]:

    
training_ops = [phase1_training_op, phase2_training_op]
mses = [phase1_mse, phase2_mse]
n_epochs = [4, 4]
batch_sizes = [150, 150]

with tf.Session() as sess:
    init.run()
    for phase in range(2):
        print("Training phase #{}".format(phase + 1))
        for epoch in range(n_epochs[phase]):
            n_batches = mnist.train.num_examples // batch_sizes[phase]
            for iteration in range(n_batches):
                print("\r{}%".format(100 * iteration // n_batches), end="")
                sys.stdout.flush()
                X_batch, y_batch = mnist.train.next_batch(batch_sizes[phase])
                sess.run(training_ops[phase], feed_dict={X: X_batch})
            mse_train = mses[phase].eval(feed_dict={X: X_batch})
            print("\r{}".format(epoch), "Train MSE:", mse_train)
            saver.save(sess, "./my_model_one_at_a_time.ckpt")
    mse_test = mses[phase].eval(feed_dict={X: mnist.test.images})
    print("Test MSE:", mse_test)









    



Training phase #1
0 Train MSE: 0.00773876
1 Train MSE: 0.00756009
2 Train MSE: 0.00787726
3 Train MSE: 0.00781984
Training phase #2
0 Train MSE: 0.00217559
1 Train MSE: 0.00226913
29% Train MSE: 0.00246423
3 Train MSE: 0.00310218
Test MSE: 0.00313138



In [62]:

    
show_reconstructed_digits(X, outputs, "./my_model_one_at_a_time.ckpt")









    



INFO:tensorflow:Restoring parameters from ./my_model_one_at_a_time.ckpt

Visualizing Features

A technique for visualizing features is for each neuron in the first hidden layer, you can create an image where a pixel’s intensity corresponds to the weight of the connection to the given neuron.



In [63]:

    
with tf.Session() as sess:
    saver.restore(sess, "./my_model_one_at_a_time.ckpt")
    weights1_val = weights1.eval()









    



INFO:tensorflow:Restoring parameters from ./my_model_one_at_a_time.ckpt



In [69]:

    
for i in range(5):
    plt.subplot(1, 5, i + 1)
    plot_image(weights1_val.T[i])

plt.show()

Cache the frozen layer outputs



In [70]:

    
training_ops = [phase1_training_op, phase2_training_op, training_op]
mses = [phase1_mse, phase2_mse, mse]
n_epochs = [4, 4]
batch_sizes = [150, 150]

with tf.Session() as sess:
    init.run()
    for phase in range(2):
        print("Training phase #{}".format(phase + 1))
        if phase == 1:
            mnist_hidden1 = hidden1.eval(feed_dict={X: mnist.train.images})
        for epoch in range(n_epochs[phase]):
            n_batches = mnist.train.num_examples // batch_sizes[phase]
            for iteration in range(n_batches):
                print("\r{}%".format(100 * iteration // n_batches), end="")
                sys.stdout.flush()
                if phase == 1:
                    indices = rnd.permutation(len(mnist_hidden1))
                    hidden1_batch = mnist_hidden1[indices[:batch_sizes[phase]]]
                    feed_dict = {hidden1: hidden1_batch}
                    sess.run(training_ops[phase], feed_dict=feed_dict)
                else:
                    X_batch, y_batch = mnist.train.next_batch(batch_sizes[phase])
                    feed_dict = {X: X_batch}
                    sess.run(training_ops[phase], feed_dict=feed_dict)
            mse_train = mses[phase].eval(feed_dict=feed_dict)
            print("\r{}".format(epoch), "Train MSE:", mse_train)
            saver.save(sess, "./my_model_cache_frozen.ckpt")
    mse_test = mses[phase].eval(feed_dict={X: mnist.test.images})
    print("Test MSE:", mse_test)









    



Training phase #1
0 Train MSE: 0.0077867
1 Train MSE: 0.00743229
2 Train MSE: 0.00730401
3 Train MSE: 0.00772831
Training phase #2
0 Train MSE: 0.00226764
1 Train MSE: 0.00237887
2 Train MSE: 0.00263496
3 Train MSE: 0.00265333
Test MSE: 0.00283383



In [71]:

    
show_reconstructed_digits(X, outputs, "./my_model_cache_frozen.ckpt")









    



INFO:tensorflow:Restoring parameters from ./my_model_cache_frozen.ckpt

Unsupervised Pretraining Using Stacked Autoencoders

If you have a large dataset but most of it is unlabeled, you can first train a stacked autoencoder using all the data, then reuse the lower layers to create a neural network for your actual task, and train it using the labeled data. The stacked autoencoder itself is typically trained one autoencoder at a time, as discussed earlier. When training the classifier, if you really don’t have much labeled training data, you may want to freeze the pretrained layers (at least the lower ones).



In [72]:

    
tf.reset_default_graph()

n_inputs = 28 * 28
n_hidden1 = 300
n_hidden2 = 150
n_outputs = 10

learning_rate = 0.01
l2_reg = 0.0005

activation = tf.nn.elu
regularizer = tf.contrib.layers.l2_regularizer(l2_reg)
initializer = tf.contrib.layers.variance_scaling_initializer()

X = tf.placeholder(tf.float32, shape=[None, n_inputs])
y = tf.placeholder(tf.int32, shape=[None])

weights1_init = initializer([n_inputs, n_hidden1])
weights2_init = initializer([n_hidden1, n_hidden2])
weights3_init = initializer([n_hidden2, n_hidden3])

weights1 = tf.Variable(weights1_init, dtype=tf.float32, name="weights1")
weights2 = tf.Variable(weights2_init, dtype=tf.float32, name="weights2")
weights3 = tf.Variable(weights3_init, dtype=tf.float32, name="weights3")

biases1 = tf.Variable(tf.zeros(n_hidden1), name="biases1")
biases2 = tf.Variable(tf.zeros(n_hidden2), name="biases2")
biases3 = tf.Variable(tf.zeros(n_hidden3), name="biases3")

hidden1 = activation(tf.matmul(X, weights1) + biases1)
hidden2 = activation(tf.matmul(hidden1, weights2) + biases2)
logits = tf.matmul(hidden2, weights3) + biases3

cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
reg_loss = regularizer(weights1) + regularizer(weights2) + regularizer(weights3)
loss = cross_entropy + reg_loss
optimizer = tf.train.AdamOptimizer(learning_rate)
training_op = optimizer.minimize(loss)

correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

init = tf.global_variables_initializer()
pretrain_saver = tf.train.Saver([weights1, weights2, biases1, biases2])
saver = tf.train.Saver()

Regular training (without pretraining):



In [74]:

    
n_epochs = 4
batch_size = 150
n_labeled_instances = 20000

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        n_batches = n_labeled_instances // batch_size
        for iteration in range(n_batches):
            print("\r{}%".format(100 * iteration // n_batches), end="")
            sys.stdout.flush()
            indices = rnd.permutation(n_labeled_instances)[:batch_size]
            X_batch, y_batch = mnist.train.images[indices], mnist.train.labels[indices]
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        accuracy_val = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
        print("\r{}".format(epoch), "Train accuracy:", accuracy_val, end=" ")
        saver.save(sess, "./my_model_supervised.ckpt")
        accuracy_val = accuracy.eval(feed_dict={X: mnist.test.images, y: mnist.test.labels})
        print("Test accuracy:", accuracy_val)









    



0 Train accuracy: 0.933333 Test accuracy: 0.9228
1 Train accuracy: 0.973333 Test accuracy: 0.936
29% Train accuracy: 0.98 Test accuracy: 0.9424
3 Train accuracy: 0.953333 Test accuracy: 0.942

Now reusing the first two layers of the autoencoder we pretrained:



In [76]:

    
n_epochs = 4
batch_size = 150
n_labeled_instances = 20000

#training_op = optimizer.minimize(loss, var_list=[weights3, biases3])  # Freeze layers 1 and 2 (optional)

with tf.Session() as sess:
    init.run()
    pretrain_saver.restore(sess, "./my_model_cache_frozen.ckpt")
    for epoch in range(n_epochs):
        n_batches = n_labeled_instances // batch_size
        for iteration in range(n_batches):
            print("\r{}%".format(100 * iteration // n_batches), end="")
            sys.stdout.flush()
            indices = rnd.permutation(n_labeled_instances)[:batch_size]
            X_batch, y_batch = mnist.train.images[indices], mnist.train.labels[indices]
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        accuracy_val = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
        print("\r{}".format(epoch), "Train accuracy:", accuracy_val, end="\t")
        saver.save(sess, "./my_model_supervised_pretrained.ckpt")
        accuracy_val = accuracy.eval(feed_dict={X: mnist.test.images, y: mnist.test.labels})
        print("Test accuracy:", accuracy_val)









    



INFO:tensorflow:Restoring parameters from ./my_model_cache_frozen.ckpt
0 Train accuracy: 0.96	Test accuracy: 0.9314
1 Train accuracy: 0.96	Test accuracy: 0.9337
2 Train accuracy: 0.953333	Test accuracy: 0.941
3 Train accuracy: 0.986667	Test accuracy: 0.9472

One of the triggers of the current Deep Learning tsunami is the discovery in (Hinton et al., 2006) that deep neural networks can be pretrained in an unsupervised fashion. They used restricted Boltzmann machines for that, but in (Bengio et al., 2006) showed that autoencoders worked just as well. There is nothing special about the TensorFlow implementation: just train an autoencoder using all the training data, then reuse its encoder layers to create a new neural network. Up to now, in order to force the autoencoder to learn interesting features, we have limited the size of the coding layer, making it undercomplete.

Denoising Autoencoders

Another way to force the autoencoder to learn useful features is to add noise to its inputs, training it to recover the original, noise-free inputs. This prevents the autoencoder from trivially copying its inputs to its outputs, so it ends up having to find patterns in the data. The idea of using autoencoders to remove noise has been around since the 1980s (e.g., it is mentioned in Yann LeCun’s 1987 master’s thesis). In a 2008 paper, 3 Pascal Vincent et al. showed that autoencoders could also be used for feature extraction. The idea of using autoencoders to remove noise has been around since the 1980s (e.g., it is mentioned in Yann LeCun’s 1987 master’s thesis). In (Pascal Vincent et al., 2008) showed that autoencoders could also be used for feature extraction. In (Vincent et al.,2010) it is introduced stacked denoising autoencoders.

Denoising Autoencoders with Gaussian noise



In [98]:

    
import math 

tf.reset_default_graph()

from tensorflow.contrib.layers import dropout

n_inputs = 28 * 28
n_hidden1 = 300
n_hidden2 = 150  # codings
n_hidden3 = n_hidden1
n_outputs = n_inputs

learning_rate = 0.01
l2_reg = 0.00001
keep_prob = 0.7

activation = tf.nn.elu
regularizer = tf.contrib.layers.l2_regularizer(l2_reg)
initializer = tf.contrib.layers.variance_scaling_initializer()

X = tf.placeholder(tf.float32, shape=[None, n_inputs])
is_training = tf.placeholder_with_default(False, shape=(), name='is_training')
    
X_noisy = tf.cond(is_training, 
                  lambda: X + tf.random_normal(shape=tf.shape(X),mean=0,stddev=1/math.sqrt(n_hidden1)), ## Gaussian noise 
                  lambda: X)


weights1_init = initializer([n_inputs, n_hidden1])
weights2_init = initializer([n_hidden1, n_hidden2])

weights1 = tf.Variable(weights1_init, dtype=tf.float32, name="weights1")
weights2 = tf.Variable(weights2_init, dtype=tf.float32, name="weights2")
weights3 = tf.transpose(weights2, name="weights3")  # tied weights
weights4 = tf.transpose(weights1, name="weights4")  # tied weights

biases1 = tf.Variable(tf.zeros(n_hidden1), name="biases1")
biases2 = tf.Variable(tf.zeros(n_hidden2), name="biases2")
biases3 = tf.Variable(tf.zeros(n_hidden3), name="biases3")
biases4 = tf.Variable(tf.zeros(n_outputs), name="biases4")

hidden1 = activation(tf.matmul(X_noisy, weights1) + biases1)
hidden2 = activation(tf.matmul(hidden1, weights2) + biases2)
hidden3 = activation(tf.matmul(hidden2, weights3) + biases3)
outputs = tf.matmul(hidden3, weights4) + biases4

optimizer = tf.train.AdamOptimizer(learning_rate)
mse = tf.reduce_mean(tf.square(outputs - X))
reg_loss = regularizer(weights1) + regularizer(weights2)
loss = mse + reg_loss
training_op = optimizer.minimize(loss)
    
init = tf.global_variables_initializer()
saver = tf.train.Saver()



In [99]:

    
n_epochs = 10
batch_size = 150

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        n_batches = mnist.train.num_examples // batch_size
        for iteration in range(n_batches):
            print("\r{}%".format(100 * iteration // n_batches), end="")
            sys.stdout.flush()
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            sess.run(training_op, feed_dict={X: X_batch, is_training: True})
        mse_train = mse.eval(feed_dict={X: X_batch, is_training: False})
        print("\r{}".format(epoch), "Train MSE:", mse_train)
        saver.save(sess, "./my_model_stacked_denoising.ckpt")









    



0 Train MSE: 0.0160436
1 Train MSE: 0.00962477
2 Train MSE: 0.00695359
3 Train MSE: 0.0058369
4 Train MSE: 0.00491672
5 Train MSE: 0.00483042
6 Train MSE: 0.00699327
7 Train MSE: 0.0672627
8 Train MSE: 0.0691902
9 Train MSE: 0.0691012



In [100]:

    
show_reconstructed_digits(X, outputs, "./my_model_stacked_denoising.ckpt")









    



INFO:tensorflow:Restoring parameters from ./my_model_stacked_denoising.ckpt

Note Since the shape of X is only partially defined during the construction phase, we cannot know in advance the shape of the noise that we must add to X. We cannot call X.get_shape() because this would just return the partially defined shape of X ([None, n_inputs]), and random_normal() expects a fully defined shape so it would raise an exception. Instead, we call tf.shape(X), which creates an operation that will return the shape of X at runtime, which will be fully defined at that point.

Denoising Autoencoders with Dropout



In [80]:

    
tf.reset_default_graph()

from tensorflow.contrib.layers import dropout

n_inputs = 28 * 28
n_hidden1 = 300
n_hidden2 = 150  # codings
n_hidden3 = n_hidden1
n_outputs = n_inputs

learning_rate = 0.01
l2_reg = 0.00001
keep_prob = 0.7

activation = tf.nn.elu
regularizer = tf.contrib.layers.l2_regularizer(l2_reg)
initializer = tf.contrib.layers.variance_scaling_initializer()

X = tf.placeholder(tf.float32, shape=[None, n_inputs])
is_training = tf.placeholder_with_default(False, shape=(), name='is_training')

X_drop = dropout(X, keep_prob, is_training=is_training)

weights1_init = initializer([n_inputs, n_hidden1])
weights2_init = initializer([n_hidden1, n_hidden2])

weights1 = tf.Variable(weights1_init, dtype=tf.float32, name="weights1")
weights2 = tf.Variable(weights2_init, dtype=tf.float32, name="weights2")
weights3 = tf.transpose(weights2, name="weights3")  # tied weights
weights4 = tf.transpose(weights1, name="weights4")  # tied weights

biases1 = tf.Variable(tf.zeros(n_hidden1), name="biases1")
biases2 = tf.Variable(tf.zeros(n_hidden2), name="biases2")
biases3 = tf.Variable(tf.zeros(n_hidden3), name="biases3")
biases4 = tf.Variable(tf.zeros(n_outputs), name="biases4")

hidden1 = activation(tf.matmul(X_drop, weights1) + biases1)
hidden2 = activation(tf.matmul(hidden1, weights2) + biases2)
hidden3 = activation(tf.matmul(hidden2, weights3) + biases3)
outputs = tf.matmul(hidden3, weights4) + biases4

optimizer = tf.train.AdamOptimizer(learning_rate)
mse = tf.reduce_mean(tf.square(outputs - X))
reg_loss = regularizer(weights1) + regularizer(weights2)
loss = mse + reg_loss
training_op = optimizer.minimize(loss)
    
init = tf.global_variables_initializer()
saver = tf.train.Saver()



In [78]:

    
n_epochs = 10
batch_size = 150

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        n_batches = mnist.train.num_examples // batch_size
        for iteration in range(n_batches):
            print("\r{}%".format(100 * iteration // n_batches), end="")
            sys.stdout.flush()
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            sess.run(training_op, feed_dict={X: X_batch, is_training: True})
        mse_train = mse.eval(feed_dict={X: X_batch, is_training: False})
        print("\r{}".format(epoch), "Train MSE:", mse_train)
        saver.save(sess, "./my_model_stacked_denoising.ckpt")









    



0 Train MSE: 0.0245671
1 Train MSE: 0.0156954
2 Train MSE: 0.0131523
3 Train MSE: 0.0117988
4 Train MSE: 0.0100413
5 Train MSE: 0.00974315
6 Train MSE: 0.00918222
7 Train MSE: 0.00880637
8 Train MSE: 0.00869177
9 Train MSE: 0.00908567



In [79]:

    
show_reconstructed_digits(X, outputs, "./my_model_stacked_denoising.ckpt")









    



INFO:tensorflow:Restoring parameters from ./my_model_stacked_denoising.ckpt

Sparse Autoencoders

Another kind of constraint that often leads to good feature extraction is sparsity: by adding an appropriate term to the cost function, the autoencoder is pushed to reduce the number of active neurons in the coding layer. For example, it may be pushed to have on average only 5% significantly active neurons in the coding layer. This forces the autoencoder to represent each input as a combination of a small number of activations. As a result, each neuron in the coding layer typically ends up representing a useful feature (if you could speak only a few words per month, you would probably try to make them worth listening to). In order to favor sparse models, we must first measure the actual sparsity of the coding layer at each training iteration. We do so by computing the average activation of each neuron in the coding layer, over the whole training batch. The batch size must not be too small, or else the mean will not be accurate. Once we have the mean activation per neuron, we want to penalize the neurons that are too active by adding a sparsity loss to the cost function. For example, if we measure that a neuron has an average activation of 0.3, but the target sparsity is 0.1, it must be penalized to activate less. One approach could be simply adding the squared error (0.3 – 0.1) 2 to the cost function, but in practice a better approach is to use the Kullback– Leibler divergence which has much stronger gradients than the Mean Squared Error.



In [101]:

    
p = 0.1
q = np.linspace(0, 1, 500)
kl_div = p * np.log(p / q) + (1 - p) * np.log((1 - p) / (1 - q))
mse = (p - q)**2
plt.plot([p, p], [0, 0.3], "k:")
plt.text(0.05, 0.32, "Target\nsparsity", fontsize=14)
plt.plot(q, kl_div, "b-", label="KL divergence")
plt.plot(q, mse, "r--", label="MSE")
plt.legend(loc="upper left")
plt.xlabel("Actual sparsity")
plt.ylabel("Cost", rotation=0)
plt.axis([0, 1, 0, 0.95])









    



/Users/gino/anaconda3/lib/python3.5/site-packages/ipykernel_launcher.py:3: RuntimeWarning: divide by zero encountered in true_divide
  This is separate from the ipykernel package so we can avoid doing imports until






    Out[101]:





[0, 1, 0, 0.95]



In [102]:

    
def kl_divergence(p, q):
    """Kullback Leibler divergence"""
    return p * tf.log(p / q) + (1 - p) * tf.log((1 - p) / (1 - q))



In [103]:

    
tf.reset_default_graph()

n_inputs = 28 * 28
n_hidden1 = 1000  # sparse codings
n_outputs = n_inputs

learning_rate = 0.01
sparsity_target = 0.1
sparsity_weight = 0.2

#activation = tf.nn.softplus # soft variant of ReLU
activation = tf.nn.sigmoid
initializer = tf.contrib.layers.variance_scaling_initializer()

X = tf.placeholder(tf.float32, shape=[None, n_inputs])

weights1_init = initializer([n_inputs, n_hidden1])
weights2_init = initializer([n_hidden1, n_outputs])

weights1 = tf.Variable(weights1_init, dtype=tf.float32, name="weights1")
weights2 = tf.Variable(weights2_init, dtype=tf.float32, name="weights2")

biases1 = tf.Variable(tf.zeros(n_hidden1), name="biases1")
biases2 = tf.Variable(tf.zeros(n_outputs), name="biases2")

hidden1 = activation(tf.matmul(X, weights1) + biases1)
outputs = tf.matmul(hidden1, weights2) + biases2

optimizer = tf.train.AdamOptimizer(learning_rate)
mse = tf.reduce_mean(tf.square(outputs - X))

hidden1_mean = tf.reduce_mean(hidden1, axis=0) # batch mean
sparsity_loss = tf.reduce_sum(kl_divergence(sparsity_target, hidden1_mean))
loss = mse + sparsity_weight * sparsity_loss
training_op = optimizer.minimize(loss)

init = tf.global_variables_initializer()
saver = tf.train.Saver()



In [104]:

    
n_epochs = 100
batch_size = 1000

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        n_batches = mnist.train.num_examples // batch_size
        for iteration in range(n_batches):
            print("\r{}%".format(100 * iteration // n_batches), end="")
            sys.stdout.flush()
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            sess.run(training_op, feed_dict={X: X_batch})
        mse_val, sparsity_loss_val, loss_val = sess.run([mse, sparsity_loss, loss], feed_dict={X: X_batch})
        print("\r{}".format(epoch), "Train MSE:", mse_val, "\tSparsity loss:", sparsity_loss_val, "\tTotal loss:", loss_val)
        saver.save(sess, "./my_model_sparse.ckpt")









    



0 Train MSE: 0.097747 	Sparsity loss: 0.292068 	Total loss: 0.156161
1 Train MSE: 0.0563438 	Sparsity loss: 0.0308641 	Total loss: 0.0625166
2 Train MSE: 0.0480086 	Sparsity loss: 0.117752 	Total loss: 0.0715589
3 Train MSE: 0.0436715 	Sparsity loss: 0.0763076 	Total loss: 0.0589331
4 Train MSE: 0.0397885 	Sparsity loss: 0.0734568 	Total loss: 0.0544799
5 Train MSE: 0.03715 	Sparsity loss: 0.0360815 	Total loss: 0.0443663
6 Train MSE: 0.0324828 	Sparsity loss: 0.104327 	Total loss: 0.0533482
7 Train MSE: 0.0284582 	Sparsity loss: 0.0981715 	Total loss: 0.0480925
8 Train MSE: 0.0269465 	Sparsity loss: 0.0201862 	Total loss: 0.0309838
9 Train MSE: 0.0250389 	Sparsity loss: 0.0659021 	Total loss: 0.0382193
10 Train MSE: 0.0246506 	Sparsity loss: 0.160197 	Total loss: 0.05669
11 Train MSE: 0.0218129 	Sparsity loss: 0.0439904 	Total loss: 0.030611
12 Train MSE: 0.0219018 	Sparsity loss: 0.286851 	Total loss: 0.0792721
13 Train MSE: 0.0202175 	Sparsity loss: 0.035096 	Total loss: 0.0272368
14 Train MSE: 0.0178554 	Sparsity loss: 0.181857 	Total loss: 0.0542267
15 Train MSE: 0.0182782 	Sparsity loss: 0.0799313 	Total loss: 0.0342644
16 Train MSE: 0.0175842 	Sparsity loss: 0.0556574 	Total loss: 0.0287157
17 Train MSE: 0.0163363 	Sparsity loss: 0.083057 	Total loss: 0.0329477
18 Train MSE: 0.0167227 	Sparsity loss: 0.0956679 	Total loss: 0.0358563
19 Train MSE: 0.0162272 	Sparsity loss: 0.0684045 	Total loss: 0.0299081
20 Train MSE: 0.0149597 	Sparsity loss: 0.0412269 	Total loss: 0.0232051
21% Train MSE: 0.0149775 	Sparsity loss: 0.0833485 	Total loss: 0.0316472
22 Train MSE: 0.0149455 	Sparsity loss: 0.160054 	Total loss: 0.0469562
23 Train MSE: 0.0146215 	Sparsity loss: 0.121448 	Total loss: 0.0389112
24 Train MSE: 0.0144785 	Sparsity loss: 0.0906651 	Total loss: 0.0326115
25 Train MSE: 0.0146835 	Sparsity loss: 0.139854 	Total loss: 0.0426543
26 Train MSE: 0.0143844 	Sparsity loss: 0.0730099 	Total loss: 0.0289864
27 Train MSE: 0.0141733 	Sparsity loss: 0.0361982 	Total loss: 0.0214129
28 Train MSE: 0.0136291 	Sparsity loss: 0.0947753 	Total loss: 0.0325841
29 Train MSE: 0.0129672 	Sparsity loss: 0.0849861 	Total loss: 0.0299644
30 Train MSE: 0.0164199 	Sparsity loss: 0.385454 	Total loss: 0.0935106
31 Train MSE: 0.0130111 	Sparsity loss: 0.0595881 	Total loss: 0.0249288
32 Train MSE: 0.0133 	Sparsity loss: 0.298763 	Total loss: 0.0730527
33 Train MSE: 0.0128617 	Sparsity loss: 0.0989823 	Total loss: 0.0326582
34 Train MSE: 0.0131841 	Sparsity loss: 0.118501 	Total loss: 0.0368842
35 Train MSE: 0.0134888 	Sparsity loss: 0.178199 	Total loss: 0.0491287
36 Train MSE: 0.0127066 	Sparsity loss: 0.187845 	Total loss: 0.0502755
37 Train MSE: 0.0129514 	Sparsity loss: 0.122685 	Total loss: 0.0374883
38 Train MSE: 0.0114699 	Sparsity loss: 0.135165 	Total loss: 0.038503
39 Train MSE: 0.0123306 	Sparsity loss: 0.11811 	Total loss: 0.0359527
40 Train MSE: 0.0120567 	Sparsity loss: 0.0855994 	Total loss: 0.0291766
41 Train MSE: 0.014264 	Sparsity loss: 0.26468 	Total loss: 0.0672001
42 Train MSE: 0.0122152 	Sparsity loss: 0.0875605 	Total loss: 0.0297273
43 Train MSE: 0.0113875 	Sparsity loss: 0.14839 	Total loss: 0.0410656
44 Train MSE: 0.0137394 	Sparsity loss: 0.246895 	Total loss: 0.0631184
45 Train MSE: 0.0122633 	Sparsity loss: 0.141928 	Total loss: 0.040649
46 Train MSE: 0.0122839 	Sparsity loss: 0.148898 	Total loss: 0.0420635
47 Train MSE: 0.0131509 	Sparsity loss: 0.159826 	Total loss: 0.0451161
48 Train MSE: 0.0122557 	Sparsity loss: 0.176039 	Total loss: 0.0474636
49 Train MSE: 0.0113257 	Sparsity loss: 0.211252 	Total loss: 0.053576
50 Train MSE: 0.0114021 	Sparsity loss: 0.106827 	Total loss: 0.0327676
51 Train MSE: 0.0119093 	Sparsity loss: 0.12857 	Total loss: 0.0376233
52 Train MSE: 0.0112065 	Sparsity loss: 0.226575 	Total loss: 0.0565216
53 Train MSE: 0.0112703 	Sparsity loss: 0.140428 	Total loss: 0.0393558
54 Train MSE: 0.0123009 	Sparsity loss: 0.126083 	Total loss: 0.0375175
55 Train MSE: 0.0114301 	Sparsity loss: 0.106272 	Total loss: 0.0326844
56 Train MSE: 0.0117703 	Sparsity loss: 0.216632 	Total loss: 0.0550967
57 Train MSE: 0.0109858 	Sparsity loss: 0.226755 	Total loss: 0.0563368
58 Train MSE: 0.0107132 	Sparsity loss: 0.15331 	Total loss: 0.0413753
59 Train MSE: 0.0112086 	Sparsity loss: 0.530616 	Total loss: 0.117332
60 Train MSE: 0.0104344 	Sparsity loss: 0.159806 	Total loss: 0.0423956
61 Train MSE: 0.0115444 	Sparsity loss: 0.576912 	Total loss: 0.126927
62 Train MSE: 0.0117757 	Sparsity loss: 0.16388 	Total loss: 0.0445517
63 Train MSE: 0.012041 	Sparsity loss: 0.227405 	Total loss: 0.0575221
64 Train MSE: 0.0114919 	Sparsity loss: 0.129534 	Total loss: 0.0373986
65 Train MSE: 0.0112146 	Sparsity loss: 0.0905752 	Total loss: 0.0293297
66 Train MSE: 0.0127414 	Sparsity loss: 0.135037 	Total loss: 0.0397488
67 Train MSE: 0.015235 	Sparsity loss: 0.528439 	Total loss: 0.120923
68 Train MSE: 0.0131935 	Sparsity loss: 0.206039 	Total loss: 0.0544014
69 Train MSE: 0.0113921 	Sparsity loss: 0.169977 	Total loss: 0.0453876
70 Train MSE: 0.0122408 	Sparsity loss: 0.342965 	Total loss: 0.0808338
71 Train MSE: 0.018776 	Sparsity loss: 0.243719 	Total loss: 0.0675197
72 Train MSE: 0.0186857 	Sparsity loss: 0.183516 	Total loss: 0.0553889
73 Train MSE: 0.0148549 	Sparsity loss: 0.354119 	Total loss: 0.0856788
74 Train MSE: 0.0269241 	Sparsity loss: 0.901939 	Total loss: 0.207312
75 Train MSE: 0.0422155 	Sparsity loss: 0.395917 	Total loss: 0.121399
76 Train MSE: 0.0210797 	Sparsity loss: 0.165501 	Total loss: 0.05418
77 Train MSE: 0.0122262 	Sparsity loss: 1.09046 	Total loss: 0.230319
78 Train MSE: 0.0276207 	Sparsity loss: 0.301846 	Total loss: 0.0879899
79 Train MSE: 0.0251849 	Sparsity loss: 0.842971 	Total loss: 0.193779
80 Train MSE: 0.0147128 	Sparsity loss: 0.0894203 	Total loss: 0.0325968
81 Train MSE: 0.0456385 	Sparsity loss: 0.372028 	Total loss: 0.120044
82 Train MSE: 0.0153873 	Sparsity loss: 0.172665 	Total loss: 0.0499204
83 Train MSE: 0.0192046 	Sparsity loss: 0.153511 	Total loss: 0.0499068
84 Train MSE: 0.0169175 	Sparsity loss: 0.37361 	Total loss: 0.0916395
85 Train MSE: 0.0433531 	Sparsity loss: 1.19192 	Total loss: 0.281737
86 Train MSE: 0.020865 	Sparsity loss: 0.158292 	Total loss: 0.0525234
87 Train MSE: 0.0127682 	Sparsity loss: 0.783536 	Total loss: 0.169475
88 Train MSE: 0.0217732 	Sparsity loss: 0.19835 	Total loss: 0.0614432
89 Train MSE: 0.015315 	Sparsity loss: 0.27659 	Total loss: 0.0706329
90 Train MSE: 0.0131429 	Sparsity loss: 0.152481 	Total loss: 0.043639
91 Train MSE: 0.0133821 	Sparsity loss: 0.384121 	Total loss: 0.0902063
92 Train MSE: 0.0177794 	Sparsity loss: 0.389197 	Total loss: 0.0956188
93 Train MSE: 0.0213561 	Sparsity loss: 0.668201 	Total loss: 0.154996
94 Train MSE: 0.0131681 	Sparsity loss: 0.141848 	Total loss: 0.0415376
95 Train MSE: 0.0318716 	Sparsity loss: 0.504091 	Total loss: 0.13269
96 Train MSE: 0.016774 	Sparsity loss: 0.122511 	Total loss: 0.0412761
97 Train MSE: 0.0203522 	Sparsity loss: 0.15964 	Total loss: 0.0522803
98 Train MSE: 0.0513047 	Sparsity loss: 0.122413 	Total loss: 0.0757873
99 Train MSE: 0.0192168 	Sparsity loss: 0.455267 	Total loss: 0.11027



In [105]:

    
show_reconstructed_digits(X, outputs, "./my_model_sparse.ckpt")









    



INFO:tensorflow:Restoring parameters from ./my_model_sparse.ckpt

Variational Autoencoders

Another important category of autoencoders was introduced in (Kingma-Welling,2014) and has quickly become one of the most popular types of autoencoders: variational autoencoders. They are quite different from all the autoencoders we have discussed so far. They are probabilistic autoencoders, meaning that their outputs are partly determined by chance, even after training (as opposed to denoising autoencoders, which use randomness only during training). Most importantly, they are generative autoencoders, meaning that they can generate new instances that look like they were sampled from the training set. Both these properties make them rather similar to Restricted Boltzmann Machines (RBMs) but they are easier to train and the sampling process is much faster.

Let’s take a look at how they work. You can recognize, of course, the basic structure of all autoencoders, with an encoder followed by a decoder but there is a twist: instead of directly producing a coding for a given input, the encoder produces a mean coding μ and a standard deviation σ. The actual coding is then sampled randomly from a Gaussian distribution with mean μ and standard deviation σ. After that the decoder just decodes the sampled coding normally.



In [106]:

    
tf.reset_default_graph()

n_inputs = 28*28
n_hidden1 = 500
n_hidden2 = 500
n_hidden3 = 20  # codings
n_hidden4 = n_hidden2
n_hidden5 = n_hidden1
n_outputs = n_inputs

learning_rate = 0.001

activation = tf.nn.elu
initializer = tf.contrib.layers.variance_scaling_initializer(mode="FAN_AVG", uniform=True)

X = tf.placeholder(tf.float32, [None, n_inputs])

weights1 = tf.Variable(initializer([n_inputs, n_hidden1]))
weights2 = tf.Variable(initializer([n_hidden1, n_hidden2]))
weights3_mean = tf.Variable(initializer([n_hidden2, n_hidden3]))
weights3_log_sigma = tf.Variable(initializer([n_hidden2, n_hidden3]))
weights4 = tf.Variable(initializer([n_hidden3, n_hidden4]))
weights5 = tf.Variable(initializer([n_hidden4, n_hidden5]))
weights6 = tf.Variable(initializer([n_hidden5, n_inputs]))

biases1 = tf.Variable(tf.zeros([n_hidden1], dtype=tf.float32))
biases2 = tf.Variable(tf.zeros([n_hidden2], dtype=tf.float32))
biases3_mean = tf.Variable(tf.zeros([n_hidden3], dtype=tf.float32))
biases3_log_sigma = tf.Variable(tf.zeros([n_hidden3], dtype=tf.float32))
biases4 = tf.Variable(tf.zeros([n_hidden4], dtype=tf.float32))
biases5 = tf.Variable(tf.zeros([n_hidden5], dtype=tf.float32))
biases6 = tf.Variable(tf.zeros([n_inputs], dtype=tf.float32))

hidden1 = activation(tf.matmul(X, weights1) + biases1)
hidden2 = activation(tf.matmul(hidden1, weights2) + biases2)

hidden3_mean = tf.matmul(hidden2, weights3_mean) + biases3_mean
hidden3_log_sigma = tf.matmul(hidden2, weights3_log_sigma) + biases3_log_sigma
noise = tf.random_normal(tf.shape(hidden3_log_sigma), dtype=tf.float32)
hidden3 = hidden3_mean + tf.sqrt(tf.exp(hidden3_log_sigma)) * noise

hidden4 = activation(tf.matmul(hidden3, weights4) + biases4)
hidden5 = activation(tf.matmul(hidden4, weights5) + biases5)
logits = tf.matmul(hidden5, weights6) + biases6
outputs = tf.sigmoid(logits)

reconstruction_loss = tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(labels=X, logits=logits))
latent_loss = 0.5 * tf.reduce_sum(tf.exp(hidden3_log_sigma) + tf.square(hidden3_mean) - 1 - hidden3_log_sigma)
cost = reconstruction_loss + latent_loss

optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(cost)

init = tf.global_variables_initializer()
saver = tf.train.Saver()



In [107]:

    
tf.reset_default_graph()

n_inputs = 28*28
n_hidden1 = 500
n_hidden2 = 500
n_hidden3 = 20  # codings
n_hidden4 = n_hidden2
n_hidden5 = n_hidden1
n_outputs = n_inputs

learning_rate = 0.001

initializer = tf.contrib.layers.variance_scaling_initializer()

with tf.contrib.framework.arg_scope([fully_connected],
                                    activation_fn=tf.nn.elu,
                                    weights_initializer=initializer):
    X = tf.placeholder(tf.float32, [None, n_inputs])
    hidden1 = fully_connected(X, n_hidden1)
    hidden2 = fully_connected(hidden1, n_hidden2)
    hidden3_mean = fully_connected(hidden2, n_hidden3, activation_fn=None)
    hidden3_gamma = fully_connected(hidden2, n_hidden3, activation_fn=None)
    noise = tf.random_normal(tf.shape(hidden3_gamma), dtype=tf.float32)
    hidden3 = hidden3_mean + tf.exp(0.5 * hidden3_gamma) * noise
    hidden4 = fully_connected(hidden3, n_hidden4)
    hidden5 = fully_connected(hidden4, n_hidden5)
    logits = fully_connected(hidden5, n_outputs, activation_fn=None)
    outputs = tf.sigmoid(logits)

reconstruction_loss = tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(labels=X, logits=logits))
latent_loss = 0.5 * tf.reduce_sum(tf.exp(hidden3_gamma) + tf.square(hidden3_mean) - 1 - hidden3_gamma)
cost = reconstruction_loss + latent_loss

optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(cost)

init = tf.global_variables_initializer()
saver = tf.train.Saver()



In [108]:

    
n_epochs = 50
batch_size = 150

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        n_batches = mnist.train.num_examples // batch_size
        for iteration in range(n_batches):
            print("\r{}%".format(100 * iteration // n_batches), end="")
            sys.stdout.flush()
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            sess.run(training_op, feed_dict={X: X_batch})
        cost_val, reconstruction_loss_val, latent_loss_val = sess.run([cost, reconstruction_loss, latent_loss], feed_dict={X: X_batch})
        print("\r{}".format(epoch), "Train cost:", cost_val, "\tReconstruction loss:", reconstruction_loss_val, "\tLatent loss:", latent_loss_val)
        saver.save(sess, "./my_model_variational.ckpt")









    



0 Train cost: 18368.9 	Reconstruction loss: 14606.0 	Latent loss: 3762.96
19% Train cost: 18136.7 	Reconstruction loss: 14308.6 	Latent loss: 3828.06
29% Train cost: 16048.4 	Reconstruction loss: 12350.1 	Latent loss: 3698.28
39% Train cost: 16821.8 	Reconstruction loss: 12997.3 	Latent loss: 3824.53
4 Train cost: 16312.4 	Reconstruction loss: 12475.4 	Latent loss: 3836.97
5 Train cost: 15762.4 	Reconstruction loss: 12042.5 	Latent loss: 3719.9
6 Train cost: 16566.1 	Reconstruction loss: 12727.9 	Latent loss: 3838.18
7 Train cost: 15833.4 	Reconstruction loss: 12029.9 	Latent loss: 3803.56
8 Train cost: 15519.5 	Reconstruction loss: 11750.3 	Latent loss: 3769.14
9 Train cost: 15856.5 	Reconstruction loss: 12039.6 	Latent loss: 3816.95
10 Train cost: 14956.6 	Reconstruction loss: 11228.6 	Latent loss: 3727.98
11 Train cost: 15324.9 	Reconstruction loss: 11429.4 	Latent loss: 3895.53
12 Train cost: 15781.8 	Reconstruction loss: 11921.5 	Latent loss: 3860.3
13 Train cost: 15708.6 	Reconstruction loss: 11895.7 	Latent loss: 3812.88
14 Train cost: 15363.5 	Reconstruction loss: 11578.6 	Latent loss: 3784.94
15 Train cost: 15201.8 	Reconstruction loss: 11502.1 	Latent loss: 3699.64
16 Train cost: 15223.2 	Reconstruction loss: 11448.0 	Latent loss: 3775.14
17 Train cost: 14921.1 	Reconstruction loss: 11166.0 	Latent loss: 3755.04
18 Train cost: 15451.8 	Reconstruction loss: 11731.5 	Latent loss: 3720.32
19 Train cost: 14852.0 	Reconstruction loss: 11232.8 	Latent loss: 3619.24
20 Train cost: 15568.9 	Reconstruction loss: 11772.5 	Latent loss: 3796.4
21 Train cost: 15019.9 	Reconstruction loss: 11342.2 	Latent loss: 3677.66
22 Train cost: 14979.5 	Reconstruction loss: 11230.5 	Latent loss: 3749.0
23 Train cost: 14431.2 	Reconstruction loss: 10742.1 	Latent loss: 3689.14
24 Train cost: 15086.4 	Reconstruction loss: 11339.6 	Latent loss: 3746.82
25 Train cost: 15653.9 	Reconstruction loss: 11847.9 	Latent loss: 3805.96
26 Train cost: 14771.0 	Reconstruction loss: 11060.9 	Latent loss: 3710.11
27 Train cost: 15333.9 	Reconstruction loss: 11556.6 	Latent loss: 3777.31
28 Train cost: 14822.5 	Reconstruction loss: 11064.3 	Latent loss: 3758.18
29 Train cost: 14323.4 	Reconstruction loss: 10803.4 	Latent loss: 3519.97
30 Train cost: 15126.0 	Reconstruction loss: 11415.6 	Latent loss: 3710.4
31 Train cost: 14452.6 	Reconstruction loss: 10801.7 	Latent loss: 3650.89
32 Train cost: 15187.1 	Reconstruction loss: 11364.8 	Latent loss: 3822.31
33 Train cost: 14725.5 	Reconstruction loss: 10983.4 	Latent loss: 3742.07
34 Train cost: 14744.4 	Reconstruction loss: 11082.7 	Latent loss: 3661.69
35 Train cost: 14831.5 	Reconstruction loss: 11219.3 	Latent loss: 3612.21
36 Train cost: 15487.1 	Reconstruction loss: 11683.3 	Latent loss: 3803.77
37 Train cost: 14694.5 	Reconstruction loss: 10978.0 	Latent loss: 3716.54
38 Train cost: 14450.1 	Reconstruction loss: 10804.9 	Latent loss: 3645.2
39 Train cost: 14299.1 	Reconstruction loss: 10727.9 	Latent loss: 3571.18
40 Train cost: 14584.9 	Reconstruction loss: 10864.9 	Latent loss: 3719.99
41 Train cost: 14994.1 	Reconstruction loss: 11235.4 	Latent loss: 3758.7
42 Train cost: 14800.5 	Reconstruction loss: 11128.1 	Latent loss: 3672.34
43 Train cost: 14534.5 	Reconstruction loss: 10899.4 	Latent loss: 3635.07
44 Train cost: 15277.0 	Reconstruction loss: 11509.2 	Latent loss: 3767.82
45 Train cost: 15264.1 	Reconstruction loss: 11479.9 	Latent loss: 3784.21
46 Train cost: 14349.9 	Reconstruction loss: 10704.4 	Latent loss: 3645.44
47 Train cost: 13840.1 	Reconstruction loss: 10277.8 	Latent loss: 3562.25
48 Train cost: 15575.0 	Reconstruction loss: 11686.9 	Latent loss: 3888.08
49 Train cost: 15097.3 	Reconstruction loss: 11261.1 	Latent loss: 3836.23

Encode:



In [109]:

    
n_digits = 3
X_test, y_test = mnist.test.next_batch(batch_size)
codings = hidden3

with tf.Session() as sess:
    saver.restore(sess, "./my_model_variational.ckpt")
    codings_val = codings.eval(feed_dict={X: X_test})









    



INFO:tensorflow:Restoring parameters from ./my_model_variational.ckpt

Decode:



In [110]:

    
with tf.Session() as sess:
    saver.restore(sess, "./my_model_variational.ckpt")
    outputs_val = outputs.eval(feed_dict={codings: codings_val})









    



INFO:tensorflow:Restoring parameters from ./my_model_variational.ckpt

Let's plot the reconstructions:



In [111]:

    
fig = plt.figure(figsize=(8, 2.5 * n_digits))
for iteration in range(n_digits):
    plt.subplot(n_digits, 2, 1 + 2 * iteration)
    plot_image(X_test[iteration])
    plt.subplot(n_digits, 2, 2 + 2 * iteration)
    plot_image(outputs_val[iteration])

Generate digits

Now let’s use this variational autoencoder to generate images that look like handwritten digits. All we need to do is train the model, then sample random codings from a Gaussian distribution and decode them.



In [112]:

    
n_rows = 6
n_cols = 10
n_digits = n_rows * n_cols
codings_rnd = np.random.normal(size=[n_digits, n_hidden3])

with tf.Session() as sess:
    saver.restore(sess, "./my_model_variational.ckpt")
    outputs_val = outputs.eval(feed_dict={codings: codings_rnd})









    



INFO:tensorflow:Restoring parameters from ./my_model_variational.ckpt



In [115]:

    
def plot_multiple_images(images, n_rows, n_cols, pad=2):
    images = images - images.min()  # make the minimum == 0, so the padding looks white
    w,h = images.shape[1:]
    image = np.zeros(((w+pad)*n_rows+pad, (h+pad)*n_cols+pad))
    for y in range(n_rows):
        for x in range(n_cols):
            image[(y*(h+pad)+pad):(y*(h+pad)+pad+h),(x*(w+pad)+pad):(x*(w+pad)+pad+w)] = images[y*n_cols+x]
    plt.imshow(image, cmap="Greys", interpolation="nearest")
    plt.axis("off")
    
plot_multiple_images(outputs_val.reshape(-1, 28, 28), n_rows, n_cols)
plt.show()

Interpolate digits



In [118]:

    
n_iterations = 3
n_digits = 6
codings_rnd = np.random.normal(size=[n_digits, n_hidden3])

with tf.Session() as sess:
    saver.restore(sess, "./my_model_variational.ckpt")
    target_codings = np.roll(codings_rnd, -1, axis=0)
    for iteration in range(n_iterations + 1):
        codings_interpolate = codings_rnd + (target_codings - codings_rnd) * iteration / n_iterations
        outputs_val = outputs.eval(feed_dict={codings: codings_interpolate})
        plt.figure(figsize=(11, 1.5*n_iterations))
        for digit_index in range(n_digits):
            plt.subplot(1, n_digits, digit_index + 1)
            plot_image(outputs_val[digit_index])
        plt.show()









    



INFO:tensorflow:Restoring parameters from ./my_model_variational.ckpt

References

William G. Chase, Herbert A. Simon, Perception in chess, Carnegie-Mellon University U.S.A., 1972
Geoffrey E. Hinton, Simon Osindero, Yee-Whye Teh, A Fast Learning Algorithm for Deep Belief Nets, Neural Computation 18, 1527–1554 (2006)
Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle, Greedy Layer-Wise Training of Deep Networks, NIPS 2006
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, Pierre-Antoine Manzagol, Extracting and Composing Robust Features with Denoising Autoencoders, 2008
Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, Pierre-Antoine Manzagol,Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion, 2010
Géron, Aurélien, Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. O'Reilly Media, 2017
Salah Rifai, Pascal Vincent, Xavier Muller, Xavier Glorot, Yoshua Bengio, Contractive Auto-Encoders: Explicit Invariance During Feature Extraction, Proceedings of the 28 th International Conference on Machine Learning, Bellevue, WA, USA, 2011
Jonathan Masci, Ueli Meier, Dan Cire¸san, and Jurgen Schmidhuber, Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction, ICANN 2011, Part I, LNCS 6791, pp. 52–59, 2011
Guillaume Alain, Yoshua Bengio, Jason Yosinski, Eric Thibodeau-Laufer, Saizheng Zhang, Pascal Vincent, GSNs: Generative Stochastic Networks, 2015
Alireza Makhzani, Brendan Frey, Winner-Take-All Autoencoders, 2015
Alireza Makhzani, Jonathon Shlens & Navdeep Jaitly, Ian Goodfellow, Brendan Frey, Adversarial Autoencoders, 2016
Diederik P. Kingma, Max Welling, Auto-Encoding Variational Bayes, 2014