Session 3: Unsupervised and Supervised Learning

Assignment: Build Unsupervised and Supervised Networks

Parag K. Mital
Creative Applications of Deep Learning w/ Tensorflow
Kadenze Academy
#CADL

Learning Goals

  • Learn how to build an autoencoder
  • Learn how to explore latent/hidden representations of an autoencoder.
  • Learn how to build a classification network using softmax and onehot encoding

Outline

This next section will just make sure you have the right version of python and the libraries that we'll be using. Don't change the code here but make sure you "run" it (use "shift+enter")!


In [31]:
# First check the Python version
import sys
if sys.version_info < (3,4):
    print('You are running an older version of Python!\n\n' \
          'You should consider updating to Python 3.4.0 or ' \
          'higher as the libraries built for this course ' \
          'have only been tested in Python 3.4 and higher.\n')
    print('Try installing the Python 3.5 version of anaconda '
          'and then restart `jupyter notebook`:\n' \
          'https://www.continuum.io/downloads\n\n')

# Now get necessary libraries
try:
    import os
    import numpy as np
    import matplotlib.pyplot as plt
    from skimage.transform import resize
    from skimage import data
    from scipy.misc import imresize
    import IPython.display as ipyd
except ImportError:
    print('You are missing some packages! ' \
          'We will try installing them before continuing!')
    !pip install "numpy>=1.11.0" "matplotlib>=1.5.1" "scikit-image>=0.11.3" "scikit-learn>=0.17" "scipy>=0.17.0"
    import os
    import numpy as np
    import matplotlib.pyplot as plt
    from skimage.transform import resize
    from skimage import data
    from scipy.misc import imresize
    import IPython.display as ipyd
    print('Done!')

# Import Tensorflow
try:
    import tensorflow as tf
except ImportError:
    print("You do not have tensorflow installed!")
    print("Follow the instructions on the following link")
    print("to install tensorflow before continuing:")
    print("")
    print("https://github.com/pkmital/CADL#installation-preliminaries")

# This cell includes the provided libraries from the zip file
# and a library for displaying images from ipython, which
# we will use to display the gif
try:
    from libs import utils, gif, datasets, dataset_utils, vae, dft
except ImportError:
    print("Make sure you have started notebook in the same directory" +
          " as the provided zip file which includes the 'libs' folder" +
          " and the file 'utils.py' inside of it.  You will NOT be able"
          " to complete this assignment unless you restart jupyter"
          " notebook inside the directory created by extracting"
          " the zip file or cloning the github repo.")

# We'll tell matplotlib to inline any drawn figures like so:
%matplotlib inline
plt.style.use('ggplot')

In [2]:
# Bit of formatting because I don't like the default inline code style:
from IPython.core.display import HTML
HTML("""<style> .rendered_html code { 
    padding: 2px 4px;
    color: #c7254e;
    background-color: #f9f2f4;
    border-radius: 4px;
} </style>""")


Out[2]:

Assignment Synopsis

In the last session we created our first neural network. We saw that in order to create a neural network, we needed to define a cost function which would allow gradient descent to optimize all the parameters in our network. We also saw how neural networks become much more expressive by introducing series of linearities followed by non-linearities, or activation functions. We then explored a fun application of neural networks using regression to learn to paint color values given x, y positions. This allowed us to build up a sort of painterly like version of an image.

In this session, we'll see how to construct a few more types of neural networks. First, we'll explore a generative network called autoencoders. This network can be extended in a variety of ways to include convolution, denoising, or a variational layer. In Part Two, you'll then use a general autoencoder framework to encode your own list of images. In Part three, we'll then explore a discriminative network used for classification, and see how this can be used for audio classification of music or speech.

One main difference between these two networks are the data that we'll use to train them. In the first case, we will only work with "unlabeled" data and perform unsupervised learning. An example would be a collection of images, just like the one you created for assignment 1. Contrast this with "labeled" data which allows us to make use of supervised learning. For instance, we're given both images, and some other data about those images such as some text describing what object is in the image. This allows us to optimize a network where we model a distribution over the images given that it should be labeled as something. This is often a much simpler distribution to train, but with the expense of it being much harder to collect.

One of the major directions of future research will be in how to better make use of unlabeled data and unsupervised learning methods.

Part One - Autoencoders

Instructions

Work with a dataset of images and train an autoencoder. You can work with the same dataset from assignment 1, or try a larger dataset. But be careful with the image sizes, and make sure to keep it relatively small (e.g. < 200 x 200 px).

Recall from the lecture that autoencoders are great at "compressing" information. The network's construction and cost function are just like what we've done in the last session. The network is composed of a series of matrix multiplications and nonlinearities. The only difference is the output of the network has exactly the same shape as what is input. This allows us to train the network by saying that the output of the network needs to be just like the input to it, so that it tries to "compress" all the information in that video.

Autoencoders have some great potential for creative applications, as they allow us to compress a dataset of information and even generate new data from that encoding. We'll see exactly how to do this with a basic autoencoder, and then you'll be asked to explore some of the extensions to produce your own encodings.

Code

We'll now go through the process of building an autoencoder just like in the lecture. First, let's load some data. You can use the first 100 images of the Celeb Net, your own dataset, or anything else approximately under 1,000 images. Make sure you resize the images so that they are <= 200x200 pixels, otherwise the training will be very slow, and the montages we create will be too large.


In [3]:
def crop_edge(img, cropped_rate):
    """Crop arbitrary amount of pixel.
    """
    row_i = int(img.shape[0] * cropped_rate) // 2
    col_i = int(img.shape[1] * cropped_rate) // 2
    return img[row_i:-row_i, col_i:-col_i]

In [4]:
# See how this works w/ Celeb Images or try your own dataset instead:
bird_files = [os.path.join('../data/pokemon/jpeg/', file_i)
              for file_i in os.listdir('../data/pokemon/jpeg/')
              if '.jpg' in file_i]

nb_clip = 100
bird_files = bird_files[:nb_clip]
imgs = [imresize(crop_edge(plt.imread(f), 0.4), (100, 100)) for f in bird_files]

# Then convert the list of images to a 4d array (e.g. use np.array to convert a list to a 4d array):
Xs = np.array(imgs)

print(Xs.shape)
assert(Xs.ndim == 4 and Xs.shape[1] <= 250 and Xs.shape[2] <= 250)

plt.figure(figsize=(10, 10))
plt.imshow(utils.montage(imgs).astype(np.uint8))


(100, 100, 100, 3)
Out[4]:
<matplotlib.image.AxesImage at 0x7f86151351d0>

We'll now make use of something I've written to help us store this data. It provides some interfaces for generating "batches" of data, as well as splitting the data into training, validation, and testing sets. To use it, we pass in the data and optionally its labels. If we don't have labels, we just pass in the data. In the second half of this notebook, we'll explore using a dataset's labels as well.


In [5]:
ds = datasets.Dataset(Xs)
# ds = datasets.CIFAR10(flatten=False)

It allows us to easily find the mean:


In [6]:
mean_img = ds.mean().astype(np.uint8)
plt.imshow(mean_img)
print(ds.mean().shape)


(100, 100, 3)

Or the deviation:


In [7]:
std_img = ds.std() #.astype(np.uint8)
plt.imshow(std_img)
print(std_img.shape)


(100, 100, 3)

Recall we can calculate the mean of the standard deviation across each color channel:


In [8]:
std_img = np.mean(std_img, axis=2)
plt.imshow(std_img)


Out[8]:
<matplotlib.image.AxesImage at 0x7f8611fa9358>

All the input data we gave as input to our Datasets object, previously stored in Xs is now stored in a variable as part of our ds Datasets object, X:


In [9]:
plt.imshow(ds.X[0])
print(ds.X[0].shape)
print(ds.X.shape)


(100, 100, 3)
(100, 100, 100, 3)

It takes a parameter, split at the time of creation, which allows us to create train/valid/test sets. By default, this is set to [1.0, 0.0, 0.0], which means to take all the data in the train set, and nothing in the validation and testing sets. We can access "batch generators" of each of these sets by saying: ds.train.next_batch. A generator is a really powerful way of handling iteration in Python. If you are unfamiliar with the idea of generators, I recommend reading up a little bit on it, e.g. here: http://intermediatepythonista.com/python-generators - think of it as a for loop, but as a function. It returns one iteration of the loop each time you call it.

This generator will automatically handle the randomization of the dataset. Let's try looping over the dataset using the batch generator:


In [10]:
for (X, y) in ds.train.next_batch(batch_size=10):
    print(X.shape)


(10, 100, 100, 3)
(10, 100, 100, 3)
(10, 100, 100, 3)
(10, 100, 100, 3)
(10, 100, 100, 3)
(10, 100, 100, 3)
(10, 100, 100, 3)
(10, 100, 100, 3)
(10, 100, 100, 3)
(10, 100, 100, 3)

This returns X and y as a tuple. Since we're not using labels, we'll just ignore this. The next_batch method takes a parameter, batch_size, which we'll set appropriately to our batch size. Notice it runs for exactly 10 iterations to iterate over our 100 examples, then the loop exits. The order in which it iterates over the 100 examples is randomized each time you iterate.

Write two functions to preprocess (normalize) any given image, and to unprocess it, i.e. unnormalize it by removing the normalization. The preprocess function should perform exactly the task you learned to do in assignment 1: subtract the mean, then divide by the standard deviation. The deprocess function should take the preprocessed image and undo the preprocessing steps. Recall that the ds object contains the mean and std functions for access the mean and standarad deviation. We'll be using the preprocess and deprocess functions on the input and outputs of the network. Note, we could use Tensorflow to do this instead of numpy, but for sake of clarity, I'm keeping this separate from the Tensorflow graph.


In [11]:
# Write a function to preprocess/normalize an image, given its dataset object
# (which stores the mean and standard deviation!)
def preprocess(img, ds):
    norm_img = (img - ds.mean()) / ds.std()
    return norm_img

# Write a function to undo the normalization of an image, given its dataset object
# (which stores the mean and standard deviation!)
def deprocess(norm_img, ds):
    img = norm_img * ds.std() + ds.mean()
    return img

# Just to make sure that you've coded the previous two functions correctly:
assert(np.allclose(deprocess(preprocess(ds.X[0], ds), ds), ds.X[0]))
plt.imshow(deprocess(preprocess(ds.X[0], ds), ds).astype(np.uint8))


Out[11]:
<matplotlib.image.AxesImage at 0x7f8611efe7b8>

We're going to now work on creating an autoencoder. To start, we'll only use linear connections, like in the last assignment. This means, we need a 2-dimensional input: Batch Size x Number of Features. We currently have a 4-dimensional input: Batch Size x Height x Width x Channels. We'll have to calculate the number of features we have to help construct the Tensorflow Graph for our autoencoder neural network. Then, when we are ready to train the network, we'll reshape our 4-dimensional dataset into a 2-dimensional one when feeding the input of the network. Optionally, we could create a tf.reshape as the first operation of the network, so that we can still pass in our 4-dimensional array, and the Tensorflow graph would reshape it for us. We'll try the former method, by reshaping manually, and then you can explore the latter method, of handling 4-dimensional inputs on your own.


In [12]:
# Calculate the number of features in your image.
# This is the total number of pixels, or (height x width x channels).
height = ds.X[0].shape[0]
width = ds.X[0].shape[1]
channels = ds.X[0].shape[2]

n_features = height * width * channels
print(n_features)


30000

Let's create a list of how many neurons we want in each layer. This should be for just one half of the network, the encoder only. It should start large, then get smaller and smaller. We're also going to try an encode our dataset to an inner layer of just 2 values. So from our number of features, we'll go all the way down to expressing that image by just 2 values. Try the values I've put here for the celeb dataset, then explore your own values:


In [13]:
# encoder_dimensions = [1024, 256, 64, 2]

# encoder_dimensions = [1024, 64, 4, 2]
encoder_dimensions = [1024, 64, 4]
# encoder_dimensions = [1024, 512, 256, 128, 64, 32, 16, 8]

Now create a placeholder just like in the last session in the tensorflow graph that will be able to get any number (None) of n_features inputs.


In [14]:
tf.reset_default_graph()

In [15]:
X = tf.placeholder(tf.float32, shape = (None, n_features), name = "X")
                   
assert(X.get_shape().as_list() == [None, n_features])

Now complete the function encode below. This takes as input our input placeholder, X, our list of dimensions, and an activation function, e.g. tf.nn.relu or tf.nn.tanh, to apply to each layer's output, and creates a series of fully connected layers. This works just like in the last session! We multiply our input, add a bias, then apply a non-linearity. Instead of having 20 neurons in each layer, we're going to use our dimensions list to tell us how many neurons we want in each layer.

One important difference is that we're going to also store every weight matrix we create! This is so that we can use the same weight matrices when we go to build our decoder. This is a very powerful concept that creeps up in a few different neural network architectures called weight sharing. Weight sharing isn't necessary to do of course, but can speed up training and offer a different set of features depending on your dataset. Explore trying both. We'll also see how another form of weight sharing works in convolutional networks.


In [16]:
def encode(X, dimensions, activation=tf.nn.tanh):
    # We're going to keep every matrix we create so let's create a list to hold them all
    Ws = []

    # We'll create a for loop to create each layer:
    for layer_i, n_output in enumerate(dimensions):

        # This will simply prefix all the variables made in this scope
        # with the name we give it.  Make sure it is a unique name
        # for each layer, e.g., 'encoder/layer1', 'encoder/layer2', or
        # 'encoder/1', 'encoder/2',... 
        with tf.variable_scope("encode/layer" + str(layer_i + 1)):

            # Create a weight matrix which will increasingly reduce
            # down the amount of information in the input by performing
            # a matrix multiplication.  You can use the utils.linear function.
            h, W = utils.linear(X, dimensions[layer_i])

            # Finally we'll store the weight matrix.
            # We need to keep track of all
            # the weight matrices we've used in our encoder
            # so that we can build the decoder using the
            # same weight matrices.
            Ws.append(W)
            
            # Replace X with the current layer's output, so we can
            # use it in the next layer.
            X = h
    
    z = X
    return Ws, z

In [17]:
# Then call the function
Ws, z = encode(X, encoder_dimensions)

# And just some checks to make sure you've done it right.
# assert(z.get_shape().as_list() == [None, 2])
# assert(len(Ws) == len(encoder_dimensions))

Let's take a look at the graph:


In [18]:
[op.name for op in tf.get_default_graph().get_operations()]


Out[18]:
['X',
 'encode/layer1/fc/W/Initializer/random_uniform/shape',
 'encode/layer1/fc/W/Initializer/random_uniform/min',
 'encode/layer1/fc/W/Initializer/random_uniform/max',
 'encode/layer1/fc/W/Initializer/random_uniform/RandomUniform',
 'encode/layer1/fc/W/Initializer/random_uniform/sub',
 'encode/layer1/fc/W/Initializer/random_uniform/mul',
 'encode/layer1/fc/W/Initializer/random_uniform',
 'encode/layer1/fc/W',
 'encode/layer1/fc/W/Assign',
 'encode/layer1/fc/W/read',
 'encode/layer1/fc/b/Initializer/Const',
 'encode/layer1/fc/b',
 'encode/layer1/fc/b/Assign',
 'encode/layer1/fc/b/read',
 'encode/layer1/fc/MatMul',
 'encode/layer1/fc/h',
 'encode/layer2/fc/W/Initializer/random_uniform/shape',
 'encode/layer2/fc/W/Initializer/random_uniform/min',
 'encode/layer2/fc/W/Initializer/random_uniform/max',
 'encode/layer2/fc/W/Initializer/random_uniform/RandomUniform',
 'encode/layer2/fc/W/Initializer/random_uniform/sub',
 'encode/layer2/fc/W/Initializer/random_uniform/mul',
 'encode/layer2/fc/W/Initializer/random_uniform',
 'encode/layer2/fc/W',
 'encode/layer2/fc/W/Assign',
 'encode/layer2/fc/W/read',
 'encode/layer2/fc/b/Initializer/Const',
 'encode/layer2/fc/b',
 'encode/layer2/fc/b/Assign',
 'encode/layer2/fc/b/read',
 'encode/layer2/fc/MatMul',
 'encode/layer2/fc/h',
 'encode/layer3/fc/W/Initializer/random_uniform/shape',
 'encode/layer3/fc/W/Initializer/random_uniform/min',
 'encode/layer3/fc/W/Initializer/random_uniform/max',
 'encode/layer3/fc/W/Initializer/random_uniform/RandomUniform',
 'encode/layer3/fc/W/Initializer/random_uniform/sub',
 'encode/layer3/fc/W/Initializer/random_uniform/mul',
 'encode/layer3/fc/W/Initializer/random_uniform',
 'encode/layer3/fc/W',
 'encode/layer3/fc/W/Assign',
 'encode/layer3/fc/W/read',
 'encode/layer3/fc/b/Initializer/Const',
 'encode/layer3/fc/b',
 'encode/layer3/fc/b/Assign',
 'encode/layer3/fc/b/read',
 'encode/layer3/fc/MatMul',
 'encode/layer3/fc/h']

So we've created a few layers, encoding our input X all the way down to 2 values in the tensor z. We do this by multiplying our input X by a set of matrices shaped as:


In [19]:
[W_i.get_shape().as_list() for W_i in Ws]


Out[19]:
[[30000, 1024], [1024, 64], [64, 4]]

Resulting in a layer which is shaped as:


In [20]:
z.get_shape().as_list()


Out[20]:
[None, 4]

Building the Decoder

Here is a helpful animation on what the matrix "transpose" operation does:

Basically what is happening is rows becomes columns, and vice-versa. We're going to use our existing weight matrices but transpose them so that we can go in the opposite direction. In order to build our decoder, we'll have to do the opposite of what we've just done, multiplying z by the transpose of our weight matrices, to get back to a reconstructed version of X. First, we'll reverse the order of our weight matrics, and then append to the list of dimensions the final output layer's shape to match our input:


In [21]:
# We'll first reverse the order of our weight matrices
decoder_Ws = Ws[::-1]

# then reverse the order of our dimensions
# appending the last layers number of inputs.
decoder_dimensions = encoder_dimensions[::-1][1:] + [n_features]
print(decoder_dimensions)

assert(decoder_dimensions[-1] == n_features)


[64, 1024, 30000]

Now we'll build the decoder. I've shown you how to do this. Read through the code to fully understand what it is doing:


In [22]:
def decode(z, dimensions, Ws, activation=tf.nn.tanh):
    current_input = z
    for layer_i, n_output in enumerate(dimensions):
        # we'll use a variable scope again to help encapsulate our variables
        # This will simply prefix all the variables made in this scope
        # with the name we give it.
        with tf.variable_scope("decoder/layer/{}".format(layer_i)):

            # Now we'll grab the weight matrix we created before and transpose it
            # So a 3072 x 784 matrix would become 784 x 3072
            # or a 256 x 64 matrix, would become 64 x 256
            W = tf.transpose(Ws[layer_i])

            # Now we'll multiply our input by our transposed W matrix
            h = tf.matmul(current_input, W)

            # And then use a relu activation function on its output
            current_input = activation(h)

            # We'll also replace n_input with the current n_output, so that on the
            # next iteration, our new number inputs will be correct.
            n_input = n_output
    Y = current_input
    return Y

In [23]:
Y = decode(z, decoder_dimensions, decoder_Ws)

Let's take a look at the new operations we've just added. They will all be prefixed by "decoder" so we can use list comprehension to help us with this:


In [24]:
[op.name for op in tf.get_default_graph().get_operations()
 if op.name.startswith('decoder')]


Out[24]:
['decoder/layer/0/transpose/Rank',
 'decoder/layer/0/transpose/sub/y',
 'decoder/layer/0/transpose/sub',
 'decoder/layer/0/transpose/Range/start',
 'decoder/layer/0/transpose/Range/delta',
 'decoder/layer/0/transpose/Range',
 'decoder/layer/0/transpose/sub_1',
 'decoder/layer/0/transpose',
 'decoder/layer/0/MatMul',
 'decoder/layer/0/Tanh',
 'decoder/layer/1/transpose/Rank',
 'decoder/layer/1/transpose/sub/y',
 'decoder/layer/1/transpose/sub',
 'decoder/layer/1/transpose/Range/start',
 'decoder/layer/1/transpose/Range/delta',
 'decoder/layer/1/transpose/Range',
 'decoder/layer/1/transpose/sub_1',
 'decoder/layer/1/transpose',
 'decoder/layer/1/MatMul',
 'decoder/layer/1/Tanh',
 'decoder/layer/2/transpose/Rank',
 'decoder/layer/2/transpose/sub/y',
 'decoder/layer/2/transpose/sub',
 'decoder/layer/2/transpose/Range/start',
 'decoder/layer/2/transpose/Range/delta',
 'decoder/layer/2/transpose/Range',
 'decoder/layer/2/transpose/sub_1',
 'decoder/layer/2/transpose',
 'decoder/layer/2/MatMul',
 'decoder/layer/2/Tanh']

And let's take a look at the output of the autoencoder:


In [25]:
Y.get_shape().as_list()


Out[25]:
[None, 30000]

Great! So we should have a synthesized version of our input placeholder, X, inside of Y. This Y is the result of many matrix multiplications, first a series of multiplications in our encoder all the way down to 2 dimensions, and then back to the original dimensions through our decoder. Let's now create a pixel-to-pixel measure of error. This should measure the difference in our synthesized output, Y, and our input, X. You can use the $l_1$ or $l_2$ norm, just like in assignment 2. If you don't remember, go back to homework 2 where we calculated the cost function and try the same idea here.


In [26]:
# Calculate some measure of loss, e.g. the pixel to pixel absolute difference or squared difference
loss = tf.squared_difference(X, Y)

# Now sum over every pixel and then calculate the mean over the batch dimension (just like session 2!)
# hint, use tf.reduce_mean and tf.reduce_sum
cost = tf.reduce_sum(loss)

Now for the standard training code. We'll pass our cost to an optimizer, and then use mini batch gradient descent to optimize our network's parameters. We just have to be careful to make sure we're preprocessing our input and feed it in the right shape, a 2-dimensional matrix of [batch_size, n_features] in dimensions.


In [27]:
learning_rate = 0.001
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)

Below is the training code for our autoencoder. Please go through each line of code to make sure you understand what is happening, and fill in the missing pieces. This will take awhile. On my machine, it takes about 15 minutes. If you're impatient, you can "Interrupt" the kernel by going to the Kernel menu above, and continue with the notebook. Though, the longer you leave this to train, the better the result will be.

What I really want you to notice is what the network learns to encode first, based on what it is able to reconstruct. It won't able to reconstruct everything. At first, it will just be the mean image. Then, other major changes in the dataset. For the first 100 images of celeb net, this seems to be the background: white, blue, black backgrounds. From this basic interpretation, you can reason that the autoencoder has learned a representation of the backgrounds, and is able to encode that knowledge of the background in its inner most layer of just two values. It then goes on to represent the major variations in skin tone and hair. Then perhaps some facial features such as lips. So the features it is able to encode tend to be the major things at first, then the smaller things.


In [30]:
from libs import tboard
tboard.show_graph(tf.get_default_graph().as_graph_def())



In [28]:
# Create a tensorflow session and initialize all of our weights:
sess = tf.Session()
sess.run(tf.initialize_all_variables())

# Some parameters for training
batch_size = 100
n_epochs = 401
step = 10

# We'll try to reconstruct the same first 100 images and show how
# The network does over the course of training.
examples = ds.X[:100]

# We have to preprocess the images before feeding them to the network.
# I'll do this once here, so we don't have to do it every iteration.
test_examples = preprocess(examples, ds).reshape(-1, n_features)

# If we want to just visualize them, we can create a montage.
test_images = utils.montage(examples).astype(np.uint8)

# Store images so we can make a gif
gifs = []

# Now for our training:
for epoch_i in range(n_epochs):
    
    # Keep track of the cost
    this_cost = 0
    
    # Iterate over the entire dataset in batches
    for batch_X, _ in ds.train.next_batch(batch_size = batch_size):
        
        # Preprocess and reshape our current batch, batch_X:
        this_batch = preprocess(batch_X, ds).reshape(-1, n_features)
        
        # Compute the cost, and run the optimizer.
        this_cost += sess.run([cost, optimizer], feed_dict = {X: this_batch})[0]
    
    # Average cost of this epoch
    avg_cost = this_cost / ds.X.shape[0] / batch_size
    print(epoch_i, avg_cost)
    
    # Let's also try to see how the network currently reconstructs the input.
    # We'll draw the reconstruction every `step` iterations.
    if epoch_i % step == 0:
        
        # Ask for the output of the network, Y, and give it our test examples
        recon = sess.run(Y, feed_dict = {X: test_examples})
                         
        # Resize the 2d to the 4d representation:
        rsz = recon.reshape(examples.shape)

        # We have to unprocess the image now, removing the normalization
        unnorm_img = deprocess(rsz, ds)
                         
        # Clip to avoid saturation
        clipped = np.clip(unnorm_img, 0, 255)

        # And we can create a montage of the reconstruction
        recon = utils.montage(clipped).astype(np.uint8)
        
        # Store for gif
        gifs.append(recon)

        fig, axs = plt.subplots(1, 2, figsize=(10, 10))
        axs[0].imshow(test_images)
        axs[0].set_title('Original')
        axs[1].imshow(recon)
        axs[1].set_title('Synthesis')
        fig.canvas.draw()
        plt.show()


WARNING:tensorflow:From <ipython-input-28-e51eb231937b>:3: initialize_all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Use `tf.global_variables_initializer` instead.
---------------------------------------------------------------------------
InternalError                             Traceback (most recent call last)
/home/ai/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
   1021     try:
-> 1022       return fn(*args)
   1023     except errors.OpError as e:

/home/ai/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py in _run_fn(session, feed_dict, fetch_list, target_list, options, run_metadata)
   1003                                  feed_dict, fetch_list, target_list,
-> 1004                                  status, run_metadata)
   1005 

/usr/lib/python3.5/contextlib.py in __exit__(self, type, value, traceback)
     65             try:
---> 66                 next(self.gen)
     67             except StopIteration:

/home/ai/.local/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py in raise_exception_on_not_ok_status()
    465           compat.as_text(pywrap_tensorflow.TF_Message(status)),
--> 466           pywrap_tensorflow.TF_GetCode(status))
    467   finally:

InternalError: Dst tensor is not initialized.
	 [[Node: zeros = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [30000,1024] values: [0 0 0]...>, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]

During handling of the above exception, another exception occurred:

InternalError                             Traceback (most recent call last)
<ipython-input-28-e51eb231937b> in <module>()
      1 # Create a tensorflow session and initialize all of our weights:
      2 sess = tf.Session()
----> 3 sess.run(tf.initialize_all_variables())
      4 
      5 # Some parameters for training

/home/ai/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py in run(self, fetches, feed_dict, options, run_metadata)
    765     try:
    766       result = self._run(None, fetches, feed_dict, options_ptr,
--> 767                          run_metadata_ptr)
    768       if run_metadata:
    769         proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

/home/ai/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
    963     if final_fetches or final_targets:
    964       results = self._do_run(handle, final_targets, final_fetches,
--> 965                              feed_dict_string, options, run_metadata)
    966     else:
    967       results = []

/home/ai/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
   1013     if handle is None:
   1014       return self._do_call(_run_fn, self._session, feed_dict, fetch_list,
-> 1015                            target_list, options, run_metadata)
   1016     else:
   1017       return self._do_call(_prun_fn, self._session, handle, feed_dict,

/home/ai/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
   1033         except KeyError:
   1034           pass
-> 1035       raise type(e)(node_def, op, message)
   1036 
   1037   def _extend_graph(self):

InternalError: Dst tensor is not initialized.
	 [[Node: zeros = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [30000,1024] values: [0 0 0]...>, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]

Caused by op 'zeros', defined at:
  File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.5/dist-packages/ipykernel/__main__.py", line 3, in <module>
    app.launch_new_instance()
  File "/usr/local/lib/python3.5/dist-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "/usr/local/lib/python3.5/dist-packages/ipykernel/kernelapp.py", line 474, in start
    ioloop.IOLoop.instance().start()
  File "/usr/local/lib/python3.5/dist-packages/zmq/eventloop/ioloop.py", line 177, in start
    super(ZMQIOLoop, self).start()
  File "/usr/local/lib/python3.5/dist-packages/tornado/ioloop.py", line 887, in start
    handler_func(fd_obj, events)
  File "/usr/local/lib/python3.5/dist-packages/tornado/stack_context.py", line 275, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/zmq/eventloop/zmqstream.py", line 440, in _handle_events
    self._handle_recv()
  File "/usr/local/lib/python3.5/dist-packages/zmq/eventloop/zmqstream.py", line 472, in _handle_recv
    self._run_callback(callback, msg)
  File "/usr/local/lib/python3.5/dist-packages/zmq/eventloop/zmqstream.py", line 414, in _run_callback
    callback(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tornado/stack_context.py", line 275, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/ipykernel/kernelbase.py", line 276, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "/usr/local/lib/python3.5/dist-packages/ipykernel/kernelbase.py", line 228, in dispatch_shell
    handler(stream, idents, msg)
  File "/usr/local/lib/python3.5/dist-packages/ipykernel/kernelbase.py", line 390, in execute_request
    user_expressions, allow_stdin)
  File "/usr/local/lib/python3.5/dist-packages/ipykernel/ipkernel.py", line 196, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "/usr/local/lib/python3.5/dist-packages/ipykernel/zmqshell.py", line 501, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/IPython/core/interactiveshell.py", line 2717, in run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "/usr/local/lib/python3.5/dist-packages/IPython/core/interactiveshell.py", line 2821, in run_ast_nodes
    if self.run_code(code, result):
  File "/usr/local/lib/python3.5/dist-packages/IPython/core/interactiveshell.py", line 2881, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-27-8314c5a4dabc>", line 2, in <module>
    optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)
  File "/home/ai/.local/lib/python3.5/site-packages/tensorflow/python/training/optimizer.py", line 289, in minimize
    name=name)
  File "/home/ai/.local/lib/python3.5/site-packages/tensorflow/python/training/optimizer.py", line 403, in apply_gradients
    self._create_slots(var_list)
  File "/home/ai/.local/lib/python3.5/site-packages/tensorflow/python/training/adam.py", line 117, in _create_slots
    self._zeros_slot(v, "m", self._name)
  File "/home/ai/.local/lib/python3.5/site-packages/tensorflow/python/training/optimizer.py", line 647, in _zeros_slot
    named_slots[var] = slot_creator.create_zeros_slot(var, op_name)
  File "/home/ai/.local/lib/python3.5/site-packages/tensorflow/python/training/slot_creator.py", line 121, in create_zeros_slot
    val = array_ops.zeros(primary.get_shape().as_list(), dtype=dtype)
  File "/home/ai/.local/lib/python3.5/site-packages/tensorflow/python/ops/array_ops.py", line 1352, in zeros
    output = constant(zero, shape=shape, dtype=dtype, name=name)
  File "/home/ai/.local/lib/python3.5/site-packages/tensorflow/python/framework/constant_op.py", line 103, in constant
    attrs={"value": tensor_value, "dtype": dtype_value}, name=name).outputs[0]
  File "/home/ai/.local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2327, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/home/ai/.local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1226, in __init__
    self._traceback = _extract_stack()

InternalError (see above for traceback): Dst tensor is not initialized.
	 [[Node: zeros = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [30000,1024] values: [0 0 0]...>, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]

Let's take a look a the final reconstruction:


In [ ]:
fig, axs = plt.subplots(1, 2, figsize=(10, 10))
axs[0].imshow(test_images)
axs[0].set_title('Original')
axs[1].imshow(recon)
axs[1].set_title('Synthesis')
fig.canvas.draw()
plt.show()
plt.imsave(arr=test_images, fname='test.png')
plt.imsave(arr=recon, fname='recon.png')

Visualize the Embedding

Let's now try visualizing our dataset's inner most layer's activations. Since these are already 2-dimensional, we can use the values of this layer to position any input image in a 2-dimensional space. We hope to find similar looking images closer together.

We'll first ask for the inner most layer's activations when given our example images. This will run our images through the network, half way, stopping at the end of the encoder part of the network.


In [ ]:
zs = sess.run(z, feed_dict={X:test_examples})

Recall that this layer has 2 neurons:


In [129]:
zs.shape


Out[129]:
(100, 4)

Let's see what the activations look like for our 100 images as a scatter plot.


In [130]:
plt.scatter(zs[:, 0], zs[:, 1])


Out[130]:
<matplotlib.collections.PathCollection at 0x7f821c645d68>

If you view this plot over time, and let the process train longer, you will see something similar to the visualization here on the right: https://vimeo.com/155061675 - the manifold is able to express more and more possible ideas, or put another way, it is able to encode more data. As it grows more expressive, with more data, and longer training, or deeper networks, it will fill in more of the space, and have different modes expressing different clusters of the data. With just 100 examples of our dataset, this is very small to try to model with such a deep network. In any case, the techniques we've learned up to now apply in exactly the same way, even if we had 1k, 100k, or even many millions of images.

Let's try to see how this minimal example, with just 100 images, and just 100 epochs looks when we use this embedding to sort our dataset, just like we tried to do in the 1st assignment, but now with our autoencoders embedding.

Reorganize to Grid

We'll use these points to try to find an assignment to a grid. This is a well-known problem known as the "assignment problem": https://en.wikipedia.org/wiki/Assignment_problem - This is unrelated to the applications we're investigating in this course, but I thought it would be a fun extra to show you how to do. What we're going to do is take our scatter plot above, and find the best way to stretch and scale it so that each point is placed in a grid. We try to do this in a way that keeps nearby points close together when they are reassigned in their grid.


In [42]:
n_images = 100
idxs = np.linspace(np.min(zs) * 2.0, np.max(zs) * 2.0,
                   int(np.ceil(np.sqrt(n_images))))
print("idxs: ", idxs)
xs, ys = np.meshgrid(idxs, idxs)
print("shape of xs: ", xs.shape)
print("shape of ys: ", ys.shape)
grid = np.dstack((ys, xs)).reshape(-1, 2)[:n_images,:]
print("np.dstack((ys, xs)): ", np.dstack((ys, xs)).shape)
print("np.dstack((ys, xs)).reshape(-1, 2): ", np.dstack((ys, xs)).reshape(-1, 2).shape)
print("shape of grid: ", grid.shape)


idxs:  [-18342.58007812 -14364.85481771 -10387.12955729  -6409.40429688
  -2431.67903646   1546.04622396   5523.77148438   9501.49674479
  13479.22200521  17456.94726562]
shape of xs:  (10, 10)
shape of ys:  (10, 10)
np.dstack((ys, xs)):  (10, 10, 2)
np.dstack((ys, xs)).reshape(-1, 2):  (100, 2)
shape of grid:  (100, 2)

In [36]:
fig, axs = plt.subplots(1,2,figsize=(8,3))
axs[0].scatter(zs[:, 0], zs[:, 1],
               edgecolors='none', marker='o', s=2)
axs[0].set_title('Autoencoder Embedding')
axs[1].scatter(grid[:,0], grid[:,1],
               edgecolors='none', marker='o', s=2)
axs[1].set_title('Ideal Grid')


Out[36]:
<matplotlib.text.Text at 0x7fb27c5e9a20>

To do this, we can use scipy and an algorithm for solving this assignment problem known as the hungarian algorithm. With a few points, this algorithm runs pretty fast. But be careful if you have many more points, e.g. > 1000, as it is not a very efficient algorithm!


In [39]:
from scipy.spatial.distance import cdist
cost = cdist(grid[:, :], zs[:, :], 'sqeuclidean')
from scipy.optimize._hungarian import linear_sum_assignment
indexes = linear_sum_assignment(cost)

The result tells us the matching indexes from our autoencoder embedding of 2 dimensions, to our idealized grid:


In [40]:
indexes


Out[40]:
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
        34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
        51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
        68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
        85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]),
 array([15, 85, 98, 64, 96, 94,  1, 24, 28, 58, 21, 33, 23,  5, 42, 62, 22,
        13, 25, 52, 41, 63, 39, 47, 50, 30, 37, 67, 80, 34, 35, 74, 11, 84,
        73, 40, 69, 43, 72, 82, 91, 48, 27, 36, 56, 81,  6, 29, 68, 70,  4,
        17, 76, 38, 61, 51, 53, 95,  2, 99, 92, 78, 20, 18, 86, 59, 31, 90,
         3, 19, 83, 32, 49, 79, 89, 93, 45, 44, 10, 66, 97, 71, 87, 77, 46,
         0, 88, 16, 14,  9, 54, 75,  8, 12, 26, 60, 65, 55,  7, 57]))

In [43]:
plt.figure(figsize=(5, 5))
for i in range(len(zs)):
    plt.plot([zs[indexes[1][i], 0], grid[i, 0]],
             [zs[indexes[1][i], 1], grid[i, 1]], 'r')
plt.xlim([-3, 3])
plt.ylim([-3, 3])


Out[43]:
(-3, 3)

In other words, this algorithm has just found the best arrangement of our previous zs as a grid. We can now plot our images using the order of our assignment problem to see what it looks like:


In [44]:
examples_sorted = []
for i in indexes[1]:
    examples_sorted.append(examples[i])
plt.figure(figsize=(15, 15))
img = utils.montage(np.array(examples_sorted)).astype(np.uint8)
plt.imshow(img,
           interpolation='nearest')
plt.imsave(arr=img, fname='sorted.png')


2D Latent Manifold

We'll now explore the inner most layer of the network. Recall we go from the number of image features (the number of pixels), down to 2 values using successive matrix multiplications, back to the number of image features through more matrix multiplications. These inner 2 values are enough to represent our entire dataset (+ some loss, depending on how well we did). Let's explore how the decoder, the second half of the network, operates, from just these two values. We'll bypass the input placeholder, X, and the entire encoder network, and start from Z. Let's first get some data which will sample Z in 2 dimensions from -1 to 1. Then we'll feed these values through the decoder network to have our synthesized images.


In [60]:
# This is a quick way to do what we could have done as
# a nested for loop:
zs = np.meshgrid(np.linspace(-1, 1, 10),
                 np.linspace(-1, 1, 10))
print(np.linspace(-1, 1, 10))
print(len(zs))
print(zs[0].shape)
print(zs[1].shape)
# Now we have 100 x 2 values of every possible position
# in a 2D grid from -1 to 1:
zs = np.c_[zs[0].ravel(), zs[1].ravel()]
print(zs.shape)
#print(zs)


[-1.         -0.77777778 -0.55555556 -0.33333333 -0.11111111  0.11111111
  0.33333333  0.55555556  0.77777778  1.        ]
2
(10, 10)
(10, 10)
(100, 2)

Now calculate the reconstructed images using our new zs. You'll want to start from the beginning of the decoder! That is the z variable! Then calculate the Y given our synthetic values for z stored in zs.


In [46]:
recon = sess.run(Y, feed_dict={z: zs})

# reshape the result to an image:
rsz = recon.reshape(examples.shape)

# Deprocess the result, unnormalizing it
unnorm_img = deprocess(rsz, ds)

# clip to avoid saturation
clipped = np.clip(unnorm_img, 0, 255)

# Create a montage
img_i = utils.montage(clipped).astype(np.uint8)

And now we can plot the reconstructed montage representing our latent space:


In [47]:
plt.figure(figsize=(15, 15))
plt.imshow(img_i)
plt.imsave(arr=img_i, fname='manifold.png')


Part Two - General Autoencoder Framework

There are a number of extensions we can explore w/ an autoencoder. I've provided a module under the libs folder, vae.py, which you will need to explore for Part Two. It has a function, VAE, to create an autoencoder, optionally with Convolution, Denoising, and/or Variational Layers. Please read through the documentation and try to understand the different parameters.


In [61]:
help(vae.VAE)


Help on function VAE in module libs.vae:

VAE(input_shape=[None, 784], n_filters=[64, 64, 64], filter_sizes=[4, 4, 4], n_hidden=32, n_code=2, activation=<function tanh at 0x7fb286ddb8c8>, dropout=False, denoising=False, convolutional=False, variational=False)
    (Variational) (Convolutional) (Denoising) Autoencoder.
    
    Uses tied weights.
    
    Parameters
    ----------
    input_shape : list, optional
        Shape of the input to the network. e.g. for MNIST: [None, 784].
    n_filters : list, optional
        Number of filters for each layer.
        If convolutional=True, this refers to the total number of output
        filters to create for each layer, with each layer's number of output
        filters as a list.
        If convolutional=False, then this refers to the total number of neurons
        for each layer in a fully connected network.
    filter_sizes : list, optional
        Only applied when convolutional=True.  This refers to the ksize (height
        and width) of each convolutional layer.
    n_hidden : int, optional
        Only applied when variational=True.  This refers to the first fully
        connected layer prior to the variational embedding, directly after
        the encoding.  After the variational embedding, another fully connected
        layer is created with the same size prior to decoding.  Set to 0 to
        not use an additional hidden layer.
    n_code : int, optional
        Only applied when variational=True.  This refers to the number of
        latent Gaussians to sample for creating the inner most encoding.
    activation : function, optional
        Activation function to apply to each layer, e.g. tf.nn.relu
    dropout : bool, optional
        Whether or not to apply dropout.  If using dropout, you must feed a
        value for 'keep_prob', as returned in the dictionary.  1.0 means no
        dropout is used.  0.0 means every connection is dropped.  Sensible
        values are between 0.5-0.8.
    denoising : bool, optional
        Whether or not to apply denoising.  If using denoising, you must feed a
        value for 'corrupt_prob', as returned in the dictionary.  1.0 means no
        corruption is used.  0.0 means every feature is corrupted.  Sensible
        values are between 0.5-0.8.
    convolutional : bool, optional
        Whether or not to use a convolutional network or else a fully connected
        network will be created.  This effects the n_filters parameter's
        meaning.
    variational : bool, optional
        Whether or not to create a variational embedding layer.  This will
        create a fully connected layer after the encoding, if `n_hidden` is
        greater than 0, then will create a multivariate gaussian sampling
        layer, then another fully connected layer.  The size of the fully
        connected layers are determined by `n_hidden`, and the size of the
        sampling layer is determined by `n_code`.
    
    Returns
    -------
    model : dict
        {
            'cost': Tensor to optimize.
            'Ws': All weights of the encoder.
            'x': Input Placeholder
            'z': Inner most encoding Tensor (latent features)
            'y': Reconstruction of the Decoder
            'keep_prob': Amount to keep when using Dropout
            'corrupt_prob': Amount to corrupt when using Denoising
            'train': Set to True when training/Applies to Batch Normalization.
        }

Included in the vae.py module is the train_vae function. This will take a list of file paths, and train an autoencoder with the provided options. This will spit out a bunch of images of the reconstruction and latent manifold created by the encoder/variational encoder. Feel free to read through the code, as it is documented.


In [62]:
help(vae.train_vae)


Help on function train_vae in module libs.vae:

train_vae(files, input_shape, learning_rate=0.0001, batch_size=100, n_epochs=50, n_examples=10, crop_shape=[64, 64, 3], crop_factor=0.8, n_filters=[100, 100, 100, 100], n_hidden=256, n_code=50, convolutional=True, variational=True, filter_sizes=[3, 3, 3, 3], dropout=True, keep_prob=0.8, activation=<function relu at 0x7fb286dc1730>, img_step=100, save_step=100, ckpt_name='vae.ckpt')
    General purpose training of a (Variational) (Convolutional) Autoencoder.
    
    Supply a list of file paths to images, and this will do everything else.
    
    Parameters
    ----------
    files : list of strings
        List of paths to images.
    input_shape : list
        Must define what the input image's shape is.
    learning_rate : float, optional
        Learning rate.
    batch_size : int, optional
        Batch size.
    n_epochs : int, optional
        Number of epochs.
    n_examples : int, optional
        Number of example to use while demonstrating the current training
        iteration's reconstruction.  Creates a square montage, so make
        sure int(sqrt(n_examples))**2 = n_examples, e.g. 16, 25, 36, ... 100.
    crop_shape : list, optional
        Size to centrally crop the image to.
    crop_factor : float, optional
        Resize factor to apply before cropping.
    n_filters : list, optional
        Same as VAE's n_filters.
    n_hidden : int, optional
        Same as VAE's n_hidden.
    n_code : int, optional
        Same as VAE's n_code.
    convolutional : bool, optional
        Use convolution or not.
    variational : bool, optional
        Use variational layer or not.
    filter_sizes : list, optional
        Same as VAE's filter_sizes.
    dropout : bool, optional
        Use dropout or not
    keep_prob : float, optional
        Percent of keep for dropout.
    activation : function, optional
        Which activation function to use.
    img_step : int, optional
        How often to save training images showing the manifold and
        reconstruction.
    save_step : int, optional
        How often to save checkpoints.
    ckpt_name : str, optional
        Checkpoints will be named as this, e.g. 'model.ckpt'

I've also included three examples of how to use the VAE(...) and train_vae(...) functions. First look at the one using MNIST. Then look at the other two: one using the Celeb Dataset; and lastly one which will download Sita Sings the Blues, rip the frames, and train a Variational Autoencoder on it. This last one requires ffmpeg be installed (e.g. for OSX users, brew install ffmpeg, Linux users, sudo apt-get ffmpeg-dev, or else: https://ffmpeg.org/download.html). The Celeb and Sita Sings the Blues training require us to use an image pipeline, which I've mentioned briefly during the lecture. This does many things for us: it loads data from disk in batches, decodes the data as an image, resizes/crops the image, and uses a multithreaded graph to handle it all. It is very efficient and is the way to go when handling large image datasets.

The MNIST training does not use this. Instead, the entire dataset is loaded into the CPU memory, and then fed in minibatches to the graph using Python/Numpy. This is far less efficient, but will not be an issue for such a small dataset, e.g. 70k examples of 28x28 pixels = ~1.6 MB of data, easily fits into memory (in fact, it would really be better to use a Tensorflow variable with this entire dataset defined). When you consider the Celeb Net, you have 200k examples of 218x178x3 pixels = ~700 MB of data. That's just for the dataset. When you factor in everything required for the network and its weights, then you are pushing it. Basically this image pipeline will handle loading the data from disk, rather than storing it in memory.

Instructions

You'll now try to train your own autoencoder using this framework. You'll need to get a directory full of 'jpg' files. You'll then use the VAE framework and the vae.train_vae function to train a variational autoencoder on your own dataset. This accepts a list of files, and will output images of the training in the same directory. These are named "test_xs.png" as well as many images named prefixed by "manifold" and "reconstruction" for each iteration of the training. After you are happy with your training, you will need to create a forum post with the "test_xs.png" and the very last manifold and reconstruction image created to demonstrate how the variational autoencoder worked for your dataset. You'll likely need a lot more than 100 images for this to be successful.

Note that this will also create "checkpoints" which save the model! If you change the model, and already have a checkpoint by the same name, it will try to load the previous model and will fail. Be sure to remove the old checkpoint or specify a new name for ckpt_name! The default parameters shown below are what I have used for the celeb net dataset which has over 200k images. You will definitely want to use a smaller model if you do not have this many images! Explore!


In [ ]:
# Get a list of jpg file (Only JPG works!)
files = [os.path.join("../session-2/img_align_celeba", file_i)
         for file_i in os.listdir("../session-2/img_align_celeba")
         if file_i.endswith('.jpg')]
print(plt.imread(files[0]).shape)
print(plt.imread(files[1]).shape)

# Train it!  Change these parameters!
vae.train_vae(files,
              input_shape = [218, 178, 3],
              learning_rate=0.0001,
              batch_size=100,
              n_epochs=50,
              n_examples=10,
              crop_shape=[64, 64, 3],
              crop_factor=0.8,
              n_filters=[100, 100, 100, 100],
              n_hidden=256,
              n_code=50,
              convolutional=True,
              variational=True,
              filter_sizes=[3, 3, 3, 3],
              dropout=True,
              keep_prob=0.8,
              activation=tf.nn.relu,
              img_step=100,
              save_step=100,
              ckpt_name="vae.ckpt")


(218, 178, 3)
(218, 178, 3)
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-3-cc38312ad6c7> in <module>()
     26               img_step=100,
     27               save_step=100,
---> 28               ckpt_name="vae.ckpt")

/notebooks/session-3/libs/vae.py in train_vae(files, input_shape, learning_rate, batch_size, n_epochs, n_examples, crop_shape, crop_factor, n_filters, n_hidden, n_code, convolutional, variational, filter_sizes, dropout, keep_prob, activation, img_step, save_step, ckpt_name)
    357             train_cost = sess.run([ae['cost'], optimizer], feed_dict={
    358                 ae['x']: batch_xs, ae['train']: True,
--> 359                 ae['keep_prob']: keep_prob})[0]
    360             cost += train_cost
    361             if batch_i % n_files == 0:

/usr/local/lib/python3.4/site-packages/tensorflow/python/client/session.py in run(self, fetches, feed_dict, options, run_metadata)
    370     try:
    371       result = self._run(None, fetches, feed_dict, options_ptr,
--> 372                          run_metadata_ptr)
    373       if run_metadata:
    374         proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

/usr/local/lib/python3.4/site-packages/tensorflow/python/client/session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
    634     try:
    635       results = self._do_run(handle, target_list, unique_fetches,
--> 636                              feed_dict_string, options, run_metadata)
    637     finally:
    638       # The movers are no longer used. Delete them.

/usr/local/lib/python3.4/site-packages/tensorflow/python/client/session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
    706     if handle is None:
    707       return self._do_call(_run_fn, self._session, feed_dict, fetch_list,
--> 708                            target_list, options, run_metadata)
    709     else:
    710       return self._do_call(_prun_fn, self._session, handle, feed_dict,

/usr/local/lib/python3.4/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
    713   def _do_call(self, fn, *args):
    714     try:
--> 715       return fn(*args)
    716     except errors.OpError as e:
    717       message = compat.as_text(e.message)

/usr/local/lib/python3.4/site-packages/tensorflow/python/client/session.py in _run_fn(session, feed_dict, fetch_list, target_list, options, run_metadata)
    695         return tf_session.TF_Run(session, options,
    696                                  feed_dict, fetch_list, target_list,
--> 697                                  status, run_metadata)
    698 
    699     def _prun_fn(session, handle, feed_dict, fetch_list):

KeyboardInterrupt: 

Part Three - Deep Audio Classification Network

Instructions

In this last section, we'll explore using a regression network, one that predicts continuous outputs, to perform classification, a model capable of predicting discrete outputs. We'll explore the use of one-hot encodings and using a softmax layer to convert our regression outputs to a probability which we can use for classification. In the lecture, we saw how this works for the MNIST dataset, a dataset of 28 x 28 pixel handwritten digits labeled from 0 - 9. We converted our 28 x 28 pixels into a vector of 784 values, and used a fully connected network to output 10 values, the one hot encoding of our 0 - 9 labels.

In addition to the lecture material, I find these two links very helpful to try to understand classification w/ neural networks:

https://colah.github.io/posts/2014-03-NN-Manifolds-Topology/
https://cs.stanford.edu/people/karpathy/convnetjs//demo/classify2d.html

The GTZAN Music and Speech dataset has 64 music and 64 speech files, each 30 seconds long, and each at a sample rate of 22050 Hz, meaning there are 22050 samplings of the audio signal per second. What we're going to do is use all of this data to build a classification network capable of knowing whether something is music or speech. So we will have audio as input, and a probability of 2 possible values, music and speech, as output. This is very similar to the MNIST network. We just have to decide on how to represent our input data, prepare the data and its labels, build batch generators for our data, create the network, and train it. We'll make use of the libs/datasets.py module to help with some of this.

Preparing the Data

Let's first download the GTZAN music and speech dataset. I've included a helper function to do this.


In [3]:
dst = 'gtzan_music_speech'
if not os.path.exists(dst):
    dataset_utils.gtzan_music_speech_download(dst)

Inside the dst directory, we now have folders for music and speech. Let's get the list of all the wav files for music and speech:


In [4]:
# Get the full path to the directory
music_dir = os.path.join(os.path.join(dst, 'music_speech'), 'music_wav')

# Now use list comprehension to combine the path of the directory with any wave files
music = [os.path.join(music_dir, file_i)
         for file_i in os.listdir(music_dir)
         if file_i.endswith('.wav')]

# Similarly, for the speech folder:
speech_dir = os.path.join(os.path.join(dst, 'music_speech'), 'speech_wav')
speech = [os.path.join(speech_dir, file_i)
          for file_i in os.listdir(speech_dir)
          if file_i.endswith('.wav')]

# Let's see all the file names
print(len(music), len(speech))
print(music, speech)


64 64
['gtzan_music_speech/music_speech/music_wav/bagpipe.wav', 'gtzan_music_speech/music_speech/music_wav/ballad.wav', 'gtzan_music_speech/music_speech/music_wav/bartok.wav', 'gtzan_music_speech/music_speech/music_wav/beat.wav', 'gtzan_music_speech/music_speech/music_wav/beatles.wav', 'gtzan_music_speech/music_speech/music_wav/bigband.wav', 'gtzan_music_speech/music_speech/music_wav/birdland.wav', 'gtzan_music_speech/music_speech/music_wav/blues.wav', 'gtzan_music_speech/music_speech/music_wav/bmarsalis.wav', 'gtzan_music_speech/music_speech/music_wav/brahms.wav', 'gtzan_music_speech/music_speech/music_wav/canonaki.wav', 'gtzan_music_speech/music_speech/music_wav/caravan.wav', 'gtzan_music_speech/music_speech/music_wav/chaka.wav', 'gtzan_music_speech/music_speech/music_wav/classical.wav', 'gtzan_music_speech/music_speech/music_wav/classical1.wav', 'gtzan_music_speech/music_speech/music_wav/classical2.wav', 'gtzan_music_speech/music_speech/music_wav/copland.wav', 'gtzan_music_speech/music_speech/music_wav/copland2.wav', 'gtzan_music_speech/music_speech/music_wav/corea.wav', 'gtzan_music_speech/music_speech/music_wav/corea1.wav', 'gtzan_music_speech/music_speech/music_wav/cure.wav', 'gtzan_music_speech/music_speech/music_wav/debussy.wav', 'gtzan_music_speech/music_speech/music_wav/deedee.wav', 'gtzan_music_speech/music_speech/music_wav/deedee1.wav', 'gtzan_music_speech/music_speech/music_wav/duke.wav', 'gtzan_music_speech/music_speech/music_wav/echoes.wav', 'gtzan_music_speech/music_speech/music_wav/eguitar.wav', 'gtzan_music_speech/music_speech/music_wav/georose.wav', 'gtzan_music_speech/music_speech/music_wav/gismonti.wav', 'gtzan_music_speech/music_speech/music_wav/glass.wav', 'gtzan_music_speech/music_speech/music_wav/glass1.wav', 'gtzan_music_speech/music_speech/music_wav/gravity.wav', 'gtzan_music_speech/music_speech/music_wav/gravity2.wav', 'gtzan_music_speech/music_speech/music_wav/guitar.wav', 'gtzan_music_speech/music_speech/music_wav/hendrix.wav', 'gtzan_music_speech/music_speech/music_wav/ipanema.wav', 'gtzan_music_speech/music_speech/music_wav/jazz.wav', 'gtzan_music_speech/music_speech/music_wav/jazz1.wav', 'gtzan_music_speech/music_speech/music_wav/led.wav', 'gtzan_music_speech/music_speech/music_wav/loreena.wav', 'gtzan_music_speech/music_speech/music_wav/madradeus.wav', 'gtzan_music_speech/music_speech/music_wav/magkas.wav', 'gtzan_music_speech/music_speech/music_wav/march.wav', 'gtzan_music_speech/music_speech/music_wav/marlene.wav', 'gtzan_music_speech/music_speech/music_wav/mingus.wav', 'gtzan_music_speech/music_speech/music_wav/mingus1.wav', 'gtzan_music_speech/music_speech/music_wav/misirlou.wav', 'gtzan_music_speech/music_speech/music_wav/moanin.wav', 'gtzan_music_speech/music_speech/music_wav/narch.wav', 'gtzan_music_speech/music_speech/music_wav/ncherry.wav', 'gtzan_music_speech/music_speech/music_wav/nearhou.wav', 'gtzan_music_speech/music_speech/music_wav/opera.wav', 'gtzan_music_speech/music_speech/music_wav/opera1.wav', 'gtzan_music_speech/music_speech/music_wav/pop.wav', 'gtzan_music_speech/music_speech/music_wav/prodigy.wav', 'gtzan_music_speech/music_speech/music_wav/redhot.wav', 'gtzan_music_speech/music_speech/music_wav/rock.wav', 'gtzan_music_speech/music_speech/music_wav/rock2.wav', 'gtzan_music_speech/music_speech/music_wav/russo.wav', 'gtzan_music_speech/music_speech/music_wav/tony.wav', 'gtzan_music_speech/music_speech/music_wav/u2.wav', 'gtzan_music_speech/music_speech/music_wav/unpoco.wav', 'gtzan_music_speech/music_speech/music_wav/vlobos.wav', 'gtzan_music_speech/music_speech/music_wav/winds.wav'] ['gtzan_music_speech/music_speech/speech_wav/acomic.wav', 'gtzan_music_speech/music_speech/speech_wav/acomic2.wav', 'gtzan_music_speech/music_speech/speech_wav/allison.wav', 'gtzan_music_speech/music_speech/speech_wav/amal.wav', 'gtzan_music_speech/music_speech/speech_wav/austria.wav', 'gtzan_music_speech/music_speech/speech_wav/bathroom1.wav', 'gtzan_music_speech/music_speech/speech_wav/chant.wav', 'gtzan_music_speech/music_speech/speech_wav/charles.wav', 'gtzan_music_speech/music_speech/speech_wav/china.wav', 'gtzan_music_speech/music_speech/speech_wav/comedy.wav', 'gtzan_music_speech/music_speech/speech_wav/comedy1.wav', 'gtzan_music_speech/music_speech/speech_wav/conversion.wav', 'gtzan_music_speech/music_speech/speech_wav/danie.wav', 'gtzan_music_speech/music_speech/speech_wav/danie1.wav', 'gtzan_music_speech/music_speech/speech_wav/dialogue.wav', 'gtzan_music_speech/music_speech/speech_wav/dialogue1.wav', 'gtzan_music_speech/music_speech/speech_wav/dialogue2.wav', 'gtzan_music_speech/music_speech/speech_wav/diamond.wav', 'gtzan_music_speech/music_speech/speech_wav/ellhnika.wav', 'gtzan_music_speech/music_speech/speech_wav/emil.wav', 'gtzan_music_speech/music_speech/speech_wav/female.wav', 'gtzan_music_speech/music_speech/speech_wav/fem_rock.wav', 'gtzan_music_speech/music_speech/speech_wav/fire.wav', 'gtzan_music_speech/music_speech/speech_wav/geography.wav', 'gtzan_music_speech/music_speech/speech_wav/geography1.wav', 'gtzan_music_speech/music_speech/speech_wav/georg.wav', 'gtzan_music_speech/music_speech/speech_wav/god.wav', 'gtzan_music_speech/music_speech/speech_wav/greek.wav', 'gtzan_music_speech/music_speech/speech_wav/greek1.wav', 'gtzan_music_speech/music_speech/speech_wav/india.wav', 'gtzan_music_speech/music_speech/speech_wav/jony.wav', 'gtzan_music_speech/music_speech/speech_wav/jvoice.wav', 'gtzan_music_speech/music_speech/speech_wav/kedar.wav', 'gtzan_music_speech/music_speech/speech_wav/kid.wav', 'gtzan_music_speech/music_speech/speech_wav/lena.wav', 'gtzan_music_speech/music_speech/speech_wav/male.wav', 'gtzan_music_speech/music_speech/speech_wav/my_voice.wav', 'gtzan_music_speech/music_speech/speech_wav/nether.wav', 'gtzan_music_speech/music_speech/speech_wav/news1.wav', 'gtzan_music_speech/music_speech/speech_wav/news2.wav', 'gtzan_music_speech/music_speech/speech_wav/nj105.wav', 'gtzan_music_speech/music_speech/speech_wav/nj105a.wav', 'gtzan_music_speech/music_speech/speech_wav/oneday.wav', 'gtzan_music_speech/music_speech/speech_wav/psychic.wav', 'gtzan_music_speech/music_speech/speech_wav/pulp.wav', 'gtzan_music_speech/music_speech/speech_wav/pulp1.wav', 'gtzan_music_speech/music_speech/speech_wav/pulp2.wav', 'gtzan_music_speech/music_speech/speech_wav/relation.wav', 'gtzan_music_speech/music_speech/speech_wav/serbian.wav', 'gtzan_music_speech/music_speech/speech_wav/shannon.wav', 'gtzan_music_speech/music_speech/speech_wav/sleep.wav', 'gtzan_music_speech/music_speech/speech_wav/smoke1.wav', 'gtzan_music_speech/music_speech/speech_wav/smoking.wav', 'gtzan_music_speech/music_speech/speech_wav/stupid.wav', 'gtzan_music_speech/music_speech/speech_wav/teachers.wav', 'gtzan_music_speech/music_speech/speech_wav/teachers1.wav', 'gtzan_music_speech/music_speech/speech_wav/teachers2.wav', 'gtzan_music_speech/music_speech/speech_wav/thlui.wav', 'gtzan_music_speech/music_speech/speech_wav/undergrad.wav', 'gtzan_music_speech/music_speech/speech_wav/vegetables.wav', 'gtzan_music_speech/music_speech/speech_wav/vegetables1.wav', 'gtzan_music_speech/music_speech/speech_wav/vegetables2.wav', 'gtzan_music_speech/music_speech/speech_wav/voice.wav', 'gtzan_music_speech/music_speech/speech_wav/voices.wav']

We now need to load each file. We can use the scipy.io.wavefile module to load the audio as a signal.

Audio can be represented in a few ways, including as floating point or short byte data (16-bit data). This dataset is the latter and so can range from -32768 to +32767. We'll use the function I've provided in the utils module to load and convert an audio signal to a -1.0 to 1.0 floating point datatype by dividing by the maximum absolute value. Let's try this with just one of the files we have:


In [5]:
s = utils.load_audio(music[0])
plt.plot(s)


Out[5]:
[<matplotlib.lines.Line2D at 0x7facd2918978>]

Now, instead of using the raw audio signal, we're going to use the Discrete Fourier Transform to represent our audio as matched filters of different sinuoids. Unfortunately, this is a class on Tensorflow and I can't get into Digital Signal Processing basics. If you want to know more about this topic, I highly encourage you to take this course taught by the legendary Perry Cook and Julius Smith: https://www.kadenze.com/courses/physics-based-sound-synthesis-for-games-and-interactive-systems/info - there is no one better to teach this content, and in fact, I myself learned DSP from Perry Cook almost 10 years ago.

After taking the DFT, this will return our signal as real and imaginary components, a polar complex value representation which we will convert to a cartesian representation capable of saying what magnitudes and phases are in our signal.


In [6]:
# Parameters for our dft transform.  Sorry we can't go into the
# details of this in this course.  Please look into DSP texts or the
# course by Perry Cook linked above if you are unfamiliar with this.
fft_size = 512
hop_size = 256

re, im = dft.dft_np(s, hop_size=256, fft_size=512)
mag, phs = dft.ztoc(re, im)
mag_len = len(mag)   # this length will be used later on.

In [7]:
print(mag.shape)
plt.imshow(mag)


(2583, 256)
Out[7]:
<matplotlib.image.AxesImage at 0x7f3db1d85d68>

What we're seeing are the features of the audio (in columns) over time (in rows). We can see this a bit better by taking the logarithm of the magnitudes converting it to a psuedo-decibel scale. This is more similar to the logarithmic perception of loudness we have. Let's visualize this below, and I'll transpose the matrix just for display purposes:


In [8]:
plt.figure(figsize=(10, 4))
plt.imshow(np.log(mag.T))
plt.xlabel('Time')
plt.ylabel('Frequency Bin')


Out[8]:
<matplotlib.text.Text at 0x7fa1287ff860>

We could just take just a single row (or column in the second plot of the magnitudes just above, as we transposed it in that plot) as an input to a neural network. However, that just represents about an 80th of a second of audio data, and is not nearly enough data to say whether something is music or speech. We'll need to use more than a single row to get a decent length of time. One way to do this is to use a sliding 2D window from the top of the image down to the bottom of the image (or left to right). Let's start by specifying how large our sliding window is.


In [7]:
# The sample rate from our audio is 22050 Hz.
sr = 22050

# We can calculate how many hops there are in a second
# which will tell us how many frames of magnitudes
# we have per second
n_frames_per_second = sr // hop_size
print("n_frames_per_second: ", n_frames_per_second)

# We want 500 milliseconds of audio in our window
n_frames = n_frames_per_second // 2
print("n_frames: ", n_frames)

# And we'll move our window by 250 ms at a time
frame_hops = n_frames_per_second // 4
print("frame_hops: ", frame_hops)

# We'll therefore have this many sliding windows:
n_hops = (mag_len - n_frames) // frame_hops
print("n_hops: ", n_hops)


n_frames_per_second:  86
n_frames:  43
frame_hops:  21
n_hops:  120

Now we can collect all the sliding windows into a list of Xs and label them based on being music as 0 or speech as 1 into a collection of ys.


In [10]:
Xs = []
ys = []
for hop_i in range(n_hops):
    # Creating our sliding window
    frames = mag[(hop_i * frame_hops):(hop_i * frame_hops + n_frames)]
    
    # Store them with a new 3rd axis and as a logarithmic scale
    # We'll ensure that we aren't taking a log of 0 just by adding
    # a small value, also known as epsilon.
    Xs.append(np.log(np.abs(frames[..., np.newaxis]) + 1e-10))
    
    # And then store the label 
    ys.append(0)

The code below will perform this for us, as well as create the inputs and outputs to our classification network by specifying 0s for the music dataset and 1s for the speech dataset. Let's just take a look at the first sliding window, and see it's label:


In [11]:
plt.imshow(Xs[0][..., 0])
plt.title('label:{}'.format(ys[0]))


Out[11]:
<matplotlib.text.Text at 0x7fc349ac8898>

Since this was the first audio file of the music dataset, we've set it to a label of 0. And now the second one, which should have 50% overlap with the previous one, and still a label of 0:


In [12]:
plt.imshow(Xs[1][..., 0])
plt.title('label:{}'.format(ys[1]))


Out[12]:
<matplotlib.text.Text at 0x7fc349a2a0f0>

So hopefully you can see that the window is sliding down 250 milliseconds at a time, and since our window is 500 ms long, or half a second, it has 50% new content at the bottom. Let's do this for every audio file now:


In [8]:
# Store every magnitude frame and its label of being music: 0 or speech: 1
Xs, ys = [], []

# Let's start with the music files
# [Changed] Use less data to avoid memory error when it's converted to nd.array
for i in music[0:18]:
    # Load the ith file:
    s = utils.load_audio(i)
    
    # Now take the dft of it (take a DSP course!):
    re, im = dft.dft_np(s, fft_size=fft_size, hop_size=hop_size)
    
    # And convert the complex representation to magnitudes/phases (take a DSP course!):
    mag, phs = dft.ztoc(re, im)
    
    # This is how many sliding windows we have:
    n_hops = (len(mag) - n_frames) // frame_hops
    
    # Let's extract them all:
    for hop_i in range(n_hops):
        
        # Get the current sliding window
        frames = mag[(hop_i * frame_hops):(hop_i * frame_hops + n_frames)]
        
        # We'll take the log magnitudes, as this is a nicer representation:
        this_X = np.log(np.abs(frames[..., np.newaxis]) + 1e-10)
        
        # And store it:
        Xs.append(this_X)
        
        # And be sure that we store the correct label of this observation:
        ys.append(0)

In [9]:
# Now do the same thing with speech!
# [Changed] Use less data to avoid memory error when it's converted to nd.array
for i in speech[0:18]:
    
    # Load the ith file:
    s = utils.load_audio(i)
    
    # Now take the dft of it
    re, im = dft.dft_np(s, fft_size=fft_size, hop_size=hop_size)
    
    # And convert the complex representation to magnitudes/phases
    mag, phs = dft.ztoc(re, im)
    
    # This is how many sliding windows we have:
    n_hops = (len(mag) - n_frames) // frame_hops

    # Let's extract them all:
    for hop_i in range(n_hops):
        
        # Get the current sliding window
        frames = mag[(hop_i * frame_hops):(hop_i * frame_hops + n_frames)]
        
        # We'll take the log magnitudes, as this is a nicer representation:
        this_X = np.log(np.abs(frames[..., np.newaxis]) + 1e-10)
        
        # And store it:
        Xs.append(this_X)
        
        # Make sure we use the right labe
        ys.append(1)

In [10]:
# Convert them to an array:
Xs = np.asarray(Xs, dtype=np.int32)
ys = np.asarray(ys, dtype=np.int32)

print(Xs.shape, ys.shape)

# Just to make sure you've done it right.  If you've changed any of the
# parameters of the dft/hop size, then this will fail.  If that's what you
# wanted to do, then don't worry about this assertion.
#assert(Xs.shape == (15360, 43, 256, 1) and ys.shape == (15360,))


(4320, 43, 256, 1) (4320,)

Just to confirm it's doing the same as above, let's plot the first magnitude matrix:


In [12]:
plt.imshow(Xs[0][..., 0])
plt.title('label:{}'.format(ys[0]))


Out[12]:
<matplotlib.text.Text at 0x7f3db1cd0cf8>

Let's describe the shape of our input to the network:


In [11]:
n_observations, n_height, n_width, n_channels = Xs.shape

We'll now use the Dataset object I've provided for you under libs/datasets.py. This will accept the Xs, ys, a list defining our dataset split into training, validation, and testing proportions, and a parameter one_hot stating whether we want our ys to be converted to a one hot vector or not.


In [12]:
ds = datasets.Dataset(Xs = Xs, ys = ys, split = [0.8, 0.1, 0.1], one_hot = True)

Let's take a look at the batch generator this object provides. We can all any of the splits, the train, valid, or test splits as properties of the object. And each split provides a next_batch method which gives us a batch generator. We should have specified that we wanted one_hot=True to have our batch generator return our ys with 2 features, one for each possible class.


In [13]:
Xs_i, ys_i = next(ds.train.next_batch())

# Notice the shape this returns.  This will become the shape of our input and output of the network:
print(Xs_i.shape, ys_i.shape)

assert(ys_i.shape == (100, 2))


(100, 43, 256, 1) (100, 2)

Let's take a look at the first element of the randomized batch:


In [14]:
plt.imshow(Xs_i[0, :, :, 0])
plt.title('label:{}'.format(ys_i[0]))


Out[14]:
<matplotlib.text.Text at 0x7f3bf8d75438>

And the second one:


In [15]:
plt.imshow(Xs_i[1, :, :, 0])
plt.title('label:{}'.format(ys_i[1]))


Out[15]:
<matplotlib.text.Text at 0x7f3bf8ce5780>

So we have a randomized order in minibatches generated for us, and the ys are represented as a one-hot vector with each class, music and speech, encoded as a 0 or 1. Since the next_batch method is a generator, we can use it in a loop until it is exhausted to run through our entire dataset in mini-batches.

Creating the Network

Let's now create the neural network. Recall our input X is 4-dimensional, with the same shape that we've just seen as returned from our batch generator above. We're going to create a deep convolutional neural network with a few layers of convolution and 2 finals layers which are fully connected. The very last layer must have only 2 neurons corresponding to our one-hot vector of ys, so that we can properly measure the cross-entropy (just like we did with MNIST and our 10 element one-hot encoding of the digit label). First let's create our placeholders:


In [46]:
tf.reset_default_graph()

# Create the input to the network.  This is a 4-dimensional tensor.
# Recall that we are using sliding windows of our magnitudes:
X = tf.placeholder(name='X', shape=[None, 43, 256, 1], dtype=tf.float32)

# Create the output to the network.  This is our one hot encoding of 2 possible values.
Y = tf.placeholder(name='Y', shape=[None, 2], dtype=tf.float32)

Let's now create our deep convolutional network. Start by first creating the convolutional layers. Try different numbers of layers, different numbers of filters per layer, different activation functions, and varying the parameters to get the best training/validation score when training below. Try first using a kernel size of 3 and a stride of 1. You can use the utils.conv2d function to help you create the convolution.


In [47]:
# Explore different numbers of layers, and sizes of the network
n_filters = [32, 24, 16, 16]

# Now let's loop over our n_filters and create the deep convolutional neural network
H = X
for layer_i, n_filters_i in enumerate(n_filters):
    
    # Let's use the helper function to create our connection to the next layer:
    H, W = utils.conv2d(H, 
                        n_filters_i, 
                        k_h = 4, 
                        k_w = 4, 
                        d_h = 2, 
                        d_w = 2, 
                        name = str(layer_i))
    
    # And use a nonlinearity
    #  - tf.nn.relu
    #  - tf.nn.relu6
    #  - tf.nn.elu
    #  - tf.nn.softplus
    #  - tf.nn.softsign
    #  - tf.sigmoid
    #  - tf.tahn
    H = tf.nn.relu(H)
    
    # Just to check what's happening:
    print(H.get_shape().as_list())


[None, 22, 128, 32]
[None, 11, 64, 24]
[None, 6, 32, 16]
[None, 3, 16, 16]

We'll now connect our last convolutional layer to a fully connected layer of 100 neurons. This is essentially combining the spatial information, thus losing the spatial information. You can use the utils.linear function to do this, which will internally also reshape the 4-d tensor to a 2-d tensor so that it can be connected to a fully-connected layer (i.e. perform a matrix multiplication).


In [48]:
# Connect the last convolutional layer to a fully connected network!
fc, W = utils.linear(H, n_output = 32, name = "full-connected-layer", activation = tf.nn.relu) 

# And another fully connceted network, now with just 2 outputs, the number of outputs that our one hot encoding has.
Y_pred, W = utils.linear(fc, n_output = 2, name = "final-layer", activation = tf.sigmoid)

In [49]:
print(Y_pred.get_shape().as_list())
print(W.get_shape().as_list())


[None, 2]
[32, 2]

We'll now create our cost. Unlike the MNIST network, we're going to use a binary cross entropy as we only have 2 possible classes. You can use the utils.binary_cross_entropy function to help you with this. Remember, the final cost measure the average loss of your batches.


In [50]:
loss = utils.binary_cross_entropy(Y_pred, Y)
cost = tf.reduce_mean(tf.reduce_sum(loss, 1))

Just like in MNIST, we'll now also create a measure of accuracy by finding the prediction of our network. This is just for us to monitor the training and is not used to optimize the weights of the network! Look back to the MNIST network in the lecture if you are unsure of how this works (it is exactly the same):


In [51]:
predicted_y = tf.argmax(Y_pred, 1)
actual_y = tf.argmax(Y, 1)
correct_prediction = tf.equal(predicted_y, actual_y)
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

We'll now create an optimizer and train our network:


In [52]:
optimizer = tf.train.AdamOptimizer(learning_rate = 0.001).minimize(cost)

Now we're ready to train. This is a pretty simple dataset for a deep convolutional network. As a result, I've included code which demonstrates how to monitor validation performance. A validation set is data that the network has never seen, and is not used for optimizing the weights of the network. We use validation to better understand how well the performance of a network "generalizes" to unseen data.

You can easily run the risk of overfitting to the training set of this problem. Overfitting simply means that the number of parameters in our model are so high that we are not generalizing our model, and instead trying to model each individual point, rather than the general cause of the data. This is a very common problem that can be addressed by using less parameters, or enforcing regularization techniques which we didn't have a chance to cover (dropout, batch norm, l2, augmenting the dataset, and others).

For this dataset, if you notice that your validation set is performing worse than your training set, then you know you have overfit! You should be able to easily get 97+% on the validation set within < 10 epochs. If you've got great training performance, but poor validation performance, then you likely have "overfit" to the training dataset, and are unable to generalize to the validation set. Try varying the network definition, number of filters/layers until you get 97+% on your validation set!


In [55]:
# Explore these parameters
n_epochs = 25
batch_size = 200

# Create a session and init!
sess = tf.Session()
sess.run(tf.initialize_all_variables())

# Now iterate over our dataset n_epoch times
for epoch_i in range(n_epochs):
    print('Epoch: ', epoch_i)
    
    # Train
    this_accuracy = 0
    its = 0
    
    # Do our mini batches:
    for Xs_i, ys_i in ds.train.next_batch(batch_size):
        # Note here: we are running the optimizer so that the network parameters train!
        this_accuracy += sess.run([accuracy, optimizer], feed_dict={
                X:Xs_i, Y:ys_i})[0]
        its += 1
        print(this_accuracy / its)
    print('Training accuracy: ', this_accuracy / its)
    
    # Validation (see how the network does on unseen data).
    this_accuracy = 0
    its = 0
    
    # Do our mini batches:
    for Xs_i, ys_i in ds.valid.next_batch(batch_size):
        # Note here: we are NOT running the optimizer! 
        # we only measure the accuracy!
        this_accuracy += sess.run(accuracy, feed_dict={
                X:Xs_i, Y:ys_i})
        its += 1
    print('Validation accuracy: ', this_accuracy / its)


Epoch:  0
0.535000026226
0.547500014305
0.551666676998
0.535000011325
0.532000005245
0.559166669846
0.562142857483
0.55624999851
0.548333333598
0.543000000715
0.549090911042
0.564166670044
0.572692311727
0.573571430785
0.578000001113
0.58437500149
0.597647060366
0.606111112568
Training accuracy:  0.606111112568
Validation accuracy:  0.64458334446
Epoch:  1
0.579999983311
0.642499983311
0.661666651567
0.697499990463
0.683999991417
0.698333323002
0.699999988079
0.684999987483
0.684444434113
0.691999989748
0.693636352366
0.703333323201
0.708076912623
0.710714276348
0.713333324591
0.710624992847
0.7105882273
0.707817451821
Training accuracy:  0.707817451821
Validation accuracy:  0.796666681767
Epoch:  2
0.725000023842
0.747500002384
0.743333339691
0.745000004768
0.759000003338
0.754166672627
0.75571428878
0.755625002086
0.758888893657
0.762000006437
0.764545462348
0.762500007947
0.76500000862
0.771428580795
0.774333341916
0.771875008941
0.769705891609
0.769603182872
Training accuracy:  0.769603182872
Validation accuracy:  0.799166659514
Epoch:  3
0.75
0.784999996424
0.810000002384
0.808750003576
0.799000000954
0.787500003974
0.794999999659
0.798124998808
0.802222218778
0.79849999547
0.798181815581
0.79583332936
0.798076918492
0.802142854248
0.805333332221
0.810624998063
0.811176468344
0.811746031046
Training accuracy:  0.811746031046
Validation accuracy:  0.805416663488
Epoch:  4
0.829999983311
0.83750000596
0.81833332777
0.821249991655
0.831999993324
0.834166665872
0.834999995572
0.843124993145
0.844444440471
0.841999995708
0.847727266225
0.849166661501
0.847307686622
0.847499996424
0.846999994914
0.845312494785
0.846764701254
0.846349202924
Training accuracy:  0.846349202924
Validation accuracy:  0.862916668256
Epoch:  5
0.865000009537
0.867500007153
0.860000014305
0.855000004172
0.862999999523
0.86083333691
0.863571430956
0.866875000298
0.862777776188
0.864499998093
0.865909088742
0.868333329757
0.869230765563
0.87035713877
0.870666662852
0.87187499553
0.873235288788
0.8733333283
Training accuracy:  0.8733333283
Validation accuracy:  0.903750002384
Epoch:  6
0.910000026226
0.897500008345
0.898333330949
0.899999991059
0.902999997139
0.907499998808
0.897142853056
0.895624995232
0.896111104223
0.895999991894
0.89818181233
0.896249994636
0.896538454753
0.896785706282
0.899333326022
0.895312491804
0.897058816517
0.898809515768
Training accuracy:  0.898809515768
Validation accuracy:  0.953333338102
Epoch:  7
0.910000026226
0.915000021458
0.921666681767
0.926250010729
0.926000010967
0.92583334446
0.917142868042
0.917500011623
0.912777788109
0.910000008345
0.912727280097
0.915000006557
0.918076927845
0.913571434362
0.914333339532
0.912187505513
0.912352947628
0.912261909909
Training accuracy:  0.912261909909
Validation accuracy:  0.890416661898
Epoch:  8
0.884999990463
0.897500008345
0.896666665872
0.894999995828
0.901999998093
0.907499998808
0.908571430615
0.908750005066
0.911111116409
0.908000004292
0.90454545888
0.905833338698
0.907692313194
0.910000004939
0.910333339373
0.911562506109
0.910588239922
0.909603178501
Training accuracy:  0.909603178501
Validation accuracy:  0.963333328565
Epoch:  9
0.939999997616
0.897500008345
0.916666666667
0.913749992847
0.91099998951
0.915833324194
0.917857136045
0.916249990463
0.917777770095
0.918999993801
0.916363629428
0.9166666617
0.914230763912
0.913928568363
0.913666665554
0.912187498063
0.911176467643
0.911150789923
Training accuracy:  0.911150789923
Validation accuracy:  0.918333331744
Epoch:  10
0.964999973774
0.914999991655
0.923333326976
0.918749988079
0.911999988556
0.919999986887
0.916428557464
0.915624991059
0.919999990198
0.918999993801
0.920909085057
0.923749993245
0.924615378563
0.925714279924
0.925333329042
0.92656249553
0.928235288929
0.929246028264
Training accuracy:  0.929246028264
Validation accuracy:  0.963333328565
Epoch:  11
0.954999983311
0.942499995232
0.929999987284
0.929999992251
0.928999996185
0.930833329757
0.931428568704
0.930624999106
0.934444440736
0.933499997854
0.93272727186
0.935833334923
0.933461537728
0.93642857245
0.936000001431
0.93593750149
0.936764706584
0.935317460034
Training accuracy:  0.935317460034
Validation accuracy:  0.940416673819
Epoch:  12
0.910000026226
0.932500004768
0.93166667223
0.9375
0.939999997616
0.939166665077
0.937857142517
0.939374998212
0.939999997616
0.943500000238
0.946363638748
0.947916666667
0.948846152196
0.948928569044
0.947666664918
0.949062500149
0.949411763864
0.951230158408
Training accuracy:  0.951230158408
Validation accuracy:  0.95791665713
Epoch:  13
0.964999973774
0.954999983311
0.951666653156
0.951249986887
0.947999989986
0.949999988079
0.946428562914
0.944374993443
0.939999990993
0.941999989748
0.944545447826
0.945416659117
0.947307687539
0.949285711561
0.951333332062
0.951249998063
0.951764702797
0.950476186143
Training accuracy:  0.950476186143
Validation accuracy:  0.942916671435
Epoch:  14
0.930000007153
0.944999992847
0.955000003179
0.951250001788
0.952999997139
0.954999993245
0.956428561892
0.956249989569
0.956111099985
0.957499992847
0.958181809295
0.959166660905
0.9603846119
0.960714280605
0.960666660468
0.959999993443
0.960882348173
0.958095232646
Training accuracy:  0.958095232646
Validation accuracy:  0.973333338896
Epoch:  15
0.964999973774
0.954999983311
0.959999998411
0.95624999702
0.955999994278
0.960833330949
0.958571425506
0.960625000298
0.958333333333
0.959500002861
0.960909095677
0.959583337108
0.959230771432
0.955000000341
0.95466666619
0.955625001341
0.952647058403
0.952301588323
Training accuracy:  0.952301588323
Validation accuracy:  0.959583322207
Epoch:  16
0.975000023842
0.967500001192
0.958333333333
0.959999993443
0.962999999523
0.964166671038
0.962142859186
0.964375004172
0.965555561913
0.965000003576
0.963636365804
0.962083334724
0.961538461538
0.961785712412
0.960333331426
0.959687497467
0.95852940924
0.957857141892
Training accuracy:  0.957857141892
Validation accuracy:  0.947083334128
Epoch:  17
0.959999978542
0.959999978542
0.963333328565
0.967500001192
0.965999996662
0.967500001192
0.964999999319
0.96562500298
0.967222226991
0.967500007153
0.966363641349
0.963750004768
0.963076926195
0.963928576027
0.963666669528
0.96406250447
0.965294122696
0.967222226991
Training accuracy:  0.967222226991
Validation accuracy:  0.971666673819
Epoch:  18
0.949999988079
0.957499980927
0.959999978542
0.964999988675
0.964999985695
0.970833321412
0.969999986035
0.968124985695
0.9705555439
0.970499992371
0.972272721204
0.971249992649
0.96999999193
0.96928570526
0.969333326817
0.969374995679
0.967941171983
0.968730154965
Training accuracy:  0.968730154965
Validation accuracy:  0.955833335718
Epoch:  19
0.964999973774
0.964999973774
0.973333319028
0.976249992847
0.972999989986
0.970833321412
0.96642856087
0.964999988675
0.963888876968
0.963999986649
0.965909080072
0.967916657527
0.968076916841
0.968928567001
0.970666662852
0.970937497914
0.970588231788
0.968253963523
Training accuracy:  0.968253963523
Validation accuracy:  0.976666649183
Epoch:  20
0.944999992847
0.96250000596
0.966666678588
0.970000013709
0.971000015736
0.971666683753
0.972142875195
0.970000013709
0.971666680442
0.971000009775
0.969090916894
0.970833341281
0.970000005685
0.970357149839
0.971000007788
0.970000006258
0.969411769334
0.971111115482
Training accuracy:  0.971111115482
Validation accuracy:  0.976666669051
Epoch:  21
0.980000019073
0.967500001192
0.964999993642
0.967500001192
0.968000006676
0.970833341281
0.973571436746
0.975000008941
0.976666675674
0.97800000906
0.97727273811
0.977500011524
0.977307704779
0.977500012943
0.977333347003
0.977187514305
0.975882365423
0.976230170992
Training accuracy:  0.976230170992
Validation accuracy:  0.969583332539
Epoch:  22
0.959999978542
0.959999978542
0.973333319028
0.977499991655
0.978999996185
0.979166666667
0.97928571701
0.97812500596
0.979444450802
0.980500006676
0.980454553257
0.980833341678
0.981538469975
0.981428580625
0.981666676203
0.982187509537
0.981764716261
0.980793661541
Training accuracy:  0.980793661541
Validation accuracy:  0.971249997616
Epoch:  23
0.980000019073
0.990000009537
0.990000009537
0.985000014305
0.987000012398
0.986666679382
0.985714299338
0.984375014901
0.982222232554
0.982500010729
0.981818193739
0.981666679184
0.98153847456
0.981071442366
0.981333347162
0.980937514454
0.980294132934
0.980396840307
Training accuracy:  0.980396840307
Validation accuracy:  0.966250002384
Epoch:  24
0.975000023842
0.982500016689
0.985000014305
0.977500006557
0.980000007153
0.980000009139
0.977142861911
0.976875007153
0.978888895777
0.978500008583
0.979545463215
0.978333339095
0.978461545247
0.978928578751
0.978666675091
0.97875000909
0.979411773822
0.977579375108
Training accuracy:  0.977579375108
Validation accuracy:  0.96291667223

Let's try to inspect how the network is accomplishing this task, just like we did with the MNIST network. First, let's see what the names of our operations in our network are.


In [56]:
g = tf.get_default_graph()
[op.name for op in g.get_operations()]


Out[56]:
['X',
 'Y',
 '0/W',
 '0/W/Initializer/random_uniform/shape',
 '0/W/Initializer/random_uniform/min',
 '0/W/Initializer/random_uniform/max',
 '0/W/Initializer/random_uniform/RandomUniform',
 '0/W/Initializer/random_uniform/sub',
 '0/W/Initializer/random_uniform/mul',
 '0/W/Initializer/random_uniform',
 '0/W/Assign',
 '0/W/read',
 '0/conv',
 '0/b',
 '0/b/Initializer/Const',
 '0/b/Assign',
 '0/b/read',
 '0/h',
 'Relu',
 '1/W',
 '1/W/Initializer/random_uniform/shape',
 '1/W/Initializer/random_uniform/min',
 '1/W/Initializer/random_uniform/max',
 '1/W/Initializer/random_uniform/RandomUniform',
 '1/W/Initializer/random_uniform/sub',
 '1/W/Initializer/random_uniform/mul',
 '1/W/Initializer/random_uniform',
 '1/W/Assign',
 '1/W/read',
 '1/conv',
 '1/b',
 '1/b/Initializer/Const',
 '1/b/Assign',
 '1/b/read',
 '1/h',
 'Relu_1',
 '2/W',
 '2/W/Initializer/random_uniform/shape',
 '2/W/Initializer/random_uniform/min',
 '2/W/Initializer/random_uniform/max',
 '2/W/Initializer/random_uniform/RandomUniform',
 '2/W/Initializer/random_uniform/sub',
 '2/W/Initializer/random_uniform/mul',
 '2/W/Initializer/random_uniform',
 '2/W/Assign',
 '2/W/read',
 '2/conv',
 '2/b',
 '2/b/Initializer/Const',
 '2/b/Assign',
 '2/b/read',
 '2/h',
 'Relu_2',
 '3/W',
 '3/W/Initializer/random_uniform/shape',
 '3/W/Initializer/random_uniform/min',
 '3/W/Initializer/random_uniform/max',
 '3/W/Initializer/random_uniform/RandomUniform',
 '3/W/Initializer/random_uniform/sub',
 '3/W/Initializer/random_uniform/mul',
 '3/W/Initializer/random_uniform',
 '3/W/Assign',
 '3/W/read',
 '3/conv',
 '3/b',
 '3/b/Initializer/Const',
 '3/b/Assign',
 '3/b/read',
 '3/h',
 'Relu_3',
 'flatten/Reshape/shape',
 'flatten/Reshape',
 'full-connected-layer/W',
 'full-connected-layer/W/Initializer/random_uniform/shape',
 'full-connected-layer/W/Initializer/random_uniform/min',
 'full-connected-layer/W/Initializer/random_uniform/max',
 'full-connected-layer/W/Initializer/random_uniform/RandomUniform',
 'full-connected-layer/W/Initializer/random_uniform/sub',
 'full-connected-layer/W/Initializer/random_uniform/mul',
 'full-connected-layer/W/Initializer/random_uniform',
 'full-connected-layer/W/Assign',
 'full-connected-layer/W/read',
 'full-connected-layer/b',
 'full-connected-layer/b/Initializer/Const',
 'full-connected-layer/b/Assign',
 'full-connected-layer/b/read',
 'full-connected-layer/MatMul',
 'full-connected-layer/h',
 'full-connected-layer/Relu',
 'final-layer/W',
 'final-layer/W/Initializer/random_uniform/shape',
 'final-layer/W/Initializer/random_uniform/min',
 'final-layer/W/Initializer/random_uniform/max',
 'final-layer/W/Initializer/random_uniform/RandomUniform',
 'final-layer/W/Initializer/random_uniform/sub',
 'final-layer/W/Initializer/random_uniform/mul',
 'final-layer/W/Initializer/random_uniform',
 'final-layer/W/Assign',
 'final-layer/W/read',
 'final-layer/b',
 'final-layer/b/Initializer/Const',
 'final-layer/b/Assign',
 'final-layer/b/read',
 'final-layer/MatMul',
 'final-layer/h',
 'final-layer/Sigmoid',
 'add/y',
 'add',
 'Log',
 'mul',
 'sub/x',
 'sub',
 'sub_1/x',
 'sub_1',
 'add_1/y',
 'add_1',
 'Log_1',
 'mul_1',
 'add_2',
 'Neg',
 'Sum/reduction_indices',
 'Sum',
 'Rank',
 'range/start',
 'range/delta',
 'range',
 'Mean',
 'ArgMax/dimension',
 'ArgMax',
 'ArgMax_1/dimension',
 'ArgMax_1',
 'Equal',
 'Cast',
 'Rank_1',
 'range_1/start',
 'range_1/delta',
 'range_1',
 'Mean_1',
 'gradients/Shape',
 'gradients/Const',
 'gradients/Fill',
 'gradients/Mean_grad/Shape',
 'gradients/Mean_grad/Size',
 'gradients/Mean_grad/add',
 'gradients/Mean_grad/mod',
 'gradients/Mean_grad/Shape_1',
 'gradients/Mean_grad/range/start',
 'gradients/Mean_grad/range/delta',
 'gradients/Mean_grad/range',
 'gradients/Mean_grad/Fill/value',
 'gradients/Mean_grad/Fill',
 'gradients/Mean_grad/DynamicStitch',
 'gradients/Mean_grad/Maximum/y',
 'gradients/Mean_grad/Maximum',
 'gradients/Mean_grad/floordiv',
 'gradients/Mean_grad/Reshape',
 'gradients/Mean_grad/Tile',
 'gradients/Mean_grad/Shape_2',
 'gradients/Mean_grad/Shape_3',
 'gradients/Mean_grad/Rank',
 'gradients/Mean_grad/range_1/start',
 'gradients/Mean_grad/range_1/delta',
 'gradients/Mean_grad/range_1',
 'gradients/Mean_grad/Prod',
 'gradients/Mean_grad/Rank_1',
 'gradients/Mean_grad/range_2/start',
 'gradients/Mean_grad/range_2/delta',
 'gradients/Mean_grad/range_2',
 'gradients/Mean_grad/Prod_1',
 'gradients/Mean_grad/Maximum_1/y',
 'gradients/Mean_grad/Maximum_1',
 'gradients/Mean_grad/floordiv_1',
 'gradients/Mean_grad/Cast',
 'gradients/Mean_grad/truediv',
 'gradients/Sum_grad/Shape',
 'gradients/Sum_grad/Size',
 'gradients/Sum_grad/add',
 'gradients/Sum_grad/mod',
 'gradients/Sum_grad/Shape_1',
 'gradients/Sum_grad/range/start',
 'gradients/Sum_grad/range/delta',
 'gradients/Sum_grad/range',
 'gradients/Sum_grad/Fill/value',
 'gradients/Sum_grad/Fill',
 'gradients/Sum_grad/DynamicStitch',
 'gradients/Sum_grad/Maximum/y',
 'gradients/Sum_grad/Maximum',
 'gradients/Sum_grad/floordiv',
 'gradients/Sum_grad/Reshape',
 'gradients/Sum_grad/Tile',
 'gradients/Neg_grad/Neg',
 'gradients/add_2_grad/Shape',
 'gradients/add_2_grad/Shape_1',
 'gradients/add_2_grad/BroadcastGradientArgs',
 'gradients/add_2_grad/Sum',
 'gradients/add_2_grad/Reshape',
 'gradients/add_2_grad/Sum_1',
 'gradients/add_2_grad/Reshape_1',
 'gradients/add_2_grad/tuple/group_deps',
 'gradients/add_2_grad/tuple/control_dependency',
 'gradients/add_2_grad/tuple/control_dependency_1',
 'gradients/mul_grad/Shape',
 'gradients/mul_grad/Shape_1',
 'gradients/mul_grad/BroadcastGradientArgs',
 'gradients/mul_grad/mul',
 'gradients/mul_grad/Sum',
 'gradients/mul_grad/Reshape',
 'gradients/mul_grad/mul_1',
 'gradients/mul_grad/Sum_1',
 'gradients/mul_grad/Reshape_1',
 'gradients/mul_grad/tuple/group_deps',
 'gradients/mul_grad/tuple/control_dependency',
 'gradients/mul_grad/tuple/control_dependency_1',
 'gradients/mul_1_grad/Shape',
 'gradients/mul_1_grad/Shape_1',
 'gradients/mul_1_grad/BroadcastGradientArgs',
 'gradients/mul_1_grad/mul',
 'gradients/mul_1_grad/Sum',
 'gradients/mul_1_grad/Reshape',
 'gradients/mul_1_grad/mul_1',
 'gradients/mul_1_grad/Sum_1',
 'gradients/mul_1_grad/Reshape_1',
 'gradients/mul_1_grad/tuple/group_deps',
 'gradients/mul_1_grad/tuple/control_dependency',
 'gradients/mul_1_grad/tuple/control_dependency_1',
 'gradients/Log_grad/Inv',
 'gradients/Log_grad/mul',
 'gradients/Log_1_grad/Inv',
 'gradients/Log_1_grad/mul',
 'gradients/add_grad/Shape',
 'gradients/add_grad/Shape_1',
 'gradients/add_grad/BroadcastGradientArgs',
 'gradients/add_grad/Sum',
 'gradients/add_grad/Reshape',
 'gradients/add_grad/Sum_1',
 'gradients/add_grad/Reshape_1',
 'gradients/add_grad/tuple/group_deps',
 'gradients/add_grad/tuple/control_dependency',
 'gradients/add_grad/tuple/control_dependency_1',
 'gradients/add_1_grad/Shape',
 'gradients/add_1_grad/Shape_1',
 'gradients/add_1_grad/BroadcastGradientArgs',
 'gradients/add_1_grad/Sum',
 'gradients/add_1_grad/Reshape',
 'gradients/add_1_grad/Sum_1',
 'gradients/add_1_grad/Reshape_1',
 'gradients/add_1_grad/tuple/group_deps',
 'gradients/add_1_grad/tuple/control_dependency',
 'gradients/add_1_grad/tuple/control_dependency_1',
 'gradients/sub_1_grad/Shape',
 'gradients/sub_1_grad/Shape_1',
 'gradients/sub_1_grad/BroadcastGradientArgs',
 'gradients/sub_1_grad/Sum',
 'gradients/sub_1_grad/Reshape',
 'gradients/sub_1_grad/Sum_1',
 'gradients/sub_1_grad/Neg',
 'gradients/sub_1_grad/Reshape_1',
 'gradients/sub_1_grad/tuple/group_deps',
 'gradients/sub_1_grad/tuple/control_dependency',
 'gradients/sub_1_grad/tuple/control_dependency_1',
 'gradients/AddN',
 'gradients/final-layer/Sigmoid_grad/sub/x',
 'gradients/final-layer/Sigmoid_grad/sub',
 'gradients/final-layer/Sigmoid_grad/mul',
 'gradients/final-layer/Sigmoid_grad/mul_1',
 'gradients/final-layer/h_grad/BiasAddGrad',
 'gradients/final-layer/h_grad/tuple/group_deps',
 'gradients/final-layer/h_grad/tuple/control_dependency',
 'gradients/final-layer/h_grad/tuple/control_dependency_1',
 'gradients/final-layer/MatMul_grad/MatMul',
 'gradients/final-layer/MatMul_grad/MatMul_1',
 'gradients/final-layer/MatMul_grad/tuple/group_deps',
 'gradients/final-layer/MatMul_grad/tuple/control_dependency',
 'gradients/final-layer/MatMul_grad/tuple/control_dependency_1',
 'gradients/full-connected-layer/Relu_grad/ReluGrad',
 'gradients/full-connected-layer/h_grad/BiasAddGrad',
 'gradients/full-connected-layer/h_grad/tuple/group_deps',
 'gradients/full-connected-layer/h_grad/tuple/control_dependency',
 'gradients/full-connected-layer/h_grad/tuple/control_dependency_1',
 'gradients/full-connected-layer/MatMul_grad/MatMul',
 'gradients/full-connected-layer/MatMul_grad/MatMul_1',
 'gradients/full-connected-layer/MatMul_grad/tuple/group_deps',
 'gradients/full-connected-layer/MatMul_grad/tuple/control_dependency',
 'gradients/full-connected-layer/MatMul_grad/tuple/control_dependency_1',
 'gradients/flatten/Reshape_grad/Shape',
 'gradients/flatten/Reshape_grad/Reshape',
 'gradients/Relu_3_grad/ReluGrad',
 'gradients/3/h_grad/BiasAddGrad',
 'gradients/3/h_grad/tuple/group_deps',
 'gradients/3/h_grad/tuple/control_dependency',
 'gradients/3/h_grad/tuple/control_dependency_1',
 'gradients/3/conv_grad/Shape',
 'gradients/3/conv_grad/Conv2DBackpropInput',
 'gradients/3/conv_grad/Shape_1',
 'gradients/3/conv_grad/Conv2DBackpropFilter',
 'gradients/3/conv_grad/tuple/group_deps',
 'gradients/3/conv_grad/tuple/control_dependency',
 'gradients/3/conv_grad/tuple/control_dependency_1',
 'gradients/Relu_2_grad/ReluGrad',
 'gradients/2/h_grad/BiasAddGrad',
 'gradients/2/h_grad/tuple/group_deps',
 'gradients/2/h_grad/tuple/control_dependency',
 'gradients/2/h_grad/tuple/control_dependency_1',
 'gradients/2/conv_grad/Shape',
 'gradients/2/conv_grad/Conv2DBackpropInput',
 'gradients/2/conv_grad/Shape_1',
 'gradients/2/conv_grad/Conv2DBackpropFilter',
 'gradients/2/conv_grad/tuple/group_deps',
 'gradients/2/conv_grad/tuple/control_dependency',
 'gradients/2/conv_grad/tuple/control_dependency_1',
 'gradients/Relu_1_grad/ReluGrad',
 'gradients/1/h_grad/BiasAddGrad',
 'gradients/1/h_grad/tuple/group_deps',
 'gradients/1/h_grad/tuple/control_dependency',
 'gradients/1/h_grad/tuple/control_dependency_1',
 'gradients/1/conv_grad/Shape',
 'gradients/1/conv_grad/Conv2DBackpropInput',
 'gradients/1/conv_grad/Shape_1',
 'gradients/1/conv_grad/Conv2DBackpropFilter',
 'gradients/1/conv_grad/tuple/group_deps',
 'gradients/1/conv_grad/tuple/control_dependency',
 'gradients/1/conv_grad/tuple/control_dependency_1',
 'gradients/Relu_grad/ReluGrad',
 'gradients/0/h_grad/BiasAddGrad',
 'gradients/0/h_grad/tuple/group_deps',
 'gradients/0/h_grad/tuple/control_dependency',
 'gradients/0/h_grad/tuple/control_dependency_1',
 'gradients/0/conv_grad/Shape',
 'gradients/0/conv_grad/Conv2DBackpropInput',
 'gradients/0/conv_grad/Shape_1',
 'gradients/0/conv_grad/Conv2DBackpropFilter',
 'gradients/0/conv_grad/tuple/group_deps',
 'gradients/0/conv_grad/tuple/control_dependency',
 'gradients/0/conv_grad/tuple/control_dependency_1',
 'beta1_power/initial_value',
 'beta1_power',
 'beta1_power/Assign',
 'beta1_power/read',
 'beta2_power/initial_value',
 'beta2_power',
 'beta2_power/Assign',
 'beta2_power/read',
 'zeros',
 '0/W/Adam',
 '0/W/Adam/Assign',
 '0/W/Adam/read',
 'zeros_1',
 '0/W/Adam_1',
 '0/W/Adam_1/Assign',
 '0/W/Adam_1/read',
 'zeros_2',
 '0/b/Adam',
 '0/b/Adam/Assign',
 '0/b/Adam/read',
 'zeros_3',
 '0/b/Adam_1',
 '0/b/Adam_1/Assign',
 '0/b/Adam_1/read',
 'zeros_4',
 '1/W/Adam',
 '1/W/Adam/Assign',
 '1/W/Adam/read',
 'zeros_5',
 '1/W/Adam_1',
 '1/W/Adam_1/Assign',
 '1/W/Adam_1/read',
 'zeros_6',
 '1/b/Adam',
 '1/b/Adam/Assign',
 '1/b/Adam/read',
 'zeros_7',
 '1/b/Adam_1',
 '1/b/Adam_1/Assign',
 '1/b/Adam_1/read',
 'zeros_8',
 '2/W/Adam',
 '2/W/Adam/Assign',
 '2/W/Adam/read',
 'zeros_9',
 '2/W/Adam_1',
 '2/W/Adam_1/Assign',
 '2/W/Adam_1/read',
 'zeros_10',
 '2/b/Adam',
 '2/b/Adam/Assign',
 '2/b/Adam/read',
 'zeros_11',
 '2/b/Adam_1',
 '2/b/Adam_1/Assign',
 '2/b/Adam_1/read',
 'zeros_12',
 '3/W/Adam',
 '3/W/Adam/Assign',
 '3/W/Adam/read',
 'zeros_13',
 '3/W/Adam_1',
 '3/W/Adam_1/Assign',
 '3/W/Adam_1/read',
 'zeros_14',
 '3/b/Adam',
 '3/b/Adam/Assign',
 '3/b/Adam/read',
 'zeros_15',
 '3/b/Adam_1',
 '3/b/Adam_1/Assign',
 '3/b/Adam_1/read',
 'zeros_16',
 'full-connected-layer/W/Adam',
 'full-connected-layer/W/Adam/Assign',
 'full-connected-layer/W/Adam/read',
 'zeros_17',
 'full-connected-layer/W/Adam_1',
 'full-connected-layer/W/Adam_1/Assign',
 'full-connected-layer/W/Adam_1/read',
 'zeros_18',
 'full-connected-layer/b/Adam',
 'full-connected-layer/b/Adam/Assign',
 'full-connected-layer/b/Adam/read',
 'zeros_19',
 'full-connected-layer/b/Adam_1',
 'full-connected-layer/b/Adam_1/Assign',
 'full-connected-layer/b/Adam_1/read',
 'zeros_20',
 'final-layer/W/Adam',
 'final-layer/W/Adam/Assign',
 'final-layer/W/Adam/read',
 'zeros_21',
 'final-layer/W/Adam_1',
 'final-layer/W/Adam_1/Assign',
 'final-layer/W/Adam_1/read',
 'zeros_22',
 'final-layer/b/Adam',
 'final-layer/b/Adam/Assign',
 'final-layer/b/Adam/read',
 'zeros_23',
 'final-layer/b/Adam_1',
 'final-layer/b/Adam_1/Assign',
 'final-layer/b/Adam_1/read',
 'Adam/learning_rate',
 'Adam/beta1',
 'Adam/beta2',
 'Adam/epsilon',
 'Adam/update_0/W/ApplyAdam',
 'Adam/update_0/b/ApplyAdam',
 'Adam/update_1/W/ApplyAdam',
 'Adam/update_1/b/ApplyAdam',
 'Adam/update_2/W/ApplyAdam',
 'Adam/update_2/b/ApplyAdam',
 'Adam/update_3/W/ApplyAdam',
 'Adam/update_3/b/ApplyAdam',
 'Adam/update_full-connected-layer/W/ApplyAdam',
 'Adam/update_full-connected-layer/b/ApplyAdam',
 'Adam/update_final-layer/W/ApplyAdam',
 'Adam/update_final-layer/b/ApplyAdam',
 'Adam/mul',
 'Adam/Assign',
 'Adam/mul_1',
 'Adam/Assign_1',
 'Adam',
 'init',
 'init_1',
 'init_2']

Now let's visualize the W tensor's weights for the first layer using the utils function montage_filters, just like we did for the MNIST dataset during the lecture. Recall from the lecture that this is another great way to inspect the performance of your network. If many of the filters look uniform, then you know the network is either under or overperforming. What you want to see are filters that look like they are responding to information such as edges or corners.


In [57]:
g = tf.get_default_graph()
W_ = sess.run(g.get_tensor_by_name('0/W:0'))
print(W_.shape)
assert(W_.dtype == np.float32)
m = utils.montage_filters(W_)
plt.figure(figsize=(5, 5))
plt.imshow(m)
plt.imsave(arr=m, fname='audio.png')


(4, 4, 1, 32)

We can also look at every layer's filters using a loop:


In [58]:
g = tf.get_default_graph()
for layer_i in range(len(n_filters)):
    W = sess.run(g.get_tensor_by_name('{}/W:0'.format(layer_i)))
    plt.figure(figsize=(5, 5))
    plt.imshow(utils.montage_filters(W))
    plt.title('Layer {}\'s Learned Convolution Kernels'.format(layer_i))


In the next session, we'll learn some much more powerful methods of inspecting such networks.

Assignment Submission

After you've completed the notebook, create a zip file of the current directory using the code below. This code will make sure you have included this completed ipython notebook and the following files named exactly as:

    session-3/
      session-3.ipynb
      test.png
      recon.png
      sorted.png
      manifold.png
      test_xs.png
      audio.png

You'll then submit this zip file for your second assignment on Kadenze for "Assignment 3: Build Unsupervised and Supervised Networks"! Remember to post Part Two to the Forum to receive full credit! If you have any questions, remember to reach out on the forums and connect with your peers or with me.

To get assessed, you'll need to be a premium student! This will allow you to build an online portfolio of all of your work and receive grades. If you aren't already enrolled as a student, register now at http://www.kadenze.com/ and join the #CADL community to see what your peers are doing! https://www.kadenze.com/courses/creative-applications-of-deep-learning-with-tensorflow/info

Also, if you share any of the GIFs on Facebook/Twitter/Instagram/etc..., be sure to use the #CADL hashtag so that other students can find your work!


In [59]:
utils.build_submission('session-3.zip',
                       ('test.png',
                        'recon.png',
                        'sorted.png',
                        'manifold.png',
                        'test_xs.png',
                        'audio.png',
                        'session-3.ipynb'))


Your assignment zip file has been created!
Now submit the file:
/notebooks/session-3/session-3.zip
to Kadenze for grading!

Coming Up

In session 4, we'll start to interrogate pre-trained Deep Convolutional Networks trained to recognize 1000 possible object labels. Along the way, we'll see how by inspecting the network, we can perform some very interesting image synthesis techniques which led to the Deep Dream viral craze. We'll also see how to separate the content and style of an image and use this for generative artistic stylization! In Session 5, we'll explore a few other powerful methods of generative synthesis, including Generative Adversarial Networks, Variational Autoencoding Generative Adversarial Networks, and Recurrent Neural Networks.