Lab 14


In [ ]:
!pip install -U tensorflow

import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
import math

from mnist_viz import *

!pip install -U okpy
from client.api.notebook import Notebook
ok = Notebook('lab14.ok')

Today's lab is a reprise of TensorFlow and a brief foray into a more advanced topic in machine learning: deep neural networks.

To keep things simple, the dataset and problem are the MNIST dataset we used in lab 12. (This was a standard dataset for testing classifiers until just a few years ago; today it is considered too small and easy.) Our inputs are grayscale images of handwritten numerical digits. Our task is to determine the digit the writer intended to write.

Run the next cell to download the dataset again.


In [ ]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
print("Training set shape: {}\nValidation set shape: {}\nTest set shape: {}"
      .format(mnist.train.images.shape, mnist.validation.images.shape, mnist.test.images.shape))
image_size = 28
num_features = image_size**2
num_labels = mnist.train.labels.shape[1]

Run the next cell to display some of the digit images.


In [ ]:
# Indices of some example images from the training set:
examples_to_show = np.array([0,  5100, 10200, 15300, 20400, 25500, 30600, 35700, 40800, 45900])

show_flat_images(mnist.train.images[examples_to_show], ncols=5,
                 title="Some examples from the training set",
                 image_titles=examples_to_show)

print("Labels for printed examples:\n{}".format(mnist.train.labels[examples_to_show]))

Now we can define a pipeline to classify these images.

Recall that TensorFlow allows us to think of models as pipelines of variables, or tensors. (A 0D tensor is a number; a 1D tensor is an array; and a 2D tensor is a matrix.) The workflow for creating a model is:

  1. We first define placeholder variables for the inputs to training a model.
  2. We define parameter variables for the parameters of our model, which we would like to learn.
  3. We apply operations (like multiplication) to the inputs and parameters to compute the output of our pipeline.
  4. We apply operations to the output and the true output to define a loss.
  5. We define an operator to perform gradient descent. When run, it will attempt to choose parameters to minimize the given loss.
  6. We create a session object and use it to run the gradient descent operator.

After training a model in this way, we can use the same session object to:

  1. Retrieve the values of parameters.
  2. Apply our model pipeline to compute its output on new inputs. (We can also compute the values of any intermediate variables we defined.)

Question 1

Fill in the unfinished parts of the cell below to define a pipeline for training logistic regression.


In [ ]:
# Variables for our images and labels:
x = tf.placeholder(tf.float32, [None, num_features])
y_ = tf.placeholder(tf.float32, [None, ...])

# Variables for parameters:
theta = tf.Variable(tf.zeros([..., num_labels]))
b = tf.Variable(tf.zeros([num_labels]))

# Variable for the output of our classifier (not a hard
# classification, but a number between 0 and 1, the result
# of the softmax function):
y = tf.nn.softmax(tf.matmul(..., ...) + ...)

# Define the regularization penalty.  We didn't do this
# last time, but it's important.
alpha = 4e-4
l2_regularizer = tf.contrib.layers.l2_regularizer(scale=alpha, scope=None)
regularization_penalty = tf.contrib.layers.apply_regularization(l2_regularizer, [theta])

# Variable for the loss suffered for each training example.
# This is the cross-entropy loss, which is the negative log
# likelihood our model assigns to the true labels, if we
# think of the output of the softmax function as a
# probability.
loss = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))\
       + regularization_penalty

# Operator to perform a step of gradient descent on our loss:
step_size = 1.0
train_step = tf.train.GradientDescentOptimizer(step_size).minimize(...)

Run the next cell to run several steps of gradient descent (actually, batch stochastic gradient descent) on the training set.


In [ ]:
def create_session_and_optimize(train_step, num_iterations=2000, batch_size=100):
    sess = tf.InteractiveSession()
    tf.global_variables_initializer().run()

    for _ in range(num_iterations):
        batch_xs, batch_ys = mnist.train.next_batch(batch_size)
        sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
    
    return sess

sess = create_session_and_optimize(train_step)

Now run the next cell to apply your model to the validation images.


In [ ]:
model_classifications = tf.argmax(y,1)
true_classes = tf.argmax(y_,1)
correct_prediction = tf.equal(model_classifications, true_classes)
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
print("Accuracy on the validation set:")
print(sess.run(accuracy, feed_dict={x: mnist.validation.images, y_: mnist.validation.labels}))
print("Model classifications and true classes:")
print(sess.run([model_classifications, true_classes], feed_dict={x: mnist.validation.images, y_: mnist.validation.labels}))

To compute the outputs of a model, we pass the output variable(s) as arguments to the method sess.run. The cell above does this twice, so check that for the syntax. The code cell before that did it thousands of times, and each run of train_step in that cell had the side effect of updating theta and x. (The state of each variable is held in sess.)

Examining the classifier

Question 2

Use sess.run to find the values of theta and b.

Note: Since theta and b are standalone variables that don't depend on the model's inputs, you don't need to use the argument feed_dict here.

Note 2: You will probably find that all the printed values of theta_2 (corresponding to the edges of an image) are 0. Don't worry, that's correct.


In [ ]:
theta_2 = ...
b_2 = ...
theta_2, b_2

Question 3

theta_2 is now a 2D NumPy array that contains one column of our model's parameters for each of the 10 digit types. Each column has length 784 - one for each pixel. So we could, if we wanted to, display it as though it were an image. Use show_flat_images to do that for each of the 10 columns.

show_flat_images (defined in mnist_viz.py) has the following signature:

def show_flat_images(images, ncols=2, image_titles=None, **kwargs):
    """
    Shows one or more images, represented as flat length-784 arrays.

    images: Image, list of images, or array of images.  Each image
      is a 1D array of length 784.  The images can also be (title,
      array) pairs, where the title is any string you want to
      display as the image's title.
    """

Hint: You may need to use the array method transpose.

Important: Unlike an image, theta_2 can have negative values. show_flat_images will display negative values in red and positive values in black.


In [ ]:
...

You should find that the black and grey parts of the first image look a little bit like a 0 and the second a little bit like a 1! The resemblance is far from perfect, especially for other numbers; we'll worry about that next.

Why should there be a resemblance at all?

When we compute the classification of an example $X$, we first compute $X \theta$. (Of course, we then add an element of $b$ and take the softmax, but this is really the important part.)

That means each column of $\theta$ is dot-producted with the image data. Each element of the column lines up with one feature, which is one pixel.

This produces 10 scores, one for each of the 10 digit types. A higher score (a higher dot product) implies that the image is more like that digit type.

In image and signal processing, it is common to call each column of $\theta$ a filter. A filter is a 784-long array of numbers, each lined up with one element of $X$, which is one pixel.

If you were designing a filter for the number 7, you would want your filter to have a high dot product with images that are 7s, and a low dot product with other images (so that other images aren't misclassified as 7s). That means you'd want it to have big positive values where pixels are often black in pictures of 7s, and small or negative values elsewhere.

Our gradient descent optimization found the best possible filters for our training set. But we can still use this idea to build a dramatically better classifier.

Run the cell below to visualize how some of your classifications went wrong. In each row:

  1. The left image is the actual image.
  2. The second image is the filter that (incorrectly) matched the image best.
  3. The third image is the elementwise product of that filter and the image. The dot product of the filter and the image is the sum of that elementwise product. So this tells us how the filter went wrong.
  4. The fourth image is the filter for the image's true label.
  5. The fifth image is the elementwise product of that filter and the image.

In [ ]:
display_mistakes(mnist, sess, x, y_, theta, model_classifications, true_classes, correct_prediction)

A big problem with the linear classifier is that it only allows a single filter for each number. That means the single filter for 7 has to match a 7 at the left or right side of the image, for example. And it also has to match a 7 that's very skinny or a 7 that's tilted to the right a bit. The result is that the filters match some common features of each number, but they look generally fuzzy.

Suppose we could allow 2 filters for each number. We could say that if an image matches either of the 2 filters for a number, then it is probably that number. Having 2 filters would let us design filters that target different representations of the number, producing less "mushy" filters that matches. For example, one filter could be a 7 in the center of the image, and one could be a 7 shifted toward the left.

Mathematically, we can implement that "match either filter" logic by computing the scores for each number for both filters and then taking the maximum of the two scores.

Question 4

Fill in the next cell to complete a prediction pipeline that uses this idea. (There are three "..."s to fill in.)


In [ ]:
def create_model_4(alpha=1e-6, num_filters_per_label=2, step_size=1.0):
    # Variables for parameters:
    filters = tf.Variable(tf.random_normal([num_features, num_labels, num_filters_per_label], stddev=0.4))
    b = tf.Variable(tf.zeros([num_labels, num_filters_per_label]))

    # This time the classifier has several intermediate steps.
    filters_out = tf.matmul(x, tf.reshape(filters, [-1, num_labels*num_filters_per_label]))
    # The score for each class, a matrix with dimensions [N, 10].
    combined_score = tf.reduce_max(tf.reshape(filters_out, [-1, num_labels, num_filters_per_label]) + b, reduction_indices=[2])
    # The output of the classifier, a matrix with dimensions [N, 10].
    # The result of applying the softmax function to the score for
    # each class.
    y = ...

    # The regularization penalty.  We regularize all the filters,
    # but not the bias term b.
    l2_regularizer = tf.contrib.layers.l2_regularizer(scale=alpha, scope=None)
    regularization_penalty = tf.contrib.layers.apply_regularization(l2_regularizer, [filters])

    # The same cross-entropy loss as in our first model.  Be sure to
    # include the regularization penalty.  Check the code for the
    # first model, but think about whether there should be any
    # differences.
    loss = ...

    # Operator to perform a step of gradient descent on our loss.
    # Be sure to use the variable step_size passed as an argument
    # to this function.
    train_step = ...
    
    return (filters, y, train_step)

Run the next cell to train this new model.

Note: Finding the best linear classifier with logistic regression is easy, but training this new model is not. We need many more iterations of stochastic gradient descent to produce the best model. If this takes too long on your computer, please set num_iterations to a smaller number. The interpretability of the resulting filters may suffer.


In [ ]:
filters_4, output_4, train_step_4 = create_model_4(alpha=4e-4, step_size=2)
sess_4 = train_and_display(mnist, x, y_, filters_4, output_4, train_step_4, num_iterations=20000)

Another way to look at errors: Confusion matrices

You should find that the new pipeline has a moderately better error rate (somewhere between 92.5 and 95.5% correct). This is still a high error rate for MNIST. "Deep" neural networks with many more filters and several "layers" can achieve accuracy above 99%.

Another useful tool for examining the remaining errors is a confusion matrix. This is just a way to visualize the answers to the following questions:

"Among all the 0s, what proportion did we classify as 0s, 1s, 2s, and so on? How about the other numbers?"

Traditionally, we arrange the results in a grid and shade the grid according to the proportions.

Question 5

Complete the function compute_confusion_matrix in the cell below. Then run the next cell to see the confusion matrix for the new prediction pipeline.


In [ ]:
def compute_confusion_matrix(classifications, true_classes, num_classes):
    """Compute the confusion matrix for a given list of classifications
    that were computed by a classifier on some dataset.
    
    A confusion matrix tells us how often the classifier "confused" each
    class for each other class.  So it contains one number for each
    ordered pair of classes.  If the documentation in this function isn't
    clear enough, see here for a tutorial on confusion matrices:
    
      http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
    
    Args:
      classifications (ndarray): A 1D array of integers, each between
        0 and num_classes-1.  These are the classifications produced by a
        model on some dataset.  Element i is the classification for example
        i in the dataset.
      true_classes (ndarray): A 1D array of integers, each between
        0 and num_classes-1.  These are the true classes of the examples in
        some dataset.  Element i is the true class for example i in the
        dataset.
      num_classes (int): The number of classes.  Each class is a number
        between 0 and num_classes-1.
    
    Returns:
      (ndarray): A 2-dimensional array of numbers.  Element [i,j] is the
        proportion of examples in the dataset that had true class i and
        were assigned class j by the classifier."""
    # This is just a recommended skeleton; you can delete it.
    ...
    counts = ...
    ...

In [ ]:
# Run this cell after you've written compute_confusion_matrix.
# You may find that there are so few errors that it's hard to
# see them.
display_confusion_matrix(compute_confusion_matrix, mnist, sess_4, x, y_, output_4, num_labels)

Question 6

Based on the confusion matrix, what was the most common confusion?


In [ ]:
this_was_the_true_class = ...
but_this_was_the_classification = ...

Submitting your assignment

If you made a good-faith effort to complete the lab, change i_finished_the_lab to True in the cell below. In any case, run the cells below to submit the lab.


In [ ]:
i_finished_the_lab = False

In [ ]:
_ = ok.grade('qcompleted')
_ = ok.backup()

In [ ]:
_ = ok.submit()

Now, run this code in your terminal to make a git commit that saves a snapshot of your changes in git. The last line of the cell runs git push, which will send your work to your personal Github repo.

# Tell git to commit your changes to this notebook
git add -A

# Tell git to make the commit
git commit -m "lab14 finished"

# Send your updates to your personal private repo
git push origin master