T81-558: Applications of Deep Neural Networks

Class 8: Convolutional Neural Networks.

This class will focus on computer vision. There are some important differences and similarities with previous neural networks.

  • We will usually use classification, though regression is still an option.
  • The input to the neural network is now 3D (height, width, color)
  • Data are not transformed, no zscores or dummy variables.
  • Processing time is much longer.
  • We now have different layer times: dense layers (just like before), convolution layers and max pooling layers.
  • Data will no longer arrive as CSV files. TensorFlow provides some utilities for going directly from image to the input fo a neural network.

Computer Vision Data Sets

There are many data sets for computer vision. Two of the most popular are the MNIST digits data set and the CIFAR image data sets.

MNIST Digits Data Set

The MNIST Digits Data Set is very popular in the neural network research community. A sample of it can be seen here:

This data set was generated from scanned forms.

CIFAR Data Set

The CIFAR-10 and CIFAR-100 datasets are also frequently used by the neural network research community.

The CIFAR-10 data set contains low-rez images that are divided into 10 classes. The CIFAR-100 data set contains 100 classes in a hierarchy.

Convolutional Neural Networks (CNNs)

The convolutional neural network (CNN) is a neural network technology that has profoundly impacted the area of computer vision (CV). Fukushima (1980) introduced the original concept of a convolutional neural network, and LeCun, Bottou, Bengio & Haffner (1998) greatly improved this work. From this research, Yan LeCun introduced the famous LeNet-5 neural network architecture. This class follows the LeNet-5 style of convolutional neural network.

A LeNET-5 Network (LeCun, 1998)

So far we have only seen one layer type (dense layers). By the end of this course we will have seen:

  • Dense Layers - Fully connected layers. (introduced previously)
  • Convolution Layers - Used to scan across images. (introduced this class)
  • Max Pooling Layers - Used to downsample images. (introduced this class)
  • Dropout Layer - Used to add regularization. (introduced next class)

Convolution Layers

The first layer that we will examine is the convolutional layer. We will begin by looking at the hyper-parameters that you must specify for a convolutional layer in most neural network frameworks that support the CNN:

  • Number of filters
  • Filter Size
  • Stride
  • Padding
  • Activation Function/Non-Linearity

The primary purpose for a convolutional layer is to detect features such as edges, lines, blobs of color, and other visual elements. The filters can detect these features. The more filters that we give to a convolutional layer, the more features it can detect.

A filter is a square-shaped object that scans over the image. A grid can represent the individual pixels of a grid. You can think of the convolutional layer as a smaller grid that sweeps left to right over each row of the image. There is also a hyper parameter that specifies both the width and height of the square-shaped filter. Figure 10.1 shows this configuration in which you see the six convolutional filters sweeping over the image grid:

A convolutional layer has weights between it and the previous layer or image grid. Each pixel on each convolutional layer is a weight. Therefore, the number of weights between a convolutional layer and its predecessor layer or image field is the following:

[FilterSize] * [FilterSize] * [# of Filters]

For example, if the filter size were 5 (5x4) for 10 filters, there would be 250 weights.

You need to understand how the convolutional filters sweep across the previous layer’s output or image grid. Figure 10.2 illustrates the sweep:

The above figure shows a convolutional filter with a size of 4 and a padding size of 1. The padding size is responsible for the boarder of zeros in the area that the filter sweeps. Even though the image is actually 8x7, the extra padding provides a virtual image size of 9x8 for the filter to sweep across. The stride specifies the number of positions at which the convolutional filters will stop. The convolutional filters move to the right, advancing by the number of cells specified in the stride. Once the far right is reached, the convolutional filter moves back to the far left, then it moves down by the stride amount and continues to the right again.

Some constraints exist in relation to the size of the stride. Obviously, the stride cannot be 0. The convolutional filter would never move if the stride were set to 0. Furthermore, neither the stride, nor the convolutional filter size can be larger than the previous grid. There are additional constraints on the stride (s), padding (p) and the filter width (f) for an image of width (w). Specifically, the convolutional filter must be able to start at the far left or top boarder, move a certain number of strides, and land on the far right or bottom boarder. The following equation shows the number of steps a convolutional operator must take to cross the image:

$ steps = \frac{w - f + 2p}{s+1} $

The number of steps must be an integer. In other words, it cannot have decimal places. The purpose of the padding (p) is to be adjusted to make this equation become an integer value.

Max Pooling Layers

Max-pool layers downsample a 3D box to a new one with smaller dimensions. Typically, you can always place a max-pool layer immediately following convolutional layer. The LENET shows the max-pool layer immediately after layers C1 and C3. These max-pool layers progressively decrease the size of the dimensions of the 3D boxes passing through them. This technique can avoid overfitting (Krizhevsky, Sutskever & Hinton, 2012).

A pooling layer has the following hyper-parameters:

  • Spatial Extent (f )
  • Stride (s)

Unlike convolutional layers, max-pool layers do not use padding. Additionally, max-pool layers have no weights, so training does not affect them. These layers simply downsample their 3D box input. The 3D box output by a max-pool layer will have a width equal to this equation:

$ w_2 = \frac{w_1 − f}{s + 1} $

The height of the 3D box produced by the max-pool layer is calculated similarly with this equation:

$ h_2 = \frac{h_1 − f}{s + 1} $

The depth of the 3D box produced by the max-pool layer is equal to the depth the 3D box received as input. The most common setting for the hyper-parameters of a max-pool layer are f =2 and s=2. The spatial extent (f) specifies that boxes of 2x2 will be scaled down to single pixels. Of these four pixels, the pixel with the maximum value will represent the 2x2 pixel in the new grid. Because squares of size 4 are replaced with size 1, 75% of the pixel information is lost. The following figure shows this transformation as a 6x6 grid becomes a 3x3:

Of course, the above diagram shows each pixel as a single number. A grayscale image would have this characteristic. For an RGB image, we usually take the average of the three numbers to determine which pixel has the maximum value.

TensorFlow with CNNs

Access to Data Sets

TensorFlow provides built in access classes for MNIST. It is important to note that MNIST data arrives already separated into three sets:

  • train - Neural network will be trained with this.
  • test - Used for early stopping.
  • validation - Used to evaluate the final network.

In [1]:
"""Functions for downloading and reading MNIST data."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import gzip
import os
import tempfile

import numpy
from six.moves import urllib
from six.moves import xrange  # pylint: disable=redefined-builtin
import tensorflow as tf
from tensorflow.contrib.learn.python.learn.datasets.mnist import read_data_sets

# Loading MNIST data
mnist = read_data_sets('MNIST_data')

Define CNN


In [2]:
import tensorflow.contrib.learn as skflow
from sklearn import datasets, metrics

def max_pool_2x2(tensor_in):
    return tf.nn.max_pool(tensor_in, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1],
        padding='SAME')

def conv_model(X, y):
    # reshape X to 4d tensor with 2nd as image width, 3rd dimension 
    # as image height, 4th dimension as color channels.
    X = tf.reshape(X, [-1, 28, 28, 1])
    # Conv Layer #1: 32 channels/neurons for each 5x5 patch
    with tf.variable_scope('conv_layer1'):
        h_conv1 = skflow.ops.conv2d(X, n_filters=32, filter_shape=[5, 5], 
                                    bias=True, activation=tf.nn.relu)
        h_pool1 = max_pool_2x2(h_conv1)
    # second conv layer will compute 64 channels for each 5x5 patch
    with tf.variable_scope('conv_layer2'):
        h_conv2 = skflow.ops.conv2d(h_pool1, n_filters=64, filter_shape=[5, 5], 
                                    bias=True, activation=tf.nn.relu)
        h_pool2 = max_pool_2x2(h_conv2)
        # Reshape tensor into a batch of vectors
        h_pool2_flat = tf.reshape(h_pool2, [-1, 7 * 7 * 64])
    # densely connected layer with 256 neurons
    h_fc1 = skflow.ops.dnn(h_pool2_flat, [256], activation=tf.nn.relu)
    return skflow.models.logistic_regression(h_fc1, y)


Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz

Training/Fitting CNN

Two methods are provided, run only one, both are not needed. Once complete, go to the next section to test accuracy.


In [4]:
# To fit/train use either the simple train (this box) or the early stop (next box)
# Do not use both.  This box is faster, but no early stop
classifier = skflow.TensorFlowEstimator(
    model_fn=conv_model, n_classes=10, batch_size=100, steps=3000,
    learning_rate=0.001)

classifier.fit(mnist.train.images, mnist.train.labels)


Step #99, avg. train loss: 2.25026
Step #199, avg. train loss: 2.01235
Step #299, avg. train loss: 1.71387
Step #399, avg. train loss: 1.35640
Step #499, avg. train loss: 1.02610
Step #600, epoch #1, avg. train loss: 0.80594
Step #700, epoch #1, avg. train loss: 0.66509
Step #800, epoch #1, avg. train loss: 0.58332
Step #900, epoch #1, avg. train loss: 0.51632
Step #1000, epoch #1, avg. train loss: 0.48027
Step #1100, epoch #2, avg. train loss: 0.44487
Step #1200, epoch #2, avg. train loss: 0.42161
Step #1300, epoch #2, avg. train loss: 0.39758
Step #1400, epoch #2, avg. train loss: 0.36532
Step #1500, epoch #2, avg. train loss: 0.35838
Step #1600, epoch #2, avg. train loss: 0.34282
Step #1700, epoch #3, avg. train loss: 0.34383
Step #1800, epoch #3, avg. train loss: 0.32769
Step #1900, epoch #3, avg. train loss: 0.30998
Step #2000, epoch #3, avg. train loss: 0.31082
Step #2100, epoch #3, avg. train loss: 0.29584
Step #2200, epoch #4, avg. train loss: 0.29659
Step #2300, epoch #4, avg. train loss: 0.28404
Step #2400, epoch #4, avg. train loss: 0.26940
Step #2500, epoch #4, avg. train loss: 0.28359
Step #2600, epoch #4, avg. train loss: 0.27643
Step #2700, epoch #4, avg. train loss: 0.26371
Step #2800, epoch #5, avg. train loss: 0.26581
Step #2900, epoch #5, avg. train loss: 0.25664
Step #3000, epoch #5, avg. train loss: 0.25396
Out[4]:
TensorFlowEstimator(batch_size=100, class_weight=None, clip_gradients=5.0,
          config=None, continue_training=False, learning_rate=0.001,
          model_fn=<function conv_model at 0x7f472f41d1e0>, n_classes=10,
          optimizer='Adagrad', steps=3000, verbose=1)

In [ ]:
# Early stopping - WARNING, this is slow on Data Scientist Workbench
# Do not run both this (and previous box)  Choose one or the other.

# Training and predicting
classifier = skflow.TensorFlowEstimator(
    model_fn=conv_model, n_classes=10, batch_size=100, steps=500,
    learning_rate=0.001)


early_stop = skflow.monitors.ValidationMonitor(mnist.validation.images, 
    mnist.validation.labels, n_classes=10,
    early_stopping_rounds=200, print_steps=50)

# Fit/train neural network
classifier.fit(mnist.train.images, mnist.train.labels, monitor=early_stop)

Evaluate Accuracy


In [5]:
from sklearn import metrics

# Evaluate success using accuracy
pred = classifier.predict(mnist.test.images)
score = metrics.accuracy_score(pred, mnist.test.labels)
print("Accuracy score: {}".format(score))


Accuracy score: 0.9355

In [ ]: