Class 8: Convolutional Neural Networks.
This class will focus on computer vision. There are some important differences and similarities with previous neural networks.
There are many data sets for computer vision. Two of the most popular are the MNIST digits data set and the CIFAR image data sets.
The MNIST Digits Data Set is very popular in the neural network research community. A sample of it can be seen here:
This data set was generated from scanned forms.
The CIFAR-10 and CIFAR-100 datasets are also frequently used by the neural network research community.
The CIFAR-10 data set contains low-rez images that are divided into 10 classes. The CIFAR-100 data set contains 100 classes in a hierarchy.
The convolutional neural network (CNN) is a neural network technology that has profoundly impacted the area of computer vision (CV). Fukushima (1980) introduced the original concept of a convolutional neural network, and LeCun, Bottou, Bengio & Haffner (1998) greatly improved this work. From this research, Yan LeCun introduced the famous LeNet-5 neural network architecture. This class follows the LeNet-5 style of convolutional neural network.
A LeNET-5 Network (LeCun, 1998)
So far we have only seen one layer type (dense layers). By the end of this course we will have seen:
The first layer that we will examine is the convolutional layer. We will begin by looking at the hyper-parameters that you must specify for a convolutional layer in most neural network frameworks that support the CNN:
The primary purpose for a convolutional layer is to detect features such as edges, lines, blobs of color, and other visual elements. The filters can detect these features. The more filters that we give to a convolutional layer, the more features it can detect.
A filter is a square-shaped object that scans over the image. A grid can represent the individual pixels of a grid. You can think of the convolutional layer as a smaller grid that sweeps left to right over each row of the image. There is also a hyper parameter that specifies both the width and height of the square-shaped filter. Figure 10.1 shows this configuration in which you see the six convolutional filters sweeping over the image grid:
A convolutional layer has weights between it and the previous layer or image grid. Each pixel on each convolutional layer is a weight. Therefore, the number of weights between a convolutional layer and its predecessor layer or image field is the following:
[FilterSize] * [FilterSize] * [# of Filters]
For example, if the filter size were 5 (5x4) for 10 filters, there would be 250 weights.
You need to understand how the convolutional filters sweep across the previous layer’s output or image grid. Figure 10.2 illustrates the sweep:
The above figure shows a convolutional filter with a size of 4 and a padding size of 1. The padding size is responsible for the boarder of zeros in the area that the filter sweeps. Even though the image is actually 8x7, the extra padding provides a virtual image size of 9x8 for the filter to sweep across. The stride specifies the number of positions at which the convolutional filters will stop. The convolutional filters move to the right, advancing by the number of cells specified in the stride. Once the far right is reached, the convolutional filter moves back to the far left, then it moves down by the stride amount and continues to the right again.
Some constraints exist in relation to the size of the stride. Obviously, the stride cannot be 0. The convolutional filter would never move if the stride were set to 0. Furthermore, neither the stride, nor the convolutional filter size can be larger than the previous grid. There are additional constraints on the stride (s), padding (p) and the filter width (f) for an image of width (w). Specifically, the convolutional filter must be able to start at the far left or top boarder, move a certain number of strides, and land on the far right or bottom boarder. The following equation shows the number of steps a convolutional operator must take to cross the image:
$ steps = \frac{w - f + 2p}{s+1} $
The number of steps must be an integer. In other words, it cannot have decimal places. The purpose of the padding (p) is to be adjusted to make this equation become an integer value.
Max-pool layers downsample a 3D box to a new one with smaller dimensions. Typically, you can always place a max-pool layer immediately following convolutional layer. The LENET shows the max-pool layer immediately after layers C1 and C3. These max-pool layers progressively decrease the size of the dimensions of the 3D boxes passing through them. This technique can avoid overfitting (Krizhevsky, Sutskever & Hinton, 2012).
A pooling layer has the following hyper-parameters:
Unlike convolutional layers, max-pool layers do not use padding. Additionally, max-pool layers have no weights, so training does not affect them. These layers simply downsample their 3D box input. The 3D box output by a max-pool layer will have a width equal to this equation:
$ w_2 = \frac{w_1 - f}{s + 1} $
The height of the 3D box produced by the max-pool layer is calculated similarly with this equation:
$ h_2 = \frac{h_1 - f}{s + 1} $
The depth of the 3D box produced by the max-pool layer is equal to the depth the 3D box received as input. The most common setting for the hyper-parameters of a max-pool layer are f =2 and s=2. The spatial extent (f) specifies that boxes of 2x2 will be scaled down to single pixels. Of these four pixels, the pixel with the maximum value will represent the 2x2 pixel in the new grid. Because squares of size 4 are replaced with size 1, 75% of the pixel information is lost. The following figure shows this transformation as a 6x6 grid becomes a 3x3:
Of course, the above diagram shows each pixel as a single number. A grayscale image would have this characteristic. For an RGB image, we usually take the average of the three numbers to determine which pixel has the maximum value.
The following sections describe how to use TensorFlow/SKFlow with CNNs.
In [1]:
"""Functions for downloading and reading MNIST data."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import gzip
import os
import tempfile
import numpy
from six.moves import urllib
from six.moves import xrange # pylint: disable=redefined-builtin
import tensorflow as tf
from tensorflow.contrib.learn.python.learn.datasets.mnist import read_data_sets
# Loading MNIST data
mnist = read_data_sets('MNIST_data')
In [2]:
import tensorflow.contrib.learn as skflow
from sklearn import datasets, metrics
def max_pool_2x2(tensor_in):
return tf.nn.max_pool(tensor_in, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1],
padding='SAME')
def conv_model(X, y):
# reshape X to 4d tensor with 2nd as image width, 3rd dimension
# as image height, 4th dimension as color channels.
X = tf.reshape(X, [-1, 28, 28, 1])
# Conv Layer #1: 32 channels/neurons for each 5x5 patch
with tf.variable_scope('conv_layer1'):
h_conv1 = skflow.ops.conv2d(X, n_filters=32, filter_shape=[5, 5],
bias=True, activation=tf.nn.relu)
h_pool1 = max_pool_2x2(h_conv1)
# second conv layer will compute 64 channels for each 5x5 patch
with tf.variable_scope('conv_layer2'):
h_conv2 = skflow.ops.conv2d(h_pool1, n_filters=64, filter_shape=[5, 5],
bias=True, activation=tf.nn.relu)
h_pool2 = max_pool_2x2(h_conv2)
# Reshape tensor into a batch of vectors
h_pool2_flat = tf.reshape(h_pool2, [-1, 7 * 7 * 64])
# densely connected layer with 256 neurons
h_fc1 = skflow.ops.dnn(h_pool2_flat, [256], activation=tf.nn.relu)
return skflow.models.logistic_regression(h_fc1, y)
In [4]:
# To fit/train use either the simple train (this box) or the early stop (next box)
# Do not use both. This box is faster, but no early stop
classifier = skflow.TensorFlowEstimator(
model_fn=conv_model, n_classes=10, batch_size=100, steps=3000,
learning_rate=0.001)
classifier.fit(mnist.train.images, mnist.train.labels)
Out[4]:
In [ ]:
# Early stopping - WARNING, this is slow on Data Scientist Workbench
# Do not run both this (and previous box) Choose one or the other.
# Training and predicting
classifier = skflow.TensorFlowEstimator(
model_fn=conv_model, n_classes=10, batch_size=100, steps=500,
learning_rate=0.001)
early_stop = skflow.monitors.ValidationMonitor(mnist.validation.images,
mnist.validation.labels, n_classes=10,
early_stopping_rounds=200, print_steps=50)
# Fit/train neural network
classifier.fit(mnist.train.images, mnist.train.labels, monitor=early_stop)
In [5]:
from sklearn import metrics
# Evaluate success using accuracy
pred = classifier.predict(mnist.test.images)
score = metrics.accuracy_score(pred, mnist.test.labels)
print("Accuracy score: {}".format(score))
In [ ]: