Self-Driving Car Engineer Nanodegree

Deep Learning

Project: Build a Traffic Sign Recognition Classifier

In this notebook, a template is provided for you to implement your functionality in stages which is required to successfully complete this project. If additional code is required that cannot be included in the notebook, be sure that the Python code is successfully imported and included in your submission, if necessary. Sections that begin with 'Implementation' in the header indicate where you should begin your implementation for your project. Note that some sections of implementation are optional, and will be marked with 'Optional' in the header.

In addition to implementing code, there will be questions that you must answer which relate to the project and your implementation. Each section where you will answer a question is preceded by a 'Question' header. Carefully read each question and provide thorough answers in the following text boxes that begin with 'Answer:'. Your project submission will be evaluated based on your answers to each of the questions and the implementation you provide.

Note: Code and Markdown cells can be executed using the Shift + Enter keyboard shortcut. In addition, Markdown cells can be edited by typically double-clicking the cell to enter edit mode.


Step 0: Load The Data


In [1]:
from keras.datasets import cifar10
(X_train, y_train), (X_test, y_test) = cifar10.load_data()
# y_train.shape is 2d, (50000, 1). While Keras is smart enough to handle this
# it's a good idea to flatten the array.
y_train = y_train.reshape(-1)
y_test = y_test.reshape(-1)


Using TensorFlow backend.
Downloading data from http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
Untaring file...

Step 1: Dataset Summary & Exploration

The pickled data is a dictionary with 4 key/value pairs:

  • 'features' is a 4D array containing raw pixel data of the traffic sign images, (num examples, width, height, channels).
  • 'labels' is a 1D array containing the label/class id of the traffic sign. The file signnames.csv contains id -> name mappings for each id.
  • 'sizes' is a list containing tuples, (width, height) representing the the original width and height the image.
  • 'coords' is a list containing tuples, (x1, y1, x2, y2) representing coordinates of a bounding box around the sign in the image. THESE COORDINATES ASSUME THE ORIGINAL IMAGE. THE PICKLED DATA CONTAINS RESIZED VERSIONS (32 by 32) OF THESE IMAGES

Complete the basic data summary below.


In [8]:
import csv

# Number of training examples
n_train = len(X_train)

# Number of testing examples.
n_test = len(X_test)

# What's the shape of an traffic sign image?
image_shape = X_train[0].shape

# How many unique classes/labels there are in the dataset.
labelmap = {}
with open('./cifar10.csv', 'r') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        labelmap[int(row['ClassId'])] = row['SignName']
n_classes = 10
        
print("Number of training examples =", n_train)
print("Number of testing examples =", n_test)
print("Image data shape =", image_shape)
print("Number of classes =", n_classes)


Number of training examples = 50000
Number of testing examples = 10000
Image data shape = (32, 32, 3)
Number of classes = 10

Visualize the German Traffic Signs Dataset using the pickled file(s). This is open ended, suggestions include: plotting traffic sign images, plotting the count of each sign, etc.

The Matplotlib examples and gallery pages are a great resource for doing visualizations in Python.

NOTE: It's recommended you start with something simple first. If you wish to do more, come back to it after you've completed the rest of the sections.


In [12]:
import matplotlib.pyplot as plt
# Visualizations will be shown in the notebook.
%matplotlib inline

import random

index = random.randint(0, n_train)
image = X_train[index]

plt.figure(figsize=(1,1))
plt.imshow(image)
print("Description: " + labelmap[y_train[index]])


Description: bird

Step 2: Design and Test a Model Architecture

Design and implement a deep learning model that learns to recognize traffic signs. Train and test your model on the German Traffic Sign Dataset.

There are various aspects to consider when thinking about this problem:

  • Neural network architecture
  • Play around preprocessing techniques (normalization, rgb to grayscale, etc)
  • Number of examples per label (some have more than others).
  • Generate fake data.

Here is an example of a published baseline model on this problem. It's not required to be familiar with the approach used in the paper but, it's good practice to try to read papers like these.

NOTE: The LeNet-5 implementation shown in the classroom at the end of the CNN lesson is a solid starting point. You'll have to change the number of classes and possibly the preprocessing, but aside from that it's plug and play!

Implementation

Use the code cell (or multiple code cells, if necessary) to implement the first step of your project. Once you have completed your implementation and are satisfied with the results, be sure to thoroughly answer the questions that follow.


In [13]:
# Pre-processing of the data

import numpy as np
import cv2

# First, let's see how many items of each category we have:
class_index = []
# let's initialize class_index
for index in range(len(y_train)):
    class_index.append([])

for index in range(len(y_train)):
    item_class = y_train[index]
    class_index[item_class].append(index)

# show a histogram for human readibility
plt.hist(y_train, bins='auto')
plt.title("Initial count of images (y) per class (x)")
plt.show()

def modify_image(image):
    """
    Expects the parameter to be a 3-dimensional array with the RGB information
    for the image; will create a new image by modifying the original with random
    modifications, particularly: tilting.
    
    http://opencv-python-tutroals.readthedocs.io/en/latest/py_tutorials/py_imgproc/py_geometric_transformations/py_geometric_transformations.html
    
    """
    rows,cols = len(image), len(image[0])
    # tilt [-20,20]
    tilt_angle = -20 + int(random.random() * 40)
    M = cv2.getRotationMatrix2D((cols/2,rows/2),tilt_angle,1)
    out = cv2.warpAffine(image,M,(cols,rows))
    # scale [-2,2] on each point in x, [-2,2] on each point in y
    scale_x_left = -2 + int(random.random() * 4)
    scale_x_right = -2 + int(random.random() * 4)
    scale_y_top = -2 + int(random.random() * 4)
    scale_y_bottom = -2 + int(random.random() * 4)
    pts1 = np.float32([[0,0],[rows,0],[0,cols],[rows,cols]])
    pts2 = np.float32([[scale_y_top,scale_x_left],
                       [rows+scale_y_bottom,scale_x_left],
                       [scale_y_top,cols+scale_x_right],
                       [rows+scale_y_bottom,cols+scale_x_right]])
    M = cv2.getPerspectiveTransform(pts1,pts2)
    out = cv2.warpPerspective(out,M,(cols,rows))
    return np.asarray(out)



# The disparity in image count is striking, we should attempt to equalize the field at least a little bit
# so the difference in example count between different classes isn't as wide
# We will ensure that each class has at least N items, so we'll iterate over the class index and ensure that
# we add the required amount of images; these new images will be created by applying modifications to
# the initial set of images
# Arbitrarily I've chosen 2000 as the minimum number of images given that some classes have as little as 250
# items while others have as much as 2500. Lowering this disparity is what I intend
min_item_count = 2000
extension_to_X_train = []
extension_to_y_train = []
extension_to_X_coords = []
extension_to_X_sizes = []
for current_class in range(n_classes):
    item_index = class_index[current_class]
    item_count = len(item_index)
    if item_count < min_item_count:
        #print("Class: {0}".format(current_class))
        #print(item_count)
        #print(item_index)
        # let's add the new data
        N = min_item_count - item_count
        for i in range(N):
            #print("Index {0}".format(i))
            image_selector = item_index[i % item_count]
            #print("Image selector {0}".format(image_selector))
            selected_image = X_train[image_selector]
            #print(selected_image)
            #plt.figure(figsize=(1,1))
            #plt.imshow(selected_image)
            extension_to_X_train.append(modify_image(selected_image))
            extension_to_y_train.append(y_train[image_selector])
            extension_to_X_coords.append(X_coords[image_selector])
            extension_to_X_sizes.append(X_sizes[image_selector])

# Append extensions
if (len(extension_to_X_train) > 0):
    X_train = np.append(X_train, extension_to_X_train, axis=0)
    y_train = np.append(y_train, extension_to_y_train, axis=0)
    X_coords = np.append(X_coords, extension_to_X_coords, axis=0)
    X_sizes = np.append(X_sizes, extension_to_X_sizes, axis=0)
    n_train = len(X_train)



In [14]:
# Give a count of the modified training data
# show a histogram for human readibility
print("Total item count =", n_train)
plt.hist(y_train, bins='auto')
plt.title("Count of images (y) per class (x) after augmentation")
plt.show()

if (len(extension_to_X_train) > 0):
  index = random.randint(0, len(extension_to_X_train))
  image = extension_to_X_train[index]
  plt.figure(figsize=(1,1))
  plt.imshow(image)
  print("Description: " + labelmap[extension_to_y_train[index]] + " - ", index)


Total item count = 50000

Question 1

Describe how you preprocessed the data. Why did you choose that technique?

Answer:

When I did the histogram to visually count how many items we had per class (label) it was clear that the disparity was very large. The suggestions given in the Step 2 header were a hint towards this being an issue. Initially when I first did the LeNet pass on the original data I realized my validation scores and test scores were low, and a bit random. I decided to try and augment the examples to be a more even count per item class, and settled on having at least 2000 examples on each class.

The way I did this was by making an index of all the images per class. I then counted how many images were there in a give class and if the count was less than 2000 I'd proceed with augmentation. These are the augmentation steps:

  1. Obtain the list of images for this class
  2. Iterate 2000-count(images for this class) times
  3. Take the next image from the list of images for this class (so we use them all to augment the data set)
  4. Modify the image
  5. Add it to the extension list
  6. Add metadata to the extension metadata lists
  7. After done iterating, merge the extension lists with the original data

I then produced a new histogram of the resulting set of images. Since I augmented all the metadata sets as well, the length and order of the data is preserved and would allow for the rest of this exercise to proceed regardless of the augmentation step.

As for the image modification, I decided to make random tweaks along two different operations:

  1. Tilt the image anywhere from -20 degrees to 20 degrees
  2. Change the shape of the image anywhere from -2 pixels to 2 pixels on each corner, on each dimension

I based the transformations on the examples given at http://opencv-python-tutroals.readthedocs.io/en/latest/py_tutorials/py_imgproc/py_geometric_transformations/py_geometric_transformations.html

The idea behind this was to ensure that the extended data was different (even if just slightly) to the original data so as to actually help the model see new things.

Note

I am explicitly NOT normalizing the data in this step as I discovered that I had issues with doing so given what I wanted to do next: create 3-channel grayscale representations of the same images, effectively doubling the data set.

In the following code blocks I generate the grayscale set and after that is done I proceed to normalize the data as told in the online class. We want the data to be centered around zero, and hopefully with an normal distribution around zero.

Then I proceed to split the augmented data into a training set and a validation set.


In [15]:
import cv2

#X_train_grayscale = []
#for index in range(n_train):
#    grayscale = cv2.cvtColor(X_train_scaled[index], cv2.COLOR_BGR2GRAY)
#    X_train_grayscale.append(grayscale)
X_train_grayscale = []
for index in range(n_train):
    grayscale = np.zeros_like(X_train[index], dtype=np.float32)
    rgb = X_train[index]
    y = 0.2989 * rgb[:,:,2] + 0.5870 * rgb[:,:,1] + 0.1140 * rgb[:,:,0]
    grayscale[:,:,0] = grayscale[:,:,1] = grayscale[:,:,2] = y.squeeze()
    X_train_grayscale.append(grayscale)


print("Done grayscaling, rendering an example output vs the original:")

index = random.randint(0, n_train)
print("Index: ", index)
image = X_train_grayscale[index]

plt.figure(figsize=(1,1))
plt.imshow(image)
plt.figure(figsize=(1,1))
plt.imshow(X_train[index])
print("Description: " + labelmap[y_train[index]])

# we now have a second set of images which is the same as the cropped-resized images we pre-processed before
# but in a human-balanced grayscale, still 3 channels.


Done grayscaling, rendering an example output vs the original:
Index:  42120
Description: bird

In [16]:
# Normalize the data first (centered at 0,0, with a span of 1)
# We know the RGB data has values 0 to 255

def normalize_image(image):
    return (image - 128) / 128

# Normalize the data ensuring the values are float32
X_train_grayscale_normalized = np.array([normalize_image(i) for i in X_train_grayscale])
X_train_normalized = np.array([normalize_image(i) for i in X_train.astype(np.float32)])
X_test_normalized = np.array([normalize_image(i) for i in X_test.astype(np.float32)])

In [17]:
#index = random.randint(0, n_train)
print("Index: ", index)
plt.figure(figsize=(1,1))
plt.imshow(X_train_normalized[index])
plt.figure(figsize=(1,1))
plt.imshow(X_train_grayscale_normalized[index])
print("Example images from normalized data")


Index:  42120
Example images from normalized data

In [18]:
# Let's say we want 80% of the data to be testing data, and 20% to be validation data
# We also want to shuffle the items in the testing set and the validation set
from sklearn.utils import shuffle

train_split_prob = 0.8
X_train_split = []
y_train_split = []
X_validation_split = []
y_validation_split = []
def split_and_append(Xinput, yinput, split_prob, X1, y1, X2, y2):
    """
    Takes the input X,y and appends them to X1,y1 or X2,y2 depending
    on the split_prob value.
    """
    for index in range(len(Xinput)):
        image = Xinput[index]
        classification = yinput[index]
        is_second_split = random.random() >= split_prob
        if is_second_split:
            X2.append(image)
            y2.append(classification)
        else:
            X1.append(image)
            y1.append(classification)

split_and_append(X_train_normalized,
                 y_train,
                 train_split_prob,
                 X_train_split,
                 y_train_split,
                 X_validation_split,
                 y_validation_split)
split_and_append(X_train_grayscale_normalized,
                 y_train,
                 train_split_prob,
                 X_train_split,
                 y_train_split,
                 X_validation_split,
                 y_validation_split)

X_train_split, y_train_split = shuffle(X_train_split, y_train_split)
X_validation_split, y_validation_split = shuffle(X_validation_split, y_validation_split)


print("Done splitting and shuffling items")
expected_len_X = len(X_train_normalized) + len(X_train_grayscale_normalized)
expected_len_y = len(y_train) * 2
len_train_X = len(X_train_split)
len_validation_X = len(X_validation_split)
len_train_y = len(y_train_split)
len_valudation_y = len(y_validation_split)
assert expected_len_X == len_train_X + len_validation_X
assert expected_len_y == len_train_y + len_valudation_y
assert len_train_X == len_train_y
assert len_validation_X == len_valudation_y
print(len_train_X)
print(len_validation_X)

index = random.randint(0, len(X_train_split))
image = X_train_split[index]

plt.figure(figsize=(1,1))
plt.imshow(image)
print("Description: " + labelmap[y_train_split[index]])


Done splitting and shuffling items
80083
19917
Description: automobile

Question 2

Describe how you set up the training, validation and testing data for your model. Optional: If you generated additional data, how did you generate the data? Why did you generate the data? What are the differences in the new dataset (with generated data) from the original dataset?

Splitting the training set

We already had one training set and one test set to begin with. I'm leaving the test set as is and intend to use it only once after I'm fully confident of the accuracy of my model based on the test data.

The training set, however, is going to be split in 2 parts:

  • 80% of the data into the actual training set
  • 20% of the data into a new validation set

This partition is achieved by iterating on the input data set (X,y) and splitting it into the 2 buckets (training(X,y) and validation(X,y)). Each data pair (Xi,yi) is assigned to a set by obtaining a random number between 0.0 and 1.0 and choosing one bucket if the random number is below the threshold, or the other bucket if it's above the threshold. This makes it easy to think of partitioning the data as percentages of the input. Note that here we assume that the random number generator gives a relatively uniform distribution of results over the [0.0, 1.0] continuum.

After the training set and validation set have been created, we proceed to shuffle their contents and perform various assertions.

Newly generated data

I decided to take this step to generate grayscale images corresponding to the original+augmented images. We really have 2 data sets at this point:

  • The original color images
  • The grayscale images created from the color images

I'm thinking perhaps training the data on both the color and the grayscale images might help me improve the accuracy of the predictions going forward. I'll try to ensure that the network can decide "this is a yield sign" based on the unique characteristics found in the color image, or the more general strokes seen in the grayscale image.

As for the grayscale, I decided to compute it myself with a human-eye based formula I found online:

y = 0.2989 * red + 0.5870 * green + 0.1140 * blue

There was no big reason for this, I just liked the way it looked better than the built-in algorithms I had at hand.

Ideas that didn't work

I read that the picked data contained cropping information, encoded as the 4 corners of a square in the original image size map. I transformed the cropping information to the coordinate system used by the scaled image (32 x 32) used a PIL image to first crop the area outside the pixels of interest. Since our algorithm assumes a 32 x 32 image, I scaled the image back to the specified dimensions.

I chose this in order to reduce the signal-to-noise ratio of the data we're sending to the network. I'm assuming that the pixels outside of the area of interest are noise and might lower the overall accuracy of our predictions. For example, if all the training images had green or brown from trees in the background, the network might fail to understand that a sign with a blue-sky in the background is equally valid. Furthermore, if we move this to grayscale, the elimination of the surrounding pixels is just as useful. Imagine that for a specific type of traffic sign, such as "yield", we didn't have enough training data. In that case our network will be very sensitive to all the pixels available in the few images we have. This might mean that our network can only recognize "yield" signs when the signal and the noise levels of the test data are similar to those of the training data. I'm assuming that by reducing the noise in the training data we render the noise in the test data less detrimental overall.

My assumptions might be wrong, so not only will I be training my network, I'll be training my own personal experience and the heuristics I'll use going forward.

Answer:


In [19]:
def LeNet(x, keep_probability):    
    # Arguments used for tf.truncated_normal, randomly defines variables for the weights and biases for each layer
    mu = 0
    sigma = 0.1
    
    # Layer 1: Convolutional. Input = 32x32x3. Output = 28x28x6.
    # new_height = (input_height - filter_height + 2 * P)/S + 1
    # 28 = ((32 - fh)/S) + 1
    #(27 * S) + fh= 32
    # S = 1, fh = 5    
    F_W = tf.Variable(tf.truncated_normal((5, 5, 3, 6), mean = mu, stddev = sigma, dtype=tf.float32))
    F_b = tf.Variable(tf.zeros(6))
    strides = [1, 1, 1, 1]
    padding = 'VALID'
    layer1 = tf.nn.conv2d(x, F_W, strides, padding) + F_b

    # Activation 1.
    layer1 = tf.nn.relu(layer1)

    # Pooling 1. Input = 28x28x6. Output = 14x14x6.
    # new_height = (input_height - filter_height)/S + 1
    # 14 = ((28 - fh)/S) + 1
    # (13 * S) + fh = 28
    # S = 2, fh = 2
    ksize=[1, 2, 2, 1]
    strides=[1, 2, 2, 1]
    padding = 'VALID'
    pooling_layer1 = tf.nn.max_pool(layer1, ksize, strides, padding)
    
    # Dropout 1.
    pooling_layer1 = tf.nn.dropout(pooling_layer1, keep_probability)
    
    # Flatten 2a. Input = 14x14x6. Output = 1176.
    flatten_layer2a = tf.contrib.layers.flatten(pooling_layer1)

    # Layer 2b: Convolutional. Input = 14x14x6. Output = 10x10x16.
    # new_height = (input_height - filter_height + 2 * P)/S + 1
    # 10 = ((14 - fh)/S) + 1
    # (9 * S) + fh = 14
    # S = 1, fh = 5
    F_W = tf.Variable(tf.truncated_normal((5, 5, 6, 16), mean = mu, stddev = sigma, dtype=tf.float32))
    F_b = tf.Variable(tf.zeros(16))
    strides = [1, 1, 1, 1]
    padding = 'VALID'
    layer2b = tf.nn.conv2d(pooling_layer1, F_W, strides, padding) + F_b
    
    # Activation 2b.
    layer2b = tf.nn.relu(layer2b)

    # Pooling 2b. Input = 10x10x16. Output = 5x5x16.
    # new_height = (input_height - filter_height)/S + 1
    # 5 = ((10 - fh)/S) + 1
    # (4 * S) + fh = 10
    # S = 2, fh = 2
    ksize=[1, 2, 2, 1]
    strides=[1, 2, 2, 1]
    padding = 'VALID'
    pooling_layer2b = tf.nn.max_pool(layer2b, ksize, strides, padding)

    # Dropout 2b.
    pooling_layer2b = tf.nn.dropout(pooling_layer2b, keep_probability)

    # Flatten 2b. Input = 5x5x16. Output = 400.
    flatten_layer2b = tf.contrib.layers.flatten(pooling_layer2b)
    
    # Layer 2c: Convolutional. Input = 5x5x16. Output = 3x3x32.
    # new_height = (input_height - filter_height + 2 * P)/S + 1
    # 3 = ((5 - fh)/S) + 1
    # (2 * S) + fh = 5
    # S = 2, fh = 1
    F_W = tf.Variable(tf.truncated_normal((1, 1, 16, 32), mean = mu, stddev = sigma, dtype=tf.float32))
    F_b = tf.Variable(tf.zeros(32))
    strides = [1, 2, 2, 1]
    padding = 'VALID'
    layer2c = tf.nn.conv2d(pooling_layer2b, F_W, strides, padding) + F_b
    
    # Activation 2c.
    layer2c = tf.nn.relu(layer2c)

    # Pooling 2c. Input = 3x3x32. Output = 2x2x32.
    # new_height = (input_height - filter_height)/S + 1
    # 2 = ((3 - fh)/S) + 1
    # (1 * S) + fh = 3
    # S = 2, fh = 1
    ksize=[1, 1, 1, 1]
    strides=[1, 2, 2, 1]
    padding = 'VALID'
    pooling_layer2c = tf.nn.max_pool(layer2c, ksize, strides, padding)

    # Dropout 2c.
    pooling_layer2c = tf.nn.dropout(pooling_layer2c, keep_probability)

    # Flatten 2c. Input = 2x2x32. Output = 128.
    flatten_layer2c = tf.contrib.layers.flatten(pooling_layer2c)
    
    # Concat layers 2a, 2b, 2c. Input = 1176 + 400 + 128. Output = 1704.
    flat_layer2 = tf.concat_v2([tf.concat_v2([flatten_layer2b, flatten_layer2a], 1), flatten_layer2c], 1)
    
    # Layer 3: Fully Connected. Input = 1704. Output = 120.
    F_W = tf.Variable(tf.truncated_normal((1704, 120), mean = mu, stddev = sigma, dtype=tf.float32))
    F_b = tf.Variable(tf.zeros(120))
    fully_connected = tf.matmul(flat_layer2, F_W) + F_b
    
    # Activation 3.
    fully_connected = tf.nn.relu(fully_connected)
    
    # Dropout 3.
    fully_connected = tf.nn.dropout(fully_connected, keep_probability)

    # Layer 4: Fully Connected. Input = 120. Output = 84.
    F_W = tf.Variable(tf.truncated_normal((120, 84), mean = mu, stddev = sigma, dtype=tf.float32))
    F_b = tf.Variable(tf.zeros(84))
    fully_connected = tf.matmul(fully_connected, F_W) + F_b
    
    # Activation 4.
    fully_connected = tf.nn.relu(fully_connected)

    # Dropout 4.
    fully_connected = tf.nn.dropout(fully_connected, keep_probability)

    # Layer 5: Fully Connected. Input = 84. Output = n_classes.
    F_W = tf.Variable(tf.truncated_normal((84, n_classes), mean = mu, stddev = sigma, dtype=tf.float32))
    F_b = tf.Variable(tf.zeros(n_classes))
    logits = tf.matmul(fully_connected, F_W) + F_b
    
    # Dropout 5.
    logits = tf.nn.dropout(logits, keep_probability)

    return logits

Question 3

What does your final architecture look like? (Type of model, layers, sizes, connectivity, etc.) For reference on how to build a deep neural network using TensorFlow, see Deep Neural Network in TensorFlow from the classroom.

Answer:

The code above is based on the published baseline model on this problem that was referenced above in this document.

I originally took my LeNet implementation with dropout operations, and ran it as it was. I worked pretty well on the training and validation but not so well on the test data or the real-world data. I then decided to revisit the Sermanet-Yann paper and learn what they did differently. I noticed that the stage-1 data after sub-sampling was being fed directly to the classifier. I had to look this up online because it wasn't clear to me how you could do that while at the same time taking the stage-2 data into the classifier.

Upon closer inspection of the graphics and the description in the whitepaper, I saw that the data flowing from the stage-1 and stage-2 to the classifier was a convolution, and what appeared to be a simple concatenation so I gave it a try.

In order to do that I split the stage-2 into 3 parts:

  • A. The stage-1 data flattened.
  • B. The stage-1 data with convolution, pooling, and flattened.
  • C. The step-B data before flattening, with convolution, pooling and flattened.

I then took the 3 flat sets and called tf.concat_v2 on them to producing a single set of features for the classifier. I originally tried calling tf.concat but I never managed to make it work the way I wanted it. The resulting set was a 1704-long feature set which I then continued processing as in the previous implementation of LeNet, only changing the input size of the classifier from 400 to 1704.

Summary of steps

  • Layer 1: Convolutional. Input = 32x32x3. Output = 28x28x6.
  • Activation 1.
  • Pooling 1. Input = 28x28x6. Output = 14x14x6.
  • Dropout 1.
  • Flatten 2a. Input = 14x14x6. Output = 1176.
  • Layer 2b: Convolutional. Input = 14x14x6. Output = 10x10x16.
  • Activation 2b.
  • Pooling 2b. Input = 10x10x16. Output = 5x5x16.
  • Dropout 2b.
  • Flatten 2b. Input = 5x5x16. Output = 400.
  • Layer 2c: Convolutional. Input = 5x5x16. Output = 3x3x32.
  • Activation 2c.
  • Pooling 2c. Input = 3x3x32. Output = 2x2x32.
  • Dropout 2c.
  • Flatten 2c. Input = 2x2x32. Output = 128.
  • Concat layers 2a, 2b, 2c. Input = 1176 + 400 + 128. Output = 1704.
  • Layer 3: Fully Connected. Input = 1704. Output = 120.
  • Activation 3.
  • Dropout 3.
  • Layer 4: Fully Connected. Input = 120. Output = 84.
  • Activation 4.
  • Dropout 4.
  • Layer 5: Fully Connected. Input = 84. Output = n_classes.
  • Dropout 5.


In [20]:
import tensorflow as tf

# Let's initialize the model
x = tf.placeholder(tf.float32, (None, 32, 32, 3))
y = tf.placeholder(tf.int32, (None))
keep_probability = tf.placeholder(tf.float32)
one_hot_y = tf.one_hot(y, n_classes)

adam_learning_rate = 0.0001
#gradient_descent_learning_rate = 0.1

logits = LeNet(x, keep_probability)
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=one_hot_y)
#prediction = tf.nn.softmax(logits)
#cross_entropy = -tf.reduce_sum(y * tf.log(prediction), reduction_indices=1)
loss_operation = tf.reduce_mean(cross_entropy)
optimizer = tf.train.AdamOptimizer(adam_learning_rate)
#optimizer = tf.train.GradientDescentOptimizer(gradient_descent_learning_rate)
training_operation = optimizer.minimize(loss_operation)

In [21]:
correct_prediction = tf.equal(tf.argmax(logits, 1), tf.argmax(one_hot_y, 1))
accuracy_operation = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
saver = tf.train.Saver()

In [22]:
EPOCHS = 100
BATCH_SIZE = 2048


def evaluate(X_data, y_data):
    num_examples = len(X_data)
    total_accuracy = 0
    sess = tf.get_default_session()
    for offset in range(0, num_examples, BATCH_SIZE):
        batch_x, batch_y = X_data[offset:offset+BATCH_SIZE], y_data[offset:offset+BATCH_SIZE]
        accuracy = sess.run(accuracy_operation, feed_dict={x: batch_x, y: batch_y, keep_probability: 1.0})
        total_accuracy += (accuracy * len(batch_x))
    return total_accuracy / num_examples

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    num_examples = len(X_train_split)
    
    print("Training...")
    print()
    for i in range(EPOCHS):
        X_train_split, y_train_split = shuffle(X_train_split, y_train_split)
        for offset in range(0, num_examples, BATCH_SIZE):
            end = offset + BATCH_SIZE
            batch_x, batch_y = X_train_split[offset:end], y_train_split[offset:end]
            sess.run(training_operation, feed_dict={x: batch_x, y: batch_y, keep_probability: 0.5})
            
        validation_accuracy = evaluate(X_validation_split, y_validation_split)
        print("EPOCH {} ...".format(i+1))
        print("Validation Accuracy = {:.3f}".format(validation_accuracy))
        print()
        
    saver.save(sess, './lenet')
    print("Model saved")


Training...

EPOCH 1 ...
Validation Accuracy = 0.102

EPOCH 2 ...
Validation Accuracy = 0.117

EPOCH 3 ...
Validation Accuracy = 0.127

EPOCH 4 ...
Validation Accuracy = 0.133

EPOCH 5 ...
Validation Accuracy = 0.134

EPOCH 6 ...
Validation Accuracy = 0.134

EPOCH 7 ...
Validation Accuracy = 0.138

EPOCH 8 ...
Validation Accuracy = 0.140

EPOCH 9 ...
Validation Accuracy = 0.142

EPOCH 10 ...
Validation Accuracy = 0.143

EPOCH 11 ...
Validation Accuracy = 0.146

EPOCH 12 ...
Validation Accuracy = 0.153

EPOCH 13 ...
Validation Accuracy = 0.159

EPOCH 14 ...
Validation Accuracy = 0.166

EPOCH 15 ...
Validation Accuracy = 0.172

EPOCH 16 ...
Validation Accuracy = 0.178

EPOCH 17 ...
Validation Accuracy = 0.183

EPOCH 18 ...
Validation Accuracy = 0.187

EPOCH 19 ...
Validation Accuracy = 0.195

EPOCH 20 ...
Validation Accuracy = 0.204

EPOCH 21 ...
Validation Accuracy = 0.203

EPOCH 22 ...
Validation Accuracy = 0.208

EPOCH 23 ...
Validation Accuracy = 0.213

EPOCH 24 ...
Validation Accuracy = 0.215

EPOCH 25 ...
Validation Accuracy = 0.217

EPOCH 26 ...
Validation Accuracy = 0.217

EPOCH 27 ...
Validation Accuracy = 0.223

EPOCH 28 ...
Validation Accuracy = 0.225

EPOCH 29 ...
Validation Accuracy = 0.229

EPOCH 30 ...
Validation Accuracy = 0.231

EPOCH 31 ...
Validation Accuracy = 0.238

EPOCH 32 ...
Validation Accuracy = 0.241

EPOCH 33 ...
Validation Accuracy = 0.246

EPOCH 34 ...
Validation Accuracy = 0.245

EPOCH 35 ...
Validation Accuracy = 0.252

EPOCH 36 ...
Validation Accuracy = 0.256

EPOCH 37 ...
Validation Accuracy = 0.256

EPOCH 38 ...
Validation Accuracy = 0.261

EPOCH 39 ...
Validation Accuracy = 0.262

EPOCH 40 ...
Validation Accuracy = 0.265

EPOCH 41 ...
Validation Accuracy = 0.269

EPOCH 42 ...
Validation Accuracy = 0.271

EPOCH 43 ...
Validation Accuracy = 0.278

EPOCH 44 ...
Validation Accuracy = 0.278

EPOCH 45 ...
Validation Accuracy = 0.276

EPOCH 46 ...
Validation Accuracy = 0.283

EPOCH 47 ...
Validation Accuracy = 0.285

EPOCH 48 ...
Validation Accuracy = 0.284

EPOCH 49 ...
Validation Accuracy = 0.288

EPOCH 50 ...
Validation Accuracy = 0.291

EPOCH 51 ...
Validation Accuracy = 0.290

EPOCH 52 ...
Validation Accuracy = 0.297

EPOCH 53 ...
Validation Accuracy = 0.298

EPOCH 54 ...
Validation Accuracy = 0.299

EPOCH 55 ...
Validation Accuracy = 0.301

EPOCH 56 ...
Validation Accuracy = 0.303

EPOCH 57 ...
Validation Accuracy = 0.303

EPOCH 58 ...
Validation Accuracy = 0.306

EPOCH 59 ...
Validation Accuracy = 0.309

EPOCH 60 ...
Validation Accuracy = 0.312

EPOCH 61 ...
Validation Accuracy = 0.317

EPOCH 62 ...
Validation Accuracy = 0.317

EPOCH 63 ...
Validation Accuracy = 0.320

EPOCH 64 ...
Validation Accuracy = 0.318

EPOCH 65 ...
Validation Accuracy = 0.320

EPOCH 66 ...
Validation Accuracy = 0.321

EPOCH 67 ...
Validation Accuracy = 0.320

EPOCH 68 ...
Validation Accuracy = 0.327

EPOCH 69 ...
Validation Accuracy = 0.325

EPOCH 70 ...
Validation Accuracy = 0.327

EPOCH 71 ...
Validation Accuracy = 0.329

EPOCH 72 ...
Validation Accuracy = 0.334

EPOCH 73 ...
Validation Accuracy = 0.335

EPOCH 74 ...
Validation Accuracy = 0.336

EPOCH 75 ...
Validation Accuracy = 0.336

EPOCH 76 ...
Validation Accuracy = 0.341

EPOCH 77 ...
Validation Accuracy = 0.343

EPOCH 78 ...
Validation Accuracy = 0.346

EPOCH 79 ...
Validation Accuracy = 0.346

EPOCH 80 ...
Validation Accuracy = 0.345

EPOCH 81 ...
Validation Accuracy = 0.350

EPOCH 82 ...
Validation Accuracy = 0.351

EPOCH 83 ...
Validation Accuracy = 0.352

EPOCH 84 ...
Validation Accuracy = 0.356

EPOCH 85 ...
Validation Accuracy = 0.357

EPOCH 86 ...
Validation Accuracy = 0.356

EPOCH 87 ...
Validation Accuracy = 0.358

EPOCH 88 ...
Validation Accuracy = 0.360

EPOCH 89 ...
Validation Accuracy = 0.363

EPOCH 90 ...
Validation Accuracy = 0.363

EPOCH 91 ...
Validation Accuracy = 0.367

EPOCH 92 ...
Validation Accuracy = 0.370

EPOCH 93 ...
Validation Accuracy = 0.371

EPOCH 94 ...
Validation Accuracy = 0.375

EPOCH 95 ...
Validation Accuracy = 0.374

EPOCH 96 ...
Validation Accuracy = 0.374

EPOCH 97 ...
Validation Accuracy = 0.373

EPOCH 98 ...
Validation Accuracy = 0.378

EPOCH 99 ...
Validation Accuracy = 0.379

EPOCH 100 ...
Validation Accuracy = 0.380

Model saved

In [23]:
with tf.Session() as sess:
    saver.restore(sess, tf.train.latest_checkpoint('.'))
    test_accuracy = evaluate(X_test_normalized, y_test)
    print("Test Accuracy = {:.3f}".format(test_accuracy))


Test Accuracy = 0.387

Question 4

How did you train your model? (Type of optimizer, batch size, epochs, hyperparameters, etc.)

Answer:

This is the part of this homework that took the longest. I've spent several days tweaking it and trying to obtain the best results I could.

I tried using the tf.train.GradientDescentOptimizer optimizer with learning rates around [0.1,0.01] and saw that the training set wasn't really learning well. I'm yet to figure out why, as I dropped it in favor of the tf.train.AdamOptimizer optimizer with learning rates around [0.001,0.0001] which yielded such good results I just ran with it.

I was able to make the batch size grow to 2048 for better performance, didn't try anything above this as this was satisfactory in the hardware I used.

I noticed that with the epochs, there was a point after which my model wasn't learning much more, that was around 750 iterations. I tried running up to 2000 iterations in order to realize this.

Another parameter I used was keep_probability given that I'm using dropout operations in my LeNet model. For training I set keep probability to 0.5, and for validation and testing I use 1.0.

Question 5

What approach did you take in coming up with a solution to this problem? It may have been a process of trial and error, in which case, outline the steps you took to get to the final solution and why you chose those steps. Perhaps your solution involved an already well known implementation or architecture. In this case, discuss why you think this is suitable for the current problem.

Answer:

It was definitely trial-and-error. I did use the LeNet code I had from a previous assignment, and tried expanding it with the Sermanet-Yann paper information I had, but that was very painful as my lack of experience with tensorflow, python, and neural networks were in my way. The first attempts at running this didn't yield the results I was hoping for, with a test accuracy of 50%. After upgrading the LeNet based of the whitepaper, and realizing that I needed to normalize the data, I found that accuracy went up to 80%. After that, I also removed the code I had that was cropping the images based on the train['coords'] and train['sizes'] data and got the test to score 90% accuracy.

I'm not sure why cropping the data didn't yield the results I expected, I plan on investigating this on my spare time. It hurt my test score by a whole 10 percentage points!

After that, I only played with changing the optimizer algorithm, the learning rate, the batch size and the amount of epochs and I got to satisfactory levels. Not perfect, but satisfactory.


Step 3: Test a Model on New Images

Take several pictures of traffic signs that you find on the web or around you (at least five), and run them through your classifier on your computer to produce example results. The classifier might not recognize some local signs but it could prove interesting nonetheless.

You may find signnames.csv useful as it contains mappings from the class id (integer) to the actual sign name.

Implementation

Use the code cell (or multiple code cells, if necessary) to implement the first step of your project. Once you have completed your implementation and are satisfied with the results, be sure to thoroughly answer the questions that follow.


In [15]:
### Load the images and plot them here.
### Feel free to use as many code cells as needed.
import matplotlib.image as mpimg

print("Left: web image, Center: web image scaled to 32x32, Right: example image from original data set in the same class")

web_images = []
web_classes = [23,40,14,28,11]
n_web_classes = len(web_classes)

fig, axes = plt.subplots(5, 3, figsize=(20,20))
for i in range(n_web_classes):
    image = mpimg.imread("German Traffic Signs/{0}.jpg".format(i + 1))
    web_images.append(cv2.resize(image,(32, 32), interpolation = cv2.INTER_AREA))
    axes[i,0].imshow(image)
    axes[i,0].set_title("Image {0}".format(i))
    axes[i,0].axis("off")
    axes[i,1].imshow(web_images[i])
    axes[i,1].set_title("Resized image {0}".format(i))
    axes[i,1].axis("off")
    prototype = X_train[class_index[web_classes[i]][100]].squeeze()
    axes[i,2].imshow(prototype)
    axes[i,2].set_title(labelmap[y_train[class_index[web_classes[i]][100]]])
    axes[i,2].axis("off")
plt.show()

# Finally, normalize the images:
web_images = np.array([normalize_image(i) for i in np.array(web_images).astype(np.float32)])


Left: web image, Center: web image scaled to 32x32, Right: example image from original data set in the same class

Question 6

Choose five candidate images of traffic signs and provide them in the report. Are there any particular qualities of the image(s) that might make classification difficult? It could be helpful to plot the images in the notebook.

Answer:

It's able to perform relatively well, but the model doesn't seem to be tolerant of size and location changes for the detected objects. For example, the roundabout mandatory image is pretty close to the training set data, but the angle of the picture and other characteristics appear to have affected the outcome.

I blame this on limited input data, and in that I still don't understand how I can make my model more resilient to image deformation without necessarily feeding a huge amount of examples into the model. That's a question I'll throw in the Slack channel or the forums.


In [16]:
# Test the model against these new images (normalized data)
with tf.Session() as sess:
    saver.restore(sess, tf.train.latest_checkpoint('.'))
    test_accuracy = evaluate(web_images, web_classes)
    print("Test Accuracy = {:.3f}".format(test_accuracy))


Test Accuracy = 0.800

Question 7

Is your model able to perform equally well on captured pictures when compared to testing on the dataset? The simplest way to do this check the accuracy of the predictions. For example, if the model predicted 1 out of 5 signs correctly, it's 20% accurate.

NOTE: You could check the accuracy manually by using signnames.csv (same directory). This file has a mapping from the class id (0-42) to the corresponding sign name. So, you could take the class id the model outputs, lookup the name in signnames.csv and see if it matches the sign from the image.

Answer:

Model Accuracy
Validation 99.6%
Test data 89%
Real-world data 80%

The test data got 89%, while the images I took from the Internet got 80%. Since my web images data set was only 5 images, that means 1 of them didn't get classified correctly. That said, the accuracy achieved is satisfactory.

I delve deeper into this further down in the document.


In [17]:
### Visualize the softmax probabilities here.
### Feel free to use as many code cells as needed.

zero_to_n = [i for i in range(n_classes)]

with tf.Session() as sess:
    saver.restore(sess, tf.train.latest_checkpoint('.'))
    softmax = sess.run(tf.nn.softmax(logits), feed_dict={x: web_images, y: web_classes, keep_probability: 1.0})
    fig, axes = plt.subplots(5, 2, figsize=(20,20))
    for i in range(n_web_classes):
        axes[i,0].imshow(mpimg.imread("German Traffic Signs/{0}.jpg".format(i + 1)))
        axes[i,0].set_title(labelmap[web_classes[i]])
        axes[i,0].axis("off")
        axes[i,1].plot(zero_to_n, softmax[i])
        axes[i,1].set_title("Softmax for " + labelmap[y_train[class_index[web_classes[i]][100]]])
        axes[i,1].axis("on")
    plt.show()


Question 8

Use the model's softmax probabilities to visualize the certainty of its predictions, tf.nn.top_k could prove helpful here. Which predictions is the model certain of? Uncertain? If the model was incorrect in its initial prediction, does the correct prediction appear in the top k? (k should be 5 at most)

tf.nn.top_k will return the values and indices (class ids) of the top k predictions. So if k=3, for each sign, it'll return the 3 largest probabilities (out of a possible 43) and the correspoding class ids.

Take this numpy array as an example:

# (5, 6) array
a = np.array([[ 0.24879643,  0.07032244,  0.12641572,  0.34763842,  0.07893497,
         0.12789202],
       [ 0.28086119,  0.27569815,  0.08594638,  0.0178669 ,  0.18063401,
         0.15899337],
       [ 0.26076848,  0.23664738,  0.08020603,  0.07001922,  0.1134371 ,
         0.23892179],
       [ 0.11943333,  0.29198961,  0.02605103,  0.26234032,  0.1351348 ,
         0.16505091],
       [ 0.09561176,  0.34396535,  0.0643941 ,  0.16240774,  0.24206137,
         0.09155967]])

Running it through sess.run(tf.nn.top_k(tf.constant(a), k=3)) produces:

TopKV2(values=array([[ 0.34763842,  0.24879643,  0.12789202],
       [ 0.28086119,  0.27569815,  0.18063401],
       [ 0.26076848,  0.23892179,  0.23664738],
       [ 0.29198961,  0.26234032,  0.16505091],
       [ 0.34396535,  0.24206137,  0.16240774]]), indices=array([[3, 0, 5],
       [0, 1, 4],
       [0, 5, 1],
       [1, 3, 5],
       [1, 4, 3]], dtype=int32))

Looking just at the first row we get [ 0.34763842, 0.24879643, 0.12789202], you can confirm these are the 3 largest probabilities in a. You'll also notice [3, 0, 5] are the corresponding indices.

Answer:

This is so cool! Visualizing the softmax and graphing examples of the top 3 hits for an image really makes this a very interesting tool for classification and for fine-tunning of the model.

The image that failed to be classified correctly was the Slippery Road image, which got classified as Speed Limit (20 km/h). That said, the second highest hit was the correct classification, as seen further down below in the graphic where we analyze the TopK hits.

I'm thinking that my data augmentation is to blame for this and I could have done a better job. I could also have picked images that were even more similar to the ones in the training set, but then that wouldn't have been interesting would it?

Note: Once you have completed all of the code implementations and successfully answered each question above, you may finalize your work by exporting the iPython Notebook as an HTML document. You can do this by using the menu above and navigating to \n", "File -> Download as -> HTML (.html). Include the finished document along with this notebook as your submission.


In [18]:
print("Top 3 hits")
print("Left: web image")

with tf.Session() as sess:
    saver.restore(sess, tf.train.latest_checkpoint('.'))
    top_k = sess.run(tf.nn.top_k(tf.nn.softmax(logits), k=3),
                     feed_dict={x: web_images, y: web_classes, keep_probability: 1.0})

    fig, axes = plt.subplots(5, 4, figsize=(20,20))
    for i in range(n_web_classes):
        axes[i,0].imshow(mpimg.imread("German Traffic Signs/{0}.jpg".format(i + 1)))
        axes[i,0].set_title("Web image: " + labelmap[web_classes[i]])
        axes[i,0].axis("off")
        axes[i,1].imshow(X_train[class_index[top_k[1][i][0]]][100].squeeze())
        axes[i,1].set_title("1. {:} ({:.1f}%)".format(labelmap[top_k[1][i][0]],top_k[0][i][0]*100))
        axes[i,1].axis("off")
        axes[i,2].imshow(X_train[class_index[top_k[1][i][1]]][100].squeeze())
        axes[i,2].set_title("2. {:} ({:.1f}%)".format(labelmap[top_k[1][i][1]],top_k[0][i][1]*100))
        axes[i,2].axis("off")
        axes[i,3].imshow(X_train[class_index[top_k[1][i][2]]][100].squeeze())
        axes[i,3].set_title("3. {:} ({:.1f}%)".format(labelmap[top_k[1][i][2]],top_k[0][i][2]*100))
        axes[i,3].axis("off")
    plt.show()


Top 3 hits
Left: web image