In this notebook, a template is provided for you to implement your functionality in stages which is required to successfully complete this project. If additional code is required that cannot be included in the notebook, be sure that the Python code is successfully imported and included in your submission, if necessary. Sections that begin with 'Implementation' in the header indicate where you should begin your implementation for your project. Note that some sections of implementation are optional, and will be marked with 'Optional' in the header.
In addition to implementing code, there will be questions that you must answer which relate to the project and your implementation. Each section where you will answer a question is preceded by a 'Question' header. Carefully read each question and provide thorough answers in the following text boxes that begin with 'Answer:'. Your project submission will be evaluated based on your answers to each of the questions and the implementation you provide.
Note: Code and Markdown cells can be executed using the Shift + Enter keyboard shortcut. In addition, Markdown cells can be edited by typically double-clicking the cell to enter edit mode.
In [1]:
# Load pickled data
import pickle
training_file = './train.p'
testing_file = './test.p'
with open(training_file, mode='rb') as f:
train = pickle.load(f)
with open(testing_file, mode='rb') as f:
test = pickle.load(f)
X_train, y_train = train['features'], train['labels']
X_test, y_test = test['features'], test['labels']
X_coords, X_sizes = train['coords'], train['sizes']
The pickled data is a dictionary with 4 key/value pairs:
'features'
is a 4D array containing raw pixel data of the traffic sign images, (num examples, width, height, channels).'labels'
is a 1D array containing the label/class id of the traffic sign. The file signnames.csv
contains id -> name mappings for each id.'sizes'
is a list containing tuples, (width, height) representing the the original width and height the image.'coords'
is a list containing tuples, (x1, y1, x2, y2) representing coordinates of a bounding box around the sign in the image. THESE COORDINATES ASSUME THE ORIGINAL IMAGE. THE PICKLED DATA CONTAINS RESIZED VERSIONS (32 by 32) OF THESE IMAGESComplete the basic data summary below.
In [2]:
import csv
# Number of training examples
n_train = len(train['labels'])
# Number of testing examples.
n_test = len(test['labels'])
# What's the shape of an traffic sign image?
image_shape = X_train[0].shape
# How many unique classes/labels there are in the dataset.
labelmap = {}
with open('./signnames.csv', 'r') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
labelmap[int(row['ClassId'])] = row['SignName']
n_classes = len(labelmap)
print("Number of training examples =", n_train)
print("Number of testing examples =", n_test)
print("Image data shape =", image_shape)
print("Number of classes =", n_classes)
Visualize the German Traffic Signs Dataset using the pickled file(s). This is open ended, suggestions include: plotting traffic sign images, plotting the count of each sign, etc.
The Matplotlib examples and gallery pages are a great resource for doing visualizations in Python.
NOTE: It's recommended you start with something simple first. If you wish to do more, come back to it after you've completed the rest of the sections.
In [3]:
import matplotlib.pyplot as plt
# Visualizations will be shown in the notebook.
%matplotlib inline
import random
index = random.randint(0, n_train)
image = X_train[index]
plt.figure(figsize=(1,1))
plt.imshow(image)
print("Description: " + labelmap[y_train[index]])
print("Original image size:")
print(X_sizes[index])
print("Original image bounding box (x1, y1, x2, y2):")
print(X_coords[index])
Design and implement a deep learning model that learns to recognize traffic signs. Train and test your model on the German Traffic Sign Dataset.
There are various aspects to consider when thinking about this problem:
Here is an example of a published baseline model on this problem. It's not required to be familiar with the approach used in the paper but, it's good practice to try to read papers like these.
NOTE: The LeNet-5 implementation shown in the classroom at the end of the CNN lesson is a solid starting point. You'll have to change the number of classes and possibly the preprocessing, but aside from that it's plug and play!
In [4]:
# Pre-processing of the data
import numpy as np
import cv2
# First, let's see how many items of each category we have:
class_index = []
# let's initialize class_index
for index in range(len(y_train)):
class_index.append([])
for index in range(len(y_train)):
item_class = y_train[index]
class_index[item_class].append(index)
# show a histogram for human readibility
plt.hist(y_train, bins='auto')
plt.title("Initial count of images (y) per class (x)")
plt.show()
def modify_image(image):
"""
Expects the parameter to be a 3-dimensional array with the RGB information
for the image; will create a new image by modifying the original with random
modifications, particularly: tilting.
http://opencv-python-tutroals.readthedocs.io/en/latest/py_tutorials/py_imgproc/py_geometric_transformations/py_geometric_transformations.html
"""
rows,cols = len(image), len(image[0])
# tilt [-20,20]
tilt_angle = -20 + int(random.random() * 40)
M = cv2.getRotationMatrix2D((cols/2,rows/2),tilt_angle,1)
out = cv2.warpAffine(image,M,(cols,rows))
# scale [-2,2] on each point in x, [-2,2] on each point in y
scale_x_left = -2 + int(random.random() * 4)
scale_x_right = -2 + int(random.random() * 4)
scale_y_top = -2 + int(random.random() * 4)
scale_y_bottom = -2 + int(random.random() * 4)
pts1 = np.float32([[0,0],[rows,0],[0,cols],[rows,cols]])
pts2 = np.float32([[scale_y_top,scale_x_left],
[rows+scale_y_bottom,scale_x_left],
[scale_y_top,cols+scale_x_right],
[rows+scale_y_bottom,cols+scale_x_right]])
M = cv2.getPerspectiveTransform(pts1,pts2)
out = cv2.warpPerspective(out,M,(cols,rows))
return np.asarray(out)
# The disparity in image count is striking, we should attempt to equalize the field at least a little bit
# so the difference in example count between different classes isn't as wide
# We will ensure that each class has at least N items, so we'll iterate over the class index and ensure that
# we add the required amount of images; these new images will be created by applying modifications to
# the initial set of images
# Arbitrarily I've chosen 2000 as the minimum number of images given that some classes have as little as 250
# items while others have as much as 2500. Lowering this disparity is what I intend
min_item_count = 2000
extension_to_X_train = []
extension_to_y_train = []
extension_to_X_coords = []
extension_to_X_sizes = []
for current_class in range(n_classes):
item_index = class_index[current_class]
item_count = len(item_index)
if item_count < min_item_count:
#print("Class: {0}".format(current_class))
#print(item_count)
#print(item_index)
# let's add the new data
N = min_item_count - item_count
for i in range(N):
#print("Index {0}".format(i))
image_selector = item_index[i % item_count]
#print("Image selector {0}".format(image_selector))
selected_image = X_train[image_selector]
#print(selected_image)
#plt.figure(figsize=(1,1))
#plt.imshow(selected_image)
extension_to_X_train.append(modify_image(selected_image))
extension_to_y_train.append(y_train[image_selector])
extension_to_X_coords.append(X_coords[image_selector])
extension_to_X_sizes.append(X_sizes[image_selector])
# Append extensions
if (len(extension_to_X_train) > 0):
X_train = np.append(X_train, extension_to_X_train, axis=0)
y_train = np.append(y_train, extension_to_y_train, axis=0)
X_coords = np.append(X_coords, extension_to_X_coords, axis=0)
X_sizes = np.append(X_sizes, extension_to_X_sizes, axis=0)
n_train = len(X_train)
In [5]:
# Give a count of the modified training data
# show a histogram for human readibility
print("Total item count =", n_train)
plt.hist(y_train, bins='auto')
plt.title("Count of images (y) per class (x) after augmentation")
plt.show()
if (len(extension_to_X_train) > 0):
index = random.randint(0, len(extension_to_X_train))
image = extension_to_X_train[index]
plt.figure(figsize=(1,1))
plt.imshow(image)
print("Description: " + labelmap[extension_to_y_train[index]] + " - ", index)
Answer:
When I did the histogram to visually count how many items we had per class (label) it was clear that the disparity was very large. The suggestions given in the Step 2
header were a hint towards this being an issue. Initially when I first did the LeNet pass on the original data I realized my validation scores and test scores were low, and a bit random. I decided to try and augment the examples to be a more even count per item class, and settled on having at least 2000 examples on each class.
The way I did this was by making an index of all the images per class. I then counted how many images were there in a give class and if the count was less than 2000 I'd proceed with augmentation. These are the augmentation steps:
I then produced a new histogram of the resulting set of images. Since I augmented all the metadata sets as well, the length and order of the data is preserved and would allow for the rest of this exercise to proceed regardless of the augmentation step.
As for the image modification, I decided to make random tweaks along two different operations:
I based the transformations on the examples given at http://opencv-python-tutroals.readthedocs.io/en/latest/py_tutorials/py_imgproc/py_geometric_transformations/py_geometric_transformations.html
The idea behind this was to ensure that the extended data was different (even if just slightly) to the original data so as to actually help the model see new things.
Note
I am explicitly NOT normalizing the data in this step as I discovered that I had issues with doing so given what I wanted to do next: create 3-channel grayscale representations of the same images, effectively doubling the data set.
In the following code blocks I generate the grayscale set and after that is done I proceed to normalize the data as told in the online class. We want the data to be centered around zero, and hopefully with an normal distribution around zero.
Then I proceed to split the augmented data into a training set and a validation set.
In [6]:
import cv2
#X_train_grayscale = []
#for index in range(n_train):
# grayscale = cv2.cvtColor(X_train_scaled[index], cv2.COLOR_BGR2GRAY)
# X_train_grayscale.append(grayscale)
X_train_grayscale = []
for index in range(n_train):
grayscale = np.zeros_like(X_train[index], dtype=np.float32)
rgb = X_train[index]
y = 0.2989 * rgb[:,:,2] + 0.5870 * rgb[:,:,1] + 0.1140 * rgb[:,:,0]
grayscale[:,:,0] = grayscale[:,:,1] = grayscale[:,:,2] = y.squeeze()
X_train_grayscale.append(grayscale)
print("Done grayscaling, rendering an example output vs the original:")
index = random.randint(0, n_train)
print("Index: ", index)
image = X_train_grayscale[index]
plt.figure(figsize=(1,1))
plt.imshow(image)
plt.figure(figsize=(1,1))
plt.imshow(X_train[index])
print("Description: " + labelmap[y_train[index]])
# we now have a second set of images which is the same as the cropped-resized images we pre-processed before
# but in a human-balanced grayscale, still 3 channels.
In [7]:
# Normalize the data first (centered at 0,0, with a span of 1)
# We know the RGB data has values 0 to 255
def normalize_image(image):
return (image - 128) / 128
# Normalize the data ensuring the values are float32
X_train_grayscale_normalized = np.array([normalize_image(i) for i in X_train_grayscale])
X_train_normalized = np.array([normalize_image(i) for i in X_train.astype(np.float32)])
X_test_normalized = np.array([normalize_image(i) for i in X_test.astype(np.float32)])
In [8]:
#index = random.randint(0, n_train)
print("Index: ", index)
plt.figure(figsize=(1,1))
plt.imshow(X_train_normalized[index])
plt.figure(figsize=(1,1))
plt.imshow(X_train_grayscale_normalized[index])
print("Example images from normalized data")
In [9]:
# Let's say we want 80% of the data to be testing data, and 20% to be validation data
# We also want to shuffle the items in the testing set and the validation set
from sklearn.utils import shuffle
train_split_prob = 0.8
X_train_split = []
y_train_split = []
X_validation_split = []
y_validation_split = []
def split_and_append(Xinput, yinput, split_prob, X1, y1, X2, y2):
"""
Takes the input X,y and appends them to X1,y1 or X2,y2 depending
on the split_prob value.
"""
for index in range(len(Xinput)):
image = Xinput[index]
classification = yinput[index]
is_second_split = random.random() >= split_prob
if is_second_split:
X2.append(image)
y2.append(classification)
else:
X1.append(image)
y1.append(classification)
split_and_append(X_train_normalized,
y_train,
train_split_prob,
X_train_split,
y_train_split,
X_validation_split,
y_validation_split)
split_and_append(X_train_grayscale_normalized,
y_train,
train_split_prob,
X_train_split,
y_train_split,
X_validation_split,
y_validation_split)
X_train_split, y_train_split = shuffle(X_train_split, y_train_split)
X_validation_split, y_validation_split = shuffle(X_validation_split, y_validation_split)
print("Done splitting and shuffling items")
expected_len_X = len(X_train_normalized) + len(X_train_grayscale_normalized)
expected_len_y = len(y_train) * 2
len_train_X = len(X_train_split)
len_validation_X = len(X_validation_split)
len_train_y = len(y_train_split)
len_valudation_y = len(y_validation_split)
assert expected_len_X == len_train_X + len_validation_X
assert expected_len_y == len_train_y + len_valudation_y
assert len_train_X == len_train_y
assert len_validation_X == len_valudation_y
print(len_train_X)
print(len_validation_X)
index = random.randint(0, len(X_train_split))
image = X_train_split[index]
plt.figure(figsize=(1,1))
plt.imshow(image)
print("Description: " + labelmap[y_train_split[index]])
Describe how you set up the training, validation and testing data for your model. Optional: If you generated additional data, how did you generate the data? Why did you generate the data? What are the differences in the new dataset (with generated data) from the original dataset?
Splitting the training set
We already had one training set and one test set to begin with. I'm leaving the test set as is and intend to use it only once after I'm fully confident of the accuracy of my model based on the test data.
The training set, however, is going to be split in 2 parts:
This partition is achieved by iterating on the input data set (X,y) and splitting it into the 2 buckets (training(X,y) and validation(X,y)). Each data pair (Xi,yi) is assigned to a set by obtaining a random number between 0.0 and 1.0 and choosing one bucket if the random number is below the threshold, or the other bucket if it's above the threshold. This makes it easy to think of partitioning the data as percentages of the input. Note that here we assume that the random number generator gives a relatively uniform distribution of results over the [0.0, 1.0] continuum.
After the training set and validation set have been created, we proceed to shuffle their contents and perform various assertions.
Newly generated data
I decided to take this step to generate grayscale images corresponding to the original+augmented images. We really have 2 data sets at this point:
I'm thinking perhaps training the data on both the color and the grayscale images might help me improve the accuracy of the predictions going forward. I'll try to ensure that the network can decide "this is a yield sign" based on the unique characteristics found in the color image, or the more general strokes seen in the grayscale image.
As for the grayscale, I decided to compute it myself with a human-eye based formula I found online:
y = 0.2989 * red + 0.5870 * green + 0.1140 * blue
There was no big reason for this, I just liked the way it looked better than the built-in algorithms I had at hand.
Ideas that didn't work
I read that the picked data contained cropping information, encoded as the 4 corners of a square in the original image size map. I transformed the cropping information to the coordinate system used by the scaled image (32 x 32) used a PIL image to first crop the area outside the pixels of interest. Since our algorithm assumes a 32 x 32 image, I scaled the image back to the specified dimensions.
I chose this in order to reduce the signal-to-noise ratio of the data we're sending to the network. I'm assuming that the pixels outside of the area of interest are noise and might lower the overall accuracy of our predictions. For example, if all the training images had green or brown from trees in the background, the network might fail to understand that a sign with a blue-sky in the background is equally valid. Furthermore, if we move this to grayscale, the elimination of the surrounding pixels is just as useful. Imagine that for a specific type of traffic sign, such as "yield", we didn't have enough training data. In that case our network will be very sensitive to all the pixels available in the few images we have. This might mean that our network can only recognize "yield" signs when the signal and the noise levels of the test data are similar to those of the training data. I'm assuming that by reducing the noise in the training data we render the noise in the test data less detrimental overall.
My assumptions might be wrong, so not only will I be training my network, I'll be training my own personal experience and the heuristics I'll use going forward.
Answer:
In [10]:
def LeNet(x, keep_probability):
# Arguments used for tf.truncated_normal, randomly defines variables for the weights and biases for each layer
mu = 0
sigma = 0.1
# Layer 1: Convolutional. Input = 32x32x3. Output = 28x28x6.
# new_height = (input_height - filter_height + 2 * P)/S + 1
# 28 = ((32 - fh)/S) + 1
#(27 * S) + fh= 32
# S = 1, fh = 5
F_W = tf.Variable(tf.truncated_normal((5, 5, 3, 6), mean = mu, stddev = sigma, dtype=tf.float32))
F_b = tf.Variable(tf.zeros(6))
strides = [1, 1, 1, 1]
padding = 'VALID'
layer1 = tf.nn.conv2d(x, F_W, strides, padding) + F_b
# Activation 1.
layer1 = tf.nn.relu(layer1)
# Pooling 1. Input = 28x28x6. Output = 14x14x6.
# new_height = (input_height - filter_height)/S + 1
# 14 = ((28 - fh)/S) + 1
# (13 * S) + fh = 28
# S = 2, fh = 2
ksize=[1, 2, 2, 1]
strides=[1, 2, 2, 1]
padding = 'VALID'
pooling_layer1 = tf.nn.max_pool(layer1, ksize, strides, padding)
# Dropout 1.
pooling_layer1 = tf.nn.dropout(pooling_layer1, keep_probability)
# Flatten 2a. Input = 14x14x6. Output = 1176.
flatten_layer2a = tf.contrib.layers.flatten(pooling_layer1)
# Layer 2b: Convolutional. Input = 14x14x6. Output = 10x10x16.
# new_height = (input_height - filter_height + 2 * P)/S + 1
# 10 = ((14 - fh)/S) + 1
# (9 * S) + fh = 14
# S = 1, fh = 5
F_W = tf.Variable(tf.truncated_normal((5, 5, 6, 16), mean = mu, stddev = sigma, dtype=tf.float32))
F_b = tf.Variable(tf.zeros(16))
strides = [1, 1, 1, 1]
padding = 'VALID'
layer2b = tf.nn.conv2d(pooling_layer1, F_W, strides, padding) + F_b
# Activation 2b.
layer2b = tf.nn.relu(layer2b)
# Pooling 2b. Input = 10x10x16. Output = 5x5x16.
# new_height = (input_height - filter_height)/S + 1
# 5 = ((10 - fh)/S) + 1
# (4 * S) + fh = 10
# S = 2, fh = 2
ksize=[1, 2, 2, 1]
strides=[1, 2, 2, 1]
padding = 'VALID'
pooling_layer2b = tf.nn.max_pool(layer2b, ksize, strides, padding)
# Dropout 2b.
pooling_layer2b = tf.nn.dropout(pooling_layer2b, keep_probability)
# Flatten 2b. Input = 5x5x16. Output = 400.
flatten_layer2b = tf.contrib.layers.flatten(pooling_layer2b)
# Layer 2c: Convolutional. Input = 5x5x16. Output = 3x3x32.
# new_height = (input_height - filter_height + 2 * P)/S + 1
# 3 = ((5 - fh)/S) + 1
# (2 * S) + fh = 5
# S = 2, fh = 1
F_W = tf.Variable(tf.truncated_normal((1, 1, 16, 32), mean = mu, stddev = sigma, dtype=tf.float32))
F_b = tf.Variable(tf.zeros(32))
strides = [1, 2, 2, 1]
padding = 'VALID'
layer2c = tf.nn.conv2d(pooling_layer2b, F_W, strides, padding) + F_b
# Activation 2c.
layer2c = tf.nn.relu(layer2c)
# Pooling 2c. Input = 3x3x32. Output = 2x2x32.
# new_height = (input_height - filter_height)/S + 1
# 2 = ((3 - fh)/S) + 1
# (1 * S) + fh = 3
# S = 2, fh = 1
ksize=[1, 1, 1, 1]
strides=[1, 2, 2, 1]
padding = 'VALID'
pooling_layer2c = tf.nn.max_pool(layer2c, ksize, strides, padding)
# Dropout 2c.
pooling_layer2c = tf.nn.dropout(pooling_layer2c, keep_probability)
# Flatten 2c. Input = 2x2x32. Output = 128.
flatten_layer2c = tf.contrib.layers.flatten(pooling_layer2c)
# Concat layers 2a, 2b, 2c. Input = 1176 + 400 + 128. Output = 1704.
flat_layer2 = tf.concat_v2([tf.concat_v2([flatten_layer2b, flatten_layer2a], 1), flatten_layer2c], 1)
# Layer 3: Fully Connected. Input = 1704. Output = 120.
F_W = tf.Variable(tf.truncated_normal((1704, 120), mean = mu, stddev = sigma, dtype=tf.float32))
F_b = tf.Variable(tf.zeros(120))
fully_connected = tf.matmul(flat_layer2, F_W) + F_b
# Activation 3.
fully_connected = tf.nn.relu(fully_connected)
# Dropout 3.
fully_connected = tf.nn.dropout(fully_connected, keep_probability)
# Layer 4: Fully Connected. Input = 120. Output = 84.
F_W = tf.Variable(tf.truncated_normal((120, 84), mean = mu, stddev = sigma, dtype=tf.float32))
F_b = tf.Variable(tf.zeros(84))
fully_connected = tf.matmul(fully_connected, F_W) + F_b
# Activation 4.
fully_connected = tf.nn.relu(fully_connected)
# Dropout 4.
fully_connected = tf.nn.dropout(fully_connected, keep_probability)
# Layer 5: Fully Connected. Input = 84. Output = n_classes.
F_W = tf.Variable(tf.truncated_normal((84, n_classes), mean = mu, stddev = sigma, dtype=tf.float32))
F_b = tf.Variable(tf.zeros(n_classes))
logits = tf.matmul(fully_connected, F_W) + F_b
# Dropout 5.
logits = tf.nn.dropout(logits, keep_probability)
return logits
What does your final architecture look like? (Type of model, layers, sizes, connectivity, etc.) For reference on how to build a deep neural network using TensorFlow, see Deep Neural Network in TensorFlow from the classroom.
Answer:
The code above is based on the published baseline model on this problem that was referenced above in this document.
I originally took my LeNet implementation with dropout operations, and ran it as it was. I worked pretty well on the training and validation but not so well on the test data or the real-world data. I then decided to revisit the Sermanet-Yann paper and learn what they did differently. I noticed that the stage-1 data after sub-sampling was being fed directly to the classifier. I had to look this up online because it wasn't clear to me how you could do that while at the same time taking the stage-2 data into the classifier.
Upon closer inspection of the graphics and the description in the whitepaper, I saw that the data flowing from the stage-1 and stage-2 to the classifier was a convolution, and what appeared to be a simple concatenation so I gave it a try.
In order to do that I split the stage-2 into 3 parts:
I then took the 3 flat sets and called tf.concat_v2
on them to producing a single set of features for the classifier. I originally tried calling tf.concat
but I never managed to make it work the way I wanted it. The resulting set was a 1704-long feature set which I then continued processing as in the previous implementation of LeNet, only changing the input size of the classifier from 400 to 1704.
Summary of steps
In [11]:
import tensorflow as tf
# Let's initialize the model
x = tf.placeholder(tf.float32, (None, 32, 32, 3))
y = tf.placeholder(tf.int32, (None))
keep_probability = tf.placeholder(tf.float32)
one_hot_y = tf.one_hot(y, n_classes)
adam_learning_rate = 0.0001
#gradient_descent_learning_rate = 0.1
logits = LeNet(x, keep_probability)
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=one_hot_y)
#prediction = tf.nn.softmax(logits)
#cross_entropy = -tf.reduce_sum(y * tf.log(prediction), reduction_indices=1)
loss_operation = tf.reduce_mean(cross_entropy)
optimizer = tf.train.AdamOptimizer(adam_learning_rate)
#optimizer = tf.train.GradientDescentOptimizer(gradient_descent_learning_rate)
training_operation = optimizer.minimize(loss_operation)
In [12]:
correct_prediction = tf.equal(tf.argmax(logits, 1), tf.argmax(one_hot_y, 1))
accuracy_operation = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
saver = tf.train.Saver()
In [13]:
EPOCHS = 1000
BATCH_SIZE = 2048
def evaluate(X_data, y_data):
num_examples = len(X_data)
total_accuracy = 0
sess = tf.get_default_session()
for offset in range(0, num_examples, BATCH_SIZE):
batch_x, batch_y = X_data[offset:offset+BATCH_SIZE], y_data[offset:offset+BATCH_SIZE]
accuracy = sess.run(accuracy_operation, feed_dict={x: batch_x, y: batch_y, keep_probability: 1.0})
total_accuracy += (accuracy * len(batch_x))
return total_accuracy / num_examples
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
num_examples = len(X_train_split)
print("Training...")
print()
for i in range(EPOCHS):
X_train_split, y_train_split = shuffle(X_train_split, y_train_split)
for offset in range(0, num_examples, BATCH_SIZE):
end = offset + BATCH_SIZE
batch_x, batch_y = X_train_split[offset:end], y_train_split[offset:end]
sess.run(training_operation, feed_dict={x: batch_x, y: batch_y, keep_probability: 1.0})
validation_accuracy = evaluate(X_validation_split, y_validation_split)
print("EPOCH {} ...".format(i+1))
print("Validation Accuracy = {:.3f}".format(validation_accuracy))
print()
saver.save(sess, './lenet')
print("Model saved")
In [14]:
with tf.Session() as sess:
saver.restore(sess, tf.train.latest_checkpoint('.'))
test_accuracy = evaluate(X_test_normalized, y_test)
print("Test Accuracy = {:.3f}".format(test_accuracy))
Answer:
This is the part of this homework that took the longest. I've spent several days tweaking it and trying to obtain the best results I could.
I tried using the tf.train.GradientDescentOptimizer
optimizer with learning rates around [0.1,0.01] and saw that the training set wasn't really learning well. I'm yet to figure out why, as I dropped it in favor of the tf.train.AdamOptimizer
optimizer with learning rates around [0.001,0.0001] which yielded such good results I just ran with it.
I was able to make the batch size grow to 2048 for better performance, didn't try anything above this as this was satisfactory in the hardware I used.
I noticed that with the epochs, there was a point after which my model wasn't learning much more, that was around 750 iterations. I tried running up to 2000 iterations in order to realize this.
Another parameter I used was keep_probability
given that I'm using dropout operations in my LeNet model. For training I set keep probability to 0.5, and for validation and testing I use 1.0.
What approach did you take in coming up with a solution to this problem? It may have been a process of trial and error, in which case, outline the steps you took to get to the final solution and why you chose those steps. Perhaps your solution involved an already well known implementation or architecture. In this case, discuss why you think this is suitable for the current problem.
Answer:
It was definitely trial-and-error. I did use the LeNet code I had from a previous assignment, and tried expanding it with the Sermanet-Yann paper information I had, but that was very painful as my lack of experience with tensorflow, python, and neural networks were in my way. The first attempts at running this didn't yield the results I was hoping for, with a test accuracy of 50%. After upgrading the LeNet based of the whitepaper, and realizing that I needed to normalize the data, I found that accuracy went up to 80%. After that, I also removed the code I had that was cropping the images based on the train['coords']
and train['sizes']
data and got the test to score 90% accuracy.
I'm not sure why cropping the data didn't yield the results I expected, I plan on investigating this on my spare time. It hurt my test score by a whole 10 percentage points!
After that, I only played with changing the optimizer algorithm, the learning rate, the batch size and the amount of epochs and I got to satisfactory levels. Not perfect, but satisfactory.
Take several pictures of traffic signs that you find on the web or around you (at least five), and run them through your classifier on your computer to produce example results. The classifier might not recognize some local signs but it could prove interesting nonetheless.
You may find signnames.csv
useful as it contains mappings from the class id (integer) to the actual sign name.
In [15]:
### Load the images and plot them here.
### Feel free to use as many code cells as needed.
import matplotlib.image as mpimg
print("Left: web image, Center: web image scaled to 32x32, Right: example image from original data set in the same class")
web_images = []
web_classes = [23,40,14,28,11]
n_web_classes = len(web_classes)
fig, axes = plt.subplots(5, 3, figsize=(20,20))
for i in range(n_web_classes):
image = mpimg.imread("German Traffic Signs/{0}.jpg".format(i + 1))
web_images.append(cv2.resize(image,(32, 32), interpolation = cv2.INTER_AREA))
axes[i,0].imshow(image)
axes[i,0].set_title("Image {0}".format(i))
axes[i,0].axis("off")
axes[i,1].imshow(web_images[i])
axes[i,1].set_title("Resized image {0}".format(i))
axes[i,1].axis("off")
prototype = X_train[class_index[web_classes[i]][100]].squeeze()
axes[i,2].imshow(prototype)
axes[i,2].set_title(labelmap[y_train[class_index[web_classes[i]][100]]])
axes[i,2].axis("off")
plt.show()
# Finally, normalize the images:
web_images = np.array([normalize_image(i) for i in np.array(web_images).astype(np.float32)])
Answer:
It's able to perform relatively well, but the model doesn't seem to be tolerant of size and location changes for the detected objects. For example, the roundabout mandatory
image is pretty close to the training set data, but the angle of the picture and other characteristics appear to have affected the outcome.
I blame this on limited input data, and in that I still don't understand how I can make my model more resilient to image deformation without necessarily feeding a huge amount of examples into the model. That's a question I'll throw in the Slack channel or the forums.
In [16]:
# Test the model against these new images (normalized data)
with tf.Session() as sess:
saver.restore(sess, tf.train.latest_checkpoint('.'))
test_accuracy = evaluate(web_images, web_classes)
print("Test Accuracy = {:.3f}".format(test_accuracy))
Is your model able to perform equally well on captured pictures when compared to testing on the dataset? The simplest way to do this check the accuracy of the predictions. For example, if the model predicted 1 out of 5 signs correctly, it's 20% accurate.
NOTE: You could check the accuracy manually by using signnames.csv
(same directory). This file has a mapping from the class id (0-42) to the corresponding sign name. So, you could take the class id the model outputs, lookup the name in signnames.csv
and see if it matches the sign from the image.
Answer:
Model | Accuracy |
---|---|
Validation | 99.6% |
Test data | 89% |
Real-world data | 80% |
The test data got 89%, while the images I took from the Internet got 80%. Since my web images data set was only 5 images, that means 1 of them didn't get classified correctly. That said, the accuracy achieved is satisfactory.
I delve deeper into this further down in the document.
In [17]:
### Visualize the softmax probabilities here.
### Feel free to use as many code cells as needed.
zero_to_n = [i for i in range(n_classes)]
with tf.Session() as sess:
saver.restore(sess, tf.train.latest_checkpoint('.'))
softmax = sess.run(tf.nn.softmax(logits), feed_dict={x: web_images, y: web_classes, keep_probability: 1.0})
fig, axes = plt.subplots(5, 2, figsize=(20,20))
for i in range(n_web_classes):
axes[i,0].imshow(mpimg.imread("German Traffic Signs/{0}.jpg".format(i + 1)))
axes[i,0].set_title(labelmap[web_classes[i]])
axes[i,0].axis("off")
axes[i,1].plot(zero_to_n, softmax[i])
axes[i,1].set_title("Softmax for " + labelmap[y_train[class_index[web_classes[i]][100]]])
axes[i,1].axis("on")
plt.show()
Use the model's softmax probabilities to visualize the certainty of its predictions, tf.nn.top_k
could prove helpful here. Which predictions is the model certain of? Uncertain? If the model was incorrect in its initial prediction, does the correct prediction appear in the top k? (k should be 5 at most)
tf.nn.top_k
will return the values and indices (class ids) of the top k predictions. So if k=3, for each sign, it'll return the 3 largest probabilities (out of a possible 43) and the correspoding class ids.
Take this numpy array as an example:
# (5, 6) array
a = np.array([[ 0.24879643, 0.07032244, 0.12641572, 0.34763842, 0.07893497,
0.12789202],
[ 0.28086119, 0.27569815, 0.08594638, 0.0178669 , 0.18063401,
0.15899337],
[ 0.26076848, 0.23664738, 0.08020603, 0.07001922, 0.1134371 ,
0.23892179],
[ 0.11943333, 0.29198961, 0.02605103, 0.26234032, 0.1351348 ,
0.16505091],
[ 0.09561176, 0.34396535, 0.0643941 , 0.16240774, 0.24206137,
0.09155967]])
Running it through sess.run(tf.nn.top_k(tf.constant(a), k=3))
produces:
TopKV2(values=array([[ 0.34763842, 0.24879643, 0.12789202],
[ 0.28086119, 0.27569815, 0.18063401],
[ 0.26076848, 0.23892179, 0.23664738],
[ 0.29198961, 0.26234032, 0.16505091],
[ 0.34396535, 0.24206137, 0.16240774]]), indices=array([[3, 0, 5],
[0, 1, 4],
[0, 5, 1],
[1, 3, 5],
[1, 4, 3]], dtype=int32))
Looking just at the first row we get [ 0.34763842, 0.24879643, 0.12789202]
, you can confirm these are the 3 largest probabilities in a
. You'll also notice [3, 0, 5]
are the corresponding indices.
Answer:
This is so cool! Visualizing the softmax and graphing examples of the top 3 hits for an image really makes this a very interesting tool for classification and for fine-tunning of the model.
The image that failed to be classified correctly was the Slippery Road
image, which got classified as Speed Limit (20 km/h)
. That said, the second highest hit was the correct classification, as seen further down below in the graphic where we analyze the TopK hits.
I'm thinking that my data augmentation is to blame for this and I could have done a better job. I could also have picked images that were even more similar to the ones in the training set, but then that wouldn't have been interesting would it?
Note: Once you have completed all of the code implementations and successfully answered each question above, you may finalize your work by exporting the iPython Notebook as an HTML document. You can do this by using the menu above and navigating to \n", "File -> Download as -> HTML (.html). Include the finished document along with this notebook as your submission.
In [18]:
print("Top 3 hits")
print("Left: web image")
with tf.Session() as sess:
saver.restore(sess, tf.train.latest_checkpoint('.'))
top_k = sess.run(tf.nn.top_k(tf.nn.softmax(logits), k=3),
feed_dict={x: web_images, y: web_classes, keep_probability: 1.0})
fig, axes = plt.subplots(5, 4, figsize=(20,20))
for i in range(n_web_classes):
axes[i,0].imshow(mpimg.imread("German Traffic Signs/{0}.jpg".format(i + 1)))
axes[i,0].set_title("Web image: " + labelmap[web_classes[i]])
axes[i,0].axis("off")
axes[i,1].imshow(X_train[class_index[top_k[1][i][0]]][100].squeeze())
axes[i,1].set_title("1. {:} ({:.1f}%)".format(labelmap[top_k[1][i][0]],top_k[0][i][0]*100))
axes[i,1].axis("off")
axes[i,2].imshow(X_train[class_index[top_k[1][i][1]]][100].squeeze())
axes[i,2].set_title("2. {:} ({:.1f}%)".format(labelmap[top_k[1][i][1]],top_k[0][i][1]*100))
axes[i,2].axis("off")
axes[i,3].imshow(X_train[class_index[top_k[1][i][2]]][100].squeeze())
axes[i,3].set_title("3. {:} ({:.1f}%)".format(labelmap[top_k[1][i][2]],top_k[0][i][2]*100))
axes[i,3].axis("off")
plt.show()