In this notebook, a template is provided for you to implement your functionality in stages which is required to successfully complete this project. If additional code is required that cannot be included in the notebook, be sure that the Python code is successfully imported and included in your submission, if necessary. Sections that begin with 'Implementation' in the header indicate where you should begin your implementation for your project. Note that some sections of implementation are optional, and will be marked with 'Optional' in the header.
In addition to implementing code, there will be questions that you must answer which relate to the project and your implementation. Each section where you will answer a question is preceded by a 'Question' header. Carefully read each question and provide thorough answers in the following text boxes that begin with 'Answer:'. Your project submission will be evaluated based on your answers to each of the questions and the implementation you provide.
Note: Code and Markdown cells can be executed using the Shift + Enter keyboard shortcut. In addition, Markdown cells can be edited by typically double-clicking the cell to enter edit mode.
Design and implement a deep learning model that learns to recognize sequences of digits. Train the model using synthetic data generated by concatenating character images from notMNIST or MNIST. To produce a synthetic sequence of digits for testing, you can for example limit yourself to sequences up to five digits, and use five classifiers on top of your deep network. You would have to incorporate an additional ‘blank’ character to account for shorter number sequences.
There are various aspects to consider when thinking about this problem:
You can use Keras to implement your model. Read more at keras.io.
Here is an example of a published baseline model on this problem. (video). You are not expected to model your architecture precisely using this model nor get the same performance levels, but this is more to show an exampe of an approach used to solve this particular problem. We encourage you to try out different architectures for yourself and see what works best for you. Here is a useful forum post discussing the architecture as described in the paper and here is another one discussing the loss function.
In [4]:
#Importing Modules needed
import idx2numpy
import numpy as np
import matplotlib.pyplot as plt
import random
import tensorflow as tf
In [2]:
#Loading the datafiles from MNIST Dataset into numpy ndarrays
raw_train_dataset = idx2numpy.convert_from_file('train-images-idx3-ubyte')
raw_train_labels = idx2numpy.convert_from_file('train-labels-idx1-ubyte')
raw_test_dataset = idx2numpy.convert_from_file('t10k-images-idx3-ubyte')
raw_test_labels = idx2numpy.convert_from_file('t10k-labels-idx1-ubyte')
In [3]:
#DEFINING FUNCTIONS
#Function that displays random image samples from ndarray with images and labels
def random_sequence(nr_images, ndarr_images, ndarr_labels):
for i in range (0,nr_images):
image=random.randrange(0, len(ndarr_images))
print('Label is: ' + str(ndarr_labels[image,:]) )
plt.imshow(ndarr_images[image], cmap='gray')
plt.show()
#Function to concatenating a random number of digit images to a sequence of digits
#An empty number space is labelled as 10 in labels
def concatenating_random(images, labels, data_size, image_size, min_string, max_string):
#initalizing output ndarrays
conc_images = np.zeros((data_size, image_size, image_size*max_string))
conc_labels = np.empty((data_size, max_string), dtype=int)
#loop to create whole dataset with data_size number of entries
for i in range(0, data_size):
#Random choice of number length
number_length=random.randrange(min_string,(max_string +1))
#initalizing
conc_numbers=np.zeros((image_size, image_size*max_string))
conc_label = np.empty((1,max_string))
conc_label.fill(10)
#position digits randomly in image
order=random.sample(list(range(max_string)), max_string)
#loop to select random digits for making number
for j in range(0, number_length):
#selecting random digit and label
number=random.randrange(0, (len(images)-1))
conc_label[0,order[j]]= labels[number]
conc_numbers[:, (order[j]*image_size) : ((order[j]+1)*image_size)] = images[number]
#Sorting labels to be in the same order as shuffled images, all empty char=99 rightfill
temp=[]
for k in range(0, max_string):
if conc_label[0,k]!=10:
temp.append(conc_label[0,k])
conc_label.fill(10)
for l in range(0,len(temp)):
conc_label[0,l]=temp[l]
#saving each number and labels to ndarray for ouput
conc_images[i]=conc_numbers
conc_labels[i,:]=conc_label
return conc_images, conc_labels
#Function to normalize the pixel depth to approx mean=0 and stdev=0.5 to make learning easier
def normalize_images(images):
pixel_depth = 255.0
images = (images - 0.5*pixel_depth)/pixel_depth
return images.astype(np.float32)
#Function to flatten dataset
def flatten(dataset, image_size, max_digits):
dataset = dataset.reshape((-1, image_size * image_size*max_digits)).astype(np.float32)
return dataset
def reformat(dataset, image_size, max_digits):
num_channels=1 #grayscale
dataset = dataset.reshape((-1, image_size, image_size*max_digits, num_channels)).astype(np.float32)
return dataset
#One hot encode labels
def one_hot_encode(labels):
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(dtype=np.float32)
enc.fit(labels)
labels=enc.transform(labels).toarray()
return labels
#Data is randomized, first 10% is taken for validation set
def validation_set_split(data, labels):
data_points=int(0.1*len(data))
valid_dataset= data[:data_points,:]
valid_labels= labels[:data_points,:]
train_dataset = data[data_points:,:]
train_labels = labels[data_points:,:]
return train_dataset, train_labels, valid_dataset, valid_labels
#
#def accuracy(predictions, labels):
# return (100.0 * np.sum(np.argmax(predictions, 1) == np.argmax(labels, 1))
# / predictions.shape[0])
#Calculated accuracy
def accuracy(predictions, labels):
return (100.0 * np.sum(np.argmax(predictions, 2).T == labels) / predictions.shape[1] / predictions.shape[0])
#Func to compare predictions agains pictures
def inspect_pred(nr_images, valid_dataset, valid_labels, valid_predictions):
valid_predictions=np.argmax(valid_predictions, 2).T
for i in range (0, nr_images):
image=random.randrange(0, len(valid_labels))
plt.imshow(valid_dataset[image, :,:,0], cmap='gray')
plt.show()
print('Correct Label: ' + str(valid_labels[image]))
print('Predicted label is: ' +str(valid_predictions[image]))
In [4]:
#PREPROCESSING TESTING AND TRAINING DATA
image_size=28 #assumed square image
max_digits=5
min_digits=0
#Concanating dataset from the training set, numbers of 0 to 5 digits. 60,000 combined numbers
train_dataset, train_labels = concatenating_random(raw_train_dataset,raw_train_labels, 60000, image_size, min_digits, max_digits)
#Concanating dataset from the testing set, numbers of 0 to 5 digits, 10,000 combined numbers
test_dataset, test_labels = concatenating_random(raw_test_dataset,raw_test_labels, 10000, image_size, min_digits, max_digits)
#Normalizing for closer to mean=0 and stdev=0.5
train_dataset=normalize_images(train_dataset)
test_dataset=normalize_images(test_dataset)
#Displaying 2 random concatenated images with label info, 10=blank character
print('Training set:')
random_sequence(2, train_dataset, train_labels)
#Displaying 2 random concatenated images with label info, 10=blank character
print('Testing set:')
random_sequence(2, test_dataset, test_labels)
#Reformatting arrays
train_dataset=reformat(train_dataset, image_size,max_digits)
test_dataset=reformat(test_dataset, image_size, max_digits)
#One hot encoding labels
#train_labels=one_hot_encode(train_labels)
#test_labels= one_hot_encode(test_labels)
#Split the training data into validation set and traing set
train_dataset, train_labels, valid_dataset, valid_labels = validation_set_split(train_dataset, train_labels)
print('Training set', train_dataset.shape, train_labels.shape)
print('Validation set', valid_dataset.shape, valid_labels.shape)
print('Test set', test_dataset.shape, test_labels.shape)
In [5]:
#TENSOR FLOW CACULATION GRAPH
image_height=image_size
image_width=max_digits*image_size
batch_size = 64
patch_size = 5
depth1 = 16
depth2= 32
num_hidden = 64
num_digits=max_digits
num_labels=11
num_channels=1
graph = tf.Graph()
with graph.as_default():
# Input data.
tf_train_dataset = tf.placeholder(
tf.float32, shape=(batch_size, image_height, image_width, num_channels))
tf_train_labels = tf.placeholder(tf.int32, shape=(batch_size, num_digits))
tf_valid_dataset = tf.constant(valid_dataset)
tf_test_dataset = tf.constant(test_dataset)
# Variables.
layer1_weights = tf.Variable(tf.truncated_normal(
[patch_size, patch_size, num_channels, depth1], stddev=0.1))
layer1_biases = tf.Variable(tf.zeros([depth1]))
layer2_weights = tf.Variable(tf.truncated_normal(
[patch_size, patch_size, depth1, depth2], stddev=0.1))
layer2_biases = tf.Variable(tf.constant(1.0, shape=[depth2]))
layer3_weights = tf.Variable(tf.truncated_normal(
[image_height // 4 * image_width // 4 * depth2, num_hidden], stddev=0.1))
layer3_biases = tf.Variable(tf.constant(1.0, shape=[num_hidden]))
soft1_weights = tf.Variable(tf.truncated_normal(
[num_hidden, num_labels], stddev=0.1))
soft1_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))
soft2_weights = tf.Variable(tf.truncated_normal(
[num_hidden, num_labels], stddev=0.1))
soft2_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))
soft3_weights = tf.Variable(tf.truncated_normal(
[num_hidden, num_labels], stddev=0.1))
soft3_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))
soft4_weights = tf.Variable(tf.truncated_normal(
[num_hidden, num_labels], stddev=0.1))
soft4_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))
soft5_weights = tf.Variable(tf.truncated_normal(
[num_hidden, num_labels], stddev=0.1))
soft5_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))
# Model.
def model(data, drop_rate=1.0):
conv = tf.nn.conv2d(data, layer1_weights, [1, 1, 1, 1], padding='SAME')
hidden = tf.nn.relu(conv + layer1_biases)
pool = tf.nn.max_pool(hidden, [1,2,2,1], [1,2,2,1], 'SAME')
conv = tf.nn.conv2d(pool, layer2_weights, [1, 1, 1, 1], padding='SAME')
hidden = tf.nn.relu(conv + layer2_biases)
pool = tf.nn.max_pool(hidden, [1,2,2,1], [1,2,2,1], 'SAME')
shape = pool.get_shape().as_list()
reshape = tf.reshape(pool, [shape[0], shape[1] * shape[2] * shape[3]])
connected= tf.matmul(reshape, layer3_weights)
hidden = tf.nn.relu(connected + layer3_biases)
hidden = tf.nn.dropout(hidden, drop_rate)
logits1 = tf.matmul(hidden, soft1_weights + soft1_biases)
logits2 = tf.matmul(hidden, soft2_weights + soft2_biases)
logits3 = tf.matmul(hidden, soft3_weights + soft3_biases)
logits4 = tf.matmul(hidden, soft4_weights + soft4_biases)
logits5 = tf.matmul(hidden, soft5_weights + soft5_biases)
return logits1, logits2, logits3, logits4, logits5
# Training computation.
[logits1, logits2, logits3, logits4, logits5] = model(tf_train_dataset, 0.9)
loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits1, labels=tf_train_labels[:,0])) +\
tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits2, labels=tf_train_labels[:,1])) +\
tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits3, labels=tf_train_labels[:,2])) +\
tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits4, labels=tf_train_labels[:,3])) +\
tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits5, labels=tf_train_labels[:,4]))
# Optimizer.
#optimizer = tf.train.AdagradOptimizer(0.05).minimize(loss)
global_step = tf.Variable(0)
learning_rate = tf.train.exponential_decay(0.05, global_step, 2000, 0.95)
optimizer = tf.train.AdagradOptimizer(learning_rate).minimize(loss, global_step=global_step)
#Training Predictions
train_prediction = tf.stack([tf.nn.softmax(logits1),\
tf.nn.softmax(logits2),\
tf.nn.softmax(logits3),\
tf.nn.softmax(logits4),\
tf.nn.softmax(logits5)])
#Validation Predictions
[logits1, logits2, logits3, logits4, logits5] = model(tf_valid_dataset)
valid_prediction = tf.stack([tf.nn.softmax(logits1),\
tf.nn.softmax(logits2),\
tf.nn.softmax(logits3),\
tf.nn.softmax(logits4),\
tf.nn.softmax(logits5)])
#Testing Predictions
[logits1, logits2, logits3, logits4, logits5] = model(tf_test_dataset)
test_prediction = tf.stack([tf.nn.softmax(logits1),\
tf.nn.softmax(logits2),\
tf.nn.softmax(logits3),\
tf.nn.softmax(logits4),\
tf.nn.softmax(logits5)])
In [9]:
#Loading calculated result from CUDA session and running a few itterations
num_steps = 5001
with tf.Session(graph=graph) as session:
saver = tf.train.Saver()
saver.restore(session, "/Users/fhellander/Machine_learning/digit-recognition-SVHN/tmp/saved_model_100001.ckpt")
print("Model restored.")
#tf.global_variables_initializer().run()
#print('Initialized')
for step in range(num_steps):
offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
batch_data = train_dataset[offset:(offset + batch_size), :, :, :]
batch_labels = train_labels[offset:(offset + batch_size), :]
feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
_, l, predictions = session.run(
[optimizer, loss, train_prediction], feed_dict=feed_dict)
if (step % 1000 == 0):
print('Minibatch loss at step %d: %f' % (step, l))
print('Minibatch accuracy: %.1f%%' % accuracy(predictions, batch_labels))
valid_predictions=valid_prediction.eval()
print('Validation accuracy: %.1f%%' % accuracy(valid_predictions, valid_labels))
inspect_pred(1, valid_dataset, valid_labels, valid_predictions)
print('Test accuracy: %.1f%%' % accuracy(test_prediction.eval(), test_labels))
save_path = saver.save(session, "saved_model.ckpt")
print("Model saved in file: %s" % save_path)
Answer:
For step 1 I decided to create a model to read and predict a random sequence of digits concatenated from the MNIST dataset.
By constructing this dataset I can create a model that learns digit recognition for various number lenghts and with different horizontal separation. This should be a good stepping stone towards the final goal of learning number recognition for the Street View House Numbers (SVHN) datset. Since the dataset is artificially concatenated it is possible to start with a single number and increase the complexity of the model up to 5 numbers.
Digit recognition should also be easier than the SVHN dataset since each image only contains a number on a black and white format, no 'real world' features present alongside the digits.
This is the first time I used neural networks for a larger application and I decided to use the same type of architecture as LENET-5 (which was used for digit recognition on hanwritten numbers) together witht the classifiers and loss calculations from the model published in the 'Multi-digit Number Recognition from Street View' article.
The approach take is to use a combination of convolutions and max_pooling operations to create an image vector input to a fully connected layer and finally to 5 different classifiers. Each classifier is trained towards outputting 11 different labels; 10 labels for the digits 0-9 and one additional label 10 for an empty space (no digit).
Perhaps it would be useful to also have an additional classifier identifying the total number of digits in the image and share this information to the remaining classifiers but as I was unsure of how the share this information by e.g sharing weights in the model it has not been implememented.
Answer:
The final architecture is a 2-layer convolution-max_pooling convnet connected to a 2-layer neural network. The full configuration is as follows:
Convolution, stride= [1, 1, 1, 1], padding='SAME', depth=16)
Relu connection
Max_pool, stride= [1,2,2,1], 'SAME')
Convolution, stride=[1, 1, 1, 1], padding='SAME', depth=32)
Relu connection
Max_pool, stride= [1,2,2,1], 'SAME')
Fully connected Layer 1
Relu connection
Fully connected layer 2 for Classifier 1 to 5
The convolution-max_pooling network transforms the input images from a 28x140 pixels and depth 1 input to 7x35x32 input to the neural network and the final connected layer is of size 64x55
Answer:
The dataset is created by the following steps:
Below are 5 random examples from the dataset.
Two datasets are created, one training set with images created by random draw from the MNIST training set, a total of 60,000 images are created. A second testing dataset is created by random draw from the MNIST testing dataset, a total of 10,000 images are created.
The test dataset is kept hidden until the end of the project and for final testing. The training dataset is split into a training set and a validation set to follow the learning progress of the algorithm and use validation score for hyperparameter exploration. 10% of the training set is taken as the validation set. No shuffeling is done of either dataset since they are created by random draw.
The algorithm is trained used stochastic gradient descent and backward propagation with approximately 100,000 itterations before the model parameters are saved.
In [13]:
#Concanating dataset from the training set, numbers of 0 to 5 digits. 60,000 combined numbers
train_dataset, train_labels = concatenating_random(raw_train_dataset,raw_train_labels, 60000, image_size, min_digits, max_digits)
#Displaying 10 random concatenated images with label info, 10=blank character
print('Training set:')
random_sequence(10, train_dataset, train_labels)
The model is trained by maximizing the probability of seeing a certain sequence of numbers (labels) given the input.
Since each number is randomly drawn we can assume that they are independent and the probability of a sequence is the product of the probabilities of each label individual label.
The probablity of each label is given by calculating the softmax(logits[1..5]) functions
Too avoid numerical errors the probabilites are multiplied togeter in log space which means adding the individual log components.
Instead of maximixing the sum of log probabilite it is possible to do the equivalent by minimizing the summation of reduce_mean(sparse_softmax_cross_entropy_with_logits(...)) for all probabilites and the network is trained by minimizing this loss function.
Once you have settled on a good architecture, you can train your model on real data. In particular, the Street View House Numbers (SVHN) dataset is a good large-scale dataset collected from house numbers in Google Street View. Training on this more challenging dataset, where the digits are not neatly lined-up and have various skews, fonts and colors, likely means you have to do some hyperparameter exploration to perform well.
In [21]:
#DEFINING FUCTIONS
import numpy as np
import random
import matplotlib.pyplot as plt
import h5py
import cPickle as pickle
from scipy import misc
import os
import cv2
import tensorflow as tf
#Function to read digitstruct to picklefile
def digitStructto_pickle(filepath, picklepath):
f = h5py.File(filepath)
metadata= {}
metadata['height'] = []
metadata['label'] = []
metadata['left'] = []
metadata['top'] = []
metadata['width'] = []
def print_attrs(name, obj):
vals = []
if obj.shape[0] == 1:
vals.append(obj[0][0])
else:
for k in range(obj.shape[0]):
vals.append(f[obj[k][0]][0][0])
metadata[name].append(vals)
for item in f['/digitStruct/bbox']:
f[item[0]].visititems(print_attrs)
try:
pickleData = open(picklepath, 'wb')
pickle.dump(metadata, pickleData, pickle.HIGHEST_PROTOCOL)
pickleData.close()
except Exception as e:
print 'Unable to save data to', pickle_file, ':', e
raise
#Extract labels from picklefile
def pickleto_labels(picklepath, max_digits):
metadata = pickle.load(open(picklepath))
labels=np.zeros((len(metadata['label']), max_digits), dtype='int32')
labels.fill(99)#empty digit label
for j in range(0, len(metadata['label'])):
#for j in range(0, 100):
for i in range(0, len(metadata['label'][j])):
temp= metadata['label'][j][i]
labels[j, i]= temp
labels[labels==10]=0 #replacing all 10s with 0
labels[labels==99]=10 #replacing all empty labels with 10
return labels
#Normalize image based on maxrange [0,255]
def max_normalize_image(images):
pixel_depth = 255.0
images = (images - 0.5*pixel_depth)/pixel_depth
return images.astype(np.float32)
#Function to read images in convert to yuv space and resize to 64x128
#Global and local contrast applied (kernel 5x5)
#Return Tensor of all images
def image_yuv_preprocess(root, nr_images, height, width):
np_images=np.zeros((nr_images, height, width,3)).astype(np.float32)
for img_index in range(1, nr_images+1):
path=root+str(img_index)+'.png'
image=cv2.imread(path)
image=cv2.cvtColor(image, cv2.COLOR_BGR2YUV)
image=misc.imresize(image, (height, width))
illum=image[:,:,0]
illum=cv2.equalizeHist(illum)
clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(5,5))
illum=clahe.apply(illum)
image[:,:,0]=illum
image=max_normalize_image(image)
np_images[img_index-1, :, :] =image.astype(np.float32)
return np_images
#Function to read images in black and white, resize and return in Tensor
def image_preprocess(root, nr_images, height, width):
np_images=np.zeros((nr_images, height, width))
for img_index in range(1, nr_images+1):
path=root+str(img_index)+'.png'
image=misc.imread(path, flatten=True)
image=misc.imresize(image, (height, width))
image=normalize_image(image)
np_images[img_index-1, :, :] =image
return np_images
#Print random selection of processed black and white images with labes
def random_images(nr_images, ndarr_images, ndarr_labels):
for i in range (0,nr_images):
image=random.randrange(0, len(ndarr_images))
print('Image '+ str(image+1) +' Label is: ' + str(ndarr_labels[image]))
plt.imshow(ndarr_images[image, :,:,0], cmap='gray')
plt.show()
#Print random YUV image
def cv2_random_images(nr_images, ndarr_images, ndarr_labels):
for i in range (0,nr_images):
#image=random.randrange(0, len(ndarr_images))
image=i
print('Image '+ str(image+1) +' Label is: ' + str(ndarr_labels[image]))
plt.imshow(ndarr_images[image, :,:,0])
plt.show()
#reformt for tensorflow if only one channel is present
def reformat(dataset, image_height, image_width, max_digits):
num_channels=1 #grayscale
dataset = dataset.reshape((-1, image_height, image_width, num_channels)).astype(np.float32)
return dataset
#Split data into training and validation set
def validation_set_split(data, labels):
data_points=int(0.02*len(data))
valid_dataset= data[:data_points,:]
valid_labels= labels[:data_points,:]
train_dataset = data[data_points:,:]
train_labels = labels[data_points:,:]
return train_dataset, train_labels, valid_dataset, valid_labels
In [4]:
#EXAMPLE OF PREPROCESSING DATA AND PLOTTING IMAGE INTENSITY CHANNEL Y
#creating picle files from digitStructs
#digitStructto_pickle('train/digitStruct.mat', 'train_metadata.p' )
#digitStructto_pickle('test/digitStruct.mat', 'test_metadata.p' )
#digitStructto_pickle('/Users/fhellander/Machine_learning/digit-recognition-SVHN/extra/digitStruct.mat', 'extra_metadata.p' )
max_digits = 6
train_labels=pickleto_labels('train_metadata.p',max_digits )
test_labels=pickleto_labels('test_metadata.p',max_digits )
extra_labels=pickleto_labels('extra_metadata.p',max_digits )
image_height=64
image_width=128
train_data=image_yuv_preprocess( 'train/', 2, image_height, image_width)
test_data=image_yuv_preprocess( 'test/', 2, image_height, image_width)
extra_data=image_yuv_preprocess( 'extra/',2, image_height, image_width)
print('Training Data:')
cv2_random_images(2, train_data, train_labels)
print('Extra Data:')
cv2_random_images(2, extra_data, extra_labels)
print('Testing Data:')
cv2_random_images(2, test_data, test_labels)
print('Training set', train_data.shape, train_labels.shape)
print('Test set', test_data.shape, test_labels.shape)
print('Extra set', extra_data.shape, extra_labels.shape)
train_data, train_labels, valid_data, valid_labels = validation_set_split(extra_data, extra_labels)
#np.save('test_data', test_data)
#np.save('test_labels', test_labels)
#np.save('train_data', train_data)
#np.save('train_labels', train_labels)
#np.save('valid_data', valid_data)
#np.save('valid_labels', valid_labels)
In [19]:
#DEFINING TENSOR FLOW GRAPH
import cPickle as pickle
import numpy as np
import tensorflow as tf
import random
import matplotlib.pyplot as plt
def accuracy(predictions, labels):
return (100.0 * np.sum(np.argmax(predictions, 2).T == labels) / predictions.shape[1] / predictions.shape[0])
def inspect_pred(nr_images, valid_dataset, valid_labels, valid_predictions):
valid_predictions=np.argmax(valid_predictions, 2).T
for i in range (0, nr_images):
image=random.randrange(0, len(valid_labels))
plt.imshow(valid_dataset[image, :,:,0], cmap='gray')
plt.show()
print('Correct Label: ' + str(valid_labels[image]))
print('Predicted label is: ' +str(valid_predictions[image]))
test_dataset=np.load('test_data.npy').astype(np.float32)[:1000,:,:,:]
test_labels=np.load('test_labels.npy')[:1000,0:5]
valid_dataset=np.load('valid_data.npy').astype(np.float32)[:256,:,:,:]
valid_labels=np.load('valid_labels.npy')[:256,0:5]
train_dataset=np.load('train_data.npy').astype(np.float32)[:256,:,:,:]
train_labels=np.load('train_labels.npy')[:,0:5]
image_height=64
image_width=128
batch_size = 128
patch_size = 5
depth1 = 16
depth2= 32
depth3 = 64
depth4 = 128
#depth5 = 256
num_hidden = 128
num_digits=5
num_labels=11
num_channels=3
graph = tf.Graph()
with graph.as_default():
# Input data.
tf_train_dataset = tf.placeholder(
tf.float32, shape=(batch_size, image_height, image_width, num_channels))
tf_train_labels = tf.placeholder(tf.int32, shape=(batch_size, num_digits))
tf_valid_dataset = tf.constant(valid_dataset)
tf_test_dataset = tf.constant(test_dataset)
# Convne Variables.
layer1_weights = tf.Variable(tf.truncated_normal(
[patch_size, patch_size, num_channels, depth1], stddev=0.1))
layer1_biases = tf.Variable(tf.zeros([depth1]))
layer2_weights = tf.Variable(tf.truncated_normal(
[patch_size, patch_size, depth1, depth2], stddev=0.1))
layer2_biases = tf.Variable(tf.constant(1.0, shape=[depth2]))
layer3_weights = tf.Variable(tf.truncated_normal(
[patch_size, patch_size, depth2, depth3], stddev=0.1))
layer3_biases = tf.Variable(tf.constant(1.0, shape=[depth3]))
layer4_weights = tf.Variable(tf.truncated_normal(
[patch_size, patch_size, depth3, depth4], stddev=0.1))
layer4_biases = tf.Variable(tf.constant(1.0, shape=[depth4]))
#layer5_weights = tf.Variable(tf.truncated_normal(
# [patch_size, patch_size, depth4, depth5], stddev=0.1))
#layer5_biases = tf.Variable(tf.constant(1.0, shape=[depth5]))
dims=5*5*128
connect_weights = tf.Variable(tf.truncated_normal(
[dims, num_hidden], stddev=0.1))
connect_biases = tf.Variable(tf.constant(1.0, shape=[num_hidden]))
soft1_weights = tf.Variable(tf.truncated_normal(
[num_hidden, num_labels], stddev=0.1))
soft1_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))
soft2_weights = tf.Variable(tf.truncated_normal(
[num_hidden, num_labels], stddev=0.1))
soft2_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))
soft3_weights = tf.Variable(tf.truncated_normal(
[num_hidden, num_labels], stddev=0.1))
soft3_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))
soft4_weights = tf.Variable(tf.truncated_normal(
[num_hidden, num_labels], stddev=0.1))
soft4_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))
soft5_weights = tf.Variable(tf.truncated_normal(
[num_hidden, num_labels], stddev=0.1))
soft5_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))
# Model.
def model(data, drop_rate=1.0):
# Construct a 5 layer Convenet
# Input size: batch x 64 x 128 x 3
# 1st Layer : convolutional layer stride=1 padding=SAME, convolution size: 5 x 5 x 1 x 16, output: batch x 64 x 128 x 16
# 1st Layer: sub-sampling layer stride 2 padding=SAME, [1,2,2,1] output: batch_size x 32 x 64 x 16
# 2nd Layer: convolutional layer stride=1 padding=SAME, , convolution size: 5 x 5 x 16 x 32, output: batch x 32 x 64 x 32
# 2nd layer: sub-sampling layer stride=[1 1 2 1] padding=SAME, [1,2,2,1] output: batch_size x 32 x 32 x 32
# 3rd layer: convolutional layer stride=1 padding=VALID, convolution size: 5 x 5 x 32 x 64, output: batch x 28 x 28 x 64
# 3rd layer: sub-sampling layer stride 2 padding=SAME, [1,2,2,1] output: batch_size x 14 x 14 x 64
# 4th layer: convolutional layer stride=1 padding=VALID, convolution size: 5 x 5 x 64 x 128, output: batch x 10 x 10 x 128
# 4th layer: sub-sampling layer stride 2 padding=SAME, [1,2,2,1] output: batch_size x 5 x 5 x 128
# Fully connected layer, weight size: 3200x128
# Output layer, weight size: 128 x 11
conv = tf.nn.conv2d(data, layer1_weights, [1, 1, 1, 1], padding='SAME') #LAYER 1
pool = tf.nn.max_pool(conv, [1,2,2,1], [1,2,2,1], 'SAME') #LAYER 1
drop = tf.nn.dropout(pool, drop_rate)
relu = tf.nn.relu(drop + layer1_biases)
conv = tf.nn.conv2d(relu, layer2_weights, [1, 1, 1, 1], padding='SAME') #LAYER 2
pool = tf.nn.max_pool(conv, [1,2,2,1], [1,1,2,1], 'SAME') #LAYER 2
drop = tf.nn.dropout(pool, drop_rate)
relu = tf.nn.relu(drop + layer2_biases)
conv = tf.nn.conv2d(relu, layer3_weights, [1, 1, 1, 1], padding='VALID') #LAYER 3
pool = tf.nn.max_pool(conv, [1,2,2,1], [1,2,2,1], 'SAME') #LAYER 3
drop = tf.nn.dropout(pool, drop_rate)
relu = tf.nn.relu(drop + layer3_biases)
conv = tf.nn.conv2d(relu, layer4_weights, [1, 1, 1, 1], padding='VALID') #LAYER 4
pool = tf.nn.max_pool(conv, [1,2,2,1], [1,2,2,1], 'SAME') #LAYER 4
drop = tf.nn.dropout(pool, drop_rate)
relu = tf.nn.relu(drop + layer4_biases)
# conv = tf.nn.conv2d(pool, layer5_weights, [1, 1, 1, 1], padding='VALID') #LAYER 5
# relu = tf.nn.relu(conv + layer5_biases)
shape = relu.get_shape().as_list()
reshape = tf.reshape(relu, [shape[0], shape[1] * shape[2] * shape[3]])
connected= tf.matmul(reshape, connect_weights)
hidden = tf.nn.relu(connected + connect_biases)
hidden = tf.nn.dropout(hidden, drop_rate)
logits1 = tf.matmul(hidden, soft1_weights + soft1_biases)
logits2 = tf.matmul(hidden, soft2_weights + soft2_biases)
logits3 = tf.matmul(hidden, soft3_weights + soft3_biases)
logits4 = tf.matmul(hidden, soft4_weights + soft4_biases)
logits5 = tf.matmul(hidden, soft5_weights + soft5_biases)
return logits1, logits2, logits3, logits4, logits5
# Training computation.
[logits1, logits2, logits3, logits4, logits5] = model(tf_train_dataset, 0.975)
loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits1, labels=tf_train_labels[:,0])) +\
tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits2, labels=tf_train_labels[:,1])) +\
tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits3, labels=tf_train_labels[:,2])) +\
tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits4, labels=tf_train_labels[:,3])) +\
tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits5, labels=tf_train_labels[:,4]))
# Optimizer.
optimizer = tf.train.AdagradOptimizer(0.02).minimize(loss)
#global_step = tf.Variable(0)
#learning_rate = tf.train.exponential_decay(0.05, global_step, 2000, 0.95)
#optimizer = tf.train.AdagradOptimizer(learning_rate).minimize(loss, global_step=global_step)
#Training Predictions
train_prediction = tf.stack([tf.nn.softmax(logits1),\
tf.nn.softmax(logits2),\
tf.nn.softmax(logits3),\
tf.nn.softmax(logits4),\
tf.nn.softmax(logits5)])
#Validation Predictions
[logits1, logits2, logits3, logits4, logits5] = model(tf_valid_dataset)
valid_prediction = tf.stack([tf.nn.softmax(logits1),\
tf.nn.softmax(logits2),\
tf.nn.softmax(logits3),\
tf.nn.softmax(logits4),\
tf.nn.softmax(logits5)])
#Testing Predictions
[logits1, logits2, logits3, logits4, logits5] = model(tf_test_dataset)
test_prediction = tf.stack([tf.nn.softmax(logits1),\
tf.nn.softmax(logits2),\
tf.nn.softmax(logits3),\
tf.nn.softmax(logits4),\
tf.nn.softmax(logits5)])
In [20]:
num_steps = 1
import time
start = time.time()
with tf.device('/cpu:0'):
with tf.Session(graph=graph) as session:
saver = tf.train.Saver()
saver.restore(session, "./tmp/saved_model_final7.ckpt")
print("Model restored.")
#tf.global_variables_initializer().run()
#print('Initialized')
max_acc = 0
acc = 0
step = 0
for step in range(num_steps):
#while max_acc - acc < 1.0 :
offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
batch_data = train_dataset[offset:(offset + batch_size), :, :, :]
batch_labels = train_labels[offset:(offset + batch_size), :]
feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
_, l, predictions = session.run(
[optimizer, loss, train_prediction], feed_dict=feed_dict)
step += 1
if (step % 1 == 0):
print('Minibatch loss at step %d: %f' % (step, l))
print('Minibatch accuracy: %.1f%%' % accuracy(predictions, batch_labels))
lap = time.time()
print('Elapsed time: ' + str(lap-start) + ' seconds')
valid_predictions=valid_prediction.eval()
acc = accuracy(valid_predictions, valid_labels)
if max_acc < acc:
max_acc = acc
print('Validation accuracy: %.1f%%' % acc)
inspect_pred(2, valid_dataset, valid_labels, valid_predictions)
test_predictions = test_prediction.eval()
print('Test accuracy: %.1f%%' % accuracy(test_predictions, test_labels))
inspect_pred(2, test_dataset, test_labels, test_predictions)
#save_path = saver.save(session, "tmp/saved_model_" + "final7" + ".ckpt")
#print("Model saved in file: %s" % save_path)
Answer:
The training and testing data was created using the following steps:
It was discovered early that a lot data is needed for the model to work well, therefore all datasamples from the SVHN training and extra dataset were combined. Since the training set contains harder examples than the extra dataset the combined dataset is scrambled to provide an even distribution of harder and easier examples. The combined dataset is split into a training and validation set, using about 2% of the samples for validation (4000 samples). The validation set is used to tune hyperparameters and the SVHN testing dataset is kept for testing and final evaluation.
The model is trained using an Adagradoptimzer and batches of 128 training samples are selected to train on for approximateley a total of 200,000 itterations before the model parameters are saved.
Since computations are getting very expensive with 64x128 size images all training is done on a linux server using CUDA and a GPU. Only illustrations with smaller dataset are shown in the iphyton notebook
I have missunderstood part of this question and did not use the same model as for the MNIST dataset here The model used to interpret handwritten digits from the MNIST dataset did not perform well on this dataset, the training and validation score only reached about 80% accuracy and testing accuracy was around 70%. Therefore this model was quickly discarded and all results shown are from a new deeper convnet model (the new architecture is presented below in question 6)
Training accuracy is generally around 96% (the batch illustrated above has only 90.5%), validation and testing accuracy is around 91%.
The model performance is acceptable but not great, one of the issues it struggles with is numbers that make up a small portion of the image. In preprocessing the images they are simply resized to 64x128 and to succesfully detect small numbers some type of number localization is probably needed before this rezising. Otherwise a number that constitutes a small proportion of the origninal image will end up being represented by a just a few pixels.
Answer:
First of all the input image resolution was increased from 28x140 to 64x128. The resolution of the original dataset seemed to be quite low and compressing it futher made distinguishing numbers hard for a human.
The model was made deeper by adding an additional convolution layer, this allowed for further downsizing of the images and also increasing the model space to learn more complicated patterns. However, extending the model beyond this size prevented any useful learning.
Due to problems with overfitting a dropout layers were added which significantly improved the results.
Initally only a smaller subset of the training data was used to increase computational efficiency (30k samples). However, using a limited number of training datapoint significantly reduced the validation and testing performance and the model required +100k samples to generalize well to unseen data.
Answer:
The initial results with the model where poor. The model quickly reached a very high training score (99%-100% accuracy) but the testing and validation score was in the low 50%. The neural network simply learned all training examples by heart and could not generalize to unseen data.
Several models where built but the final model has the following architecture:
# Input size: batch x 64 x 128 x 3
# 1st Layer : convolutional layer stride=1 padding=SAME, convolution size: 5 x 5 x 1 x 16, output: batch x 64 x 128 x 16
# 1st Layer: sub-sampling layer stride 2 padding=SAME, [1,2,2,1] output: batch_size x 32 x 64 x 16
# 2nd Layer: convolutional layer stride=1 padding=SAME, , convolution size: 5 x 5 x 16 x 32, output: batch x 32 x 64 x 32
# 2nd layer: sub-sampling layer stride=[1 1 2 1] padding=SAME, [1,2,2,1] output: batch_size x 32 x 32 x 32
# 3rd layer: convolutional layer stride=1 padding=VALID, convolution size: 5 x 5 x 32 x 64, output: batch x 28 x 28 x 64
# 3rd layer: sub-sampling layer stride 2 padding=SAME, [1,2,2,1] output: batch_size x 14 x 14 x 64
# 4th layer: convolutional layer stride=1 padding=VALID, convolution size: 5 x 5 x 64 x 128, output: batch x 10 x 10 x 128
# 4th layer: sub-sampling layer stride 2 padding=SAME, [1,2,2,1] output: batch_size x 5 x 5 x 128
# Fully connected layer, weight size: 3200x128
# Output layer, weight size: 128 x 11
The final model has a training score of about 96% and validation and testing score of about 91%. The model is doing a good job at classifying large numbers but struggles with smaller images.
The problem probably lies with using a too small resolution for small images. However, lack of computer memory prevents me trying larger images. (Or a bounding box and cropping algortihm is required before resizing images)
Take several pictures of numbers that you find around you (at least five), and run them through your classifier on your computer to produce example results. Alternatively (optionally), you can try using OpenCV / SimpleCV / Pygame to capture live images from a webcam and run those through your classifier.
In [30]:
import numpy as np
from scipy import misc, ndimage
import matplotlib.pyplot as plt
import random
import tensorflow as tf
import matplotlib.image as mpimg
test_dataset=image_yuv_preprocess('captured_test/', 5 , 64, 128)
image_height=64
image_width=128
batch_size = 128
patch_size = 5
depth1 = 16
depth2= 32
depth3 = 64
depth4 = 128
#depth5 = 256
num_hidden = 128
num_digits=5
num_labels=11
num_channels=3
graph = tf.Graph()
with graph.as_default():
# Input data.
tf_test_dataset = tf.constant(test_dataset)
# Convne Variables.
layer1_weights = tf.Variable(tf.truncated_normal(
[patch_size, patch_size, num_channels, depth1], stddev=0.1))
layer1_biases = tf.Variable(tf.zeros([depth1]))
layer2_weights = tf.Variable(tf.truncated_normal(
[patch_size, patch_size, depth1, depth2], stddev=0.1))
layer2_biases = tf.Variable(tf.constant(1.0, shape=[depth2]))
layer3_weights = tf.Variable(tf.truncated_normal(
[patch_size, patch_size, depth2, depth3], stddev=0.1))
layer3_biases = tf.Variable(tf.constant(1.0, shape=[depth3]))
layer4_weights = tf.Variable(tf.truncated_normal(
[patch_size, patch_size, depth3, depth4], stddev=0.1))
layer4_biases = tf.Variable(tf.constant(1.0, shape=[depth4]))
dims=5*5*128
connect_weights = tf.Variable(tf.truncated_normal(
[dims, num_hidden], stddev=0.1))
connect_biases = tf.Variable(tf.constant(1.0, shape=[num_hidden]))
soft1_weights = tf.Variable(tf.truncated_normal(
[num_hidden, num_labels], stddev=0.1))
soft1_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))
soft2_weights = tf.Variable(tf.truncated_normal(
[num_hidden, num_labels], stddev=0.1))
soft2_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))
soft3_weights = tf.Variable(tf.truncated_normal(
[num_hidden, num_labels], stddev=0.1))
soft3_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))
soft4_weights = tf.Variable(tf.truncated_normal(
[num_hidden, num_labels], stddev=0.1))
soft4_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))
soft5_weights = tf.Variable(tf.truncated_normal(
[num_hidden, num_labels], stddev=0.1))
soft5_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))
# Model.
def model(data, drop_rate=1.0):
conv = tf.nn.conv2d(data, layer1_weights, [1, 1, 1, 1], padding='SAME') #LAYER 1
pool = tf.nn.max_pool(conv, [1,2,2,1], [1,2,2,1], 'SAME') #LAYER 1
drop = tf.nn.dropout(pool, drop_rate)
relu = tf.nn.relu(drop + layer1_biases)
conv = tf.nn.conv2d(relu, layer2_weights, [1, 1, 1, 1], padding='SAME') #LAYER 2
pool = tf.nn.max_pool(conv, [1,2,2,1], [1,1,2,1], 'SAME') #LAYER 2
drop = tf.nn.dropout(pool, drop_rate)
relu = tf.nn.relu(drop + layer2_biases)
conv = tf.nn.conv2d(relu, layer3_weights, [1, 1, 1, 1], padding='VALID') #LAYER 3
pool = tf.nn.max_pool(conv, [1,2,2,1], [1,2,2,1], 'SAME') #LAYER 3
drop = tf.nn.dropout(pool, drop_rate)
relu = tf.nn.relu(drop + layer3_biases)
conv = tf.nn.conv2d(relu, layer4_weights, [1, 1, 1, 1], padding='VALID') #LAYER 4
pool = tf.nn.max_pool(conv, [1,2,2,1], [1,2,2,1], 'SAME') #LAYER 4
drop = tf.nn.dropout(pool, drop_rate)
relu = tf.nn.relu(drop + layer4_biases)
shape = relu.get_shape().as_list()
reshape = tf.reshape(relu, [shape[0], shape[1] * shape[2] * shape[3]])
connected= tf.matmul(reshape, connect_weights)
hidden = tf.nn.relu(connected + connect_biases)
hidden = tf.nn.dropout(hidden, drop_rate)
logits1 = tf.matmul(hidden, soft1_weights + soft1_biases)
logits2 = tf.matmul(hidden, soft2_weights + soft2_biases)
logits3 = tf.matmul(hidden, soft3_weights + soft3_biases)
logits4 = tf.matmul(hidden, soft4_weights + soft4_biases)
logits5 = tf.matmul(hidden, soft5_weights + soft5_biases)
return logits1, logits2, logits3, logits4, logits5
#Testing Predictions
[logits1, logits2, logits3, logits4, logits5] = model(tf_test_dataset)
test_prediction = tf.stack([tf.nn.softmax(logits1),\
tf.nn.softmax(logits2),\
tf.nn.softmax(logits3),\
tf.nn.softmax(logits4),\
tf.nn.softmax(logits5)])
num_steps = 1
with tf.Session(graph=graph) as session:
saver = tf.train.Saver()
saver.restore(session, "./tmp/saved_model_final7.ckpt")
print("Model restored.")
test_labels=test_prediction.eval()
test_labels=np.argmax(test_labels,2).T
for i in range (0,5):
plt.imshow(test_dataset[i, :,:,0], cmap='gray')
plt.show()
print('Prediction is: ' +str(test_labels[i,:]))
Answer:
The results are unfortunately very poor, no images are classified correcly.
The images are challenging in several ways:
Answer:
The model performance is suprisingly poor for these examples, they are challenging in many different ways but not a single digit is classified correctly. 0% of the numbers are classified correcly and on a label by label slot the accuracy is only 64%
Example 1: This picture is of a 3D digit, i,e a physical model of the digit 3. The model only sees the a single 2D picture of the digit and fails to classify it correctly and instead outputs the number 7. Perhaps if the picture was taken with at a right angle to the digit it could be identified correctly. No 3D shaped digits have been observed in the training dataset and it is thus not suprising that the classification does not work
Example 2: The picture is of a license plate taken in strong sunlight and the correct label is 961 whereas the model outputs 999. The first digit is correctly classified and the algorithm does not seem to be confused by the precense of letters. One would expect the second digit to also be correct but it has been observed that the model sometimes confuses 6 with 9, there seems to be a weakness in accounting for spatial location of the digit and that its the top to down shape that is deciding. The last digit has a poor visibilty due to strong sunlight.
Example 3: The correct label is 5 whereas the model ouput is 7. This should be a fairly easy example to classify correctly with a single clearly defined number but the model still fails for unknown reasons
Example 4: This example is a handwritten number on a horizontally lined paper, the correct output is 345 but the model output is 243. The first digit is intersected by the horizontal line and the model cannot distinguish that this feature is not part of the digits feature thus labelling it a 2. The third digit is a poorly drawn 5 and since very few handwritten images where present in the training set the features of a 5 is not recognised.
Example 5: The correct output is 24 but the model output is 111. A comma separator is present between the 2 and the 4 but it is a small feature and it is not clear why the model completely missclassifies the number.
Answer: Leave blank if you did not complete this part.
There are many things you can do once you have the basic classifier in place. One example would be to also localize where the numbers are on the image. The SVHN dataset provides bounding boxes that you can tune to train a localizer. Train a regression loss to the coordinates of the bounding box, and then test it.
A model which also localizes the image position was tested.
In addition to the logit ouput for 5 numbers a regression head is included outputting 20 bounding box coordinates (4 for each number). The input to the regression head is the same convnet shared with the logits for predicting numbers. The idea is that this is forcing the convnet to focus on the part of the image containing numbers.
The mean square error from the regression prediction is combined with the cross entropy error of the digit predictions and used as the loss function for the optimizer.
The following network architecture was used:
# Construct a 4 layer Convenet
# Input size: batch x 64 x 128 x 1
# 1st Layer : convolutional layer stride=1 padding=SAME, convolution size: 5 x 5 x 1 x 16, output: batch x 64 x 128 x 16
# 1st Layer: sub-sampling layer stride 2 padding=SAME, [1,2,2,1] output: batch_size x 32 x 64 x 16
# 2nd Layer: convolutional layer stride=1 padding=SAME, , convolution size: 5 x 5 x 16 x 32, output: batch x 32 x 64 x 32
# 2nd layer: sub-sampling layer stride=[1 1 2 1] padding=SAME, [1,2,2,1] output: batch_size x 32 x 32 x 32
# 3rd layer: convolutional layer stride=1 padding=VALID, convolution size: 5 x 5 x 32 x 64, output: batch x 28 x 28 x 64
# 3rd layer: sub-sampling layer stride 2 padding=SAME, [1,2,2,1] output: batch_size x 14 x 14 x 64
# 4th layer: convolutional layer stride=1 padding=VALID, convolution size: 5 x 5 x 64 x 128, output: batch x 10 x 10 x 128
# 4th layer: sub-sampling layer stride 2 padding=SAME, [1,2,2,1] output: batch_size x 5 x 5 x 128
# Fully connect layer 1, weight size: 3200 x 128
# Logits Output layer, weight size: 128 x 11
# Fully connect layer 2, weight size: 3200 x 128
# Regression Output layer, weight size: 128 x 11
In [43]:
import cPickle as pickle
import numpy as np
import tensorflow as tf
import random
import matplotlib.pyplot as plt
import matplotlib.patches as patches
def batch_iou(a, b, epsilon=1e-5):
#reformatting from left, top, widht, height to x1,y2, x2,y2
ax1=a[:,0]
bx1=b[:,0]
ax2=a[:,1] - a[:,3]
bx2=b[:,1] - b[:,3]
ax3=ax1 + a[:,2]
bx3=bx1 + b[:,2]
ax4=a[:,1]
bx4=a[:,1]
a =np.stack((ax1,ax2,ax3,ax4), axis=1)
b = np.stack((bx1,bx2,bx3,bx4), axis=1)
""" Given two arrays `a` and `b` where each row contains a bounding
box defined as a list of four numbers:
[x1,y1,x2,y2]
where:
x1,y1 represent the upper left corner
x2,y2 represent the lower right corner
It returns the Intersect of Union scores for each corresponding
pair of boxes.
Args:
a: (numpy array) each row containing [x1,y1,x2,y2] coordinates
b: (numpy array) each row containing [x1,y1,x2,y2] coordinates
epsilon: (float) Small value to prevent division by zero
Returns:
(numpy array) The Intersect of Union scores for each pair of bounding
boxes.
"""
# COORDINATES OF THE INTERSECTION BOXES
x1 = np.array([a[:, 0], b[:, 0]]).max(axis=0)
y1 = np.array([a[:, 1], b[:, 1]]).max(axis=0)
x2 = np.array([a[:, 2], b[:, 2]]).min(axis=0)
y2 = np.array([a[:, 3], b[:, 3]]).min(axis=0)
# AREAS OF OVERLAP - Area where the boxes intersect
width = (x2 - x1)
height = (y2 - y1)
# handle case where there is NO overlap
width[width < 0] = 0
height[height < 0] = 0
area_overlap = width * height
# COMBINED AREAS
area_a = (a[:, 2] - a[:, 0]) * (a[:, 3] - a[:, 1])
area_b = (b[:, 2] - b[:, 0]) * (b[:, 3] - b[:, 1])
area_combined = area_a + area_b - area_overlap
# RATIO OF AREA OF OVERLAP OVER COMBINED AREA
iou = area_overlap / (area_combined + epsilon)
return iou
def accuracy(predictions, labels):
return (100.0 * np.sum(np.argmax(predictions, 2).T == labels) / predictions.shape[1] / predictions.shape[0])
def inspect_pred(nr_images, valid_dataset, valid_labels, valid_predictions):
valid_predictions=np.argmax(valid_predictions, 2).T
for i in range (0, nr_images):
image=random.randrange(0, len(valid_labels))
plt.imshow(valid_dataset[image, :,:,0], cmap='gray')
plt.show()
print('Correct Label: ' + str(valid_labels[image]))
print('Predicted label is: ' +str(valid_predictions[image]))
#Print random YUV image
def cv2_random_images(nr_images, ndarr_images, ndarr_labels, boxes):
for i in range (0,nr_images):
#image=random.randrange(0, len(ndarr_images))
ndarr_images=ndarr_images.reshape((-1, 64, 128))
bboxes=np.copy(boxes)
#box cordinates as fraction of image, returning to nr of pixels
for j in range(0, len(bboxes[0])/4):
bboxes[i, 4*j] = bboxes[i, 4*j]*ndarr_images[0].shape[1]
bboxes[i, 4*j+1] = bboxes[i, 4*j+1]*ndarr_images[0].shape[0]
bboxes[i, 4*j+2] = bboxes[i, 4*j+2]*ndarr_images[0].shape[1]
bboxes[i, 4*j+3] = bboxes[i, 4*j+3]*ndarr_images[0].shape[0]
#plotting images with bounding boxes
print('Image '+ str(i+1) +' Label is: ' + str(ndarr_labels[i]))
fig,ax = plt.subplots(1)
ax.imshow(ndarr_images[i, :,:], cmap='gray')
#Create rectangle patch
rect1=patches.Rectangle((bboxes[i,0],bboxes[i,1]), bboxes[i,2],bboxes[i,3], linewidth=1, edgecolor='b', facecolor='none')
rect2=patches.Rectangle((bboxes[i,4],bboxes[i,5]), bboxes[i,6],bboxes[i,7], linewidth=1, edgecolor='b', facecolor='none')
rect3=patches.Rectangle((bboxes[i,8],bboxes[i,9]), bboxes[i,10],bboxes[i,11], linewidth=1, edgecolor='b', facecolor='none')
rect4=patches.Rectangle((bboxes[i,12],bboxes[i,13]), bboxes[i,14],bboxes[i,15], linewidth=1, edgecolor='b', facecolor='none')
rect5=patches.Rectangle((bboxes[i,16],bboxes[i,17]), bboxes[i,18],bboxes[i,19], linewidth=1, edgecolor='b', facecolor='none')
ax.add_patch(rect1);ax.add_patch(rect2);ax.add_patch(rect3);ax.add_patch(rect4);ax.add_patch(rect5)
plt.show()
num_digits=5
num_boxes=5
image_height=64
image_width=128
batch_size = 128
patch_size = 5
depth1 = 16
depth2= 32
depth3 = 64
depth4 = 128
num_hidden = 128
num_labels=11
num_channels=1
test_dataset=np.load('test_data.npy').astype(np.float32)[:]
test_dataset=test_dataset.reshape((-1, image_height, image_width, num_channels))
test_labels=np.load('test_labels.npy')[:,0:num_digits].astype(np.float32)
test_boxes=np.load('test_boxes.npy')[:,: 4*num_boxes ].astype(np.float32)
valid_dataset=np.load('valid_data.npy')[:100].astype(np.float32)
valid_dataset=valid_dataset.reshape((-1, image_height, image_width, num_channels))
valid_labels=np.load('valid_labels.npy')[:100,0:num_digits].astype(np.float32)
valid_boxes=np.load('valid_boxes.npy')[:100,: 4*num_boxes ].astype(np.float32)
train_dataset=np.load('train_data.npy').astype(np.float32)
train_dataset=train_dataset.reshape((-1, image_height, image_width, num_channels))
train_labels=np.load('train_labels.npy')[:,:num_digits]
train_boxes=np.load('train_boxes.npy')[:,: num_boxes*4]
graph = tf.Graph()
with graph.as_default():
# Input data.
tf_train_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_height, image_width, num_channels))
tf_train_labels = tf.placeholder(tf.int32, shape=(batch_size, num_digits))
tf_train_boxes = tf.placeholder(tf.float32, shape=(batch_size, num_boxes*4))
tf_valid_dataset = tf.constant(valid_dataset)
tf_test_dataset = tf.constant(test_dataset)
# Convnet Variables.
layer1_weights = tf.Variable(tf.truncated_normal(
[patch_size, patch_size, num_channels, depth1], mean=0.0, stddev=0.1))
layer1_biases = tf.Variable(tf.constant(0.1, shape=[depth1]))
layer2_weights = tf.Variable(tf.truncated_normal(
[patch_size, patch_size, depth1, depth2], mean=0.0, stddev=0.1))
layer2_biases = tf.Variable(tf.constant(0.1, shape=[depth2]))
layer3_weights = tf.Variable(tf.truncated_normal(
[patch_size, patch_size, depth2, depth3], mean=0.0 , stddev=0.1))
layer3_biases = tf.Variable(tf.constant(0.1, shape=[depth3]))
layer4_weights = tf.Variable(tf.truncated_normal(
[patch_size, patch_size, depth3, depth4], mean=0.0, stddev=0.1))
layer4_biases = tf.Variable(tf.constant(0.1, shape=[depth4]))
#layer5_weights = tf.Variable(tf.truncated_normal(
# [patch_size, patch_size, depth4, depth5], stddev=0.1))
#layer5_biases = tf.Variable(tf.constant(1.0, shape=[depth5]))
dims=5*5*128
connect_weights = tf.Variable(tf.truncated_normal(
[dims, num_hidden], mean=0.0 , stddev=0.1))
connect_biases = tf.Variable(tf.constant(0.1, shape=[num_hidden]))
soft1_weights = tf.Variable(tf.truncated_normal(
[num_hidden, num_labels], mean=0.0 , stddev=0.1))
soft1_biases = tf.Variable(tf.constant(0.1, shape=[num_labels]))
soft2_weights = tf.Variable(tf.truncated_normal(
[num_hidden, num_labels], mean=0.0, stddev=0.1))
soft2_biases = tf.Variable(tf.constant(0.1, shape=[num_labels]))
soft3_weights = tf.Variable(tf.truncated_normal(
[num_hidden, num_labels], mean=0.0, stddev=0.1))
soft3_biases = tf.Variable(tf.constant(0.1, shape=[num_labels]))
soft4_weights = tf.Variable(tf.truncated_normal(
[num_hidden, num_labels], mean=0.0, stddev=0.1))
soft4_biases = tf.Variable(tf.constant(0.1, shape=[num_labels]))
soft5_weights = tf.Variable(tf.truncated_normal(
[num_hidden, num_labels], mean=0.0, stddev=0.1))
soft5_biases = tf.Variable(tf.constant(0.1, shape=[num_labels]))
reg1_weights = tf.Variable(tf.truncated_normal(
[dims, num_hidden], mean=0.0, stddev=0.1))
reg1_biases = tf.Variable(tf.constant(0.1, shape=[num_hidden]))
reg2_weights = tf.Variable(tf.truncated_normal(
[num_hidden, num_boxes*4], mean=0.0 , stddev=0.1))
reg2_biases = tf.Variable(tf.constant(0.1, shape=[num_boxes*4]))
# Model.
def model(data, drop_rate=1.0):
conv = tf.nn.conv2d(data, layer1_weights, [1, 1, 1, 1], padding='SAME') #LAYER 1
pool = tf.nn.max_pool(conv, [1,2,2,1], [1,2,2,1], 'SAME') #LAYER 1
drop = tf.nn.dropout(pool, drop_rate)
relu = tf.nn.relu(drop + layer1_biases)
conv = tf.nn.conv2d(relu, layer2_weights, [1, 1, 1, 1], padding='SAME') #LAYER 2
pool = tf.nn.max_pool(conv, [1,2,2,1], [1,1,2,1], 'SAME') #LAYER 2
drop = tf.nn.dropout(pool, drop_rate)
relu = tf.nn.relu(drop + layer2_biases)
conv = tf.nn.conv2d(relu, layer3_weights, [1, 1, 1, 1], padding='VALID') #LAYER 3
pool = tf.nn.max_pool(conv, [1,2,2,1], [1,2,2,1], 'SAME') #LAYER 3
drop = tf.nn.dropout(pool, drop_rate)
relu = tf.nn.relu(drop + layer3_biases)
conv = tf.nn.conv2d(relu, layer4_weights, [1, 1, 1, 1], padding='VALID') #LAYER 4
pool = tf.nn.max_pool(conv, [1,2,2,1], [1,2,2,1], 'SAME') #LAYER 4
drop = tf.nn.dropout(pool, drop_rate)
relu = tf.nn.relu(drop + layer4_biases)
shape = relu.get_shape().as_list()
reshape = tf.reshape(relu, [shape[0], shape[1] * shape[2] * shape[3]])
connected= tf.matmul(reshape, connect_weights)
hidden = tf.nn.relu(connected + connect_biases)
hidden = tf.nn.dropout(hidden, drop_rate)
logits1 = tf.matmul(hidden, soft1_weights + soft1_biases)
logits2 = tf.matmul(hidden, soft2_weights + soft2_biases)
logits3 = tf.matmul(hidden, soft3_weights + soft3_biases)
logits4 = tf.matmul(hidden, soft4_weights + soft4_biases)
logits5 = tf.matmul(hidden, soft5_weights + soft5_biases)
reg_boxes = tf.matmul(reshape, reg1_weights)
relu=tf.nn.relu(reg_boxes + reg1_biases)
reg_boxes = tf.matmul(relu, reg2_weights + reg2_biases)
return logits1, logits2, logits3, logits4, logits5, reg_boxes
# Training computation.
[logits1, logits2, logits3, logits4, logits5, reg_boxes] = model(tf_train_dataset, 0.985)
digit_loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits1, labels=tf_train_labels[:,0])) +\
tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits2, labels=tf_train_labels[:,1])) +\
tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits3, labels=tf_train_labels[:,2])) +\
tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits4, labels=tf_train_labels[:,3])) +\
tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits5, labels=tf_train_labels[:,4]))
bbox_loss = tf.sqrt(tf.reduce_mean(tf.square(10*(reg_boxes - tf_train_boxes))), name="bbox_loss")
loss = tf.add(digit_loss, bbox_loss, name="loss")
# Optimizer.
optimizer = tf.train.AdagradOptimizer(0.025).minimize(loss)
#Training Predictions
train_prediction = tf.stack([tf.nn.softmax(logits1),\
tf.nn.softmax(logits2),\
tf.nn.softmax(logits3),\
tf.nn.softmax(logits4),\
tf.nn.softmax(logits5)])
train_reg_boxes = reg_boxes
#Validation Predictions
[logits1, logits2, logits3, logits4, logits5, reg_boxes] = model(tf_valid_dataset)
valid_prediction = tf.stack([tf.nn.softmax(logits1),\
tf.nn.softmax(logits2),\
tf.nn.softmax(logits3),\
tf.nn.softmax(logits4),\
tf.nn.softmax(logits5)])
valid_reg_boxes = reg_boxes
#Testing Predictions
[logits1, logits2, logits3, logits4, logits5, regboxes] = model(tf_test_dataset)
test_prediction = tf.stack([tf.nn.softmax(logits1),\
tf.nn.softmax(logits2),\
tf.nn.softmax(logits3),\
tf.nn.softmax(logits4),\
tf.nn.softmax(logits5)])
test_reg_boxes = reg_boxes
num_steps = 1
import time
start = time.time()
with tf.device('/gpu:0'):
with tf.Session(graph=graph) as session:
saver = tf.train.Saver()
saver.restore(session, "tmp/saved_model_all_data2.ckpt")
print("Model restored.")
#tf.global_variables_initializer().run()
#print('Initialized')
for step in range(num_steps):
offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
batch_data = train_dataset[offset:(offset + batch_size), :, :, :]
batch_labels = train_labels[offset:(offset + batch_size), :]
batch_boxes = train_boxes[offset:(offset + batch_size), :]
feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels, tf_train_boxes : batch_boxes}
_, l, predictions = session.run(
[optimizer, loss, train_prediction], feed_dict=feed_dict)
step += 1
if (step % 1 == 0):
loss1=session.run(digit_loss, feed_dict=feed_dict)
loss2=session.run(bbox_loss, feed_dict=feed_dict)
print('------------------------STATUS---------------------------------')
print('Minibatch combined loss at step %d: %f' % (step, l))
print('Minibatch digit loss: %f, bbox loss: %f' % (loss1, loss2))
print('Minibatch accuracy: %.1f%%' % accuracy(predictions, batch_labels))
reg_boxes=session.run(train_reg_boxes, feed_dict=feed_dict)
IOU=[]#batch_iou(reg_boxes, batch_boxes)
for i in range(0, num_boxes):
IOU.append(batch_iou(reg_boxes[:, i*4:(i+1)*4], batch_boxes[:, i*4:(i+1)*4]))
print("Average minibatch IOU score: %.1f%%" %(np.mean(IOU)*100) )
lap = time.time()
print('Elapsed time: ' + str(lap-start) + ' seconds')
valid_predictions=valid_prediction.eval()
reg_boxes = valid_reg_boxes.eval()
acc = accuracy(valid_predictions, valid_labels)
IOU=[]
for i in range(0, num_boxes):
IOU.append(batch_iou(reg_boxes[:, i*4:(i+1)*4], valid_boxes[:, i*4:(i+1)*4]))
print('Validation accuracy: %.1f%%' % acc)
print("Average validation IOU score: %.1f%%" %(np.mean(IOU)*100) )
#inspect_pred(1, valid_dataset, valid_labels, valid_predictions)
test_predictions= test_prediction.eval()
test_boxes = test_reg_boxes.eval()
print('Test accuracy: %.1f%%' % accuracy(test_predictions, test_labels))
test_predictions=np.argmax(test_predictions, 2).T
cv2_random_images(5, test_dataset, test_predictions, test_boxes)
Answer:
The image localization works resonable well for the algorithm. The average intersect of union score is calculated and is about 67% for testing and validation.
This score is a bit lower than expected when looking at the visual results, however it is probably penalized by the fact that an 'empy' digit is labelled with a box of starting in the lower right corner of the image and with a size corresponding to the image height and image width (normalize bounding box size (1.0, 1.0, 1.0, 1.0). When plotting most of these boxes are outside the image but if they do not have the correct size it will penalize the IOU score. For the score it would have been better to label the empty bounding boxes as (1.0, 1.0, 0.0, 0.0)
Unfortunately the digit recognition is not improved by the implementation of the bounding boxes, perhaps a better approach would be to have a two stage algorithm: Step 1 creates bounding boxes and Step 2 crops the images accordingly and identifies the correct numbers.
As it is the algorithm still struggles to identify small numbers in images.
Test the localization function on the images you captured in Step 3. Does the model accurately calculate a bounding box for the numbers in the images you found? If you did not use a graphical interface, you may need to investigate the bounding boxes by hand. Provide an example of the localization created on a captured image.
In [57]:
import cPickle as pickle
import numpy as np
import tensorflow as tf
import random
import matplotlib.pyplot as plt
import matplotlib.patches as patches
def batch_iou(a, b, epsilon=1e-5):
#reformatting from left, top, widht, height to x1,y2, x2,y2
ax1=a[:,0]
bx1=b[:,0]
ax2=a[:,1] - a[:,3]
bx2=b[:,1] - b[:,3]
ax3=ax1 + a[:,2]
bx3=bx1 + b[:,2]
ax4=a[:,1]
bx4=a[:,1]
a =np.stack((ax1,ax2,ax3,ax4), axis=1)
b = np.stack((bx1,bx2,bx3,bx4), axis=1)
""" Given two arrays `a` and `b` where each row contains a bounding
box defined as a list of four numbers:
[x1,y1,x2,y2]
where:
x1,y1 represent the upper left corner
x2,y2 represent the lower right corner
It returns the Intersect of Union scores for each corresponding
pair of boxes.
Args:
a: (numpy array) each row containing [x1,y1,x2,y2] coordinates
b: (numpy array) each row containing [x1,y1,x2,y2] coordinates
epsilon: (float) Small value to prevent division by zero
Returns:
(numpy array) The Intersect of Union scores for each pair of bounding
boxes.
"""
# COORDINATES OF THE INTERSECTION BOXES
x1 = np.array([a[:, 0], b[:, 0]]).max(axis=0)
y1 = np.array([a[:, 1], b[:, 1]]).max(axis=0)
x2 = np.array([a[:, 2], b[:, 2]]).min(axis=0)
y2 = np.array([a[:, 3], b[:, 3]]).min(axis=0)
# AREAS OF OVERLAP - Area where the boxes intersect
width = (x2 - x1)
height = (y2 - y1)
# handle case where there is NO overlap
width[width < 0] = 0
height[height < 0] = 0
area_overlap = width * height
# COMBINED AREAS
area_a = (a[:, 2] - a[:, 0]) * (a[:, 3] - a[:, 1])
area_b = (b[:, 2] - b[:, 0]) * (b[:, 3] - b[:, 1])
area_combined = area_a + area_b - area_overlap
# RATIO OF AREA OF OVERLAP OVER COMBINED AREA
iou = area_overlap / (area_combined + epsilon)
return iou
def accuracy(predictions, labels):
return (100.0 * np.sum(np.argmax(predictions, 2).T == labels) / predictions.shape[1] / predictions.shape[0])
def inspect_pred(nr_images, valid_dataset, valid_labels, valid_predictions):
valid_predictions=np.argmax(valid_predictions, 2).T
for i in range (0, nr_images):
image=random.randrange(0, len(valid_labels))
plt.imshow(valid_dataset[image, :,:,0], cmap='gray')
plt.show()
print('Correct Label: ' + str(valid_labels[image]))
print('Predicted label is: ' +str(valid_predictions[image]))
#Print random YUV image
def cv2_random_images(nr_images, ndarr_images, ndarr_labels, boxes):
for i in range (0,nr_images):
#image=random.randrange(0, len(ndarr_images))
ndarr_images=ndarr_images.reshape((-1, 64, 128))
bboxes=np.copy(boxes)
#box cordinates as fraction of image, returning to nr of pixels
for j in range(0, len(bboxes[0])/4):
bboxes[i, 4*j] = bboxes[i, 4*j]*ndarr_images[0].shape[1]
bboxes[i, 4*j+1] = bboxes[i, 4*j+1]*ndarr_images[0].shape[0]
bboxes[i, 4*j+2] = bboxes[i, 4*j+2]*ndarr_images[0].shape[1]
bboxes[i, 4*j+3] = bboxes[i, 4*j+3]*ndarr_images[0].shape[0]
#plotting images with bounding boxes
print('Image '+ str(i+1) +' Label is: ' + str(ndarr_labels[i]))
fig,ax = plt.subplots(1)
ax.imshow(ndarr_images[i, :,:], cmap='gray')
#Create rectangle patch
rect1=patches.Rectangle((bboxes[i,0],bboxes[i,1]), bboxes[i,2],bboxes[i,3], linewidth=1, edgecolor='b', facecolor='none')
rect2=patches.Rectangle((bboxes[i,4],bboxes[i,5]), bboxes[i,6],bboxes[i,7], linewidth=1, edgecolor='b', facecolor='none')
rect3=patches.Rectangle((bboxes[i,8],bboxes[i,9]), bboxes[i,10],bboxes[i,11], linewidth=1, edgecolor='b', facecolor='none')
rect4=patches.Rectangle((bboxes[i,12],bboxes[i,13]), bboxes[i,14],bboxes[i,15], linewidth=1, edgecolor='b', facecolor='none')
rect5=patches.Rectangle((bboxes[i,16],bboxes[i,17]), bboxes[i,18],bboxes[i,19], linewidth=1, edgecolor='b', facecolor='none')
ax.add_patch(rect1);ax.add_patch(rect2);ax.add_patch(rect3);ax.add_patch(rect4);ax.add_patch(rect5)
plt.show()
num_digits=5
num_boxes=5
image_height=64
image_width=128
batch_size = 128
patch_size = 5
depth1 = 16
depth2= 32
depth3 = 64
depth4 = 128
num_hidden = 128
num_labels=11
num_channels=1
test_dataset=np.load('test_data.npy').astype(np.float32)[:]
test_dataset=test_dataset.reshape((-1, image_height, image_width, num_channels))
test_labels=np.load('test_labels.npy')[:,0:num_digits].astype(np.float32)
test_boxes=np.load('test_boxes.npy')[:,: 4*num_boxes ].astype(np.float32)
valid_dataset=np.load('valid_data.npy')[:100].astype(np.float32)
valid_dataset=valid_dataset.reshape((-1, image_height, image_width, num_channels))
valid_labels=np.load('valid_labels.npy')[:100,0:num_digits].astype(np.float32)
valid_boxes=np.load('valid_boxes.npy')[:100,: 4*num_boxes ].astype(np.float32)
train_dataset=np.load('train_data.npy').astype(np.float32)
train_dataset=train_dataset.reshape((-1, image_height, image_width, num_channels))
train_labels=np.load('train_labels.npy')[:,:num_digits]
train_boxes=np.load('train_boxes.npy')[:,: num_boxes*4]
graph = tf.Graph()
with graph.as_default():
# Input data.
tf_train_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_height, image_width, num_channels))
tf_train_labels = tf.placeholder(tf.int32, shape=(batch_size, num_digits))
tf_train_boxes = tf.placeholder(tf.float32, shape=(batch_size, num_boxes*4))
tf_valid_dataset = tf.constant(valid_dataset)
tf_test_dataset = tf.constant(test_dataset)
# Convnet Variables.
layer1_weights = tf.Variable(tf.truncated_normal(
[patch_size, patch_size, num_channels, depth1], mean=0.0, stddev=0.1))
layer1_biases = tf.Variable(tf.constant(0.1, shape=[depth1]))
layer2_weights = tf.Variable(tf.truncated_normal(
[patch_size, patch_size, depth1, depth2], mean=0.0, stddev=0.1))
layer2_biases = tf.Variable(tf.constant(0.1, shape=[depth2]))
layer3_weights = tf.Variable(tf.truncated_normal(
[patch_size, patch_size, depth2, depth3], mean=0.0 , stddev=0.1))
layer3_biases = tf.Variable(tf.constant(0.1, shape=[depth3]))
layer4_weights = tf.Variable(tf.truncated_normal(
[patch_size, patch_size, depth3, depth4], mean=0.0, stddev=0.1))
layer4_biases = tf.Variable(tf.constant(0.1, shape=[depth4]))
#layer5_weights = tf.Variable(tf.truncated_normal(
# [patch_size, patch_size, depth4, depth5], stddev=0.1))
#layer5_biases = tf.Variable(tf.constant(1.0, shape=[depth5]))
dims=5*5*128
connect_weights = tf.Variable(tf.truncated_normal(
[dims, num_hidden], mean=0.0 , stddev=0.1))
connect_biases = tf.Variable(tf.constant(0.1, shape=[num_hidden]))
soft1_weights = tf.Variable(tf.truncated_normal(
[num_hidden, num_labels], mean=0.0 , stddev=0.1))
soft1_biases = tf.Variable(tf.constant(0.1, shape=[num_labels]))
soft2_weights = tf.Variable(tf.truncated_normal(
[num_hidden, num_labels], mean=0.0, stddev=0.1))
soft2_biases = tf.Variable(tf.constant(0.1, shape=[num_labels]))
soft3_weights = tf.Variable(tf.truncated_normal(
[num_hidden, num_labels], mean=0.0, stddev=0.1))
soft3_biases = tf.Variable(tf.constant(0.1, shape=[num_labels]))
soft4_weights = tf.Variable(tf.truncated_normal(
[num_hidden, num_labels], mean=0.0, stddev=0.1))
soft4_biases = tf.Variable(tf.constant(0.1, shape=[num_labels]))
soft5_weights = tf.Variable(tf.truncated_normal(
[num_hidden, num_labels], mean=0.0, stddev=0.1))
soft5_biases = tf.Variable(tf.constant(0.1, shape=[num_labels]))
reg1_weights = tf.Variable(tf.truncated_normal(
[dims, num_hidden], mean=0.0, stddev=0.1))
reg1_biases = tf.Variable(tf.constant(0.1, shape=[num_hidden]))
reg2_weights = tf.Variable(tf.truncated_normal(
[num_hidden, num_boxes*4], mean=0.0 , stddev=0.1))
reg2_biases = tf.Variable(tf.constant(0.1, shape=[num_boxes*4]))
# Model.
def model(data, drop_rate=1.0):
conv = tf.nn.conv2d(data, layer1_weights, [1, 1, 1, 1], padding='SAME') #LAYER 1
pool = tf.nn.max_pool(conv, [1,2,2,1], [1,2,2,1], 'SAME') #LAYER 1
drop = tf.nn.dropout(pool, drop_rate)
relu = tf.nn.relu(drop + layer1_biases)
conv = tf.nn.conv2d(relu, layer2_weights, [1, 1, 1, 1], padding='SAME') #LAYER 2
pool = tf.nn.max_pool(conv, [1,2,2,1], [1,1,2,1], 'SAME') #LAYER 2
drop = tf.nn.dropout(pool, drop_rate)
relu = tf.nn.relu(drop + layer2_biases)
conv = tf.nn.conv2d(relu, layer3_weights, [1, 1, 1, 1], padding='VALID') #LAYER 3
pool = tf.nn.max_pool(conv, [1,2,2,1], [1,2,2,1], 'SAME') #LAYER 3
drop = tf.nn.dropout(pool, drop_rate)
relu = tf.nn.relu(drop + layer3_biases)
conv = tf.nn.conv2d(relu, layer4_weights, [1, 1, 1, 1], padding='VALID') #LAYER 4
pool = tf.nn.max_pool(conv, [1,2,2,1], [1,2,2,1], 'SAME') #LAYER 4
drop = tf.nn.dropout(pool, drop_rate)
relu = tf.nn.relu(drop + layer4_biases)
shape = relu.get_shape().as_list()
reshape = tf.reshape(relu, [shape[0], shape[1] * shape[2] * shape[3]])
connected= tf.matmul(reshape, connect_weights)
hidden = tf.nn.relu(connected + connect_biases)
hidden = tf.nn.dropout(hidden, drop_rate)
logits1 = tf.matmul(hidden, soft1_weights + soft1_biases)
logits2 = tf.matmul(hidden, soft2_weights + soft2_biases)
logits3 = tf.matmul(hidden, soft3_weights + soft3_biases)
logits4 = tf.matmul(hidden, soft4_weights + soft4_biases)
logits5 = tf.matmul(hidden, soft5_weights + soft5_biases)
reg_boxes = tf.matmul(reshape, reg1_weights)
relu=tf.nn.relu(reg_boxes + reg1_biases)
reg_boxes = tf.matmul(relu, reg2_weights + reg2_biases)
return logits1, logits2, logits3, logits4, logits5, reg_boxes
# Training computation.
[logits1, logits2, logits3, logits4, logits5, reg_boxes] = model(tf_train_dataset, 0.985)
digit_loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits1, labels=tf_train_labels[:,0])) +\
tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits2, labels=tf_train_labels[:,1])) +\
tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits3, labels=tf_train_labels[:,2])) +\
tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits4, labels=tf_train_labels[:,3])) +\
tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits5, labels=tf_train_labels[:,4]))
bbox_loss = tf.sqrt(tf.reduce_mean(tf.square(10*(reg_boxes - tf_train_boxes))), name="bbox_loss")
loss = tf.add(digit_loss, bbox_loss, name="loss")
# Optimizer.
optimizer = tf.train.AdagradOptimizer(0.025).minimize(loss)
#Training Predictions
train_prediction = tf.stack([tf.nn.softmax(logits1),\
tf.nn.softmax(logits2),\
tf.nn.softmax(logits3),\
tf.nn.softmax(logits4),\
tf.nn.softmax(logits5)])
train_reg_boxes = reg_boxes
#Validation Predictions
[logits1, logits2, logits3, logits4, logits5, reg_boxes] = model(tf_valid_dataset)
valid_prediction = tf.stack([tf.nn.softmax(logits1),\
tf.nn.softmax(logits2),\
tf.nn.softmax(logits3),\
tf.nn.softmax(logits4),\
tf.nn.softmax(logits5)])
valid_reg_boxes = reg_boxes
#Testing Predictions
[logits1, logits2, logits3, logits4, logits5, regboxes] = model(tf_test_dataset)
test_prediction = tf.stack([tf.nn.softmax(logits1),\
tf.nn.softmax(logits2),\
tf.nn.softmax(logits3),\
tf.nn.softmax(logits4),\
tf.nn.softmax(logits5)])
test_reg_boxes = reg_boxes
num_steps = 1
import time
start = time.time()
with tf.device('/gpu:0'):
with tf.Session(graph=graph) as session:
saver = tf.train.Saver()
saver.restore(session, "tmp/saved_model_all_data2.ckpt")
print("Model restored.")
#tf.global_variables_initializer().run()
#print('Initialized')
for step in range(num_steps):
offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
batch_data = train_dataset[offset:(offset + batch_size), :, :, :]
batch_labels = train_labels[offset:(offset + batch_size), :]
batch_boxes = train_boxes[offset:(offset + batch_size), :]
feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels, tf_train_boxes : batch_boxes}
_, l, predictions = session.run(
[optimizer, loss, train_prediction], feed_dict=feed_dict)
step += 1
test_predictions= test_prediction.eval()
test_boxes = test_reg_boxes.eval()
test_predictions=np.argmax(test_predictions, 2).T
cv2_random_images(5, test_dataset, test_predictions, test_boxes)
Answer:
The model doest not create correct bounding boxes for the captured image.
The 3D image example seems to confuse the algorithm into believing that there are 2 numbers present, allthough the digit prediction only returns 1 digit. THe same is the case for the picture of a house number 5.
The remaining pictures have a correct number of bounding boxes but the algorithm fails to accuractly locate them.
To be able to correctly identify these type of images a more diverse dataset would perhaps be needed.
Take your project one step further. If you're interested, look to build an Android application or even a more robust Python program that can interface with input images and display the classified numbers and even the bounding boxes. You can for example try to build an augmented reality app by overlaying your answer on the image like the Word Lens app does.
Loading a TensorFlow model into a camera app on Android is demonstrated in the TensorFlow Android demo app, which you can simply modify.
If you decide to explore this optional route, be sure to document your interface and implementation, along with significant results you find. You can see the additional rubric items that you could be evaluated on by following this link.
In [ ]:
### Your optional code implementation goes here.
### Feel free to use as many code cells as needed.
Write your documentation here.
Note: Once you have completed all of the code implementations and successfully answered each question above, you may finalize your work by exporting the iPython Notebook as an HTML document. You can do this by using the menu above and navigating to
File -> Download as -> HTML (.html). Include the finished document along with this notebook as your submission.