Machine Learning Engineer Nanodegree

Deep Learning

Project: Build a Digit Recognition Program

In this notebook, a template is provided for you to implement your functionality in stages which is required to successfully complete this project. If additional code is required that cannot be included in the notebook, be sure that the Python code is successfully imported and included in your submission, if necessary. Sections that begin with 'Implementation' in the header indicate where you should begin your implementation for your project. Note that some sections of implementation are optional, and will be marked with 'Optional' in the header.

In addition to implementing code, there will be questions that you must answer which relate to the project and your implementation. Each section where you will answer a question is preceded by a 'Question' header. Carefully read each question and provide thorough answers in the following text boxes that begin with 'Answer:'. Your project submission will be evaluated based on your answers to each of the questions and the implementation you provide.

Note: Code and Markdown cells can be executed using the Shift + Enter keyboard shortcut. In addition, Markdown cells can be edited by typically double-clicking the cell to enter edit mode.


Step 1: Design and Test a Model Architecture

Design and implement a deep learning model that learns to recognize sequences of digits. Train the model using synthetic data generated by concatenating character images from notMNIST or MNIST. To produce a synthetic sequence of digits for testing, you can for example limit yourself to sequences up to five digits, and use five classifiers on top of your deep network. You would have to incorporate an additional ‘blank’ character to account for shorter number sequences.

There are various aspects to consider when thinking about this problem:

  • Your model can be derived from a deep neural net or a convolutional network.
  • You could experiment sharing or not the weights between the softmax classifiers.
  • You can also use a recurrent network in your deep neural net to replace the classification layers and directly emit the sequence of digits one-at-a-time.

You can use Keras to implement your model. Read more at keras.io.

Here is an example of a published baseline model on this problem. (video). You are not expected to model your architecture precisely using this model nor get the same performance levels, but this is more to show an exampe of an approach used to solve this particular problem. We encourage you to try out different architectures for yourself and see what works best for you. Here is a useful forum post discussing the architecture as described in the paper and here is another one discussing the loss function.

Implementation

Use the code cell (or multiple code cells, if necessary) to implement the first step of your project. Once you have completed your implementation and are satisfied with the results, be sure to thoroughly answer the questions that follow.


In [4]:
#Importing Modules needed
import idx2numpy
import numpy as np
import matplotlib.pyplot as plt
import random
import tensorflow as tf

In [2]:
#Loading the datafiles from MNIST Dataset into numpy ndarrays
raw_train_dataset = idx2numpy.convert_from_file('train-images-idx3-ubyte')
raw_train_labels = idx2numpy.convert_from_file('train-labels-idx1-ubyte')
raw_test_dataset = idx2numpy.convert_from_file('t10k-images-idx3-ubyte')
raw_test_labels = idx2numpy.convert_from_file('t10k-labels-idx1-ubyte')

In [3]:
#DEFINING FUNCTIONS


#Function that displays random image samples from ndarray with images and labels        
def random_sequence(nr_images, ndarr_images, ndarr_labels):
    for i in range (0,nr_images):
        image=random.randrange(0, len(ndarr_images))
        print('Label is: '   + str(ndarr_labels[image,:]) )
        plt.imshow(ndarr_images[image], cmap='gray')
        plt.show()      
        
#Function to concatenating a random number of digit images to a sequence of digits 
#An empty number space is labelled as 10 in labels             
def concatenating_random(images, labels, data_size, image_size, min_string, max_string):
    #initalizing output ndarrays
    conc_images = np.zeros((data_size, image_size, image_size*max_string))  
    conc_labels = np.empty((data_size, max_string), dtype=int)  

    #loop to create whole dataset with data_size number of entries
    for i in range(0, data_size):
        #Random choice of number length
        number_length=random.randrange(min_string,(max_string +1))
        
        #initalizing
        conc_numbers=np.zeros((image_size, image_size*max_string))        
        conc_label = np.empty((1,max_string))
        conc_label.fill(10)
        
        #position digits randomly in image
        order=random.sample(list(range(max_string)), max_string)
        
        #loop to select random digits for making number
        for j in range(0, number_length):
            #selecting random digit and label
            number=random.randrange(0, (len(images)-1)) 
            conc_label[0,order[j]]= labels[number]
            conc_numbers[:, (order[j]*image_size) : ((order[j]+1)*image_size)] = images[number]
 
        #Sorting labels to be in the same order as shuffled images, all empty char=99 rightfill
        temp=[]
        for k in range(0, max_string):
            if conc_label[0,k]!=10:
                temp.append(conc_label[0,k])
        conc_label.fill(10)
       
        for l in range(0,len(temp)):
            conc_label[0,l]=temp[l]
        
        #saving each number and labels to ndarray for ouput
        conc_images[i]=conc_numbers  
        conc_labels[i,:]=conc_label        
    return conc_images, conc_labels

#Function to normalize the pixel depth to approx mean=0 and stdev=0.5 to make learning easier 
def normalize_images(images):
    pixel_depth = 255.0 
    images = (images - 0.5*pixel_depth)/pixel_depth
    return images.astype(np.float32)

#Function to flatten dataset 
def flatten(dataset, image_size, max_digits):
    dataset = dataset.reshape((-1, image_size * image_size*max_digits)).astype(np.float32)
    return dataset

def reformat(dataset, image_size, max_digits):
    num_channels=1 #grayscale
    dataset = dataset.reshape((-1, image_size, image_size*max_digits, num_channels)).astype(np.float32)
    return dataset

#One hot encode labels
def one_hot_encode(labels):
    from sklearn.preprocessing import OneHotEncoder
    enc = OneHotEncoder(dtype=np.float32)
    enc.fit(labels)
    labels=enc.transform(labels).toarray()
    return labels

#Data is randomized, first 10% is taken for validation set
def validation_set_split(data, labels):
    data_points=int(0.1*len(data))
    valid_dataset= data[:data_points,:]
    valid_labels= labels[:data_points,:]
    train_dataset = data[data_points:,:]
    train_labels = labels[data_points:,:]

    return train_dataset, train_labels, valid_dataset, valid_labels

#
#def accuracy(predictions, labels):
#  return (100.0 * np.sum(np.argmax(predictions, 1) == np.argmax(labels, 1))
#          / predictions.shape[0])

#Calculated accuracy
def accuracy(predictions, labels):    
    return (100.0 * np.sum(np.argmax(predictions, 2).T == labels) / predictions.shape[1] / predictions.shape[0])

#Func to compare predictions agains pictures
def inspect_pred(nr_images, valid_dataset, valid_labels, valid_predictions):
    valid_predictions=np.argmax(valid_predictions, 2).T
    for i in range (0, nr_images):
        image=random.randrange(0, len(valid_labels))
        plt.imshow(valid_dataset[image, :,:,0], cmap='gray')
        plt.show()
        print('Correct Label: ' + str(valid_labels[image]))
        print('Predicted label is: ' +str(valid_predictions[image]))

In [4]:
#PREPROCESSING TESTING AND TRAINING DATA

image_size=28 #assumed square image
max_digits=5
min_digits=0

#Concanating dataset from the training set, numbers of 0 to 5 digits. 60,000 combined numbers
train_dataset, train_labels = concatenating_random(raw_train_dataset,raw_train_labels, 60000, image_size, min_digits, max_digits)

#Concanating dataset from the testing set, numbers of 0 to 5 digits, 10,000 combined numbers
test_dataset, test_labels = concatenating_random(raw_test_dataset,raw_test_labels, 10000, image_size, min_digits, max_digits)

#Normalizing for closer to mean=0 and stdev=0.5
train_dataset=normalize_images(train_dataset)
test_dataset=normalize_images(test_dataset)

#Displaying 2 random concatenated images with label info, 10=blank character
print('Training set:')
random_sequence(2, train_dataset, train_labels)

#Displaying 2 random concatenated images with label info, 10=blank character
print('Testing set:')
random_sequence(2, test_dataset, test_labels)

#Reformatting arrays
train_dataset=reformat(train_dataset, image_size,max_digits)
test_dataset=reformat(test_dataset, image_size, max_digits)

#One hot encoding labels
#train_labels=one_hot_encode(train_labels)
#test_labels= one_hot_encode(test_labels)

#Split the training data into validation set and traing set
train_dataset, train_labels, valid_dataset, valid_labels = validation_set_split(train_dataset, train_labels)

print('Training set', train_dataset.shape, train_labels.shape)
print('Validation set', valid_dataset.shape, valid_labels.shape)
print('Test set', test_dataset.shape, test_labels.shape)


Training set:
Label is: [ 3 10 10 10 10]
Label is: [ 6  7 10 10 10]
Testing set:
Label is: [ 8  1  9 10 10]
Label is: [10 10 10 10 10]
('Training set', (54000, 28, 140, 1), (54000, 5))
('Validation set', (6000, 28, 140, 1), (6000, 5))
('Test set', (10000, 28, 140, 1), (10000, 5))

In [5]:
#TENSOR FLOW CACULATION GRAPH
image_height=image_size
image_width=max_digits*image_size
batch_size = 64
patch_size = 5
depth1 = 16
depth2= 32
num_hidden = 64
num_digits=max_digits
num_labels=11
num_channels=1


graph = tf.Graph()

with graph.as_default():

  # Input data.
  tf_train_dataset = tf.placeholder(
    tf.float32, shape=(batch_size, image_height, image_width, num_channels))
  tf_train_labels = tf.placeholder(tf.int32, shape=(batch_size, num_digits))
  tf_valid_dataset = tf.constant(valid_dataset)
  tf_test_dataset = tf.constant(test_dataset)
  
  # Variables.
  layer1_weights = tf.Variable(tf.truncated_normal(
      [patch_size, patch_size, num_channels, depth1], stddev=0.1))
  layer1_biases = tf.Variable(tf.zeros([depth1]))
  layer2_weights = tf.Variable(tf.truncated_normal(
      [patch_size, patch_size, depth1, depth2], stddev=0.1))
  layer2_biases = tf.Variable(tf.constant(1.0, shape=[depth2]))
  layer3_weights = tf.Variable(tf.truncated_normal(
      [image_height // 4 * image_width // 4 * depth2, num_hidden], stddev=0.1))
  layer3_biases = tf.Variable(tf.constant(1.0, shape=[num_hidden]))
  
  soft1_weights = tf.Variable(tf.truncated_normal(
      [num_hidden, num_labels], stddev=0.1))
  soft1_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))
  soft2_weights = tf.Variable(tf.truncated_normal(
      [num_hidden, num_labels], stddev=0.1))
  soft2_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))
  soft3_weights = tf.Variable(tf.truncated_normal(
      [num_hidden, num_labels], stddev=0.1))
  soft3_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))
  soft4_weights = tf.Variable(tf.truncated_normal(
      [num_hidden, num_labels], stddev=0.1))
  soft4_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))
  soft5_weights = tf.Variable(tf.truncated_normal(
      [num_hidden, num_labels], stddev=0.1))
  soft5_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))
  
   
  # Model.
  def model(data, drop_rate=1.0):
    conv = tf.nn.conv2d(data, layer1_weights, [1, 1, 1, 1], padding='SAME')
    hidden = tf.nn.relu(conv + layer1_biases)
    pool = tf.nn.max_pool(hidden, [1,2,2,1], [1,2,2,1], 'SAME')
    conv = tf.nn.conv2d(pool, layer2_weights, [1, 1, 1, 1], padding='SAME')
    hidden = tf.nn.relu(conv + layer2_biases)
    pool = tf.nn.max_pool(hidden, [1,2,2,1], [1,2,2,1], 'SAME')  
    shape = pool.get_shape().as_list()
    reshape = tf.reshape(pool, [shape[0], shape[1] * shape[2] * shape[3]])
    connected= tf.matmul(reshape, layer3_weights)
    hidden = tf.nn.relu(connected + layer3_biases)
    hidden = tf.nn.dropout(hidden, drop_rate)
    logits1 = tf.matmul(hidden, soft1_weights + soft1_biases)
    logits2 = tf.matmul(hidden, soft2_weights + soft2_biases)
    logits3 = tf.matmul(hidden, soft3_weights + soft3_biases)
    logits4 = tf.matmul(hidden, soft4_weights + soft4_biases)
    logits5 = tf.matmul(hidden, soft5_weights + soft5_biases)
    
    return logits1, logits2, logits3, logits4, logits5
  
  # Training computation.
  [logits1, logits2, logits3, logits4, logits5] = model(tf_train_dataset, 0.9)

  loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits1, labels=tf_train_labels[:,0])) +\
  tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits2, labels=tf_train_labels[:,1])) +\
  tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits3, labels=tf_train_labels[:,2])) +\
  tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits4, labels=tf_train_labels[:,3])) +\
  tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits5, labels=tf_train_labels[:,4]))

  # Optimizer.
  #optimizer = tf.train.AdagradOptimizer(0.05).minimize(loss)


  global_step = tf.Variable(0)
  learning_rate = tf.train.exponential_decay(0.05, global_step, 2000, 0.95)
  optimizer = tf.train.AdagradOptimizer(learning_rate).minimize(loss, global_step=global_step)
  

  
  #Training Predictions
  train_prediction = tf.stack([tf.nn.softmax(logits1),\
                      tf.nn.softmax(logits2),\
                      tf.nn.softmax(logits3),\
                      tf.nn.softmax(logits4),\
                      tf.nn.softmax(logits5)])  
  
  #Validation Predictions
  [logits1, logits2, logits3, logits4, logits5] = model(tf_valid_dataset)
  valid_prediction = tf.stack([tf.nn.softmax(logits1),\
                      tf.nn.softmax(logits2),\
                      tf.nn.softmax(logits3),\
                      tf.nn.softmax(logits4),\
                      tf.nn.softmax(logits5)])  
   
  #Testing Predictions  
  [logits1, logits2, logits3, logits4, logits5] = model(tf_test_dataset)
  test_prediction = tf.stack([tf.nn.softmax(logits1),\
                      tf.nn.softmax(logits2),\
                      tf.nn.softmax(logits3),\
                      tf.nn.softmax(logits4),\
                      tf.nn.softmax(logits5)])

In [9]:
#Loading calculated result from CUDA session and running a few itterations


num_steps = 5001
  
with tf.Session(graph=graph) as session:
    
  saver = tf.train.Saver()
  saver.restore(session, "/Users/fhellander/Machine_learning/digit-recognition-SVHN/tmp/saved_model_100001.ckpt")
  print("Model restored.")  
  #tf.global_variables_initializer().run()
  #print('Initialized')

  for step in range(num_steps):
    offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
    batch_data = train_dataset[offset:(offset + batch_size), :, :, :]
    batch_labels = train_labels[offset:(offset + batch_size), :]
    feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
    _, l, predictions = session.run(
      [optimizer, loss, train_prediction], feed_dict=feed_dict)
    if (step % 1000 == 0):
      print('Minibatch loss at step %d: %f' % (step, l))
      print('Minibatch accuracy: %.1f%%' % accuracy(predictions, batch_labels))
      valid_predictions=valid_prediction.eval() 
      print('Validation accuracy: %.1f%%' % accuracy(valid_predictions, valid_labels))
      inspect_pred(1, valid_dataset, valid_labels, valid_predictions)
        
  print('Test accuracy: %.1f%%' % accuracy(test_prediction.eval(), test_labels))
  save_path = saver.save(session, "saved_model.ckpt") 
  print("Model saved in file: %s" % save_path)


Model restored.
Minibatch loss at step 0: 1.066339
Minibatch accuracy: 93.6%
Validation accuracy: 96.1%
Correct Label: [ 3 10 10 10 10]
Predicted label is: [ 3 10 10 10 10]
Minibatch loss at step 1000: 1.008476
Minibatch accuracy: 93.1%
Validation accuracy: 96.2%
Correct Label: [ 2 10 10 10 10]
Predicted label is: [ 2 10 10 10 10]
Minibatch loss at step 2000: 0.898436
Minibatch accuracy: 94.4%
Validation accuracy: 96.2%
Correct Label: [ 3 10 10 10 10]
Predicted label is: [ 3 10 10 10 10]
Minibatch loss at step 3000: 0.881012
Minibatch accuracy: 94.7%
Validation accuracy: 96.2%
Correct Label: [ 7  3  8  0 10]
Predicted label is: [ 7  3  8  8 10]
Minibatch loss at step 4000: 1.003875
Minibatch accuracy: 93.3%
Validation accuracy: 96.2%
Correct Label: [ 6  3  9  9 10]
Predicted label is: [ 6  3  3  9 10]
Minibatch loss at step 5000: 0.879885
Minibatch accuracy: 94.5%
Validation accuracy: 96.3%
Correct Label: [10 10 10 10 10]
Predicted label is: [10 10 10 10 10]
Test accuracy: 96.6%
Model saved in file: saved_model.ckpt

Question 1

What approach did you take in coming up with a solution to this problem?

Answer:

For step 1 I decided to create a model to read and predict a random sequence of digits concatenated from the MNIST dataset.

  • A sequence of random lenght (0 to 5 numbers) are sampled randomly from the dataset and given a random position on a picture space
  • If less than 5 numbers are choosen, blank spaces (black images) are used to fill the space.
  • The model needs to learn to read 100,000 different numbers with alternating positions on the image space.

By constructing this dataset I can create a model that learns digit recognition for various number lenghts and with different horizontal separation. This should be a good stepping stone towards the final goal of learning number recognition for the Street View House Numbers (SVHN) datset. Since the dataset is artificially concatenated it is possible to start with a single number and increase the complexity of the model up to 5 numbers.

Digit recognition should also be easier than the SVHN dataset since each image only contains a number on a black and white format, no 'real world' features present alongside the digits.

This is the first time I used neural networks for a larger application and I decided to use the same type of architecture as LENET-5 (which was used for digit recognition on hanwritten numbers) together witht the classifiers and loss calculations from the model published in the 'Multi-digit Number Recognition from Street View' article.

The approach take is to use a combination of convolutions and max_pooling operations to create an image vector input to a fully connected layer and finally to 5 different classifiers. Each classifier is trained towards outputting 11 different labels; 10 labels for the digits 0-9 and one additional label 10 for an empty space (no digit).

Perhaps it would be useful to also have an additional classifier identifying the total number of digits in the image and share this information to the remaining classifiers but as I was unsure of how the share this information by e.g sharing weights in the model it has not been implememented.

Question 2

What does your final architecture look like? (Type of model, layers, sizes, connectivity, etc.)

Answer:

The final architecture is a 2-layer convolution-max_pooling convnet connected to a 2-layer neural network. The full configuration is as follows:

Convolution, stride= [1, 1, 1, 1], padding='SAME', depth=16)
Relu connection
Max_pool, stride= [1,2,2,1], 'SAME')
Convolution, stride=[1, 1, 1, 1], padding='SAME', depth=32)
Relu connection
Max_pool, stride= [1,2,2,1], 'SAME')  
Fully connected Layer 1 
Relu connection 

Fully connected layer 2 for Classifier 1 to 5

The convolution-max_pooling network transforms the input images from a 28x140 pixels and depth 1 input to 7x35x32 input to the neural network and the final connected layer is of size 64x55

Question 3

How did you train your model? How did you generate your synthetic dataset? Include examples of images from the synthetic data you constructed.

Answer:

The dataset is created by the following steps:

  • a random number between 0 and 5 is drawn, this will be the number of digits use. Noted number_lenght
  • from the MNIST dataset a number of samples corresponding to number_lenght is drawn randomly
  • a blank image space is created where up to 5 numbers can fit
  • a list of integers between 0 and 4 is shuffled. This list is used to decide the position of numbers drawn from MNIST
  • the drawn numbers from MNIST are positoned on the blank image space according to the shuffled list
  • the procedure is repeated to produce as many datapoints as desired and the corresponding labels is recorded.
  • the labels are the correct numbers from left to right. With less than 5 numbers the label 10 is used to indicate missing numbers and fills up the remaining label spaces.

Below are 5 random examples from the dataset.

Two datasets are created, one training set with images created by random draw from the MNIST training set, a total of 60,000 images are created. A second testing dataset is created by random draw from the MNIST testing dataset, a total of 10,000 images are created.

The test dataset is kept hidden until the end of the project and for final testing. The training dataset is split into a training set and a validation set to follow the learning progress of the algorithm and use validation score for hyperparameter exploration. 10% of the training set is taken as the validation set. No shuffeling is done of either dataset since they are created by random draw.

The algorithm is trained used stochastic gradient descent and backward propagation with approximately 100,000 itterations before the model parameters are saved.


In [13]:
#Concanating dataset from the training set, numbers of 0 to 5 digits. 60,000 combined numbers
train_dataset, train_labels = concatenating_random(raw_train_dataset,raw_train_labels, 60000, image_size, min_digits, max_digits)


#Displaying 10 random concatenated images with label info, 10=blank character
print('Training set:')
random_sequence(10, train_dataset, train_labels)


Training set:
Label is: [ 3  5  8  7 10]
Label is: [ 3  4  0  7 10]
Label is: [10 10 10 10 10]
Label is: [3 7 2 6 4]
Label is: [ 4 10 10 10 10]
Label is: [10 10 10 10 10]
Label is: [ 3 10 10 10 10]
Label is: [3 5 9 4 2]
Label is: [10 10 10 10 10]
Label is: [5 8 0 2 9]

The model is trained by maximizing the probability of seeing a certain sequence of numbers (labels) given the input.

Since each number is randomly drawn we can assume that they are independent and the probability of a sequence is the product of the probabilities of each label individual label.

The probablity of each label is given by calculating the softmax(logits[1..5]) functions

Too avoid numerical errors the probabilites are multiplied togeter in log space which means adding the individual log components.

Instead of maximixing the sum of log probabilite it is possible to do the equivalent by minimizing the summation of reduce_mean(sparse_softmax_cross_entropy_with_logits(...)) for all probabilites and the network is trained by minimizing this loss function.


Step 2: Train a Model on a Realistic Dataset

Once you have settled on a good architecture, you can train your model on real data. In particular, the Street View House Numbers (SVHN) dataset is a good large-scale dataset collected from house numbers in Google Street View. Training on this more challenging dataset, where the digits are not neatly lined-up and have various skews, fonts and colors, likely means you have to do some hyperparameter exploration to perform well.

Implementation

Use the code cell (or multiple code cells, if necessary) to implement the first step of your project. Once you have completed your implementation and are satisfied with the results, be sure to thoroughly answer the questions that follow.


In [21]:
#DEFINING FUCTIONS

import numpy as np
import random
import matplotlib.pyplot as plt
import h5py
import cPickle as pickle
from scipy import misc
import os
import cv2
import tensorflow as tf

#Function to read digitstruct to picklefile
def digitStructto_pickle(filepath, picklepath):
    
 
    f = h5py.File(filepath)
    
    metadata= {}
    metadata['height'] = []
    metadata['label'] = []
    metadata['left'] = []
    metadata['top'] = []
    metadata['width'] = []
    
    def print_attrs(name, obj):
        vals = []
        if obj.shape[0] == 1:
            vals.append(obj[0][0])
        else:
            for k in range(obj.shape[0]):
                vals.append(f[obj[k][0]][0][0])
        metadata[name].append(vals)    
    
    for item in f['/digitStruct/bbox']:
        f[item[0]].visititems(print_attrs)
              
    try:
      pickleData = open(picklepath, 'wb')
      pickle.dump(metadata, pickleData, pickle.HIGHEST_PROTOCOL)
      pickleData.close()
    except Exception as e:
      print 'Unable to save data to', pickle_file, ':', e
      raise 


#Extract labels from picklefile
def pickleto_labels(picklepath, max_digits):

  metadata = pickle.load(open(picklepath))
  
  labels=np.zeros((len(metadata['label']), max_digits), dtype='int32')
  labels.fill(99)#empty digit label
  
  for j in range(0, len(metadata['label'])):
  #for j in range(0, 100):
      
      for i in range(0, len(metadata['label'][j])):
          temp= metadata['label'][j][i]
          labels[j, i]= temp
                
  labels[labels==10]=0 #replacing all 10s with 0
  labels[labels==99]=10 #replacing all empty labels with 10
  return labels


#Normalize image based on maxrange [0,255]
def max_normalize_image(images):
    pixel_depth = 255.0 
    images = (images - 0.5*pixel_depth)/pixel_depth
    return images.astype(np.float32)


#Function to read images in convert to yuv space and resize to 64x128
#Global and local contrast applied (kernel 5x5)
#Return Tensor of all images
def image_yuv_preprocess(root, nr_images, height, width):
    
    np_images=np.zeros((nr_images, height, width,3)).astype(np.float32)
    for img_index in range(1, nr_images+1):
        path=root+str(img_index)+'.png'
        image=cv2.imread(path)
        image=cv2.cvtColor(image, cv2.COLOR_BGR2YUV)
        image=misc.imresize(image, (height, width))   
        illum=image[:,:,0]
        illum=cv2.equalizeHist(illum)
        clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(5,5))
        illum=clahe.apply(illum)      
        image[:,:,0]=illum
        image=max_normalize_image(image)
        np_images[img_index-1, :, :] =image.astype(np.float32)               
    return np_images

#Function to read images in black and white, resize and return in Tensor
def image_preprocess(root, nr_images, height, width):
    np_images=np.zeros((nr_images, height, width))
    for img_index in range(1, nr_images+1):
        path=root+str(img_index)+'.png'
        image=misc.imread(path, flatten=True)
        image=misc.imresize(image, (height, width))
        image=normalize_image(image)
        np_images[img_index-1, :, :] =image                
    return np_images

#Print random selection of processed black and white images with labes
def random_images(nr_images, ndarr_images, ndarr_labels):
    for i in range (0,nr_images):
        image=random.randrange(0, len(ndarr_images))
        print('Image '+ str(image+1) +' Label is: ' + str(ndarr_labels[image]))
        plt.imshow(ndarr_images[image, :,:,0], cmap='gray')
        plt.show()
        
#Print random YUV image         
def cv2_random_images(nr_images, ndarr_images, ndarr_labels):
    for i in range (0,nr_images):
        #image=random.randrange(0, len(ndarr_images))
        image=i
        print('Image '+ str(image+1) +' Label is: ' + str(ndarr_labels[image]))
        plt.imshow(ndarr_images[image, :,:,0])
        plt.show()
  
#reformt for tensorflow if only one channel is present      
def reformat(dataset, image_height, image_width, max_digits):
    num_channels=1 #grayscale
    dataset = dataset.reshape((-1, image_height, image_width, num_channels)).astype(np.float32)
    return dataset    
 
#Split data into training and validation set
def validation_set_split(data, labels):
    data_points=int(0.02*len(data))
    valid_dataset= data[:data_points,:]
    valid_labels= labels[:data_points,:]
    train_dataset = data[data_points:,:]
    train_labels = labels[data_points:,:]
    return train_dataset, train_labels, valid_dataset, valid_labels

In [4]:
#EXAMPLE OF PREPROCESSING DATA AND PLOTTING IMAGE INTENSITY CHANNEL Y

#creating picle files from digitStructs
#digitStructto_pickle('train/digitStruct.mat', 'train_metadata.p' )
#digitStructto_pickle('test/digitStruct.mat', 'test_metadata.p' )
#digitStructto_pickle('/Users/fhellander/Machine_learning/digit-recognition-SVHN/extra/digitStruct.mat', 'extra_metadata.p' )

max_digits = 6
train_labels=pickleto_labels('train_metadata.p',max_digits )
test_labels=pickleto_labels('test_metadata.p',max_digits )
extra_labels=pickleto_labels('extra_metadata.p',max_digits )

image_height=64
image_width=128

train_data=image_yuv_preprocess( 'train/', 2, image_height, image_width)
test_data=image_yuv_preprocess( 'test/', 2, image_height, image_width)
extra_data=image_yuv_preprocess( 'extra/',2, image_height, image_width)

print('Training Data:')
cv2_random_images(2, train_data, train_labels)

print('Extra Data:')
cv2_random_images(2, extra_data, extra_labels)

print('Testing Data:')
cv2_random_images(2, test_data, test_labels)

print('Training set', train_data.shape, train_labels.shape)
print('Test set', test_data.shape, test_labels.shape)
print('Extra set', extra_data.shape, extra_labels.shape)

train_data, train_labels, valid_data, valid_labels = validation_set_split(extra_data, extra_labels)

#np.save('test_data', test_data)
#np.save('test_labels', test_labels)

#np.save('train_data', train_data)
#np.save('train_labels', train_labels)

#np.save('valid_data', valid_data)
#np.save('valid_labels', valid_labels)


Training Data:
Image 1 Label is: [ 1  9 10 10 10 10]
Image 2 Label is: [ 2  3 10 10 10 10]
Extra Data:
Image 1 Label is: [ 4  7  8 10 10 10]
Image 2 Label is: [ 7  1 10 10 10 10]
Testing Data:
Image 1 Label is: [ 5 10 10 10 10 10]
Image 2 Label is: [ 2  1  0 10 10 10]
('Training set', (2, 64, 128, 3), (33402, 6))
('Test set', (2, 64, 128, 3), (13068, 6))
('Extra set', (2, 64, 128, 3), (202353, 6))

In [19]:
#DEFINING TENSOR FLOW GRAPH

import cPickle as pickle
import numpy as np
import tensorflow as tf
import random
import matplotlib.pyplot as plt


def accuracy(predictions, labels):    
    return (100.0 * np.sum(np.argmax(predictions, 2).T == labels) / predictions.shape[1] / predictions.shape[0])

def inspect_pred(nr_images, valid_dataset, valid_labels, valid_predictions):
    valid_predictions=np.argmax(valid_predictions, 2).T
    for i in range (0, nr_images):
        image=random.randrange(0, len(valid_labels))
        plt.imshow(valid_dataset[image, :,:,0], cmap='gray')
        plt.show()
        print('Correct Label: ' + str(valid_labels[image]))
        print('Predicted label is: ' +str(valid_predictions[image]))
 

test_dataset=np.load('test_data.npy').astype(np.float32)[:1000,:,:,:]
test_labels=np.load('test_labels.npy')[:1000,0:5]
valid_dataset=np.load('valid_data.npy').astype(np.float32)[:256,:,:,:]
valid_labels=np.load('valid_labels.npy')[:256,0:5]
train_dataset=np.load('train_data.npy').astype(np.float32)[:256,:,:,:]
train_labels=np.load('train_labels.npy')[:,0:5]


image_height=64
image_width=128
batch_size = 128
patch_size = 5
depth1 = 16
depth2= 32
depth3 = 64
depth4 = 128
#depth5 = 256
num_hidden = 128
num_digits=5
num_labels=11
num_channels=3


graph = tf.Graph()

with graph.as_default():

  # Input data.
  tf_train_dataset = tf.placeholder(
    tf.float32, shape=(batch_size, image_height, image_width, num_channels))
  tf_train_labels = tf.placeholder(tf.int32, shape=(batch_size, num_digits))
  tf_valid_dataset = tf.constant(valid_dataset)
  tf_test_dataset = tf.constant(test_dataset)

  
  # Convne Variables.
  layer1_weights = tf.Variable(tf.truncated_normal(
      [patch_size, patch_size, num_channels, depth1], stddev=0.1))
  layer1_biases = tf.Variable(tf.zeros([depth1]))
  layer2_weights = tf.Variable(tf.truncated_normal(
      [patch_size, patch_size, depth1, depth2], stddev=0.1))
  layer2_biases = tf.Variable(tf.constant(1.0, shape=[depth2]))
  layer3_weights = tf.Variable(tf.truncated_normal(
      [patch_size, patch_size, depth2, depth3], stddev=0.1))
  layer3_biases = tf.Variable(tf.constant(1.0, shape=[depth3]))
  layer4_weights = tf.Variable(tf.truncated_normal(
      [patch_size, patch_size, depth3, depth4], stddev=0.1))
  layer4_biases = tf.Variable(tf.constant(1.0, shape=[depth4]))
  #layer5_weights = tf.Variable(tf.truncated_normal(
  #    [patch_size, patch_size, depth4, depth5], stddev=0.1))
  #layer5_biases = tf.Variable(tf.constant(1.0, shape=[depth5]))


  dims=5*5*128
  connect_weights = tf.Variable(tf.truncated_normal(
      [dims, num_hidden], stddev=0.1))
  connect_biases = tf.Variable(tf.constant(1.0, shape=[num_hidden]))
  
  soft1_weights = tf.Variable(tf.truncated_normal(
      [num_hidden, num_labels], stddev=0.1))
  soft1_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))
  soft2_weights = tf.Variable(tf.truncated_normal(
      [num_hidden, num_labels], stddev=0.1))
  soft2_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))
  soft3_weights = tf.Variable(tf.truncated_normal(
      [num_hidden, num_labels], stddev=0.1))
  soft3_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))
  soft4_weights = tf.Variable(tf.truncated_normal(
      [num_hidden, num_labels], stddev=0.1))
  soft4_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))
  soft5_weights = tf.Variable(tf.truncated_normal(
      [num_hidden, num_labels], stddev=0.1))
  soft5_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))
    
  # Model.
  def model(data, drop_rate=1.0):

    # Construct a 5 layer Convenet
    # Input size: batch x 64 x 128 x 3
    # 1st Layer : convolutional layer stride=1 padding=SAME, convolution size: 5 x 5 x 1 x 16, output: batch x 64 x 128 x 16
    # 1st Layer: sub-sampling layer stride 2 padding=SAME, [1,2,2,1] output: batch_size x 32 x 64 x 16
    # 2nd Layer: convolutional layer stride=1 padding=SAME, , convolution size: 5 x 5 x 16 x 32, output: batch x 32 x 64 x 32
    # 2nd layer: sub-sampling layer stride=[1 1 2 1] padding=SAME, [1,2,2,1] output: batch_size x 32 x 32 x 32
    # 3rd layer: convolutional layer stride=1 padding=VALID, convolution size: 5 x 5 x 32 x 64, output: batch x 28 x 28 x 64
    # 3rd layer: sub-sampling layer stride 2 padding=SAME, [1,2,2,1] output: batch_size x 14 x 14 x 64
    # 4th layer: convolutional layer stride=1 padding=VALID, convolution size: 5 x 5 x 64 x 128, output: batch x 10 x 10 x 128
    # 4th layer: sub-sampling layer stride 2 padding=SAME, [1,2,2,1] output: batch_size x 5 x 5 x 128
    
    # Fully connected layer, weight size: 3200x128
    # Output layer, weight size: 128 x 11


    conv = tf.nn.conv2d(data, layer1_weights, [1, 1, 1, 1], padding='SAME') #LAYER 1
    pool = tf.nn.max_pool(conv, [1,2,2,1], [1,2,2,1], 'SAME') #LAYER 1
    drop = tf.nn.dropout(pool, drop_rate)
    relu = tf.nn.relu(drop + layer1_biases)
    

    conv = tf.nn.conv2d(relu, layer2_weights, [1, 1, 1, 1], padding='SAME') #LAYER 2
    pool = tf.nn.max_pool(conv, [1,2,2,1], [1,1,2,1], 'SAME') #LAYER 2
    drop = tf.nn.dropout(pool, drop_rate)    
    relu = tf.nn.relu(drop + layer2_biases)

    conv = tf.nn.conv2d(relu, layer3_weights, [1, 1, 1, 1], padding='VALID') #LAYER 3
    pool = tf.nn.max_pool(conv, [1,2,2,1], [1,2,2,1], 'SAME') #LAYER 3
    drop = tf.nn.dropout(pool, drop_rate) 
    relu = tf.nn.relu(drop + layer3_biases)

    conv = tf.nn.conv2d(relu, layer4_weights, [1, 1, 1, 1], padding='VALID') #LAYER 4
    pool = tf.nn.max_pool(conv, [1,2,2,1], [1,2,2,1], 'SAME') #LAYER 4
    drop = tf.nn.dropout(pool, drop_rate) 
    relu = tf.nn.relu(drop + layer4_biases)  
   # conv = tf.nn.conv2d(pool, layer5_weights, [1, 1, 1, 1], padding='VALID') #LAYER 5
   # relu = tf.nn.relu(conv + layer5_biases)


    shape = relu.get_shape().as_list()
    reshape = tf.reshape(relu, [shape[0], shape[1] * shape[2] * shape[3]])
    connected= tf.matmul(reshape, connect_weights)
    hidden = tf.nn.relu(connected + connect_biases)
    hidden = tf.nn.dropout(hidden, drop_rate) 


    logits1 = tf.matmul(hidden, soft1_weights + soft1_biases)
    logits2 = tf.matmul(hidden, soft2_weights + soft2_biases)
    logits3 = tf.matmul(hidden, soft3_weights + soft3_biases)
    logits4 = tf.matmul(hidden, soft4_weights + soft4_biases)
    logits5 = tf.matmul(hidden, soft5_weights + soft5_biases)
    
    return logits1, logits2, logits3, logits4, logits5
  
  # Training computation.
  [logits1, logits2, logits3, logits4, logits5] = model(tf_train_dataset, 0.975)

  loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits1, labels=tf_train_labels[:,0])) +\
  tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits2, labels=tf_train_labels[:,1])) +\
  tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits3, labels=tf_train_labels[:,2])) +\
  tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits4, labels=tf_train_labels[:,3])) +\
  tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits5, labels=tf_train_labels[:,4]))

  # Optimizer.
  optimizer = tf.train.AdagradOptimizer(0.02).minimize(loss)


  #global_step = tf.Variable(0)
  #learning_rate = tf.train.exponential_decay(0.05, global_step, 2000, 0.95)
  #optimizer = tf.train.AdagradOptimizer(learning_rate).minimize(loss, global_step=global_step)
  

  
  #Training Predictions
  train_prediction = tf.stack([tf.nn.softmax(logits1),\
                      tf.nn.softmax(logits2),\
                      tf.nn.softmax(logits3),\
                      tf.nn.softmax(logits4),\
                      tf.nn.softmax(logits5)])  
  
  #Validation Predictions
  [logits1, logits2, logits3, logits4, logits5] = model(tf_valid_dataset)
  valid_prediction = tf.stack([tf.nn.softmax(logits1),\
                      tf.nn.softmax(logits2),\
                      tf.nn.softmax(logits3),\
                      tf.nn.softmax(logits4),\
                      tf.nn.softmax(logits5)])  
   
  #Testing Predictions  
  [logits1, logits2, logits3, logits4, logits5] = model(tf_test_dataset)
  test_prediction = tf.stack([tf.nn.softmax(logits1),\
                      tf.nn.softmax(logits2),\
                      tf.nn.softmax(logits3),\
                      tf.nn.softmax(logits4),\
                      tf.nn.softmax(logits5)])

In [20]:
num_steps = 1

import time
start = time.time()  
with tf.device('/cpu:0'):
  with tf.Session(graph=graph) as session:
      
    saver = tf.train.Saver()
    saver.restore(session, "./tmp/saved_model_final7.ckpt")
    print("Model restored.")  
    #tf.global_variables_initializer().run()
    #print('Initialized')


    max_acc = 0
    acc = 0
    step = 0
    for step in range(num_steps):
    #while max_acc - acc < 1.0 :
      offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
      batch_data = train_dataset[offset:(offset + batch_size), :, :, :]
      batch_labels = train_labels[offset:(offset + batch_size), :]
      feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
      _, l, predictions = session.run(
        [optimizer, loss, train_prediction], feed_dict=feed_dict)
      step += 1
      if (step % 1 == 0):
        print('Minibatch loss at step %d: %f' % (step, l))
        print('Minibatch accuracy: %.1f%%' % accuracy(predictions, batch_labels))
        lap = time.time()
        print('Elapsed time: ' + str(lap-start) + ' seconds')
        valid_predictions=valid_prediction.eval()
        acc = accuracy(valid_predictions, valid_labels)
        if max_acc < acc:
          max_acc = acc
        print('Validation accuracy: %.1f%%' % acc)
        inspect_pred(2, valid_dataset, valid_labels, valid_predictions)
    
    test_predictions = test_prediction.eval()
    print('Test accuracy: %.1f%%' % accuracy(test_predictions, test_labels))
    inspect_pred(2, test_dataset, test_labels, test_predictions)
    #save_path = saver.save(session, "tmp/saved_model_" + "final7" + ".ckpt") 
    #print("Model saved in file: %s" % save_path)


Model restored.
Minibatch loss at step 1: 2.061867
Minibatch accuracy: 90.5%
Elapsed time: 12.0030050278 seconds
Validation accuracy: 90.6%
Correct Label: [2 5 5 1 7]
Predicted label is: [ 2  5  5  3 10]
Correct Label: [ 2  1  3 10 10]
Predicted label is: [ 1  1  5 10 10]
Test accuracy: 90.6%
Correct Label: [ 6  3  4  4 10]
Predicted label is: [ 6  3  4 10 10]
Correct Label: [ 2  8 10 10 10]
Predicted label is: [ 2  8 10 10 10]

Question 4

Describe how you set up the training and testing data for your model. How does the model perform on a realistic dataset?

Answer:

The training and testing data was created using the following steps:

  • Testing and Training labels are read from the information in the DigitStruct files and put in an ordered array
  • Testing and Training images are read in colour
  • Testing and Training images are resized to 64*128 pixels for a consistent input format. Since most images have a wide aspect ratio this a 1:2 ratio was choosen
  • Pictures are converted to a YUV colour scheme and a global and local contrast normalization is done
  • Pictures are converted to black and white to save memory space (Colour was deemed unimportant)
  • pixel values are normalized to a range of [-1, 1] and approximate StDev of 0.5

It was discovered early that a lot data is needed for the model to work well, therefore all datasamples from the SVHN training and extra dataset were combined. Since the training set contains harder examples than the extra dataset the combined dataset is scrambled to provide an even distribution of harder and easier examples. The combined dataset is split into a training and validation set, using about 2% of the samples for validation (4000 samples). The validation set is used to tune hyperparameters and the SVHN testing dataset is kept for testing and final evaluation.

The model is trained using an Adagradoptimzer and batches of 128 training samples are selected to train on for approximateley a total of 200,000 itterations before the model parameters are saved.

Since computations are getting very expensive with 64x128 size images all training is done on a linux server using CUDA and a GPU. Only illustrations with smaller dataset are shown in the iphyton notebook

I have missunderstood part of this question and did not use the same model as for the MNIST dataset here The model used to interpret handwritten digits from the MNIST dataset did not perform well on this dataset, the training and validation score only reached about 80% accuracy and testing accuracy was around 70%. Therefore this model was quickly discarded and all results shown are from a new deeper convnet model (the new architecture is presented below in question 6)

Training accuracy is generally around 96% (the batch illustrated above has only 90.5%), validation and testing accuracy is around 91%.

The model performance is acceptable but not great, one of the issues it struggles with is numbers that make up a small portion of the image. In preprocessing the images they are simply resized to 64x128 and to succesfully detect small numbers some type of number localization is probably needed before this rezising. Otherwise a number that constitutes a small proportion of the origninal image will end up being represented by a just a few pixels.

Question 5

What changes did you have to make, if any, to achieve "good" results? Were there any options you explored that made the results worse?

Answer:

First of all the input image resolution was increased from 28x140 to 64x128. The resolution of the original dataset seemed to be quite low and compressing it futher made distinguishing numbers hard for a human.

The model was made deeper by adding an additional convolution layer, this allowed for further downsizing of the images and also increasing the model space to learn more complicated patterns. However, extending the model beyond this size prevented any useful learning.

Due to problems with overfitting a dropout layers were added which significantly improved the results.

Initally only a smaller subset of the training data was used to increase computational efficiency (30k samples). However, using a limited number of training datapoint significantly reduced the validation and testing performance and the model required +100k samples to generalize well to unseen data.

Question 6

What were your initial and final results with testing on a realistic dataset? Do you believe your model is doing a good enough job at classifying numbers correctly?

Answer:

The initial results with the model where poor. The model quickly reached a very high training score (99%-100% accuracy) but the testing and validation score was in the low 50%. The neural network simply learned all training examples by heart and could not generalize to unseen data.

Several models where built but the final model has the following architecture:

# Input size: batch x 64 x 128 x 3
# 1st Layer : convolutional layer stride=1 padding=SAME, convolution size: 5 x 5 x 1 x 16, output: batch x 64 x 128 x 16
# 1st Layer: sub-sampling layer stride 2 padding=SAME, [1,2,2,1] output: batch_size x 32 x 64 x 16
# 2nd Layer: convolutional layer stride=1 padding=SAME, , convolution size: 5 x 5 x 16 x 32, output: batch x 32 x 64 x 32
# 2nd layer: sub-sampling layer stride=[1 1 2 1] padding=SAME, [1,2,2,1] output: batch_size x 32 x 32 x 32
# 3rd layer: convolutional layer stride=1 padding=VALID, convolution size: 5 x 5 x 32 x 64, output: batch x 28 x 28 x 64
# 3rd layer: sub-sampling layer stride 2 padding=SAME, [1,2,2,1] output: batch_size x 14 x 14 x 64
# 4th layer: convolutional layer stride=1 padding=VALID, convolution size: 5 x 5 x 64 x 128, output: batch x 10 x 10 x 128
# 4th layer: sub-sampling layer stride 2 padding=SAME, [1,2,2,1] output: batch_size x 5 x 5 x 128

# Fully connected layer, weight size: 3200x128
# Output layer, weight size: 128 x 11

The final model has a training score of about 96% and validation and testing score of about 91%. The model is doing a good job at classifying large numbers but struggles with smaller images.

The problem probably lies with using a too small resolution for small images. However, lack of computer memory prevents me trying larger images. (Or a bounding box and cropping algortihm is required before resizing images)


Step 3: Test a Model on Newly-Captured Images

Take several pictures of numbers that you find around you (at least five), and run them through your classifier on your computer to produce example results. Alternatively (optionally), you can try using OpenCV / SimpleCV / Pygame to capture live images from a webcam and run those through your classifier.

Implementation

Use the code cell (or multiple code cells, if necessary) to implement the first step of your project. Once you have completed your implementation and are satisfied with the results, be sure to thoroughly answer the questions that follow.


In [30]:
import numpy as np
from scipy import misc, ndimage
import matplotlib.pyplot as plt
import random
import tensorflow as tf
import matplotlib.image as mpimg

test_dataset=image_yuv_preprocess('captured_test/', 5 , 64, 128) 
    
image_height=64
image_width=128
batch_size = 128
patch_size = 5
depth1 = 16
depth2= 32
depth3 = 64
depth4 = 128
#depth5 = 256
num_hidden = 128
num_digits=5
num_labels=11
num_channels=3


graph = tf.Graph()

with graph.as_default():

  # Input data.
  tf_test_dataset = tf.constant(test_dataset)

  
  # Convne Variables.
  layer1_weights = tf.Variable(tf.truncated_normal(
      [patch_size, patch_size, num_channels, depth1], stddev=0.1))
  layer1_biases = tf.Variable(tf.zeros([depth1]))
  layer2_weights = tf.Variable(tf.truncated_normal(
      [patch_size, patch_size, depth1, depth2], stddev=0.1))
  layer2_biases = tf.Variable(tf.constant(1.0, shape=[depth2]))
  layer3_weights = tf.Variable(tf.truncated_normal(
      [patch_size, patch_size, depth2, depth3], stddev=0.1))
  layer3_biases = tf.Variable(tf.constant(1.0, shape=[depth3]))
  layer4_weights = tf.Variable(tf.truncated_normal(
      [patch_size, patch_size, depth3, depth4], stddev=0.1))
  layer4_biases = tf.Variable(tf.constant(1.0, shape=[depth4]))

  dims=5*5*128
  connect_weights = tf.Variable(tf.truncated_normal(
      [dims, num_hidden], stddev=0.1))
  connect_biases = tf.Variable(tf.constant(1.0, shape=[num_hidden]))
  soft1_weights = tf.Variable(tf.truncated_normal(
      [num_hidden, num_labels], stddev=0.1))
  soft1_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))
  soft2_weights = tf.Variable(tf.truncated_normal(
      [num_hidden, num_labels], stddev=0.1))
  soft2_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))
  soft3_weights = tf.Variable(tf.truncated_normal(
      [num_hidden, num_labels], stddev=0.1))
  soft3_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))
  soft4_weights = tf.Variable(tf.truncated_normal(
      [num_hidden, num_labels], stddev=0.1))
  soft4_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))
  soft5_weights = tf.Variable(tf.truncated_normal(
      [num_hidden, num_labels], stddev=0.1))
  soft5_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))
    
  # Model.
  def model(data, drop_rate=1.0):

    conv = tf.nn.conv2d(data, layer1_weights, [1, 1, 1, 1], padding='SAME') #LAYER 1
    pool = tf.nn.max_pool(conv, [1,2,2,1], [1,2,2,1], 'SAME') #LAYER 1
    drop = tf.nn.dropout(pool, drop_rate)
    relu = tf.nn.relu(drop + layer1_biases)
    conv = tf.nn.conv2d(relu, layer2_weights, [1, 1, 1, 1], padding='SAME') #LAYER 2
    pool = tf.nn.max_pool(conv, [1,2,2,1], [1,1,2,1], 'SAME') #LAYER 2
    drop = tf.nn.dropout(pool, drop_rate)    
    relu = tf.nn.relu(drop + layer2_biases)
    conv = tf.nn.conv2d(relu, layer3_weights, [1, 1, 1, 1], padding='VALID') #LAYER 3
    pool = tf.nn.max_pool(conv, [1,2,2,1], [1,2,2,1], 'SAME') #LAYER 3
    drop = tf.nn.dropout(pool, drop_rate) 
    relu = tf.nn.relu(drop + layer3_biases)
    conv = tf.nn.conv2d(relu, layer4_weights, [1, 1, 1, 1], padding='VALID') #LAYER 4
    pool = tf.nn.max_pool(conv, [1,2,2,1], [1,2,2,1], 'SAME') #LAYER 4
    drop = tf.nn.dropout(pool, drop_rate) 
    relu = tf.nn.relu(drop + layer4_biases)  
    shape = relu.get_shape().as_list()
    reshape = tf.reshape(relu, [shape[0], shape[1] * shape[2] * shape[3]])
    connected= tf.matmul(reshape, connect_weights)
    hidden = tf.nn.relu(connected + connect_biases)
    hidden = tf.nn.dropout(hidden, drop_rate) 
    logits1 = tf.matmul(hidden, soft1_weights + soft1_biases)
    logits2 = tf.matmul(hidden, soft2_weights + soft2_biases)
    logits3 = tf.matmul(hidden, soft3_weights + soft3_biases)
    logits4 = tf.matmul(hidden, soft4_weights + soft4_biases)
    logits5 = tf.matmul(hidden, soft5_weights + soft5_biases)
    
    return logits1, logits2, logits3, logits4, logits5

  #Testing Predictions  
  [logits1, logits2, logits3, logits4, logits5] = model(tf_test_dataset)
  test_prediction = tf.stack([tf.nn.softmax(logits1),\
                      tf.nn.softmax(logits2),\
                      tf.nn.softmax(logits3),\
                      tf.nn.softmax(logits4),\
                      tf.nn.softmax(logits5)])     
    
num_steps = 1
  
with tf.Session(graph=graph) as session:
    
  saver = tf.train.Saver()
  saver.restore(session, "./tmp/saved_model_final7.ckpt")
  print("Model restored.")  
   
  test_labels=test_prediction.eval()
  test_labels=np.argmax(test_labels,2).T
                       

for i in range (0,5):
  plt.imshow(test_dataset[i, :,:,0], cmap='gray')
  plt.show()  
  print('Prediction is: ' +str(test_labels[i,:]))


Model restored.
Prediction is: [ 7 10 10 10 10]
Prediction is: [ 9  9  9 10 10]
Prediction is: [ 7 10 10 10 10]
Prediction is: [ 2  4  3 10 10]
Prediction is: [ 1  1  1 10 10]

Question 7

Choose five candidate images of numbers you took from around you and provide them in the report. Are there any particular qualities of the image(s) that might make classification difficult?

Answer:

The results are unfortunately very poor, no images are classified correcly.

The images are challenging in several ways:

  • first image has a 3D shape with has not been seen in training
  • second image is partly shaded and partly in strong sunlight, contrast varies and it also contains letters
  • third image is not particularly challenging except from being small
  • fourth image is handwritten, perhaps numbers from the MNIST set could have been included in training for a better result on numbers like this. There are also vertical lines present which seems to confuse the algoritm
  • fifth image is small and has a comma separator

Question 8

Is your model able to perform equally well on captured pictures or a live camera stream when compared to testing on the realistic dataset?

Answer:

The model performance is suprisingly poor for these examples, they are challenging in many different ways but not a single digit is classified correctly. 0% of the numbers are classified correcly and on a label by label slot the accuracy is only 64%

Example 1: This picture is of a 3D digit, i,e a physical model of the digit 3. The model only sees the a single 2D picture of the digit and fails to classify it correctly and instead outputs the number 7. Perhaps if the picture was taken with at a right angle to the digit it could be identified correctly. No 3D shaped digits have been observed in the training dataset and it is thus not suprising that the classification does not work

Example 2: The picture is of a license plate taken in strong sunlight and the correct label is 961 whereas the model outputs 999. The first digit is correctly classified and the algorithm does not seem to be confused by the precense of letters. One would expect the second digit to also be correct but it has been observed that the model sometimes confuses 6 with 9, there seems to be a weakness in accounting for spatial location of the digit and that its the top to down shape that is deciding. The last digit has a poor visibilty due to strong sunlight.

Example 3: The correct label is 5 whereas the model ouput is 7. This should be a fairly easy example to classify correctly with a single clearly defined number but the model still fails for unknown reasons

Example 4: This example is a handwritten number on a horizontally lined paper, the correct output is 345 but the model output is 243. The first digit is intersected by the horizontal line and the model cannot distinguish that this feature is not part of the digits feature thus labelling it a 2. The third digit is a poorly drawn 5 and since very few handwritten images where present in the training set the features of a 5 is not recognised.

Example 5: The correct output is 24 but the model output is 111. A comma separator is present between the 2 and the 4 but it is a small feature and it is not clear why the model completely missclassifies the number.

Optional: Question 9

If necessary, provide documentation for how an interface was built for your model to load and classify newly-acquired images.

Answer: Leave blank if you did not complete this part.


Step 4: Explore an Improvement for a Model

There are many things you can do once you have the basic classifier in place. One example would be to also localize where the numbers are on the image. The SVHN dataset provides bounding boxes that you can tune to train a localizer. Train a regression loss to the coordinates of the bounding box, and then test it.

Implementation

Use the code cell (or multiple code cells, if necessary) to implement the first step of your project. Once you have completed your implementation and are satisfied with the results, be sure to thoroughly answer the questions that follow.

A model which also localizes the image position was tested.

In addition to the logit ouput for 5 numbers a regression head is included outputting 20 bounding box coordinates (4 for each number). The input to the regression head is the same convnet shared with the logits for predicting numbers. The idea is that this is forcing the convnet to focus on the part of the image containing numbers.

The mean square error from the regression prediction is combined with the cross entropy error of the digit predictions and used as the loss function for the optimizer.

The following network architecture was used:

# Construct a 4 layer Convenet
# Input size: batch x 64 x 128 x 1
# 1st Layer : convolutional layer stride=1 padding=SAME, convolution size: 5 x 5 x 1 x 16, output: batch x 64 x 128 x 16
# 1st Layer: sub-sampling layer stride 2 padding=SAME, [1,2,2,1] output: batch_size x 32 x 64 x 16
# 2nd Layer: convolutional layer stride=1 padding=SAME, , convolution size: 5 x 5 x 16 x 32, output: batch x 32 x 64 x 32
# 2nd layer: sub-sampling layer stride=[1 1 2 1] padding=SAME, [1,2,2,1] output: batch_size x 32 x 32 x 32
# 3rd layer: convolutional layer stride=1 padding=VALID, convolution size: 5 x 5 x 32 x 64, output: batch x 28 x 28 x 64
# 3rd layer: sub-sampling layer stride 2 padding=SAME, [1,2,2,1] output: batch_size x 14 x 14 x 64
# 4th layer: convolutional layer stride=1 padding=VALID, convolution size: 5 x 5 x 64 x 128, output: batch x 10 x 10 x 128
# 4th layer: sub-sampling layer stride 2 padding=SAME, [1,2,2,1] output: batch_size x 5 x 5 x 128

# Fully connect layer 1, weight size: 3200 x 128
# Logits Output layer, weight size: 128 x 11


# Fully connect layer 2, weight size: 3200 x 128
# Regression Output layer, weight size: 128 x 11

In [43]:
import cPickle as pickle
import numpy as np
import tensorflow as tf
import random
import matplotlib.pyplot as plt
import matplotlib.patches as patches


def batch_iou(a, b, epsilon=1e-5):
    
    
    #reformatting from left, top, widht, height to x1,y2, x2,y2
    ax1=a[:,0]
    bx1=b[:,0]  
    ax2=a[:,1] - a[:,3]
    bx2=b[:,1] - b[:,3]  
    ax3=ax1 + a[:,2]
    bx3=bx1 + b[:,2]
    ax4=a[:,1]
    bx4=a[:,1]
    
    a =np.stack((ax1,ax2,ax3,ax4), axis=1)
    b = np.stack((bx1,bx2,bx3,bx4), axis=1)
         
    """ Given two arrays `a` and `b` where each row contains a bounding
        box defined as a list of four numbers:
            [x1,y1,x2,y2]
        where:
            x1,y1 represent the upper left corner
            x2,y2 represent the lower right corner
        It returns the Intersect of Union scores for each corresponding
        pair of boxes.

    Args:
        a:          (numpy array) each row containing [x1,y1,x2,y2] coordinates
        b:          (numpy array) each row containing [x1,y1,x2,y2] coordinates
        epsilon:    (float) Small value to prevent division by zero

    Returns:
        (numpy array) The Intersect of Union scores for each pair of bounding
        boxes.
    """
    
    # COORDINATES OF THE INTERSECTION BOXES
    x1 = np.array([a[:, 0], b[:, 0]]).max(axis=0)
    y1 = np.array([a[:, 1], b[:, 1]]).max(axis=0)
    x2 = np.array([a[:, 2], b[:, 2]]).min(axis=0)
    y2 = np.array([a[:, 3], b[:, 3]]).min(axis=0)
    
    

    # AREAS OF OVERLAP - Area where the boxes intersect
    width = (x2 - x1)
    height = (y2 - y1)

    # handle case where there is NO overlap
    width[width < 0] = 0
    height[height < 0] = 0

    area_overlap = width * height

    # COMBINED AREAS
    area_a = (a[:, 2] - a[:, 0]) * (a[:, 3] - a[:, 1])
    area_b = (b[:, 2] - b[:, 0]) * (b[:, 3] - b[:, 1])
    area_combined = area_a + area_b - area_overlap

    # RATIO OF AREA OF OVERLAP OVER COMBINED AREA
    iou = area_overlap / (area_combined + epsilon)
    return iou



def accuracy(predictions, labels):    
    return (100.0 * np.sum(np.argmax(predictions, 2).T == labels) / predictions.shape[1] / predictions.shape[0])

def inspect_pred(nr_images, valid_dataset, valid_labels, valid_predictions):
    valid_predictions=np.argmax(valid_predictions, 2).T
    for i in range (0, nr_images):
        image=random.randrange(0, len(valid_labels))
        plt.imshow(valid_dataset[image, :,:,0], cmap='gray')
        plt.show()
        print('Correct Label: ' + str(valid_labels[image]))
        print('Predicted label is: ' +str(valid_predictions[image]))
        
        
#Print random YUV image         
def cv2_random_images(nr_images, ndarr_images, ndarr_labels, boxes):
    for i in range (0,nr_images):
        #image=random.randrange(0, len(ndarr_images))
        
        ndarr_images=ndarr_images.reshape((-1, 64, 128))
        
        bboxes=np.copy(boxes)
        #box cordinates as fraction of image, returning to nr of pixels
        for j in range(0, len(bboxes[0])/4):
            bboxes[i, 4*j] = bboxes[i, 4*j]*ndarr_images[0].shape[1]
            bboxes[i, 4*j+1] = bboxes[i, 4*j+1]*ndarr_images[0].shape[0]
            bboxes[i, 4*j+2] = bboxes[i, 4*j+2]*ndarr_images[0].shape[1] 
            bboxes[i, 4*j+3] = bboxes[i, 4*j+3]*ndarr_images[0].shape[0]
        
        
        #plotting images with bounding boxes
        print('Image '+ str(i+1) +' Label is: ' + str(ndarr_labels[i]))
        fig,ax = plt.subplots(1)
        ax.imshow(ndarr_images[i, :,:], cmap='gray')
        
        #Create rectangle patch
        rect1=patches.Rectangle((bboxes[i,0],bboxes[i,1]), bboxes[i,2],bboxes[i,3], linewidth=1, edgecolor='b', facecolor='none')
        rect2=patches.Rectangle((bboxes[i,4],bboxes[i,5]), bboxes[i,6],bboxes[i,7], linewidth=1, edgecolor='b', facecolor='none')
        rect3=patches.Rectangle((bboxes[i,8],bboxes[i,9]), bboxes[i,10],bboxes[i,11], linewidth=1, edgecolor='b', facecolor='none')
        rect4=patches.Rectangle((bboxes[i,12],bboxes[i,13]), bboxes[i,14],bboxes[i,15], linewidth=1, edgecolor='b', facecolor='none')
        rect5=patches.Rectangle((bboxes[i,16],bboxes[i,17]), bboxes[i,18],bboxes[i,19], linewidth=1, edgecolor='b', facecolor='none')
  
        ax.add_patch(rect1);ax.add_patch(rect2);ax.add_patch(rect3);ax.add_patch(rect4);ax.add_patch(rect5)
        
        plt.show()
 

num_digits=5
num_boxes=5
image_height=64
image_width=128
batch_size = 128
patch_size = 5
depth1 = 16
depth2= 32
depth3 = 64
depth4 = 128
num_hidden = 128
num_labels=11
num_channels=1 

test_dataset=np.load('test_data.npy').astype(np.float32)[:]
test_dataset=test_dataset.reshape((-1, image_height, image_width, num_channels))
test_labels=np.load('test_labels.npy')[:,0:num_digits].astype(np.float32)
test_boxes=np.load('test_boxes.npy')[:,: 4*num_boxes ].astype(np.float32)

valid_dataset=np.load('valid_data.npy')[:100].astype(np.float32)
valid_dataset=valid_dataset.reshape((-1, image_height, image_width, num_channels))
valid_labels=np.load('valid_labels.npy')[:100,0:num_digits].astype(np.float32)
valid_boxes=np.load('valid_boxes.npy')[:100,: 4*num_boxes ].astype(np.float32)

train_dataset=np.load('train_data.npy').astype(np.float32)
train_dataset=train_dataset.reshape((-1, image_height, image_width, num_channels))
train_labels=np.load('train_labels.npy')[:,:num_digits]
train_boxes=np.load('train_boxes.npy')[:,: num_boxes*4]

graph = tf.Graph()

with graph.as_default():

  # Input data.
  tf_train_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_height, image_width, num_channels))
  tf_train_labels = tf.placeholder(tf.int32, shape=(batch_size, num_digits))
  tf_train_boxes = tf.placeholder(tf.float32, shape=(batch_size, num_boxes*4))
  
  tf_valid_dataset = tf.constant(valid_dataset)
  tf_test_dataset = tf.constant(test_dataset)

  
  # Convnet Variables.
  layer1_weights = tf.Variable(tf.truncated_normal(
      [patch_size, patch_size, num_channels, depth1], mean=0.0, stddev=0.1))
  layer1_biases = tf.Variable(tf.constant(0.1, shape=[depth1]))
  layer2_weights = tf.Variable(tf.truncated_normal(
      [patch_size, patch_size, depth1, depth2], mean=0.0, stddev=0.1))
  layer2_biases = tf.Variable(tf.constant(0.1, shape=[depth2]))
  layer3_weights = tf.Variable(tf.truncated_normal(
      [patch_size, patch_size, depth2, depth3], mean=0.0 , stddev=0.1))
  layer3_biases = tf.Variable(tf.constant(0.1, shape=[depth3]))
  layer4_weights = tf.Variable(tf.truncated_normal(
      [patch_size, patch_size, depth3, depth4], mean=0.0, stddev=0.1))
  layer4_biases = tf.Variable(tf.constant(0.1, shape=[depth4]))
  #layer5_weights = tf.Variable(tf.truncated_normal(
  #    [patch_size, patch_size, depth4, depth5], stddev=0.1))
  #layer5_biases = tf.Variable(tf.constant(1.0, shape=[depth5]))


  dims=5*5*128
  connect_weights = tf.Variable(tf.truncated_normal(
      [dims, num_hidden], mean=0.0 , stddev=0.1))
  connect_biases = tf.Variable(tf.constant(0.1, shape=[num_hidden]))
  
  soft1_weights = tf.Variable(tf.truncated_normal(
      [num_hidden, num_labels], mean=0.0 , stddev=0.1))
  soft1_biases = tf.Variable(tf.constant(0.1, shape=[num_labels]))
  soft2_weights = tf.Variable(tf.truncated_normal(
      [num_hidden, num_labels], mean=0.0, stddev=0.1))
  soft2_biases = tf.Variable(tf.constant(0.1, shape=[num_labels]))
  soft3_weights = tf.Variable(tf.truncated_normal(
      [num_hidden, num_labels], mean=0.0, stddev=0.1))
  soft3_biases = tf.Variable(tf.constant(0.1, shape=[num_labels]))
  soft4_weights = tf.Variable(tf.truncated_normal(
      [num_hidden, num_labels], mean=0.0, stddev=0.1))
  soft4_biases = tf.Variable(tf.constant(0.1, shape=[num_labels]))
  soft5_weights = tf.Variable(tf.truncated_normal(
      [num_hidden, num_labels], mean=0.0, stddev=0.1))
  soft5_biases = tf.Variable(tf.constant(0.1, shape=[num_labels]))
  
  reg1_weights = tf.Variable(tf.truncated_normal(
      [dims, num_hidden], mean=0.0, stddev=0.1))
  reg1_biases = tf.Variable(tf.constant(0.1, shape=[num_hidden]))  
  
  reg2_weights = tf.Variable(tf.truncated_normal(
      [num_hidden, num_boxes*4], mean=0.0 , stddev=0.1))
  reg2_biases = tf.Variable(tf.constant(0.1, shape=[num_boxes*4])) 

  # Model.
  def model(data, drop_rate=1.0):

    conv = tf.nn.conv2d(data, layer1_weights, [1, 1, 1, 1], padding='SAME') #LAYER 1
    pool = tf.nn.max_pool(conv, [1,2,2,1], [1,2,2,1], 'SAME') #LAYER 1
    drop = tf.nn.dropout(pool, drop_rate)
    relu = tf.nn.relu(drop + layer1_biases)
    conv = tf.nn.conv2d(relu, layer2_weights, [1, 1, 1, 1], padding='SAME') #LAYER 2
    pool = tf.nn.max_pool(conv, [1,2,2,1], [1,1,2,1], 'SAME') #LAYER 2
    drop = tf.nn.dropout(pool, drop_rate)    
    relu = tf.nn.relu(drop + layer2_biases)
    conv = tf.nn.conv2d(relu, layer3_weights, [1, 1, 1, 1], padding='VALID') #LAYER 3
    pool = tf.nn.max_pool(conv, [1,2,2,1], [1,2,2,1], 'SAME') #LAYER 3
    drop = tf.nn.dropout(pool, drop_rate) 
    relu = tf.nn.relu(drop + layer3_biases)
    conv = tf.nn.conv2d(relu, layer4_weights, [1, 1, 1, 1], padding='VALID') #LAYER 4
    pool = tf.nn.max_pool(conv, [1,2,2,1], [1,2,2,1], 'SAME') #LAYER 4
    drop = tf.nn.dropout(pool, drop_rate) 
    relu = tf.nn.relu(drop + layer4_biases)  
    shape = relu.get_shape().as_list()
    reshape = tf.reshape(relu, [shape[0], shape[1] * shape[2] * shape[3]])
    connected= tf.matmul(reshape, connect_weights)
    hidden = tf.nn.relu(connected + connect_biases)
    hidden = tf.nn.dropout(hidden, drop_rate) 
    logits1 = tf.matmul(hidden, soft1_weights + soft1_biases)
    logits2 = tf.matmul(hidden, soft2_weights + soft2_biases)
    logits3 = tf.matmul(hidden, soft3_weights + soft3_biases)
    logits4 = tf.matmul(hidden, soft4_weights + soft4_biases)
    logits5 = tf.matmul(hidden, soft5_weights + soft5_biases)
    
    reg_boxes = tf.matmul(reshape, reg1_weights)
    relu=tf.nn.relu(reg_boxes + reg1_biases)
    reg_boxes = tf.matmul(relu, reg2_weights + reg2_biases)
    
    return logits1, logits2, logits3, logits4, logits5, reg_boxes
  
  # Training computation.
  [logits1, logits2, logits3, logits4, logits5, reg_boxes] = model(tf_train_dataset, 0.985)

  digit_loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits1, labels=tf_train_labels[:,0])) +\
  tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits2, labels=tf_train_labels[:,1])) +\
  tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits3, labels=tf_train_labels[:,2])) +\
  tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits4, labels=tf_train_labels[:,3])) +\
  tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits5, labels=tf_train_labels[:,4]))


  bbox_loss = tf.sqrt(tf.reduce_mean(tf.square(10*(reg_boxes - tf_train_boxes))), name="bbox_loss")

  loss = tf.add(digit_loss, bbox_loss, name="loss")
  
  # Optimizer.
  optimizer = tf.train.AdagradOptimizer(0.025).minimize(loss)

  #Training Predictions
  train_prediction = tf.stack([tf.nn.softmax(logits1),\
                      tf.nn.softmax(logits2),\
                      tf.nn.softmax(logits3),\
                      tf.nn.softmax(logits4),\
                      tf.nn.softmax(logits5)]) 
    
  train_reg_boxes = reg_boxes
  
  #Validation Predictions
  [logits1, logits2, logits3, logits4, logits5, reg_boxes] = model(tf_valid_dataset)
  valid_prediction = tf.stack([tf.nn.softmax(logits1),\
                      tf.nn.softmax(logits2),\
                      tf.nn.softmax(logits3),\
                      tf.nn.softmax(logits4),\
                      tf.nn.softmax(logits5)])  
      
  valid_reg_boxes = reg_boxes
    
    
  #Testing Predictions  
  [logits1, logits2, logits3, logits4, logits5, regboxes] = model(tf_test_dataset)
  test_prediction = tf.stack([tf.nn.softmax(logits1),\
                      tf.nn.softmax(logits2),\
                      tf.nn.softmax(logits3),\
                      tf.nn.softmax(logits4),\
                      tf.nn.softmax(logits5)]) 
    
  test_reg_boxes = reg_boxes

num_steps = 1

import time
start = time.time()  
with tf.device('/gpu:0'):
  with tf.Session(graph=graph) as session:
      
    saver = tf.train.Saver()
    saver.restore(session, "tmp/saved_model_all_data2.ckpt")
    print("Model restored.")  
    #tf.global_variables_initializer().run()
    #print('Initialized')

    for step in range(num_steps):
      offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
      batch_data = train_dataset[offset:(offset + batch_size), :, :, :]
      batch_labels = train_labels[offset:(offset + batch_size), :]
      batch_boxes = train_boxes[offset:(offset + batch_size), :]
      
      feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels, tf_train_boxes : batch_boxes}
      _, l, predictions = session.run(
        [optimizer, loss, train_prediction], feed_dict=feed_dict)
      step += 1
      if (step % 1 == 0):
          
        loss1=session.run(digit_loss, feed_dict=feed_dict)
        loss2=session.run(bbox_loss, feed_dict=feed_dict)
        print('------------------------STATUS---------------------------------')
        print('Minibatch combined loss at step %d: %f' % (step, l))
        print('Minibatch digit loss: %f, bbox loss: %f' % (loss1, loss2))
        print('Minibatch accuracy: %.1f%%' % accuracy(predictions, batch_labels))
       
        reg_boxes=session.run(train_reg_boxes, feed_dict=feed_dict)    
        IOU=[]#batch_iou(reg_boxes, batch_boxes)
        for i in range(0, num_boxes):
            IOU.append(batch_iou(reg_boxes[:, i*4:(i+1)*4], batch_boxes[:, i*4:(i+1)*4]))
        
        print("Average minibatch IOU score: %.1f%%" %(np.mean(IOU)*100) )
            
        lap = time.time()
        print('Elapsed time: ' + str(lap-start) + ' seconds')
        valid_predictions=valid_prediction.eval()
        reg_boxes = valid_reg_boxes.eval()
        acc = accuracy(valid_predictions, valid_labels)
        
        IOU=[]
        for i in range(0, num_boxes):
            IOU.append(batch_iou(reg_boxes[:, i*4:(i+1)*4], valid_boxes[:, i*4:(i+1)*4]))
        
        print('Validation accuracy: %.1f%%' % acc)
        print("Average validation IOU score: %.1f%%" %(np.mean(IOU)*100) )
        #inspect_pred(1, valid_dataset, valid_labels, valid_predictions)
     
    test_predictions= test_prediction.eval()
    test_boxes = test_reg_boxes.eval()
    print('Test accuracy: %.1f%%' % accuracy(test_predictions, test_labels))

    test_predictions=np.argmax(test_predictions, 2).T
    cv2_random_images(5, test_dataset, test_predictions, test_boxes)


Model restored.
------------------------STATUS---------------------------------
Minibatch combined loss at step 1: 2.937126
Minibatch digit loss: 1.677998, bbox loss: 1.126479
Minibatch accuracy: 90.0%
Average minibatch IOU score: 64.0%
Elapsed time: 17.1075718403 seconds
Validation accuracy: 89.6%
Average validation IOU score: 67.4%
Test accuracy: 90.5%
Image 1 Label is: [ 2  3 10 10 10]
Image 2 Label is: [ 2  9 10 10 10]
Image 3 Label is: [ 1  9  9 10 10]
Image 4 Label is: [ 2  7  7 10 10]
Image 5 Label is: [ 6  8 10 10 10]

Question 10

How well does your model localize numbers on the testing set from the realistic dataset? Do your classification results change at all with localization included?

Answer:

The image localization works resonable well for the algorithm. The average intersect of union score is calculated and is about 67% for testing and validation.

This score is a bit lower than expected when looking at the visual results, however it is probably penalized by the fact that an 'empy' digit is labelled with a box of starting in the lower right corner of the image and with a size corresponding to the image height and image width (normalize bounding box size (1.0, 1.0, 1.0, 1.0). When plotting most of these boxes are outside the image but if they do not have the correct size it will penalize the IOU score. For the score it would have been better to label the empty bounding boxes as (1.0, 1.0, 0.0, 0.0)

Unfortunately the digit recognition is not improved by the implementation of the bounding boxes, perhaps a better approach would be to have a two stage algorithm: Step 1 creates bounding boxes and Step 2 crops the images accordingly and identifies the correct numbers.

As it is the algorithm still struggles to identify small numbers in images.

Question 11

Test the localization function on the images you captured in Step 3. Does the model accurately calculate a bounding box for the numbers in the images you found? If you did not use a graphical interface, you may need to investigate the bounding boxes by hand. Provide an example of the localization created on a captured image.


In [57]:
import cPickle as pickle
import numpy as np
import tensorflow as tf
import random
import matplotlib.pyplot as plt
import matplotlib.patches as patches


def batch_iou(a, b, epsilon=1e-5):
    
    
    #reformatting from left, top, widht, height to x1,y2, x2,y2
    ax1=a[:,0]
    bx1=b[:,0]  
    ax2=a[:,1] - a[:,3]
    bx2=b[:,1] - b[:,3]  
    ax3=ax1 + a[:,2]
    bx3=bx1 + b[:,2]
    ax4=a[:,1]
    bx4=a[:,1]
    
    a =np.stack((ax1,ax2,ax3,ax4), axis=1)
    b = np.stack((bx1,bx2,bx3,bx4), axis=1)
         
    """ Given two arrays `a` and `b` where each row contains a bounding
        box defined as a list of four numbers:
            [x1,y1,x2,y2]
        where:
            x1,y1 represent the upper left corner
            x2,y2 represent the lower right corner
        It returns the Intersect of Union scores for each corresponding
        pair of boxes.

    Args:
        a:          (numpy array) each row containing [x1,y1,x2,y2] coordinates
        b:          (numpy array) each row containing [x1,y1,x2,y2] coordinates
        epsilon:    (float) Small value to prevent division by zero

    Returns:
        (numpy array) The Intersect of Union scores for each pair of bounding
        boxes.
    """
    
    # COORDINATES OF THE INTERSECTION BOXES
    x1 = np.array([a[:, 0], b[:, 0]]).max(axis=0)
    y1 = np.array([a[:, 1], b[:, 1]]).max(axis=0)
    x2 = np.array([a[:, 2], b[:, 2]]).min(axis=0)
    y2 = np.array([a[:, 3], b[:, 3]]).min(axis=0)
    
    

    # AREAS OF OVERLAP - Area where the boxes intersect
    width = (x2 - x1)
    height = (y2 - y1)

    # handle case where there is NO overlap
    width[width < 0] = 0
    height[height < 0] = 0

    area_overlap = width * height

    # COMBINED AREAS
    area_a = (a[:, 2] - a[:, 0]) * (a[:, 3] - a[:, 1])
    area_b = (b[:, 2] - b[:, 0]) * (b[:, 3] - b[:, 1])
    area_combined = area_a + area_b - area_overlap

    # RATIO OF AREA OF OVERLAP OVER COMBINED AREA
    iou = area_overlap / (area_combined + epsilon)
    return iou



def accuracy(predictions, labels):    
    return (100.0 * np.sum(np.argmax(predictions, 2).T == labels) / predictions.shape[1] / predictions.shape[0])

def inspect_pred(nr_images, valid_dataset, valid_labels, valid_predictions):
    valid_predictions=np.argmax(valid_predictions, 2).T
    for i in range (0, nr_images):
        image=random.randrange(0, len(valid_labels))
        plt.imshow(valid_dataset[image, :,:,0], cmap='gray')
        plt.show()
        print('Correct Label: ' + str(valid_labels[image]))
        print('Predicted label is: ' +str(valid_predictions[image]))
        
        
#Print random YUV image         
def cv2_random_images(nr_images, ndarr_images, ndarr_labels, boxes):
    for i in range (0,nr_images):
        #image=random.randrange(0, len(ndarr_images))
        
        ndarr_images=ndarr_images.reshape((-1, 64, 128))
        
        bboxes=np.copy(boxes)
        #box cordinates as fraction of image, returning to nr of pixels
        for j in range(0, len(bboxes[0])/4):
            bboxes[i, 4*j] = bboxes[i, 4*j]*ndarr_images[0].shape[1]
            bboxes[i, 4*j+1] = bboxes[i, 4*j+1]*ndarr_images[0].shape[0]
            bboxes[i, 4*j+2] = bboxes[i, 4*j+2]*ndarr_images[0].shape[1] 
            bboxes[i, 4*j+3] = bboxes[i, 4*j+3]*ndarr_images[0].shape[0]
        
        
        #plotting images with bounding boxes
        print('Image '+ str(i+1) +' Label is: ' + str(ndarr_labels[i]))
        fig,ax = plt.subplots(1)
        ax.imshow(ndarr_images[i, :,:], cmap='gray')
        
        #Create rectangle patch
        rect1=patches.Rectangle((bboxes[i,0],bboxes[i,1]), bboxes[i,2],bboxes[i,3], linewidth=1, edgecolor='b', facecolor='none')
        rect2=patches.Rectangle((bboxes[i,4],bboxes[i,5]), bboxes[i,6],bboxes[i,7], linewidth=1, edgecolor='b', facecolor='none')
        rect3=patches.Rectangle((bboxes[i,8],bboxes[i,9]), bboxes[i,10],bboxes[i,11], linewidth=1, edgecolor='b', facecolor='none')
        rect4=patches.Rectangle((bboxes[i,12],bboxes[i,13]), bboxes[i,14],bboxes[i,15], linewidth=1, edgecolor='b', facecolor='none')
        rect5=patches.Rectangle((bboxes[i,16],bboxes[i,17]), bboxes[i,18],bboxes[i,19], linewidth=1, edgecolor='b', facecolor='none')
  
        ax.add_patch(rect1);ax.add_patch(rect2);ax.add_patch(rect3);ax.add_patch(rect4);ax.add_patch(rect5)
        
        plt.show()
 

num_digits=5
num_boxes=5
image_height=64
image_width=128
batch_size = 128
patch_size = 5
depth1 = 16
depth2= 32
depth3 = 64
depth4 = 128
num_hidden = 128
num_labels=11
num_channels=1 

test_dataset=np.load('test_data.npy').astype(np.float32)[:]
test_dataset=test_dataset.reshape((-1, image_height, image_width, num_channels))
test_labels=np.load('test_labels.npy')[:,0:num_digits].astype(np.float32)
test_boxes=np.load('test_boxes.npy')[:,: 4*num_boxes ].astype(np.float32)

valid_dataset=np.load('valid_data.npy')[:100].astype(np.float32)
valid_dataset=valid_dataset.reshape((-1, image_height, image_width, num_channels))
valid_labels=np.load('valid_labels.npy')[:100,0:num_digits].astype(np.float32)
valid_boxes=np.load('valid_boxes.npy')[:100,: 4*num_boxes ].astype(np.float32)

train_dataset=np.load('train_data.npy').astype(np.float32)
train_dataset=train_dataset.reshape((-1, image_height, image_width, num_channels))
train_labels=np.load('train_labels.npy')[:,:num_digits]
train_boxes=np.load('train_boxes.npy')[:,: num_boxes*4]

graph = tf.Graph()

with graph.as_default():

  # Input data.
  tf_train_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_height, image_width, num_channels))
  tf_train_labels = tf.placeholder(tf.int32, shape=(batch_size, num_digits))
  tf_train_boxes = tf.placeholder(tf.float32, shape=(batch_size, num_boxes*4))
  
  tf_valid_dataset = tf.constant(valid_dataset)
  tf_test_dataset = tf.constant(test_dataset)

  
  # Convnet Variables.
  layer1_weights = tf.Variable(tf.truncated_normal(
      [patch_size, patch_size, num_channels, depth1], mean=0.0, stddev=0.1))
  layer1_biases = tf.Variable(tf.constant(0.1, shape=[depth1]))
  layer2_weights = tf.Variable(tf.truncated_normal(
      [patch_size, patch_size, depth1, depth2], mean=0.0, stddev=0.1))
  layer2_biases = tf.Variable(tf.constant(0.1, shape=[depth2]))
  layer3_weights = tf.Variable(tf.truncated_normal(
      [patch_size, patch_size, depth2, depth3], mean=0.0 , stddev=0.1))
  layer3_biases = tf.Variable(tf.constant(0.1, shape=[depth3]))
  layer4_weights = tf.Variable(tf.truncated_normal(
      [patch_size, patch_size, depth3, depth4], mean=0.0, stddev=0.1))
  layer4_biases = tf.Variable(tf.constant(0.1, shape=[depth4]))
  #layer5_weights = tf.Variable(tf.truncated_normal(
  #    [patch_size, patch_size, depth4, depth5], stddev=0.1))
  #layer5_biases = tf.Variable(tf.constant(1.0, shape=[depth5]))


  dims=5*5*128
  connect_weights = tf.Variable(tf.truncated_normal(
      [dims, num_hidden], mean=0.0 , stddev=0.1))
  connect_biases = tf.Variable(tf.constant(0.1, shape=[num_hidden]))
  
  soft1_weights = tf.Variable(tf.truncated_normal(
      [num_hidden, num_labels], mean=0.0 , stddev=0.1))
  soft1_biases = tf.Variable(tf.constant(0.1, shape=[num_labels]))
  soft2_weights = tf.Variable(tf.truncated_normal(
      [num_hidden, num_labels], mean=0.0, stddev=0.1))
  soft2_biases = tf.Variable(tf.constant(0.1, shape=[num_labels]))
  soft3_weights = tf.Variable(tf.truncated_normal(
      [num_hidden, num_labels], mean=0.0, stddev=0.1))
  soft3_biases = tf.Variable(tf.constant(0.1, shape=[num_labels]))
  soft4_weights = tf.Variable(tf.truncated_normal(
      [num_hidden, num_labels], mean=0.0, stddev=0.1))
  soft4_biases = tf.Variable(tf.constant(0.1, shape=[num_labels]))
  soft5_weights = tf.Variable(tf.truncated_normal(
      [num_hidden, num_labels], mean=0.0, stddev=0.1))
  soft5_biases = tf.Variable(tf.constant(0.1, shape=[num_labels]))
  
  reg1_weights = tf.Variable(tf.truncated_normal(
      [dims, num_hidden], mean=0.0, stddev=0.1))
  reg1_biases = tf.Variable(tf.constant(0.1, shape=[num_hidden]))  
  
  reg2_weights = tf.Variable(tf.truncated_normal(
      [num_hidden, num_boxes*4], mean=0.0 , stddev=0.1))
  reg2_biases = tf.Variable(tf.constant(0.1, shape=[num_boxes*4])) 

  # Model.
  def model(data, drop_rate=1.0):

    conv = tf.nn.conv2d(data, layer1_weights, [1, 1, 1, 1], padding='SAME') #LAYER 1
    pool = tf.nn.max_pool(conv, [1,2,2,1], [1,2,2,1], 'SAME') #LAYER 1
    drop = tf.nn.dropout(pool, drop_rate)
    relu = tf.nn.relu(drop + layer1_biases)
    conv = tf.nn.conv2d(relu, layer2_weights, [1, 1, 1, 1], padding='SAME') #LAYER 2
    pool = tf.nn.max_pool(conv, [1,2,2,1], [1,1,2,1], 'SAME') #LAYER 2
    drop = tf.nn.dropout(pool, drop_rate)    
    relu = tf.nn.relu(drop + layer2_biases)
    conv = tf.nn.conv2d(relu, layer3_weights, [1, 1, 1, 1], padding='VALID') #LAYER 3
    pool = tf.nn.max_pool(conv, [1,2,2,1], [1,2,2,1], 'SAME') #LAYER 3
    drop = tf.nn.dropout(pool, drop_rate) 
    relu = tf.nn.relu(drop + layer3_biases)
    conv = tf.nn.conv2d(relu, layer4_weights, [1, 1, 1, 1], padding='VALID') #LAYER 4
    pool = tf.nn.max_pool(conv, [1,2,2,1], [1,2,2,1], 'SAME') #LAYER 4
    drop = tf.nn.dropout(pool, drop_rate) 
    relu = tf.nn.relu(drop + layer4_biases)  
    shape = relu.get_shape().as_list()
    reshape = tf.reshape(relu, [shape[0], shape[1] * shape[2] * shape[3]])
    connected= tf.matmul(reshape, connect_weights)
    hidden = tf.nn.relu(connected + connect_biases)
    hidden = tf.nn.dropout(hidden, drop_rate) 
    logits1 = tf.matmul(hidden, soft1_weights + soft1_biases)
    logits2 = tf.matmul(hidden, soft2_weights + soft2_biases)
    logits3 = tf.matmul(hidden, soft3_weights + soft3_biases)
    logits4 = tf.matmul(hidden, soft4_weights + soft4_biases)
    logits5 = tf.matmul(hidden, soft5_weights + soft5_biases)
    
    reg_boxes = tf.matmul(reshape, reg1_weights)
    relu=tf.nn.relu(reg_boxes + reg1_biases)
    reg_boxes = tf.matmul(relu, reg2_weights + reg2_biases)
    
    return logits1, logits2, logits3, logits4, logits5, reg_boxes
  
  # Training computation.
  [logits1, logits2, logits3, logits4, logits5, reg_boxes] = model(tf_train_dataset, 0.985)

  digit_loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits1, labels=tf_train_labels[:,0])) +\
  tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits2, labels=tf_train_labels[:,1])) +\
  tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits3, labels=tf_train_labels[:,2])) +\
  tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits4, labels=tf_train_labels[:,3])) +\
  tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits5, labels=tf_train_labels[:,4]))


  bbox_loss = tf.sqrt(tf.reduce_mean(tf.square(10*(reg_boxes - tf_train_boxes))), name="bbox_loss")

  loss = tf.add(digit_loss, bbox_loss, name="loss")
  
  # Optimizer.
  optimizer = tf.train.AdagradOptimizer(0.025).minimize(loss)

  #Training Predictions
  train_prediction = tf.stack([tf.nn.softmax(logits1),\
                      tf.nn.softmax(logits2),\
                      tf.nn.softmax(logits3),\
                      tf.nn.softmax(logits4),\
                      tf.nn.softmax(logits5)]) 
    
  train_reg_boxes = reg_boxes
  
  #Validation Predictions
  [logits1, logits2, logits3, logits4, logits5, reg_boxes] = model(tf_valid_dataset)
  valid_prediction = tf.stack([tf.nn.softmax(logits1),\
                      tf.nn.softmax(logits2),\
                      tf.nn.softmax(logits3),\
                      tf.nn.softmax(logits4),\
                      tf.nn.softmax(logits5)])  
      
  valid_reg_boxes = reg_boxes
    
    
  #Testing Predictions  
  [logits1, logits2, logits3, logits4, logits5, regboxes] = model(tf_test_dataset)
  test_prediction = tf.stack([tf.nn.softmax(logits1),\
                      tf.nn.softmax(logits2),\
                      tf.nn.softmax(logits3),\
                      tf.nn.softmax(logits4),\
                      tf.nn.softmax(logits5)]) 
    
  test_reg_boxes = reg_boxes

num_steps = 1

import time
start = time.time()  
with tf.device('/gpu:0'):
  with tf.Session(graph=graph) as session:
      
    saver = tf.train.Saver()
    saver.restore(session, "tmp/saved_model_all_data2.ckpt")
    print("Model restored.")  
    #tf.global_variables_initializer().run()
    #print('Initialized')

    for step in range(num_steps):
      offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
      batch_data = train_dataset[offset:(offset + batch_size), :, :, :]
      batch_labels = train_labels[offset:(offset + batch_size), :]
      batch_boxes = train_boxes[offset:(offset + batch_size), :]
      
      feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels, tf_train_boxes : batch_boxes}
      _, l, predictions = session.run(
        [optimizer, loss, train_prediction], feed_dict=feed_dict)
      step += 1
 
     
    test_predictions= test_prediction.eval()
    test_boxes = test_reg_boxes.eval()


    test_predictions=np.argmax(test_predictions, 2).T
    cv2_random_images(5, test_dataset, test_predictions, test_boxes)


Model restored.
Image 1 Label is: [ 3 10 10 10 10]
Image 2 Label is: [ 6  9  9 10 10]
Image 3 Label is: [ 1  5 10 10 10]
Image 4 Label is: [ 1  4  6 10 10]
Image 5 Label is: [ 2  3  3 10 10]

Answer:

The model doest not create correct bounding boxes for the captured image.

The 3D image example seems to confuse the algorithm into believing that there are 2 numbers present, allthough the digit prediction only returns 1 digit. THe same is the case for the picture of a house number 5.

The remaining pictures have a correct number of bounding boxes but the algorithm fails to accuractly locate them.

To be able to correctly identify these type of images a more diverse dataset would perhaps be needed.


Optional Step 5: Build an Application or Program for a Model

Take your project one step further. If you're interested, look to build an Android application or even a more robust Python program that can interface with input images and display the classified numbers and even the bounding boxes. You can for example try to build an augmented reality app by overlaying your answer on the image like the Word Lens app does.

Loading a TensorFlow model into a camera app on Android is demonstrated in the TensorFlow Android demo app, which you can simply modify.

If you decide to explore this optional route, be sure to document your interface and implementation, along with significant results you find. You can see the additional rubric items that you could be evaluated on by following this link.

Optional Implementation

Use the code cell (or multiple code cells, if necessary) to implement the first step of your project. Once you have completed your implementation and are satisfied with the results, be sure to thoroughly answer the questions that follow.


In [ ]:
### Your optional code implementation goes here.
### Feel free to use as many code cells as needed.

Documentation

Provide additional documentation sufficient for detailing the implementation of the Android application or Python program for visualizing the classification of numbers in images. It should be clear how the program or application works. Demonstrations should be provided.

Write your documentation here.

Note: Once you have completed all of the code implementations and successfully answered each question above, you may finalize your work by exporting the iPython Notebook as an HTML document. You can do this by using the menu above and navigating to
File -> Download as -> HTML (.html). Include the finished document along with this notebook as your submission.