Object recognition with CNN

Keras is a Python library for deep learning that wraps the powerful numerical libraries Theano and TensorFlow.

A difficult problem where traditional neural networks fall down is called object recognition. It is where a model is able to identify the objects in images.

In this post, we will discover how to develop and evaluate deep learning models for object recognition in Keras. After completing this tutorial we will know:

  1. About the CIFAR-10 object recognition dataset and how to load and use it in Keras.
  2. How to create a simple Convolutional Neural Network for object recognition.
  3. How to lift performance by creating deeper Convolutional Neural Networks.

The CIFAR-10 Problem Description

The problem of automatically identifying objects in photographs is difficult because of the near infinite number of permutations of objects, positions, lighting and so on. It’s a really hard problem.

This is a well-studied problem in computer vision and more recently an important demonstration of the capability of deep learning. A standard computer vision and deep learning dataset for this problem was developed by the Canadian Institute for Advanced Research (CIFAR).

The CIFAR-10 dataset consists of 60,000 photos divided into 10 classes (hence the name CIFAR-10). Classes include common objects such as airplanes, automobiles, birds, cats and so on. The dataset is split in a standard way, where 50,000 images are used for training a model and the remaining 10,000 for evaluating its performance.

The photos are in color with red, green and blue components, but are small measuring 32 by 32 pixel squares.

Loading The CIFAR-10 Dataset in Keras

The CIFAR-10 dataset can easily be loaded in Keras.

Keras has the facility to automatically download standard datasets like CIFAR-10 and store them in the ~/.keras/datasets directory using the cifar10.load_data() function. This dataset is large at 163 megabytes, so it may take a few minutes to download.

Once downloaded, subsequent calls to the function will load the dataset ready for use.

The dataset is stored as pickled training and test sets, ready for use in Keras. Each image is represented as a three dimensional matrix, with dimensions for red, green, blue, width and height. We can plot images directly using matplotlib.


In [1]:
# Plot ad hoc CIFAR10 instances
from keras.datasets import cifar10
from matplotlib import pyplot
# load data
(X_train, y_train), (X_test, y_test) = cifar10.load_data()


Using TensorFlow backend.

Simple Convolutional Neural Network for CIFAR-10

The CIFAR-10 problem is best solved using a Convolutional Neural Network (CNN).

We can quickly start off by defining all of the classes and functions we will need in this example.


In [2]:
# Simple CNN model for CIFAR-10
import numpy
from keras.datasets import cifar10
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import Flatten
from keras.constraints import maxnorm
from keras.optimizers import SGD
from keras.layers.convolutional import Conv2D
from keras.layers.convolutional import MaxPooling2D
from keras.utils import np_utils
from keras import backend as K
K.set_image_dim_ordering('th')

As is good practice, we next initialize the random number seed with a constant to ensure the results are reproducible.


In [3]:
# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)

Next we can load the CIFAR-10 dataset.


In [4]:
# load data
(X_train, y_train), (X_test, y_test) = cifar10.load_data()

The pixel values are in the range of 0 to 255 for each of the red, green and blue channels.

It is good practice to work with normalized data. Because the input values are well understood, we can easily normalize to the range 0 to 1 by dividing each value by the maximum observation which is 255.

Note, the data is loaded as integers, so we must cast it to floating point values in order to perform the division.


In [5]:
# normalize inputs from 0-255 to 0.0-1.0
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train = X_train / 255.0
X_test = X_test / 255.0

The output variables are defined as a vector of integers from 0 to 1 for each class.

We can use a one hot encoding to transform them into a binary matrix in order to best model the classification problem. We know there are 10 classes for this problem, so we can expect the binary matrix to have a width of 10.


In [6]:
# one hot encode outputs
y_train = np_utils.to_categorical(y_train)
y_test = np_utils.to_categorical(y_test)
num_classes = y_test.shape[1]

Let’s start off by defining a simple CNN structure as a baseline and evaluate how well it performs on the problem.

We will use a structure with two convolutional layers followed by max pooling and a flattening out of the network to fully connected layers to make predictions.

Our baseline network structure can be summarized as follows:

  1. Convolutional input layer, 32 feature maps with a size of 3×3, a rectifier activation function and a weight constraint of max norm set to 3.
  2. Dropout set to 20%.
  3. Convolutional layer, 32 feature maps with a size of 3×3, a rectifier activation function and a weight constraint of max norm set to 3.
  4. Max Pool layer with size 2×2.
  5. Flatten layer.
  6. Fully connected layer with 512 units and a rectifier activation function.
  7. Dropout set to 50%.
  8. Fully connected output layer with 10 units and a softmax activation function.

A logarithmic loss function is used with the stochastic gradient descent optimization algorithm configured with a large momentum and weight decay start with a learning rate of 0.01.


In [7]:
# Create the model
model = Sequential()
model.add(Conv2D(32, (3, 3), input_shape=(3, 32, 32), padding='same', activation='relu', kernel_constraint=maxnorm(3)))
model.add(Dropout(0.2))
model.add(Conv2D(32, (3, 3), activation='relu', padding='same', kernel_constraint=maxnorm(3)))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(512, activation='relu', kernel_constraint=maxnorm(3)))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))
# Compile model
epochs = 10
lrate = 0.01
decay = lrate/epochs
sgd = SGD(lr=lrate, momentum=0.9, decay=decay, nesterov=False)
model.compile(loss='categorical_crossentropy', optimizer='adagrad', metrics=['categorical_accuracy'])
print(model.summary())


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_1 (Conv2D)            (None, 32, 32, 32)        896       
_________________________________________________________________
dropout_1 (Dropout)          (None, 32, 32, 32)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 32, 32, 32)        9248      
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 32, 16, 16)        0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 8192)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 512)               4194816   
_________________________________________________________________
dropout_2 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 10)                5130      
=================================================================
Total params: 4,210,090
Trainable params: 4,210,090
Non-trainable params: 0
_________________________________________________________________
None

We can fit this model with 10 epochs and a batch size of 32.

A small number of epochs was chosen to help keep this tutorial moving. Normally the number of epochs would be one or two orders of magnitude larger for this problem.

Once the model is fit, we evaluate it on the test dataset and print out the classification accuracy.


In [8]:
# Fit the model
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=epochs, batch_size=32)
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))


Train on 50000 samples, validate on 10000 samples
Epoch 1/10
50000/50000 [==============================] - 647s 13ms/step - loss: 1.6141 - categorical_accuracy: 0.4203 - val_loss: 1.3088 - val_categorical_accuracy: 0.5389
Epoch 2/10
50000/50000 [==============================] - 691s 14ms/step - loss: 1.1987 - categorical_accuracy: 0.5752 - val_loss: 1.1130 - val_categorical_accuracy: 0.6090
Epoch 3/10
50000/50000 [==============================] - 685s 14ms/step - loss: 1.0322 - categorical_accuracy: 0.6374 - val_loss: 1.0067 - val_categorical_accuracy: 0.6496
Epoch 4/10
50000/50000 [==============================] - 694s 14ms/step - loss: 0.9367 - categorical_accuracy: 0.6711 - val_loss: 0.9716 - val_categorical_accuracy: 0.6610
Epoch 5/10
50000/50000 [==============================] - 699s 14ms/step - loss: 0.8595 - categorical_accuracy: 0.6990 - val_loss: 0.9385 - val_categorical_accuracy: 0.6734
Epoch 6/10
50000/50000 [==============================] - 693s 14ms/step - loss: 0.8011 - categorical_accuracy: 0.7212 - val_loss: 0.9027 - val_categorical_accuracy: 0.6855
Epoch 7/10
50000/50000 [==============================] - 708s 14ms/step - loss: 0.7488 - categorical_accuracy: 0.7395 - val_loss: 0.8850 - val_categorical_accuracy: 0.6913
Epoch 8/10
50000/50000 [==============================] - 729s 15ms/step - loss: 0.7026 - categorical_accuracy: 0.7575 - val_loss: 0.8820 - val_categorical_accuracy: 0.6935
Epoch 9/10
50000/50000 [==============================] - 712s 14ms/step - loss: 0.6573 - categorical_accuracy: 0.7739 - val_loss: 0.8735 - val_categorical_accuracy: 0.6992
Epoch 10/10
50000/50000 [==============================] - 666s 13ms/step - loss: 0.6167 - categorical_accuracy: 0.7884 - val_loss: 0.8690 - val_categorical_accuracy: 0.7013
Accuracy: 70.13%

We can improve the accuracy significantly by creating a much deeper network.

Larger Convolutional Neural Network for CIFAR-10

We have seen that a simple CNN performs poorly on this complex problem. In this section we look at scaling up the size and complexity of our model.

Let’s design a deep version of the simple CNN above. We can introduce an additional round of convolutions with many more feature maps. We will use the same pattern of Convolutional, Dropout, Convolutional and Max Pooling layers.

This pattern will be repeated 3 times with 32, 64, and 128 feature maps. The effect be an increasing number of feature maps with a smaller and smaller size given the max pooling layers. Finally an additional and larger Dense layer will be used at the output end of the network in an attempt to better translate the large number feature maps to class values.

We can summarize a new network architecture as follows:

Convolutional input layer, 32 feature maps with a size of 3×3 and a rectifier activation function.

Dropout layer at 20%.

Convolutional layer, 32 feature maps with a size of 3×3 and a rectifier activation function.

Max Pool layer with size 2×2.

Convolutional layer, 64 feature maps with a size of 3×3 and a rectifier activation function.

Dropout layer at 20%.

Convolutional layer, 64 feature maps with a size of 3×3 and a rectifier activation function.

Max Pool layer with size 2×2.

Convolutional layer, 128 feature maps with a size of 3×3 and a rectifier activation function.

Dropout layer at 20%.

Convolutional layer,128 feature maps with a size of 3×3 and a rectifier activation function.

Max Pool layer with size 2×2.

Flatten layer.

Dropout layer at 20%.

Fully connected layer with 1024 units and a rectifier activation function.

Dropout layer at 20%.

Fully connected layer with 512 units and a rectifier activation function.

Dropout layer at 20%.

Fully connected output layer with 10 units and a softmax activation function.

We can very easily define this network topology in Keras, as follows:


In [10]:
# Create the model

model = Sequential()
model.add(Conv2D(32, (3, 3), input_shape=(3, 32, 32), activation='relu', padding='same'))
model.add(Dropout(0.2))
model.add(Conv2D(32, (3, 3), activation='relu', padding='same'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(64, (3, 3), activation='relu', padding='same'))
model.add(Dropout(0.2))
model.add(Conv2D(64, (3, 3), activation='relu', padding='same'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(128, (3, 3), activation='relu', padding='same'))
model.add(Dropout(0.2))
model.add(Conv2D(128, (3, 3), activation='relu', padding='same'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dropout(0.2))
model.add(Dense(1024, activation='relu', kernel_constraint=maxnorm(3)))
model.add(Dropout(0.2))
model.add(Dense(512, activation='relu', kernel_constraint=maxnorm(3)))
model.add(Dropout(0.2))
model.add(Dense(num_classes, activation='softmax'))

# Compile model

epochs = 10
lrate = 0.01
decay = lrate/epochs
sgd = SGD(lr=lrate, momentum=0.9, decay=decay, nesterov=False)
model.compile(loss='categorical_crossentropy', optimizer='adagrad', metrics=['categorical_accuracy'])
print(model.summary())


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_3 (Conv2D)            (None, 32, 32, 32)        896       
_________________________________________________________________
dropout_3 (Dropout)          (None, 32, 32, 32)        0         
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 32, 32, 32)        9248      
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 32, 16, 16)        0         
_________________________________________________________________
conv2d_5 (Conv2D)            (None, 64, 16, 16)        18496     
_________________________________________________________________
dropout_4 (Dropout)          (None, 64, 16, 16)        0         
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 64, 16, 16)        36928     
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 64, 8, 8)          0         
_________________________________________________________________
conv2d_7 (Conv2D)            (None, 128, 8, 8)         73856     
_________________________________________________________________
dropout_5 (Dropout)          (None, 128, 8, 8)         0         
_________________________________________________________________
conv2d_8 (Conv2D)            (None, 128, 8, 8)         147584    
_________________________________________________________________
max_pooling2d_4 (MaxPooling2 (None, 128, 4, 4)         0         
_________________________________________________________________
flatten_2 (Flatten)          (None, 2048)              0         
_________________________________________________________________
dropout_6 (Dropout)          (None, 2048)              0         
_________________________________________________________________
dense_3 (Dense)              (None, 1024)              2098176   
_________________________________________________________________
dropout_7 (Dropout)          (None, 1024)              0         
_________________________________________________________________
dense_4 (Dense)              (None, 512)               524800    
_________________________________________________________________
dropout_8 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_5 (Dense)              (None, 10)                5130      
=================================================================
Total params: 2,915,114
Trainable params: 2,915,114
Non-trainable params: 0
_________________________________________________________________
None

We can fit and evaluate this model using the same a procedure above and the same number of epochs but a larger batch size of 64, found through some minor experimentation.


In [13]:
numpy.random.seed(seed)
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=epochs, batch_size=64)
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))


Train on 50000 samples, validate on 10000 samples
Epoch 1/10
50000/50000 [==============================] - 1091s 22ms/step - loss: 2.0590 - categorical_accuracy: 0.2101 - val_loss: 1.5999 - val_categorical_accuracy: 0.3941
Epoch 2/10
50000/50000 [==============================] - 1082s 22ms/step - loss: 1.4533 - categorical_accuracy: 0.4666 - val_loss: 1.2691 - val_categorical_accuracy: 0.5376
Epoch 3/10
50000/50000 [==============================] - 1120s 22ms/step - loss: 1.1911 - categorical_accuracy: 0.5700 - val_loss: 1.2213 - val_categorical_accuracy: 0.5705
Epoch 4/10
50000/50000 [==============================] - 1097s 22ms/step - loss: 1.0234 - categorical_accuracy: 0.6352 - val_loss: 1.0030 - val_categorical_accuracy: 0.6460
Epoch 5/10
50000/50000 [==============================] - 1070s 21ms/step - loss: 0.9110 - categorical_accuracy: 0.6764 - val_loss: 0.9320 - val_categorical_accuracy: 0.6756
Epoch 6/10
50000/50000 [==============================] - 1019s 20ms/step - loss: 0.8293 - categorical_accuracy: 0.7062 - val_loss: 0.8340 - val_categorical_accuracy: 0.7086
Epoch 7/10
50000/50000 [==============================] - 1020s 20ms/step - loss: 0.7600 - categorical_accuracy: 0.7305 - val_loss: 0.8423 - val_categorical_accuracy: 0.7081
Epoch 8/10
50000/50000 [==============================] - 1018s 20ms/step - loss: 0.7085 - categorical_accuracy: 0.7480 - val_loss: 0.8242 - val_categorical_accuracy: 0.7126
Epoch 9/10
50000/50000 [==============================] - 1049s 21ms/step - loss: 0.6614 - categorical_accuracy: 0.7679 - val_loss: 0.7554 - val_categorical_accuracy: 0.7397
Epoch 10/10
50000/50000 [==============================] - 1101s 22ms/step - loss: 0.6158 - categorical_accuracy: 0.7821 - val_loss: 0.7516 - val_categorical_accuracy: 0.7386
Accuracy: 73.86%

Running this example prints the classification accuracy and loss on the training and test datasets each epoch. The estimate of classification accuracy for the final model is 73.86% which is nearly 3 points better than our simpler model.

Extensions To Improve Model Performance

We have achieved good results on this very difficult problem, but we are still a good way from achieving world class results.

Below are some ideas that we can try to extend upon the models and improve model performance.

  1. Train for More Epochs. Each model was trained for a very small number of epochs, 10. It is common to train large convolutional neural networks for hundreds or thousands of epochs. I would expect that performance gains can be achieved by significantly raising the number of training epochs.
  2. Image Data Augmentation. The objects in the image vary in their position. Another boost in model performance can likely be achieved by using some data augmentation. Methods such as standardization and random shifts and horizontal image flips may be beneficial.
  3. Deeper Network Topology. The larger network presented is deep, but larger networks could be designed for the problem. This may involve more feature maps closer to the input and perhaps less aggressive pooling. Additionally, standard convolutional network topologies that have been shown useful may be adopted and evaluated on the problem.