This notebook contains an excerpt from the book Machine Learning for OpenCV by Michael Beyeler. The code is released under the MIT license, and is available on GitHub.

Note that this excerpt contains only the raw code - the book is rich with additional explanations and illustrations. If you find this content useful, please consider supporting the work by buying the book!

Training a Deep Neural Net to Classify Handwritten Digits Using Keras

Although we achieved a formidable score with the MLP above, our result does not hold up to state-of-the-art results. Currently the best result has close to 99.8% accuracy—better than human performance! This is why nowadays, the task of classifying handwritten digits is largely regarded as solved.

To get closer to the state-of-the-art results, we need to use state-of-the-art techniques. Thus, we return to Keras.

Preprocessing the MNIST dataset

To make sure we get the same result every time we run the experiment, we will pick a random seed for NumPy's random number generator. This way, shuffling the training samples from the MNIST dataset will always result in the same order:


In [1]:
import numpy as np
np.random.seed(1337)  # for reproducibility

Keras provides a loading function similar to train_test_split from scikit-learn's model_selection module. Its syntax might look strangely familiar to you:


In [2]:
from keras.datasets import mnist
(X_train, y_train), (X_test, y_test) = mnist.load_data()


Using Theano backend.

The neural nets in Keras act on the feature matrix slightly differently than the standard OpenCV and scikit-learn estimators. Whereas the rows of a feature matrix in Keras still correspond to the number of samples (X_train.shape[0] in the code below), we can preserve the two-dimensional nature of the input images by adding more dimensions to the feature matrix:


In [3]:
img_rows, img_cols = 28, 28
X_train = X_train.reshape(X_train.shape[0], img_rows, img_cols, 1)
X_test = X_test.reshape(X_test.shape[0], img_rows, img_cols, 1)
input_shape = (img_rows, img_cols, 1)
    
X_train = X_train.astype('float32') / 255.0
X_test = X_test.astype('float32') / 255.0

Here we have reshaped the feature matrix into a four-dimensional matrix with dimensions n_features x 28 x 28 x 1. We also made sure we operate on 32-bit floating point numbers between [0, 1], rather than unsigned integers in [0, 255].

Then, we can one-hot encode the training labels like we did before. This will make sure each category of target labels can be assigned to a neuron in the output layer. We could do this with scikit-learn's preprocessing, but in this case it is easier to use Keras' own utility function:


In [4]:
from keras.utils import np_utils
n_classes = 10
Y_train = np_utils.to_categorical(y_train, n_classes)
Y_test = np_utils.to_categorical(y_test, n_classes)

Creating a convolutional neural network

Once we have preprocessed the data, it is time to define the actual model. Here, we will once again rely on the Sequential model to define a feedforward neural network:


In [5]:
from keras.models import Sequential
model = Sequential()

However, this time, we will be smarter about the individual layers. We will design our neural network around a convolutional layer, where the kernel is a 3 x 3 pixel two-dimensional convolution.

A two-dimensional convolutional layer operates akin to image filtering in OpenCV, where each image in the input data is convolved with a small two-dimensional kernel. In Keras, we can specify the kernel size and the stride:


In [6]:
from keras.layers import Conv2D
n_filters = 32
kernel_size = (3, 3)
model.add(Conv2D(n_filters, (kernel_size[0], kernel_size[1]),
                 padding='valid',
                 input_shape=input_shape))

After that, we will use a linear rectified unit as an activation function:


In [7]:
from keras.layers import Activation
model.add(Activation('relu'))

In a deep convolutional neural net, we can have as many layers as we want. A popular version of this structure applied to MNIST involves performing the convolution and rectification twice:


In [8]:
model.add(Conv2D(n_filters, (kernel_size[0], kernel_size[1])))
model.add(Activation('relu'))

Finally, we will pool the activations and add a Dropout layer:


In [9]:
from keras.layers import MaxPooling2D, Dropout
pool_size = (2, 2)
model.add(MaxPooling2D(pool_size=pool_size))
model.add(Dropout(0.25))

Then we will flatten the model and finally pass it through a softmax function to arrive at the output layer:


In [10]:
from keras.layers import Flatten, Dense
model.add(Flatten())
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(n_classes))
model.add(Activation('softmax'))

Here, we will use the cross-entropy loss and the Adadelta algorithm:


In [11]:
model.compile(loss='categorical_crossentropy',
              optimizer='adadelta',
              metrics=['accuracy'])

Fitting the model

We fit the model like we do with all other classifiers:

Caution! This might take several hours depending on your machine.


In [12]:
model.fit(X_train, Y_train, batch_size=128, epochs=12,
          verbose=1, validation_data=(X_test, Y_test))


Train on 60000 samples, validate on 10000 samples
Epoch 1/12
60000/60000 [==============================] - 88s - loss: 0.3501 - acc: 0.8929 - val_loss: 0.0887 - val_acc: 0.9714
Epoch 2/12
60000/60000 [==============================] - 87s - loss: 0.1263 - acc: 0.9618 - val_loss: 0.0596 - val_acc: 0.9803
Epoch 3/12
60000/60000 [==============================] - 87s - loss: 0.0999 - acc: 0.9705 - val_loss: 0.0475 - val_acc: 0.9844
Epoch 4/12
60000/60000 [==============================] - 88s - loss: 0.0839 - acc: 0.9752 - val_loss: 0.0416 - val_acc: 0.9861
Epoch 5/12
60000/60000 [==============================] - 87s - loss: 0.0742 - acc: 0.9780 - val_loss: 0.0384 - val_acc: 0.9876
Epoch 6/12
60000/60000 [==============================] - 87s - loss: 0.0661 - acc: 0.9802 - val_loss: 0.0366 - val_acc: 0.9876
Epoch 7/12
60000/60000 [==============================] - 87s - loss: 0.0603 - acc: 0.9814 - val_loss: 0.0378 - val_acc: 0.9869
Epoch 8/12
60000/60000 [==============================] - 87s - loss: 0.0580 - acc: 0.9825 - val_loss: 0.0351 - val_acc: 0.9880
Epoch 9/12
60000/60000 [==============================] - 87s - loss: 0.0535 - acc: 0.9830 - val_loss: 0.0325 - val_acc: 0.9881
Epoch 10/12
60000/60000 [==============================] - 87s - loss: 0.0518 - acc: 0.9845 - val_loss: 0.0305 - val_acc: 0.9893
Epoch 11/12
60000/60000 [==============================] - 87s - loss: 0.0490 - acc: 0.9854 - val_loss: 0.0308 - val_acc: 0.9899
Epoch 12/12
60000/60000 [==============================] - 87s - loss: 0.0459 - acc: 0.9862 - val_loss: 0.0301 - val_acc: 0.9898
Out[12]:
<keras.callbacks.History at 0x7f28d9adc390>

After training completes, we can evaluate the classifier:


In [13]:
model.evaluate(X_test, Y_test, verbose=1)


 9920/10000 [============================>.] - ETA: 0s
Out[13]:
[0.030051206669808015, 0.98980000000000001]

And we achieve 99.25% accuracy! Worlds apart from the MLP classifier we implemented before. And this is just one way to do things. As you can see, neural networks provide a plethora of tuning parameters, and it is not at all clear which ones will lead to the best performance.