Deep Learning with MNIST

This is a "censored" version of a Keras example from here: https://github.com/fchollet/keras

This version is meant to be run on the CPU. It trains on less data and will give you less impressive predictive performance than the original example code (which is meant to be run on a GPU).

Train a simple deep NN on the MNIST dataset.

  • 16 seconds per 6,000 data instance epoch on Intel Celeron N2840 @ 2.16GHz (Sigmoid)
  • 14 seconds per 6,000 data instance epoch on Intel Celeron N2480 @ 2.16GHz (ReLU)
  • 20 seconds per 6,000 data instance epoch on Intel Celeron N2480 @ 2.16GHz (ReLU, validation_data)
  • 2 seconds per 60,000 data instance epoch on a K520 GPU.

In [ ]:
from __future__ import print_function
import numpy as np
np.random.seed(1337)  # for reproducibility

from keras.datasets import mnist
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.optimizers import SGD, Adam, RMSprop
from keras.utils import np_utils

In [ ]:
(X_train, y_train), (X_test, y_test) = mnist.load_data()

In [ ]:
X_train

In [ ]:
X_train.shape

Each of our 60,000 handwritten digits comes prepackaged as a 28x28 matrix, which you can think of as a primitive 28x28 pixel photo of a digit. The elements in the matrix take on values between 0 and 255, depending on how dark the writing is in that "pixel".

But we usually train our models on feature vectors, not matrices. Go ahead and use numpy's reshape method to reshape our matrix into a 28x28 = 784-dimensional vector.

Signature

np.reshape(a, newshape, order='C')

Docstring

Gives a new shape to an array without changing its data.

Parameters

a : array_like Array to be reshaped.

newshape : int or tuple of ints The new shape should be compatible with the original shape. If an integer, then the result will be a 1-D array of that length. One shape dimension can be -1. In this case, the value is inferred from the length of the array and remaining dimensions.

Returns

reshaped_array : ndarray This will be a new view object if possible; otherwise, it will be a copy.

See Also

ndarray.reshape : Equivalent method.


In [ ]:
# Reshape the 60,000 28x28 matrices in X_train into a 60000x784 dimensional matrix

Do the same for our test set, X_test. Don't forget to check how many data points it has with the shape method.


In [ ]:
# Do the same for X_test

But since we're training on the CPU and don't want to wait all day for our models to train, we'll use only the first 6000 data points for our actual training set.


In [ ]:
X_train = X_train[:6000]
y_train = y_train[:6000]
X_test = X_test[:5000] # we'll use a half-sized test set, too
y_test = y_test[:5000]

This part just makes sure numpy casts our data values to the correct type.


In [ ]:
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')

When training neural networks, it's important to normalize our data. There are multiple reasons for this. Normalizing lets us not worry about any scaling effects our input might have on our model (for example, one feature is measured in millimeters, and another in miles). Furthermore, when we randomly initialize the weights of our neurons, we won't have to worry (as much) that our neurons will become saturated. Saturation occurs when the output of our neuron is very near one or zero, no matter the input. Remember, the output of a (sigmoidal) neuron is a linear combination of its inputs passed into a logistic function. The gradient of the logistic function is very small when its output is near 1 or 0, hence the gradient in the backpropogation algorithm will be very small and our neurons weights will update too slowly.

Normalize (Well, standardize, techinically) X_train so that its mean is 0 and standard deviation 1. (Hint: subtract the mean from each datapoint then divide by the standard deviation). Also standardize X_test, but use the mean and standard deviation from the training data.


In [ ]:
# Standardize X_train and X_test

Since our output is a linear combination of some input from our hidden layer, we need to make sure our classes are one-hot encoded. For example, if our neural net is 50% sure a digit is a 2 and 50% sure it is a 4, we don't want our network predicting a 3!


In [ ]:
print("Class representation before one hot encoding:\n", y_train[0:3])
Y_train = np_utils.to_categorical(y_train, 10)
Y_test = np_utils.to_categorical(y_test, 10)
print("Class representation after one hot encoding:\n", Y_train[0:3])

Now we can finally start building our model! Keras has two model types. Sequential (below) is a typical neural network with layers stacked one on top of the other. The other model, Graph, models any network that can be represented as a directed acyclic graph (DAG). Can you imagine a neural network that violates one or both of the DAG properties? :-)


In [ ]:
model = Sequential()

Here is the blueprint for our model. It's your job to write the code. Reference the Keras docs (or look over this quick introduction for all the code you need to know).

  1. Add a dense hidden layer 512 neurons wide that accepts our 784 dimensional input. (What does it mean for a layer to be dense?).
  2. Give these neurons a 'sigmoid' activation function.
  3. Add another dense hidden layer 512 neurons wide, each neuron with a sigmoid activation function.
  4. Finally, add a dense layer 10 neurons wide with a 'softmax' activation function. (What do you think this layer is doing?)

In [ ]:
# Your multi-layer perceptron here

Our model requires two other parameters: a loss (objective) function and an optimization method. For our loss function we will use what Keras calls 'categorical_crossentropy', but which also goes by 'multiclass logloss'. Keras has a fair amount of objective functions to choose from, but categorical_crossentropy is the only one that really makes sense here.

We will use stochastic gradient descent ('sgd') for our optimization method. Stochastic, because it randomly selects a sample from our training data to calculate the gradient during backpropogation (or iterates sequentially over a random shuffling of the data). Gradient descent because... err, it's doing gradient descent.


In [ ]:
model.compile(loss='categorical_crossentropy', optimizer='sgd')

At last, the heavy lifting portion of our code, the fitting of the model.

Call model.fit with our training data, a batch_size of 128, and 1 epoch.

A batch is how many datapoints our neural net uses to calculate the gradient before taking a step in the gradient descent process. Batches are taken sequentially from the entire training set during optimization.

An epoch is a single pass through the data during optimization. For example, if we are training on 100 data points, and our batch size is 10, then we will have completed an epoch after processing 10 batches.


In [ ]:
# Call model.fit with batch_size = 128 and nb_epoch = 1
# If you get an I/O error, make sure your model 
# completely finishes compiling before running this line.

In [ ]:
results = model.evaluate(X_test, Y_test, show_accuracy=True)
print("Loss on Test:", results[0])
print("Accuracy on Test:", results[1])

Our results are terrible. What happened?!

Well, it turns out no one actually uses the sigmoid activation function in practice. The gradient of the logistic function is just too small to be useful, and calculating the exponential in the denominator is too computationally expensive. Instead, replace all your sigmoid activation functions with ReLU ('relu') activation functions. The ReLU takes value 0 from -infinity to 0 and value x from 0 to +infinity (equivalent to max(0, x)). Fit your model again and see what happens.

If you've done that, you should have seen a huge improvement. So why is ReLU so much better than sigmoid? They each have their own pros and cons (see "Commonly used activation functions"), but ReLU typically converges 6-8 times faster than sigmoid, since its gradient is always 1 for positive x (sigmoid's gradient is only 1 at x = 0.5). But there's a risk that your neuron will "die" if your x value is non-positive too often. The gradient for a non-positive x is zero, so there's no way for the neuron to update its weights once that happens. And if your neuron happens to combine its inputs such that they are non-positive values for every data instance in our training set, the neuron will never change its weights ever again.

We've only been optimizing our model for a single epoch so far, but in practice neural nets are optimized over tens to thousands of epochs. Go ahead and increase the number of epochs in our call to model.fit to 10 and pass in new arguments validation_data=(X_test, Y_test) and show_accuracy=True. [It appears that we are using our test set in the model training process, thus putting us at great risk of overfitting. But this is not the case since we are not actually tuning our hyperparameters, then picking the model with the best score, etc. We are simply "checking" our progress against the test set, NOT making any changes to our model in response to our accuracy score on the test set. Of course, if we were to do hyperparameter tuning (which we're kind of doing here, by changing activation functions, etc, but this is merely for illustrative purposes), we would need to remove the test data from our training process entirely].

Those results weren't bad, but we can overfit less with dropout. Dropout is a form of regularization that works by "dropping" neurons from the network at random at each batch. You can think of this as randomly selecting a proportion of edges in our neural network graph to ignore when doing the feed-forward (i.e., the matrix multiplication) part in the training of our models. Add Dropout layers that randomly set 20% of the input units to 0 at each of our hidden layers. You should notice that our validation accuracy increases.

That's all there is to it! Your neural net architecture should now be identical to the architecture specified in the original example. The only difference is that we train and validate on a subset of the data, since we're using the CPU. If we were on the GPU and training and validating on all 70,000 data points, each epoch would only take 2 seconds and would achieve 98.4% accuracy on the test set after 20 epochs. (Yes, 2 seconds an epoch on 1000% more training data, you read that right!).

There's plenty more parameters to adjust, so play around with the batch size, dropout proportion, optimization method, hidden layer width and depth, and different activation functions.