In this guide we introduce neural networks for classification, and using Keras, we will begin to try solving problems and datasets that are much bigger than the toy datasets like Iris we have used so far. In this notebook, we will define classification and introduce several innovations we have to make in order to do it correctly, and we will also use two large standard datasets which have been used by machine learning scientists for many years: MNIST and CIFAR-10.
Classification is a task in which all data points are assigned some discrete category. For regression of a single output variable, we have a single output neuron, as we have seen in previous notebooks. For doing multi-class classification, we instead have an output neuron for each of the possible classes, and say that the predicted output is the one corresponding to the neuron which has the highest output value. For example, given the task of classifying images of handwritten digits (which we will introduce later), we might build a neural network like the following, having 10 output neurons for each of the 10 digits.
Before trying out neural networks for classification, we have to introduce two new concepts: the softmax function, and cross-entropy loss.
So far, we have learned about one activation function, that of the sigmoid function. We saw that we use this typically in all the layers of a neural network except the output layer, which is usually left linear (without an activation function) for the task of regression. But in classification, it is very typical to output the final layer outputs through the softmax activation function. Given a final layer output vector $z$, the softmax function is defined as the following:
$$\sigma (\mathbf {z} )_{i}={\frac {e^{z_{i}}}{\sum _{i}e^{z_{i}}}}$$Where the denominator $\sum _{i=1}e^{z_{i}}$ is the sum over all the classes. Softmax squashes the output $z$, which is unbounded, to values between 0 and 1, and dividing it by the sum over all the classes means that the output sums to 1. This means we can interpret the output as class probabilities.
We will use the softmax activation for the classification output layer from here on out. A short example follows:
In [1]:
import matplotlib.pyplot as plt
import numpy as np
def softmax(Z):
Z = [np.exp(z) for z in Z]
S = np.sum(Z)
Z /= S
return Z
Z = [0.0, 2.3, 1.0, 0, 5.3, 0.0]
y = softmax(Z)
print("Z =", Z)
print("y =", y)
We were given a length-6 vector $Z$ containing the values $[0.0, 2.3, 1.0, 0, 8.3, 0.0]$. We ran it through the softmax function, and we plot it below:
In [2]:
plt.bar(range(len(y)), y)
Out[2]:
Notice that because of the exponential nature of $e$, the 5th value in $Z$, 5.3, has an over 90% probability. Softmax tends to exaggerate the differences in the original output.
We introduced loss functions in the last guide, and we used the simple mean-squared error (MSE) function to evaluate the performance of our network. While MSE works nicely for regression, and can work for classification as well, it is generally not preferred for classification, because class variables are not naturally continuous and therefore, the MSE error, being a continuous value is not exactly relevant or "natural." Instead, what scientists generally prefer for classification is categorical cross-entropy loss.
A discussion or derivation of cross-entropy loss is beyond the scope of this class but a good introduction to it can be found here. A discussion of what makes it superior to MSE for classification can be found here. We will just focus on its properties instead.
Letting $y_i$ denote the ground truth value of class $i$, and $\hat{y}_i$ be our prediction of class $i$, the cross-entropy loss is defined as:
$$ H(y, \hat{y}) = -\sum_{i} y_i \log \hat{y}_i $$If the number of classes is 2, we can expand this:
$$ H(y, \hat{y}) = -{(y\log(\hat{y}) + (1 - y)\log(1 - \hat{y}))}\ $$Notice that as our probability for the predicting the correct class approaches 1, the cross-entropy approaches 0. For example, if $y=1$, then as $\hat{y}\rightarrow 1$, $H(y, \hat{y}) \rightarrow 0$. If our probability for the correct class approaches 0 (the exact wrong prediction), e.g. if $y=1$ and $\hat{y} \rightarrow 0$, then $H(y, \hat{y}) \rightarrow \infty$.
This is true in the more general $M$-class cross-entropy loss as well, $ H(y, \hat{y}) = -\sum_{i} y_i \log \hat{y}_i $, where if our prediction is very close to the true label, then the entropy loss is close to 0, whereas the more dissimilar the prediction is to the true class, the higher it is.
Minor note: in practice, a very small $\epsilon$ is added to the log, e.g. $\log(\hat{y}+\epsilon)$ to avoid $\log 0$ which is undefined.
In the last guide, we introduced Keras. We will now use it to solve a classification problem, that of MNIST. First, let's import Keras and the other python libraries we will need.
In [11]:
import os
import matplotlib.pyplot as plt
import numpy as np
import random
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.layers import Conv2D, MaxPooling2D, Flatten
from keras.layers import Activation
We are also now going to scale up our setup by using a much more complicated dataset than Iris, that of the MNIST, a dataset of 70,000 28x28 pixel grayscale images of handwritten numbers, manually classified into the 10 digits, and split into a canonical training set and test set. We can load MNIST with the following code:
In [12]:
from keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
num_classes = 10
Let's see what the data is packaged like:
In [13]:
print('%d train samples, %d test samples'% (x_train.shape[0], x_test.shape[0]))
print("training data shape: ", x_train.shape, y_train.shape)
print("test data shape: ", x_test.shape, y_test.shape)
Let's look at some samples of the images.
In [14]:
samples = np.concatenate([np.concatenate([x_train[i] for i in [int(random.random() * len(x_train)) for i in range(16)]], axis=1) for i in range(4)], axis=0)
plt.figure(figsize=(16,4))
plt.imshow(samples, cmap='gray')
Out[14]:
As before, we need to pre-process the data for Keras. To do so, we will reshape the image arrays from $n$x28x28 to $n$x784, so each row of the data is the full "unrolled" list of pixels, and we will ensure they are float32 for precision. We then normalize the pixel values (which are naturally between 0 and 255) so that they are all between 0 and 1 instead.
In [15]:
# reshape to input vectors
x_train = x_train.reshape(60000, 784)
x_test = x_test.reshape(10000, 784)
# make float32
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
# normalize to (0-1)
x_train /= 255
x_test /= 255
In classification, we will eventually structure our neural networks so that they have $n$ output neurons, 1 for each class. The idea is whichever output neuron has the highest value at the end is the predicted class. For this, we must structure our labels as "one-hot" vectors, which are vectors of length $n$ where $n$ is the number of classes, and the elements are all 0 except for the correct label, which is 1. For example, an image of the number 3 would be:
$[0, 0, 0, 1, 0, 0, 0, 0, 0, 0]$
And for the number 7 it would be:
$[0, 0, 0, 0, 0, 0, 0, 1, 0, 0]$
Notice we are zero-indexed again, so the first element is for the digit 0.
In [16]:
print("first sample of y_train before one-hot vectorization", y_train[0])
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
print("first sample of y_train after one-hot vectorization", y_train[0])
Now let's make a neural network for MNIST. We'll give it two layers of 100 neurons each, with sigmoid activations. Then we will make the output layer go through a softmax activation, the standard for classification.
In [18]:
model = Sequential()
model.add(Dense(100, activation='sigmoid', input_dim=784))
model.add(Dense(100, activation='sigmoid'))
model.add(Dense(num_classes, activation='softmax'))
Thus, the network has 784 100 = 78,400 weights in the first layer, 100 100 = 10,000 weights in the second layer, and 100 * 10 = 1,000 weights in the output layer, plus 100 + 100 + 10 = 210 biases, giving us a total of 78,400 + 10,000 + 1,000 + 210 = 89,610 parameters. We can see this in the summary.
In [19]:
model.summary()
We will compile our model to optimize for the categorical cross-entropy loss as described earlier, and we will use SGD as our optimizer again. We will also include the optional argument metrics
to keep track of the accuracy during training, in addition to just the loss. The accuracy is the % of samples classified correctly.
In [21]:
model.compile(loss='categorical_crossentropy',
optimizer='sgd',
metrics=['accuracy'])
We now train our network for 20 epochs and use a batch size of 100. We will talk in more detail later on how to choose these hyper-parameters. We use our validation set to evaluate our performance.
In [22]:
model.fit(x_train, y_train,
batch_size=100,
epochs=20,
verbose=1,
validation_data=(x_test, y_test))
Out[22]:
Evaluate the performance of the network.
In [23]:
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
After 20 epochs, we have an accuracy of around 87%. Perhaps we can train it for a bit longer to get better performance? Let's run fit again for another 20 epochs. Notice that as long as we don't recompile the model, we can keep running fit to try to improve the model. So we know that we don't necessarily have to decide ahead of time how long to train for, we can keep training as we see fit.
In [24]:
model.fit(x_train, y_train,
batch_size=100,
epochs=20,
verbose=1,
validation_data=(x_test, y_test))
Out[24]:
At this point our accuracy is at 90%. This seems not too bad! Random guesses would only get us 10% accuracy, so we must be doing something right. But 90% is not acceptable for MNIST. The current record for MNIST has 99.8% accuracy, which means our model makes 500 times as many errors as the best network.
So how can we improve it? What if we make the network bigger? And train for longer? Let's give it two layers of 256 neurons each, and then train for 60 epochs.
In [30]:
model = Sequential()
model.add(Dense(256, activation='sigmoid', input_dim=784))
model.add(Dense(256, activation='sigmoid'))
model.add(Dense(num_classes, activation='softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='sgd',
metrics=['accuracy'])
model.summary()
The network now has 269,322 parameters, which is more than 3 times as many as the last network. Now train it.
In [31]:
model.fit(x_train, y_train,
batch_size=100,
epochs=60,
verbose=1,
validation_data=(x_test, y_test))
Out[31]:
Surprisingly, this new network only achieves 91.6% accuracy, which is only a bit better than the last one.
So maybe bigger is not better! The problem is that just making the network bigger has diminishing improvements for us. We are going to need to make more improvements to get good results. We will introduce some improvements in the next notebook.
Before we do that,let's try what we have so far with CIFAR-10. CIFAR-10 is a dataset which contains 60,000 32x32x3 RGB-color images of airplanes, automobiles, birds, cats, deer, dogs, frogs, horses, ships, and trucks.
The next cell will import that dataset, and tell us about it's shape.
In [51]:
from keras.datasets import cifar10
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
num_classes = 10
print('%d train samples, %d test samples'%(x_train.shape[0], x_test.shape[0]))
print("training data shape: ", x_train.shape, y_train.shape)
print("test data shape: ", x_test.shape, y_test.shape)
Let's look at a random sample of images from CIFAR-10.
In [52]:
samples = np.concatenate([np.concatenate([x_train[i] for i in [int(random.random() * len(x_train)) for i in range(16)]], axis=1) for i in range(6)], axis=0)
plt.figure(figsize=(16,6))
plt.imshow(samples, cmap='gray')
Out[52]:
As with MNIST, we need to pre-process the data by converting to float32 precision, reshaping so each row is a single input vector, and normalizing between 0 and 1.
In [53]:
# reshape to input vectors
x_train = x_train.reshape(50000, 32*32*3)
x_test = x_test.reshape(10000, 32*32*3)
# make float32
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
# normalize to (0-1)
x_train /= 255
x_test /= 255
Convert labels to one-hot vectors.
In [54]:
# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
Let's copy the last network we used for MNIST, and see how this architecture does for CIFAR-10. Note that the input_dim
of the first layer is no longer 784 as it was for MNIST, but now it is 32x32x3=3072. This means we will have more parameters in this network than the MNIST network of the same architecture.
In [55]:
model = Sequential()
model.add(Dense(100, activation='sigmoid', input_dim=3072))
model.add(Dense(100, activation='sigmoid'))
model.add(Dense(num_classes, activation='softmax'))
model.summary()
This network now has 318,410 parameters, compared to 269,322 as we had in the equivalent MNIST network. Let's compile it to learn with SGD and the same categorical cross-entropy loss function.
In [47]:
model.compile(loss='categorical_crossentropy',
optimizer='sgd',
metrics=['accuracy'])
Now train for 60 epochs, same batch size.
In [48]:
model.fit(x_train, y_train,
batch_size=100,
epochs=60,
verbose=1,
validation_data=(x_test, y_test))
Out[48]:
After 60 epochs, our network only has an accuracy of 40%. Still better than random guesses (10%) but 40% is terrible. The current record for CIFAR-10 accuracy is 97%. So we have a long way to go!
In the next notebook, we will introduce convolutional neural networks, which will greatly improve our performance.