Classification, and scaling up our networks

In this guide we introduce neural networks for classification, and using Keras, we will begin to try solving problems and datasets that are much bigger than the toy datasets like Iris we have used so far. In this notebook, we will define classification and introduce several innovations we have to make in order to do it correctly, and we will also use two large standard datasets which have been used by machine learning scientists for many years: MNIST and CIFAR-10.

Classification

Classification is a task in which all data points are assigned some discrete category. For regression of a single output variable, we have a single output neuron, as we have seen in previous notebooks. For doing multi-class classification, we instead have an output neuron for each of the possible classes, and say that the predicted output is the one corresponding to the neuron which has the highest output value. For example, given the task of classifying images of handwritten digits (which we will introduce later), we might build a neural network like the following, having 10 output neurons for each of the 10 digits.

Before trying out neural networks for classification, we have to introduce two new concepts: the softmax function, and cross-entropy loss.

Softmax activation

So far, we have learned about one activation function, that of the sigmoid function. We saw that we use this typically in all the layers of a neural network except the output layer, which is usually left linear (without an activation function) for the task of regression. But in classification, it is very typical to output the final layer outputs through the softmax activation function. Given a final layer output vector $z$, the softmax function is defined as the following:

$$\sigma (\mathbf {z} )_{i}={\frac {e^{z_{i}}}{\sum _{i}e^{z_{i}}}}$$

Where the denominator $\sum _{i=1}e^{z_{i}}$ is the sum over all the classes. Softmax squashes the output $z$, which is unbounded, to values between 0 and 1, and dividing it by the sum over all the classes means that the output sums to 1. This means we can interpret the output as class probabilities.

We will use the softmax activation for the classification output layer from here on out. A short example follows:


In [1]:
import matplotlib.pyplot as plt
import numpy as np

def softmax(Z):
    Z = [np.exp(z) for z in Z]
    S = np.sum(Z)
    Z /= S
    return Z

Z = [0.0, 2.3, 1.0, 0, 5.3, 0.0]
y = softmax(Z)

print("Z =", Z)
print("y =", y)


Z = [0.0, 2.3, 1.0, 0, 5.3, 0.0]
y = [0.004629   0.04617051 0.01258293 0.004629   0.92735955 0.004629  ]

We were given a length-6 vector $Z$ containing the values $[0.0, 2.3, 1.0, 0, 8.3, 0.0]$. We ran it through the softmax function, and we plot it below:


In [2]:
plt.bar(range(len(y)), y)


Out[2]:
<BarContainer object of 6 artists>

Notice that because of the exponential nature of $e$, the 5th value in $Z$, 5.3, has an over 90% probability. Softmax tends to exaggerate the differences in the original output.

Categorical cross-entropy loss

We introduced loss functions in the last guide, and we used the simple mean-squared error (MSE) function to evaluate the performance of our network. While MSE works nicely for regression, and can work for classification as well, it is generally not preferred for classification, because class variables are not naturally continuous and therefore, the MSE error, being a continuous value is not exactly relevant or "natural." Instead, what scientists generally prefer for classification is categorical cross-entropy loss.

A discussion or derivation of cross-entropy loss is beyond the scope of this class but a good introduction to it can be found here. A discussion of what makes it superior to MSE for classification can be found here. We will just focus on its properties instead.

Letting $y_i$ denote the ground truth value of class $i$, and $\hat{y}_i$ be our prediction of class $i$, the cross-entropy loss is defined as:

$$ H(y, \hat{y}) = -\sum_{i} y_i \log \hat{y}_i $$

If the number of classes is 2, we can expand this:

$$ H(y, \hat{y}) = -{(y\log(\hat{y}) + (1 - y)\log(1 - \hat{y}))}\ $$

Notice that as our probability for the predicting the correct class approaches 1, the cross-entropy approaches 0. For example, if $y=1$, then as $\hat{y}\rightarrow 1$, $H(y, \hat{y}) \rightarrow 0$. If our probability for the correct class approaches 0 (the exact wrong prediction), e.g. if $y=1$ and $\hat{y} \rightarrow 0$, then $H(y, \hat{y}) \rightarrow \infty$.

This is true in the more general $M$-class cross-entropy loss as well, $ H(y, \hat{y}) = -\sum_{i} y_i \log \hat{y}_i $, where if our prediction is very close to the true label, then the entropy loss is close to 0, whereas the more dissimilar the prediction is to the true class, the higher it is.

Minor note: in practice, a very small $\epsilon$ is added to the log, e.g. $\log(\hat{y}+\epsilon)$ to avoid $\log 0$ which is undefined.

MNIST: the "hello world" of classification

In the last guide, we introduced Keras. We will now use it to solve a classification problem, that of MNIST. First, let's import Keras and the other python libraries we will need.


In [11]:
import os
import matplotlib.pyplot as plt
import numpy as np
import random

import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.layers import Conv2D, MaxPooling2D, Flatten

from keras.layers import Activation

We are also now going to scale up our setup by using a much more complicated dataset than Iris, that of the MNIST, a dataset of 70,000 28x28 pixel grayscale images of handwritten numbers, manually classified into the 10 digits, and split into a canonical training set and test set. We can load MNIST with the following code:


In [12]:
from keras.datasets import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
num_classes = 10

Let's see what the data is packaged like:


In [13]:
print('%d train samples, %d test samples'% (x_train.shape[0], x_test.shape[0]))
print("training data shape: ", x_train.shape, y_train.shape)
print("test data shape: ", x_test.shape, y_test.shape)


60000 train samples, 10000 test samples
training data shape:  (60000, 28, 28) (60000,)
test data shape:  (10000, 28, 28) (10000,)

Let's look at some samples of the images.


In [14]:
samples = np.concatenate([np.concatenate([x_train[i] for i in [int(random.random() * len(x_train)) for i in range(16)]], axis=1) for i in range(4)], axis=0)
plt.figure(figsize=(16,4))
plt.imshow(samples, cmap='gray')


Out[14]:
<matplotlib.image.AxesImage at 0x12cc06da0>

As before, we need to pre-process the data for Keras. To do so, we will reshape the image arrays from $n$x28x28 to $n$x784, so each row of the data is the full "unrolled" list of pixels, and we will ensure they are float32 for precision. We then normalize the pixel values (which are naturally between 0 and 255) so that they are all between 0 and 1 instead.


In [15]:
# reshape to input vectors
x_train = x_train.reshape(60000, 784)
x_test = x_test.reshape(10000, 784)

# make float32
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')

# normalize to (0-1)
x_train /= 255
x_test /= 255

In classification, we will eventually structure our neural networks so that they have $n$ output neurons, 1 for each class. The idea is whichever output neuron has the highest value at the end is the predicted class. For this, we must structure our labels as "one-hot" vectors, which are vectors of length $n$ where $n$ is the number of classes, and the elements are all 0 except for the correct label, which is 1. For example, an image of the number 3 would be:

$[0, 0, 0, 1, 0, 0, 0, 0, 0, 0]$

And for the number 7 it would be:

$[0, 0, 0, 0, 0, 0, 0, 1, 0, 0]$

Notice we are zero-indexed again, so the first element is for the digit 0.


In [16]:
print("first sample of y_train before one-hot vectorization", y_train[0])

y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

print("first sample of y_train after one-hot vectorization", y_train[0])


first sample of y_train before one-hot vectorization 5
first sample of y_train after one-hot vectorization [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]

Now let's make a neural network for MNIST. We'll give it two layers of 100 neurons each, with sigmoid activations. Then we will make the output layer go through a softmax activation, the standard for classification.


In [18]:
model = Sequential()
model.add(Dense(100, activation='sigmoid', input_dim=784))
model.add(Dense(100, activation='sigmoid'))
model.add(Dense(num_classes, activation='softmax'))

Thus, the network has 784 100 = 78,400 weights in the first layer, 100 100 = 10,000 weights in the second layer, and 100 * 10 = 1,000 weights in the output layer, plus 100 + 100 + 10 = 210 biases, giving us a total of 78,400 + 10,000 + 1,000 + 210 = 89,610 parameters. We can see this in the summary.


In [19]:
model.summary()


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_4 (Dense)              (None, 100)               78500     
_________________________________________________________________
dense_5 (Dense)              (None, 100)               10100     
_________________________________________________________________
dense_6 (Dense)              (None, 10)                1010      
=================================================================
Total params: 89,610
Trainable params: 89,610
Non-trainable params: 0
_________________________________________________________________

We will compile our model to optimize for the categorical cross-entropy loss as described earlier, and we will use SGD as our optimizer again. We will also include the optional argument metrics to keep track of the accuracy during training, in addition to just the loss. The accuracy is the % of samples classified correctly.


In [21]:
model.compile(loss='categorical_crossentropy',
              optimizer='sgd',
              metrics=['accuracy'])

We now train our network for 20 epochs and use a batch size of 100. We will talk in more detail later on how to choose these hyper-parameters. We use our validation set to evaluate our performance.


In [22]:
model.fit(x_train, y_train,
          batch_size=100,
          epochs=20,
          verbose=1,
          validation_data=(x_test, y_test))


Train on 60000 samples, validate on 10000 samples
Epoch 1/20
60000/60000 [==============================] - 2s 28us/step - loss: 2.2794 - acc: 0.1910 - val_loss: 2.2425 - val_acc: 0.3578
Epoch 2/20
60000/60000 [==============================] - 1s 25us/step - loss: 2.2101 - acc: 0.3785 - val_loss: 2.1680 - val_acc: 0.5047
Epoch 3/20
60000/60000 [==============================] - 2s 25us/step - loss: 2.1182 - acc: 0.5069 - val_loss: 2.0495 - val_acc: 0.5663
Epoch 4/20
60000/60000 [==============================] - 1s 24us/step - loss: 1.9693 - acc: 0.5674 - val_loss: 1.8622 - val_acc: 0.6098
Epoch 5/20
60000/60000 [==============================] - 1s 24us/step - loss: 1.7526 - acc: 0.6198 - val_loss: 1.6146 - val_acc: 0.6593
Epoch 6/20
60000/60000 [==============================] - 2s 25us/step - loss: 1.5026 - acc: 0.6680 - val_loss: 1.3660 - val_acc: 0.7114
Epoch 7/20
60000/60000 [==============================] - 1s 24us/step - loss: 1.2761 - acc: 0.7147 - val_loss: 1.1638 - val_acc: 0.7313
Epoch 8/20
60000/60000 [==============================] - 1s 25us/step - loss: 1.1001 - acc: 0.7481 - val_loss: 1.0126 - val_acc: 0.7680
Epoch 9/20
60000/60000 [==============================] - 1s 25us/step - loss: 0.9690 - acc: 0.7720 - val_loss: 0.9003 - val_acc: 0.7851
Epoch 10/20
60000/60000 [==============================] - 1s 24us/step - loss: 0.8696 - acc: 0.7915 - val_loss: 0.8138 - val_acc: 0.8004
Epoch 11/20
60000/60000 [==============================] - 1s 24us/step - loss: 0.7915 - acc: 0.8070 - val_loss: 0.7436 - val_acc: 0.8155
Epoch 12/20
60000/60000 [==============================] - 1s 25us/step - loss: 0.7286 - acc: 0.8196 - val_loss: 0.6870 - val_acc: 0.8272
Epoch 13/20
60000/60000 [==============================] - 2s 26us/step - loss: 0.6769 - acc: 0.8297 - val_loss: 0.6399 - val_acc: 0.8379
Epoch 14/20
60000/60000 [==============================] - 1s 24us/step - loss: 0.6340 - acc: 0.8389 - val_loss: 0.6007 - val_acc: 0.8462
Epoch 15/20
60000/60000 [==============================] - 1s 24us/step - loss: 0.5980 - acc: 0.8462 - val_loss: 0.5673 - val_acc: 0.8517
Epoch 16/20
60000/60000 [==============================] - 1s 24us/step - loss: 0.5674 - acc: 0.8527 - val_loss: 0.5390 - val_acc: 0.8569
Epoch 17/20
60000/60000 [==============================] - 2s 26us/step - loss: 0.5414 - acc: 0.8586 - val_loss: 0.5152 - val_acc: 0.8632
Epoch 18/20
60000/60000 [==============================] - 2s 25us/step - loss: 0.5187 - acc: 0.8633 - val_loss: 0.4938 - val_acc: 0.8681
Epoch 19/20
60000/60000 [==============================] - 2s 26us/step - loss: 0.4990 - acc: 0.8681 - val_loss: 0.4756 - val_acc: 0.8723
Epoch 20/20
60000/60000 [==============================] - 1s 25us/step - loss: 0.4817 - acc: 0.8722 - val_loss: 0.4598 - val_acc: 0.8744
Out[22]:
<keras.callbacks.History at 0x12ccfa518>

Evaluate the performance of the network.


In [23]:
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])


Test loss: 0.45980212097167966
Test accuracy: 0.8744

After 20 epochs, we have an accuracy of around 87%. Perhaps we can train it for a bit longer to get better performance? Let's run fit again for another 20 epochs. Notice that as long as we don't recompile the model, we can keep running fit to try to improve the model. So we know that we don't necessarily have to decide ahead of time how long to train for, we can keep training as we see fit.


In [24]:
model.fit(x_train, y_train,
          batch_size=100,
          epochs=20,
          verbose=1,
          validation_data=(x_test, y_test))


Train on 60000 samples, validate on 10000 samples
Epoch 1/20
60000/60000 [==============================] - 2s 25us/step - loss: 0.4664 - acc: 0.8753 - val_loss: 0.4453 - val_acc: 0.8788
Epoch 2/20
60000/60000 [==============================] - 1s 24us/step - loss: 0.4528 - acc: 0.8786 - val_loss: 0.4327 - val_acc: 0.8828
Epoch 3/20
60000/60000 [==============================] - 1s 25us/step - loss: 0.4405 - acc: 0.8812 - val_loss: 0.4211 - val_acc: 0.8845
Epoch 4/20
60000/60000 [==============================] - 1s 24us/step - loss: 0.4295 - acc: 0.8835 - val_loss: 0.4109 - val_acc: 0.8879
Epoch 5/20
60000/60000 [==============================] - 1s 25us/step - loss: 0.4196 - acc: 0.8862 - val_loss: 0.4017 - val_acc: 0.8900
Epoch 6/20
60000/60000 [==============================] - 1s 25us/step - loss: 0.4106 - acc: 0.8880 - val_loss: 0.3937 - val_acc: 0.8916
Epoch 7/20
60000/60000 [==============================] - 1s 24us/step - loss: 0.4023 - acc: 0.8897 - val_loss: 0.3858 - val_acc: 0.8941
Epoch 8/20
60000/60000 [==============================] - 2s 25us/step - loss: 0.3949 - acc: 0.8917 - val_loss: 0.3793 - val_acc: 0.8944
Epoch 9/20
60000/60000 [==============================] - 1s 25us/step - loss: 0.3880 - acc: 0.8933 - val_loss: 0.3727 - val_acc: 0.8971
Epoch 10/20
60000/60000 [==============================] - 2s 25us/step - loss: 0.3817 - acc: 0.8948 - val_loss: 0.3673 - val_acc: 0.8981
Epoch 11/20
60000/60000 [==============================] - 1s 24us/step - loss: 0.3759 - acc: 0.8957 - val_loss: 0.3614 - val_acc: 0.8993
Epoch 12/20
60000/60000 [==============================] - 2s 25us/step - loss: 0.3705 - acc: 0.8971 - val_loss: 0.3560 - val_acc: 0.9005
Epoch 13/20
60000/60000 [==============================] - 1s 25us/step - loss: 0.3654 - acc: 0.8983 - val_loss: 0.3515 - val_acc: 0.9014
Epoch 14/20
60000/60000 [==============================] - 2s 25us/step - loss: 0.3606 - acc: 0.8993 - val_loss: 0.3469 - val_acc: 0.9009
Epoch 15/20
60000/60000 [==============================] - 2s 26us/step - loss: 0.3562 - acc: 0.9007 - val_loss: 0.3431 - val_acc: 0.9027
Epoch 16/20
60000/60000 [==============================] - 1s 25us/step - loss: 0.3520 - acc: 0.9016 - val_loss: 0.3393 - val_acc: 0.9027
Epoch 17/20
60000/60000 [==============================] - 1s 25us/step - loss: 0.3481 - acc: 0.9025 - val_loss: 0.3354 - val_acc: 0.9047
Epoch 18/20
60000/60000 [==============================] - 2s 25us/step - loss: 0.3443 - acc: 0.9035 - val_loss: 0.3324 - val_acc: 0.9041
Epoch 19/20
60000/60000 [==============================] - 2s 25us/step - loss: 0.3407 - acc: 0.9040 - val_loss: 0.3289 - val_acc: 0.9055
Epoch 20/20
60000/60000 [==============================] - 2s 25us/step - loss: 0.3373 - acc: 0.9046 - val_loss: 0.3259 - val_acc: 0.9069
Out[24]:
<keras.callbacks.History at 0x11b99fcf8>

At this point our accuracy is at 90%. This seems not too bad! Random guesses would only get us 10% accuracy, so we must be doing something right. But 90% is not acceptable for MNIST. The current record for MNIST has 99.8% accuracy, which means our model makes 500 times as many errors as the best network.

So how can we improve it? What if we make the network bigger? And train for longer? Let's give it two layers of 256 neurons each, and then train for 60 epochs.


In [30]:
model = Sequential()
model.add(Dense(256, activation='sigmoid', input_dim=784))
model.add(Dense(256, activation='sigmoid'))
model.add(Dense(num_classes, activation='softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='sgd',
              metrics=['accuracy'])

model.summary()


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_21 (Dense)             (None, 256)               200960    
_________________________________________________________________
dense_22 (Dense)             (None, 256)               65792     
_________________________________________________________________
dense_23 (Dense)             (None, 10)                2570      
=================================================================
Total params: 269,322
Trainable params: 269,322
Non-trainable params: 0
_________________________________________________________________

The network now has 269,322 parameters, which is more than 3 times as many as the last network. Now train it.


In [31]:
model.fit(x_train, y_train,
          batch_size=100,
          epochs=60,
          verbose=1,
          validation_data=(x_test, y_test))


Train on 60000 samples, validate on 10000 samples
Epoch 1/60
60000/60000 [==============================] - 2s 39us/step - loss: 2.2673 - acc: 0.2190 - val_loss: 2.2256 - val_acc: 0.2695
Epoch 2/60
60000/60000 [==============================] - 2s 37us/step - loss: 2.1828 - acc: 0.4033 - val_loss: 2.1243 - val_acc: 0.4352
Epoch 3/60
60000/60000 [==============================] - 2s 40us/step - loss: 2.0538 - acc: 0.5165 - val_loss: 1.9559 - val_acc: 0.5852
Epoch 4/60
60000/60000 [==============================] - 2s 38us/step - loss: 1.8457 - acc: 0.5946 - val_loss: 1.7022 - val_acc: 0.6786
Epoch 5/60
60000/60000 [==============================] - 2s 38us/step - loss: 1.5702 - acc: 0.6608 - val_loss: 1.4133 - val_acc: 0.7070
Epoch 6/60
60000/60000 [==============================] - 2s 37us/step - loss: 1.3021 - acc: 0.7172 - val_loss: 1.1698 - val_acc: 0.7369
Epoch 7/60
60000/60000 [==============================] - 2s 39us/step - loss: 1.0892 - acc: 0.7600 - val_loss: 0.9857 - val_acc: 0.7899
Epoch 8/60
60000/60000 [==============================] - 2s 39us/step - loss: 0.9319 - acc: 0.7898 - val_loss: 0.8538 - val_acc: 0.8040
Epoch 9/60
60000/60000 [==============================] - 2s 39us/step - loss: 0.8170 - acc: 0.8108 - val_loss: 0.7549 - val_acc: 0.8228
Epoch 10/60
60000/60000 [==============================] - 2s 36us/step - loss: 0.7316 - acc: 0.8266 - val_loss: 0.6810 - val_acc: 0.8375
Epoch 11/60
60000/60000 [==============================] - 2s 36us/step - loss: 0.6667 - acc: 0.8371 - val_loss: 0.6248 - val_acc: 0.8436
Epoch 12/60
60000/60000 [==============================] - 2s 36us/step - loss: 0.6164 - acc: 0.8460 - val_loss: 0.5800 - val_acc: 0.8554
Epoch 13/60
60000/60000 [==============================] - 2s 38us/step - loss: 0.5765 - acc: 0.8529 - val_loss: 0.5437 - val_acc: 0.8604
Epoch 14/60
60000/60000 [==============================] - 2s 36us/step - loss: 0.5444 - acc: 0.8592 - val_loss: 0.5149 - val_acc: 0.8668
Epoch 15/60
60000/60000 [==============================] - 2s 37us/step - loss: 0.5180 - acc: 0.8645 - val_loss: 0.4904 - val_acc: 0.8707
Epoch 16/60
60000/60000 [==============================] - 2s 37us/step - loss: 0.4958 - acc: 0.8687 - val_loss: 0.4713 - val_acc: 0.8750
Epoch 17/60
60000/60000 [==============================] - 2s 40us/step - loss: 0.4770 - acc: 0.8724 - val_loss: 0.4535 - val_acc: 0.8798
Epoch 18/60
60000/60000 [==============================] - 2s 36us/step - loss: 0.4608 - acc: 0.8766 - val_loss: 0.4381 - val_acc: 0.8820
Epoch 19/60
60000/60000 [==============================] - 2s 37us/step - loss: 0.4469 - acc: 0.8790 - val_loss: 0.4254 - val_acc: 0.8859
Epoch 20/60
60000/60000 [==============================] - 2s 37us/step - loss: 0.4346 - acc: 0.8812 - val_loss: 0.4148 - val_acc: 0.8863
Epoch 21/60
60000/60000 [==============================] - 2s 39us/step - loss: 0.4239 - acc: 0.8843 - val_loss: 0.4040 - val_acc: 0.8891
Epoch 22/60
60000/60000 [==============================] - 2s 38us/step - loss: 0.4142 - acc: 0.8860 - val_loss: 0.3946 - val_acc: 0.8901
Epoch 23/60
60000/60000 [==============================] - 2s 37us/step - loss: 0.4055 - acc: 0.8882 - val_loss: 0.3880 - val_acc: 0.8915
Epoch 24/60
60000/60000 [==============================] - 2s 39us/step - loss: 0.3979 - acc: 0.8898 - val_loss: 0.3800 - val_acc: 0.8938
Epoch 25/60
60000/60000 [==============================] - 2s 39us/step - loss: 0.3909 - acc: 0.8916 - val_loss: 0.3730 - val_acc: 0.8951
Epoch 26/60
60000/60000 [==============================] - 2s 38us/step - loss: 0.3844 - acc: 0.8931 - val_loss: 0.3666 - val_acc: 0.8977
Epoch 27/60
60000/60000 [==============================] - 2s 39us/step - loss: 0.3786 - acc: 0.8942 - val_loss: 0.3616 - val_acc: 0.8983
Epoch 28/60
60000/60000 [==============================] - 2s 40us/step - loss: 0.3733 - acc: 0.8956 - val_loss: 0.3569 - val_acc: 0.8989
Epoch 29/60
60000/60000 [==============================] - 3s 42us/step - loss: 0.3683 - acc: 0.8969 - val_loss: 0.3526 - val_acc: 0.9000
Epoch 30/60
60000/60000 [==============================] - 3s 45us/step - loss: 0.3637 - acc: 0.8972 - val_loss: 0.3480 - val_acc: 0.9012
Epoch 31/60
60000/60000 [==============================] - 2s 41us/step - loss: 0.3593 - acc: 0.8989 - val_loss: 0.3442 - val_acc: 0.9013
Epoch 32/60
60000/60000 [==============================] - 3s 44us/step - loss: 0.3552 - acc: 0.8997 - val_loss: 0.3403 - val_acc: 0.9023
Epoch 33/60
60000/60000 [==============================] - 3s 48us/step - loss: 0.3515 - acc: 0.9004 - val_loss: 0.3367 - val_acc: 0.9031
Epoch 34/60
60000/60000 [==============================] - 3s 46us/step - loss: 0.3478 - acc: 0.9014 - val_loss: 0.3344 - val_acc: 0.9038
Epoch 35/60
60000/60000 [==============================] - 3s 44us/step - loss: 0.3445 - acc: 0.9019 - val_loss: 0.3300 - val_acc: 0.9049
Epoch 36/60
60000/60000 [==============================] - 3s 45us/step - loss: 0.3413 - acc: 0.9031 - val_loss: 0.3272 - val_acc: 0.9049
Epoch 37/60
60000/60000 [==============================] - 3s 48us/step - loss: 0.3382 - acc: 0.9039 - val_loss: 0.3247 - val_acc: 0.9059
Epoch 38/60
60000/60000 [==============================] - 3s 54us/step - loss: 0.3353 - acc: 0.9047 - val_loss: 0.3225 - val_acc: 0.9068
Epoch 39/60
60000/60000 [==============================] - 3s 52us/step - loss: 0.3325 - acc: 0.9052 - val_loss: 0.3195 - val_acc: 0.9076
Epoch 40/60
60000/60000 [==============================] - 3s 50us/step - loss: 0.3298 - acc: 0.9061 - val_loss: 0.3174 - val_acc: 0.9088
Epoch 41/60
60000/60000 [==============================] - 3s 48us/step - loss: 0.3273 - acc: 0.9064 - val_loss: 0.3151 - val_acc: 0.9096
Epoch 42/60
60000/60000 [==============================] - 3s 47us/step - loss: 0.3248 - acc: 0.9072 - val_loss: 0.3116 - val_acc: 0.9104
Epoch 43/60
60000/60000 [==============================] - 3s 46us/step - loss: 0.3225 - acc: 0.9074 - val_loss: 0.3099 - val_acc: 0.9107
Epoch 44/60
60000/60000 [==============================] - 3s 45us/step - loss: 0.3202 - acc: 0.9081 - val_loss: 0.3082 - val_acc: 0.9112
Epoch 45/60
60000/60000 [==============================] - 3s 45us/step - loss: 0.3180 - acc: 0.9088 - val_loss: 0.3061 - val_acc: 0.9124
Epoch 46/60
60000/60000 [==============================] - 3s 44us/step - loss: 0.3158 - acc: 0.9092 - val_loss: 0.3038 - val_acc: 0.9126
Epoch 47/60
60000/60000 [==============================] - 3s 44us/step - loss: 0.3139 - acc: 0.9099 - val_loss: 0.3024 - val_acc: 0.9127
Epoch 48/60
60000/60000 [==============================] - 3s 48us/step - loss: 0.3118 - acc: 0.9101 - val_loss: 0.3005 - val_acc: 0.9135
Epoch 49/60
60000/60000 [==============================] - 3s 45us/step - loss: 0.3098 - acc: 0.9114 - val_loss: 0.2985 - val_acc: 0.9136
Epoch 50/60
60000/60000 [==============================] - 3s 45us/step - loss: 0.3080 - acc: 0.9116 - val_loss: 0.2969 - val_acc: 0.9135
Epoch 51/60
60000/60000 [==============================] - 3s 47us/step - loss: 0.3062 - acc: 0.9119 - val_loss: 0.2957 - val_acc: 0.9147
Epoch 52/60
60000/60000 [==============================] - 3s 45us/step - loss: 0.3044 - acc: 0.9130 - val_loss: 0.2941 - val_acc: 0.9152
Epoch 53/60
60000/60000 [==============================] - 3s 48us/step - loss: 0.3027 - acc: 0.9135 - val_loss: 0.2928 - val_acc: 0.9152
Epoch 54/60
60000/60000 [==============================] - 3s 49us/step - loss: 0.3010 - acc: 0.9137 - val_loss: 0.2916 - val_acc: 0.9163
Epoch 55/60
60000/60000 [==============================] - 3s 56us/step - loss: 0.2994 - acc: 0.9140 - val_loss: 0.2890 - val_acc: 0.9157
Epoch 56/60
60000/60000 [==============================] - 4s 61us/step - loss: 0.2978 - acc: 0.9147 - val_loss: 0.2879 - val_acc: 0.9158
Epoch 57/60
60000/60000 [==============================] - 3s 48us/step - loss: 0.2962 - acc: 0.9150 - val_loss: 0.2864 - val_acc: 0.9176
Epoch 58/60
60000/60000 [==============================] - 3s 52us/step - loss: 0.2948 - acc: 0.9155 - val_loss: 0.2853 - val_acc: 0.9175
Epoch 59/60
60000/60000 [==============================] - 3s 48us/step - loss: 0.2933 - acc: 0.9158 - val_loss: 0.2838 - val_acc: 0.9178
Epoch 60/60
60000/60000 [==============================] - 3s 51us/step - loss: 0.2916 - acc: 0.9161 - val_loss: 0.2824 - val_acc: 0.9182
Out[31]:
<keras.callbacks.History at 0x11be5c8d0>

Surprisingly, this new network only achieves 91.6% accuracy, which is only a bit better than the last one.

So maybe bigger is not better! The problem is that just making the network bigger has diminishing improvements for us. We are going to need to make more improvements to get good results. We will introduce some improvements in the next notebook.

Before we do that,let's try what we have so far with CIFAR-10. CIFAR-10 is a dataset which contains 60,000 32x32x3 RGB-color images of airplanes, automobiles, birds, cats, deer, dogs, frogs, horses, ships, and trucks.

The next cell will import that dataset, and tell us about it's shape.


In [51]:
from keras.datasets import cifar10

(x_train, y_train), (x_test, y_test) = cifar10.load_data()
num_classes = 10

print('%d train samples, %d test samples'%(x_train.shape[0], x_test.shape[0]))
print("training data shape: ", x_train.shape, y_train.shape)
print("test data shape: ", x_test.shape, y_test.shape)


50000 train samples, 10000 test samples
training data shape:  (50000, 32, 32, 3) (50000, 1)
test data shape:  (10000, 32, 32, 3) (10000, 1)

Let's look at a random sample of images from CIFAR-10.


In [52]:
samples = np.concatenate([np.concatenate([x_train[i] for i in [int(random.random() * len(x_train)) for i in range(16)]], axis=1) for i in range(6)], axis=0)
plt.figure(figsize=(16,6))
plt.imshow(samples, cmap='gray')


Out[52]:
<matplotlib.image.AxesImage at 0x1256a9780>

As with MNIST, we need to pre-process the data by converting to float32 precision, reshaping so each row is a single input vector, and normalizing between 0 and 1.


In [53]:
# reshape to input vectors
x_train = x_train.reshape(50000, 32*32*3)
x_test = x_test.reshape(10000, 32*32*3)

# make float32
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')

# normalize to (0-1)
x_train /= 255
x_test /= 255

Convert labels to one-hot vectors.


In [54]:
# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

Let's copy the last network we used for MNIST, and see how this architecture does for CIFAR-10. Note that the input_dim of the first layer is no longer 784 as it was for MNIST, but now it is 32x32x3=3072. This means we will have more parameters in this network than the MNIST network of the same architecture.


In [55]:
model = Sequential()
model.add(Dense(100, activation='sigmoid', input_dim=3072))
model.add(Dense(100, activation='sigmoid'))
model.add(Dense(num_classes, activation='softmax'))
model.summary()


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_33 (Dense)             (None, 100)               307300    
_________________________________________________________________
dense_34 (Dense)             (None, 100)               10100     
_________________________________________________________________
dense_35 (Dense)             (None, 10)                1010      
=================================================================
Total params: 318,410
Trainable params: 318,410
Non-trainable params: 0
_________________________________________________________________

This network now has 318,410 parameters, compared to 269,322 as we had in the equivalent MNIST network. Let's compile it to learn with SGD and the same categorical cross-entropy loss function.


In [47]:
model.compile(loss='categorical_crossentropy',
              optimizer='sgd',
              metrics=['accuracy'])

Now train for 60 epochs, same batch size.


In [48]:
model.fit(x_train, y_train,
          batch_size=100,
          epochs=60,
          verbose=1,
          validation_data=(x_test, y_test))


Train on 50000 samples, validate on 10000 samples
Epoch 1/60
50000/50000 [==============================] - 2s 44us/step - loss: 2.2953 - acc: 0.1635 - val_loss: 2.2710 - val_acc: 0.2044
Epoch 2/60
50000/50000 [==============================] - 2s 39us/step - loss: 2.2556 - acc: 0.2231 - val_loss: 2.2391 - val_acc: 0.2233
Epoch 3/60
50000/50000 [==============================] - 2s 41us/step - loss: 2.2188 - acc: 0.2426 - val_loss: 2.1973 - val_acc: 0.2510
Epoch 4/60
50000/50000 [==============================] - 2s 40us/step - loss: 2.1728 - acc: 0.2514 - val_loss: 2.1499 - val_acc: 0.2524
Epoch 5/60
50000/50000 [==============================] - 2s 39us/step - loss: 2.1261 - acc: 0.2618 - val_loss: 2.1060 - val_acc: 0.2519
Epoch 6/60
50000/50000 [==============================] - 2s 40us/step - loss: 2.0861 - acc: 0.2686 - val_loss: 2.0698 - val_acc: 0.2685
Epoch 7/60
50000/50000 [==============================] - 2s 39us/step - loss: 2.0537 - acc: 0.2761 - val_loss: 2.0405 - val_acc: 0.2734
Epoch 8/60
50000/50000 [==============================] - 2s 40us/step - loss: 2.0270 - acc: 0.2826 - val_loss: 2.0164 - val_acc: 0.2731
Epoch 9/60
50000/50000 [==============================] - 2s 40us/step - loss: 2.0042 - acc: 0.2865 - val_loss: 1.9949 - val_acc: 0.2907
Epoch 10/60
50000/50000 [==============================] - 2s 40us/step - loss: 1.9848 - acc: 0.2929 - val_loss: 1.9763 - val_acc: 0.2930
Epoch 11/60
50000/50000 [==============================] - 2s 40us/step - loss: 1.9676 - acc: 0.2987 - val_loss: 1.9616 - val_acc: 0.3052
Epoch 12/60
50000/50000 [==============================] - 2s 42us/step - loss: 1.9526 - acc: 0.3052 - val_loss: 1.9461 - val_acc: 0.3104
Epoch 13/60
50000/50000 [==============================] - 2s 40us/step - loss: 1.9389 - acc: 0.3108 - val_loss: 1.9327 - val_acc: 0.3146
Epoch 14/60
50000/50000 [==============================] - 2s 41us/step - loss: 1.9268 - acc: 0.3165 - val_loss: 1.9214 - val_acc: 0.3200
Epoch 15/60
50000/50000 [==============================] - 2s 41us/step - loss: 1.9155 - acc: 0.3223 - val_loss: 1.9115 - val_acc: 0.3209
Epoch 16/60
50000/50000 [==============================] - 2s 40us/step - loss: 1.9056 - acc: 0.3255 - val_loss: 1.9029 - val_acc: 0.3229
Epoch 17/60
50000/50000 [==============================] - 2s 42us/step - loss: 1.8965 - acc: 0.3308 - val_loss: 1.8921 - val_acc: 0.3268
Epoch 18/60
50000/50000 [==============================] - 2s 46us/step - loss: 1.8880 - acc: 0.3329 - val_loss: 1.8845 - val_acc: 0.3393
Epoch 19/60
50000/50000 [==============================] - 2s 43us/step - loss: 1.8799 - acc: 0.3352 - val_loss: 1.8766 - val_acc: 0.3378
Epoch 20/60
50000/50000 [==============================] - 3s 57us/step - loss: 1.8728 - acc: 0.3381 - val_loss: 1.8685 - val_acc: 0.3401
Epoch 21/60
50000/50000 [==============================] - 3s 51us/step - loss: 1.8652 - acc: 0.3431 - val_loss: 1.8616 - val_acc: 0.3451
Epoch 22/60
50000/50000 [==============================] - 3s 56us/step - loss: 1.8582 - acc: 0.3457 - val_loss: 1.8552 - val_acc: 0.3389
Epoch 23/60
50000/50000 [==============================] - 3s 55us/step - loss: 1.8513 - acc: 0.3476 - val_loss: 1.8489 - val_acc: 0.3473
Epoch 24/60
50000/50000 [==============================] - 2s 44us/step - loss: 1.8445 - acc: 0.3503 - val_loss: 1.8416 - val_acc: 0.3550
Epoch 25/60
50000/50000 [==============================] - 2s 43us/step - loss: 1.8378 - acc: 0.3529 - val_loss: 1.8347 - val_acc: 0.3561
Epoch 26/60
50000/50000 [==============================] - 2s 45us/step - loss: 1.8314 - acc: 0.3544 - val_loss: 1.8275 - val_acc: 0.3566
Epoch 27/60
50000/50000 [==============================] - 2s 48us/step - loss: 1.8250 - acc: 0.3570 - val_loss: 1.8224 - val_acc: 0.3614
Epoch 28/60
50000/50000 [==============================] - 2s 44us/step - loss: 1.8187 - acc: 0.3586 - val_loss: 1.8156 - val_acc: 0.3615
Epoch 29/60
50000/50000 [==============================] - 2s 43us/step - loss: 1.8123 - acc: 0.3608 - val_loss: 1.8108 - val_acc: 0.3674
Epoch 30/60
50000/50000 [==============================] - 2s 46us/step - loss: 1.8062 - acc: 0.3630 - val_loss: 1.8034 - val_acc: 0.3646
Epoch 31/60
50000/50000 [==============================] - 2s 45us/step - loss: 1.8008 - acc: 0.3640 - val_loss: 1.7986 - val_acc: 0.3657
Epoch 32/60
50000/50000 [==============================] - 2s 44us/step - loss: 1.7954 - acc: 0.3670 - val_loss: 1.7922 - val_acc: 0.3675
Epoch 33/60
50000/50000 [==============================] - 2s 42us/step - loss: 1.7899 - acc: 0.3686 - val_loss: 1.7869 - val_acc: 0.3682
Epoch 34/60
50000/50000 [==============================] - 2s 45us/step - loss: 1.7845 - acc: 0.3698 - val_loss: 1.7812 - val_acc: 0.3737
Epoch 35/60
50000/50000 [==============================] - 2s 46us/step - loss: 1.7795 - acc: 0.3706 - val_loss: 1.7768 - val_acc: 0.3766
Epoch 36/60
50000/50000 [==============================] - 2s 42us/step - loss: 1.7742 - acc: 0.3722 - val_loss: 1.7718 - val_acc: 0.3721
Epoch 37/60
50000/50000 [==============================] - 2s 45us/step - loss: 1.7696 - acc: 0.3742 - val_loss: 1.7673 - val_acc: 0.3746
Epoch 38/60
50000/50000 [==============================] - 2s 44us/step - loss: 1.7646 - acc: 0.3765 - val_loss: 1.7639 - val_acc: 0.3697
Epoch 39/60
50000/50000 [==============================] - 2s 46us/step - loss: 1.7603 - acc: 0.3766 - val_loss: 1.7566 - val_acc: 0.3808
Epoch 40/60
50000/50000 [==============================] - 2s 42us/step - loss: 1.7557 - acc: 0.3790 - val_loss: 1.7530 - val_acc: 0.3811
Epoch 41/60
50000/50000 [==============================] - 2s 42us/step - loss: 1.7512 - acc: 0.3804 - val_loss: 1.7476 - val_acc: 0.3834
Epoch 42/60
50000/50000 [==============================] - 2s 46us/step - loss: 1.7469 - acc: 0.3812 - val_loss: 1.7437 - val_acc: 0.3833
Epoch 43/60
50000/50000 [==============================] - 2s 48us/step - loss: 1.7425 - acc: 0.3826 - val_loss: 1.7393 - val_acc: 0.3876
Epoch 44/60
50000/50000 [==============================] - 2s 46us/step - loss: 1.7380 - acc: 0.3856 - val_loss: 1.7362 - val_acc: 0.3870
Epoch 45/60
50000/50000 [==============================] - 2s 46us/step - loss: 1.7341 - acc: 0.3865 - val_loss: 1.7307 - val_acc: 0.3881
Epoch 46/60
50000/50000 [==============================] - 2s 47us/step - loss: 1.7299 - acc: 0.3870 - val_loss: 1.7269 - val_acc: 0.3917
Epoch 47/60
50000/50000 [==============================] - 2s 48us/step - loss: 1.7254 - acc: 0.3888 - val_loss: 1.7253 - val_acc: 0.3899
Epoch 48/60
50000/50000 [==============================] - 2s 49us/step - loss: 1.7216 - acc: 0.3905 - val_loss: 1.7190 - val_acc: 0.3896
Epoch 49/60
50000/50000 [==============================] - 2s 46us/step - loss: 1.7177 - acc: 0.3927 - val_loss: 1.7144 - val_acc: 0.3954
Epoch 50/60
50000/50000 [==============================] - 2s 48us/step - loss: 1.7133 - acc: 0.3932 - val_loss: 1.7108 - val_acc: 0.3913
Epoch 51/60
50000/50000 [==============================] - 2s 48us/step - loss: 1.7096 - acc: 0.3960 - val_loss: 1.7077 - val_acc: 0.3975
Epoch 52/60
50000/50000 [==============================] - 2s 49us/step - loss: 1.7057 - acc: 0.3967 - val_loss: 1.7036 - val_acc: 0.3992
Epoch 53/60
50000/50000 [==============================] - 2s 50us/step - loss: 1.7016 - acc: 0.3978 - val_loss: 1.6998 - val_acc: 0.4012
Epoch 54/60
50000/50000 [==============================] - 2s 47us/step - loss: 1.6974 - acc: 0.3994 - val_loss: 1.6949 - val_acc: 0.4033
Epoch 55/60
50000/50000 [==============================] - 2s 50us/step - loss: 1.6938 - acc: 0.4008 - val_loss: 1.6905 - val_acc: 0.4011
Epoch 56/60
50000/50000 [==============================] - 2s 50us/step - loss: 1.6897 - acc: 0.4019 - val_loss: 1.6893 - val_acc: 0.4035
Epoch 57/60
50000/50000 [==============================] - 2s 46us/step - loss: 1.6860 - acc: 0.4037 - val_loss: 1.6837 - val_acc: 0.4048
Epoch 58/60
50000/50000 [==============================] - 2s 46us/step - loss: 1.6821 - acc: 0.4045 - val_loss: 1.6802 - val_acc: 0.4047
Epoch 59/60
50000/50000 [==============================] - 2s 50us/step - loss: 1.6784 - acc: 0.4064 - val_loss: 1.6755 - val_acc: 0.4072
Epoch 60/60
50000/50000 [==============================] - 3s 57us/step - loss: 1.6749 - acc: 0.4074 - val_loss: 1.6734 - val_acc: 0.4098
Out[48]:
<keras.callbacks.History at 0x119700780>

After 60 epochs, our network only has an accuracy of 40%. Still better than random guesses (10%) but 40% is terrible. The current record for CIFAR-10 accuracy is 97%. So we have a long way to go!

In the next notebook, we will introduce convolutional neural networks, which will greatly improve our performance.