In [1]:
import keras
keras.__version__


Using TensorFlow backend.
Out[1]:
'2.0.8'

Overfitting and underfitting

This notebook contains the code samples found in Chapter 3, Section 6 of Deep Learning with Python. Note that the original text features far more content, in particular further explanations and figures: in this notebook, you will only find source code and related comments.


In all the examples we saw in the previous chapter -- movie review sentiment prediction, topic classification, and house price regression -- we could notice that the performance of our model on the held-out validation data would always peak after a few epochs and would then start degrading, i.e. our model would quickly start to overfit to the training data. Overfitting happens in every single machine learning problem. Learning how to deal with overfitting is essential to mastering machine learning.

The fundamental issue in machine learning is the tension between optimization and generalization. "Optimization" refers to the process of adjusting a model to get the best performance possible on the training data (the "learning" in "machine learning"), while "generalization" refers to how well the trained model would perform on data it has never seen before. The goal of the game is to get good generalization, of course, but you do not control generalization; you can only adjust the model based on its training data.

At the beginning of training, optimization and generalization are correlated: the lower your loss on training data, the lower your loss on test data. While this is happening, your model is said to be under-fit: there is still progress to be made; the network hasn't yet modeled all relevant patterns in the training data. But after a certain number of iterations on the training data, generalization stops improving, validation metrics stall then start degrading: the model is then starting to over-fit, i.e. is it starting to learn patterns that are specific to the training data but that are misleading or irrelevant when it comes to new data.

To prevent a model from learning misleading or irrelevant patterns found in the training data, the best solution is of course to get more training data. A model trained on more data will naturally generalize better. When that is no longer possible, the next best solution is to modulate the quantity of information that your model is allowed to store, or to add constraints on what information it is allowed to store. If a network can only afford to memorize a small number of patterns, the optimization process will force it to focus on the most prominent patterns, which have a better chance of generalizing well.

The processing of fighting overfitting in this way is called regularization. Let's review some of the most common regularization techniques, and let's apply them in practice to improve our movie classification model from the previous chapter.

Note: in this notebook we will be using the IMDB test set as our validation set. It doesn't matter in this context.

Let's prepare the data using the code from Chapter 3, Section 5:


In [2]:
from keras.datasets import imdb
import numpy as np

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

def vectorize_sequences(sequences, dimension=10000):
    # Create an all-zero matrix of shape (len(sequences), dimension)
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.  # set specific indices of results[i] to 1s
    return results

# Our vectorized training data
x_train = vectorize_sequences(train_data)
# Our vectorized test data
x_test = vectorize_sequences(test_data)
# Our vectorized labels
y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')

Fighting overfitting

Reducing the network's size

The simplest way to prevent overfitting is to reduce the size of the model, i.e. the number of learnable parameters in the model (which is determined by the number of layers and the number of units per layer). In deep learning, the number of learnable parameters in a model is often referred to as the model's "capacity". Intuitively, a model with more parameters will have more "memorization capacity" and therefore will be able to easily learn a perfect dictionary-like mapping between training samples and their targets, a mapping without any generalization power. For instance, a model with 500,000 binary parameters could easily be made to learn the class of every digits in the MNIST training set: we would only need 10 binary parameters for each of the 50,000 digits. Such a model would be useless for classifying new digit samples. Always keep this in mind: deep learning models tend to be good at fitting to the training data, but the real challenge is generalization, not fitting.

On the other hand, if the network has limited memorization resources, it will not be able to learn this mapping as easily, and thus, in order to minimize its loss, it will have to resort to learning compressed representations that have predictive power regarding the targets -- precisely the type of representations that we are interested in. At the same time, keep in mind that you should be using models that have enough parameters that they won't be underfitting: your model shouldn't be starved for memorization resources. There is a compromise to be found between "too much capacity" and "not enough capacity".

Unfortunately, there is no magical formula to determine what the right number of layers is, or what the right size for each layer is. You will have to evaluate an array of different architectures (on your validation set, not on your test set, of course) in order to find the right model size for your data. The general workflow to find an appropriate model size is to start with relatively few layers and parameters, and start increasing the size of the layers or adding new layers until you see diminishing returns with regard to the validation loss.

Let's try this on our movie review classification network. Our original network was as such:


In [4]:
from keras import models
from keras import layers

original_model = models.Sequential()
original_model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
original_model.add(layers.Dense(16, activation='relu'))
original_model.add(layers.Dense(1, activation='sigmoid'))

original_model.compile(optimizer='rmsprop',
                       loss='binary_crossentropy',
                       metrics=['acc'])

Now let's try to replace it with this smaller network:


In [5]:
smaller_model = models.Sequential()
smaller_model.add(layers.Dense(4, activation='relu', input_shape=(10000,)))
smaller_model.add(layers.Dense(4, activation='relu'))
smaller_model.add(layers.Dense(1, activation='sigmoid'))

smaller_model.compile(optimizer='rmsprop',
                      loss='binary_crossentropy',
                      metrics=['acc'])

Here's a comparison of the validation losses of the original network and the smaller network. The dots are the validation loss values of the smaller network, and the crosses are the initial network (remember: a lower validation loss signals a better model).


In [6]:
original_hist = original_model.fit(x_train, y_train,
                                   epochs=20,
                                   batch_size=512,
                                   validation_data=(x_test, y_test))


Train on 25000 samples, validate on 25000 samples
Epoch 1/20
25000/25000 [==============================] - 3s - loss: 0.4594 - acc: 0.8188 - val_loss: 0.3429 - val_acc: 0.8812
Epoch 2/20
25000/25000 [==============================] - 2s - loss: 0.2659 - acc: 0.9075 - val_loss: 0.2873 - val_acc: 0.8905
Epoch 3/20
25000/25000 [==============================] - 2s - loss: 0.2060 - acc: 0.9276 - val_loss: 0.2827 - val_acc: 0.8879
Epoch 4/20
25000/25000 [==============================] - 2s - loss: 0.1697 - acc: 0.9401 - val_loss: 0.2921 - val_acc: 0.8851
Epoch 5/20
25000/25000 [==============================] - 2s - loss: 0.1495 - acc: 0.9470 - val_loss: 0.3100 - val_acc: 0.8812
Epoch 6/20
25000/25000 [==============================] - 2s - loss: 0.1283 - acc: 0.9572 - val_loss: 0.3336 - val_acc: 0.8748
Epoch 7/20
25000/25000 [==============================] - 2s - loss: 0.1121 - acc: 0.9624 - val_loss: 0.3987 - val_acc: 0.8593
Epoch 8/20
25000/25000 [==============================] - 2s - loss: 0.0994 - acc: 0.9670 - val_loss: 0.3788 - val_acc: 0.8702
Epoch 9/20
25000/25000 [==============================] - 2s - loss: 0.0889 - acc: 0.9716 - val_loss: 0.4242 - val_acc: 0.8603
Epoch 10/20
25000/25000 [==============================] - 2s - loss: 0.0782 - acc: 0.9757 - val_loss: 0.4256 - val_acc: 0.8653
Epoch 11/20
25000/25000 [==============================] - 2s - loss: 0.0691 - acc: 0.9792 - val_loss: 0.4515 - val_acc: 0.8638
Epoch 12/20
25000/25000 [==============================] - 2s - loss: 0.0603 - acc: 0.9820 - val_loss: 0.5102 - val_acc: 0.8610
Epoch 13/20
25000/25000 [==============================] - 2s - loss: 0.0518 - acc: 0.9851 - val_loss: 0.5281 - val_acc: 0.8587
Epoch 14/20
25000/25000 [==============================] - 2s - loss: 0.0446 - acc: 0.9873 - val_loss: 0.5441 - val_acc: 0.8589
Epoch 15/20
25000/25000 [==============================] - 2s - loss: 0.0367 - acc: 0.9903 - val_loss: 0.5777 - val_acc: 0.8574
Epoch 16/20
25000/25000 [==============================] - 2s - loss: 0.0313 - acc: 0.9922 - val_loss: 0.6377 - val_acc: 0.8555
Epoch 17/20
25000/25000 [==============================] - 2s - loss: 0.0247 - acc: 0.9941 - val_loss: 0.7269 - val_acc: 0.8501
Epoch 18/20
25000/25000 [==============================] - 2s - loss: 0.0203 - acc: 0.9956 - val_loss: 0.6920 - val_acc: 0.8516
Epoch 19/20
25000/25000 [==============================] - 2s - loss: 0.0156 - acc: 0.9970 - val_loss: 0.7689 - val_acc: 0.8425
Epoch 20/20
25000/25000 [==============================] - 2s - loss: 0.0144 - acc: 0.9966 - val_loss: 0.7694 - val_acc: 0.8487

In [7]:
smaller_model_hist = smaller_model.fit(x_train, y_train,
                                       epochs=20,
                                       batch_size=512,
                                       validation_data=(x_test, y_test))


Train on 25000 samples, validate on 25000 samples
Epoch 1/20
25000/25000 [==============================] - 2s - loss: 0.5737 - acc: 0.8049 - val_loss: 0.4826 - val_acc: 0.8616
Epoch 2/20
25000/25000 [==============================] - 2s - loss: 0.3973 - acc: 0.8866 - val_loss: 0.3699 - val_acc: 0.8776
Epoch 3/20
25000/25000 [==============================] - 2s - loss: 0.2985 - acc: 0.9054 - val_loss: 0.3140 - val_acc: 0.8860
Epoch 4/20
25000/25000 [==============================] - 2s - loss: 0.2428 - acc: 0.9189 - val_loss: 0.2913 - val_acc: 0.8870
Epoch 5/20
25000/25000 [==============================] - 2s - loss: 0.2085 - acc: 0.9290 - val_loss: 0.2809 - val_acc: 0.8897
Epoch 6/20
25000/25000 [==============================] - 2s - loss: 0.1849 - acc: 0.9360 - val_loss: 0.2772 - val_acc: 0.8899
Epoch 7/20
25000/25000 [==============================] - 2s - loss: 0.1666 - acc: 0.9430 - val_loss: 0.2835 - val_acc: 0.8863
Epoch 8/20
25000/25000 [==============================] - 2s - loss: 0.1515 - acc: 0.9487 - val_loss: 0.2909 - val_acc: 0.8850
Epoch 9/20
25000/25000 [==============================] - 2s - loss: 0.1388 - acc: 0.9526 - val_loss: 0.2984 - val_acc: 0.8842
Epoch 10/20
25000/25000 [==============================] - 2s - loss: 0.1285 - acc: 0.9569 - val_loss: 0.3102 - val_acc: 0.8818
Epoch 11/20
25000/25000 [==============================] - 2s - loss: 0.1194 - acc: 0.9599 - val_loss: 0.3219 - val_acc: 0.8794
Epoch 12/20
25000/25000 [==============================] - 2s - loss: 0.1105 - acc: 0.9648 - val_loss: 0.3379 - val_acc: 0.8774
Epoch 13/20
25000/25000 [==============================] - 2s - loss: 0.1035 - acc: 0.9674 - val_loss: 0.3532 - val_acc: 0.8730
Epoch 14/20
25000/25000 [==============================] - 2s - loss: 0.0963 - acc: 0.9688 - val_loss: 0.3651 - val_acc: 0.8731
Epoch 15/20
25000/25000 [==============================] - 2s - loss: 0.0895 - acc: 0.9724 - val_loss: 0.3858 - val_acc: 0.8703
Epoch 16/20
25000/25000 [==============================] - 2s - loss: 0.0838 - acc: 0.9734 - val_loss: 0.4157 - val_acc: 0.8654
Epoch 17/20
25000/25000 [==============================] - 2s - loss: 0.0785 - acc: 0.9764 - val_loss: 0.4214 - val_acc: 0.8677
Epoch 18/20
25000/25000 [==============================] - 2s - loss: 0.0733 - acc: 0.9784 - val_loss: 0.4390 - val_acc: 0.8644
Epoch 19/20
25000/25000 [==============================] - 2s - loss: 0.0685 - acc: 0.9796 - val_loss: 0.4539 - val_acc: 0.8638
Epoch 20/20
25000/25000 [==============================] - 2s - loss: 0.0638 - acc: 0.9822 - val_loss: 0.4744 - val_acc: 0.8617

In [8]:
epochs = range(1, 21)
original_val_loss = original_hist.history['val_loss']
smaller_model_val_loss = smaller_model_hist.history['val_loss']

In [9]:
import matplotlib.pyplot as plt

# b+ is for "blue cross"
plt.plot(epochs, original_val_loss, 'b+', label='Original model')
# "bo" is for "blue dot"
plt.plot(epochs, smaller_model_val_loss, 'bo', label='Smaller model')
plt.xlabel('Epochs')
plt.ylabel('Validation loss')
plt.legend()

plt.show()


As you can see, the smaller network starts overfitting later than the reference one (after 6 epochs rather than 4) and its performance degrades much more slowly once it starts overfitting.

Now, for kicks, let's add to this benchmark a network that has much more capacity, far more than the problem would warrant:


In [11]:
bigger_model = models.Sequential()
bigger_model.add(layers.Dense(512, activation='relu', input_shape=(10000,)))
bigger_model.add(layers.Dense(512, activation='relu'))
bigger_model.add(layers.Dense(1, activation='sigmoid'))

bigger_model.compile(optimizer='rmsprop',
                     loss='binary_crossentropy',
                     metrics=['acc'])

In [12]:
bigger_model_hist = bigger_model.fit(x_train, y_train,
                                     epochs=20,
                                     batch_size=512,
                                     validation_data=(x_test, y_test))


Train on 25000 samples, validate on 25000 samples
Epoch 1/20
25000/25000 [==============================] - 3s - loss: 0.4539 - acc: 0.8011 - val_loss: 0.4150 - val_acc: 0.8229
Epoch 2/20
25000/25000 [==============================] - 3s - loss: 0.2148 - acc: 0.9151 - val_loss: 0.2742 - val_acc: 0.8901
Epoch 3/20
25000/25000 [==============================] - 3s - loss: 0.1217 - acc: 0.9544 - val_loss: 0.5442 - val_acc: 0.7975
Epoch 4/20
25000/25000 [==============================] - 3s - loss: 0.0552 - acc: 0.9835 - val_loss: 0.4316 - val_acc: 0.8842
Epoch 5/20
25000/25000 [==============================] - 3s - loss: 0.0662 - acc: 0.9888 - val_loss: 0.5098 - val_acc: 0.8822
Epoch 6/20
25000/25000 [==============================] - 3s - loss: 0.0017 - acc: 0.9998 - val_loss: 0.6867 - val_acc: 0.8811
Epoch 7/20
25000/25000 [==============================] - 3s - loss: 0.1019 - acc: 0.9882 - val_loss: 0.6737 - val_acc: 0.8800
Epoch 8/20
25000/25000 [==============================] - 3s - loss: 0.0735 - acc: 0.9896 - val_loss: 0.6185 - val_acc: 0.8772
Epoch 9/20
25000/25000 [==============================] - 3s - loss: 3.4759e-04 - acc: 1.0000 - val_loss: 0.7328 - val_acc: 0.8818
Epoch 10/20
25000/25000 [==============================] - 3s - loss: 0.0504 - acc: 0.9912 - val_loss: 0.7092 - val_acc: 0.8791
Epoch 11/20
25000/25000 [==============================] - 3s - loss: 0.0589 - acc: 0.9919 - val_loss: 0.6831 - val_acc: 0.8785
Epoch 12/20
25000/25000 [==============================] - 3s - loss: 1.9067e-04 - acc: 1.0000 - val_loss: 0.8005 - val_acc: 0.8784
Epoch 13/20
25000/25000 [==============================] - 3s - loss: 0.0623 - acc: 0.9916 - val_loss: 0.7540 - val_acc: 0.8740
Epoch 14/20
25000/25000 [==============================] - 3s - loss: 0.0274 - acc: 0.9954 - val_loss: 0.7806 - val_acc: 0.8670
Epoch 15/20
25000/25000 [==============================] - 3s - loss: 0.0011 - acc: 0.9998 - val_loss: 0.8107 - val_acc: 0.8783
Epoch 16/20
25000/25000 [==============================] - 3s - loss: 0.0445 - acc: 0.9943 - val_loss: 0.8394 - val_acc: 0.8702
Epoch 17/20
25000/25000 [==============================] - 3s - loss: 0.0268 - acc: 0.9959 - val_loss: 0.7708 - val_acc: 0.8745
Epoch 18/20
25000/25000 [==============================] - 3s - loss: 7.7057e-04 - acc: 1.0000 - val_loss: 0.8885 - val_acc: 0.8738
Epoch 19/20
25000/25000 [==============================] - 3s - loss: 0.0297 - acc: 0.9962 - val_loss: 0.8419 - val_acc: 0.8728
Epoch 20/20
25000/25000 [==============================] - 3s - loss: 0.0018 - acc: 0.9998 - val_loss: 0.8896 - val_acc: 0.8682

Here's how the bigger network fares compared to the reference one. The dots are the validation loss values of the bigger network, and the crosses are the initial network.


In [26]:
bigger_model_val_loss = bigger_model_hist.history['val_loss']

plt.plot(epochs, original_val_loss, 'b+', label='Original model')
plt.plot(epochs, bigger_model_val_loss, 'bo', label='Bigger model')
plt.xlabel('Epochs')
plt.ylabel('Validation loss')
plt.legend()

plt.show()


The bigger network starts overfitting almost right away, after just one epoch, and overfits much more severely. Its validation loss is also more noisy.

Meanwhile, here are the training losses for our two networks:


In [28]:
original_train_loss = original_hist.history['loss']
bigger_model_train_loss = bigger_model_hist.history['loss']

plt.plot(epochs, original_train_loss, 'b+', label='Original model')
plt.plot(epochs, bigger_model_train_loss, 'bo', label='Bigger model')
plt.xlabel('Epochs')
plt.ylabel('Training loss')
plt.legend()

plt.show()


As you can see, the bigger network gets its training loss near zero very quickly. The more capacity the network has, the quicker it will be able to model the training data (resulting in a low training loss), but the more susceptible it is to overfitting (resulting in a large difference between the training and validation loss).

Adding weight regularization

You may be familiar with Occam's Razor principle: given two explanations for something, the explanation most likely to be correct is the "simplest" one, the one that makes the least amount of assumptions. This also applies to the models learned by neural networks: given some training data and a network architecture, there are multiple sets of weights values (multiple models) that could explain the data, and simpler models are less likely to overfit than complex ones.

A "simple model" in this context is a model where the distribution of parameter values has less entropy (or a model with fewer parameters altogether, as we saw in the section above). Thus a common way to mitigate overfitting is to put constraints on the complexity of a network by forcing its weights to only take small values, which makes the distribution of weight values more "regular". This is called "weight regularization", and it is done by adding to the loss function of the network a cost associated with having large weights. This cost comes in two flavors:

  • L1 regularization, where the cost added is proportional to the absolute value of the weights coefficients (i.e. to what is called the "L1 norm" of the weights).
  • L2 regularization, where the cost added is proportional to the square of the value of the weights coefficients (i.e. to what is called the "L2 norm" of the weights). L2 regularization is also called weight decay in the context of neural networks. Don't let the different name confuse you: weight decay is mathematically the exact same as L2 regularization.

In Keras, weight regularization is added by passing weight regularizer instances to layers as keyword arguments. Let's add L2 weight regularization to our movie review classification network:


In [17]:
from keras import regularizers

l2_model = models.Sequential()
l2_model.add(layers.Dense(16, kernel_regularizer=regularizers.l2(0.001),
                          activation='relu', input_shape=(10000,)))
l2_model.add(layers.Dense(16, kernel_regularizer=regularizers.l2(0.001),
                          activation='relu'))
l2_model.add(layers.Dense(1, activation='sigmoid'))

In [18]:
l2_model.compile(optimizer='rmsprop',
                 loss='binary_crossentropy',
                 metrics=['acc'])

l2(0.001) means that every coefficient in the weight matrix of the layer will add 0.001 * weight_coefficient_value to the total loss of the network. Note that because this penalty is only added at training time, the loss for this network will be much higher at training than at test time.

Here's the impact of our L2 regularization penalty:


In [19]:
l2_model_hist = l2_model.fit(x_train, y_train,
                             epochs=20,
                             batch_size=512,
                             validation_data=(x_test, y_test))


Train on 25000 samples, validate on 25000 samples
Epoch 1/20
25000/25000 [==============================] - 3s - loss: 0.4880 - acc: 0.8218 - val_loss: 0.3820 - val_acc: 0.8798
Epoch 2/20
25000/25000 [==============================] - 2s - loss: 0.3162 - acc: 0.9068 - val_loss: 0.3353 - val_acc: 0.8896
Epoch 3/20
25000/25000 [==============================] - 2s - loss: 0.2742 - acc: 0.9185 - val_loss: 0.3306 - val_acc: 0.8898
Epoch 4/20
25000/25000 [==============================] - 2s - loss: 0.2489 - acc: 0.9288 - val_loss: 0.3363 - val_acc: 0.8866
Epoch 5/20
25000/25000 [==============================] - 2s - loss: 0.2420 - acc: 0.9318 - val_loss: 0.3492 - val_acc: 0.8820
Epoch 6/20
25000/25000 [==============================] - 2s - loss: 0.2322 - acc: 0.9359 - val_loss: 0.3567 - val_acc: 0.8788
Epoch 7/20
25000/25000 [==============================] - 2s - loss: 0.2254 - acc: 0.9385 - val_loss: 0.3632 - val_acc: 0.8787
Epoch 8/20
25000/25000 [==============================] - 2s - loss: 0.2219 - acc: 0.9380 - val_loss: 0.3630 - val_acc: 0.8794
Epoch 9/20
25000/25000 [==============================] - 2s - loss: 0.2162 - acc: 0.9430 - val_loss: 0.3704 - val_acc: 0.8763
Epoch 10/20
25000/25000 [==============================] - 2s - loss: 0.2144 - acc: 0.9428 - val_loss: 0.3876 - val_acc: 0.8727
Epoch 11/20
25000/25000 [==============================] - 2s - loss: 0.2091 - acc: 0.9439 - val_loss: 0.3883 - val_acc: 0.8724
Epoch 12/20
25000/25000 [==============================] - 2s - loss: 0.2061 - acc: 0.9455 - val_loss: 0.3870 - val_acc: 0.8740
Epoch 13/20
25000/25000 [==============================] - 2s - loss: 0.2069 - acc: 0.9445 - val_loss: 0.4073 - val_acc: 0.8714
Epoch 14/20
25000/25000 [==============================] - 2s - loss: 0.2028 - acc: 0.9475 - val_loss: 0.3976 - val_acc: 0.8714
Epoch 15/20
25000/25000 [==============================] - 2s - loss: 0.1998 - acc: 0.9472 - val_loss: 0.4362 - val_acc: 0.8670
Epoch 16/20
25000/25000 [==============================] - 2s - loss: 0.2019 - acc: 0.9462 - val_loss: 0.4088 - val_acc: 0.8711
Epoch 17/20
25000/25000 [==============================] - 2s - loss: 0.1953 - acc: 0.9495 - val_loss: 0.4185 - val_acc: 0.8698
Epoch 18/20
25000/25000 [==============================] - 2s - loss: 0.1945 - acc: 0.9508 - val_loss: 0.4371 - val_acc: 0.8674
Epoch 19/20
25000/25000 [==============================] - 2s - loss: 0.1934 - acc: 0.9486 - val_loss: 0.4136 - val_acc: 0.8699
Epoch 20/20
25000/25000 [==============================] - 2s - loss: 0.1924 - acc: 0.9504 - val_loss: 0.4200 - val_acc: 0.8704

In [30]:
l2_model_val_loss = l2_model_hist.history['val_loss']

plt.plot(epochs, original_val_loss, 'b+', label='Original model')
plt.plot(epochs, l2_model_val_loss, 'bo', label='L2-regularized model')
plt.xlabel('Epochs')
plt.ylabel('Validation loss')
plt.legend()

plt.show()


As you can see, the model with L2 regularization (dots) has become much more resistant to overfitting than the reference model (crosses), even though both models have the same number of parameters.

As alternatives to L2 regularization, you could use one of the following Keras weight regularizers:


In [ ]:
from keras import regularizers

# L1 regularization
regularizers.l1(0.001)

# L1 and L2 regularization at the same time
regularizers.l1_l2(l1=0.001, l2=0.001)

Adding dropout

Dropout is one of the most effective and most commonly used regularization techniques for neural networks, developed by Hinton and his students at the University of Toronto. Dropout, applied to a layer, consists of randomly "dropping out" (i.e. setting to zero) a number of output features of the layer during training. Let's say a given layer would normally have returned a vector [0.2, 0.5, 1.3, 0.8, 1.1] for a given input sample during training; after applying dropout, this vector will have a few zero entries distributed at random, e.g. [0, 0.5, 1.3, 0, 1.1]. The "dropout rate" is the fraction of the features that are being zeroed-out; it is usually set between 0.2 and 0.5. At test time, no units are dropped out, and instead the layer's output values are scaled down by a factor equal to the dropout rate, so as to balance for the fact that more units are active than at training time.

Consider a Numpy matrix containing the output of a layer, layer_output, of shape (batch_size, features). At training time, we would be zero-ing out at random a fraction of the values in the matrix:


In [ ]:
# At training time: we drop out 50% of the units in the output
layer_output *= np.randint(0, high=2, size=layer_output.shape)

At test time, we would be scaling the output down by the dropout rate. Here we scale by 0.5 (because we were previous dropping half the units):


In [ ]:
# At test time:
layer_output *= 0.5

Note that this process can be implemented by doing both operations at training time and leaving the output unchanged at test time, which is often the way it is implemented in practice:


In [ ]:
# At training time:
layer_output *= np.randint(0, high=2, size=layer_output.shape)
# Note that we are scaling *up* rather scaling *down* in this case
layer_output /= 0.5

This technique may seem strange and arbitrary. Why would this help reduce overfitting? Geoff Hinton has said that he was inspired, among other things, by a fraud prevention mechanism used by banks -- in his own words: "I went to my bank. The tellers kept changing and I asked one of them why. He said he didn’t know but they got moved around a lot. I figured it must be because it would require cooperation between employees to successfully defraud the bank. This made me realize that randomly removing a different subset of neurons on each example would prevent conspiracies and thus reduce overfitting".

The core idea is that introducing noise in the output values of a layer can break up happenstance patterns that are not significant (what Hinton refers to as "conspiracies"), which the network would start memorizing if no noise was present.

In Keras you can introduce dropout in a network via the Dropout layer, which gets applied to the output of layer right before it, e.g.:


In [ ]:
model.add(layers.Dropout(0.5))

Let's add two Dropout layers in our IMDB network to see how well they do at reducing overfitting:


In [22]:
dpt_model = models.Sequential()
dpt_model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
dpt_model.add(layers.Dropout(0.5))
dpt_model.add(layers.Dense(16, activation='relu'))
dpt_model.add(layers.Dropout(0.5))
dpt_model.add(layers.Dense(1, activation='sigmoid'))

dpt_model.compile(optimizer='rmsprop',
                  loss='binary_crossentropy',
                  metrics=['acc'])

In [23]:
dpt_model_hist = dpt_model.fit(x_train, y_train,
                               epochs=20,
                               batch_size=512,
                               validation_data=(x_test, y_test))


Train on 25000 samples, validate on 25000 samples
Epoch 1/20
25000/25000 [==============================] - 3s - loss: 0.6035 - acc: 0.6678 - val_loss: 0.4704 - val_acc: 0.8651
Epoch 2/20
25000/25000 [==============================] - 2s - loss: 0.4622 - acc: 0.8002 - val_loss: 0.3612 - val_acc: 0.8724
Epoch 3/20
25000/25000 [==============================] - 2s - loss: 0.3731 - acc: 0.8553 - val_loss: 0.2960 - val_acc: 0.8904
Epoch 4/20
25000/25000 [==============================] - 2s - loss: 0.3162 - acc: 0.8855 - val_loss: 0.2772 - val_acc: 0.8917
Epoch 5/20
25000/25000 [==============================] - 2s - loss: 0.2762 - acc: 0.9033 - val_loss: 0.2803 - val_acc: 0.8889
Epoch 6/20
25000/25000 [==============================] - 2s - loss: 0.2454 - acc: 0.9172 - val_loss: 0.2823 - val_acc: 0.8892
Epoch 7/20
25000/25000 [==============================] - 2s - loss: 0.2178 - acc: 0.9281 - val_loss: 0.2982 - val_acc: 0.8877
Epoch 8/20
25000/25000 [==============================] - 2s - loss: 0.1994 - acc: 0.9351 - val_loss: 0.3101 - val_acc: 0.8875
Epoch 9/20
25000/25000 [==============================] - 2s - loss: 0.1832 - acc: 0.9400 - val_loss: 0.3318 - val_acc: 0.8860
Epoch 10/20
25000/25000 [==============================] - 2s - loss: 0.1692 - acc: 0.9434 - val_loss: 0.3534 - val_acc: 0.8841
Epoch 11/20
25000/25000 [==============================] - 2s - loss: 0.1590 - acc: 0.9483 - val_loss: 0.3689 - val_acc: 0.8830
Epoch 12/20
25000/25000 [==============================] - 2s - loss: 0.1499 - acc: 0.9496 - val_loss: 0.4107 - val_acc: 0.8776
Epoch 13/20
25000/25000 [==============================] - 2s - loss: 0.1405 - acc: 0.9539 - val_loss: 0.4114 - val_acc: 0.8782
Epoch 14/20
25000/25000 [==============================] - 2s - loss: 0.1333 - acc: 0.9562 - val_loss: 0.4549 - val_acc: 0.8771
Epoch 15/20
25000/25000 [==============================] - 2s - loss: 0.1267 - acc: 0.9572 - val_loss: 0.4579 - val_acc: 0.8800
Epoch 16/20
25000/25000 [==============================] - 2s - loss: 0.1225 - acc: 0.9580 - val_loss: 0.4843 - val_acc: 0.8772
Epoch 17/20
25000/25000 [==============================] - 2s - loss: 0.1233 - acc: 0.9590 - val_loss: 0.4783 - val_acc: 0.8761
Epoch 18/20
25000/25000 [==============================] - 2s - loss: 0.1212 - acc: 0.9601 - val_loss: 0.5051 - val_acc: 0.8740
Epoch 19/20
25000/25000 [==============================] - 2s - loss: 0.1153 - acc: 0.9618 - val_loss: 0.5451 - val_acc: 0.8747
Epoch 20/20
25000/25000 [==============================] - 2s - loss: 0.1155 - acc: 0.9621 - val_loss: 0.5358 - val_acc: 0.8738

Let's plot the results:


In [32]:
dpt_model_val_loss = dpt_model_hist.history['val_loss']

plt.plot(epochs, original_val_loss, 'b+', label='Original model')
plt.plot(epochs, dpt_model_val_loss, 'bo', label='Dropout-regularized model')
plt.xlabel('Epochs')
plt.ylabel('Validation loss')
plt.legend()

plt.show()


Again, a clear improvement over the reference network.

To recap: here the most common ways to prevent overfitting in neural networks:

  • Getting more training data.
  • Reducing the capacity of the network.
  • Adding weight regularization.
  • Adding dropout.