Training a better model


In [1]:
from theano.sandbox import cuda


WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10).  Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
 https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

Using gpu device 0: GeForce GTX 1060 6GB (CNMeM is enabled with initial size: 80.0% of memory, cuDNN 5110)

In [2]:
%matplotlib inline
import utils; reload(utils)
from utils import *
from __future__ import division, print_function


Using Theano backend.

In [3]:
#path = os.path.join('input','sample')
path = os.path.join('input','sample-10')
#path = os.path.join('input')
output_path = os.path.join('output','sample')
model_path = os.path.join(output_path, 'models')
if not os.path.exists(model_path): os.mkdir(model_path)

#batch_size=64
#batch_size=32
#batch_size=16
batch_size=8

Are we underfitting?

Our validation accuracy so far has generally been higher than our training accuracy. That leads to two obvious questions:

  1. How is this possible?
  2. Is this desirable?

The answer to (1) is that this is happening because of dropout. Dropout refers to a layer that randomly deletes (i.e. sets to zero) each activation in the previous layer with probability p (generally 0.5). This only happens during training, not when calculating the accuracy on the validation set, which is why the validation set can show higher accuracy than the training set.

The purpose of dropout is to avoid overfitting. By deleting parts of the neural network at random during training, it ensures that no one part of the network can overfit to one part of the training set. The creation of dropout was one of the key developments in deep learning, and has allowed us to create rich models without overfitting. However, it can also result in underfitting if overused, and this is something we should be careful of with our model.

So the answer to (2) is: this is probably not desirable. It is likely that we can get better validation set results with less (or no) dropout, if we're seeing that validation accuracy is higher than training accuracy - a strong sign of underfitting. So let's try removing dropout entirely, and see what happens!

(We had dropout in this model already because the VGG authors found it necessary for the imagenet competition. But that doesn't mean it's necessary for dogs v cats, so we will do our own analysis of regularization approaches from scratch.)

Removing dropout

Our high level approach here will be to start with our fine-tuned cats vs dogs model (with dropout), then fine-tune all the dense layers, after removing dropout from them. The steps we will take are:

  • Re-create and load our modified VGG model with binary dependent (i.e. dogs v cats)
  • Split the model between the convolutional (conv) layers and the dense layers
  • Pre-calculate the output of the conv layers, so that we don't have to redundently re-calculate them on every epoch
  • Create a new model with just the dense layers, and dropout p set to zero
  • Train this new model using the output of the conv layers as training data.

Start with Vgg + binary output (dogs/cats)

As before we need to start with a working model, so let's bring in our working VGG 16 model and change it to predict our binary dependent...


In [4]:
model = vgg_ft(2)

In [4]:
??vgg_ft

...and load our fine-tuned weights.


In [5]:
#model.load_weights(model_path+'finetune3.h5')
model.load_weights(os.path.join(model_path, 'finetune_1_ll.h5'))

Split conv and dense layers

We're going to be training a number of iterations without dropout, so it would be best for us to pre-calculate the input to the fully connected layers - i.e. the Flatten() layer. We'll start by finding this layer in our model, and creating a new model that contains just the layers up to and including this layer:


In [6]:
layers = model.layers

In [7]:
last_conv_idx = [index for index,layer in enumerate(layers) 
                     if type(layer) is Convolution2D][-1]

In [8]:
last_conv_idx


Out[8]:
30

In [9]:
layers[last_conv_idx]


Out[9]:
<keras.layers.convolutional.Convolution2D at 0x15c4b9c18>

In [10]:
conv_layers = layers[:last_conv_idx+1] # convolutional layers only; i.e. first N layers to the index
conv_model = Sequential(conv_layers)
# Dense layers - also known as fully connected or 'FC' layers
fc_layers = layers[last_conv_idx+1:] # remaining layers are Dense/FC

Generate features for the FC layers by precalculating conv output

Now we can use the exact same approach to creating features as we used when we created the linear model from the imagenet predictions in the last lesson - it's only the model that has changed. As you're seeing, there's a fairly small number of "recipes" that can get us a long way!


In [11]:
batches = get_batches(os.path.join(path,'train'), shuffle=False, batch_size=batch_size)
val_batches = get_batches(os.path.join(path,'valid'), shuffle=False, batch_size=batch_size)

val_classes = val_batches.classes
trn_classes = batches.classes
val_labels = onehot(val_classes)
trn_labels = onehot(trn_classes)


Found 2000 images belonging to 2 classes.
Found 499 images belonging to 2 classes.

Below:

We're pre-calculating the inputs to the new model. The inputs are the training and valiation sets. So, we basically want to get the result of running those two sets thru the conv layers only and save them so we can run them through the new model.

So, we use the conv-only model to run prediction on the training and validation sets in order to precompute the values we need.


In [12]:
val_features = conv_model.predict_generator(val_batches, val_batches.nb_sample)

In [13]:
trn_features = conv_model.predict_generator(batches, batches.nb_sample)

In [14]:
save_array(os.path.join(model_path, 'train_convlayer_features.bc'), trn_features)
save_array(os.path.join(model_path,'valid_convlayer_features.bc'), val_features)

In [15]:
trn_features = load_array(os.path.join(model_path, 'train_convlayer_features.bc'))
val_features = load_array(os.path.join(model_path,'valid_convlayer_features.bc'))

In [16]:
trn_features.shape


Out[16]:
(2000L, 512L, 14L, 14L)

In [17]:
val_features.shape


Out[17]:
(499L, 512L, 14L, 14L)

Remove dropout from the fully-connected layer model

For our new fully connected model, we'll create it using the exact same architecture as the last layers of VGG 16, so that we can conveniently copy pre-trained weights over from that model. However, we'll set the dropout layer's p values to zero, so as to effectively remove dropout.


In [18]:
# Copy the weights from the pre-trained model.
# NB: Since we're removing dropout, we want to half the weights
def proc_wgts(layer): return [o/2 for o in layer.get_weights()]

In [19]:
# Such a finely tuned model needs to be updated very slowly!
opt = RMSprop(lr=0.00001, rho=0.7)

In [20]:
def get_fc_model():
    model = Sequential([
        MaxPooling2D(input_shape=conv_layers[-1].output_shape[1:]),
        Flatten(),
        Dense(4096, activation='relu'),
        Dropout(0.),
        Dense(4096, activation='relu'),
        Dropout(0.),
        Dense(2, activation='softmax')
        ])

    for l1,l2 in zip(model.layers, fc_layers): l1.set_weights(proc_wgts(l2))

    model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])
    return model

In [21]:
fc_model = get_fc_model()

Fit the FC model to the training and validation data

And fit the model in the usual way:


In [22]:
fc_model.fit(trn_features, trn_labels, nb_epoch=8, 
             batch_size=batch_size, validation_data=(val_features, val_labels))


Train on 2000 samples, validate on 499 samples
Epoch 1/8
2000/2000 [==============================] - 12s - loss: 0.0552 - acc: 0.9800 - val_loss: 0.0842 - val_acc: 0.9840
Epoch 2/8
2000/2000 [==============================] - 12s - loss: 0.0024 - acc: 0.9990 - val_loss: 0.0916 - val_acc: 0.9820
Epoch 3/8
2000/2000 [==============================] - 12s - loss: 5.3425e-05 - acc: 1.0000 - val_loss: 0.1110 - val_acc: 0.9780
Epoch 4/8
2000/2000 [==============================] - 12s - loss: 1.2481e-05 - acc: 1.0000 - val_loss: 0.1033 - val_acc: 0.9760
Epoch 5/8
2000/2000 [==============================] - 12s - loss: 5.0694e-07 - acc: 1.0000 - val_loss: 0.1294 - val_acc: 0.9820
Epoch 6/8
2000/2000 [==============================] - 12s - loss: 1.3894e-07 - acc: 1.0000 - val_loss: 0.1276 - val_acc: 0.9820
Epoch 7/8
2000/2000 [==============================] - 12s - loss: 1.2118e-07 - acc: 1.0000 - val_loss: 0.1385 - val_acc: 0.9820
Epoch 8/8
2000/2000 [==============================] - 12s - loss: 1.1948e-07 - acc: 1.0000 - val_loss: 0.1388 - val_acc: 0.9820
Out[22]:
<keras.callbacks.History at 0x17f8cbc88>

Save the weights (no dropout)


In [23]:
fc_model.save_weights(os.path.join(model_path,'lesson3_no_dropout.h5'))

In [24]:
fc_model.load_weights(os.path.join(model_path,'lesson3_no_dropout.h5'))

Reducing overfitting

Now that we've gotten the model to overfit, we can take a number of steps to reduce this.

Add data augmentation to the training data

Let's try adding a small amount of data augmentation, and see if we reduce overfitting as a result. The approach will be identical to the method we used to finetune the dense layers in lesson 2, except that we will use a generator with augmentation configured. Here's how we set up the generator, and create batches from it:


In [25]:
gen = image.ImageDataGenerator(rotation_range=15, width_shift_range=0.1, 
                               height_shift_range=0.1, zoom_range=0.1, horizontal_flip=True)

In [26]:
batches = get_batches(os.path.join(path, 'train'), gen, batch_size=batch_size)
# NB: We don't want to augment or shuffle the validation set
val_batches = get_batches(os.path.join(path, 'valid'), shuffle=False, batch_size=batch_size)


Found 2000 images belonging to 2 classes.
Found 499 images belonging to 2 classes.

Combine the Conv and FC layers into a single model

When using data augmentation, we can't pre-compute our convolutional layer features, since randomized changes are being made to every input image. That is, even if the training process sees the same image multiple times, each time it will have undergone different data augmentation, so the results of the convolutional layers will be different.

Therefore, in order to allow data to flow through all the conv layers and our new dense layers, we attach our fully connected model to the convolutional model--after ensuring that the convolutional layers are not trainable:


In [27]:
fc_model = get_fc_model()

In [29]:
for layer in conv_model.layers: layer.trainable = False
# Look how easy it is to connect two models together!
conv_model.add(fc_model)

Now we can compile, train, and save our model as usual - note that we use fit_generator() since we want to pull random images from the directories on every batch.

Compile and train the combined model


In [30]:
conv_model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

In [31]:
conv_model.fit_generator(batches, samples_per_epoch=batches.nb_sample, nb_epoch=8, 
                        validation_data=val_batches, nb_val_samples=val_batches.nb_sample)


Epoch 1/8
2000/2000 [==============================] - 51s - loss: 0.0746 - acc: 0.9725 - val_loss: 0.0721 - val_acc: 0.9840
Epoch 2/8
2000/2000 [==============================] - 49s - loss: 0.0375 - acc: 0.9910 - val_loss: 0.0825 - val_acc: 0.9820
Epoch 3/8
2000/2000 [==============================] - 49s - loss: 0.0269 - acc: 0.9920 - val_loss: 0.0997 - val_acc: 0.9800
Epoch 4/8
2000/2000 [==============================] - 49s - loss: 0.0133 - acc: 0.9940 - val_loss: 0.1536 - val_acc: 0.9780
Epoch 5/8
2000/2000 [==============================] - 49s - loss: 0.0175 - acc: 0.9965 - val_loss: 0.1053 - val_acc: 0.9800
Epoch 6/8
2000/2000 [==============================] - 49s - loss: 0.0140 - acc: 0.9975 - val_loss: 0.1150 - val_acc: 0.9860
Epoch 7/8
2000/2000 [==============================] - 49s - loss: 0.0105 - acc: 0.9970 - val_loss: 0.1363 - val_acc: 0.9840
Epoch 8/8
2000/2000 [==============================] - 49s - loss: 0.0093 - acc: 0.9975 - val_loss: 0.1600 - val_acc: 0.9840
Out[31]:
<keras.callbacks.History at 0x15e3304e0>

Save the weights (combined)


In [ ]:
conv_model.save_weights(os.path.join(model_path, 'aug1.h5'))

In [34]:
conv_model.load_weights(os.path.join(model_path, 'aug1.h5'))

Add batchnorm to the model

We can use nearly the same approach as before - but this time we'll add batchnorm layers (and dropout layers):


In [35]:
conv_layers[-1].output_shape[1:]  # last layer shape


Out[35]:
(512, 14, 14)

In [36]:
def get_bn_layers(p):
    return [
        MaxPooling2D(input_shape=conv_layers[-1].output_shape[1:]),
        Flatten(),
        Dense(4096, activation='relu'),
        BatchNormalization(),
        Dropout(p),
        Dense(4096, activation='relu'),
        BatchNormalization(),
        Dropout(p),
        Dense(1000, activation='softmax')
        ]

In [37]:
def load_fc_weights_from_vgg16bn(model):
    "Load weights for model from the dense layers of the Vgg16BN model."
    # See imagenet_batchnorm.ipynb for info on how the weights for
    # Vgg16BN can be generated from the standard Vgg16 weights.
    from vgg16bn import Vgg16BN
    vgg16_bn = Vgg16BN()
    _, fc_layers = split_at(vgg16_bn.model, Convolution2D)
    copy_weights(fc_layers, model.layers)

In [38]:
p=0.6

Create a standalone model from the BN layers of vgg16bn


In [39]:
#bn_model = Sequential(get_bn_layers(0.6))
bn_model = Sequential(get_bn_layers(p))

In [40]:
load_fc_weights_from_vgg16bn(bn_model)


Downloading data from http://files.fast.ai/models/vgg16_bn.h5

In [43]:
def proc_wgts(layer, prev_p, new_p):
    scal = (1-prev_p)/(1-new_p)
    return [o*scal for o in layer.get_weights()]

In [44]:
for l in bn_model.layers: 
    if type(l)==Dense: l.set_weights(proc_wgts(l, 0.5, 0.6))  # here l == 'l' not '1'

In [45]:
# Remove last layer and lock all the others
bn_model.pop()
for layer in bn_model.layers: layer.trainable=False

In [46]:
# Add linear layer (2-class) (just doing the ImageNet mapping to Kaggle dogs and cats)
bn_model.add(Dense(2,activation='softmax'))

Compile and fit the model


In [47]:
bn_model.compile(Adam(), 'categorical_crossentropy', metrics=['accuracy'])

In [48]:
bn_model.fit(trn_features, trn_labels, nb_epoch=8, validation_data=(val_features, val_labels))


Train on 2000 samples, validate on 499 samples
Epoch 1/8
2000/2000 [==============================] - 2s - loss: 0.5140 - acc: 0.8905 - val_loss: 0.1541 - val_acc: 0.9739
Epoch 2/8
2000/2000 [==============================] - 1s - loss: 0.1748 - acc: 0.9595 - val_loss: 0.0999 - val_acc: 0.98000.959
Epoch 3/8
2000/2000 [==============================] - 1s - loss: 0.2074 - acc: 0.9580 - val_loss: 0.1130 - val_acc: 0.9780
Epoch 4/8
2000/2000 [==============================] - 1s - loss: 0.2028 - acc: 0.9635 - val_loss: 0.0951 - val_acc: 0.9820
Epoch 5/8
2000/2000 [==============================] - 1s - loss: 0.1785 - acc: 0.9645 - val_loss: 0.1239 - val_acc: 0.9800
Epoch 6/8
2000/2000 [==============================] - 1s - loss: 0.2133 - acc: 0.9580 - val_loss: 0.1010 - val_acc: 0.9800
Epoch 7/8
2000/2000 [==============================] - 1s - loss: 0.1758 - acc: 0.9650 - val_loss: 0.1223 - val_acc: 0.9780
Epoch 8/8
2000/2000 [==============================] - 1s - loss: 0.1649 - acc: 0.9680 - val_loss: 0.1000 - val_acc: 0.9760
Out[48]:
<keras.callbacks.History at 0x176a57da0>

Save the weights (batchnorm)


In [49]:
bn_model.save_weights(os.path.join(model_path,'bn.h5'))

In [50]:
bn_model.load_weights(os.path.join(model_path,'bn.h5'))

Create another BN model and combine it with the conv layers into a final model


In [51]:
bn_layers = get_bn_layers(0.6)
bn_layers.pop()
bn_layers.append(Dense(2,activation='softmax'))

In [52]:
final_model = Sequential(conv_layers)
for layer in final_model.layers: layer.trainable = False
for layer in bn_layers: final_model.add(layer)

Set the BN layers weights from the first BN model


In [53]:
for l1,l2 in zip(bn_model.layers, bn_layers):
    l2.set_weights(l1.get_weights())

Fit the model


In [54]:
final_model.compile(optimizer=Adam(), 
                    loss='categorical_crossentropy', metrics=['accuracy'])

In [55]:
final_model.fit_generator(batches, samples_per_epoch=batches.nb_sample, nb_epoch=1, 
                        validation_data=val_batches, nb_val_samples=val_batches.nb_sample)


Epoch 1/1
2000/2000 [==============================] - 51s - loss: 2.2081 - acc: 0.8415 - val_loss: 1.6770 - val_acc: 0.8858
Out[55]:
<keras.callbacks.History at 0x17ba69be0>

Save the weights (final model)


In [58]:
final_model.save_weights(os.path.join(model_path, 'final1.h5'))

Fit the model


In [59]:
final_model.fit_generator(batches, samples_per_epoch=batches.nb_sample, nb_epoch=4, 
                        validation_data=val_batches, nb_val_samples=val_batches.nb_sample)


Epoch 1/4
2000/2000 [==============================] - 51s - loss: 1.8759 - acc: 0.8735 - val_loss: 1.1234 - val_acc: 0.9218
Epoch 2/4
2000/2000 [==============================] - 49s - loss: 1.7514 - acc: 0.8805 - val_loss: 1.0986 - val_acc: 0.9279
Epoch 3/4
2000/2000 [==============================] - 50s - loss: 1.6122 - acc: 0.8880 - val_loss: 0.6317 - val_acc: 0.9599
Epoch 4/4
2000/2000 [==============================] - 49s - loss: 1.3345 - acc: 0.9085 - val_loss: 0.5065 - val_acc: 0.9619
Out[59]:
<keras.callbacks.History at 0x18d06b860>

Save the weights (final model)


In [60]:
final_model.save_weights(os.path.join(model_path, 'final2.h5'))

In [61]:
final_model.optimizer.lr=0.001

Fit the model


In [62]:
final_model.fit_generator(batches, samples_per_epoch=batches.nb_sample, nb_epoch=4, 
                        validation_data=val_batches, nb_val_samples=val_batches.nb_sample)


Epoch 1/4
2000/2000 [==============================] - 51s - loss: 1.4807 - acc: 0.8975 - val_loss: 0.4867 - val_acc: 0.9659
Epoch 2/4
2000/2000 [==============================] - 49s - loss: 1.4498 - acc: 0.9005 - val_loss: 0.5859 - val_acc: 0.9579
Epoch 3/4
2000/2000 [==============================] - 49s - loss: 1.4331 - acc: 0.9030 - val_loss: 0.7196 - val_acc: 0.9499
Epoch 4/4
2000/2000 [==============================] - 49s - loss: 1.3423 - acc: 0.9095 - val_loss: 0.5011 - val_acc: 0.9659
Out[62]:
<keras.callbacks.History at 0x18d057c18>

Save the weights (final model)


In [63]:
bn_model.save_weights(os.path.join(model_path, 'final3.h5'))

In [ ]: