Linear models with CNN features

We need to find a way to convert the imagenet predictions to a probability of being a cat or a dog, since that is what the Kaggle competition requires us to submit.

A very simple solution to both of these problems is to learn a linear model that is trained using the 1,000 predictions from the imagenet model for each image as input, and the dog/cat label as target.

Plan for Lesson 2 Assignment

  • Train a linear model using the ImageNet predictions to classify images into dogs or cats.
    • Get the true labels for every image
    • Get the 1,000 imagenet category predictions for every image
    • Save the loaded data as bcolz persisted arrays in the models/ folder.
    • Fit the linear model using these predictions as features.
    • Save the predictions as bcolz persisted arrays in the models/ folder.
  • Evaluate the result of the training.
    • View examples of the model predictions
      • A few correct labels at random
      • A few incorrect labels at random
      • The most correct labels of each class (ie those with highest probability that are correct)
      • The most incorrect labels of each class (ie those with highest probability that are incorrect)
      • The most uncertain labels (ie those with probability closest to 0.5).
    • Create a confusion matrix from the predictions
      • Print it out
      • Plot it
  • Retrain the VGG model with the new linear layer.
    • Pop off the existing Dense last layer.
    • Add the new Dense layer as the new last layer.
    • Save the weights from the training in the models/ folder
  • Run predictions on the Kaggle dogs and cats test data using the retrained model.

Basic Setup


In [1]:
# Rather than importing everything manually, we'll make things easy
#   and load them all in utils.py, and just import them from there.
%matplotlib inline
import utils; reload(utils)
from utils import *


Using Theano backend.

In [2]:
%matplotlib inline
from __future__ import division,print_function
import os, json
from glob import glob
import numpy as np
import scipy
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import confusion_matrix
np.set_printoptions(precision=4, linewidth=100)
from matplotlib import pyplot as plt
import utils; reload(utils)
from utils import plots, get_batches, plot_confusion_matrix, get_data

In [3]:
from numpy.random import random, permutation
from scipy import misc, ndimage
from scipy.ndimage.interpolation import zoom

import keras
from keras import backend as K
from keras.utils.data_utils import get_file
from keras.models import Sequential
from keras.layers import Input
from keras.layers.core import Flatten, Dense, Dropout, Lambda
from keras.layers.convolutional import Convolution2D, MaxPooling2D, ZeroPadding2D
from keras.optimizers import SGD, RMSprop
from keras.preprocessing import image

Train a linear model using the ImageNet predictions to classify images into dogs or cats

Using a Dense() layer in this way, we can easily convert the 1,000 predictions given by our model into a probability of dog vs cat.

  • Simply train a linear model to take the 1,000 predictions as input, and return dog or cat as output, learning from the Kaggle data.
  • This should be easier and more accurate than manually creating a map from imagenet categories to one dog/cat category.

We start with some basic config steps. We copy a small amount of our data into a 'sample' directory, with the exact same structure as our 'train' directory--this is always a good idea in all machine learning, since we should do all of our initial testing using a dataset small enough that we never have to wait for it.


In [4]:
#path = "data/dogscats/sample/"
#path = "data/dogscats/"
path = "/input/sample/"
model_path = path + 'models/'
if not os.path.exists(model_path): os.mkdir(model_path)

We will process as many images at a time as our graphics card allows. This is a case of trial and error to find the max batch size - the largest size that doesn't give an out of memory error.


In [5]:
batch_size=10
#batch_size=4

Our overall approach here will be:

  1. Get the true labels for every image
  2. Get the 1,000 imagenet category predictions for every image
  3. Feed these predictions as input to a simple linear model.

Let's start by grabbing training and validation batches.


In [6]:
# Use batch size of 1 since we're just doing preprocessing on the CPU

# NOTE: The "batches" are generators, more memory efficient.
# This step seems redundant because get_data below calls get_batches internally.
# However, we do use these batches directly to access their metadata properties later.
# A bit better design/refactor would allow us to call get_batches once and then concatenate 
# them into arrays like get_data does.
val_batches = get_batches(path+'valid', shuffle=False, batch_size=1)
trn_batches = get_batches(path+'train', shuffle=False, batch_size=1)


Found 50 images belonging to 2 classes.
Found 200 images belonging to 2 classes.

In [51]:
??val_batches

In [7]:
val_batches.classes


Out[7]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

In [8]:
trn_batches.classes


Out[8]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

In [9]:
trn_batches.filenames[:5]


Out[9]:
['cats/cat.101.jpg',
 'cats/cat.10465.jpg',
 'cats/cat.10474.jpg',
 'cats/cat.10491.jpg',
 'cats/cat.1059.jpg']

Loading and resizing the images every time we want to use them isn't necessary. Instead, we should save the processed arrays.

  • By far the fastest way to save and load numpy arrays is using bcolz.
  • This also compresses the arrays, so we save disk space.

Here are the functions we'll use to save and load using bcolz.


In [10]:
import bcolz
def save_array(fname, arr): 
    c=bcolz.carray(arr, rootdir=fname, mode='w')
    c.flush()
    
def load_array(fname): 
    return bcolz.open(fname)[:]

We have provided a simple function that joins the arrays from all the batches - let's use this to grab the training and validation data:


In [44]:
# get_batches loads raw image data from a folder as a continuous stream via a special python generator.
??get_batches

In [46]:
# get_data calls get_batches and concatenates all the output into a single list.
??get_data

In [95]:
# val_data is a single array containing all the elements returned from one full epoch of get_batches.
val_data = get_data(path+'valid')


Found 50 images belonging to 2 classes.

In [17]:
val_data.shape


Out[17]:
(50, 3, 224, 224)

In [94]:
trn_data = get_data(path+'train')


Found 200 images belonging to 2 classes.

In [19]:
trn_data.shape


Out[19]:
(200, 3, 224, 224)

In [20]:
# Saving the image data expressed as metadata in a matrix.
save_array(model_path+ 'train_data.bc', trn_data)
save_array(model_path + 'valid_data.bc', val_data)

We can load our training and validation data later without recalculating them:


In [11]:
trn_data = load_array(model_path+'train_data.bc')
val_data = load_array(model_path+'valid_data.bc')

Get the true labels for every image

Keras returns classes as a single column, so we convert to one hot encoding


In [12]:
def onehot(x): 
    return np.array(OneHotEncoder().fit_transform(x.reshape(-1,1)).todense())

In [13]:
# NOTE:
# val_batches and trn_batches from an earlier step, near the start of all this. 
val_classes = val_batches.classes
trn_classes = trn_batches.classes

# Convert Keras classes from single column to two for one-hot encoding.
val_labels = onehot(val_classes)
trn_labels = onehot(trn_classes)

In [97]:
# Matrix shape
trn_labels.shape


Out[97]:
(200, 2)

In [14]:
# Vector contents
trn_classes #[:4]


Out[14]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

In [15]:
# One-hot Matrix contents
trn_labels[:4]


Out[15]:
array([[ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.]])

Get the 1,000 imagenet category predictions for every image

...and their 1,000 imagenet probabilties from VGG16--these will be the features for our linear model:

  1. Remember that the Vgg16 model was trained on the ImageNet data originally.
  2. We now run prediction on the Kaggle data (train/valid) using the vgg16 model (with original weights).
  3. The output of this prediction is a set of probabilities that the data will be one of 1000 ImageNet labels.
  4. So, now we have a set of labeled probabilities for each image (i.e. labeled as dog or cat).
  5. These probabilities become the features for a new set of training data for our linear classifier (i.e. into just dog or cat).

We need to start with our VGG 16 model, since we'll be using its predictions and features.


In [16]:
from vgg16 import Vgg16
vgg = Vgg16()
model = vgg.model

In [17]:
# Here, 'model' is the pre-trained VGG model, which already has the ImageNet weights. 
# We're using it to classify each labeled dog or cat image into one of the 1000 ImageNet classes first,
# and later we'll re-classify them into dog or cat.
trn_features = model.predict(trn_data, batch_size=batch_size)
val_features = model.predict(val_data, batch_size=batch_size)

In [28]:
# Saving the features that were extracted from the ImageNet classes and predictions above.
save_array(model_path+ 'train_lastlayer_features.bc', trn_features)
save_array(model_path + 'valid_lastlayer_features.bc', val_features)

We can load our training and validation features later without recalculating them:


In [18]:
trn_features = load_array(model_path+'train_lastlayer_features.bc')
val_features = load_array(model_path+'valid_lastlayer_features.bc')

In [19]:
# Contents of new training features matrix. 
# This is the output of the prediction of the Vgg model on the Kaggle sample training/validation sets.
# This contains the probabilities of each labeled image is one of the 1000 ImageNet classes.
trn_features.shape


Out[19]:
(200, 1000)

In [20]:
val_features.shape
#val_features[:4][:4]


Out[20]:
(50, 1000)

Now we can define our linear model, just like we did earlier:


In [21]:
# 1000 inputs, since that's the saved features, and 2 outputs, for dog and cat
# This is a stand-alone linear model, not part of the larger CNN we'll be working with later.
lm = Sequential([ Dense(2, activation='softmax', input_shape=(1000,)) ])
lm.compile(optimizer=RMSprop(lr=0.1), loss='categorical_crossentropy', metrics=['accuracy'])

Fit the linear model using these predictions as features.

We're ready to fit the model!

That is, we're going to train the new linear model using our recently generated training data (probabilities).


In [22]:
#batch_size=64
batch_size=4

In [33]:
??lm.fit

In [23]:
# Now we're training the 
lm.fit(trn_features, trn_labels, nb_epoch=3, batch_size=batch_size, 
       validation_data=(val_features, val_labels))


Train on 200 samples, validate on 50 samples
Epoch 1/3
200/200 [==============================] - 0s - loss: 0.2989 - acc: 0.8800 - val_loss: 0.1160 - val_acc: 0.9800
Epoch 2/3
200/200 [==============================] - 0s - loss: 0.1228 - acc: 0.9450 - val_loss: 0.0618 - val_acc: 0.9600
Epoch 3/3
200/200 [==============================] - 0s - loss: 0.0775 - acc: 0.9650 - val_loss: 0.0434 - val_acc: 0.9800
Out[23]:
<keras.callbacks.History at 0x7f0fd2a6afd0>

In [24]:
lm.summary()


____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
dense_4 (Dense)                  (None, 2)             2002        dense_input_1[0][0]              
====================================================================================================
Total params: 2,002
Trainable params: 2,002
Non-trainable params: 0
____________________________________________________________________________________________________

Evaluate the result of the training

Viewing model prediction examples

Keras' fit() function conveniently shows us the value of the loss function, and the accuracy, after every epoch ("epoch" refers to one full run through all training examples). The most important metrics for us to look at are for the validation set, since we want to check for over-fitting.

  • Tip: with our first model we should try to overfit before we start worrying about how to handle that - there's no point even thinking about regularization, data augmentation, etc if you're still under-fitting! (We'll be looking at these techniques shortly).

As well as looking at the overall metrics, it's also a good idea to look at examples of each of:

  1. A few correct labels at random
  2. A few incorrect labels at random
  3. The most correct labels of each class (ie those with highest probability that are correct)
  4. The most incorrect labels of each class (ie those with highest probability that are incorrect)
  5. The most uncertain labels (ie those with probability closest to 0.5).

Let's see what we, if anything, we can from these (in general, these are particularly useful for debugging problems in the model; since this model is so simple, there may not be too much to learn at this stage.)

Calculate predictions on validation set, so we can find correct and incorrect examples:


In [25]:
# We want both the classes...
preds = lm.predict_classes(val_features, batch_size=batch_size)
# ...and the probabilities of being a cat
probs = lm.predict_proba(val_features, batch_size=batch_size)[:,0]  # <== what is this here?


 4/50 [=>............................] - ETA: 0s

In [49]:
??lm.predict_proba

In [26]:
probs#[:8]


Out[26]:
array([  9.9980e-01,   9.9986e-01,   9.3915e-01,   9.9852e-01,   9.9988e-01,   9.9987e-01,
         4.5572e-01,   9.7192e-01,   9.9932e-01,   9.9998e-01,   9.8503e-01,   9.9985e-01,
         9.9987e-01,   9.7467e-01,   8.0765e-01,   9.9869e-01,   9.9972e-01,   9.9664e-01,
         9.9111e-01,   9.9937e-01,   9.9962e-01,   9.9978e-01,   9.9563e-01,   6.6271e-02,
         8.1784e-03,   6.3789e-03,   4.6646e-02,   4.1006e-01,   5.9673e-03,   2.2139e-03,
         9.3492e-04,   4.7413e-02,   5.4551e-02,   1.2323e-03,   2.5998e-04,   1.0639e-03,
         1.7929e-03,   2.4355e-03,   3.6252e-03,   7.5339e-03,   6.1301e-03,   5.2193e-04,
         8.2584e-02,   4.9377e-04,   4.9389e-02,   4.7487e-04,   9.3915e-04,   2.0320e-03,
         3.5493e-04,   7.3503e-02], dtype=float32)

In [27]:
preds#[:8]


Out[27]:
array([0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

Get the filenames for the validation set, so we can view images:


In [28]:
filenames = val_batches.filenames

In [29]:
# Number of images to view for each visualization task
n_view = 4

Helper function to plot images by index in the validation set:


In [30]:
def plots_idx(idx, titles=None):
    plots([image.load_img(path + 'valid/' + filenames[i]) for i in idx], titles=titles)

In [31]:
#1. A few correct labels at random
correct = np.where(preds==val_labels[:,1])[0]
idx = permutation(correct)[:n_view]
plots_idx(idx, probs[idx])



In [32]:
#2. A few incorrect labels at random
incorrect = np.where(preds!=val_labels[:,1])[0]
idx = permutation(incorrect)[:n_view]
plots_idx(idx, probs[idx])



In [33]:
#3. The images we most confident were cats, and are actually cats
correct_cats = np.where((preds==0) & (preds==val_labels[:,1]))[0]
most_correct_cats = np.argsort(probs[correct_cats])[::-1][:n_view]
plots_idx(correct_cats[most_correct_cats], probs[correct_cats][most_correct_cats])



In [34]:
# as above, but dogs
correct_dogs = np.where((preds==1) & (preds==val_labels[:,1]))[0]
most_correct_dogs = np.argsort(probs[correct_dogs])[:n_view]
#plots_idx(correct_dogs[most_correct_dogs], 1-probs[correct_dogs][most_correct_dogs])
plots_idx(correct_dogs[most_correct_dogs], probs[correct_dogs][most_correct_dogs])



In [35]:
#3. The images we were most confident were cats, but are actually dogs
incorrect_cats = np.where((preds==0) & (preds!=val_labels[:,1]))[0]
most_incorrect_cats = np.argsort(probs[incorrect_cats])[::-1][:n_view]
plots_idx(incorrect_cats[most_incorrect_cats], probs[incorrect_cats][most_incorrect_cats])


---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-35-88f5bb94f1fc> in <module>()
      2 incorrect_cats = np.where((preds==0) & (preds!=val_labels[:,1]))[0]
      3 most_incorrect_cats = np.argsort(probs[incorrect_cats])[::-1][:n_view]
----> 4 plots_idx(incorrect_cats[most_incorrect_cats], probs[incorrect_cats][most_incorrect_cats])

<ipython-input-30-5532fd1c2535> in plots_idx(idx, titles)
      1 def plots_idx(idx, titles=None):
----> 2     plots([image.load_img(path + 'valid/' + filenames[i]) for i in idx], titles=titles)

/code/utils.pyc in plots(ims, figsize, rows, interp, titles)
     78 
     79 def plots(ims, figsize=(12,6), rows=1, interp=False, titles=None):
---> 80     if type(ims[0]) is np.ndarray:
     81         ims = np.array(ims).astype(np.uint8)
     82         if (ims.shape[-1] != 3):

IndexError: list index out of range

In [36]:
#3. The images we were most confident were dogs, but are actually cats
incorrect_dogs = np.where((preds==1) & (preds!=val_labels[:,1]))[0]
most_incorrect_dogs = np.argsort(probs[incorrect_dogs])[:n_view]
#plots_idx(incorrect_dogs[most_incorrect_dogs], 1-probs[incorrect_dogs][most_incorrect_dogs])
plots_idx(incorrect_dogs[most_incorrect_dogs], 1-probs[incorrect_dogs][most_incorrect_dogs])



In [37]:
#5. The most uncertain labels (ie those with probability closest to 0.5).
most_uncertain = np.argsort(np.abs(probs-0.5))
plots_idx(most_uncertain[:n_view], probs[most_uncertain])


Create a confusion matrix from the predictions

Perhaps the most common way to analyze the result of a classification model is to use a confusion matrix. Scikit-learn has a convenient function we can use for this purpose:


In [38]:
cm = confusion_matrix(val_classes, preds)

We can just print out the confusion matrix, or we can show a graphical view (which is mainly useful for dependents with a larger number of categories).


In [39]:
plot_confusion_matrix(cm, val_batches.class_indices)


[[22  1]
 [ 0 27]]

About activation functions

Do you remember how we defined our linear model? Here it is again for reference:

lm = Sequential([ Dense(2, activation='softmax', input_shape=(1000,)) ])

And do you remember the definition of a fully connected layer in the original VGG?:

model.add(Dense(4096, activation='relu'))

You might we wondering, what's going on with that activation parameter? Adding an 'activation' parameter to a layer in Keras causes an additional function to be called after the layer is calculated. You'll recall that we had no such parameter in our most basic linear model at the start of this lesson - that's because a simple linear model has no activation function. But nearly all deep model layers have an activation function - specifically, a non-linear activation function, such as tanh, sigmoid (1/(1+exp(x))), or relu (max(0,x), called the rectified linear function). Why?

The reason for this is that if you stack purely linear layers on top of each other, then you just end up with a linear layer! For instance, if your first layer was 2*x, and your second was -2*x, then the combination is: -2*(2*x) = -4*x. If that's all we were able to do with deep learning, it wouldn't be very deep! But what if we added a relu activation after our first layer? Then the combination would be: -2 * max(0, 2*x). As you can see, that does not simplify to just a linear function like the previous example--and indeed we can stack as many of these on top of each other as we wish, to create arbitrarily complex functions.

And why would we want to do that? Because it turns out that such a stack of linear functions and non-linear activations can approximate any other function just as close as we want. So we can use it to model anything! This extraordinary insight is known as the universal approximation theorem. For a visual understanding of how and why this works, I strongly recommend you read Michael Nielsen's excellent interactive visual tutorial.

The last layer generally needs a different activation function to the other layers--because we want to encourage the last layer's output to be of an appropriate form for our particular problem. For instance, if our output is a one hot encoded categorical variable, we want our final layer's activations to add to one (so they can be treated as probabilities) and to have generally a single activation much higher than the rest (since with one hot encoding we have just a single 'one', and all other target outputs are zero). Our classication problems will always have this form, so we will introduce the activation function that has these properties: the softmax function. Softmax is defined as (for the i'th output activation): exp(x[i]) / sum(exp(x)).

I suggest you try playing with that function in a spreadsheet to get a sense of how it behaves.

We will see other activation functions later in this course - but relu (and minor variations) for intermediate layers and softmax for output layers will be by far the most common.

Retrain the VGG model with the new linear layer

Since the original VGG16 network's last layer is Dense (i.e. a linear model) it seems a little odd that we are adding an additional linear model on top of it. This is especially true since the last layer had a softmax activation, which is an odd choice for an intermediate layer--and by adding an extra layer on top of it, we have made it an intermediate layer. What if we just removed the original final layer and replaced it with one that we train for the purpose of distinguishing cats and dogs? It turns out that this is a good idea - as we'll see!

We start by removing the last layer, and telling Keras that we want to fix the weights in all the other layers (since we aren't looking to learn new parameters for those other layers).


In [40]:
#vgg.model.summary()
model.summary()


____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
lambda_1 (Lambda)                (None, 3, 224, 224)   0           lambda_input_1[0][0]             
____________________________________________________________________________________________________
zeropadding2d_1 (ZeroPadding2D)  (None, 3, 226, 226)   0           lambda_1[0][0]                   
____________________________________________________________________________________________________
convolution2d_1 (Convolution2D)  (None, 64, 224, 224)  1792        zeropadding2d_1[0][0]            
____________________________________________________________________________________________________
zeropadding2d_2 (ZeroPadding2D)  (None, 64, 226, 226)  0           convolution2d_1[0][0]            
____________________________________________________________________________________________________
convolution2d_2 (Convolution2D)  (None, 64, 224, 224)  36928       zeropadding2d_2[0][0]            
____________________________________________________________________________________________________
maxpooling2d_1 (MaxPooling2D)    (None, 64, 112, 112)  0           convolution2d_2[0][0]            
____________________________________________________________________________________________________
zeropadding2d_3 (ZeroPadding2D)  (None, 64, 114, 114)  0           maxpooling2d_1[0][0]             
____________________________________________________________________________________________________
convolution2d_3 (Convolution2D)  (None, 128, 112, 112) 73856       zeropadding2d_3[0][0]            
____________________________________________________________________________________________________
zeropadding2d_4 (ZeroPadding2D)  (None, 128, 114, 114) 0           convolution2d_3[0][0]            
____________________________________________________________________________________________________
convolution2d_4 (Convolution2D)  (None, 128, 112, 112) 147584      zeropadding2d_4[0][0]            
____________________________________________________________________________________________________
maxpooling2d_2 (MaxPooling2D)    (None, 128, 56, 56)   0           convolution2d_4[0][0]            
____________________________________________________________________________________________________
zeropadding2d_5 (ZeroPadding2D)  (None, 128, 58, 58)   0           maxpooling2d_2[0][0]             
____________________________________________________________________________________________________
convolution2d_5 (Convolution2D)  (None, 256, 56, 56)   295168      zeropadding2d_5[0][0]            
____________________________________________________________________________________________________
zeropadding2d_6 (ZeroPadding2D)  (None, 256, 58, 58)   0           convolution2d_5[0][0]            
____________________________________________________________________________________________________
convolution2d_6 (Convolution2D)  (None, 256, 56, 56)   590080      zeropadding2d_6[0][0]            
____________________________________________________________________________________________________
zeropadding2d_7 (ZeroPadding2D)  (None, 256, 58, 58)   0           convolution2d_6[0][0]            
____________________________________________________________________________________________________
convolution2d_7 (Convolution2D)  (None, 256, 56, 56)   590080      zeropadding2d_7[0][0]            
____________________________________________________________________________________________________
maxpooling2d_3 (MaxPooling2D)    (None, 256, 28, 28)   0           convolution2d_7[0][0]            
____________________________________________________________________________________________________
zeropadding2d_8 (ZeroPadding2D)  (None, 256, 30, 30)   0           maxpooling2d_3[0][0]             
____________________________________________________________________________________________________
convolution2d_8 (Convolution2D)  (None, 512, 28, 28)   1180160     zeropadding2d_8[0][0]            
____________________________________________________________________________________________________
zeropadding2d_9 (ZeroPadding2D)  (None, 512, 30, 30)   0           convolution2d_8[0][0]            
____________________________________________________________________________________________________
convolution2d_9 (Convolution2D)  (None, 512, 28, 28)   2359808     zeropadding2d_9[0][0]            
____________________________________________________________________________________________________
zeropadding2d_10 (ZeroPadding2D) (None, 512, 30, 30)   0           convolution2d_9[0][0]            
____________________________________________________________________________________________________
convolution2d_10 (Convolution2D) (None, 512, 28, 28)   2359808     zeropadding2d_10[0][0]           
____________________________________________________________________________________________________
maxpooling2d_4 (MaxPooling2D)    (None, 512, 14, 14)   0           convolution2d_10[0][0]           
____________________________________________________________________________________________________
zeropadding2d_11 (ZeroPadding2D) (None, 512, 16, 16)   0           maxpooling2d_4[0][0]             
____________________________________________________________________________________________________
convolution2d_11 (Convolution2D) (None, 512, 14, 14)   2359808     zeropadding2d_11[0][0]           
____________________________________________________________________________________________________
zeropadding2d_12 (ZeroPadding2D) (None, 512, 16, 16)   0           convolution2d_11[0][0]           
____________________________________________________________________________________________________
convolution2d_12 (Convolution2D) (None, 512, 14, 14)   2359808     zeropadding2d_12[0][0]           
____________________________________________________________________________________________________
zeropadding2d_13 (ZeroPadding2D) (None, 512, 16, 16)   0           convolution2d_12[0][0]           
____________________________________________________________________________________________________
convolution2d_13 (Convolution2D) (None, 512, 14, 14)   2359808     zeropadding2d_13[0][0]           
____________________________________________________________________________________________________
maxpooling2d_5 (MaxPooling2D)    (None, 512, 7, 7)     0           convolution2d_13[0][0]           
____________________________________________________________________________________________________
flatten_1 (Flatten)              (None, 25088)         0           maxpooling2d_5[0][0]             
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 4096)          102764544   flatten_1[0][0]                  
____________________________________________________________________________________________________
dropout_1 (Dropout)              (None, 4096)          0           dense_1[0][0]                    
____________________________________________________________________________________________________
dense_2 (Dense)                  (None, 4096)          16781312    dropout_1[0][0]                  
____________________________________________________________________________________________________
dropout_2 (Dropout)              (None, 4096)          0           dense_2[0][0]                    
____________________________________________________________________________________________________
dense_3 (Dense)                  (None, 1000)          4097000     dropout_2[0][0]                  
====================================================================================================
Total params: 138,357,544
Trainable params: 138,357,544
Non-trainable params: 0
____________________________________________________________________________________________________

Pop off the existing Dense last layer.


In [41]:
model.pop()
for layer in model.layers: layer.trainable=False

In [42]:
model.summary()


____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
lambda_1 (Lambda)                (None, 3, 224, 224)   0           lambda_input_1[0][0]             
____________________________________________________________________________________________________
zeropadding2d_1 (ZeroPadding2D)  (None, 3, 226, 226)   0           lambda_1[0][0]                   
____________________________________________________________________________________________________
convolution2d_1 (Convolution2D)  (None, 64, 224, 224)  1792        zeropadding2d_1[0][0]            
____________________________________________________________________________________________________
zeropadding2d_2 (ZeroPadding2D)  (None, 64, 226, 226)  0           convolution2d_1[0][0]            
____________________________________________________________________________________________________
convolution2d_2 (Convolution2D)  (None, 64, 224, 224)  36928       zeropadding2d_2[0][0]            
____________________________________________________________________________________________________
maxpooling2d_1 (MaxPooling2D)    (None, 64, 112, 112)  0           convolution2d_2[0][0]            
____________________________________________________________________________________________________
zeropadding2d_3 (ZeroPadding2D)  (None, 64, 114, 114)  0           maxpooling2d_1[0][0]             
____________________________________________________________________________________________________
convolution2d_3 (Convolution2D)  (None, 128, 112, 112) 73856       zeropadding2d_3[0][0]            
____________________________________________________________________________________________________
zeropadding2d_4 (ZeroPadding2D)  (None, 128, 114, 114) 0           convolution2d_3[0][0]            
____________________________________________________________________________________________________
convolution2d_4 (Convolution2D)  (None, 128, 112, 112) 147584      zeropadding2d_4[0][0]            
____________________________________________________________________________________________________
maxpooling2d_2 (MaxPooling2D)    (None, 128, 56, 56)   0           convolution2d_4[0][0]            
____________________________________________________________________________________________________
zeropadding2d_5 (ZeroPadding2D)  (None, 128, 58, 58)   0           maxpooling2d_2[0][0]             
____________________________________________________________________________________________________
convolution2d_5 (Convolution2D)  (None, 256, 56, 56)   295168      zeropadding2d_5[0][0]            
____________________________________________________________________________________________________
zeropadding2d_6 (ZeroPadding2D)  (None, 256, 58, 58)   0           convolution2d_5[0][0]            
____________________________________________________________________________________________________
convolution2d_6 (Convolution2D)  (None, 256, 56, 56)   590080      zeropadding2d_6[0][0]            
____________________________________________________________________________________________________
zeropadding2d_7 (ZeroPadding2D)  (None, 256, 58, 58)   0           convolution2d_6[0][0]            
____________________________________________________________________________________________________
convolution2d_7 (Convolution2D)  (None, 256, 56, 56)   590080      zeropadding2d_7[0][0]            
____________________________________________________________________________________________________
maxpooling2d_3 (MaxPooling2D)    (None, 256, 28, 28)   0           convolution2d_7[0][0]            
____________________________________________________________________________________________________
zeropadding2d_8 (ZeroPadding2D)  (None, 256, 30, 30)   0           maxpooling2d_3[0][0]             
____________________________________________________________________________________________________
convolution2d_8 (Convolution2D)  (None, 512, 28, 28)   1180160     zeropadding2d_8[0][0]            
____________________________________________________________________________________________________
zeropadding2d_9 (ZeroPadding2D)  (None, 512, 30, 30)   0           convolution2d_8[0][0]            
____________________________________________________________________________________________________
convolution2d_9 (Convolution2D)  (None, 512, 28, 28)   2359808     zeropadding2d_9[0][0]            
____________________________________________________________________________________________________
zeropadding2d_10 (ZeroPadding2D) (None, 512, 30, 30)   0           convolution2d_9[0][0]            
____________________________________________________________________________________________________
convolution2d_10 (Convolution2D) (None, 512, 28, 28)   2359808     zeropadding2d_10[0][0]           
____________________________________________________________________________________________________
maxpooling2d_4 (MaxPooling2D)    (None, 512, 14, 14)   0           convolution2d_10[0][0]           
____________________________________________________________________________________________________
zeropadding2d_11 (ZeroPadding2D) (None, 512, 16, 16)   0           maxpooling2d_4[0][0]             
____________________________________________________________________________________________________
convolution2d_11 (Convolution2D) (None, 512, 14, 14)   2359808     zeropadding2d_11[0][0]           
____________________________________________________________________________________________________
zeropadding2d_12 (ZeroPadding2D) (None, 512, 16, 16)   0           convolution2d_11[0][0]           
____________________________________________________________________________________________________
convolution2d_12 (Convolution2D) (None, 512, 14, 14)   2359808     zeropadding2d_12[0][0]           
____________________________________________________________________________________________________
zeropadding2d_13 (ZeroPadding2D) (None, 512, 16, 16)   0           convolution2d_12[0][0]           
____________________________________________________________________________________________________
convolution2d_13 (Convolution2D) (None, 512, 14, 14)   2359808     zeropadding2d_13[0][0]           
____________________________________________________________________________________________________
maxpooling2d_5 (MaxPooling2D)    (None, 512, 7, 7)     0           convolution2d_13[0][0]           
____________________________________________________________________________________________________
flatten_1 (Flatten)              (None, 25088)         0           maxpooling2d_5[0][0]             
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 4096)          102764544   flatten_1[0][0]                  
____________________________________________________________________________________________________
dropout_1 (Dropout)              (None, 4096)          0           dense_1[0][0]                    
____________________________________________________________________________________________________
dense_2 (Dense)                  (None, 4096)          16781312    dropout_1[0][0]                  
____________________________________________________________________________________________________
dropout_2 (Dropout)              (None, 4096)          0           dense_2[0][0]                    
====================================================================================================
Total params: 134,260,544
Trainable params: 0
Non-trainable params: 134,260,544
____________________________________________________________________________________________________

Careful! Now that we've modified the definition of model, be careful not to rerun any code in the previous sections, without first recreating the model from scratch! (Yes, I made that mistake myself, which is why I'm warning you about it now...)

Now we're ready to add our new final layer...

Add the new Dense layer as the new last layer.


In [43]:
model.add(Dense(2, activation='softmax'))

In [44]:
model.summary()


____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
lambda_1 (Lambda)                (None, 3, 224, 224)   0           lambda_input_1[0][0]             
____________________________________________________________________________________________________
zeropadding2d_1 (ZeroPadding2D)  (None, 3, 226, 226)   0           lambda_1[0][0]                   
____________________________________________________________________________________________________
convolution2d_1 (Convolution2D)  (None, 64, 224, 224)  1792        zeropadding2d_1[0][0]            
____________________________________________________________________________________________________
zeropadding2d_2 (ZeroPadding2D)  (None, 64, 226, 226)  0           convolution2d_1[0][0]            
____________________________________________________________________________________________________
convolution2d_2 (Convolution2D)  (None, 64, 224, 224)  36928       zeropadding2d_2[0][0]            
____________________________________________________________________________________________________
maxpooling2d_1 (MaxPooling2D)    (None, 64, 112, 112)  0           convolution2d_2[0][0]            
____________________________________________________________________________________________________
zeropadding2d_3 (ZeroPadding2D)  (None, 64, 114, 114)  0           maxpooling2d_1[0][0]             
____________________________________________________________________________________________________
convolution2d_3 (Convolution2D)  (None, 128, 112, 112) 73856       zeropadding2d_3[0][0]            
____________________________________________________________________________________________________
zeropadding2d_4 (ZeroPadding2D)  (None, 128, 114, 114) 0           convolution2d_3[0][0]            
____________________________________________________________________________________________________
convolution2d_4 (Convolution2D)  (None, 128, 112, 112) 147584      zeropadding2d_4[0][0]            
____________________________________________________________________________________________________
maxpooling2d_2 (MaxPooling2D)    (None, 128, 56, 56)   0           convolution2d_4[0][0]            
____________________________________________________________________________________________________
zeropadding2d_5 (ZeroPadding2D)  (None, 128, 58, 58)   0           maxpooling2d_2[0][0]             
____________________________________________________________________________________________________
convolution2d_5 (Convolution2D)  (None, 256, 56, 56)   295168      zeropadding2d_5[0][0]            
____________________________________________________________________________________________________
zeropadding2d_6 (ZeroPadding2D)  (None, 256, 58, 58)   0           convolution2d_5[0][0]            
____________________________________________________________________________________________________
convolution2d_6 (Convolution2D)  (None, 256, 56, 56)   590080      zeropadding2d_6[0][0]            
____________________________________________________________________________________________________
zeropadding2d_7 (ZeroPadding2D)  (None, 256, 58, 58)   0           convolution2d_6[0][0]            
____________________________________________________________________________________________________
convolution2d_7 (Convolution2D)  (None, 256, 56, 56)   590080      zeropadding2d_7[0][0]            
____________________________________________________________________________________________________
maxpooling2d_3 (MaxPooling2D)    (None, 256, 28, 28)   0           convolution2d_7[0][0]            
____________________________________________________________________________________________________
zeropadding2d_8 (ZeroPadding2D)  (None, 256, 30, 30)   0           maxpooling2d_3[0][0]             
____________________________________________________________________________________________________
convolution2d_8 (Convolution2D)  (None, 512, 28, 28)   1180160     zeropadding2d_8[0][0]            
____________________________________________________________________________________________________
zeropadding2d_9 (ZeroPadding2D)  (None, 512, 30, 30)   0           convolution2d_8[0][0]            
____________________________________________________________________________________________________
convolution2d_9 (Convolution2D)  (None, 512, 28, 28)   2359808     zeropadding2d_9[0][0]            
____________________________________________________________________________________________________
zeropadding2d_10 (ZeroPadding2D) (None, 512, 30, 30)   0           convolution2d_9[0][0]            
____________________________________________________________________________________________________
convolution2d_10 (Convolution2D) (None, 512, 28, 28)   2359808     zeropadding2d_10[0][0]           
____________________________________________________________________________________________________
maxpooling2d_4 (MaxPooling2D)    (None, 512, 14, 14)   0           convolution2d_10[0][0]           
____________________________________________________________________________________________________
zeropadding2d_11 (ZeroPadding2D) (None, 512, 16, 16)   0           maxpooling2d_4[0][0]             
____________________________________________________________________________________________________
convolution2d_11 (Convolution2D) (None, 512, 14, 14)   2359808     zeropadding2d_11[0][0]           
____________________________________________________________________________________________________
zeropadding2d_12 (ZeroPadding2D) (None, 512, 16, 16)   0           convolution2d_11[0][0]           
____________________________________________________________________________________________________
convolution2d_12 (Convolution2D) (None, 512, 14, 14)   2359808     zeropadding2d_12[0][0]           
____________________________________________________________________________________________________
zeropadding2d_13 (ZeroPadding2D) (None, 512, 16, 16)   0           convolution2d_12[0][0]           
____________________________________________________________________________________________________
convolution2d_13 (Convolution2D) (None, 512, 14, 14)   2359808     zeropadding2d_13[0][0]           
____________________________________________________________________________________________________
maxpooling2d_5 (MaxPooling2D)    (None, 512, 7, 7)     0           convolution2d_13[0][0]           
____________________________________________________________________________________________________
flatten_1 (Flatten)              (None, 25088)         0           maxpooling2d_5[0][0]             
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 4096)          102764544   flatten_1[0][0]                  
____________________________________________________________________________________________________
dropout_1 (Dropout)              (None, 4096)          0           dense_1[0][0]                    
____________________________________________________________________________________________________
dense_2 (Dense)                  (None, 4096)          16781312    dropout_1[0][0]                  
____________________________________________________________________________________________________
dropout_2 (Dropout)              (None, 4096)          0           dense_2[0][0]                    
____________________________________________________________________________________________________
dense_5 (Dense)                  (None, 2)             8194        dropout_2[0][0]                  
====================================================================================================
Total params: 134,268,738
Trainable params: 8,194
Non-trainable params: 134,260,544
____________________________________________________________________________________________________

In [35]:
??vgg.finetune

...and compile our updated model, and set up our batches to use the preprocessed images (note that now we will also shuffle the training batches, to add more randomness when using multiple epochs):


In [45]:
gen=image.ImageDataGenerator()
batches = gen.flow(trn_data, trn_labels, batch_size=batch_size, shuffle=True)
val_batches = gen.flow(val_data, val_labels, batch_size=batch_size, shuffle=False)

We'll define a simple function for fitting models, just to save a little typing...


In [46]:
def fit_model(model, batches, val_batches, nb_epoch=1):
    model.fit_generator(batches, samples_per_epoch=batches.n, nb_epoch=nb_epoch, 
                        validation_data=val_batches, nb_val_samples=val_batches.n)

...and now we can use it to train the last layer of our model!

(It runs quite slowly, since it still has to calculate all the previous layers in order to know what input to pass to the new final layer. We could precalculate the output of the penultimate layer, like we did for the final layer earlier - but since we're only likely to want one or two iterations, it's easier to follow this alternative approach.)


In [47]:
#opt = RMSprop(lr=0.1)
#opt = RMSprop(lr=0.01)
opt = RMSprop(lr=0.001)
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

Fit the updated model


In [48]:
fit_model(model, batches, val_batches, nb_epoch=2)
#fit_model(model, batches, val_batches, nb_epoch=4)


Epoch 1/2
200/200 [==============================] - 268s - loss: 0.4555 - acc: 0.8550 - val_loss: 0.0467 - val_acc: 0.9600
Epoch 2/2
200/200 [==============================] - 260s - loss: 0.2187 - acc: 0.9250 - val_loss: 0.0307 - val_acc: 1.0000

Before moving on, go back and look at how little code we had to write in this section to finetune the model. Because this is such an important and common operation, keras is set up to make it as easy as possible. We didn't even have to use any external helper functions in this section.

It's a good idea to save weights of all your models, so you can re-use them later. Be sure to note the git log number of your model when keeping a research journal of your results.


In [58]:
model.save_weights(model_path+'finetune1.h5')

In [59]:
model.load_weights(model_path+'finetune1.h5')

In [60]:
model.evaluate(val_data, val_labels)


50/50 [==============================] - 44s     
Out[60]:
[7.4143237733840985, 0.54000000000000004]

We can look at the earlier prediction examples visualizations by redefining probs and preds and re-using our earlier code.


In [67]:
#preds = model.predict_classes(val_data, batch_size=batch_size)
#probs = model.predict_proba(val_data, batch_size=batch_size)[:,0]

# What happens if I use the training data here, which has more examples?
preds = model.predict_classes(trn_data, batch_size=batch_size)
probs = model.predict_proba(trn_data, batch_size=batch_size)[:,0]


200/200 [==============================] - 206s   
200/200 [==============================] - 197s   

In [68]:
probs#[:8]


Out[68]:
array([  3.0636e-24,   1.3517e-23,   1.1271e-15,   1.2863e-22,   1.7958e-22,   6.3603e-32,
         3.6124e-25,   2.3463e-16,   7.8844e-21,   2.2752e-23,   5.8406e-27,   6.3743e-26,
         6.5587e-26,   8.7820e-21,   1.6448e-20,   8.0639e-20,   1.4843e-13,   4.3397e-20,
         5.2136e-21,   6.0793e-18,   1.8005e-23,   2.7200e-17,   3.0654e-19,   2.3814e-15,
         4.6877e-21,   6.7124e-22,   1.0752e-17,   3.6133e-25,   9.6471e-23,   2.5488e-21,
         1.1966e-25,   2.2114e-21,   7.4661e-20,   1.1317e-21,   2.0036e-17,   6.5325e-17,
         1.3659e-25,   6.8473e-21,   1.8764e-20,   1.4001e-21,   2.3879e-19,   1.1712e-23,
         1.8368e-26,   9.5270e-21,   2.1954e-23,   6.5802e-21,   2.1378e-19,   3.7486e-20,
         3.9816e-30,   1.6520e-20,   2.9834e-25,   5.9016e-21,   1.5992e-26,   2.4368e-18,
         3.0976e-22,   2.0723e-21,   3.3351e-15,   1.2196e-19,   9.0931e-31,   1.5938e-17,
         1.6258e-16,   9.6800e-31,   4.2162e-23,   1.3817e-26,   2.9371e-22,   1.1023e-27,
         2.6663e-30,   2.8125e-22,   1.4098e-27,   2.6091e-25,   7.2712e-18,   1.2987e-19,
         1.2177e-18,   1.3886e-25,   1.1951e-22,   4.6010e-25,   1.9926e-17,   1.5175e-21,
         2.2983e-24,   7.3397e-25,   2.5200e-23,   1.2230e-34,   5.4773e-26,   4.9689e-18,
         2.5682e-20,   1.5613e-17,   1.7126e-21,   7.3605e-14,   5.6422e-13,   1.5725e-27,
         1.4142e-26,   1.6820e-24,   4.5249e-18,   2.7864e-28,   3.6679e-31,   1.9860e-29,
         4.1718e-25,   4.4464e-24,   1.1654e-26,   1.0185e-30,   3.4307e-29,   7.6108e-23,
         7.0624e-29,   1.2404e-29,   1.8536e-28,   1.3048e-25,   1.1350e-31,   1.5642e-29,
         2.1039e-30,   3.2152e-25,   2.1200e-27,   6.9530e-38,   5.2214e-29,   1.1210e-44,
         1.0821e-23,   4.0809e-35,   5.2802e-32,   1.6497e-25,   4.8529e-36,   3.8353e-39,
         8.3976e-24,   1.8803e-13,   4.3513e-31,   3.8222e-41,   0.0000e+00,   3.4462e-23,
         2.1780e-25,   9.4977e-32,   1.6817e-32,   3.6546e-25,   1.7103e-24,   3.6364e-29,
         1.0278e-28,   1.3846e-29,   3.2467e-35,   1.1356e-21,   1.5728e-27,   2.9851e-32,
         2.1050e-29,   2.5345e-32,   3.3878e-28,   1.1435e-23,   1.3699e-41,   3.9945e-29,
         6.3915e-28,   1.1295e-29,   2.2297e-24,   6.5767e-34,   1.4013e-45,   1.1024e-24,
         2.3570e-42,   6.2737e-24,   2.0800e-23,   7.1527e-32,   5.1406e-32,   5.4818e-26,
         3.0574e-30,   4.6377e-24,   2.7405e-26,   4.4632e-31,   1.6071e-28,   1.3491e-19,
         6.1631e-22,   2.2672e-34,   2.4523e-29,   3.4353e-24,   7.7169e-40,   3.5083e-31,
         4.3466e-30,   6.7368e-34,   1.3119e-35,   2.4248e-37,   5.2427e-20,   3.0520e-38,
         1.1928e-21,   1.3732e-28,   1.7798e-23,   4.1921e-36,   9.9268e-26,   2.1344e-20,
         1.8944e-31,   8.3653e-25,   0.0000e+00,   3.8265e-28,   2.9305e-23,   3.3523e-34,
         7.2511e-35,   4.7572e-23,   3.7713e-34,   1.5917e-30,   3.2434e-25,   2.2061e-31,
         3.3029e-33,   1.6937e-31,   1.7550e-26,   2.9644e-26,   2.3787e-28,   7.4319e-40,
         1.1142e-24,   3.7431e-41], dtype=float32)

In [69]:
val_classes


Out[69]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

In [70]:
preds


Out[70]:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [1]:
#cm = confusion_matrix(val_classes, preds)
cm = confusion_matrix(trn_classes, preds)


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-0284b9cd2696> in <module>()
      1 #cm = confusion_matrix(val_classes, preds)
----> 2 cm = confusion_matrix(trn_classes, preds)

NameError: name 'confusion_matrix' is not defined

In [66]:
plot_confusion_matrix(cm, {'cat':0, 'dog':1})


[[ 0 23]
 [ 0 27]]

Retraining more layers

Now that we've fine-tuned the new final layer, can we, and should we, fine-tune all the dense layers? The answer to both questions, it turns out, is: yes! Let's start with the "can we" question...

An introduction to back-propagation

The key to training multiple layers of a model, rather than just one, lies in a technique called "back-propagation" (or backprop to its friends). Backprop is one of the many words in deep learning parlance that is creating a new word for something that already exists - in this case, backprop simply refers to calculating gradients using the chain rule. (But we will still introduce the deep learning terms during this course, since it's important to know them when reading about or discussing deep learning.)

As you (hopefully!) remember from high school, the chain rule is how you calculate the gradient of a "function of a function"--something of the form f(u), where u=g(x). For instance, let's say your function is pow((2*x), 2). Then u is 2*x, and f(u) is power(u, 2). The chain rule tells us that the derivative of this is simply the product of the derivatives of f() and g(). Using f'(x) to refer to the derivative, we can say that: f'(x) = f'(u) * g'(x) = 2*u * 2 = 2*(2*x) * 2 = 8*x.

Let's check our calculation:


In [6]:
# sympy let's us do symbolic differentiation (and much more!) in python
import sympy as sp
# we have to define our variables
x = sp.var('x')
# then we can request the derivative or any expression of that variable
pow(2*x,2).diff()


Out[6]:
8*x

The key insight is that the stacking of linear functions and non-linear activations we learnt about in the last section is simply defining a function of functions (of functions, of functions...). Each layer is taking the output of the previous layer's function, and using it as input into its function. Therefore, we can calculate the derivative at any layer by simply multiplying the gradients of that layer and all of its following layers together! This use of the chain rule to allow us to rapidly calculate the derivatives of our model at any layer is referred to as back propagation.

The good news is that you'll never have to worry about the details of this yourself, since libraries like Theano and Tensorflow (and therefore wrappers like Keras) provide automatic differentiation (or AD). TODO

Training multiple layers in Keras

The code below will work on any model that contains dense layers; it's not just for this VGG model.

NB: Don't skip the step of fine-tuning just the final layer first, since otherwise you'll have one layer with random weights, which will cause the other layers to quickly move a long way from their optimized imagenet weights.


In [253]:
layers = model.layers
# Get the index of the first dense layer...
first_dense_idx = [index for index,layer in enumerate(layers) if type(layer) is Dense][0]
# ...and set this and all subsequent layers to trainable
for layer in layers[first_dense_idx:]: layer.trainable=True

Since we haven't changed our architecture, there's no need to re-compile the model - instead, we just set the learning rate. Since we're training more layers, and since we've already optimized the last layer, we should use a lower learning rate than previously.


In [254]:
K.set_value(opt.lr, 0.01)
fit_model(model, batches, val_batches, 3)


Epoch 1/3
23000/23000 [==============================] - 272s - loss: 0.4665 - acc: 0.9703 - val_loss: 0.4522 - val_acc: 0.9710
Epoch 2/3
23000/23000 [==============================] - 273s - loss: 0.4468 - acc: 0.9716 - val_loss: 0.3980 - val_acc: 0.9740
Epoch 3/3
23000/23000 [==============================] - 272s - loss: 0.4187 - acc: 0.9736 - val_loss: 0.3903 - val_acc: 0.9745

This is an extraordinarily powerful 5 lines of code. We have fine-tuned all of our dense layers to be optimized for our specific data set. This kind of technique has only become accessible in the last year or two - and we can already do it in just 5 lines of python!


In [255]:
model.save_weights(model_path+'finetune2.h5')

There's generally little room for improvement in training the convolutional layers, if you're using the model on natural images (as we are). However, there's no harm trying a few of the later conv layers, since it may give a slight improvement, and can't hurt (and we can always load the previous weights if the accuracy decreases).


In [256]:
for layer in layers[12:]: layer.trainable=True
K.set_value(opt.lr, 0.001)

In [257]:
fit_model(model, batches, val_batches, 4)


Epoch 1/4
23000/23000 [==============================] - 273s - loss: 0.4165 - acc: 0.9736 - val_loss: 0.3984 - val_acc: 0.9740
Epoch 2/4
23000/23000 [==============================] - 272s - loss: 0.4314 - acc: 0.9728 - val_loss: 0.3952 - val_acc: 0.9750
Epoch 3/4
23000/23000 [==============================] - 272s - loss: 0.4240 - acc: 0.9730 - val_loss: 0.3941 - val_acc: 0.9750
Epoch 4/4
23000/23000 [==============================] - 272s - loss: 0.4166 - acc: 0.9736 - val_loss: 0.3891 - val_acc: 0.9745

In [259]:
model.save_weights(model_path+'finetune3.h5')

You can always load the weights later and use the model to do whatever you need:


In [ ]:
model.load_weights(model_path+'finetune2.h5')
model.evaluate_generator(get_batches('valid', gen, False, batch_size*2), val_batches.N)

In [ ]: