This session is inspired by a blog post by François Chollet, the creator of the Keras library.
WARNING: the execution of notebook requires a GPU e.g. nvidia K80, GTX 980 or later with at least 6GB of GPU RAM.
For this session we are going to use the dataset of the dogs-vs-cats.
To download the data yourself, create a password-based account on Kaggle, then click on the download link of one of the data file when you are logged-in in your browser to get to the form that makes you accept the terms and conditions of that challenge.
Then in a shell session possibly on a server do the following:
pip3 install kaggle
# You need to download a new api key here https://www.kaggle.com/{my_name}/account
# And save it likewise `~/.kaggle/kaggle.json`.
mkdir -p ~/data/dogs-vs-cats
cd ~/data/dogs-vs-cats
kaggle competitions download -c dogs-vs-cats
This should download 3 files among which: train.zip
and test1.zip
(and a CSV template file we won't need).
Once this is done we can extract the archives for the train set:
In [6]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import os
import os.path as op
import shutil
from zipfile import ZipFile
data_folder = op.expanduser('~/data/dogs-vs-cats')
train_folder = op.join(data_folder, 'train')
if not op.exists(train_folder):
train_zip = op.join(data_folder, 'dogs-vs-cats.zip')
print('Extracting %s...' % train_zip)
ZipFile(train_zip).extractall(data_folder)
ZipFile(op.join(data_folder, "train.zip")).extractall(data_folder)
The Keras image data helpers want images for different classes ('cat' and 'dog') to live in distinct subfolders. Let's rearrange the image files to follow that convention:
In [7]:
def rearrange_folders(folder):
image_filenames = [op.join(folder, fn) for fn in os.listdir(folder)
if fn.endswith('.jpg')]
if len(image_filenames) == 0:
return
print("Rearranging %d images in %s into one subfolder per class..."
% (len(image_filenames), folder))
for image_filename in image_filenames:
subfolder, _ = image_filename.split('.', 1)
subfolder = op.join(folder, subfolder)
if not op.exists(subfolder):
os.mkdir(subfolder)
shutil.move(image_filename, subfolder)
rearrange_folders(train_folder)
Lets build a validation dataset by taking 500 images of cats and 500 images of dogs out of the training set:
In [8]:
n_validation = 500
validation_folder = op.join(data_folder, 'validation')
if not op.exists(validation_folder):
os.mkdir(validation_folder)
for class_name in ['dog', 'cat']:
train_subfolder = op.join(train_folder, class_name)
validation_subfolder = op.join(validation_folder, class_name)
print("Populating %s..." % validation_subfolder)
os.mkdir(validation_subfolder)
images_filenames = sorted(os.listdir(train_subfolder))
for image_filename in images_filenames[-n_validation:]:
shutil.move(op.join(train_subfolder, image_filename),
validation_subfolder)
print("Moved %d images" % len(os.listdir(validation_subfolder)))
Let's use keras utilities to manually load the first image file of the cat folder. If keras complains about the missing "PIL" library, make sure to install it with one of the following commands:
conda install pillow
# or
pip install pillow
You might need to restart the kernel of this notebook to get Keras work.
In [9]:
from tensorflow.keras.preprocessing.image import array_to_img, img_to_array, load_img
img = load_img(op.join(train_folder, 'cat', 'cat.249.jpg'))
x = img_to_array(img)
print(x.shape)
In [10]:
plt.imshow(x.astype(np.uint8))
plt.axis('off');
Keras provides tools to generate many variations from a single image: this is useful to augment the dataset with variants that should not affect the image label: a rotated image of a cat is an image of a cat.
Doing data augmentation at train time make neural networks ignore such label-preserving transformations and therefore help reduce overfitting.
In [15]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator
augmenting_datagen = ImageDataGenerator(
rescale=1. / 255,
rotation_range=40,
width_shift_range=0.2,
height_shift_range=0.2,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True,
channel_shift_range=9,
fill_mode='nearest'
)
In [12]:
plt.figure(figsize=(11, 5))
flow = augmenting_datagen.flow(x[np.newaxis, :, :, :])
for i, x_augmented in zip(range(15), flow):
plt.subplot(3, 5, i + 1)
plt.imshow(x_augmented[0])
plt.axis('off')
The ImageDataGenerator
object can the be pointed to the dataset folder both load the image and augment them on the fly and resize / crop them to fit the input dimensions of the classification neural network.
In [13]:
flow = augmenting_datagen.flow_from_directory(
train_folder, batch_size=1, target_size=(224, 224))
plt.figure(figsize=(11, 5))
for i, (X, y) in zip(range(15), flow):
plt.subplot(3, 5, i + 1)
plt.imshow(X[0])
plt.axis('off')
In [16]:
from tensorflow.keras.applications.resnet50 import ResNet50, preprocess_input
full_imagenet_model = ResNet50(weights='imagenet')
In [17]:
print(full_imagenet_model.summary())
If you have graphviz
system package and the pydot_ng
python package installed you can uncomment the following cell to display the structure of the network.
In [ ]:
# from IPython.display import SVG
# from keras.utils.vis_utils import model_to_dot
# model_viz = model_to_dot(full_imagenet_model,
# show_layer_names=False,
# show_shapes=True)
# SVG(model_viz.create(prog='dot', format='svg'))
In [18]:
from tensorflow.keras.models import Model
output = full_imagenet_model.layers[-2].output
base_model = Model(full_imagenet_model.input, output)
When using this model we need to be careful to apply the same image processing as was used during the training, otherwise the marginal distribution of the input pixels might not be on the right scale:
In [20]:
def preprocess_function(x):
if x.ndim == 3:
x = x[np.newaxis, :, :, :]
return preprocess_input(x)
In [21]:
batch_size = 50
datagen = ImageDataGenerator(preprocessing_function=preprocess_function)
train_flow = datagen.flow_from_directory(
train_folder,
target_size=(224, 224),
batch_size=batch_size,
class_mode='binary',
shuffle=True,
)
X, y = next(train_flow)
print(X.shape, y.shape)
Exercise: write a function that iterate of over 5000 images in the training set (bach after batch), extracts the activations of the last layer of base_model
(by calling predicts) and collect the results in a big numpy array with dimensions (5000, 2048)
for the features and (5000,)
for the matching image labels.
In [26]:
# %load solutions/dogs_vs_cats_extract_features.py
from time import time
features = []
labels = []
t0 = time()
count = 0
for X, y in train_flow:
labels.append(y)
features.append(base_model.predict(X))
count += len(y)
if count % 100 == 0:
print("processed %d images at %d images/s"
% (count, count / (time() - t0)))
if count >= 5000:
break
labels_train = np.concatenate(labels)
features_train = np.vstack(features)
np.save('labels_train.npy', labels_train)
np.save('features_train.npy', features_train)
Let's load precomputed features if available:
In [27]:
print("Loading precomputed features")
labels_train = np.load('labels_train.npy')
features_train = np.load('features_train.npy')
Let's train a simple linear model on those features. First let's check that the resulting small dataset has balanced classes:
In [28]:
print(labels_train.shape)
In [29]:
np.mean(labels_train)
Out[29]:
In [30]:
n_samples, n_features = features_train.shape
print(n_features, "features extracted")
Let's define the classification model:
In [31]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
top_model = Sequential()
top_model.add(Dense(1, input_dim=n_features, activation='sigmoid'))
top_model.compile(optimizer=Adam(lr=1e-4),
loss='binary_crossentropy', metrics=['accuracy'])
top_model.fit(features_train, labels_train,
validation_split=0.1, verbose=2, epochs=15)
Out[31]:
Alright so the transfer learning is already at ~0.98 / 0.99 accuracy. This is not too surprising as the cats and dogs classes are already part of the imagenet label set.
Note that this is already as good or slightly better than the winner of the original kaggle competition three years ago. At that time they did not have pretrained resnet models at hand.
Or validation set has 1000 images, so an accuracy of 0.990 means only 10 classification errors.
Let's plug this on top the base model to be able to use it to make some classifications on our held out validation image folder:
In [32]:
model = Model(base_model.input, top_model(base_model.output))
In [33]:
flow = ImageDataGenerator().flow_from_directory(
validation_folder, batch_size=1, target_size=(224, 224))
plt.figure(figsize=(12, 8))
for i, (X, y) in zip(range(15), flow):
plt.subplot(3, 5, i + 1)
plt.imshow(X[0] / 255)
prediction = model.predict(preprocess_input(X))
label = "dog" if y[:, 1] > 0.5 else "cat"
plt.title("dog prob=%0.4f\ntrue label: %s"
% (prediction, label))
plt.axis('off')
Let's compute the validation score on the full validation set:
In [34]:
valgen = ImageDataGenerator(preprocessing_function=preprocess_function)
val_flow = valgen.flow_from_directory(
validation_folder, batch_size=batch_size, target_size=(224, 224),
shuffle=False, class_mode='binary')
all_correct = []
for i, (X, y) in zip(range(val_flow.n // batch_size), val_flow):
predictions = model.predict(X).ravel()
correct = list((predictions > 0.5) == y)
all_correct.extend(correct)
print("Processed %d images" % len(all_correct))
print("Validation accuracy: %0.4f" % np.mean(all_correct))
Exercise: display the example where the model makes the most confident mistakes.
To display images in jupyter notebook you can use:
from IPython.display import Image, display
import os.path as op
display(Image(op.join(validation_folder, image_name)))
The filenames of items sampled by a flow (without random shuffling) can be accessed via: val_flow.filenames
.
In [ ]:
In [35]:
# %load solutions/dogs_vs_cats_worst_predictions.py
from IPython.display import Image, display
predicted_batches = []
label_batches = []
n_batches = val_flow.n // batch_size
for i, (X, y) in zip(range(n_batches), val_flow):
predicted_batches.append(model.predict(X).ravel())
label_batches.append(y)
print("%d/%d" % (i + 1, n_batches))
predictions = np.concatenate(predicted_batches)
true_labels = np.concatenate(label_batches)
top_offenders = np.abs(predictions - true_labels).argsort()[::-1][:10]
image_names = np.array(val_flow.filenames, dtype=np.object)[top_offenders]
for img, pred in zip(image_names, predictions[top_offenders]):
print("predicted dog probability: %0.4f" % pred)
display(Image(op.join(validation_folder, img)))
# Analysis:
#
# The worst offender has the grid occlusion: this kind of grids is
# probably much more frequent for dogs in in the rest of the training
# set. This is an unwanted bias of our dataset.
#
# To fix it we would probably need to add other images with similar
# occlusion patterns to teach the model to be invariant to them.
# This could be achieved with a dedicated data augmentation scheme.
#
# The image with both a dog and a cat could clearly be considered a
# labeling error: this kind of ambiguous images should be removed
# from the validation set to properly asses the generalization ability
# of the model.
#
# The other errors are harder to understand. Introspecting the gradients
# back to the pixel space could help understand what's misleading the
# model. It could be some elements in the background that are
# statistically very correlated to dogs in the training set.
In [36]:
from tensorflow.keras.layers import Add
[(i, l.output_shape)
for (i, l) in enumerate(model.layers)
if isinstance(l, Add)]
Out[36]:
Let's fix the weights of the low level layers and fine tune the top level layers:
In [37]:
for i, layer in enumerate(model.layers):
layer.trainable = i >= 151
Let's fine tune a bit the top level layers to see if we can further improve the accuracy. Use the nvidia-smi command in a bash terminal on the server to monitor the GPU usage when the model is training.
In [ ]:
from keras import optimizers
augmenting_datagen = ImageDataGenerator(
rotation_range=20,
width_shift_range=0.2,
height_shift_range=0.2,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True,
fill_mode='nearest',
preprocessing_function=preprocess_function,
)
train_flow = augmenting_datagen.flow_from_directory(
train_folder, target_size=(224, 224), batch_size=batch_size,
class_mode='binary', shuffle=True, seed=0)
opt = optimizers.SGD(lr=1e-4, momentum=0.9)
model.compile(optimizer=opt, loss='binary_crossentropy',
metrics=['accuracy'])
# compute the validation metrics every 5000 training samples
history = model.fit_generator(train_flow, 5000,
epochs=30,
validation_data=val_flow,
validation_steps=val_flow.n)
# Note: the pretrained model was already very good. Fine tuning
# does not really seem to help. It might be more interesting to
# introspect the quality of the labeling in the training set to
# check for images that are too ambiguous and should be removed
# from the training set.
Bonus exercise: train your own architecture from scratch using adam and data augmentation. Start with a small architecture first (e.g. 4 convolutions layers interleaved with 2 max pooling layers followed by a Flatten
and two fully connected layers).
Bonus exercise: run this notebook on an instance with several GPUs (NC12 or NC24 instances on Azure) and try to speed up the training with: https://medium.com/@kuza55/transparent-multi-gpu-training-on-tensorflow-with-keras-8b0016fd9012
In [ ]: