Wayne H Nixalo
FAI 1 - Practical Deep Learning I - Lesson 1/2 - Image Recognition
Code-Along with dogs_cats_redux.ipynb
In this tutorial, you will learn how generate and submit predictions to a Kaggle competiton Dogs vs. Cats Redux: Kernels Edition
Directory structure:
utils/
vgg16.py
utils.py
lesson1/
WNx_lesson1_codealong_cats_vs_dogs_redux_ipynb
data/
redux/
train/
cat.437.jpg
dog.9924.jpg
....
test/
231.jpg
325.jpg
....
You can download the data files from the competition page here or you can download them from the command line using the Kaggle CLI.
Launch notebook inside lesson1 directory
cd lesson1
jupyter notebook
In [1]:
#Verify we're in the lesson1 directory
%pwd
Out[1]:
In [2]:
# Create references to important directories we'll use over and over
import os, sys
current_dir = os.getcwd()
LESSON_HOME_DIR = current_dir
# %mkdir data # make it if it isn't there
DATA_HOME_DIR = current_dir+'/data'
For some reason, my current working directory wasn't the system path. Instead it was a bunch of references to my Miniconda3 directory. Maybe this is bc of how I installed it and os.path and sys.path read from a global path file, I don't know.
However, if add the current working directory + '../utils' to my sys.path, I'm able to import modules living in the utils folder in the parent directory. I think J.Howard was doing this but my setup is different. Maybe.
In [3]:
# Allow relative imports to directories above lesson1/
# sys.path.insert(1, os.path.join(sys.path[0], '..'))
sys.path.insert(1, os.path.join(LESSON_HOME_DIR, '../utils'))
# import modules
from utils import *
from vgg16 import Vgg16
# Instatiate plotting tool
# In Jupyter notebooks, you will need to run this command before doing any plotting
%matplotlib inline
In [4]:
# # Me playing w/ sys.path & os.path to get utils/ loaded above.
# sys.path.insert(1,'/Users/WayNoxchi/Deshar/CodeDeshar/FAI/utils')
# print(sys.path[1])
# help(os.path.join)
In [ ]:
# Create directories
%cd $DATA_HOME_DIR
%mkdir valid
%mkdir results
%mkdir -p sample/train
%mkdir -p sample/test
%mkdir -p sample/valid
%mkdir -p sample/results
%mkdir -p test/unknown
Go into the data/train directory. Move a shuffled set of images to data/valid. Copy a shuffled set of images from data/train to data/sample/train. From data/valid copy a shuffled set of images to /data/sample/valid.
Helper function copyfile(..) from shutil will help us with the copying. Moving is done via os.rename(..)
On my MacBook I'm using a full dataset of 1000 imgs: 400 test, 600 train. From that: 50 for validation, 100 as a sample. On my Asus it'll be the full 12.5k train, 25k test set.
In [5]:
%cd $DATA_HOME_DIR/train
In [ ]:
g = glob('*.jpg')
shuf = np.random.permutation(g)
# move validation set from data/train --> data/valid
for i in range(50): os.rename(shuf[i], DATA_HOME_DIR + '/valid/' + shuf[i])
In [6]:
from shutil import copyfile
In [ ]:
g = glob('*.jpg')
shuf = np.random.permutation(g)
for i in range(100): copyfile(shuf[i], DATA_HOME_DIR + '/sample/train/' + shuf[i])
In [7]:
%cd $DATA_HOME_DIR/valid
In [ ]:
g = glob('*.jpg')
shuf = np.random.permutation(g)
for i in range(10): copyfile(shuf[i], DATA_HOME_DIR + '/sample/valid/' + shuf[i])
In [ ]:
# Divide cat/dog images into separate directories
# gotta do this for both sample folders, and train, and valid
%cd $DATA_HOME_DIR/sample/train
%mkdir cats
%mkdir dogs
%mv cat.*.jpg cats/
%mv dog.*.jpg dogs/
%cd $DATA_HOME_DIR/sample/valid
%mkdir cats
%mkdir dogs
%mv cat.*.jpg cats/
%mv dog.*.jpg dogs/
%cd $DATA_HOME_DIR/train
%mkdir cats
%mkdir dogs
%mv cat.*.jpg cats/
%mv dog.*.jpg dogs/
%cd $DATA_HOME_DIR/valid
%mkdir cats
%mkdir dogs
%mv cat.*.jpg cats/
%mv dog.*.jpg dogs/
In [ ]:
# Create single 'unkown' class for test set
%cd $DATA_HOME_DIR/test
%mv *.jpg unknown/
In [8]:
%cd $DATA_HOME_DIR
# Set path to sample/ path if desired; otherwise just + '/'
path = DATA_HOME_DIR + '/sample/'
test_path = DATA_HOME_DIR + '/test/' # we're using all the test data
results_path = DATA_HOME_DIR + '/results/'
train_path = DATA_HOME_DIR + '/train/'
valid_path = DATA_HOME_DIR + '/valid/'
In [9]:
# from theano.ifelse import ifelse
In [10]:
# import Vgg16 helper class
# NOTE: self.FILE_PATH may be hard coded as 'http://www.platform.ai/models/' J.Howard updated
# this to 'http://files.fast.ai/models/' so update vgg16 & vgg16bn
vgg = Vgg16()
In [11]:
# Set constants. You can experiment with no_of_epochs to improve the model
batch_size=64
no_of_epochs=3
In [12]:
# Finetune the model.
# validation doesn't need backpropagation, so uses less memory, so can use larger batches
batches = vgg.get_batches(train_path, batch_size=batch_size)
val_batches = vgg.get_batches(valid_path, batch_size=batch_size*2)
vgg.finetune(batches)
# Note sure if we set this for all fits
vgg.model.optimizer.lr = 0.01
We're also saving our weights after each epoch.
In [13]:
# NOTE we are passing in the validation dataset to the fit() method
# Foreach epoch we test our model against the validation set
latest_weights_filename = None
for epoch in range(no_of_epochs):
print "Running epoch %d" % epoch
vgg.fit(batches, val_batches, nb_epoch=1)
latest_weights_filename = 'ft%d.h5' % epoch
vgg.model.save_weights(results_path + latest_weights_filename)
print "Completed %s fit operations" % no_of_epochs
# may want to start with a higher η, then lower to 0.01 after 1st epoch.
In [14]:
batches, preds = vgg.test(test_path, batch_size=batch_size*2)
In [17]:
# For every image, vgg.test() generates two probabilities based on how we've ordered the
# cats/dogs directories. Looks like col 1 is cats, col 2 is dogs.
print preds[:5]
filenames = batches.filenames
print filenames[:5]
In [51]:
# You can verify the column ordering by viewing some images
from PIL import Image
from matplotlib.pyplot import imshow
imshow(Image.open(test_path + filenames[2]))
# print test_path + filenames[2]
Out[51]:
NOTE: I wasn't able to get PIL.Image.open(•) to display an image in the notebook. Instead it returns the image object:
<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=499x343 at 0x12035B8C0>
If I run .show() after the call, PIL opens the image in my system's image viewer. I got around this by importing matplotlib.pyplot.imshow, then calling imshow(•) on the image object:
imshow(Image.open(test_path + filenames[2]))
And this displays the image, albeit as a plot.
In [52]:
# Save our test results arrays so we can use them again later
save_array(results_path + 'test_preds.dat', preds)
save_array(results_path + 'filenames.dat', filenames)
Keras' fit() function conveniently shows us the value of the loss function, and the accuracy, after every epoch (ea. full run through all training examples). The most important metrics for us to look at are for the validation set, since we want to check for over-fitting.
As well as looking at the overall metrics, it's also a good idea to look at examples of each of:
Let's see what we can learn from these examples. (In general, this is a particularly useful technique for debugging problems in the model. However, since this model is so simple, there may not be too much to learn at this stage.)
Calculate prediction on validation set, so we can find corect and incorrect examples:
In [53]:
vgg.model.load_weights(results_path + latest_weights_filename)
In [54]:
val_batches, probs = vgg.test(valid_path, batch_size = batch_size)
In [55]:
filenames = val_batches.filenames
expected_labels = val_batches.classes # 0 / 1
# Round our predictions to 0/1 to generate labels
our_predictions = probs[:,0]
our_labels = np.round(1 - our_predictions)
In [56]:
from keras.preprocessing import image
# Helper function to plot images by index in the validation set
# Plots is a helper function in utils.py
def plots_idx(idx, titles=None):
plots([image.load_img(valid_path + filenames[i]) for i in idx], titles=titles)
# Number of images to view for each visualization task
n_view = 4
In [58]:
# 1. A few correct labels at random
correct = np.where(our_labels==expected_labels)[0]
print "Found %d correct labels" % len(correct)
idx = permutation(correct)[:n_view]
plots_idx(idx, our_predictions[idx])
In [60]:
# 2. A few incorrect labels at random
incorrect = np.where(our_labels!=expected_labels)[0]
print "Found %d incorrect labels" % len(incorrect)
idx = permutation(incorrect)[:n_view]
plots_idx(idx, our_predictions[idx])
In [63]:
# 3a. The images we're most confident were cats, and are actually cats
correct_cats = np.where((our_labels==0) & (our_labels==expected_labels))[0]
print "Found %d confident correct cats labels" % len(correct_cats)
most_correct_cats = np.argsort(our_predictions[correct_cats])[::-1][:n_view]
plots_idx(correct_cats[most_correct_cats], our_predictions[correct_cats][most_correct_cats])
In [65]:
# 3b. The images we're most confident were dogs, and are actually dogs
correct_dogs = np.where((our_labels==1) & (our_labels==expected_labels))[0]
print "Found %d confident correct dogs labels" % len(correct_dogs)
most_correct_dogs = np.argsort(our_predictions[correct_dogs])[::-1][:n_view]
plots_idx(correct_dogs[most_correct_dogs], our_predictions[correct_dogs][most_correct_dogs])
In [66]:
# 4a. The images we're most confident were cats, but are actually dogs
incorrect_cats = np.where((our_labels==0) & (our_labels!=expected_labels))[0]
print "Found %d incorrect cats labels" % len(incorrect_cats)
if len(incorrect_cats):
most_incorrect_cats = np.argsort(our_predictions[incorrect_cats])[::-1][:n_view]
plots_idx(incorrect_cats[most_incorrect_cats], our_predictions[incorrect_cats][most_incorrect_cats])
In [67]:
# 4b. The image we're most confident were dogs, but are actually cats
incorrect_dogs = np.where((our_labels==1) & (our_labels!=expected_labels))[0]
print "Found %d incorrect dogs labels" % len(incorrect_dogs)
if len(incorrect_dogs):
most_incorrect_dogs = np.argsort(our_predictions[incorrect_dogs])[::-1][:n_view]
plots_idx(incorrect_dogs[most_incorrect_dogs], our_predictions[incorrect_dogs][most_incorrect_dogs])
In [68]:
# 5. The most uncertain labels (ie those with probability closest to 0.5)
most_uncertain = np.argsort(np.abs(our_predictions - 0.5))
plots_idx(most_uncertain[:n_view], our_predictions[most_uncertain])
Perhaps the most common way to analyze the result of a classification model is to use a confusion matrix. Scikit-learn has a convenient function we can use for this purpose:
In [69]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(expected_labels, our_labels)
We can just pritn out the confusion matrix, or we can show a graphical view (which is mainly useful for dependents with a larger number of categories).
In [70]:
plot_confusion_matrix(cm, val_batches.class_indices)
Here's the format Kaggle requires for new submissions:
imageId,isDog
1242, .3984
3947, .1000
4539, .9082
2345, .0000
Kaggle wants the imageId followed by the probability of the image being a dog. Kaggle uses a metric called Log Loss to evaluate your submission.
In [71]:
# Load our test predictions from file
preds = load_array(results_path + 'test_preds.dat')
filenames = load_array(results_path + 'filenames.dat')
In [76]:
# Grab the dog prediction column
isdog = preds[:,1]
print "Raw Predictions: " + str(isdog[:5])
print "Mid Predictions: " + str(isdog[(isdog < 0.6) & (isdog > 0.4)])
print "Edge Predictions: " + str(isdog[(isdog == 1) | (isdog == 0)][:5])
Log Loss doesn't support probability values of 0 or 1 -- they're undeifined (and we hav emany). Fortunately, Kaggle helps us by offsetting our 0s and 1s by a very small value. So if we upload our submission now, we'll have lots of 0.999999999 and 0.000000001 values. This seems good, right?
Not so. There's an additional twist due to how log loss is calculated -- log loss rewards predictions that are confident and correct (p=.9999,label=1), but it punishes predictions that are confident and wrong far more (p=.0001,label=1). See visualization below.
In [146]:
# Visualize Log Loss when True value = 1
# y-axis is log loss, x-axis is probability that label = 1
# As you can see Log Loss increases rapidly as we approach 0
# But increases slowly as our predicted probability gets closer to 1
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import log_loss
x = [i*.0001 for i in xrange(1,10000)]
y = [log_loss([1],[[i*.0001,1-(i*.0001)]],eps=1e-15,labels=[0,-1]) for i in xrange(1,10000,1)]
plt.plot(x, y)
plt.axis([-.05, 1.1, -.8, 10])
plt.title("Log Loss when true label = 1")
plt.xlabel("predicted probability")
plt.ylabel("log loss")
plt.show()
The method above was causing a ValueError because sklearn.metrics.logloss(•) wants to know its labels (labels wasn't explicitly defined). This code will make the correct plot:
http://forums.fast.ai/t/sklearn-log-loss-error/839/6?u=borz
If I input y_true = [0] and labels=[0,1] OR y_true = [1] and labels=[0,-1] you get the correct plot. y_true = [1] and labels=[0,1] generates a reversed plot..
In [142]:
# So to play it safe, we use a sneaky trick to round down our edge predictions
# Swap all ones with .95 and all zeros with .05
isdog = isdog.clip(min=0.05, max=0.95)
In [145]:
# Extract imageIds from the filenames in our test/unknown directory
filenames = batches.filenames
ids = np.array([int(f[8:f.find('.')]) for f in filenames]) # 'unknown/' is 8 chars long
Here we join the two columns into an array of [imageId, isDog]
In [147]:
subm = np.stack([ids,isdog],axis=1)
subm[:5]
Out[147]:
In [148]:
%cd $DATA_HOME_DIR
submission_file_name = 'submission1.csv'
np.savetxt(submission_file_name, subm, fmt='%d,%.5f', header='id,label', comments='')
In [153]:
from IPython.display import FileLink
%cd $LESSON_HOME_DIR
FileLink('data/' + submission_file_name)
Out[153]:
Some helpful links for IPython.display
-- Wayne H Nixalo 05 May 2017
In [ ]: