This code needs re-running every time.
In [1]:
%matplotlib inline
path = "/home/mark/Study/dl.fast.ai/deeplearning1/nbs/data/redux/"
import utils;
# from imp import reload # fixes a P2-P3 incompatibility
# reload(utils)
from utils import *
TO DO:
N.B. Don't rerun the below! Or you'll create lots of new dirs!
In [59]:
%pwd
Out[59]:
In [13]:
%cd /home/mark/Study/dl.fast.ai/deeplearning1/nbs/data/redux
In [66]:
!tree -d
In [67]:
%cd train
In [68]:
%mkdir ../valid
In [69]:
g = glob('*.jpg')
shuf = np.random.permutation(g)
for i in range(2000): os.rename(shuf[i], '../valid/' + shuf[i])
In [70]:
%mkdir ../sample
%mkdir ../sample/train
%mkdir ../sample/valid
In [72]:
!tree -d ../
In [75]:
from shutil import copyfile
In [ ]:
%pwd
In [74]:
g = glob('*.jpg')
shuf = np.random.permutation(g)
for i in range(200): copyfile(shuf[i], '../sample/train/' + shuf[i])
In [77]:
%cd ../valid
In [78]:
%pwd
Out[78]:
In [79]:
g = glob('*.jpg')
shuf = np.random.permutation(g)
for i in range(50): copyfile(shuf[i], '../sample/valid/' + shuf[i])
In [1]:
# %ls
Move all cats to a cats directory and all dogs to a dogs dir
... for each set?? Is this why he separates the 'cd' commands out, below?
YES!
In [84]:
%cd ../train
In [86]:
%cd ../valid
In [88]:
%cd ../sample/train
In [90]:
%cd ../valid
In [32]:
%pwd
Out[32]:
In [93]:
%mkdir cats
%mkdir dogs
%mv cat.*.jpg cats/
%mv dog.*.jpg dogs/
In [12]:
%cd data/redux
In [10]:
# Create single 'unknown' class for test set
#%cd test
%mkdir unknown
%mv *.jpg unknown/
In [14]:
!tree -d
This code can just be copied from the standard Lesson-1 material
In [2]:
print(path)
In [3]:
# As large as you can, but no larger than 64 is recommended.
# If you have an older or cheaper GPU, you'll run out of memory, so will have to decrease this.
batch_size=64
In [5]:
# Import our class, and instantiate
import vgg16;
from vgg16 import Vgg16
vgg = Vgg16()
In [6]:
# Grab a few images at a time for training and validation.
# NB: They must be in subdirectories named based on their category
print(path+'train')
batches = vgg.get_batches(path+'train', batch_size = batch_size)
val_batches = vgg.get_batches(path+'valid', batch_size = batch_size*2)
vgg.finetune(batches)
In [40]:
vgg.fit(batches, val_batches, nb_epoch=1)
In [41]:
vgg.model.save_weights(path + 'results/ft1.h5')
In [43]:
vgg.model.load_weights(path + 'results/ft1.h5')
In [44]:
vgg.fit(batches, val_batches, nb_epoch=1)
In [45]:
vgg.model.save_weights(path + 'results/ft2.h5')
After 2 epochs, let's reduce the LR... (annealing):
In [46]:
vgg.model.optimizer.lr = 0.01
In [47]:
vgg.fit(batches, val_batches, nb_epoch=1)
In [48]:
vgg.model.save_weights(path + 'results/ft3.h5')
In [116]:
vgg.fit(batches, val_batches, nb_epoch=1)
In [122]:
vgg.model.save_weights(path + 'results/ft4.h5')
Kaggle shares what they expect, by sharing a sample
In [15]:
%ls
In [16]:
!head sample_submission.csv
So the format is (id, label), where the id is the file number, and the label is a probability 0<<p<<1
And, of course, the test
In [17]:
%ls test
In [14]:
%ls test -d
In [18]:
print(path + 'test')
print(len(os.listdir(path + 'test')))
Let's make some real predictions, then...
In [49]:
batches, preds = vgg.test(path + 'test', batch_size = batch_size * 2)
N.B. We need the batches to get the filenames, as we can parse the filenames to get the IDs
In [50]:
filenames = batches.filenames
N.B. Kaggle is expecting something of the form 'isDog', which corresponds to the second column in the following:
In [51]:
preds[:5]
Out[51]:
In [52]:
filenames[:5]
Out[52]:
In [53]:
save_array(path + 'results/test_preds.dat', preds)
save_array(path + 'results/filenames.dat', filenames)
In [54]:
preds = load_array('results/test_preds.dat')
filenames = load_array('results/filenames.dat')
In [55]:
from PIL import Image
In [56]:
Image.open('test/'+filenames[0])
Out[56]:
In [57]:
isdog = np.clip(preds[:,1], 0.02, 0.98) # create an isdog index from col #1 ('dog' col) of preds
# and 'clip' it, to downgrade its confident predictions (penalised by log loss)
In [58]:
isdog[:5]
Out[58]:
If each filename comes back with a partial path, such as 'unknown/4392.jpg', create an ID list from this that takes a 'middle' from the 9th position in the string onwards until the '.'(e.g. '4392'), using a list comprehension.
In [59]:
ids = [int(f[8:f.find('.')]) for f in filenames]
ids [:5]
Out[59]:
Use Numpy's 'stack' to put these two columns together into an array of predictions.
In [60]:
subm = np.stack([ids, isdog], axis = 1)
subm[:5]
Out[60]:
This is our array of answers to the test data.
In [61]:
%pwd
Out[61]:
In [62]:
np.savetxt(path + 'subm98.csv', subm, fmt = '%d,%.5f', header = 'id,label', comments = '')
All we need to do now is present this for uploading. Here's a nice trick for making this easy.
In [63]:
from IPython.display import FileLink
FileLink(path + 'subm98.csv')
Out[63]:
N.B. this code is highly reusable!
We need to do 5 things. We should look at examples of:
Note that this uses the 'test' method in vgg...
In [7]:
vgg.model.load_weights(path+'results/ft1.h5')
val_batches, probs = vgg.test(path+'valid', batch_size = batch_size)
In [8]:
labels = val_batches.classes
filenames = val_batches.filenames
N.B. the 'probs' here are probabilities.
In [9]:
probs = probs[:,0]
preds = np.round(1-probs)
probs[:8]
Out[9]:
In [10]:
preds[:8]
Out[10]:
In [21]:
# Number of images to view for each visualisation task
n_view = 4
Helper function to plot images by index in the validation set:
In [22]:
def plots_idx(idx, titles = None):
plots([image.load_img(path + 'valid/' + filenames[i]) for i in idx], titles = titles)
In [23]:
#1. A few correct labels at random
correct = np.where(preds == labels)[0]
idx = permutation(correct)[:n_view]
plots_idx(idx, probs[idx])
In [24]:
incorrect = np.where(preds != labels)[0]
idx = permutation(incorrect)[:n_view]
plots_idx(idx, probs[idx])
[] I should try to understand how 'correct_cats[most_correct_cats]' works!
In [25]:
correct_cats = np.where((preds == 0) & (preds == labels))[0]
most_correct_cats = np.argsort(probs[correct_cats])[::-1][:n_view]
plots_idx(correct_cats[most_correct_cats], probs[correct_cats][most_correct_cats])
In [31]:
print(correct_cats[:5])
print(correct_cats.shape)
print(most_correct_cats[:5])
print(most_correct_cats.shape)
In [26]:
correct_dogs = np.where((preds == 1) & (preds == labels))[0]
most_correct_dogs = np.argsort(probs[correct_dogs])[::-1][:n_view]
plots_idx(correct_dogs[most_correct_dogs], probs[correct_dogs][most_correct_dogs])
In [33]:
#4a. The images where we were most confident that they were cats, but they were dogs
incorrect_cats = np.where((preds == 0) & (preds != labels))[0]
most_incorrect_cats = np.argsort(probs[incorrect_cats])[::-1][:n_view]
plots_idx(incorrect_cats[most_incorrect_cats], probs[incorrect_cats][most_incorrect_cats])
Fair play ... some of these are v. hard! We could conclude that the model is doing okay if it struggles with these.
In [34]:
#4a. The images where we were most confident that they were dogs, but they were cats
incorrect_dogs = np.where((preds == 1) & (preds != labels))[0]
most_incorrect_dogs = np.argsort(probs[incorrect_dogs])[::-1][:n_view]
plots_idx(incorrect_dogs[most_incorrect_dogs], probs[correct_dogs][most_incorrect_dogs])
In [38]:
most_uncertain = np.argsort(np.abs(probs - 0.5))
plots_idx(most_uncertain[:n_view], probs[most_uncertain])
In [ ]: