Homework for Lesson 1

This code needs re-running every time.



In [1]:

    
%matplotlib inline
path = "/home/mark/Study/dl.fast.ai/deeplearning1/nbs/data/redux/"
import utils;
# from imp import reload  # fixes a P2-P3 incompatibility
# reload(utils)
from utils import *









    



WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10).  Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
 https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

Using gpu device 0: GeForce GTX 1070 (CNMeM is disabled, cuDNN 5110)
Using Theano backend.

TO DO:

create validation set and sample
move to separate dirs for each set
finetune and train
submit

Create validation set and sample

In Kaggle: - login and agree to comp rules (easily forgotten) In shell: kg download -c dogs-vs-cats-redux-kernels-edition

N.B. Don't rerun the below! Or you'll create lots of new dirs!



In [59]:

    
%pwd









    Out[59]:





'/home/mark/Study/dl.fast.ai/deeplearning1/nbs/data/redux/train'



In [13]:

    
%cd /home/mark/Study/dl.fast.ai/deeplearning1/nbs/data/redux









    



/home/mark/Study/dl.fast.ai/deeplearning1/nbs/data/redux



In [66]:

    
!tree -d









    



.
├── test
└── train

2 directories



In [67]:

    
%cd train









    



/home/mark/Study/dl.fast.ai/deeplearning1/nbs/data/redux/train



In [68]:

    
%mkdir ../valid



In [69]:

    
g = glob('*.jpg')
shuf = np.random.permutation(g)
for i in range(2000): os.rename(shuf[i], '../valid/' + shuf[i])



In [70]:

    
%mkdir ../sample
%mkdir ../sample/train
%mkdir ../sample/valid



In [72]:

    
!tree -d ../









    



../
├── sample
│   ├── train
│   └── valid
├── test
├── train
└── valid

6 directories



In [75]:

    
from shutil import copyfile



In [ ]:

    
%pwd



In [74]:

    
g = glob('*.jpg')
shuf = np.random.permutation(g)
for i in range(200): copyfile(shuf[i], '../sample/train/' + shuf[i])



In [77]:

    
%cd ../valid









    



/home/mark/Study/dl.fast.ai/deeplearning1/nbs/data/redux/valid



In [78]:

    
%pwd









    Out[78]:





'/home/mark/Study/dl.fast.ai/deeplearning1/nbs/data/redux/valid'



In [79]:

    
g = glob('*.jpg')
shuf = np.random.permutation(g)
for i in range(50): copyfile(shuf[i], '../sample/valid/' + shuf[i])



In [1]:

    
# %ls

Move to separate dirs for each set

Move all cats to a cats directory and all dogs to a dogs dir

... for each set?? Is this why he separates the 'cd' commands out, below?

YES!



In [84]:

    
%cd ../train









    



/home/mark/Study/dl.fast.ai/deeplearning1/nbs/data/redux/train



In [86]:

    
%cd ../valid









    



/home/mark/Study/dl.fast.ai/deeplearning1/nbs/data/redux/valid



In [88]:

    
%cd ../sample/train









    



/home/mark/Study/dl.fast.ai/deeplearning1/nbs/data/redux/sample/train



In [90]:

    
%cd ../valid









    



/home/mark/Study/dl.fast.ai/deeplearning1/nbs/data/redux/sample/valid



In [32]:

    
%pwd









    Out[32]:





'/home/mark/Study/dl.fast.ai/deeplearning1/nbs/data/redux/sample/valid'



In [93]:

    
%mkdir cats
%mkdir dogs
%mv cat.*.jpg cats/
%mv dog.*.jpg dogs/









    



mkdir: cannot create directory ‘cats’: File exists
mkdir: cannot create directory ‘dogs’: File exists
mv: cannot stat 'cat.*.jpg': No such file or directory
mv: cannot stat 'dog.*.jpg': No such file or directory



In [12]:

    
%cd data/redux









    



[Errno 2] No such file or directory: 'data/redux'
/home/mark/Study/dl.fast.ai/deeplearning1/nbs/data/redux/test



In [10]:

    
# Create single 'unknown' class for test set
#%cd test
%mkdir unknown
%mv *.jpg unknown/



In [14]:

    
!tree -d









    



.
├── results
├── sample
│   ├── train
│   │   ├── cats
│   │   └── dogs
│   └── valid
│       ├── cats
│       └── dogs
├── test
│   └── unknown
├── train
│   ├── cats
│   └── dogs
└── valid
    ├── cats
    └── dogs

16 directories

Finetune and train

This code can just be copied from the standard Lesson-1 material



In [2]:

    
print(path)









    



/home/mark/Study/dl.fast.ai/deeplearning1/nbs/data/redux/

%mkdir results

This code (2 blocks) needs re-running



In [3]:

    
# As large as you can, but no larger than 64 is recommended. 
# If you have an older or cheaper GPU, you'll run out of memory, so will have to decrease this.
batch_size=64



In [5]:

    
# Import our class, and instantiate
import vgg16; 
from vgg16 import Vgg16
vgg = Vgg16()



In [6]:

    
# Grab a few images at a time for training and validation.
# NB: They must be in subdirectories named based on their category
print(path+'train')
batches = vgg.get_batches(path+'train', batch_size = batch_size)
val_batches = vgg.get_batches(path+'valid', batch_size = batch_size*2)
vgg.finetune(batches)









    



/home/mark/Study/dl.fast.ai/deeplearning1/nbs/data/redux/train
Found 21000 images belonging to 2 classes.
Found 2000 images belonging to 2 classes.

Run a few epochs of fitting, saving the weights each time



In [40]:

    
vgg.fit(batches, val_batches, nb_epoch=1)









    



Epoch 1/1
21000/21000 [==============================] - 232s - loss: 0.1393 - acc: 0.9636 - val_loss: 0.0652 - val_acc: 0.9835



In [41]:

    
vgg.model.save_weights(path + 'results/ft1.h5')



In [43]:

    
vgg.model.load_weights(path + 'results/ft1.h5')



In [44]:

    
vgg.fit(batches, val_batches, nb_epoch=1)









    



Epoch 1/1
21000/21000 [==============================] - 219s - loss: 0.1011 - acc: 0.9758 - val_loss: 0.0541 - val_acc: 0.9870



In [45]:

    
vgg.model.save_weights(path + 'results/ft2.h5')

After 2 epochs, let's reduce the LR... (annealing):



In [46]:

    
vgg.model.optimizer.lr = 0.01



In [47]:

    
vgg.fit(batches, val_batches, nb_epoch=1)









    



Epoch 1/1
21000/21000 [==============================] - 219s - loss: 0.0942 - acc: 0.9784 - val_loss: 0.0687 - val_acc: 0.9850



In [48]:

    
vgg.model.save_weights(path + 'results/ft3.h5')



In [116]:

    
vgg.fit(batches, val_batches, nb_epoch=1)









    



Epoch 1/1
21000/21000 [==============================] - 218s - loss: 0.1161 - acc: 0.9692 - val_loss: 0.0584 - val_acc: 0.9825



In [122]:

    
vgg.model.save_weights(path + 'results/ft4.h5')

Submit

Kaggle shares what they expect, by sharing a sample



In [15]:

    
%ls









    



results/  sample_submission.csv  test.zip  train.zip
sample/   test/                  train/    valid/



In [16]:

    
!head sample_submission.csv









    



id,label
1,0.5
2,0.5
3,0.5
4,0.5
5,0.5
6,0.5
7,0.5
8,0.5
9,0.5

So the format is (id, label), where the id is the file number, and the label is a probability 0<<p<<1

And, of course, the test



In [17]:

    
%ls test









    



unknown/



In [14]:

    
%ls test -d









    



test/

Making the predictions



In [18]:

    
print(path + 'test')
print(len(os.listdir(path + 'test')))









    



/home/mark/Study/dl.fast.ai/deeplearning1/nbs/data/redux/test
1

Let's make some real predictions, then...



In [49]:

    
batches, preds = vgg.test(path + 'test', batch_size = batch_size * 2)









    



Found 12500 images belonging to 1 classes.

N.B. We need the batches to get the filenames, as we can parse the filenames to get the IDs



In [50]:

    
filenames = batches.filenames

N.B. Kaggle is expecting something of the form 'isDog', which corresponds to the second column in the following:



In [51]:

    
preds[:5]









    Out[51]:





array([[  1.7251e-12,   1.0000e+00],
       [  1.1085e-11,   1.0000e+00],
       [  9.9999e-01,   6.1808e-06],
       [  1.7331e-08,   1.0000e+00],
       [  1.7599e-10,   1.0000e+00]], dtype=float32)



In [52]:

    
filenames[:5]









    Out[52]:





['unknown/429.jpg',
 'unknown/7807.jpg',
 'unknown/7292.jpg',
 'unknown/5620.jpg',
 'unknown/8651.jpg']



In [53]:

    
save_array(path + 'results/test_preds.dat', preds)
save_array(path + 'results/filenames.dat', filenames)



In [54]:

    
preds = load_array('results/test_preds.dat')
filenames = load_array('results/filenames.dat')



In [55]:

    
from PIL import Image



In [56]:

    
Image.open('test/'+filenames[0])









    Out[56]:



In [57]:

    
isdog = np.clip(preds[:,1], 0.02, 0.98)     # create an isdog index from col #1 ('dog' col) of preds 
                                            # and 'clip' it, to downgrade its confident predictions (penalised by log loss)



In [58]:

    
isdog[:5]









    Out[58]:





array([ 0.98,  0.98,  0.02,  0.98,  0.98], dtype=float32)

If each filename comes back with a partial path, such as 'unknown/4392.jpg', create an ID list from this that takes a 'middle' from the 9th position in the string onwards until the '.'(e.g. '4392'), using a list comprehension.



In [59]:

    
ids = [int(f[8:f.find('.')]) for f in filenames]
ids [:5]









    Out[59]:





[429, 7807, 7292, 5620, 8651]

Use Numpy's 'stack' to put these two columns together into an array of predictions.



In [60]:

    
subm = np.stack([ids, isdog], axis = 1)
subm[:5]









    Out[60]:





array([[  4.2900e+02,   9.8000e-01],
       [  7.8070e+03,   9.8000e-01],
       [  7.2920e+03,   2.0000e-02],
       [  5.6200e+03,   9.8000e-01],
       [  8.6510e+03,   9.8000e-01]])

This is our array of answers to the test data.



In [61]:

    
%pwd









    Out[61]:





'/home/mark/Study/dl.fast.ai/deeplearning1/nbs/data/redux'



In [62]:

    
np.savetxt(path + 'subm98.csv', subm, fmt = '%d,%.5f', header = 'id,label', comments = '')

All we need to do now is present this for uploading. Here's a nice trick for making this easy.



In [63]:

    
from IPython.display import FileLink
FileLink(path + 'subm98.csv')









    Out[63]:




/home/mark/Study/dl.fast.ai/deeplearning1/nbs/data/redux/subm98.csv

Investigating model error

N.B. this code is highly reusable!

We need to do 5 things. We should look at examples of:

correct labels, at random
incorrect labels, at random
the labels, in each class, with highest probability of being correct
the labels, in each class, with highest probability of being incorrect
the 'most-uncertain' labels (p=0.5)

First, load a weights-set, and make some new predictions ('probs')

Note that this uses the 'test' method in vgg...



In [7]:

    
vgg.model.load_weights(path+'results/ft1.h5')
val_batches, probs = vgg.test(path+'valid', batch_size = batch_size)









    



Found 2000 images belonging to 2 classes.



In [8]:

    
labels = val_batches.classes
filenames = val_batches.filenames

N.B. the 'probs' here are probabilities.



In [9]:

    
probs = probs[:,0]
preds = np.round(1-probs)
probs[:8]









    Out[9]:





array([ 1.    ,  0.5743,  1.    ,  0.3327,  1.    ,  1.    ,  1.    ,  1.    ], dtype=float32)



In [10]:

    
preds[:8]









    Out[10]:





array([ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.], dtype=float32)



In [21]:

    
# Number of images to view for each visualisation task
n_view = 4

Helper function to plot images by index in the validation set:



In [22]:

    
def plots_idx(idx, titles = None):
    plots([image.load_img(path + 'valid/' + filenames[i]) for i in idx], titles = titles)

1. A few correct labels at random



In [23]:

    
#1. A few correct labels at random 
correct = np.where(preds == labels)[0]
idx = permutation(correct)[:n_view]
plots_idx(idx, probs[idx])

2. A few incorrect labels at random:



In [24]:

    
incorrect = np.where(preds != labels)[0]
idx = permutation(incorrect)[:n_view]
plots_idx(idx, probs[idx])

3. The labels, in each class, with highest probability of being correct

[] I should try to understand how 'correct_cats[most_correct_cats]' works!



In [25]:

    
correct_cats = np.where((preds == 0) & (preds == labels))[0]
most_correct_cats = np.argsort(probs[correct_cats])[::-1][:n_view]
plots_idx(correct_cats[most_correct_cats], probs[correct_cats][most_correct_cats])



In [31]:

    
print(correct_cats[:5])
print(correct_cats.shape)
print(most_correct_cats[:5])
print(most_correct_cats.shape)









    



[0 1 2 4 5]
(991,)
[990 415 429 427]
(4,)



In [26]:

    
correct_dogs = np.where((preds == 1) & (preds == labels))[0]
most_correct_dogs = np.argsort(probs[correct_dogs])[::-1][:n_view]
plots_idx(correct_dogs[most_correct_dogs], probs[correct_dogs][most_correct_dogs])

4. The labels, in each class, with highest probability of being incorrect



In [33]:

    
#4a. The images where we were most confident that they were cats, but they were dogs
incorrect_cats = np.where((preds == 0) & (preds != labels))[0]
most_incorrect_cats = np.argsort(probs[incorrect_cats])[::-1][:n_view]
plots_idx(incorrect_cats[most_incorrect_cats], probs[incorrect_cats][most_incorrect_cats])

Fair play ... some of these are v. hard! We could conclude that the model is doing okay if it struggles with these.



In [34]:

    
#4a. The images where we were most confident that they were dogs, but they were cats
incorrect_dogs = np.where((preds == 1) & (preds != labels))[0]
most_incorrect_dogs = np.argsort(probs[incorrect_dogs])[::-1][:n_view]
plots_idx(incorrect_dogs[most_incorrect_dogs], probs[correct_dogs][most_incorrect_dogs])

The 'most-uncertain' labels (p=0.5)



In [38]:

    
most_uncertain = np.argsort(np.abs(probs - 0.5))
plots_idx(most_uncertain[:n_view], probs[most_uncertain])



In [ ]: