lesson1: Convolutional Neural Networks with dogscats

Let's classify images using deep learning and submit the result to Kaggle!

Prerequisite

This notebook assumes Keras with Theano backend.

  • TODO: make TensorFlow version as another notebook

It also assumes that you will run it on either one of these two cases:

  • Floydhub (--env theano:py2 -> Theano rel-0.8.2 + Keras 1.2.2 on Python2)
  • local conda virtual environment (Theano 0.9.0 + Keras 2.0.4 on Python3)

Refer to this FloydHub document for available FloydHub environments.

Setup

Make sure to have these files in the parent directory of the directory where you execute this notebook.

  • available in the official repo for Keras1 on Python2 (rename from original files)
    • utils_keras1.py
    • vgg16_keras1.py
    • vgg16bn_keras1.py
  • available in the unofficial repo for Keras2 on Python3
    • utils.py
    • vgg16.py
    • vgg16bn.py

The directory structure looks like this. Please modifiy the symlinks according to your environment.

  • (*) only for FloydHub
  • (**) only for local
floyd_requirements.txt (*)
floydhub.data.unzip/   (*)
floydhub.data.zipped/  (*)
    dogscats.zip
lesson1/
    data/ (**)
        redux/
            train/
                cat.437.jpg
                dog.9924.jpg
                ...
            test/
                231.jpg
                325.jpg
                ...
    dogscats_run.ipynb
    floyd_requirements.txt -> ../floyd_requirements.txt (*)
    utils.py -> ../utils(_keras1).py
    vgg16.py -> ../vgg16(_keras1).py
    vgg16bn.py -> ../vgg16bn(_keras1).py
utils.py
utils_keras1.py
vgg16.py
vgg16_keras1.py
vgg16bn.py
vgg16bn_keras1.py

Prepare data

The details of data preparation largely depends on which dataset you use. In this section, we will use a pre-organized dataset from http://files.fast.ai/files/dogscats.zip

For another example of data preparation, please refer to this notebook

How the dataset looks like

After extracting the dogscats.zip file, the directory structure look like this.

dogscats/
    models/
    sample/
        train/
            cats/
                cat.394.jpg
                ... (8 items)
            dogs/
                dog.1402.jpg
                ... (8 items)
        valid/
            cats/
                cat.10435.jpg
                ... (4 items)
            dogs/
                dog.10459.jpg
                ... (4 items)
            features.npy
            labels.npy
    test1/
        1.jpg
        10.jpg
        100.jpg
        ... (12500 items)
    train/
        cats/
            cat.0.jpg
            cat.1.jpg
            cat.3.jpg
            ... (11500 items)
        dogs/
            cat.0.jpg
            cat.1.jpg
            cat.2.jpg
            cat.4.jpg
            ... (11500 items)
    valid/
        cats/
            cat.2.jpg
            cat.5.jpg
            ... (1000 item. these are copied from train/cats/ directory)
        dogs/
            dog.3.jpg
            dog.9.jpg
            ... (1000 item. these are copied from train/dogs/ directory)

FloydHub

The cell below shows how to update data to FloydHub.

# from the directory which this notebook is executed
cd ../floydhub.data.zipped/; pwd

# expected: empty
ls -l

wget http://files.fast.ai/files/dogscats.zip

# upload the zipped dataset to floydnet, and create a floydnet dataset
floyd data init dogscats.zipped
floyd data upload

Using the data we have just uploaded to FloydHub, let's unzip it on FloydHub.

# from the directory which this notebook is executed
cd ../floydhub.fast.ai.data.unzip/; pwd

# expected: empty
ls -l

floyd init dogscats.unzip
floyd run --gpu --data [data ID of the uploaded zip] "unzip /input/dogscats.zip -d /output"

Please note:

  • the data ID should be the one you see from the above step
  • the mounted data is available in /input/ directory, and you need to direct the unzipped files to /output/ directory

local

TODO

Run the notebook

Now let's run the notebook in the environment of your choice.

# from the directory which this notebook is executed
cd ./; pwd

# FloydHub
floyd init dogscats
floyd run --mode jupyter --data [data ID of unzipped data] --env theano:py2 --gpu

# alternatively, for local
#jupyter notebook

and check ~/.keras/keras.json

mkdir ~/.keras

# FloydHub (Keras1)
echo '{
    "image_dim_ordering": "th",
    "epsilon": 1e-07,
    "floatx": "float32",
    "backend": "theano"
}' > ~/.keras/keras.json

# alternatively, for local (Keras2)
#echo '{
#    "image_data_format": "channels_first",
#    "backend": "theano",
#    "floatx": "float32",
#    "epsilon": 1e-07
#}' > ~/.keras/keras.json

Finally, let's start running the notebook.


In [ ]:
# make some Python3 functions available on Python2
from __future__ import division, print_function

import sys
print(sys.version_info)

import theano
print(theano.__version__)

import keras
print(keras.__version__)

In [ ]:
# FloydHub: check data
%ls /input/dogscats/

In [ ]:
# check current directory
%pwd
%ls

# see some files are loaded fine
%cat floyd_requirements.txt

# check no Keras2 specific function is used (when Keras1 is used)
%cat utils.py

In [ ]:
#Create references to important directories we will use over and over
import os, sys
current_dir = os.getcwd()

LESSON_HOME_DIR = current_dir

# FloydHub
DATA_HOME_DIR = "/input/dogscats/"
OUTPUT_HOME_DIR = "/output/"

# alternatively, for local
#DATA_HOME_DIR = current_dir+'/data/redux'

In [ ]:
#import modules
from utils import *
from vgg16 import Vgg16

#Instantiate plotting tool
#In Jupyter notebooks, you will need to run this command before doing any plotting
%matplotlib inline

Finetuning and Training


In [ ]:
%cd $DATA_HOME_DIR

#Set path to sample/ path if desired
path = DATA_HOME_DIR + '/' #'/sample/'
test_path = DATA_HOME_DIR + '/test1/' #We use all the test data

# FloydHub
# data needs to be output under /output
# if results_path cannot be created, execute mkdir directly in the terminal
results_path = OUTPUT_HOME_DIR + '/results/'
%mkdir results_path

train_path = path + '/train/'
valid_path = path + '/valid/'

Use a pretrained VGG model with our Vgg16 class


In [ ]:
# As large as you can, but no larger than 64 is recommended.
#batch_size = 8
batch_size = 64

no_of_epochs=3

The original pre-trained Vgg16 class classifies images into one of the 1000 categories. This number of categories depends on the dataset which Vgg16 was trained with. (http://image-net.org/challenges/LSVRC/2014/browse-synsets)

In order to classify images into the categories which we prepare (2 categories of dogs/cats, in this notebook), fine-tuning technology is useful. It:

  • keeps the most weights from the pre-trained Vgg16 model, but modifies only a few parts of the weights
  • changes the dimension of the output layer (from 1000 to 2, in this notebook)

In [ ]:
vgg = Vgg16()

In [ ]:
# Grab a few images at a time for training and validation.
batches = vgg.get_batches(train_path, batch_size=batch_size)
val_batches = vgg.get_batches(valid_path, batch_size=batch_size*2)

# Finetune: note that the vgg model is compiled inside the finetune method.
vgg.finetune(batches)

In [ ]:
# Fit: note that we are passing in the validation dataset to the fit() method
# For each epoch we test our model against the validation set
latest_weights_filename = None

# FloydHub (Keras1)
for epoch in range(no_of_epochs):
    print("Running epoch: %d" % epoch)
    vgg.fit(batches, val_batches, nb_epoch=1)
    latest_weights_filename = 'ft%d.h5' % epoch
    vgg.model.save_weights(results_path+latest_weights_filename)
print("Completed %s fit operations" % no_of_epochs)

# alternatively, for local (Keras2)
"""
for epoch in range(no_of_epochs):
    print("Running epoch: %d" % epoch)
    vgg.fit(batches, val_batches, batch_size, nb_epoch=1)
    latest_weights_filename = 'ft%d.h5' % epoch
    vgg.model.save_weights(results_path+latest_weights_filename)
print("Completed %s fit operations" % no_of_epochs)
"""

Generate Predictions


In [ ]:
# OUTPUT_HOME_DIR, not DATA_HOME_DIR due to FloydHub restriction
%cd $OUTPUT_HOME_DIR
%mkdir -p test1/unknown

%cd $OUTPUT_HOME_DIR/test1
%cp $test_path/*.jpg unknown/

# rewrite test_path
test_path = OUTPUT_HOME_DIR + '/test1/' #We use all the test data

In [ ]:
batches, preds = vgg.test(test_path, batch_size = batch_size*2)

In [ ]:
print(preds[:5])

filenames = batches.filenames
print(filenames[:5])

In [ ]:
# You can verify the column ordering by viewing some images
from PIL import Image
Image.open(test_path + filenames[2])

In [ ]:
#Save our test results arrays so we can use them again later
save_array(results_path + 'test_preds.dat', preds)
save_array(results_path + 'filenames.dat', filenames)

Validate Predictions

Calculate predictions on validation set, so we can find correct and incorrect examples:


In [ ]:
vgg.model.load_weights(results_path+latest_weights_filename)

In [ ]:
val_batches, probs = vgg.test(valid_path, batch_size = batch_size)

In [ ]:
filenames = val_batches.filenames
expected_labels = val_batches.classes #0 or 1

In [ ]:
#Round our predictions to 0/1 to generate labels
our_predictions = probs[:,0]
our_labels = np.round(1-our_predictions)

(TODO) look at data to improve model

confusion matrix


In [ ]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(expected_labels, our_labels)

In [ ]:
plot_confusion_matrix(cm, val_batches.class_indices)

Submit Predictions to Kaggle!

This section also depends on which dataset you use (and which Kaggle competition you are participating)


In [ ]:
#Load our test predictions from file
preds = load_array(results_path + 'test_preds.dat')
filenames = load_array(results_path + 'filenames.dat')

In [ ]:
#Grab the dog prediction column
isdog = preds[:,1]
print("Raw Predictions: " + str(isdog[:5]))
print("Mid Predictions: " + str(isdog[(isdog < .6) & (isdog > .4)]))
print("Edge Predictions: " + str(isdog[(isdog == 1) | (isdog == 0)]))

In [ ]:
# sneaky trick to round down our edge predictions
# Swap all ones with .95 and all zeros with .05
isdog = isdog.clip(min=0.05, max=0.95)

In [ ]:
#Extract imageIds from the filenames in our test/unknown directory 
filenames = batches.filenames
ids = np.array([int(f[8:f.find('.')]) for f in filenames])

In [ ]:
subm = np.stack([ids,isdog], axis=1)
subm[:5]

In [ ]:
# FloydHub
%cd $OUTPUT_HOME_DIR

# alternatively, for local
#%cd $DATA_HOME_DIR

submission_file_name = 'submission1.csv'
np.savetxt(submission_file_name, subm, fmt='%d,%.5f', header='id,label', comments='')

In [ ]:
from IPython.display import FileLink

# FloydHub
%cd $OUTPUT_HOME_DIR
FileLink(submission_file_name)

# alternatively, for local
#%cd $LESSON_HOME_DIR
#FileLink('data/redux/'+submission_file_name)