Let's classify images using deep learning and submit the result to Kaggle!
This notebook assumes Keras with Theano backend.
It also assumes that you will run it on either one of these two cases:
Refer to this FloydHub document for available FloydHub environments.
Make sure to have these files in the parent directory of the directory where you execute this notebook.
The directory structure looks like this. Please modifiy the symlinks according to your environment.
floyd_requirements.txt (*)
floydhub.data.unzip/ (*)
floydhub.data.zipped/ (*)
dogscats.zip
lesson1/
data/ (**)
redux/
train/
cat.437.jpg
dog.9924.jpg
...
test/
231.jpg
325.jpg
...
dogscats_run.ipynb
floyd_requirements.txt -> ../floyd_requirements.txt (*)
utils.py -> ../utils(_keras1).py
vgg16.py -> ../vgg16(_keras1).py
vgg16bn.py -> ../vgg16bn(_keras1).py
utils.py
utils_keras1.py
vgg16.py
vgg16_keras1.py
vgg16bn.py
vgg16bn_keras1.py
The details of data preparation largely depends on which dataset you use. In this section, we will use a pre-organized dataset from http://files.fast.ai/files/dogscats.zip
For another example of data preparation, please refer to this notebook
After extracting the dogscats.zip file, the directory structure look like this.
dogscats/
models/
sample/
train/
cats/
cat.394.jpg
... (8 items)
dogs/
dog.1402.jpg
... (8 items)
valid/
cats/
cat.10435.jpg
... (4 items)
dogs/
dog.10459.jpg
... (4 items)
features.npy
labels.npy
test1/
1.jpg
10.jpg
100.jpg
... (12500 items)
train/
cats/
cat.0.jpg
cat.1.jpg
cat.3.jpg
... (11500 items)
dogs/
cat.0.jpg
cat.1.jpg
cat.2.jpg
cat.4.jpg
... (11500 items)
valid/
cats/
cat.2.jpg
cat.5.jpg
... (1000 item. these are copied from train/cats/ directory)
dogs/
dog.3.jpg
dog.9.jpg
... (1000 item. these are copied from train/dogs/ directory)
The cell below shows how to update data to FloydHub.
# from the directory which this notebook is executed
cd ../floydhub.data.zipped/; pwd
# expected: empty
ls -l
wget http://files.fast.ai/files/dogscats.zip
# upload the zipped dataset to floydnet, and create a floydnet dataset
floyd data init dogscats.zipped
floyd data upload
Using the data we have just uploaded to FloydHub, let's unzip it on FloydHub.
# from the directory which this notebook is executed
cd ../floydhub.fast.ai.data.unzip/; pwd
# expected: empty
ls -l
floyd init dogscats.unzip
floyd run --gpu --data [data ID of the uploaded zip] "unzip /input/dogscats.zip -d /output"
Please note:
TODO
Now let's run the notebook in the environment of your choice.
# from the directory which this notebook is executed
cd ./; pwd
# FloydHub
floyd init dogscats
floyd run --mode jupyter --data [data ID of unzipped data] --env theano:py2 --gpu
# alternatively, for local
#jupyter notebook
and check ~/.keras/keras.json
mkdir ~/.keras
# FloydHub (Keras1)
echo '{
"image_dim_ordering": "th",
"epsilon": 1e-07,
"floatx": "float32",
"backend": "theano"
}' > ~/.keras/keras.json
# alternatively, for local (Keras2)
#echo '{
# "image_data_format": "channels_first",
# "backend": "theano",
# "floatx": "float32",
# "epsilon": 1e-07
#}' > ~/.keras/keras.json
Finally, let's start running the notebook.
In [ ]:
# make some Python3 functions available on Python2
from __future__ import division, print_function
import sys
print(sys.version_info)
import theano
print(theano.__version__)
import keras
print(keras.__version__)
In [ ]:
# FloydHub: check data
%ls /input/dogscats/
In [ ]:
# check current directory
%pwd
%ls
# see some files are loaded fine
%cat floyd_requirements.txt
# check no Keras2 specific function is used (when Keras1 is used)
%cat utils.py
In [ ]:
#Create references to important directories we will use over and over
import os, sys
current_dir = os.getcwd()
LESSON_HOME_DIR = current_dir
# FloydHub
DATA_HOME_DIR = "/input/dogscats/"
OUTPUT_HOME_DIR = "/output/"
# alternatively, for local
#DATA_HOME_DIR = current_dir+'/data/redux'
In [ ]:
#import modules
from utils import *
from vgg16 import Vgg16
#Instantiate plotting tool
#In Jupyter notebooks, you will need to run this command before doing any plotting
%matplotlib inline
In [ ]:
%cd $DATA_HOME_DIR
#Set path to sample/ path if desired
path = DATA_HOME_DIR + '/' #'/sample/'
test_path = DATA_HOME_DIR + '/test1/' #We use all the test data
# FloydHub
# data needs to be output under /output
# if results_path cannot be created, execute mkdir directly in the terminal
results_path = OUTPUT_HOME_DIR + '/results/'
%mkdir results_path
train_path = path + '/train/'
valid_path = path + '/valid/'
In [ ]:
# As large as you can, but no larger than 64 is recommended.
#batch_size = 8
batch_size = 64
no_of_epochs=3
The original pre-trained Vgg16 class classifies images into one of the 1000 categories. This number of categories depends on the dataset which Vgg16 was trained with. (http://image-net.org/challenges/LSVRC/2014/browse-synsets)
In order to classify images into the categories which we prepare (2 categories of dogs/cats, in this notebook), fine-tuning technology is useful. It:
In [ ]:
vgg = Vgg16()
In [ ]:
# Grab a few images at a time for training and validation.
batches = vgg.get_batches(train_path, batch_size=batch_size)
val_batches = vgg.get_batches(valid_path, batch_size=batch_size*2)
# Finetune: note that the vgg model is compiled inside the finetune method.
vgg.finetune(batches)
In [ ]:
# Fit: note that we are passing in the validation dataset to the fit() method
# For each epoch we test our model against the validation set
latest_weights_filename = None
# FloydHub (Keras1)
for epoch in range(no_of_epochs):
print("Running epoch: %d" % epoch)
vgg.fit(batches, val_batches, nb_epoch=1)
latest_weights_filename = 'ft%d.h5' % epoch
vgg.model.save_weights(results_path+latest_weights_filename)
print("Completed %s fit operations" % no_of_epochs)
# alternatively, for local (Keras2)
"""
for epoch in range(no_of_epochs):
print("Running epoch: %d" % epoch)
vgg.fit(batches, val_batches, batch_size, nb_epoch=1)
latest_weights_filename = 'ft%d.h5' % epoch
vgg.model.save_weights(results_path+latest_weights_filename)
print("Completed %s fit operations" % no_of_epochs)
"""
In [ ]:
# OUTPUT_HOME_DIR, not DATA_HOME_DIR due to FloydHub restriction
%cd $OUTPUT_HOME_DIR
%mkdir -p test1/unknown
%cd $OUTPUT_HOME_DIR/test1
%cp $test_path/*.jpg unknown/
# rewrite test_path
test_path = OUTPUT_HOME_DIR + '/test1/' #We use all the test data
In [ ]:
batches, preds = vgg.test(test_path, batch_size = batch_size*2)
In [ ]:
print(preds[:5])
filenames = batches.filenames
print(filenames[:5])
In [ ]:
# You can verify the column ordering by viewing some images
from PIL import Image
Image.open(test_path + filenames[2])
In [ ]:
#Save our test results arrays so we can use them again later
save_array(results_path + 'test_preds.dat', preds)
save_array(results_path + 'filenames.dat', filenames)
Calculate predictions on validation set, so we can find correct and incorrect examples:
In [ ]:
vgg.model.load_weights(results_path+latest_weights_filename)
In [ ]:
val_batches, probs = vgg.test(valid_path, batch_size = batch_size)
In [ ]:
filenames = val_batches.filenames
expected_labels = val_batches.classes #0 or 1
In [ ]:
#Round our predictions to 0/1 to generate labels
our_predictions = probs[:,0]
our_labels = np.round(1-our_predictions)
(TODO) look at data to improve model
confusion matrix
In [ ]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(expected_labels, our_labels)
In [ ]:
plot_confusion_matrix(cm, val_batches.class_indices)
This section also depends on which dataset you use (and which Kaggle competition you are participating)
In [ ]:
#Load our test predictions from file
preds = load_array(results_path + 'test_preds.dat')
filenames = load_array(results_path + 'filenames.dat')
In [ ]:
#Grab the dog prediction column
isdog = preds[:,1]
print("Raw Predictions: " + str(isdog[:5]))
print("Mid Predictions: " + str(isdog[(isdog < .6) & (isdog > .4)]))
print("Edge Predictions: " + str(isdog[(isdog == 1) | (isdog == 0)]))
In [ ]:
# sneaky trick to round down our edge predictions
# Swap all ones with .95 and all zeros with .05
isdog = isdog.clip(min=0.05, max=0.95)
In [ ]:
#Extract imageIds from the filenames in our test/unknown directory
filenames = batches.filenames
ids = np.array([int(f[8:f.find('.')]) for f in filenames])
In [ ]:
subm = np.stack([ids,isdog], axis=1)
subm[:5]
In [ ]:
# FloydHub
%cd $OUTPUT_HOME_DIR
# alternatively, for local
#%cd $DATA_HOME_DIR
submission_file_name = 'submission1.csv'
np.savetxt(submission_file_name, subm, fmt='%d,%.5f', header='id,label', comments='')
In [ ]:
from IPython.display import FileLink
# FloydHub
%cd $OUTPUT_HOME_DIR
FileLink(submission_file_name)
# alternatively, for local
#%cd $LESSON_HOME_DIR
#FileLink('data/redux/'+submission_file_name)