In [1]:
import theano
In [2]:
import os, sys
sys.path.insert(1, os.path.join(os.getcwd(), 'utils'))
In [3]:
%matplotlib inline
from __future__ import print_function, division
# path = "data/sample/"
path = "data/statefarm/sample/"
import utils; reload(utils)
from utils import *
from IPython.display import FileLink
In [4]:
# batch_size = 64
batch_size = 32
In [5]:
%cd data/statefarm
%cd train
In [6]:
%mkdir ../sample
%mkdir ../sample/train
%mkdir ../sample/valid
In [7]:
for d in glob('c?'):
os.mkdir('../sample/train/' + d)
os.mkdir('../sample/valid/' + d)
In [5]:
from shutil import copyfile
In [23]:
g = glob('c?/*.jpg')
shuf = np.random.permutation(g)
for i in range(1500): copyfile(shuf[i], '../sample/train/' + shuf[i])
In [20]:
# # removing copied sample training images
# help(os)
# for f in glob('c?/*.jpg'):
# os.remove(f)
In [23]:
% cd ../../..
%mkdir data/statefarm/results
%mkdir data/statefarm/sample/test
How I'll do it: create a full val set in the full valid folder, then copy over the same percentage as train to the sample/valid folder.
Acutally: wouldn't it be better if I used the full validation set for more accurate results? Then again, for processing on my MacBook, it may be good enough to go w/ the 1st method.
In [10]:
# run once, make sure you're in datadir first
# path = os.getcwd()
# os.mkdir(path + '/valid')
# for i in xrange(10): os.mkdir(path + '/valid' + '/c' + str(i))
def reset_valid(verbose=1, valid_path='', TRAIN_DIR=''):
"""Moves all images in validation set back to
their respective classes in the training set."""
counter = 0
if not valid_path: valid_path = os.getcwd() + '/valid/'
if not TRAIN_DIR: TRAIN_DIR = os.getcwd() + '/train'
%cd $valid_path
for i in xrange(10):
%cd c"$i"
g = glob('*.jpg')
for n in xrange(len(g)):
os.rename(g[n], TRAIN_DIR + '/c' + str(i) + '/' + g[n])
counter += 1
% cd ..
if verbose: print("Moved {} files.".format(counter))
# %mv $VALID_DIR/c"$i"/$*.jpg $TRAIN_DIR/c"$i"/$*.jpg
# modified from: http://forums.fast.ai/t/statefarm-kaggle-comp/183/20
def set_valid(number=1, verbose=1, data_path=''):
"""Moves <number> of subjects from training to validation
directories. Verbosity: 0: Silent; 1: print no. files moved;
2: print each move operation"""
if not data_path: data_path = os.getcwd() + '/'
counter = 0
if number < 0: number = 0
for n in xrange(number):
# read CSV file into Pandas DataFrame
dil = pd.read_csv(data_path + 'driver_imgs_list.csv')
# group frame by subject in image
grouped_subjects = dil.groupby('subject')
# pick <number> subjects at random
subject = grouped_subjects.groups.keys()[np.random.randint(0, high=len(grouped_subjects.groups))] # <-- groups?
# get the group assoc w/ subject
group = grouped_subjects.get_group(subject)
# loop over gropu & move imgs to validation dir
for (subject, clssnm, img) in group.values:
source = '{}train/{}/{}'.format(data_path, clssnm, img)
target = source.replace('train', 'valid')
if verbose > 1: print('mv {} {}'.format(source, target))
os.rename(source, target)
counter += 1
if verbose: print ("Files moved: {}".format(counter))
In [11]:
%pwd
Out[11]:
In [6]:
# %cd ~/Deshar/Kaukasos/FAI
%cd ~/Kaukasos/FAI
In [13]:
%cd data/statefarm/
reset_valid()
%cd ..
set_valid(number=3)
I understand now why I was getting weird validation-accuracy results: I was moving a unique valset from training, in the full data directory and not in the sample dir. But then why was my model even able to train if there wasn't anything in the sample validation folders? Because I was only copying 1000 random images from sample/train to sample/valid. Ooof..
Nevermind, ignore (some of) that.... the 1000 sample val imgs are taken from the valid set moved from training in the full directory.. The problem affecting accuracy is that the valid set separated from training after the sample training set is copied.. So some of the val imgs will have drivers in sample training set. This explains why accuracy was off, but not as off as one would expect. Will reconfigure this.
This notebook is being rerun on my Asus Linux machine. Upgrading from an Intel Core i5 CPU to an NVidia GTX 870M GPU should yield a good speedup.
CPU times:
In [26]:
%pwd
Out[26]:
In [27]:
%cd valid
# g = glob('valid/c?/*.jpg') # <-- this doesnt work: why?
g = glob('c?/*.jpg')
shuf = np.random.permutation(g)
# for i in range(1000): copyfile(shuf[i], '/sample/' + shuf[i])
for i in range(1000): copyfile(shuf[i], '../sample/valid/' + shuf[i])
In [7]:
batches = get_batches(path + 'train', batch_size=batch_size)
val_batches = get_batches(path + 'valid', batch_size=batch_size*2, shuffle=False)
In [35]:
%pwd
os.mkdir(path + 'test')
In [8]:
(val_classes, trn_classes, val_labels, trn_labels, val_filenames, filenames,
test_filename) = get_classes(path)
In [39]:
model = Sequential([
BatchNormalization(axis=1, input_shape=(3, 224, 224)),
Flatten(),
Dense(10, activation='softmax')
])
As you can see below, this training is going nowhere...
In [40]:
model.compile(Adam(), loss='categorical_crossentropy', metrics=['accuracy'])
model.fit_generator(batches, batches.nb_sample, nb_epoch=2, validation_data=val_batches,
nb_val_samples=val_batches.nb_sample)
Out[40]:
Let's first check the number of parameters to see that there's enough parameters to find some useful relationships:
In [41]:
model.summary()
In [42]:
10*3*224*224
Out[42]:
Since we have a simple model with no regularization and plenty of parameters, it seems most likely that our learning rate is too hgh. Perhaps it is jumping to a solution where it predicts one or two classes with high confidence, so that it can give a zero prediction to as many classes as possible - that's the best approach for a model that is no better than random, and there is likely to be where we would end up with a high learning rate. So let's check:
In [43]:
np.round(model.predict_generator(batches, batches.n)[:10],2)
Out[43]:
In [16]:
# temp = model.predict_generator(batches, batches.n)
(Not so in this case, only kind of, but it was indeed predicted 1 or 6 back on the Mac)
Our hypothesis was correct. It's nearly always predicting class 1 or 6, with very high confidence. So let's try a lower learning rate:
In [44]:
# here's a way to take a look at the learning rate
import keras.backend as K
LR = K.eval(model.optimizer.lr)
print(LR)
In [45]:
model = Sequential([
BatchNormalization(axis=1, input_shape=(3,224,224)),
Flatten(),
Dense(10, activation='softmax')
])
model.compile(Adam(lr=1e-5), loss='categorical_crossentropy', metrics=['accuracy'])
model.fit_generator(batches, batches.nb_sample, nb_epoch=2, validation_data=val_batches,
nb_val_samples=val_batches.nb_sample)
Out[45]:
Great - we found our way out of that hole ... Now we can increase the learning rate and see where we can get to.
In [46]:
model.optimizer.lr=0.001
In [47]:
model.fit_generator(batches, batches.nb_sample, nb_epoch=4, validation_data=val_batches,
nb_val_samples=val_batches.nb_sample)
Out[47]:
We're stabilizing at validation accuracy of 0.39 (~.35 in my NB). Not great, but a lot better than random. Before moving on, let's check that our validation set on the sample is large enough that it gives consistent results:
In [48]:
rnd_batches = get_batches(path+'valid', batch_size=batch_size*2, shuffle=True)
In [49]:
val_res = [model.evaluate_generator(rnd_batches, rnd_batches.nb_sample) for i in range(10)]
In [50]:
np.round(val_res,2)
Out[50]:
Yup, pretty consistent - if we see imporvements of 3% or more, it's probably not random, based on the above samples.
The previous model is over-fitting a lot, but we can't use dropout since we only have one layer. We can try to decrease overfitting in our model by adding l2 regularization (ie: add the sum of squares of the weights to our loss function):
In [51]:
model = Sequential([
BatchNormalization(axis=1, input_shape=(3,224,224)),
Flatten(),
Dense(10, activation='softmax', W_regularizer=l2(0.01))
])
model.compile(Adam(lr=1e-5), loss='categorical_crossentropy', metrics=['accuracy'])
model.fit_generator(batches, batches.nb_sample, nb_epoch=2, validation_data=val_batches,
nb_val_samples=val_batches.nb_sample)
Out[51]:
In [52]:
model.optimizer.lr=0.001
model.fit_generator(batches, batches.nb_sample, nb_epoch=4, validation_data=val_batches,
nb_val_samples=val_batches.nb_sample)
Out[52]:
Looks like we can get a bit over 50% (almost, here: 42.8%) accuracy this way. This'll be a good benchmark for our future models - if we can't beat 50%, then we're not even beating a linear model trained on a sample, so we'll know that's not a good approach.
The next simplest model is to add a single hidden layer.
In [53]:
model = Sequential([
BatchNormalization(axis=1, input_shape=(3, 224, 224)),
Flatten(),
Dense(100, activation='relu'), #¿would λ2 be good here?
BatchNormalization(),
Dense(10, activation='softmax')
])
model.compile(Adam(lr=1e-5), loss='categorical_crossentropy', metrics=['accuracy'])
model.fit_generator(batches, batches.nb_sample, nb_epoch=2, validation_data=val_batches,
nb_val_samples=val_batches.nb_sample)
model.optimizer.lr = 0.01
model.fit_generator(batches, batches.nb_sample, nb_epoch=5, validation_data=val_batches,
nb_val_samples=val_batches.nb_sample)
Out[53]:
(Odd, I may not have a good validation set if I'm getting such higher valacc numbers... ---- not anymore now that I'm using a proper valset. Of course, just as with JH's notebook: val accuracy has decreased a bit.)
Not looking very encouraging... which isn't surprising since we know that CNNs are a much better choice for computer vision problems. So we'll try one.
2 conv layers with max pooling followed by a simple dense network is a good simple CNN to start with:
In [24]:
def conv1(batches):
model = Sequential([
BatchNormalization(axis=1, input_shape=(3,224,224)),
Convolution2D(32, 3, 3, activation='relu'),
BatchNormalization(axis=1),
MaxPooling2D((3, 3)),
Convolution2D(64, 3, 3, activation='relu'),
BatchNormalization(axis=1),
MaxPooling2D((3,3)),
Flatten(),
Dense(200, activation='relu'),
BatchNormalization(),
Dense(10, activation='softmax')
])
model.compile(Adam(1e-3), loss='categorical_crossentropy', metrics=['accuracy'])
model.fit_generator(batches, batches.nb_sample, nb_epoch=2, validation_data=val_batches,
nb_val_samples=val_batches.nb_sample)
model.optimizer.lr = 0.001
model.fit_generator(batches, batches.nb_sample, nb_epoch=4, validation_data=val_batches,
nb_val_samples=val_batches.nb_sample)
return model
On GPU running out of memory (2692/3017 MiB) at this point. Restarting with smaller batch size (32?)
In [10]:
conv1(batches)
The training set here is very rapidly reaching a very high accuracy. So if we could regularize this, perhaps we could get a reasonable results.
So, what kind of regularization should we try first? As we discussed in lesson 3, we should start with data augmentation.
To find the best data augmentation parameters, we can try each type of data augmentation, one at a time. For each type, we can try four very different levels of augmentation, and see which is the best. In the steps below we've only kept the single best results we found. We're using the CNN we defined above, since we have already observed it can model the data quickly and accurately.
Width shift: move the image left and right -
In [11]:
gen_t = image.ImageDataGenerator(width_shift_range=0.1)
batches = get_batches(path + 'train', gen_t, batch_size=batch_size)
In [12]:
model = conv1(batches)
Height shift: move the image up and down -
In [13]:
gen_t = image.ImageDataGenerator(height_shift_range=0.05)
batches = get_batches(path + 'train', gen_t, batch_size=batch_size)
In [14]:
model = conv1(batches)
Random shear angles (max in radians) -
In [15]:
gen_t = image.ImageDataGenerator(shear_range=0.1)
batches = get_batches(path + 'train', gen_t, batch_size=batch_size)
In [16]:
model = conv1(batches)
Rotation: max in degrees -
In [17]:
gen_t = image.ImageDataGenerator(rotation_range=15)
batches = get_batches(path + 'train', gen_t, batch_size=batch_size)
In [18]:
model = conv1(batches)
Channel shift: randomly changing the R,B,G colors -
In [19]:
gen_t = image.ImageDataGenerator(channel_shift_range=20)
batches = get_batches(path + 'train', gen_t, batch_size=batch_size)
In [20]:
model = conv1(batches)
And finally, putting it all together!
In [21]:
gen_t = image.ImageDataGenerator(rotation_range=15, height_shift_range=0.05,
shear_range=0.1, channel_shift_range=20, width_shift_range=0.1)
batches = get_batches(path + 'train', gen_t, batch_size=batch_size)
In [25]:
model = conv1(batches)
At first glance, this isn't looking encouraging, since the validation set is poor and getting worse. But the training set is getting better, and still has a long way to go in accuracy - so we should try annealing our learning rate and running more epochs, before we make a decision.
In [26]:
model.optimizer.lr = 0.0001
model.fit_generator(batches, batches.nb_sample, nb_epoch=5, validation_data=val_batches,
nb_val_samples=val_batches.nb_sample)
Out[26]:
Lucky we tried that - we're starting to make progress! Let's keep going.
In [27]:
model.fit_generator(batches, batches.nb_sample, nb_epoch=25, validation_data=val_batches,
nb_val_samples=val_batches.nb_sample)
Out[27]:
Amazingly, using nothing but a small sample, a simple (not pre-trianed) model with no dropout, and data augmentation, we're getting results that would get us into the top 50% of the competition! This looks like a great foundation for our further experiments.
To go further, we'll need to use the whole dataset, since dropout and data volumes are very related, so we can't tweak dropout without using all the data.
(I can confirm: my best 1st attempt score was a loss of 1.51002 w/ indicated val/trn loss of: 1.1896/1.0269 & val acc. of 61.9%.. Here it's obviously severely overfitting on training data, but the indicated val loss/acc is: 1.0082/76.8%.. Looks good.)
Btw, 1.51~~ gives a ranking of 658/1440: top 45.69%. So this would be much better.