Statefarm - Whole Dataset

from theano.sandbox import cuda

WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10).  Please switch to the gpuarray backend. You can get more information about how to switch at this URL:

Using gpu device 0: Tesla K80 (CNMeM is disabled, cuDNN 5103)
WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10).  Please switch to the gpuarray backend. You can get more information about how to switch at this URL:

%matplotlib inline
from __future__ import print_function, division
from importlib import reload
import utils; reload(utils)
from utils import *
from IPython.display import FileLink

Using Theano backend.

path = LESSON_HOME_DIR+'data/state/'

Setup batches

batches = get_batches(path+'train', batch_size=batch_size)
val_batches = get_batches(path+'valid', batch_size=batch_size*2, shuffle=False)

Found 20424 images belonging to 10 classes.
Found 2000 images belonging to 10 classes.

(val_classes, trn_classes, val_labels, trn_labels, 
    val_filenames, filenames, test_filenames) = get_classes(path)

Found 79726 images belonging to 1 classes.

Rather than using batches, we could just import all the data into an array to save some processing time. (In most examples I'm using the batches, however - just because that's how I happened to start out.)

trn = get_data(path+'train')
val = get_data(path+'valid')

save_array(path+'results/val.dat', val)
save_array(path+'results/trn.dat', trn)

def get_data(path, target_size=(224,224)):
    batches = get_batches(path, shuffle=False, batch_size=1, class_mode=None, target_size=target_size)
    return np.concatenate([ for i in range(batches.nb_sample)])

def save_array(fname, arr):
    c=bcolz.carray(arr, rootdir=fname, mode='w')

val = load_array(path+'results/val.dat')
trn = load_array(path+'results/trn.dat')

Re-run sample experiements on full dataset

We should find that everything that worked on the sample (see statefarm-sample.ipynb), works on the full dataset too. Only better! Because now we have more data. So let's see how they go - the models in this section are exact copies of the sample notebook models.

Single conv layer

def conv1(batches):
    model = Sequential([
            BatchNormalization(axis=1, input_shape=(3,224,224)),
            Convolution2D(32,3,3, activation='relu'),
            Convolution2D(64,3,3, activation='relu'),
            Dense(200, activation='relu'),
            Dense(10, activation='softmax')

    # model.compile(Adam(lr=1e-4), loss='categorical_crossentropy', metrics=['accuracy'])
    # model.fit_generator(batches, batches.nb_sample, nb_epoch=2, validation_data=val_batches, 
    #                nb_val_samples=val_batches.nb_sample)
    model.compile(Adam(lr=0.001), loss='categorical_crossentropy', metrics=['accuracy'])
    model.fit_generator(batches, batches.nb_sample, nb_epoch=4, validation_data=val_batches, 
    return model

model = conv1(batches)

Epoch 1/4
20424/20424 [==============================] - 287s - loss: 0.2771 - acc: 0.9316 - val_loss: 1.1387 - val_acc: 0.6385
Epoch 2/4
20424/20424 [==============================] - 281s - loss: 0.0262 - acc: 0.9957 - val_loss: 0.0374 - val_acc: 0.9930
Epoch 3/4
20424/20424 [==============================] - 280s - loss: 0.0068 - acc: 0.9989 - val_loss: 0.0278 - val_acc: 0.9940
Epoch 4/4
20424/20424 [==============================] - 279s - loss: 0.0064 - acc: 0.9990 - val_loss: 0.0441 - val_acc: 0.9890

Data augmentation

gen_t = image.ImageDataGenerator(rotation_range=15, height_shift_range=0.05, 
                shear_range=0.1, channel_shift_range=20, width_shift_range=0.1)
batches = get_batches(path+'train', gen_t, batch_size=batch_size)

model = conv1(batches)

Epoch 1/4
20424/20424 [==============================] - 337s - loss: 1.1472 - acc: 0.6281 - val_loss: 1.0366 - val_acc: 0.6240
Epoch 2/4
20424/20424 [==============================] - 292s - loss: 0.4675 - acc: 0.8528 - val_loss: 0.2315 - val_acc: 0.9300
Epoch 3/4
20424/20424 [==============================] - 291s - loss: 0.3025 - acc: 0.9072 - val_loss: 0.4220 - val_acc: 0.8610
Epoch 4/4
20424/20424 [==============================] - 290s - loss: 0.2379 - acc: 0.9235 - val_loss: 0.1794 - val_acc: 0.9365

Imagenet conv features

Since we have so little data, and it is similar to imagenet images (full color photos), using pre-trained VGG weights is likely to be helpful - in fact it seems likely that we won't need to fine-tune the convolutional layer weights much, if at all. So we can pre-compute the output of the last convolutional layer, as we did in lesson 3 when we experimented with dropout. (However this means that we can't use full data augmentation, since we can't pre-compute something that changes every image.)

vgg = Vgg16()
model = vgg.model
last_conv_idx = [i for i,l in enumerate(model.layers) if type(l) is Convolution2D][-1]
conv_layers = model.layers[:last_conv_idx+1]

conv_model = Sequential(conv_layers)

# batches shuffle must be set to False when pre-computing features
batches = get_batches(path+'train', batch_size=batch_size, shuffle=False)

Found 20424 images belonging to 10 classes.

(val_classes, trn_classes, val_labels, trn_labels, 
    val_filenames, filenames, test_filenames) = get_classes(path)

conv_feat = conv_model.predict_generator(batches, batches.nb_sample)
conv_val_feat = conv_model.predict_generator(val_batches, val_batches.nb_sample)
conv_test_feat = conv_model.predict_generator(test_batches, test_batches.nb_sample)

test_batches = get_batches(path+'test', batch_size=batch_size, shuffle=False)
conv_test_feat = conv_model.predict_generator(test_batches, test_batches.nb_sample)

Found 79726 images belonging to 1 classes.

# save_array(path+'results/conv_val_feat.dat', conv_val_feat)
save_array(path+'results/conv_test_feat.dat', conv_test_feat)
# save_array(path+'results/conv_feat.dat', conv_feat)

conv_feat = load_array(path+'results/conv_feat.dat')
conv_val_feat = load_array(path+'results/conv_val_feat.dat')

Batchnorm dense layers on pretrained conv layers

We'll find a good clipping amount using the validation set, prior to submitting.

def do_clip(arr, mx): return np.clip(arr, (1-mx)/9, mx)

val_preds = model.predict(val, batch_size = batch_size)

Object `keras.predict_generator` not found.

keras.metrics.categorical_crossentropy(val_labels, do_clip(val_preds, 0.96)).eval()


# test_batches = get_batches(path+'test', batch_size=batch_size, shuffle=False)
test = get_data(path+'test')
preds =model.predict(test, batch_size = batch_size*2)

Found 79726 images belonging to 1 classes.

subm = do_clip(preds,0.96)

subm_name = path+'results/subm.gz'

submission = pd.DataFrame(subm, columns=classes)
submission.insert(0, 'img', [a[4:] for a in test_filenames])

submission.to_csv(subm_name, index=False, compression='gzip')