9 June 2017

Wayne Nixalo

This notebook started out trying to generate convolutional test features using Sequential.predict_generator, by using bcolz to save the generated features to disk, in batches as they were created. This was successful after a few days of work. (roughly: 6 - 9 June).

This notebook's continued on to build a full set of submittable predictions. Once that's solved, I can build a strong model, unconstrained by system memory limits. Video-memory is another matter, at this time.


In [1]:
import theano


/home/wnixalo/miniconda3/envs/FAI/lib/python2.7/site-packages/theano/gpuarray/dnn.py:135: UserWarning: Your cuDNN version is more recent than Theano. If you encounter problems, try updating Theano or downgrading cuDNN to version 5.1.
  warnings.warn("Your cuDNN version is more recent than "
Using cuDNN version 6021 on context None
Mapped name None to device cuda: GeForce GTX 870M (0000:01:00.0)

In [2]:
import os, sys
sys.path.insert(1, os.path.join('utils'))

from __future__ import print_function, division

path = 'data/statefarm/'
import utils; reload(utils)
from utils import *


Using Theano backend.

In [3]:
batch_size=16
vgg = Vgg16()
model = vgg.model
last_conv_idx = [i for i, l in enumerate(model.layers) if type(l) is Convolution2D][-1]
conv_layers = model.layers[:last_conv_idx + 1]
conv_model = Sequential(conv_layers)

In [4]:
gen = image.ImageDataGenerator()
test_batches = get_batches(path + 'test', batch_size=batch_size, shuffle=False)


Found 79726 images belonging to 1 classes.

Manual iteration through test image to generate convolutional test features. Saves each batch to disk insetad of loading in memory.


In [5]:
# conv_test_feat = conv_model.predict_generator(test_batches, test_batches.nb_sample)

I think conv_feat below should be conv_test_feat


In [7]:
fname = path + 'results/conv_test_feat.dat'
%rm -r $fname
for i in xrange(test_batches.n // batch_size + 1):
    conv_test_feat = conv_model.predict_on_batch(test_batches.next()[0])
    if not i:
        c = bcolz.carray(conv_feat, rootdir= path + '/results/conv_test_feat.dat', mode='a')
    else:
        c.append(conv_feat)
c.flush()


rm: cannot remove 'data/statefarm/results/conv_feat_test.dat': No such file or directory

Question: Why does it look like I can have the entire conv_test_feat array open at once, when opened w/ bcolz; but when it's explicitly loaded as a Numpy array via bcolz.open(fname)[:], all of a sudden the RAM takes a severe memory hit?


In [8]:
# apparently you can just open a (massive) bcolz carray this way 
# without crashing memory... okay I'm learning things
# carr = bcolz.open(fname)


Out[8]:
79712

In [9]:
# forgot to add the '+1' so missed the last 14 images. Doing that here:
# NOTE: below code only adds on the missed batch
# iterate generator until final missed batch, then work:
fname = path + 'results/conv_test_feat.dat'
test_batches.reset()
iters = test_batches.n // batch_size
for i in xrange(iters): test_batches.next()
conv_test_feat = conv_model.predict_on_batch(test_batches.next()[0])
# c = bcolz.carray(conv_test_feat, rootdir=fname, mode='a')
c = bcolz.open(fname)
c.append(conv_test_feat)
c.flush()

As expected (& which motivated this) the full set of convolutional test features does not fit at once in memory.


In [5]:
fname = path + 'results/conv_test_feat.dat'

In [6]:
x = bcolz.open(fname)
len(x)


Out[6]:
79726

Loading train/valid features; defining & fitting NN model


In [31]:
# conv_train_feat_batches = get_batches(path + '/results/conv_feat.dat')
# conv_valid_feat_batches = get_batches(path + '/results/conv_val_feat.dat')
conv_trn_feat = load_array(path + '/results/conv_feat.dat')
conv_val_feat = load_array(path + '/results/conv_val_feat.dat')

In [7]:
(val_classes, trn_classes, val_labels, trn_labels, 
    val_filenames, filenames, test_filenames) = get_classes(path)


Found 19463 images belonging to 10 classes.
Found 2961 images belonging to 10 classes.
Found 79726 images belonging to 1 classes.

In [8]:
p = 0.8
bn_model = Sequential([
        MaxPooling2D(input_shape=conv_layers[-1].output_shape[1:]),
        Flatten(),
        Dropout(p/2),
        Dense(128, activation='relu'),
        BatchNormalization(),
        Dropout(p/2),
        Dense(128, activation='relu'),
        BatchNormalization(),
        Dropout(p),
        Dense(10, activation='softmax')
        ])
bn_model.compile(Adam(lr=1e-3), loss='categorical_crossentropy', metrics=['accuracy'])

In [ ]:
# Sequential.fit_generator(self, generator, samples_per_epoch, nb_epoch, verbose=1, callbacks=None, validation_data=None, nb_val_samples=None, class_weight=None, max_q_size=10, nb_worker=1, pickle_safe=False, initial_epoch=0, **kwargs)
# bn_model.fit_generator((conv_train_feat_batches, trn_labels), conv_train_feat_batches.nb_sample, nb_epoch=1,
#                        validation_data=(conv_valid_feat_batches, val_labels), nb_val_samples=conv_valid_feat_batches.nb_sample)

In [34]:
bn_model.fit(conv_trn_feat, trn_labels, batch_size=batch_size, nb_epoch=1,
             validation_data = (conv_val_feat, val_labels))


Train on 19463 samples, validate on 2961 samples
Epoch 1/1
19463/19463 [==============================] - 8s - loss: 1.4585 - acc: 0.5945 - val_loss: 0.7479 - val_acc: 0.7207
Out[34]:
<keras.callbacks.History at 0x7fb53ffbc210>

In [35]:
bn_model.optimizer.lr=1e-2
bn_model.fit(conv_trn_feat, trn_labels, batch_size=batch_size, nb_epoch=4,
             validation_data = (conv_val_feat, val_labels))


Train on 19463 samples, validate on 2961 samples
Epoch 1/4
19463/19463 [==============================] - 8s - loss: 0.2751 - acc: 0.9171 - val_loss: 0.5394 - val_acc: 0.8173
Epoch 2/4
19463/19463 [==============================] - 8s - loss: 0.1748 - acc: 0.9462 - val_loss: 0.6580 - val_acc: 0.7693
Epoch 3/4
19463/19463 [==============================] - 8s - loss: 0.1337 - acc: 0.9578 - val_loss: 0.7747 - val_acc: 0.7305
Epoch 4/4
19463/19463 [==============================] - 8s - loss: 0.1088 - acc: 0.9670 - val_loss: 0.8638 - val_acc: 0.7518
Out[35]:
<keras.callbacks.History at 0x7fb53fbae410>

In [9]:
# bn_model.save_weights(path + 'models/da_conv8.h5')
bn_model.load_weights(path + 'models/da_conv8.h5')

In [55]:
# conv_test_feat_batches = bcolz.iterblocks(path + fname)
fname = path + 'results/conv_test_feat.dat'
idx, inc = 0, 4096
preds = []
while idx < test_batches.n - inc:
    conv_test_feat = bcolz.open(fname)[idx:idx+inc]
    idx += inc
    if len(preds):
        next_preds = bn_model.predict(conv_test_feat, batch_size=batch_size, verbose=0)
        preds = np.concatenate([preds, next_preds])
    else:
        preds = bn_model.predict(conv_test_feat, batch_size=batch_size, verbose=0)
conv_test_feat = bcolz.open(fname)[idx:]
next_preds = bn_model.predict(conv_test_feat, batch_size=batch_size, verbose=0)
preds = np.concatenate([preds, next_preds])
print(len(preds))
if len(preds) != len(bcolz.open(fname)):
    print("Ya done fucked up, son.")


81920
Ya done fucked up, son.

Made a mistake on the last loop above. The penultimate batch -- the last full 4096-image batch -- was added onto the end of the predictions array twice. The final 2194 image predictions were never run.

Easy enough to fix: modify the above code to work perfectly. Then either:

  • create entirely new predictions from scratch (~ 1 hour)
  • remove the last increment (4096) of predictions from the array, and add the last batch.

Gonna take option 2.

EDIT:

actually, option 1. preds was stored in memory, which was erased when I closed this machine for the night. So this time I'll just build the predictions array properly.

Below is testing/debugging output from the night before


In [52]:
print(81920 - 79726)
print(79726 % 4096)
print(81920 % 4096) # <-- that's yeh problem right there, kid


2194
1902
0

In [ ]:
x = preds[len(preds) - 4096]
print(preds[-1])
print(x)

In [43]:



Out[43]:
array([[  6.2483e-07,   2.4578e-06,   2.9354e-05,   6.7996e-05,   8.1581e-07,   2.9132e-06,
          9.9981e-01,   2.7024e-07,   8.0846e-05,   3.0252e-06],
       [  4.0233e-04,   3.4477e-05,   7.3304e-07,   3.4403e-01,   6.5541e-01,   9.2372e-06,
          7.4395e-05,   8.1555e-06,   1.8162e-05,   8.2546e-06],
       [  3.7808e-06,   1.2269e-06,   4.5053e-07,   3.5392e-06,   1.5726e-05,   3.7257e-06,
          1.5923e-05,   7.0004e-06,   9.9993e-01,   2.0967e-05],
       [  2.1178e-05,   6.0503e-06,   1.8488e-06,   7.9847e-06,   7.7963e-06,   9.9988e-01,
          3.8778e-05,   4.0426e-06,   1.3915e-05,   1.8222e-05],
       [  4.2161e-01,   1.3603e-04,   9.1913e-02,   5.2514e-04,   4.0447e-02,   2.0817e-01,
          1.7152e-02,   3.7824e-03,   2.7693e-02,   1.8857e-01],
       [  6.9312e-04,   5.2366e-02,   1.6738e-05,   5.5922e-06,   3.7776e-05,   1.1497e-04,
          1.4271e-05,   9.1994e-05,   6.5573e-05,   9.4659e-01],
       [  9.8337e-10,   4.1691e-08,   2.3664e-07,   4.8789e-09,   2.7257e-08,   2.3041e-08,
          2.1754e-07,   1.0000e+00,   5.1427e-07,   2.3709e-09],
       [  1.1594e-06,   5.4730e-09,   2.5601e-09,   8.1659e-09,   5.9669e-06,   9.9999e-01,
          5.0917e-07,   1.1032e-06,   3.9713e-07,   2.8886e-09],
       [  9.6761e-05,   9.8503e-01,   2.0876e-04,   1.8814e-05,   8.8029e-05,   1.1453e-04,
          1.1641e-02,   4.8791e-05,   2.6841e-03,   7.0924e-05],
       [  6.7607e-06,   1.3271e-07,   9.9734e-01,   9.7243e-06,   5.7442e-04,   1.3320e-05,
          5.3985e-07,   1.3169e-03,   1.1960e-04,   6.1661e-04]], dtype=float32)

In [40]:
preds[0]


Out[40]:
array([  7.4223e-07,   4.0707e-05,   5.6510e-05,   1.0678e-05,   1.6999e-04,   3.3456e-03,
         2.3427e-05,   9.9596e-01,   9.5046e-05,   2.9750e-04], dtype=float32)

In [10]:
# ??image.ImageDataGenerator.flow_from_directory

In [12]:
# ??Sequential.predict()

Redoing predictions here:


In [10]:
fname = path + 'results/conv_test_feat.dat'
idx, inc = 4096, 4096
preds = []

conv_test_feat = bcolz.open(fname)[:idx]
preds = bn_model.predict(conv_test_feat, batch_size=batch_size, verbose=0)
while idx < test_batches.n - inc:
    conv_test_feat = bcolz.open(fname)[idx:idx+inc]
    idx += inc
    next_preds = bn_model.predict(conv_test_feat, batch_size=batch_size, verbose=0)
    preds = np.concatenate([preds, next_preds])
conv_test_feat = bcolz.open(fname)[idx:]
next_preds = bn_model.predict(conv_test_feat, batch_size=batch_size, verbose=0)
preds = np.concatenate([preds, next_preds])

print(len(preds))
if len(preds) != len(bcolz.open(fname)):
    print("Ya done fucked up, son.")


79726

Oh I forgot, predictions through a FC NN are fast. CNNs are where it takes a long time.

This is just quick testing that it works. Full/polished will be in the reworked statefarm-codealong (or just statefarm) JNB:


In [11]:
def do_clip(arr, mx): return np.clip(arr, (1-mx)/9, mx)

In [12]:
subm = do_clip(preds, 0.93)

In [13]:
subm_name = path + 'results/subm01.gz'

In [15]:
trn_batches = get_batches(path + 'train', batch_size=batch_size, shuffle=False)


Found 19463 images belonging to 10 classes.

In [16]:
# make sure training batches defined before this:
classes = sorted(trn_batches.class_indices, key=trn_batches.class_indices.get)

In [19]:
import pandas as pd
submission = pd.DataFrame(subm, columns=classes)
submission.insert(0, 'img', [f[8:] for f in test_filenames])
submission.head()


Out[19]:
img c0 c1 c2 c3 c4 c5 c6 c7 c8 c9
0 img_93169.jpg 0.007778 0.007778 0.007778 0.007778 0.007778 0.007778 0.007778 0.930000 0.007778 0.007778
1 img_81727.jpg 0.007778 0.007778 0.007778 0.930000 0.007778 0.007778 0.007778 0.007778 0.007778 0.007778
2 img_53095.jpg 0.007778 0.007778 0.007778 0.007778 0.007778 0.007778 0.930000 0.007778 0.007778 0.007778
3 img_13927.jpg 0.052475 0.007778 0.007778 0.007778 0.081378 0.007778 0.007778 0.007778 0.857765 0.007778
4 img_36496.jpg 0.007778 0.007778 0.007778 0.007778 0.007778 0.007778 0.007778 0.930000 0.007778 0.007778

In [20]:
submission.to_csv(subm_name, index=False, compression='gzip')

In [24]:
from IPython.display import FileLink
FileLink(subm_name)




This 'just good enough to pass' code/model got a 0.70947 on the Kaggle competition. My previous best was 1.50925 at place 658/1440; top ~45.7%. This gets place 415/1440. Top ~28.9%


In [ ]: