9 June 2017

Wayne Nixalo

This notebook started out trying to generate convolutional test features using Sequential.predict_generator, by using bcolz to save the generated features to disk, in batches as they were created. This was successful after a few days of work. (roughly: 6 - 9 June).

This notebook's continued on to build a full set of submittable predictions. Once that's solved, I can build a strong model, unconstrained by system memory limits. Video-memory is another matter, at this time.



In [1]:

    
import theano









    



/home/wnixalo/miniconda3/envs/FAI/lib/python2.7/site-packages/theano/gpuarray/dnn.py:135: UserWarning: Your cuDNN version is more recent than Theano. If you encounter problems, try updating Theano or downgrading cuDNN to version 5.1.
  warnings.warn("Your cuDNN version is more recent than "
Using cuDNN version 6021 on context None
Mapped name None to device cuda: GeForce GTX 870M (0000:01:00.0)



In [2]:

    
import os, sys
sys.path.insert(1, os.path.join('utils'))

from __future__ import print_function, division

path = 'data/statefarm/'
import utils; reload(utils)
from utils import *









    



Using Theano backend.



In [3]:

    
batch_size=16
vgg = Vgg16()
model = vgg.model
last_conv_idx = [i for i, l in enumerate(model.layers) if type(l) is Convolution2D][-1]
conv_layers = model.layers[:last_conv_idx + 1]
conv_model = Sequential(conv_layers)



In [4]:

    
gen = image.ImageDataGenerator()
test_batches = get_batches(path + 'test', batch_size=batch_size, shuffle=False)









    



Found 79726 images belonging to 1 classes.

Manual iteration through test image to generate convolutional test features. Saves each batch to disk insetad of loading in memory.



In [5]:

    
# conv_test_feat = conv_model.predict_generator(test_batches, test_batches.nb_sample)

I think conv_feat below should be conv_test_feat



In [7]:

    
fname = path + 'results/conv_test_feat.dat'
%rm -r $fname
for i in xrange(test_batches.n // batch_size + 1):
    conv_test_feat = conv_model.predict_on_batch(test_batches.next()[0])
    if not i:
        c = bcolz.carray(conv_feat, rootdir= path + '/results/conv_test_feat.dat', mode='a')
    else:
        c.append(conv_feat)
c.flush()









    



rm: cannot remove 'data/statefarm/results/conv_feat_test.dat': No such file or directory

Question: Why does it look like I can have the entire conv_test_feat array open at once, when opened w/ bcolz; but when it's explicitly loaded as a Numpy array via bcolz.open(fname)[:], all of a sudden the RAM takes a severe memory hit?



In [8]:

    
# apparently you can just open a (massive) bcolz carray this way 
# without crashing memory... okay I'm learning things
# carr = bcolz.open(fname)









    Out[8]:





79712



In [9]:

    
# forgot to add the '+1' so missed the last 14 images. Doing that here:
# NOTE: below code only adds on the missed batch
# iterate generator until final missed batch, then work:
fname = path + 'results/conv_test_feat.dat'
test_batches.reset()
iters = test_batches.n // batch_size
for i in xrange(iters): test_batches.next()
conv_test_feat = conv_model.predict_on_batch(test_batches.next()[0])
# c = bcolz.carray(conv_test_feat, rootdir=fname, mode='a')
c = bcolz.open(fname)
c.append(conv_test_feat)
c.flush()

As expected (& which motivated this) the full set of convolutional test features does not fit at once in memory.



In [5]:

    
fname = path + 'results/conv_test_feat.dat'



In [6]:

    
x = bcolz.open(fname)
len(x)









    Out[6]:





79726

Loading train/valid features; defining & fitting NN model



In [31]:

    
# conv_train_feat_batches = get_batches(path + '/results/conv_feat.dat')
# conv_valid_feat_batches = get_batches(path + '/results/conv_val_feat.dat')
conv_trn_feat = load_array(path + '/results/conv_feat.dat')
conv_val_feat = load_array(path + '/results/conv_val_feat.dat')



In [7]:

    
(val_classes, trn_classes, val_labels, trn_labels, 
    val_filenames, filenames, test_filenames) = get_classes(path)









    



Found 19463 images belonging to 10 classes.
Found 2961 images belonging to 10 classes.
Found 79726 images belonging to 1 classes.



In [8]:

    
p = 0.8
bn_model = Sequential([
        MaxPooling2D(input_shape=conv_layers[-1].output_shape[1:]),
        Flatten(),
        Dropout(p/2),
        Dense(128, activation='relu'),
        BatchNormalization(),
        Dropout(p/2),
        Dense(128, activation='relu'),
        BatchNormalization(),
        Dropout(p),
        Dense(10, activation='softmax')
        ])
bn_model.compile(Adam(lr=1e-3), loss='categorical_crossentropy', metrics=['accuracy'])



In [ ]:

    
# Sequential.fit_generator(self, generator, samples_per_epoch, nb_epoch, verbose=1, callbacks=None, validation_data=None, nb_val_samples=None, class_weight=None, max_q_size=10, nb_worker=1, pickle_safe=False, initial_epoch=0, **kwargs)
# bn_model.fit_generator((conv_train_feat_batches, trn_labels), conv_train_feat_batches.nb_sample, nb_epoch=1,
#                        validation_data=(conv_valid_feat_batches, val_labels), nb_val_samples=conv_valid_feat_batches.nb_sample)



In [34]:

    
bn_model.fit(conv_trn_feat, trn_labels, batch_size=batch_size, nb_epoch=1,
             validation_data = (conv_val_feat, val_labels))









    



Train on 19463 samples, validate on 2961 samples
Epoch 1/1
19463/19463 [==============================] - 8s - loss: 1.4585 - acc: 0.5945 - val_loss: 0.7479 - val_acc: 0.7207






    Out[34]:





<keras.callbacks.History at 0x7fb53ffbc210>



In [35]:

    
bn_model.optimizer.lr=1e-2
bn_model.fit(conv_trn_feat, trn_labels, batch_size=batch_size, nb_epoch=4,
             validation_data = (conv_val_feat, val_labels))









    



Train on 19463 samples, validate on 2961 samples
Epoch 1/4
19463/19463 [==============================] - 8s - loss: 0.2751 - acc: 0.9171 - val_loss: 0.5394 - val_acc: 0.8173
Epoch 2/4
19463/19463 [==============================] - 8s - loss: 0.1748 - acc: 0.9462 - val_loss: 0.6580 - val_acc: 0.7693
Epoch 3/4
19463/19463 [==============================] - 8s - loss: 0.1337 - acc: 0.9578 - val_loss: 0.7747 - val_acc: 0.7305
Epoch 4/4
19463/19463 [==============================] - 8s - loss: 0.1088 - acc: 0.9670 - val_loss: 0.8638 - val_acc: 0.7518






    Out[35]:





<keras.callbacks.History at 0x7fb53fbae410>



In [9]:

    
# bn_model.save_weights(path + 'models/da_conv8.h5')
bn_model.load_weights(path + 'models/da_conv8.h5')



In [55]:

    
# conv_test_feat_batches = bcolz.iterblocks(path + fname)
fname = path + 'results/conv_test_feat.dat'
idx, inc = 0, 4096
preds = []
while idx < test_batches.n - inc:
    conv_test_feat = bcolz.open(fname)[idx:idx+inc]
    idx += inc
    if len(preds):
        next_preds = bn_model.predict(conv_test_feat, batch_size=batch_size, verbose=0)
        preds = np.concatenate([preds, next_preds])
    else:
        preds = bn_model.predict(conv_test_feat, batch_size=batch_size, verbose=0)
conv_test_feat = bcolz.open(fname)[idx:]
next_preds = bn_model.predict(conv_test_feat, batch_size=batch_size, verbose=0)
preds = np.concatenate([preds, next_preds])
print(len(preds))
if len(preds) != len(bcolz.open(fname)):
    print("Ya done fucked up, son.")









    



81920
Ya done fucked up, son.

Made a mistake on the last loop above. The penultimate batch -- the last full 4096-image batch -- was added onto the end of the predictions array twice. The final 2194 image predictions were never run.

Easy enough to fix: modify the above code to work perfectly. Then either:

create entirely new predictions from scratch (~ 1 hour)
remove the last increment (4096) of predictions from the array, and add the last batch.

Gonna take option 2.

EDIT:

actually, option 1. preds was stored in memory, which was erased when I closed this machine for the night. So this time I'll just build the predictions array properly.

Below is testing/debugging output from the night before



In [52]:

    
print(81920 - 79726)
print(79726 % 4096)
print(81920 % 4096) # <-- that's yeh problem right there, kid



In [ ]:

    
x = preds[len(preds) - 4096]
print(preds[-1])
print(x)



In [43]:









    Out[43]:





array([[  6.2483e-07,   2.4578e-06,   2.9354e-05,   6.7996e-05,   8.1581e-07,   2.9132e-06,
          9.9981e-01,   2.7024e-07,   8.0846e-05,   3.0252e-06],
       [  4.0233e-04,   3.4477e-05,   7.3304e-07,   3.4403e-01,   6.5541e-01,   9.2372e-06,
          7.4395e-05,   8.1555e-06,   1.8162e-05,   8.2546e-06],
       [  3.7808e-06,   1.2269e-06,   4.5053e-07,   3.5392e-06,   1.5726e-05,   3.7257e-06,
          1.5923e-05,   7.0004e-06,   9.9993e-01,   2.0967e-05],
       [  2.1178e-05,   6.0503e-06,   1.8488e-06,   7.9847e-06,   7.7963e-06,   9.9988e-01,
          3.8778e-05,   4.0426e-06,   1.3915e-05,   1.8222e-05],
       [  4.2161e-01,   1.3603e-04,   9.1913e-02,   5.2514e-04,   4.0447e-02,   2.0817e-01,
          1.7152e-02,   3.7824e-03,   2.7693e-02,   1.8857e-01],
       [  6.9312e-04,   5.2366e-02,   1.6738e-05,   5.5922e-06,   3.7776e-05,   1.1497e-04,
          1.4271e-05,   9.1994e-05,   6.5573e-05,   9.4659e-01],
       [  9.8337e-10,   4.1691e-08,   2.3664e-07,   4.8789e-09,   2.7257e-08,   2.3041e-08,
          2.1754e-07,   1.0000e+00,   5.1427e-07,   2.3709e-09],
       [  1.1594e-06,   5.4730e-09,   2.5601e-09,   8.1659e-09,   5.9669e-06,   9.9999e-01,
          5.0917e-07,   1.1032e-06,   3.9713e-07,   2.8886e-09],
       [  9.6761e-05,   9.8503e-01,   2.0876e-04,   1.8814e-05,   8.8029e-05,   1.1453e-04,
          1.1641e-02,   4.8791e-05,   2.6841e-03,   7.0924e-05],
       [  6.7607e-06,   1.3271e-07,   9.9734e-01,   9.7243e-06,   5.7442e-04,   1.3320e-05,
          5.3985e-07,   1.3169e-03,   1.1960e-04,   6.1661e-04]], dtype=float32)



In [40]:

    
preds[0]









    Out[40]:





array([  7.4223e-07,   4.0707e-05,   5.6510e-05,   1.0678e-05,   1.6999e-04,   3.3456e-03,
         2.3427e-05,   9.9596e-01,   9.5046e-05,   2.9750e-04], dtype=float32)



In [10]:

    
# ??image.ImageDataGenerator.flow_from_directory



In [12]:

    
# ??Sequential.predict()

Redoing predictions here:



In [10]:

    
fname = path + 'results/conv_test_feat.dat'
idx, inc = 4096, 4096
preds = []

conv_test_feat = bcolz.open(fname)[:idx]
preds = bn_model.predict(conv_test_feat, batch_size=batch_size, verbose=0)
while idx < test_batches.n - inc:
    conv_test_feat = bcolz.open(fname)[idx:idx+inc]
    idx += inc
    next_preds = bn_model.predict(conv_test_feat, batch_size=batch_size, verbose=0)
    preds = np.concatenate([preds, next_preds])
conv_test_feat = bcolz.open(fname)[idx:]
next_preds = bn_model.predict(conv_test_feat, batch_size=batch_size, verbose=0)
preds = np.concatenate([preds, next_preds])

print(len(preds))
if len(preds) != len(bcolz.open(fname)):
    print("Ya done fucked up, son.")

Oh I forgot, predictions through a FC NN are fast. CNNs are where it takes a long time.

This is just quick testing that it works. Full/polished will be in the reworked statefarm-codealong (or just statefarm) JNB:



In [11]:

    
def do_clip(arr, mx): return np.clip(arr, (1-mx)/9, mx)



In [12]:

    
subm = do_clip(preds, 0.93)



In [13]:

    
subm_name = path + 'results/subm01.gz'



In [15]:

    
trn_batches = get_batches(path + 'train', batch_size=batch_size, shuffle=False)









    



Found 19463 images belonging to 10 classes.



In [16]:

    
# make sure training batches defined before this:
classes = sorted(trn_batches.class_indices, key=trn_batches.class_indices.get)



In [19]:

    
import pandas as pd
submission = pd.DataFrame(subm, columns=classes)
submission.insert(0, 'img', [f[8:] for f in test_filenames])
submission.head()









    Out[19]:







  
    
      
      img
      c0
      c1
      c2
      c3
      c4
      c5
      c6
      c7
      c8
      c9
    
  
  
    
      0
      img_93169.jpg
      0.007778
      0.007778
      0.007778
      0.007778
      0.007778
      0.007778
      0.007778
      0.930000
      0.007778
      0.007778
    
    
      1
      img_81727.jpg
      0.007778
      0.007778
      0.007778
      0.930000
      0.007778
      0.007778
      0.007778
      0.007778
      0.007778
      0.007778
    
    
      2
      img_53095.jpg
      0.007778
      0.007778
      0.007778
      0.007778
      0.007778
      0.007778
      0.930000
      0.007778
      0.007778
      0.007778
    
    
      3
      img_13927.jpg
      0.052475
      0.007778
      0.007778
      0.007778
      0.081378
      0.007778
      0.007778
      0.007778
      0.857765
      0.007778
    
    
      4
      img_36496.jpg
      0.007778
      0.007778
      0.007778
      0.007778
      0.007778
      0.007778
      0.007778
      0.930000
      0.007778
      0.007778



In [20]:

    
submission.to_csv(subm_name, index=False, compression='gzip')



In [24]:

    
from IPython.display import FileLink
FileLink(subm_name)









    Out[24]:




data/statefarm/results/subm01.gz

This 'just good enough to pass' code/model got a 0.70947 on the Kaggle competition. My previous best was 1.50925 at place 658/1440; top ~45.7%. This gets place 415/1440. Top ~28.9%



In [ ]:

	img	c0	c1	c2	c3	c4	c5	c6	c7	c8	c9
0	img_93169.jpg	0.007778	0.007778	0.007778	0.007778	0.007778	0.007778	0.007778	0.930000	0.007778	0.007778
1	img_81727.jpg	0.007778	0.007778	0.007778	0.930000	0.007778	0.007778	0.007778	0.007778	0.007778	0.007778
2	img_53095.jpg	0.007778	0.007778	0.007778	0.007778	0.007778	0.007778	0.930000	0.007778	0.007778	0.007778
3	img_13927.jpg	0.052475	0.007778	0.007778	0.007778	0.081378	0.007778	0.007778	0.007778	0.857765	0.007778
4	img_36496.jpg	0.007778	0.007778	0.007778	0.007778	0.007778	0.007778	0.007778	0.930000	0.007778	0.007778