9 June 2017
Wayne Nixalo
This notebook started out trying to generate convolutional test features using Sequential.predict_generator, by using bcolz to save the generated features to disk, in batches as they were created. This was successful after a few days of work. (roughly: 6 - 9 June).
This notebook's continued on to build a full set of submittable predictions. Once that's solved, I can build a strong model, unconstrained by system memory limits. Video-memory is another matter, at this time.
In [1]:
import theano
In [2]:
import os, sys
sys.path.insert(1, os.path.join('utils'))
from __future__ import print_function, division
path = 'data/statefarm/'
import utils; reload(utils)
from utils import *
In [3]:
batch_size=16
vgg = Vgg16()
model = vgg.model
last_conv_idx = [i for i, l in enumerate(model.layers) if type(l) is Convolution2D][-1]
conv_layers = model.layers[:last_conv_idx + 1]
conv_model = Sequential(conv_layers)
In [4]:
gen = image.ImageDataGenerator()
test_batches = get_batches(path + 'test', batch_size=batch_size, shuffle=False)
Manual iteration through test image to generate convolutional test features. Saves each batch to disk insetad of loading in memory.
In [5]:
# conv_test_feat = conv_model.predict_generator(test_batches, test_batches.nb_sample)
I think conv_feat below should be conv_test_feat
In [7]:
fname = path + 'results/conv_test_feat.dat'
%rm -r $fname
for i in xrange(test_batches.n // batch_size + 1):
conv_test_feat = conv_model.predict_on_batch(test_batches.next()[0])
if not i:
c = bcolz.carray(conv_feat, rootdir= path + '/results/conv_test_feat.dat', mode='a')
else:
c.append(conv_feat)
c.flush()
Question: Why does it look like I can have the entire conv_test_feat array open at once, when opened w/ bcolz; but when it's explicitly loaded as a Numpy array via bcolz.open(fname)[:], all of a sudden the RAM takes a severe memory hit?
In [8]:
# apparently you can just open a (massive) bcolz carray this way
# without crashing memory... okay I'm learning things
# carr = bcolz.open(fname)
Out[8]:
In [9]:
# forgot to add the '+1' so missed the last 14 images. Doing that here:
# NOTE: below code only adds on the missed batch
# iterate generator until final missed batch, then work:
fname = path + 'results/conv_test_feat.dat'
test_batches.reset()
iters = test_batches.n // batch_size
for i in xrange(iters): test_batches.next()
conv_test_feat = conv_model.predict_on_batch(test_batches.next()[0])
# c = bcolz.carray(conv_test_feat, rootdir=fname, mode='a')
c = bcolz.open(fname)
c.append(conv_test_feat)
c.flush()
As expected (& which motivated this) the full set of convolutional test features does not fit at once in memory.
In [5]:
fname = path + 'results/conv_test_feat.dat'
In [6]:
x = bcolz.open(fname)
len(x)
Out[6]:
Loading train/valid features; defining & fitting NN model
In [31]:
# conv_train_feat_batches = get_batches(path + '/results/conv_feat.dat')
# conv_valid_feat_batches = get_batches(path + '/results/conv_val_feat.dat')
conv_trn_feat = load_array(path + '/results/conv_feat.dat')
conv_val_feat = load_array(path + '/results/conv_val_feat.dat')
In [7]:
(val_classes, trn_classes, val_labels, trn_labels,
val_filenames, filenames, test_filenames) = get_classes(path)
In [8]:
p = 0.8
bn_model = Sequential([
MaxPooling2D(input_shape=conv_layers[-1].output_shape[1:]),
Flatten(),
Dropout(p/2),
Dense(128, activation='relu'),
BatchNormalization(),
Dropout(p/2),
Dense(128, activation='relu'),
BatchNormalization(),
Dropout(p),
Dense(10, activation='softmax')
])
bn_model.compile(Adam(lr=1e-3), loss='categorical_crossentropy', metrics=['accuracy'])
In [ ]:
# Sequential.fit_generator(self, generator, samples_per_epoch, nb_epoch, verbose=1, callbacks=None, validation_data=None, nb_val_samples=None, class_weight=None, max_q_size=10, nb_worker=1, pickle_safe=False, initial_epoch=0, **kwargs)
# bn_model.fit_generator((conv_train_feat_batches, trn_labels), conv_train_feat_batches.nb_sample, nb_epoch=1,
# validation_data=(conv_valid_feat_batches, val_labels), nb_val_samples=conv_valid_feat_batches.nb_sample)
In [34]:
bn_model.fit(conv_trn_feat, trn_labels, batch_size=batch_size, nb_epoch=1,
validation_data = (conv_val_feat, val_labels))
Out[34]:
In [35]:
bn_model.optimizer.lr=1e-2
bn_model.fit(conv_trn_feat, trn_labels, batch_size=batch_size, nb_epoch=4,
validation_data = (conv_val_feat, val_labels))
Out[35]:
In [9]:
# bn_model.save_weights(path + 'models/da_conv8.h5')
bn_model.load_weights(path + 'models/da_conv8.h5')
In [55]:
# conv_test_feat_batches = bcolz.iterblocks(path + fname)
fname = path + 'results/conv_test_feat.dat'
idx, inc = 0, 4096
preds = []
while idx < test_batches.n - inc:
conv_test_feat = bcolz.open(fname)[idx:idx+inc]
idx += inc
if len(preds):
next_preds = bn_model.predict(conv_test_feat, batch_size=batch_size, verbose=0)
preds = np.concatenate([preds, next_preds])
else:
preds = bn_model.predict(conv_test_feat, batch_size=batch_size, verbose=0)
conv_test_feat = bcolz.open(fname)[idx:]
next_preds = bn_model.predict(conv_test_feat, batch_size=batch_size, verbose=0)
preds = np.concatenate([preds, next_preds])
print(len(preds))
if len(preds) != len(bcolz.open(fname)):
print("Ya done fucked up, son.")
Made a mistake on the last loop above. The penultimate batch -- the last full 4096-image batch -- was added onto the end of the predictions array twice. The final 2194 image predictions were never run.
Easy enough to fix: modify the above code to work perfectly. Then either:
Gonna take option 2.
EDIT:
actually, option 1. preds was stored in memory, which was erased when I closed this machine for the night. So this time I'll just build the predictions array properly.
Below is testing/debugging output from the night before
In [52]:
print(81920 - 79726)
print(79726 % 4096)
print(81920 % 4096) # <-- that's yeh problem right there, kid
In [ ]:
x = preds[len(preds) - 4096]
print(preds[-1])
print(x)
In [43]:
Out[43]:
In [40]:
preds[0]
Out[40]:
In [10]:
# ??image.ImageDataGenerator.flow_from_directory
In [12]:
# ??Sequential.predict()
Redoing predictions here:
In [10]:
fname = path + 'results/conv_test_feat.dat'
idx, inc = 4096, 4096
preds = []
conv_test_feat = bcolz.open(fname)[:idx]
preds = bn_model.predict(conv_test_feat, batch_size=batch_size, verbose=0)
while idx < test_batches.n - inc:
conv_test_feat = bcolz.open(fname)[idx:idx+inc]
idx += inc
next_preds = bn_model.predict(conv_test_feat, batch_size=batch_size, verbose=0)
preds = np.concatenate([preds, next_preds])
conv_test_feat = bcolz.open(fname)[idx:]
next_preds = bn_model.predict(conv_test_feat, batch_size=batch_size, verbose=0)
preds = np.concatenate([preds, next_preds])
print(len(preds))
if len(preds) != len(bcolz.open(fname)):
print("Ya done fucked up, son.")
Oh I forgot, predictions through a FC NN are fast. CNNs are where it takes a long time.
This is just quick testing that it works. Full/polished will be in the reworked statefarm-codealong (or just statefarm) JNB:
In [11]:
def do_clip(arr, mx): return np.clip(arr, (1-mx)/9, mx)
In [12]:
subm = do_clip(preds, 0.93)
In [13]:
subm_name = path + 'results/subm01.gz'
In [15]:
trn_batches = get_batches(path + 'train', batch_size=batch_size, shuffle=False)
In [16]:
# make sure training batches defined before this:
classes = sorted(trn_batches.class_indices, key=trn_batches.class_indices.get)
In [19]:
import pandas as pd
submission = pd.DataFrame(subm, columns=classes)
submission.insert(0, 'img', [f[8:] for f in test_filenames])
submission.head()
Out[19]:
In [20]:
submission.to_csv(subm_name, index=False, compression='gzip')
In [24]:
from IPython.display import FileLink
FileLink(subm_name)
Out[24]:
This 'just good enough to pass' code/model got a 0.70947 on the Kaggle competition. My previous best was 1.50925 at place 658/1440; top ~45.7%. This gets place 415/1440. Top ~28.9%
In [ ]: