Our deep network is a CNN based on a Keras website example and trained it using binary-xent with sigmoidal activation for the output layer. We resized all our images to 300x185x3 upon loading them and selected binary crossentropy as our loss function. More details about the model are available later in the document.
We ran into some issues during the training of our model related to the amount of images AWS servers could manage at any given time. With a "p2.xlarge" instance roughly 15,000 images total could be stored in memory (12GB on Tesla K80). We had many more images than this, so we took advantage of being able to run 10 training epochs to train our model 10k images at a time, draw another sample, and train again in an iterative process to improve test performance. We also standardized each image, so that it has a mean of 0 and std of 1.
After 100 epochs of training on a small subset of our data, our model achieved a binary accuracy of 0.9939. As can be seen from the matplotlib visualizations, four out of our seven genres were predicted as almost entirely zero while three predicted genre lables were more accurate. Rather than using overall accuracy, we explored the average Precision and Recall across test observations when tuning parameters for our model. Our main limiting factor for model improvement in this round was time required to test new parameters without "breaking the bank" on AWS credits. We hope to continue improving this model prior to the final paper.
In [1]:
import tensorflow as tf
import numpy as np
import pandas as pd
from scipy import ndimage
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, Activation, Conv2D, MaxPooling2D
from keras.optimizers import SGD, RMSprop
from keras.utils import plot_model
from keras.utils.vis_utils import model_to_dot
from keras.preprocessing.image import ImageDataGenerator
from IPython.display import SVG
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
# Load data
%cd ~/data/
labs = pd.read_csv('multilabels.csv')
ids = pd.read_csv('features_V1.csv', usecols=[0])
# Take care of some weirdness that led to duplicate entries
labs = pd.concat([ids,labs], axis=1, ignore_index=True)
labs = labs.drop_duplicates(subset=[0])
ids = labs.pop(0).as_matrix()
labs = labs.as_matrix()
In [3]:
# Split train/test - 15k is about the limit of what we can hold in memory (12GB on Tesla K80)
n_train = 10000
n_test = 5000
rnd_ids = np.random.choice(np.squeeze(ids), size=n_train+n_test, replace=False)
train_ids = rnd_ids[:n_train]
test_ids = rnd_ids[n_train:]
# Pull in multilabels
y_train = labs[np.nonzero(np.in1d(np.squeeze(ids),train_ids))[0]]
y_test = labs[np.nonzero(np.in1d(np.squeeze(ids),test_ids))[0]]
# Read in images - need to do some goofy stuff here to handle the highly irregular image sizes and formats
X_train = np.zeros([n_train, 600, 185, 3])
ct = 0
for i in train_ids:
IM = ndimage.imread('posters/{}.jpg'.format(i))
try:
X_train[ct,:IM.shape[0],:,:] = IM[:,:,:3]
except:
X_train[ct,:IM.shape[0],:,0] = IM
ct += 1
if ct % 100 == 0:
print 'training data {i}/{n} loaded'.format(i=ct, n=n_train)
X_train = X_train[:,:300,:,:] # trim excess off edges
print 'training data loaded'
X_test = np.zeros([n_test, 600, 185, 3])
ct = 0
for i in test_ids:
IM = ndimage.imread('posters/{}.jpg'.format(i))
try:
X_test[ct,:IM.shape[0],:,:] = IM[:,:,:3]
except:
X_test[ct,:IM.shape[0],:,0] = IM
ct += 1
if ct % 100 == 0:
print 'test data {i}/{n} loaded'.format(i=ct, n=n_test)
X_test = X_test[:,:300,:,:] # trim excess off edges
print 'test data loaded'
# Create dataGenerator to feed image batches -
# this is nice because it also standardizes training data
datagen = ImageDataGenerator(
samplewise_center=True,
samplewise_std_normalization=True)
# compute quantities required for featurewise normalization
# (std, mean, and principal components if ZCA whitening is applied)
datagen.fit(X_train)
In [7]:
# Build CNN model
model = Sequential()
# input: 300x185 images with 3 channels -> (300, 185, 3) tensors.
# this applies 32 convolution filters of size 3x3 each.
model.add(Conv2D(32, (3, 3), activation='relu', input_shape=(300, 185, 3)))
model.add(Conv2D(32, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(7, activation='sigmoid'))
#sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
#model.compile(loss='binary_crossentropy', optimizer=sgd)
model.compile(loss='binary_crossentropy',
optimizer='rmsprop',
metrics=['binary_accuracy'])
model.summary()
# Visualize network graph
#SVG(model_to_dot(model).create(prog='dot', format='svg'))
We decided to implement a convolutional neural network that we adapted from the Keras "VGG-like convnet" tutorial. This is essentially a much simplified version of the pre-trained VGG-16 model that we fine-tuned.
The network consists of a two convolution layers with ReLu activation, followed by max pooling, a repeat of this motif, and then two fully connected layers - the first with ReLU activation and the output layer with sigmoidal activation. The model is regularized using dropout after each max pool and between the two fully connected layers.
We trained the network using a binary cross-entropy loss function and an RMSprop optimizer. We tried a number of other optimizers, including SGD w/momentum, Adam, and Nadam - none of which showed much difference in performance and were all slighly slower to learn than RMSprop.
Because we use an RMSProp optimizer we plan to only tune the learning rate for the model as the documentation suggests leaving the other parameters at their defaults. First we test this model on a smaller set (512) images to make sure it works well, then move on to training this model on a set of 10,000 images.
In [11]:
# Fit the model with a small batch of training data, n=512 images
model.fit_generator(datagen.flow(X_train, y_train, batch_size=32),
steps_per_epoch=len(X_train) / 32, epochs=100)
score = model.evaluate(X_test, y_test, batch_size=32)
In [15]:
model.predict(X_test[:50,:,:,:])
Out[15]:
In [25]:
plt.pcolor(model.predict(X_test))
Out[25]:
In [26]:
plt.pcolor(y_test)
Out[26]:
In [34]:
# Now fit the model using a much bigger training set, n=1e4
model.fit_generator(datagen.flow(X_train, y_train, batch_size=32),
steps_per_epoch=len(X_train) / 32, epochs=10)
# Now, I think this may be wrong since we're not applying the
# same input transformation as the data feeder:
score = model.evaluate(X_test, y_test, batch_size=32)
# Maybe the right way to do it looks something like:
#score = model.predict_generator(datagen.flow(X_test, y_test), steps=100)
In [35]:
# Performance on the test set
print score
We can see that the performance on the test set is much worse that, with an overall binary accuracy of 0.77 - much less than the final training set accuracy of 0.89. It seems that we have not trained the model on a large enough set of images, or we are just overfitting that data we have. We can try to load a new set of training images and then continue training the model to see if test performance improves.
In [ ]:
# Continue fitting the model for another 5 epochs and see if much changes - we could just be overfitting at this point
model.fit_generator(datagen.flow(X_train, y_train, batch_size=32),
steps_per_epoch=len(X_train) / 32, epochs=5)
score = model.evaluate(X_test, y_test, batch_size=32)
Note, every time we test a different hyperparameter or a different number of epochs we rebuild a new model because Keras allows you to further train a model that has already been created. When we test different parameters we do not want to train multiple models on top on one another.
In [36]:
#RMSProp optimizer to tune learning rate, others at default as suggested by documentation
# try learning rate of 0.01
model.compile(loss='binary_crossentropy',
optimizer=RMSprop(lr=0.01, rho=0.9, epsilon=1e-08, decay=0.0),
metrics=['binary_accuracy'])
In [37]:
# Now fit the model using a much bigger training set, n=1e4
model.fit_generator(datagen.flow(X_train, y_train, batch_size=32),
steps_per_epoch=len(X_train) / 32, epochs=10)
score = model.evaluate(X_test, y_test, batch_size=32)
In [38]:
print score
In [16]:
#RMSProp optimizer to tune learning rate, others at default as suggested by documentation
# try learning rate of 0.0001
model.compile(loss='binary_crossentropy',
optimizer=RMSprop(lr=0.0001, rho=0.9, epsilon=1e-08, decay=0.0),
metrics=['binary_accuracy'])
In [17]:
# Now fit the model using a much bigger training set, n=1e4
model.fit_generator(datagen.flow(X_train, y_train, batch_size=32),
steps_per_epoch=len(X_train) / 32, epochs=10)
Out[17]:
In [18]:
score = model.evaluate(X_test, y_test, batch_size=32)
In [19]:
print score
Looking at the results from the models above, one can see that as the number of epochs increases, the binary accuracy improves and the loss decreases. This also improves the score of the model as well. Next, looking at the learning rate, the default LR of 0.001 with the RMSProp optimizer with 10 epochs performs better than when the LR is set to 0.01 as this results in lower loss and higher binary accuracy. When the learning rate is set to 0.0001 the loss and binary accuracy are similar to that of a learning rate of 0.001. For that sake of this analysis we will choose a learning rate of 0.0001, however we may explore even smaller learning rates as we move forward training with more epochs. Now we will train a model with 15 epochs to see how the performance varies. For the sake of time we will not extend beyond this, however we plan to use even more as we progress.
In [24]:
#RMSProp optimizer to tune learning rate, others at default as suggested by documentation
# choose learning rate of 0.001
model.compile(loss='binary_crossentropy',
optimizer=RMSprop(lr=0.0001, rho=0.9, epsilon=1e-08, decay=0.0),
metrics=['binary_accuracy'])
In [25]:
# Now fit the model using a much bigger training set, n=1e4
base_model = model.fit_generator(datagen.flow(X_train, y_train, batch_size=32),
steps_per_epoch=len(X_train) / 32, epochs=15)
In [26]:
score = model.evaluate(X_test, y_test, batch_size=32)
print score
Based on the above result compared to the previous models it appears that it would be a good idea to test a larger number of epochs (maybe 100) and see if the model performance improves.
In [33]:
print(base_model.history.keys())
In [38]:
plt.plot(base_model.history['binary_accuracy'])
plt.title('Model Binary Accuracy for Base model')
plt.ylabel('binary accuracy')
plt.xlabel('epoch')
plt.show()
In [51]:
plt.plot(base_model.history['loss'])
plt.title('Model Loss for Base model')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.show()
In [8]:
# now the same thing as above but with 50 epochs to see if the loss continues to shrink toward 0
# if not, we will have to reevaluate some of our models parameters
#RMSProp optimizer to tune learning rate, others at default as suggested by documentation
# choose learning rate of 0.001
model.compile(loss='binary_crossentropy',
optimizer=RMSprop(lr=0.0001, rho=0.9, epsilon=1e-08, decay=0.0),
metrics=['binary_accuracy'])
In [9]:
# Now fit the model using a much bigger training set, n=1e4
base_model_100e = model.fit_generator(datagen.flow(X_train, y_train, batch_size=32),
steps_per_epoch=len(X_train) / 32, epochs=50, verbose=2)
In [10]:
plt.plot(base_model_100e.history['loss'])
plt.title('Model Loss for 50 epochs - LR = 0.0001')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.show()
When we plot the model to look at the loss for 50 epochs we don't see anything too concerning. The loss decreases quickly as expected, and then levels out around 0.1.
In [11]:
plt.plot(base_model_100e.history['binary_accuracy'])
plt.title('Model Accuracy for 50 epochs - LR = 0.0001')
plt.ylabel('binary accuracy')
plt.xlabel('epoch')
plt.show()
Now we are going to create a new model with the same hyperparameters, fitting 10 epochs at a time, but this time use 10,000 new training observations for each set of 10 epochs.
In [3]:
# Split train/test - 15k is about the limit of what we can hold in memory (12GB on Tesla K80)
n_train = 145000
n_test = 5000
rnd_ids = np.random.choice(np.squeeze(ids), size=n_train+n_test, replace=False)
train_ids = rnd_ids[:n_train]
test_ids = rnd_ids[n_train:]
In [4]:
train_set1 = train_ids[0:10000]
In [11]:
# Pull in multilabels
y_train = labs[np.nonzero(np.in1d(np.squeeze(ids),train_set1))[0]]
y_test = labs[np.nonzero(np.in1d(np.squeeze(ids),test_ids))[0]]
In [5]:
# Read in images - need to do some goofy stuff here to handle the highly irregular image sizes and formats
X_train = np.zeros([train_set1.shape[0], 600, 185, 3])
ct = 0
for i in train_set1:
IM = ndimage.imread('posters/{}.jpg'.format(i))
try:
X_train[ct,:IM.shape[0],:,:] = IM[:,:,:3]
except:
X_train[ct,:IM.shape[0],:,0] = IM
ct += 1
if ct % 100 == 0:
print 'training data {i}/{n} loaded'.format(i=ct, n=train_set1.shape[0])
X_train = X_train[:,:300,:,:] # trim excess off edges
print 'training data loaded'
X_test = np.zeros([n_test, 600, 185, 3])
ct = 0
for i in test_ids:
IM = ndimage.imread('posters/{}.jpg'.format(i))
try:
X_test[ct,:IM.shape[0],:,:] = IM[:,:,:3]
except:
X_test[ct,:IM.shape[0],:,0] = IM
ct += 1
if ct % 100 == 0:
print 'test data {i}/{n} loaded'.format(i=ct, n=n_test)
X_test = X_test[:,:300,:,:] # trim excess off edges
print 'test data loaded'
# Create dataGenerator to feed image batches -
# this is nice because it also standardizes training data
datagen = ImageDataGenerator(
samplewise_center=True,
samplewise_std_normalization=True)
# compute quantities required for featurewise normalization
# (std, mean, and principal components if ZCA whitening is applied)
datagen.fit(X_train)
In [6]:
# Build CNN model
model = Sequential()
# input: 300x185 images with 3 channels -> (300, 185, 3) tensors.
# this applies 32 convolution filters of size 3x3 each.
model.add(Conv2D(32, (3, 3), activation='relu', input_shape=(300, 185, 3)))
model.add(Conv2D(32, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(7, activation='sigmoid'))
#sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
#model.compile(loss='binary_crossentropy', optimizer=sgd)
model.compile(loss='binary_crossentropy',
optimizer='rmsprop',
metrics=['binary_accuracy'])
In [7]:
#RMSProp optimizer to tune learning rate, others at default as suggested by documentation
# choose learning rate of 0.001
model.compile(loss='binary_crossentropy',
optimizer=RMSprop(lr=0.0001, rho=0.9, epsilon=1e-08, decay=0.0),
metrics=['binary_accuracy'])
In [12]:
# model fit with different iterations of 10,000 train observations
multi_train_sets = model.fit_generator(datagen.flow(X_train, y_train, batch_size=32),
steps_per_epoch=len(X_train) / 32, epochs=10, verbose=2)
In [ ]:
multi_train_loss = multi_train_sets.history['loss']
multi_train_acc = multi_train_sets.history['binary_accuracy']
In [19]:
# need to do this to free up memory
del X_train
del train_set1
del y_train
In [23]:
train_set2 = train_ids[10000:20000]
In [24]:
y_train = labs[np.nonzero(np.in1d(np.squeeze(ids),train_set2))[0]]
In [26]:
# Read in images - need to do some goofy stuff here to handle the highly irregular image sizes and formats
X_train = np.zeros([train_set2.shape[0], 600, 185, 3])
ct = 0
for i in train_set2:
IM = ndimage.imread('posters/{}.jpg'.format(i))
try:
X_train[ct,:IM.shape[0],:,:] = IM[:,:,:3]
except:
X_train[ct,:IM.shape[0],:,0] = IM
ct += 1
if ct % 100 == 0:
print 'training data {i}/{n} loaded'.format(i=ct, n=train_set2.shape[0])
X_train = X_train[:,:300,:,:] # trim excess off edges
print 'training data loaded'
# Create dataGenerator to feed image batches -
# this is nice because it also standardizes training data
datagen = ImageDataGenerator(
samplewise_center=True,
samplewise_std_normalization=True)
# compute quantities required for featurewise normalization
# (std, mean, and principal components if ZCA whitening is applied)
datagen.fit(X_train)
In [27]:
# model fit with different iterations of 10,000 train observations : second set
multi_train_sets = model.fit_generator(datagen.flow(X_train, y_train, batch_size=32),
steps_per_epoch=len(X_train) / 32, epochs=10, verbose=2)
In [28]:
score = model.evaluate(X_test, y_test, batch_size=32)
print score
In [29]:
multi_train_loss2 = multi_train_sets.history['loss']
multi_train_acc2 = multi_train_sets.history['binary_accuracy']
In [46]:
plt.plot(multi_train_loss)
plt.plot(multi_train_loss2)
plt.title('Model Loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['0-10,000 obs.', ' Adding 10,000-20,000 obs.'], loc='upper right')
plt.show()
Prelimiary results show that as we add additional training observations to the model the performance in terms of loss improves. This suggests that if we continued to add additional observations the model would perform even better, however after this set of observations we run into memory problems when trying to load the next 10,000 train observations.
In [ ]:
In [30]:
# need to do this to free up memory
del X_train
del train_set2
del y_train
In [ ]:
# Anything further than this results in a memory error even when you delete the big training sets
# not sure how to proceed
In [31]:
train_set3 = train_ids[20000:30000]
y_train = labs[np.nonzero(np.in1d(np.squeeze(ids),train_set3))[0]]
In [32]:
# Read in images - need to do some goofy stuff here to handle the highly irregular image sizes and formats
X_train = np.zeros([train_set3.shape[0], 600, 185, 3])
ct = 0
for i in train_set3:
IM = ndimage.imread('posters/{}.jpg'.format(i))
try:
X_train[ct,:IM.shape[0],:,:] = IM[:,:,:3]
except:
X_train[ct,:IM.shape[0],:,0] = IM
ct += 1
if ct % 100 == 0:
print 'training data {i}/{n} loaded'.format(i=ct, n=train_set3.shape[0])
X_train = X_train[:,:300,:,:] # trim excess off edges
print 'training data loaded'
# Create dataGenerator to feed image batches -
# this is nice because it also standardizes training data
datagen = ImageDataGenerator(
samplewise_center=True,
samplewise_std_normalization=True)
# compute quantities required for featurewise normalization
# (std, mean, and principal components if ZCA whitening is applied)
datagen.fit(X_train)
In [ ]: