Visual Question Answering with LSTM and VGG features

In this notebook, we build a VQA model with LSTM as the language model and the VGG-19 as our visual model. Since the full dataset is quite large, we load and play with a small portion of it on our local machine.


In [1]:
# don't re-inventing the wheel
import h5py, json, spacy

import numpy as np
import cPickle as pickle

%matplotlib inline
import matplotlib.pyplot as plt

from model import LSTMModel
from utils import prepare_ques_batch, prepare_im_batch, get_batches_idx


Using TensorFlow backend.

Word Embeddings

For word embeddings, we use the pre-trained word2vec provided by the spacy package


In [2]:
# run `python -m spacy.en.download` to collect the embeddings (1st time only)
embeddings = spacy.en.English()
word_dim = 300

Loading Tiny Dataset

Here we load a tiny dataset of 300 question/answer pairs and 100 images which is prepared using the script in Dataset Handling.ipynb


In [4]:
h5_img_file_tiny = h5py.File('data/vqa_data_img_vgg_train_small.h5', 'r')
fv_im_tiny = h5_img_file_tiny.get('/images_train')

with open('data/qa_data_train_small.pkl', 'rb') as fp:
    qa_data_tiny = pickle.load(fp)

json_file = json.load(open('data/vqa_data_prepro.json', 'r'))
ix_to_word = json_file['ix_to_word']
ix_to_ans = json_file['ix_to_ans']

vocab_size = len(ix_to_word)
print "Loading tiny dataset of %d image features and %d question/answer pairs for training." % (len(fv_im_tiny), len(qa_data_tiny))


Loading tiny dataset of 8000 image features and 24000 question/answer pairs for training.

In this dataset, one image associates with muiltiple question/answer pairs (3 in this case). Therefore, we need to hand-binding the question/answer pairs with the corresponding image feature for training.


In [5]:
questions, ques_len, im_ix, ans = zip(*qa_data_tiny)

nb_classes = 1000
max_ques_len = 26

X_ques = prepare_ques_batch(questions, ques_len, max_ques_len, embeddings, word_dim, ix_to_word)
X_im = prepare_im_batch(fv_im_tiny, im_ix)
y = np.zeros((len(ans), nb_classes))
y[np.arange(len(ans)), ans] = 1

Overfit LSTM + VGG

Finally, we are getting to the fun part! Let's build our model...


In [20]:
model = LSTMModel()
model.build()


____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
bidirectional_2 (Bidirectional)  (None, 1024)          2497536                                      
____________________________________________________________________________________________________
maxpooling2d_2 (MaxPooling2D)    (None, 7, 7, 512)     0                                            
____________________________________________________________________________________________________
flatten_2 (Flatten)              (None, 25088)         0                                            
____________________________________________________________________________________________________
dense_7 (Dense)                  (None, 4096)          102764544                                    
____________________________________________________________________________________________________
batchnormalization_7 (BatchNormal(None, 4096)          8192                                         
____________________________________________________________________________________________________
dense_8 (Dense)                  (None, 4096)          16781312                                     
____________________________________________________________________________________________________
batchnormalization_8 (BatchNormal(None, 4096)          8192                                         
____________________________________________________________________________________________________
dense_9 (Dense)                  (None, 4096)          16781312                                     
____________________________________________________________________________________________________
batchnormalization_9 (BatchNormal(None, 4096)          8192                                         
____________________________________________________________________________________________________
batchnormalization_10 (BatchNorma(None, 5120)          10240       merge_2[0][0]                    
____________________________________________________________________________________________________
dense_10 (Dense)                 (None, 2014)          10313694    batchnormalization_10[0][0]      
____________________________________________________________________________________________________
batchnormalization_11 (BatchNorma(None, 2014)          4028        dense_10[0][0]                   
____________________________________________________________________________________________________
dropout_3 (Dropout)              (None, 2014)          0           batchnormalization_11[0][0]      
____________________________________________________________________________________________________
dense_11 (Dense)                 (None, 2014)          4058210     dropout_3[0][0]                  
____________________________________________________________________________________________________
batchnormalization_12 (BatchNorma(None, 2014)          4028        dense_11[0][0]                   
____________________________________________________________________________________________________
dropout_4 (Dropout)              (None, 2014)          0           batchnormalization_12[0][0]      
____________________________________________________________________________________________________
dense_12 (Dense)                 (None, 1000)          2015000     dropout_4[0][0]                  
====================================================================================================
Total params: 155254480
____________________________________________________________________________________________________

Since the dataset we are using is tiny, we can fit the whole dataset to the convenience fit method and specify the batch_size. Note that this already ate up a lot of memory and it won't work for the large dataset.


In [21]:
loss = model.fit(X_ques, X_im, y, nb_epoch=30, batch_size=1000)


Epoch 1/30
24000/24000 [==============================] - 548s - loss: 4.5991 - acc: 0.2435    
Epoch 2/30
24000/24000 [==============================] - 588s - loss: 3.4552 - acc: 0.3030    
Epoch 3/30
24000/24000 [==============================] - 576s - loss: 3.1088 - acc: 0.3350    
Epoch 4/30
24000/24000 [==============================] - 588s - loss: 2.8409 - acc: 0.3665    
Epoch 5/30
24000/24000 [==============================] - 602s - loss: 2.5720 - acc: 0.4008    
Epoch 6/30
24000/24000 [==============================] - 594s - loss: 2.3287 - acc: 0.4408    
Epoch 7/30
24000/24000 [==============================] - 555s - loss: 2.0898 - acc: 0.4841    
Epoch 8/30
24000/24000 [==============================] - 557s - loss: 1.8417 - acc: 0.5358    
Epoch 9/30
24000/24000 [==============================] - 564s - loss: 1.6218 - acc: 0.5834    
Epoch 10/30
24000/24000 [==============================] - 648s - loss: 1.4237 - acc: 0.6224    
Epoch 11/30
24000/24000 [==============================] - 621s - loss: 1.2406 - acc: 0.6636    
Epoch 12/30
24000/24000 [==============================] - 623s - loss: 1.0669 - acc: 0.6996    
Epoch 13/30
24000/24000 [==============================] - 633s - loss: 0.9554 - acc: 0.7297    
Epoch 14/30
24000/24000 [==============================] - 610s - loss: 0.8357 - acc: 0.7590    
Epoch 15/30
24000/24000 [==============================] - 662s - loss: 0.7391 - acc: 0.7840    
Epoch 16/30
24000/24000 [==============================] - 785s - loss: 0.6712 - acc: 0.7978    
Epoch 17/30
24000/24000 [==============================] - 680s - loss: 0.5996 - acc: 0.8158    
Epoch 18/30
24000/24000 [==============================] - 585s - loss: 0.5548 - acc: 0.8306    
Epoch 19/30
24000/24000 [==============================] - 585s - loss: 0.5015 - acc: 0.8412    
Epoch 20/30
24000/24000 [==============================] - 649s - loss: 0.4528 - acc: 0.8540    
Epoch 21/30
24000/24000 [==============================] - 612s - loss: 0.4146 - acc: 0.8652    
Epoch 22/30
24000/24000 [==============================] - 620s - loss: 0.3931 - acc: 0.8712    
Epoch 23/30
24000/24000 [==============================] - 616s - loss: 0.3630 - acc: 0.8832    
Epoch 24/30
24000/24000 [==============================] - 624s - loss: 0.3395 - acc: 0.8897    
Epoch 25/30
24000/24000 [==============================] - 636s - loss: 0.3170 - acc: 0.8960    
Epoch 26/30
24000/24000 [==============================] - 600s - loss: 0.2919 - acc: 0.9058    
Epoch 27/30
24000/24000 [==============================] - 594s - loss: 0.2733 - acc: 0.9088    
Epoch 28/30
24000/24000 [==============================] - 561s - loss: 0.2601 - acc: 0.9155    
Epoch 29/30
24000/24000 [==============================] - 553s - loss: 0.2377 - acc: 0.9246    
Epoch 30/30
24000/24000 [==============================] - 544s - loss: 0.2243 - acc: 0.9283    

In [22]:
plt.plot(loss.history['loss'], label='train_loss')
plt.plot(loss.history['acc'], label='train_acc')
plt.legend(loc='best')


Out[22]:
<matplotlib.legend.Legend at 0x223dc7510>

Let's see how far we can get with this overfitted model...


In [23]:
h5_img_file_test_tiny = h5py.File('data/vqa_data_img_vgg_test_small.h5', 'r')
fv_im_test_tiny = h5_img_file_test_tiny.get('/images_test')

with open('data/qa_data_test_small.pkl', 'rb') as fp:
    qa_data_test_tiny = pickle.load(fp)
    
print "Loading tiny dataset of %d image features and %d question/answer pairs for testing" % (len(fv_im_test_tiny), len(qa_data_test_tiny))


Loading tiny dataset of 4000 image features and 12000 question/answer pairs for testing

In [24]:
questions, ques_len, im_ix, ans = zip(*qa_data_test_tiny)

X_ques_test = prepare_ques_batch(questions, ques_len, max_ques_len, embeddings, word_dim, ix_to_word)
X_im_test = prepare_im_batch(fv_im_test_tiny, im_ix)
y_test = np.zeros((len(ans), nb_classes))
y_test[np.arange(len(ans)), [494 if a > 1000 else a for a in ans]] = 1

In [25]:
loss, acc = model.evaluate(X_ques_test, X_im_test, y_test)


12000/12000 [==============================] - 106s   

In [26]:
# GRU
print loss, acc


4.25542008877 0.290333333425

In [ ]:


In [ ]: