DeepLarning Couse HSE 2016 fall:

Image Captioning

In this seminar you'll be going through the image captioning pipeline.

To begin with, let us download the dataset of image features from a pre-trained GoogleNet.


In [7]:
###
### Or alternatevely
# !wget https://www.dropbox.com/s/3hj16b0fj6yw7cc/data.tar.gz?dl=1 -O data.tar.gz
# !tar -xvzf data.tar.gz
###

DATA_DIR = './'

Data preprocessing


In [8]:
%%time
# Read Dataset
import numpy as np
import os.path as osp
import pickle

img_codes = np.load(osp.join(DATA_DIR, "data/image_codes.npy"))
captions = pickle.load(open(osp.join(DATA_DIR, 'data/caption_tokens.pcl'), 'rb'))


CPU times: user 549 ms, sys: 317 ms, total: 866 ms
Wall time: 866 ms

In [9]:
print ("each image code is a 1000-unit vector:", img_codes.shape)
print (img_codes[0,:10])
print ('\n\n')
print ("for each image there are 5-7 descriptions, e.g.:\n")
print ('\n'.join(captions[0]))


each image code is a 1000-unit vector: (123287, 1000)
[ 1.38901556 -3.82951474 -1.94360816 -0.5317238  -0.03120959 -2.87483215
 -2.9554503   0.6960277  -0.68551242 -0.7855981 ]



for each image there are 5-7 descriptions, e.g.:

a man with a red helmet on a small moped on a dirt road
man riding a motor bike on a dirt road on the countryside
a man riding on the back of a motorcycle
a dirt path with a young person on a motor bike rests to the foreground of a verdant area with a bridge and a background of cloud wreathed mountains
a man in a red shirt and a red hat is on a motorcycle on a hill side

In [10]:
#split descriptions into tokens
for img_i in range(len(captions)):
    for caption_i in range(len(captions[img_i])):
        sentence = captions[img_i][caption_i] 
        captions[img_i][caption_i] = ["#START#"]+sentence.split(' ')+["#END#"]

In [12]:
# Build a Vocabulary
from collections import Counter
word_counts = Counter()

for row in captions:
    for caption in row:
        word_counts.update(caption)

In [14]:
vocab  = ['#UNK#', '#START#', '#END#']
vocab += [k for k, v in word_counts.items() if v >= 5]
n_tokens = len(vocab)

assert 10000 <= n_tokens <= 10500

word_to_index = {w: i for i, w in enumerate(vocab)}

In [15]:
PAD_ix = -1
UNK_ix = vocab.index('#UNK#')

#good old as_matrix for the third time
def as_matrix(sequences,max_len=None):
    max_len = max_len or max(map(len,sequences))
    
    matrix = np.zeros((len(sequences),max_len),dtype='int32')+PAD_ix
    for i,seq in enumerate(sequences):
        row_ix = [word_to_index.get(word,UNK_ix) for word in seq[:max_len]]
        matrix[i,:len(row_ix)] = row_ix
    
    return matrix

In [16]:
#try it out on several descriptions of a random image
as_matrix(captions[1337])


Out[16]:
array([[ 1361,  8055, 10041,  6998,  7821,  7304,  2954,  9035,  6730,
         6614,  8332,  9930,  3223,    -1,    -1],
       [ 1361,  6730,  6614,  3903,  2726,  3570,  9930,  8332,  9970,
           11,  2954,  3223,    -1,    -1,    -1],
       [ 1361,  1784,  6998,  7821,  7304,  2954,  9035,  6730,  8297,
         9705,  4190,  9339,   908,  7598,  3223],
       [ 1361,  1784,  9721,  3941,  5763,  7374,  1784,  9721,  9759,
         3223,    -1,    -1,    -1,    -1,    -1],
       [ 1361,  8055, 10041,  6998,  7821,  7304,  2954,  9035,  6730,
         6614,  4190,  9339,   503,  3223,    -1]], dtype=int32)

Mah Neural Network


In [17]:
# network shapes. 
CNN_FEATURE_SIZE = img_codes.shape[1]
EMBED_SIZE = 128 #pls change me if u want
LSTM_UNITS = 200 #pls change me if u want

In [18]:
import theano
import theano.tensor as T


WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10).  Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
 https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

Using gpu device 0: Tesla K40m (CNMeM is disabled, cuDNN 5105)

In [19]:
# Input Variable
sentences = T.imatrix()# [batch_size x time] of word ids
image_vectors = T.matrix() # [batch size x unit] of CNN image features
sentence_mask = T.neq(sentences,PAD_ix)

In [20]:
import lasagne
from lasagne.layers import *

In [23]:
#network inputs
l_words = InputLayer((None,None),sentences )
l_mask = InputLayer((None,None),sentence_mask )

#embeddings for words 
l_word_embeddings = EmbeddingLayer(l_words, n_tokens, EMBED_SIZE)

#cudos for using some pre-trained embedding :)

In [24]:
# input layer for image features
l_image_features = InputLayer((None,CNN_FEATURE_SIZE),image_vectors )

#convert 1000 image features from googlenet to whatever LSTM_UNITS you have set
#it's also a good idea to add some dropout here and there
l_image_features_small = DenseLayer(l_image_features, LSTM_UNITS)

assert l_image_features_small.output_shape == (None,LSTM_UNITS)

In [25]:
# Concatinate image features and word embedings in one sequence 
decoder = LSTMLayer(l_word_embeddings,
                    LSTM_UNITS,
                    cell_init=l_image_features_small,
                    grad_clipping=10,
                    mask_input=l_mask)
#    * takes word embeddings as an input
#    * has LSTM_UNITS units in the final layer
#    * has cell_init (or hid init for gru) set to converted image features
#    * mask_input = input_mask
#    * don't forget the grad clipping (~5-10)

#find out better recurrent architectures for bonus point

In [28]:
# Decoding of rnn hiden states
from broadcast import BroadcastLayer,UnbroadcastLayer

#apply whatever comes next to each tick of each example in a batch. Equivalent to 2 reshapes
broadcast_decoder_ticks = BroadcastLayer(decoder,(0,1))
print ("broadcasted decoder shape = ",broadcast_decoder_ticks.output_shape)


broadcasted decoder shape =  (None, 200)

In [30]:
#predict probabilities for next tokens
predicted_probabilities_each_tick = DenseLayer(broadcast_decoder_ticks,
                                               n_tokens,
                                               nonlinearity=lasagne.nonlinearities.softmax)
# maybe a more complicated architecture will work better?

In [31]:
#un-broadcast back into (batch,tick,probabilities)
predicted_probabilities = UnbroadcastLayer(predicted_probabilities_each_tick,
                                           broadcast_layer=broadcast_decoder_ticks)

print ("output shape = ",predicted_probabilities.output_shape)

#remove if you know what you're doing (e.g. 1d convolutions or fixed shape)
assert predicted_probabilities.output_shape == (None, None, 10373)


output shape =  (None, None, 10373)

Some tricks

  • If you train large network, it is usually a good idea to make a 2-stage prediction
    1. (large recurrent state) -> (bottleneck e.g. 256)
    2. (bottleneck) -> (vocabulary size)
    • this way you won't need to store/train (large_recurrent_state x vocabulary size) matrix
  • Also maybe use Hierarchical Softmax?

In [33]:
next_word_probas = get_output(predicted_probabilities)


predictions_flat = next_word_probas[:,:-1].reshape((-1,n_tokens))
reference_answers = sentences[:,1:].reshape((-1,))

#write symbolic loss function to train NN for
loss = lasagne.objectives.categorical_crossentropy(predictions_flat, reference_answers)

#mean over non-PAD tokens
output_mask = sentence_mask[:,1:]
# loss = (loss.reshape(reference_answers.shape)*output_mask).sum() / output_mask.sum()
loss = loss.mean()

In [34]:
#trainable NN weights
weights = get_all_params(predicted_probabilities,trainable=True)
updates = lasagne.updates.adam(loss, weights)

In [44]:
#compile a functions for training and evaluation
#please not that your functions must accept image features as FIRST param and sentences as second one
train_step = theano.function([image_vectors, sentences], [loss], updates=updates)
val_step   = theano.function([image_vectors, sentences], [loss])
#for val_step use deterministic=True if you have any dropout/noize

Training

  • You first have to implement a batch generator
  • Than the network will get trained the usual way

In [45]:
captions = np.array(captions)

In [46]:
from random import choice

def generate_batch(images,captions,batch_size,max_caption_len=None):
    
    #sample random numbers for image/caption indicies
    random_image_ix = np.random.randint(0,len(images),size=batch_size)
    
    #get images
    batch_images = images[random_image_ix]
    
    #5-7 captions for each image
    captions_for_batch_images = captions[random_image_ix]
    
    #pick 1 from 5-7 captions for each image
    batch_captions = list(map(choice,captions_for_batch_images))
    
    #convert to matrix
    batch_captions_ix = as_matrix(batch_captions,max_len=max_caption_len)
    
    return batch_images, batch_captions_ix

In [47]:
generate_batch(img_codes,captions,3)


Out[47]:
(array([[-2.0480957 ,  2.38128543,  1.66498172, ..., -2.92565465,
          0.95502877,  3.54487467],
        [-0.79817939, -1.75958812,  1.04067791, ..., -1.2281096 ,
         -0.78799808, -0.29764307],
        [ 1.14972579, -5.34672832, -0.62096763, ..., -1.72912693,
          0.03900388, -3.06219769]], dtype=float32),
 array([[1361, 1784, 9240, 7219, 8623, 2472, 8332, 8933, 5099, 3223,   -1,
           -1],
        [1361, 9970, 1902, 7664, 5516, 3306, 2402, 8332, 1784, 4947, 7273,
         3223],
        [1361, 1784, 5943, 9035, 8545, 8332, 1784, 7346, 5422, 9700, 9191,
         3223]], dtype=int32))

Main loop

  • We recommend you to periodically evaluate the network using the next "apply trained model" block
    • its safe to interrupt training, run a few examples and start training again

In [48]:
batch_size=50 #adjust me
n_epochs=100 #adjust me
n_batches_per_epoch = 50 #adjust me
n_validation_batches = 5 #how many batches are used for validation after each epoch

In [52]:
from IPython.display import clear_output

In [ ]:
from tqdm import tqdm

for epoch in range(n_epochs):
    train_loss=0
    for _ in tqdm(range(n_batches_per_epoch)):
        train_loss += train_step(*generate_batch(img_codes,captions,batch_size))[0]
    train_loss /= n_batches_per_epoch
    
    val_loss=0
    for _ in range(n_validation_batches):
        val_loss += val_step(*generate_batch(img_codes,captions,batch_size))[0]
    val_loss /= n_validation_batches
    clear_output(True)
    print('\nEpoch: {}, train loss: {}, val loss: {}'.format(epoch, train_loss, val_loss))


print("Finish :)")


  2%|▏         | 1/50 [00:00<00:05,  9.24it/s]
Epoch: 63, train loss: 1.714716968536377, val loss: 1.8070559740066527
 92%|█████████▏| 46/50 [00:05<00:00,  8.75it/s]

apply trained model


In [ ]:
#the same kind you did last week, but a bit smaller
from pretrained_lenet import build_model,preprocess,MEAN_VALUES

# build googlenet
lenet = build_model()

#load weights
#lenet_weights = pickle.load(open(osp.join(DATA_DIR, 'data/blvc_googlenet.pkl')), encoding='latin1')['param values']
lenet_weights = np.load(osp.join(DATA_DIR, 'data/blvc_googlenet.npz'), encoding='latin1')['param values']
set_all_param_values(lenet["prob"], lenet_weights)

#compile get_features
cnn_input_var = lenet['input'].input_var
cnn_feature_layer = lenet['loss3/classifier']
get_cnn_features = theano.function([cnn_input_var], lasagne.layers.get_output(cnn_feature_layer))

In [ ]:
from matplotlib import pyplot as plt
%matplotlib inline

#sample image
img = plt.imread(osp.join(DATA_DIR, 'data/Dog-and-Cat.jpg'))
img = preprocess(img)

In [ ]:
#deprocess and show, one line :)
from pretrained_lenet import MEAN_VALUES
plt.imshow(np.transpose((img[0] + MEAN_VALUES)[::-1],[1,2,0]).astype('uint8'))

Generate caption


In [ ]:
last_word_probas = <get network-predicted probas at last tick
#TRY OUT deterministic=True if you want more steady results

get_probs = theano.function([image_vectors,sentences], last_word_probas)

#this is exactly the generation function from week5 classwork,
#except now we condition on image features instead of words
def generate_caption(image,caption_prefix = ("START",),t=1,sample=True,max_len=100):
    image_features = get_cnn_features(image)
    caption = list(caption_prefix)
    for _ in range(max_len):
        
        next_word_probs = <obtain probabilities for next words>
        assert len(next_word_probs.shape) ==1 #must be one-dimensional
        #apply temperature
        next_word_probs = next_word_probs**t / np.sum(next_word_probs**t)

        if sample:
            next_word = np.random.choice(vocab,p=next_word_probs) 
        else:
            next_word = vocab[np.argmax(next_word_probs)]

        caption.append(next_word)

        if next_word=="#END#":
            break
            
    return caption

In [ ]:
for i in range(10):
    print ' '.join(generate_caption(img,t=5.)[1:-1])

Demo

Find at least 10 images to test it on.

  • Seriously, that's part of an assignment. Go get at least 10 pictures to get captioned
  • Make sure it works okay on simple images before going to something more comples
  • Photos, not animation/3d/drawings, unless you want to train CNN network on anime
  • Mind the aspect ratio (see what preprocess does to your image)

In [ ]:
#apply your network on image sample you found
#
#

In [ ]:

grading

  • base 5 if it compiles and trains without exploding
  • +1 for finding representative set of reference examples
  • +2 for providing 10+ examples where network provides reasonable captions (at least sometimes :) )
    • you may want to predict with sample=False and deterministic=True for consistent results
    • kudos for submitting network params that reproduce it
  • +2 for providing 10+ examples where network fails IF you also got previous 10 examples right
  • bonus points for experiments with architecture and initialization (see above)
  • bonus points for trying out other pre-trained nets for captioning
  • a whole lot of bonus points if you also train via metric learning
    • image -> vec
    • caption -> vec (encoder, not decoder)
    • loss = correct captions must be closer, wrong ones must be farther
    • prediction = choose caption that is closest to image
  • a freaking whole lot of points if you also obtain statistically signifficant results the other way round
    • take caption, get closest image

In [ ]: