DeepLarning Couse HSE 2016 fall:

Image Captioning

In this seminar you'll be going through the image captioning pipeline.

To begin with, let us download the dataset of image features from a pre-trained GoogleNet.

In [ ]:
### Or alternatevely
### !wget -O data.tar.gz
### !tar -xvzf data.tar.gz

DATA_DIR = '/wrk/trng103/'

Data preprocessing

In [ ]:
# Read Dataset
import numpy as np
import os.path as osp
import pickle

img_codes = np.load(osp.join(DATA_DIR, "data/image_codes.npy"))
captions = pickle.load(open(osp.join(DATA_DIR, 'data/caption_tokens.pcl'), 'rb'))

In [ ]:
print ("each image code is a 1000-unit vector:", img_codes.shape)
print (img_codes[0,:10])
print ('\n\n')
print ("for each image there are 5-7 descriptions, e.g.:\n")
print ('\n'.join(captions[0]))

In [ ]:
#split descriptions into tokens
for img_i in range(len(captions)):
    for caption_i in range(len(captions[img_i])):
        sentence = captions[img_i][caption_i] 
        captions[img_i][caption_i] = ["#START#"]+sentence.split(' ')+["#END#"]

In [ ]:
# Build a Vocabulary
from collections import Counter
word_counts = Counter()

<Compute word frequencies for each word in captions. See code above for data structure>

In [ ]:
vocab  = ['#UNK#', '#START#', '#END#']
vocab += [k for k, v in word_counts.items() if v >= 5]
n_tokens = len(vocab)

assert 10000 <= n_tokens <= 10500

word_to_index = {w: i for i, w in enumerate(vocab)}

In [ ]:
PAD_ix = -1
UNK_ix = vocab.index('#UNK#')

#good old as_matrix for the third time
def as_matrix(sequences,max_len=None):
    max_len = max_len or max(map(len,sequences))
    matrix = np.zeros((len(sequences),max_len),dtype='int32')+PAD_ix
    for i,seq in enumerate(sequences):
        row_ix = [word_to_index.get(word,UNK_ix) for word in seq[:max_len]]
        matrix[i,:len(row_ix)] = row_ix
    return matrix

In [ ]:
#try it out on several descriptions of a random image

Mah Neural Network

In [ ]:
# network shapes. 
CNN_FEATURE_SIZE = img_codes.shape[1]
EMBED_SIZE = 128 #pls change me if u want
LSTM_UNITS = 200 #pls change me if u want

In [ ]:
import theano
import theano.tensor as T

In [ ]:
# Input Variable
sentences = T.imatrix()# [batch_size x time] of word ids
image_vectors = T.matrix() # [batch size x unit] of CNN image features
sentence_mask = T.neq(sentences,PAD_ix)

In [ ]:
import lasagne
from lasagne.layers import *

In [ ]:
#network inputs
l_words = InputLayer((None,None),sentences )
l_mask = InputLayer((None,None),sentence_mask )

#embeddings for words 
l_word_embeddings = <apply word embedding. use EMBED_SIZE>

#cudos for using some pre-trained embedding :)

In [ ]:
# input layer for image features
l_image_features = InputLayer((None,CNN_FEATURE_SIZE),image_vectors )

#convert 1000 image features from googlenet to whatever LSTM_UNITS you have set
#it's also a good idea to add some dropout here and there
l_image_features_small = <convert l_image features to a shape equal to rnn hidden state. Also play with dropout/noize>

assert l_image_features_small.output_shape == (None,LSTM_UNITS)

In [ ]:
# Concatinate image features and word embedings in one sequence 
decoder = a recurrent layer (gru/lstm) with following checklist:
#    * takes word embeddings as an input
#    * has LSTM_UNITS units in the final layer
#    * has cell_init (or hid init for gru) set to converted image features
#    * mask_input = input_mask
#    * don't forget the grad clipping (~5-10)

#find out better recurrent architectures for bonus point

In [ ]:
# Decoding of rnn hiden states
from broadcast import BroadcastLayer,UnbroadcastLayer

#apply whatever comes next to each tick of each example in a batch. Equivalent to 2 reshapes
broadcast_decoder_ticks = BroadcastLayer(decoder,(0,1))
print "broadcasted decoder shape = ",broadcast_decoder_ticks.output_shape

In [ ]:
#predict probabilities for next tokens
predicted_probabilities_each_tick = <predict probabilities for each tick, using broadcasted_decoder_shape as an input. No reshaping needed here.>
# maybe a more complicated architecture will work better?

In [ ]:
#un-broadcast back into (batch,tick,probabilities)
predicted_probabilities = UnbroadcastLayer(predicted_probabilities_each_tick,

print "output shape = ",predicted_probabilities.output_shape

#remove if you know what you're doing (e.g. 1d convolutions or fixed shape)
assert predicted_probabilities.output_shape == (None, None, 10373)

Some tricks

  • If you train large network, it is usually a good idea to make a 2-stage prediction
    1. (large recurrent state) -> (bottleneck e.g. 256)
    2. (bottleneck) -> (vocabulary size)
    • this way you won't need to store/train (large_recurrent_state x vocabulary size) matrix
  • Also maybe use Hierarchical Softmax?

In [ ]:
next_word_probas = <get network output>

predictions_flat = next_word_probas[:,:-1].reshape((-1,n_tokens))
reference_answers = sentences[:,1:].reshape((-1,))

#write symbolic loss function to train NN for
loss = <compute elementwise loss function>

In [ ]:
#trainable NN weights
weights = get_all_params(predicted_probabilities,trainable=True)
updates = <parameter updates using your favorite algoritm>

In [ ]:
#compile a functions for training and evaluation
#please not that your functions must accept image features as FIRST param and sentences as second one
train_step = <function that takes input sentence and image mask, outputs loss and updates weights>
val_step   = <function that takes input sentence and image mask and outputs loss>
#for val_step use deterministic=True if you have any dropout/noize


  • You first have to implement a batch generator
  • Than the network will get trained the usual way

In [ ]:
captions = np.array(captions)

In [ ]:
from random import choice

def generate_batch(images,captions,batch_size,max_caption_len=None):
    #sample random numbers for image/caption indicies
    random_image_ix = np.random.randint(0,len(images),size=batch_size)
    #get images
    batch_images = images[random_image_ix]
    #5-7 captions for each image
    captions_for_batch_images = captions[random_image_ix]
    #pick 1 from 5-7 captions for each image
    batch_captions = map(choice,captions_for_batch_images)
    #convert to matrix
    batch_captions_ix = as_matrix(batch_captions,max_len=max_caption_len)
    return batch_images, batch_captions_ix

In [ ]:

Main loop

  • We recommend you to periodically evaluate the network using the next "apply trained model" block
    • its safe to interrupt training, run a few examples and start training again

In [ ]:
batch_size=50 #adjust me
n_epochs=100 #adjust me
n_batches_per_epoch = 50 #adjust me
n_validation_batches = 5 #how many batches are used for validation after each epoch

In [ ]:
!pip3 install --user tqdm

In [ ]:
from tqdm import tqdm

for epoch in range(n_epochs):
    for _ in tqdm(range(n_batches_per_epoch)):
        train_loss += train_step(*generate_batch(img_codes,captions,batch_size))
    train_loss /= n_batches_per_epoch
    for _ in range(n_validation_batches):
        val_loss += val_step(*generate_batch(img_codes,captions,batch_size))
    val_loss /= n_validation_batches
    print('\nEpoch: {}, train loss: {}, val loss: {}'.format(epoch, train_loss, val_loss))

print("Finish :)")

apply trained model

In [ ]:
!pip3 install --user scikit-image

In [ ]:
#the same kind you did last week, but a bit smaller
from pretrained_lenet import build_model,preprocess,MEAN_VALUES

# build googlenet
lenet = build_model()

#load weights
#lenet_weights = pickle.load(open(osp.join(DATA_DIR, 'data/blvc_googlenet.pkl')), encoding='latin1')['param values']
lenet_weights = np.load(osp.join(DATA_DIR, 'data/blvc_googlenet.npz'), encoding='latin1')['param values']
set_all_param_values(lenet["prob"], lenet_weights)

#compile get_features
cnn_input_var = lenet['input'].input_var
cnn_feature_layer = lenet['loss3/classifier']
get_cnn_features = theano.function([cnn_input_var], lasagne.layers.get_output(cnn_feature_layer))

In [ ]:
from matplotlib import pyplot as plt
%matplotlib inline

#sample image
img = plt.imread(osp.join(DATA_DIR, 'data/Dog-and-Cat.jpg'))
img = preprocess(img)

In [ ]:
#deprocess and show, one line :)
from pretrained_lenet import MEAN_VALUES
plt.imshow(np.transpose((img[0] + MEAN_VALUES)[::-1],[1,2,0]).astype('uint8'))

Generate caption

In [ ]:
last_word_probas = <get network-predicted probas at last tick
#TRY OUT deterministic=True if you want more steady results

get_probs = theano.function([image_vectors,sentences], last_word_probas)

#this is exactly the generation function from week5 classwork,
#except now we condition on image features instead of words
def generate_caption(image,caption_prefix = ("START",),t=1,sample=True,max_len=100):
    image_features = get_cnn_features(image)
    caption = list(caption_prefix)
    for _ in range(max_len):
        next_word_probs = <obtain probabilities for next words>
        assert len(next_word_probs.shape) ==1 #must be one-dimensional
        #apply temperature
        next_word_probs = next_word_probs**t / np.sum(next_word_probs**t)

        if sample:
            next_word = np.random.choice(vocab,p=next_word_probs) 
            next_word = vocab[np.argmax(next_word_probs)]


        if next_word=="#END#":
    return caption

In [ ]:
for i in range(10):
    print ' '.join(generate_caption(img,t=5.)[1:-1])


Find at least 10 images to test it on.

  • Seriously, that's part of an assignment. Go get at least 10 pictures to get captioned
  • Make sure it works okay on simple images before going to something more comples
  • Photos, not animation/3d/drawings, unless you want to train CNN network on anime
  • Mind the aspect ratio (see what preprocess does to your image)

In [ ]:
#apply your network on image sample you found

In [ ]:


  • base 5 if it compiles and trains without exploding
  • +1 for finding representative set of reference examples
  • +2 for providing 10+ examples where network provides reasonable captions (at least sometimes :) )
    • you may want to predict with sample=False and deterministic=True for consistent results
    • kudos for submitting network params that reproduce it
  • +2 for providing 10+ examples where network fails IF you also got previous 10 examples right
  • bonus points for experiments with architecture and initialization (see above)
  • bonus points for trying out other pre-trained nets for captioning
  • a whole lot of bonus points if you also train via metric learning
    • image -> vec
    • caption -> vec (encoder, not decoder)
    • loss = correct captions must be closer, wrong ones must be farther
    • prediction = choose caption that is closest to image
  • a freaking whole lot of points if you also obtain statistically signifficant results the other way round
    • take caption, get closest image

In [ ]: