Image Captioning

In this seminar you'll be going through the image captioning pipeline.

To begin with, let us download the dataset of image features from a pre-trained GoogleNet.



In [ ]:

    
###
### Or alternatevely
### !wget https://www.dropbox.com/s/3hj16b0fj6yw7cc/data.tar.gz?dl=1 -O data.tar.gz
### !tar -xvzf data.tar.gz
###

DATA_DIR = '/wrk/trng103/'

Data preprocessing



In [ ]:

    
%%time
# Read Dataset
import numpy as np
import os.path as osp
import pickle

img_codes = np.load(osp.join(DATA_DIR, "data/image_codes.npy"))
captions = pickle.load(open(osp.join(DATA_DIR, 'data/caption_tokens.pcl'), 'rb'))



In [ ]:

    
print ("each image code is a 1000-unit vector:", img_codes.shape)
print (img_codes[0,:10])
print ('\n\n')
print ("for each image there are 5-7 descriptions, e.g.:\n")
print ('\n'.join(captions[0]))



In [ ]:

    
#split descriptions into tokens
for img_i in range(len(captions)):
    for caption_i in range(len(captions[img_i])):
        sentence = captions[img_i][caption_i] 
        captions[img_i][caption_i] = ["#START#"]+sentence.split(' ')+["#END#"]



In [ ]:

    
# Build a Vocabulary
from collections import Counter
word_counts = Counter()

<Compute word frequencies for each word in captions. See code above for data structure>



In [ ]:

    
vocab  = ['#UNK#', '#START#', '#END#']
vocab += [k for k, v in word_counts.items() if v >= 5]
n_tokens = len(vocab)

assert 10000 <= n_tokens <= 10500

word_to_index = {w: i for i, w in enumerate(vocab)}



In [ ]:

    
PAD_ix = -1
UNK_ix = vocab.index('#UNK#')

#good old as_matrix for the third time
def as_matrix(sequences,max_len=None):
    max_len = max_len or max(map(len,sequences))
    
    matrix = np.zeros((len(sequences),max_len),dtype='int32')+PAD_ix
    for i,seq in enumerate(sequences):
        row_ix = [word_to_index.get(word,UNK_ix) for word in seq[:max_len]]
        matrix[i,:len(row_ix)] = row_ix
    
    return matrix



In [ ]:

    
#try it out on several descriptions of a random image
as_matrix(captions[1337])

Mah Neural Network



In [ ]:

    
# network shapes. 
CNN_FEATURE_SIZE = img_codes.shape[1]
EMBED_SIZE = 128 #pls change me if u want
LSTM_UNITS = 200 #pls change me if u want



In [ ]:

    
import theano
import theano.tensor as T



In [ ]:

    
# Input Variable
sentences = T.imatrix()# [batch_size x time] of word ids
image_vectors = T.matrix() # [batch size x unit] of CNN image features
sentence_mask = T.neq(sentences,PAD_ix)



In [ ]:

    
import lasagne
from lasagne.layers import *



In [ ]:

    
#network inputs
l_words = InputLayer((None,None),sentences )
l_mask = InputLayer((None,None),sentence_mask )

#embeddings for words 
l_word_embeddings = <apply word embedding. use EMBED_SIZE>

#cudos for using some pre-trained embedding :)



In [ ]:

    
# input layer for image features
l_image_features = InputLayer((None,CNN_FEATURE_SIZE),image_vectors )

#convert 1000 image features from googlenet to whatever LSTM_UNITS you have set
#it's also a good idea to add some dropout here and there
l_image_features_small = <convert l_image features to a shape equal to rnn hidden state. Also play with dropout/noize>

assert l_image_features_small.output_shape == (None,LSTM_UNITS)



In [ ]:

    
# Concatinate image features and word embedings in one sequence 
decoder = a recurrent layer (gru/lstm) with following checklist:
#    * takes word embeddings as an input
#    * has LSTM_UNITS units in the final layer
#    * has cell_init (or hid init for gru) set to converted image features
#    * mask_input = input_mask
#    * don't forget the grad clipping (~5-10)

#find out better recurrent architectures for bonus point



In [ ]:

    
# Decoding of rnn hiden states
from broadcast import BroadcastLayer,UnbroadcastLayer

#apply whatever comes next to each tick of each example in a batch. Equivalent to 2 reshapes
broadcast_decoder_ticks = BroadcastLayer(decoder,(0,1))
print "broadcasted decoder shape = ",broadcast_decoder_ticks.output_shape



In [ ]:

    
#predict probabilities for next tokens
predicted_probabilities_each_tick = <predict probabilities for each tick, using broadcasted_decoder_shape as an input. No reshaping needed here.>
# maybe a more complicated architecture will work better?



In [ ]:

    
#un-broadcast back into (batch,tick,probabilities)
predicted_probabilities = UnbroadcastLayer(predicted_probabilities_each_tick,
                                           broadcast_layer=broadcast_decoder_ticks)

print "output shape = ",predicted_probabilities.output_shape

#remove if you know what you're doing (e.g. 1d convolutions or fixed shape)
assert predicted_probabilities.output_shape == (None, None, 10373)

Some tricks

If you train large network, it is usually a good idea to make a 2-stage prediction
1. (large recurrent state) -> (bottleneck e.g. 256)
2. (bottleneck) -> (vocabulary size)
- this way you won't need to store/train (large_recurrent_state x vocabulary size) matrix
Also maybe use Hierarchical Softmax?
- https://gist.github.com/justheuristic/581853c6d6b87eae9669297c2fb1052d



In [ ]:

    
next_word_probas = <get network output>


predictions_flat = next_word_probas[:,:-1].reshape((-1,n_tokens))
reference_answers = sentences[:,1:].reshape((-1,))

#write symbolic loss function to train NN for
loss = <compute elementwise loss function>

#mean over non-PAD tokens
output_mask = sentence_mask[:,1:]
loss = (loss.reshape(reference_answers.shape)*output_mask).sum() / output_mask.sum()



In [ ]:

    
#trainable NN weights
weights = get_all_params(predicted_probabilities,trainable=True)
updates = <parameter updates using your favorite algoritm>



In [ ]:

    
#compile a functions for training and evaluation
#please not that your functions must accept image features as FIRST param and sentences as second one
train_step = <function that takes input sentence and image mask, outputs loss and updates weights>
val_step   = <function that takes input sentence and image mask and outputs loss>
#for val_step use deterministic=True if you have any dropout/noize

Training

You first have to implement a batch generator
Than the network will get trained the usual way



In [ ]:

    
captions = np.array(captions)



In [ ]:

    
from random import choice

def generate_batch(images,captions,batch_size,max_caption_len=None):
    
    #sample random numbers for image/caption indicies
    random_image_ix = np.random.randint(0,len(images),size=batch_size)
    
    #get images
    batch_images = images[random_image_ix]
    
    #5-7 captions for each image
    captions_for_batch_images = captions[random_image_ix]
    
    #pick 1 from 5-7 captions for each image
    batch_captions = map(choice,captions_for_batch_images)
    
    #convert to matrix
    batch_captions_ix = as_matrix(batch_captions,max_len=max_caption_len)
    
    return batch_images, batch_captions_ix



In [ ]:

    
generate_batch(img_codes,captions,3)

Main loop

We recommend you to periodically evaluate the network using the next "apply trained model" block
- its safe to interrupt training, run a few examples and start training again



In [ ]:

    
batch_size=50 #adjust me
n_epochs=100 #adjust me
n_batches_per_epoch = 50 #adjust me
n_validation_batches = 5 #how many batches are used for validation after each epoch



In [ ]:

    
!pip3 install --user tqdm



In [ ]:

    
from tqdm import tqdm

for epoch in range(n_epochs):
    
    train_loss=0
    for _ in tqdm(range(n_batches_per_epoch)):
        train_loss += train_step(*generate_batch(img_codes,captions,batch_size))
    train_loss /= n_batches_per_epoch
    
    val_loss=0
    for _ in range(n_validation_batches):
        val_loss += val_step(*generate_batch(img_codes,captions,batch_size))
    val_loss /= n_validation_batches
    
    print('\nEpoch: {}, train loss: {}, val loss: {}'.format(epoch, train_loss, val_loss))

print("Finish :)")

apply trained model



In [ ]:

    
!pip3 install --user scikit-image



In [ ]:

    
#the same kind you did last week, but a bit smaller
from pretrained_lenet import build_model,preprocess,MEAN_VALUES

# build googlenet
lenet = build_model()

#load weights
#lenet_weights = pickle.load(open(osp.join(DATA_DIR, 'data/blvc_googlenet.pkl')), encoding='latin1')['param values']
lenet_weights = np.load(osp.join(DATA_DIR, 'data/blvc_googlenet.npz'), encoding='latin1')['param values']
set_all_param_values(lenet["prob"], lenet_weights)

#compile get_features
cnn_input_var = lenet['input'].input_var
cnn_feature_layer = lenet['loss3/classifier']
get_cnn_features = theano.function([cnn_input_var], lasagne.layers.get_output(cnn_feature_layer))



In [ ]:

    
from matplotlib import pyplot as plt
%matplotlib inline

#sample image
img = plt.imread(osp.join(DATA_DIR, 'data/Dog-and-Cat.jpg'))
img = preprocess(img)



In [ ]:

    
#deprocess and show, one line :)
from pretrained_lenet import MEAN_VALUES
plt.imshow(np.transpose((img[0] + MEAN_VALUES)[::-1],[1,2,0]).astype('uint8'))

Generate caption



In [ ]:

    
last_word_probas = <get network-predicted probas at last tick
#TRY OUT deterministic=True if you want more steady results

get_probs = theano.function([image_vectors,sentences], last_word_probas)

#this is exactly the generation function from week5 classwork,
#except now we condition on image features instead of words
def generate_caption(image,caption_prefix = ("START",),t=1,sample=True,max_len=100):
    image_features = get_cnn_features(image)
    caption = list(caption_prefix)
    for _ in range(max_len):
        
        next_word_probs = <obtain probabilities for next words>
        assert len(next_word_probs.shape) ==1 #must be one-dimensional
        #apply temperature
        next_word_probs = next_word_probs**t / np.sum(next_word_probs**t)

        if sample:
            next_word = np.random.choice(vocab,p=next_word_probs) 
        else:
            next_word = vocab[np.argmax(next_word_probs)]

        caption.append(next_word)

        if next_word=="#END#":
            break
            
    return caption



In [ ]:

    
for i in range(10):
    print ' '.join(generate_caption(img,t=5.)[1:-1])

Demo

Find at least 10 images to test it on.

Seriously, that's part of an assignment. Go get at least 10 pictures to get captioned
Make sure it works okay on simple images before going to something more comples
Photos, not animation/3d/drawings, unless you want to train CNN network on anime
Mind the aspect ratio (see what preprocess does to your image)



In [ ]:

    
#apply your network on image sample you found
#
#



In [ ]:

grading

base 5 if it compiles and trains without exploding
+1 for finding representative set of reference examples
+2 for providing 10+ examples where network provides reasonable captions (at least sometimes :) )
- you may want to predict with sample=False and deterministic=True for consistent results
- kudos for submitting network params that reproduce it
+2 for providing 10+ examples where network fails IF you also got previous 10 examples right

bonus points for experiments with architecture and initialization (see above)
bonus points for trying out other pre-trained nets for captioning
a whole lot of bonus points if you also train via metric learning
- image -> vec
- caption -> vec (encoder, not decoder)
- loss = correct captions must be closer, wrong ones must be farther
- prediction = choose caption that is closest to image
a freaking whole lot of points if you also obtain statistically signifficant results the other way round
- take caption, get closest image



In [ ]: