First of all -- Checking Questions

Вопрос 1: Можно ли использовать сверточные сети для классификации текстов? Если нет обоснуйте :D, если да то как? как решить проблему с произвольной длинной входа?

Да, если тексты имеют одинаковую длину.

Вопрос 2: Чем LSTM лучше/хуже чем обычная RNN?

Градиенты затухают медленнее. Удается использовать информацию не только от близлежащих слоев.

Вопрос 3: Выпишите производную $\frac{d c_{n+1}}{d c_{k}}$ для LSTM http://colah.github.io/posts/2015-08-Understanding-LSTMs/, объясните формулу, когда производная затухает, когда взрывается?

<Ответ>

Вопрос 4: Зачем нужен TBPTT почему BPTT плох?

<Ответ>

Вопрос 5: Как комбинировать рекуррентные и сверточные сети, а главное зачем? Приведите несколько примеров реальных задач.

Сначала распознавание объектов на изображениях с помощью сверточных сетей, а затем аннотация изображений помощью реккурентных.

Вопрос 6: Объясните интуицию выбора размера эмбединг слоя? почему это опасное место?

<Ответ>
  • Arseniy Ashuha, you can text me ars.ashuha@gmail.com, Александр Панин

Image Captioning

In this seminar you'll be going through the image captioning pipeline.

To begin with, let us download the dataset of image features from a pre-trained GoogleNet.


In [ ]:
!wget https://www.dropbox.com/s/3hj16b0fj6yw7cc/data.tar.gz?dl=1 -O data.tar.gz
!tar -xvzf data.tar.gz

Data preprocessing


In [1]:
%%time
# Read Dataset
import numpy as np
import pickle

img_codes = np.load("data/image_codes.npy")
captions = pickle.load(open('data/caption_tokens.pcl', 'rb'))


CPU times: user 4.63 s, sys: 1.03 s, total: 5.66 s
Wall time: 12.8 s

In [2]:
print "each image code is a 1000-unit vector:", img_codes.shape
print img_codes[0,:10]
print '\n\n'
print "for each image there are 5-7 descriptions, e.g.:\n"
print '\n'.join(captions[0])


each image code is a 1000-unit vector: (123287, 1000)
[ 1.38901556 -3.82951474 -1.94360816 -0.5317238  -0.03120959 -2.87483215
 -2.9554503   0.6960277  -0.68551242 -0.7855981 ]



for each image there are 5-7 descriptions, e.g.:

a man with a red helmet on a small moped on a dirt road
man riding a motor bike on a dirt road on the countryside
a man riding on the back of a motorcycle
a dirt path with a young person on a motor bike rests to the foreground of a verdant area with a bridge and a background of cloud wreathed mountains
a man in a red shirt and a red hat is on a motorcycle on a hill side

In [3]:
#split descriptions into tokens
for img_i in range(len(captions)):
    for caption_i in range(len(captions[img_i])):
        sentence = captions[img_i][caption_i] 
        captions[img_i][caption_i] = ["#START#"]+sentence.split(' ')+["#END#"]

In [4]:
# Build a Vocabulary

############# TO CODE IT BY YOURSELF ##################
#<here should be dict word:number of entrances>
word_counts = {}
for img_i in captions:
    for caption_i in img_i:
        for word in caption_i:
            try:
                word_counts[word] += 1
            except KeyError:
                word_counts[word] = 1
# print word_counts
vocab  = ['#UNK#', '#START#', '#END#']
vocab += [k for k, v in word_counts.items() if v >= 5]
n_tokens = len(vocab)

assert 10000 <= n_tokens <= 10500

word_to_index = {w: i for i, w in enumerate(vocab)}

In [5]:
PAD_ix = -1
UNK_ix = vocab.index('#UNK#')

def as_matrix(sequences,max_len=None):
    max_len = max_len or max(map(len,sequences))
    
    matrix = np.zeros((len(sequences),max_len),dtype='int32')+PAD_ix
    for i,seq in enumerate(sequences):
        row_ix = [word_to_index.get(word,UNK_ix) for word in seq[:max_len]]
        matrix[i,:len(row_ix)] = row_ix
    
    return matrix

In [6]:
#try it out on several descriptions of a random image
as_matrix(captions[1337])


Out[6]:
array([[ 8481,  2852,  7829,  4136, 10058,  9934,  5915,  4859,  6766,
         1243,  3980,  6254,  8134,    -1,    -1],
       [ 8481,  6766,  1243,  8902,  1021,  9095,  6254,  3980,  8256,
          727,  5915,  8134,    -1,    -1,    -1],
       [ 8481,  8717,  4136, 10058,  9934,  5915,  4859,  6766,  5627,
         8639,   535,  5470,  7115,  5155,  8134],
       [ 8481,  8717,  8897,  3069,  2538,   781,  8717,  8897,  1558,
         8134,    -1,    -1,    -1,    -1,    -1],
       [ 8481,  2852,  7829,  4136, 10058,  9934,  5915,  4859,  6766,
         1243,   535,  5470,  8919,  8134,    -1]], dtype=int32)

Mah Neural Network


In [7]:
# network shapes. 
CNN_FEATURE_SIZE = img_codes.shape[1]
EMBED_SIZE = 128 #pls change me if u want
LSTM_UNITS = 200 #pls change me if u want

In [8]:
import theano
import lasagne
import theano.tensor as T
from lasagne.layers import *

In [9]:
# Input Variable
sentences = T.imatrix()# [batch_size x time] of word ids
image_vectors = T.matrix() # [batch size x unit] of CNN image features
sentence_mask = T.neq(sentences, PAD_ix)

In [10]:
#network inputs
l_words = InputLayer((None, None), sentences)
l_mask = InputLayer((None, None), sentence_mask)

#embeddings for words 
############# TO CODE IT BY YOURSELF ##################
l_word_embeddings = EmbeddingLayer(l_words, input_size=n_tokens, output_size=EMBED_SIZE)

In [11]:
# input layer for image features
l_image_features = InputLayer((None, CNN_FEATURE_SIZE), image_vectors)

############# TO CODE IT BY YOURSELF ##################
#convert 1000 image features from googlenet to whatever LSTM_UNITS you have set
#it's also a good idea to add some dropout here and there
l_image_features_small = DropoutLayer(l_image_features)
l_image_features_small = DenseLayer(l_image_features_small, LSTM_UNITS)
assert l_image_features_small.output_shape == (None, LSTM_UNITS)

In [12]:
############# TO CODE IT BY YOURSELF ##################
# Concatinate image features and word embedings in one sequence 
decoder = LSTMLayer(l_word_embeddings,
                    num_units=LSTM_UNITS,
                    cell_init=l_image_features_small,
                    mask_input=l_mask,
                    grad_clipping=10**10)

In [13]:
# Decoding of rnn hiden states
from broadcast import BroadcastLayer,UnbroadcastLayer

#apply whatever comes next to each tick of each example in a batch. Equivalent to 2 reshapes
broadcast_decoder_ticks = BroadcastLayer(decoder, (0, 1))
print "broadcasted decoder shape = ",broadcast_decoder_ticks.output_shape

predicted_probabilities_each_tick = DenseLayer(
    broadcast_decoder_ticks,n_tokens, nonlinearity=lasagne.nonlinearities.softmax)

#un-broadcast back into (batch,tick,probabilities)
predicted_probabilities = UnbroadcastLayer(
    predicted_probabilities_each_tick, broadcast_layer=broadcast_decoder_ticks)

print "output shape = ", predicted_probabilities.output_shape

#remove if you know what you're doing (e.g. 1d convolutions or fixed shape)
assert predicted_probabilities.output_shape == (None, None, 10373)


broadcasted decoder shape =  (None, 200)
output shape =  (None, None, 10373)

In [14]:
next_word_probas = get_output(predicted_probabilities)

reference_answers = sentences[:,1:]
output_mask = sentence_mask[:,1:]

#write symbolic loss function to train NN for
loss = lasagne.objectives.categorical_crossentropy(
    next_word_probas[:, :-1].reshape((-1, n_tokens)),
    reference_answers.reshape((-1,))
).reshape(reference_answers.shape)

############# TO CODE IT BY YOURSELF ##################
loss = (loss * output_mask).sum() / output_mask.sum()

In [15]:
#trainable NN weights
############# TO CODE IT BY YOURSELF ##################
weights = get_all_params(predicted_probabilities)
updates = lasagne.updates.adam(loss, weights)

In [16]:
#compile a function that takes input sentence and image mask, outputs loss and updates weights
#please not that your functions must accept image features as FIRST param and sentences as second one
############# TO CODE IT BY YOURSELF ##################
train_step = theano.function([image_vectors, sentences], loss, updates=updates)
val_step   = theano.function([image_vectors, sentences], loss)

Training

  • You first have to implement a batch generator
  • Than the network will get trained the usual way

In [17]:
captions = np.array(captions)

In [18]:
from random import choice

def generate_batch(images,captions,batch_size,max_caption_len=None):
    #sample random numbers for image/caption indicies
    random_image_ix = np.random.randint(0, len(images), size=batch_size)
    
    #get images
    batch_images = images[random_image_ix]
    
    #5-7 captions for each image
    captions_for_batch_images = captions[random_image_ix]
    
    #pick 1 from 5-7 captions for each image
    batch_captions = map(choice, captions_for_batch_images)
    
    #convert to matrix
    batch_captions_ix = as_matrix(batch_captions,max_len=max_caption_len)
    
    return batch_images, batch_captions_ix

In [19]:
generate_batch(img_codes,captions, 3)


Out[19]:
(array([[-1.3130095 , -1.84587181, -0.89477605, ...,  0.67713088,
          1.55226099,  1.77054608],
        [-4.06723881, -4.88099098,  2.45802474, ..., -2.21407533,
         -1.55124247, -0.56788898],
        [ 4.188025  , -1.60059953,  2.21920943, ..., -3.53720665,
          0.88908136,  2.98351288]], dtype=float32),
 array([[8481, 5164, 5047, 7206, 3017, 7614, 7260,  781, 8717, 8693, 8134,
           -1],
        [8481, 8717, 5570,  595,  781, 3323, 4364, 4859, 8717, 8534,  998,
         8134],
        [8481, 8717,  478, 4859, 2603, 8902, 3069, 3980, 8717, 4212, 8134,
           -1]], dtype=int32))

Main loop

  • We recommend you to periodically evaluate the network using the next "apply trained model" block
    • its safe to interrupt training, run a few examples and start training again

На первых (около 30) эпохах использовал батчи по 20-30 элементов, постепенно увеличивая до 130. Всего было посчитанно около 110 эпох.


In [170]:
batch_size = 100 #adjust me
n_epochs   = 5 #adjust me
n_batches_per_epoch = 50 #adjust me
n_validation_batches = 5 #how many batches are used for validation after each epoch

In [172]:
%%time
from tqdm import tqdm

for epoch in range(n_epochs):
    train_loss=0
    for _ in tqdm(range(n_batches_per_epoch)):
        train_loss += train_step(*generate_batch(img_codes,captions,batch_size))
    train_loss /= n_batches_per_epoch
    
    val_loss=0
    for _ in range(n_validation_batches):
        val_loss += val_step(*generate_batch(img_codes,captions,batch_size))
    val_loss /= n_validation_batches
    
    print('\nEpoch: {}, train loss: {}, val loss: {}'.format(epoch, train_loss, val_loss))

print("Finish :)")


100%|██████████| 50/50 [05:19<00:00,  7.27s/it]
  0%|          | 0/50 [00:00<?, ?it/s]
Epoch: 0, train loss: 2.65756696846, val loss: 2.59482125641
100%|██████████| 50/50 [06:46<00:00,  8.18s/it]
  0%|          | 0/50 [00:00<?, ?it/s]
Epoch: 1, train loss: 2.65778410162, val loss: 2.6127486045
100%|██████████| 50/50 [06:55<00:00,  9.87s/it]
  0%|          | 0/50 [00:00<?, ?it/s]
Epoch: 2, train loss: 2.63677129403, val loss: 2.67636922386
100%|██████████| 50/50 [05:30<00:00,  5.77s/it]
  0%|          | 0/50 [00:00<?, ?it/s]
Epoch: 3, train loss: 2.63122560129, val loss: 2.6475341315
100%|██████████| 50/50 [05:15<00:00,  6.04s/it]
Epoch: 4, train loss: 2.63119870206, val loss: 2.58934502373
Finish :)
CPU times: user 1h 25s, sys: 14min 46s, total: 1h 15min 12s
Wall time: 31min 18s

apply trained model


In [174]:
#the same kind you did last week, but a bit smaller
from pretrained_lenet import build_model,preprocess, MEAN_VALUES

# build googlenet
lenet = build_model()

#load weights
lenet_weights = pickle.load(open('data/blvc_googlenet.pkl'))['param values']
set_all_param_values(lenet["prob"], lenet_weights)

#compile get_features
cnn_input_var = lenet['input'].input_var
cnn_feature_layer = lenet['loss3/classifier']
get_cnn_features = theano.function([cnn_input_var], lasagne.layers.get_output(cnn_feature_layer))

In [175]:
from matplotlib import pyplot as plt
%matplotlib inline

#sample image
img = plt.imread('data/Dog-and-Cat.jpg')
img = preprocess(img)

In [176]:
#deprocess and show, one line :)
from pretrained_lenet import MEAN_VALUES
plt.imshow(np.transpose((img[0] + MEAN_VALUES)[::-1],[1,2,0]).astype('uint8'))


Out[176]:
<matplotlib.image.AxesImage at 0x7f8e5c1b7350>

Generate caption


In [199]:
last_word_probas_det = get_output(predicted_probabilities,deterministic=False)[:,-1]

get_probs = theano.function([image_vectors,sentences], last_word_probas_det)

#this is exactly the generation function from week5 classwork,
#except now we condition on image features instead of words
def generate_caption(image,caption_prefix = ("START",),t=1,sample=True,max_len=100):
    image_features = get_cnn_features(image)
    caption = list(caption_prefix)
    for _ in range(max_len):
        
        next_word_probs = get_probs(image_features,as_matrix([caption]) ).ravel()
        #apply temperature
        next_word_probs = next_word_probs**t / np.sum(next_word_probs**t)

        if sample:
            next_word = np.random.choice(vocab,p=next_word_probs) 
        else:
            next_word = vocab[np.argmax(next_word_probs)]

        caption.append(next_word)

        if next_word=="#END#":
            break
            
    return caption

In [200]:
for i in range(10):
    print ' '.join(generate_caption(img,t=1.)[1:-1])


small brown brown and white cat is looking at the camera
brown dog and a cat laying on the ground table
brown bear and a cat are together
older white cat sitting next to a bunch of donuts
a brown and white photo of cute dogs
alcoholic white orange and white cat sitting in an open suitcase
garden with a cat sitting on the ground
an orange and white furry suit and kid
brown dog and a cat laying on the ground table
black and white cat is sitting on a street corner

Bonus Part

  • Use ResNet Instead of GoogLeNet
  • Use W2V as embedding
  • Use Attention :)

In [ ]: