First of all -- Checking Questions

Вопрос 1: Можно ли использовать сверточные сети для классификации текстов? Если нет обоснуйте :D, если да то как? как решить проблему с произвольной длинной входа?

Да, если тексты имеют одинаковую длину.

Вопрос 2: Чем LSTM лучше/хуже чем обычная RNN?

Градиенты затухают медленнее. Удается использовать информацию не только от близлежащих слоев.

Вопрос 3: Выпишите производную $\frac{d c_{n+1}}{d c_{k}}$ для LSTM, объясните формулу, когда производная затухает, когда взрывается?


Вопрос 4: Зачем нужен TBPTT почему BPTT плох?


Вопрос 5: Как комбинировать рекуррентные и сверточные сети, а главное зачем? Приведите несколько примеров реальных задач.

Сначала распознавание объектов на изображениях с помощью сверточных сетей, а затем аннотация изображений помощью реккурентных.

Вопрос 6: Объясните интуицию выбора размера эмбединг слоя? почему это опасное место?

Image Captioning

In this seminar you'll be going through the image captioning pipeline.

To begin with, let us download the dataset of image features from a pre-trained GoogleNet.

!wget -O data.tar.gz
!tar -xvzf data.tar.gz

Data preprocessing

In [1]:
# Read Dataset
import numpy as np
import pickle

img_codes = np.load("data/image_codes.npy")
captions = pickle.load(open('data/caption_tokens.pcl', 'rb'))

In [2]:
print "each image code is a 1000-unit vector:", img_codes.shape
print img_codes[0,:10]
print '\n\n'
print "for each image there are 5-7 descriptions, e.g.:\n"
print '\n'.join(captions[0])

each image code is a 1000-unit vector: (123287, 1000)
[ 1.38901556 -3.82951474 -1.94360816 -0.5317238  -0.03120959 -2.87483215
 -2.9554503   0.6960277  -0.68551242 -0.7855981 ]

for each image there are 5-7 descriptions, e.g.:

a man with a red helmet on a small moped on a dirt road
man riding a motor bike on a dirt road on the countryside
a man riding on the back of a motorcycle
a dirt path with a young person on a motor bike rests to the foreground of a verdant area with a bridge and a background of cloud wreathed mountains
a man in a red shirt and a red hat is on a motorcycle on a hill side

In [3]:
#split descriptions into tokens
for img_i in range(len(captions)):
    for caption_i in range(len(captions[img_i])):
        sentence = captions[img_i][caption_i] 
        captions[img_i][caption_i] = ["#START#"]+sentence.split(' ')+["#END#"]

In [4]:
# Build a Vocabulary

############# TO CODE IT BY YOURSELF ##################
#<here should be dict word:number of entrances>
word_counts = {}
for img_i in captions:
    for caption_i in img_i:
        for word in caption_i:
                word_counts[word] += 1
            except KeyError:
                word_counts[word] = 1
# print word_counts
vocab  = ['#UNK#', '#START#', '#END#']
vocab += [k for k, v in word_counts.items() if v >= 5]
n_tokens = len(vocab)

assert 10000 <= n_tokens <= 10500

word_to_index = {w: i for i, w in enumerate(vocab)}

In [5]:
PAD_ix = -1
UNK_ix = vocab.index('#UNK#')

def as_matrix(sequences,max_len=None):
    max_len = max_len or max(map(len,sequences))
    matrix = np.zeros((len(sequences),max_len),dtype='int32')+PAD_ix
    for i,seq in enumerate(sequences):
        row_ix = [word_to_index.get(word,UNK_ix) for word in seq[:max_len]]
        matrix[i,:len(row_ix)] = row_ix
    return matrix

In [6]:
#try it out on several descriptions of a random image

array([[ 8481,  2852,  7829,  4136, 10058,  9934,  5915,  4859,  6766,
         1243,  3980,  6254,  8134,    -1,    -1],
       [ 8481,  6766,  1243,  8902,  1021,  9095,  6254,  3980,  8256,
          727,  5915,  8134,    -1,    -1,    -1],
       [ 8481,  8717,  4136, 10058,  9934,  5915,  4859,  6766,  5627,
         8639,   535,  5470,  7115,  5155,  8134],
       [ 8481,  8717,  8897,  3069,  2538,   781,  8717,  8897,  1558,
         8134,    -1,    -1,    -1,    -1,    -1],
       [ 8481,  2852,  7829,  4136, 10058,  9934,  5915,  4859,  6766,
         1243,   535,  5470,  8919,  8134,    -1]], dtype=int32)

Mah Neural Network

In [7]:
# network shapes. 
CNN_FEATURE_SIZE = img_codes.shape[1]
EMBED_SIZE = 128 #pls change me if u want
LSTM_UNITS = 200 #pls change me if u want

In [8]:
import theano
import lasagne
import theano.tensor as T
from lasagne.layers import *

In [9]:
# Input Variable
sentences = T.imatrix()# [batch_size x time] of word ids
image_vectors = T.matrix() # [batch size x unit] of CNN image features
sentence_mask = T.neq(sentences, PAD_ix)

In [10]:
#network inputs
l_words = InputLayer((None, None), sentences)
l_mask = InputLayer((None, None), sentence_mask)

#embeddings for words 
############# TO CODE IT BY YOURSELF ##################
l_word_embeddings = EmbeddingLayer(l_words, input_size=n_tokens, output_size=EMBED_SIZE)

In [11]:
# input layer for image features
l_image_features = InputLayer((None, CNN_FEATURE_SIZE), image_vectors)

############# TO CODE IT BY YOURSELF ##################
#convert 1000 image features from googlenet to whatever LSTM_UNITS you have set
#it's also a good idea to add some dropout here and there
l_image_features_small = DropoutLayer(l_image_features)
l_image_features_small = DenseLayer(l_image_features_small, LSTM_UNITS)
assert l_image_features_small.output_shape == (None, LSTM_UNITS)

In [12]:
############# TO CODE IT BY YOURSELF ##################
# Concatinate image features and word embedings in one sequence 
decoder = LSTMLayer(l_word_embeddings,

In [13]:
# Decoding of rnn hiden states
from broadcast import BroadcastLayer,UnbroadcastLayer

#apply whatever comes next to each tick of each example in a batch. Equivalent to 2 reshapes
broadcast_decoder_ticks = BroadcastLayer(decoder, (0, 1))
print "broadcasted decoder shape = ",broadcast_decoder_ticks.output_shape

predicted_probabilities_each_tick = DenseLayer(
    broadcast_decoder_ticks,n_tokens, nonlinearity=lasagne.nonlinearities.softmax)

#un-broadcast back into (batch,tick,probabilities)
predicted_probabilities = UnbroadcastLayer(
    predicted_probabilities_each_tick, broadcast_layer=broadcast_decoder_ticks)

print "output shape = ", predicted_probabilities.output_shape

#remove if you know what you're doing (e.g. 1d convolutions or fixed shape)
assert predicted_probabilities.output_shape == (None, None, 10373)

broadcasted decoder shape =  (None, 200)
output shape =  (None, None, 10373)

In [14]:
next_word_probas = get_output(predicted_probabilities)

reference_answers = sentences[:,1:]
output_mask = sentence_mask[:,1:]

#write symbolic loss function to train NN for
loss = lasagne.objectives.categorical_crossentropy(
    next_word_probas[:, :-1].reshape((-1, n_tokens)),

############# TO CODE IT BY YOURSELF ##################
loss = (loss * output_mask).sum() / output_mask.sum()

In [15]:
#trainable NN weights
############# TO CODE IT BY YOURSELF ##################
weights = get_all_params(predicted_probabilities)
updates = lasagne.updates.adam(loss, weights)

In [16]:
#compile a function that takes input sentence and image mask, outputs loss and updates weights
#please not that your functions must accept image features as FIRST param and sentences as second one
############# TO CODE IT BY YOURSELF ##################
train_step = theano.function([image_vectors, sentences], loss, updates=updates)
val_step   = theano.function([image_vectors, sentences], loss)


  • You first have to implement a batch generator
  • Than the network will get trained the usual way

In [17]:
captions = np.array(captions)

In [18]:
from random import choice

def generate_batch(images,captions,batch_size,max_caption_len=None):
    #sample random numbers for image/caption indicies
    random_image_ix = np.random.randint(0, len(images), size=batch_size)
    #get images
    batch_images = images[random_image_ix]
    #5-7 captions for each image
    captions_for_batch_images = captions[random_image_ix]
    #pick 1 from 5-7 captions for each image
    batch_captions = map(choice, captions_for_batch_images)
    #convert to matrix
    batch_captions_ix = as_matrix(batch_captions,max_len=max_caption_len)
    return batch_images, batch_captions_ix

In [19]:
generate_batch(img_codes,captions, 3)

(array([[-1.3130095 , -1.84587181, -0.89477605, ...,  0.67713088,
          1.55226099,  1.77054608],
        [-4.06723881, -4.88099098,  2.45802474, ..., -2.21407533,
         -1.55124247, -0.56788898],
        [ 4.188025  , -1.60059953,  2.21920943, ..., -3.53720665,
          0.88908136,  2.98351288]], dtype=float32),
 array([[8481, 5164, 5047, 7206, 3017, 7614, 7260,  781, 8717, 8693, 8134,
        [8481, 8717, 5570,  595,  781, 3323, 4364, 4859, 8717, 8534,  998,
        [8481, 8717,  478, 4859, 2603, 8902, 3069, 3980, 8717, 4212, 8134,
           -1]], dtype=int32))

Main loop

  • We recommend you to periodically evaluate the network using the next "apply trained model" block
    • its safe to interrupt training, run a few examples and start training again

На первых (около 30) эпохах использовал батчи по 20-30 элементов, постепенно увеличивая до 130. Всего было посчитанно около 110 эпох.

In [170]:
batch_size = 100 #adjust me
n_epochs   = 5 #adjust me
n_batches_per_epoch = 50 #adjust me
n_validation_batches = 5 #how many batches are used for validation after each epoch

In [172]:
from tqdm import tqdm

for epoch in range(n_epochs):
    for _ in tqdm(range(n_batches_per_epoch)):
        train_loss += train_step(*generate_batch(img_codes,captions,batch_size))
    train_loss /= n_batches_per_epoch
    for _ in range(n_validation_batches):
        val_loss += val_step(*generate_batch(img_codes,captions,batch_size))
    val_loss /= n_validation_batches
    print('\nEpoch: {}, train loss: {}, val loss: {}'.format(epoch, train_loss, val_loss))

print("Finish :)")

Epoch: 0, train loss: 2.65756696846, val loss: 2.59482125641
Epoch: 1, train loss: 2.65778410162, val loss: 2.6127486045
Epoch: 2, train loss: 2.63677129403, val loss: 2.67636922386
Epoch: 3, train loss: 2.63122560129, val loss: 2.6475341315
Epoch: 4, train loss: 2.63119870206, val loss: 2.58934502373
Finish :)
apply trained model

In [174]:
#the same kind you did last week, but a bit smaller
from pretrained_lenet import build_model,preprocess, MEAN_VALUES

# build googlenet
lenet = build_model()

#load weights
lenet_weights = pickle.load(open('data/blvc_googlenet.pkl'))['param values']
set_all_param_values(lenet["prob"], lenet_weights)

#compile get_features
cnn_input_var = lenet['input'].input_var
cnn_feature_layer = lenet['loss3/classifier']
get_cnn_features = theano.function([cnn_input_var], lasagne.layers.get_output(cnn_feature_layer))

In [175]:
from matplotlib import pyplot as plt
%matplotlib inline

#sample image
img = plt.imread('data/Dog-and-Cat.jpg')
img = preprocess(img)

In [176]:
#deprocess and show, one line :)
from pretrained_lenet import MEAN_VALUES
plt.imshow(np.transpose((img[0] + MEAN_VALUES)[::-1],[1,2,0]).astype('uint8'))

<matplotlib.image.AxesImage at 0x7f8e5c1b7350>

Generate caption

In [199]:
last_word_probas_det = get_output(predicted_probabilities,deterministic=False)[:,-1]

get_probs = theano.function([image_vectors,sentences], last_word_probas_det)

#this is exactly the generation function from week5 classwork,
#except now we condition on image features instead of words
def generate_caption(image,caption_prefix = ("START",),t=1,sample=True,max_len=100):
    image_features = get_cnn_features(image)
    caption = list(caption_prefix)
    for _ in range(max_len):
        next_word_probs = get_probs(image_features,as_matrix([caption]) ).ravel()
        #apply temperature
        next_word_probs = next_word_probs**t / np.sum(next_word_probs**t)

        if sample:
            next_word = np.random.choice(vocab,p=next_word_probs) 
            next_word = vocab[np.argmax(next_word_probs)]


        if next_word=="#END#":
    return caption

In [200]:
for i in range(10):
    print ' '.join(generate_caption(img,t=1.)[1:-1])

small brown brown and white cat is looking at the camera
brown dog and a cat laying on the ground table
brown bear and a cat are together
older white cat sitting next to a bunch of donuts
a brown and white photo of cute dogs
alcoholic white orange and white cat sitting in an open suitcase
garden with a cat sitting on the ground
an orange and white furry suit and kid
brown dog and a cat laying on the ground table
black and white cat is sitting on a street corner

Bonus Part

  • Use ResNet Instead of GoogLeNet
  • Use W2V as embedding
  • Use Attention :)

