Visual Turing Test - Tutorial

Scalable Learning and Perception Group, authored by Dr. Mario Fritz and Mateusz Malinowski. We are also curious about your opinion on the tutorial; every feedback is welcomed, please write to mmalinow@mpi-inf.mpg.de.

This tutorial is based on our ICCV'15 paper "Ask Your Neurons: A Neural-based Approach to Answering Questions about Images", and, more broadly, our project on Visual Turing Test.

Since visual features are large, you should download the features separately, and put them to data/visual_features/visual_features or data/vqa/visual_features directory.

daquar (it is recommended to download all them)
- residual net [23 MB] -- place under: data/daquar/visual_features/fb_resnet/blobs.*.npy
- googlenet [17 MB] -- place under: data/daquar/visual_features/googlenet/blobs.*.npy
- NYU-Depth images [430 MB] -- place under: data/daquar/images/*.png
vqa
- residual_net train [1.2GB] -- recommended, place under: data/vqa/visual_features/train2014/fb_resnet/blobs.*.npy
- residual_net val [573MB] -- recommended, place under: data/vqa/visual_features/val2014/fb_resnet/blobs.*.npy
- vgg_net train [2.2GB] -- a few different visual features, place under: data/vqa/visual_features/train2014/vgg_net/blobs.*.npy
- vgg_net val [1.3GB] -- a few different visual features, place under: data/vqa/visual_features/val2014/vgg_net/blobs.*.npy
vqa - question answer pairs
- Questions [134MB] -- place under: data/vqa/Questions/
- Annotations [56MB] -- place under: data/vqa/Annotations
nltk data [recommended]
- nltk_data [20MB] -- place under: data/nltk_data
tutorial
- this version of tutorial can be found here
- github version -- this version will be further updated

If you use this tutorial or its parts or Kraino in your project, please consider citing at least our 'Ask Your Neurons' paper (you can find the bibtex below).

Bibtex:

@article{malinowski2016ask
  title={Ask your neurons: A Deep Learning Approach to Visual Question Answering},
  author={Malinowski, Mateusz and Rohrbach, Marcus and Fritz, Mario},
  booktitle={arXiv preprint arXiv:1605.02697},
  year={2016}
}
or
@inproceedings{malinowski2015ask,
  title={Ask your neurons: A neural-based approach to answering questions about images},
  author={Malinowski, Mateusz and Rohrbach, Marcus and Fritz, Mario},
  booktitle={Proceedings of the IEEE International Conference on Computer Vision},
  pages={1--9},
  year={2015}
}
or
@article{malinowski2016tutorial,
  title={Tutorial on Answering Questions about Images with Deep Learning},
  author={Malinowski, Mateusz and Fritz, Mario},
  journal={arXiv preprint arXiv:1610.01076},
  year={2016}
}

Before starting the tutorial make sure that you have the following hierarchy of the folders: boring_function.py neural_solver.py visual_turing_test.ipynb data/ kraino/ local/ fig/

Introduction

How to use this notebook?

Before focusing on the actual task, let's briefly see what we can do in the Jupyter Notebook. We will also introduce notation that we use in this tutorial.

Shortcuts:

Shift + Enter - runs the cell, and step inside the next cell
Ctrl + Enter - runs the cell (stay in the same cell)
Esc + x - deletes the cell (be careful)
Esc + b - creates a cell bellow the current cell

The following represents an exercise that doesn't need programming. Its role is to practice some newly introduced concepts.

Exercise

The following is a python script.

print("Hello world")

# Comment
# Now we write a loop printing numbers from 0 to 9
for k in xrange(10):
    # remember that the python's syntax is driven by indentation
    print k

Some exercises need some programming, or at least executing the code. However, in this tutorial, we try to keep the programming part rather minimal, and focus on the Visual Turing Test. The following cell is a small programming exercise. You can edit it by double clicking the cell, and execute it by running the cell (Shift + Enter). We use #TODO: to give some hints or more detailed explanation.



In [ ]:

    
def print_n_numbers(n):
    #TODO: write a loop that prints numbers from 0 to n (excluding n)
    for i in xrange(n):
        print(i)

# now we execute the function
print_n_numbers(5)

The function below print each element in the list in a new line. We will use this function later, so please run the interpreter over the following cell (Shift+Enter).



In [ ]:

    
def print_list(ll):
    # Prints the list
    print('\n'.join(ll))
    
print_list(['Visual Turing Test', 'Summer School', 'Dr. Mario Fritz', 'Mateusz Malinowski'])

The notebook can also interface with the command line. Try the following line (again Shift+Enter).



In [ ]:

    
! ls

And now let's execute python's 'boring_function' with an argument. It prints the available GPU, the argument, as well as versions of Theano, and Keras (we will talk about both frameworks later in the tutorial). Since the boring_function imports Theano, its execution may take a while.



In [1]:

    
! python boring_function.py 'hello world'









    



Couldn't import dot_parser, loading of dot files will not be possible.
Using gpu device 0: Tesla K40m (CNMeM is disabled, cuDNN 5004)
hello world
0.9.0dev0.dev-beefa9396a0e089f6b35b43f71c05557b1083515
0.3.3

To view the content of the function, run the following



In [2]:

    
! tail boring_function.py









    



import sys
import theano
import keras

if __name__ == '__main__':
    print sys.argv[1]
    print theano.__version__
    print keras.__version__

The command below checks the available GPU machines.



In [3]:

    
! nvidia-smi









    



Mon Jan 30 00:54:14 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57                 Driver Version: 367.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K40m          Off  | 0000:05:00.0     Off |                    0 |
| N/A   39C    P0    61W / 235W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K40m          Off  | 0000:42:00.0     Off |                    0 |
| N/A   29C    P0    61W / 235W |      0MiB / 11439MiB |     63%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Challenge

During this tutorial, we will look at very recent research thread that interlinks language and vision together -- a Visual Turing Test -- that is about answering on natural language questions about images by machines.

Datasets

In this section, we will get familiar with both datasets, accuracy measures, and features.

DAQUAR

Let's first look into the folder data/daquar to make the problem a bit more tangible for us. Here, we have training and test data in qa.894.raw.train.format_triple and qa.894.raw.test.format_triple.

Execute the cell below to see how input data look like.
Please use Shift+Enter on the cell below.
Make sure you understand the format.



In [4]:

    
! head -15 data/daquar/qa.894.raw.train.format_triple









    



what is on the right side of the black telephone and on the left side of the red chair ?
desk
image3
what is in front of the white door on the left side of the desk ?
telephone
image3
what is on the desk ?
book, scissor, papers, tape_dispenser
image3
what is the largest brown objects ?
carton
image3
what color is the chair in front of the white wall ?
red
image3

Let's have a look at the figure in Introduction->Challenge. The figure lists images with associated question-answer pairs. It also comments on challenges associated with every question-answer-triplet. We see that to answer properly on the questions, the answerer needs to understand the scene visually, understand the question, but also, arguably, has to resort to common sense knowledge, or even know the preferences of the person asking a question ('What is behind the table?' - what 'behind' means?).

Can you see anything particularly interesting in the first column of the figure in [Introduction->Challenge]?
Think about a spatial relationship between an observer, object of interest, and the world.



In [5]:

    
#TODO: Execute the following procedure (Shift+Enter)
from kraino.utils import data_provider

dp = data_provider.select['daquar-triples']
dp









    Out[5]:





{'perception': <function kraino.utils.data_provider.<lambda>>,
 'save_predictions': <function kraino.utils.data_provider.daquar_save_results>,
 'text': <function kraino.utils.data_provider.daquar_qa_triples>}

The code above returns a dictionary of three representations of the DAQUAR dataset. For now, we will look only into the 'text' representation. dp['text'] returns a function from dataset split into the dataset's textual representation. It will be more clear after executing the following instruction.



In [6]:

    
# check the keys of the representation of DAQUAR train
train_text_representation = dp['text'](train_or_test='train')
train_text_representation.keys()









    Out[6]:





['answer_words_delimiter',
 'end_of_answer',
 'img_name',
 'y',
 'x',
 'end_of_question',
 'img_ind',
 'question_id']

This representation specifies how questions are ended ('?'), answers are ended ('.'), answer words are delimited (DAQUAR sometimes has a set of answer words as an answer, for instance 'knife, fork' may be a valid answer), but most important, it has questions (key 'x'), answers (key 'y'), and names of the corresponding images (key 'img_name').



In [7]:

    
# let's check some entries of the text's representation
n_elements = 10
print('== Questions:')
print_list(train_text_representation['x'][:n_elements])
print
print('== Answers:')
print_list(train_text_representation['y'][:n_elements])
print
print('== Image Names:')
print_list(train_text_representation['img_name'][:n_elements])









    



== Questions:






    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-7-a25d18255c1b> in <module>()
      2 n_elements = 10
      3 print('== Questions:')
----> 4 print_list(train_text_representation['x'][:n_elements])
      5 print
      6 print('== Answers:')

NameError: name 'print_list' is not defined

Summary

DAQUAR consists of question, answer, image triplets. Pairs question, answer for different folds are accessible from

data_provider.select['text']

Textual Features

Ok. We have a text. But unfortunately neural networks expect numerical input, so we cannot really work with the raw text. We need to transform an raw input into some numerical value or a vector of values. One particularly successful representation is called one-hot vector and it is a binary vector with exactly one non-zero entry. This entry points to the corresponding word in the vocabulary. See the illustration below.

* How does a vector computed after we sum up one-hot representations of 'what is behind the table' (see the illustration above) look like?
* What if we sum up 'What table is behind the table'? Can you interpret the resulting vector?
* Can you guess why it's nice to work with one-hot vector representation of the text?

As we see from the illustrative example above, we first need to build a suitable vocabulary from our raw textual training data, and next transform them into one-hot representation.



In [8]:

    
from toolz import frequencies
train_raw_x = train_text_representation['x']
# we start from building the frequencies table
wordcount_x = frequencies(' '.join(train_raw_x).split(' '))
# print the most and least frequent words
n_show = 5
print(sorted(wordcount_x.items(), key=lambda x: x[1], reverse=True)[:n_show])
print(sorted(wordcount_x.items(), key=lambda x: x[1])[:n_show])









    



[('the', 9847), ('?', 6795), ('what', 5847), ('is', 5368), ('on', 2909)]
[('all', 1), ('surrounded', 1), ('four', 1), ('displaying', 1), ('children', 1)]



In [9]:

    
# Kraino is a framework that helps in fast prototyping Visual Turing Test models
from kraino.utils.input_output_space import build_vocabulary

# This function takes wordcounts and returns word2index - mapping from words into indices, 
# and index2word - mapping from indices to words.
word2index_x, index2word_x = build_vocabulary(
    this_wordcount=wordcount_x,
    truncate_to_most_frequent=0)
word2index_x









    Out[9]:





{'3': 506,
 u'<eoa>': 2,
 u'<eoq>': 3,
 u'<pad>': 0,
 u'<unk>': 1,
 '?': 52,
 'a': 205,
 'above': 80,
 'ac': 817,
 'across': 512,
 'against': 533,
 'air': 588,
 'airconditionerg': 790,
 'alarm': 424,
 'all': 4,
 'along': 92,
 'amidst': 783,
 'and': 382,
 'any': 390,
 'apart': 688,
 'apples': 513,
 'appliance': 510,
 'appliances': 112,
 'are': 488,
 'arm': 494,
 'armchair': 546,
 'armchairs': 320,
 'around': 765,
 'at': 828,
 'attached': 807,
 'audio': 647,
 'available': 514,
 'away': 501,
 'baby': 137,
 'back': 708,
 'backpack': 541,
 'bag': 471,
 'bags': 607,
 'ball': 623,
 'bananas': 651,
 'bars': 477,
 'base': 303,
 'basin': 566,
 'basins': 545,
 'basket': 215,
 'baskets': 110,
 'bath': 257,
 'bathroom': 558,
 'bathtub': 592,
 'bean': 463,
 'bear': 462,
 'bed': 374,
 'bedding': 720,
 'beds': 873,
 'bedside': 51,
 'been': 563,
 'before': 234,
 'behind': 660,
 'beige': 757,
 'below': 811,
 'belt': 556,
 'bench': 667,
 'beneath': 269,
 'benhind': 95,
 'between': 509,
 'bicycle': 335,
 'big': 679,
 'bigger': 712,
 'biggest': 787,
 'billing': 559,
 'bills': 270,
 'bin': 595,
 'bins': 219,
 'black': 124,
 'blackboard': 411,
 'blackboards': 265,
 'blades': 404,
 'blanket': 33,
 'blind': 323,
 'blinds': 350,
 'blocks': 282,
 'blue': 70,
 'board': 569,
 'boards': 746,
 'body': 86,
 'book': 677,
 'bookcase': 625,
 'books': 812,
 'bookshelf': 797,
 'border': 389,
 'both': 571,
 'bottle': 645,
 'bottles': 200,
 'bottom': 718,
 'bowl': 614,
 'bowls': 278,
 'box': 96,
 'boxes': 640,
 'bricks': 260,
 'briefcase': 285,
 'bright': 438,
 'broom': 238,
 'brown': 28,
 'brush': 772,
 'brushes': 361,
 'bucket': 480,
 'built': 395,
 'bunk': 177,
 'burner': 502,
 'burners': 436,
 'business': 733,
 'but': 792,
 'by': 741,
 'cabin': 378,
 'cabinet': 793,
 'cabinets': 835,
 'cable': 43,
 'cables': 832,
 'cage': 258,
 'cair': 310,
 'calendar': 492,
 'camel': 652,
 'cameras': 293,
 'can': 152,
 'candelabra': 148,
 'candle': 711,
 'candles': 208,
 'candlestick': 150,
 'candlesticks': 261,
 'cap': 149,
 'card': 308,
 'cardboard': 93,
 'carpet': 82,
 'carpets': 107,
 'carrying': 136,
 'carton': 582,
 'cartons': 551,
 'case': 505,
 'cash': 587,
 'casing': 576,
 'cd': 549,
 'ceiling': 356,
 'center': 368,
 'centerpiece': 865,
 'central': 745,
 'centre': 721,
 'certificates': 81,
 'chair': 15,
 'chairs': 696,
 'chandelier': 439,
 'chart': 413,
 'chiar': 312,
 'children': 17,
 'circle': 454,
 'circles': 405,
 'circular': 189,
 'cistern': 386,
 'cleaner': 383,
 'clock': 73,
 'clockes': 83,
 'close': 491,
 'closed': 508,
 'closer': 495,
 'closest': 719,
 'closet': 498,
 'cloth': 794,
 'clothes': 56,
 'clothing': 176,
 'cobalt': 10,
 'coffee': 525,
 'coil': 419,
 'color': 687,
 'colored': 98,
 'colorful': 459,
 'colors': 47,
 'colour': 638,
 'coloured': 202,
 'colours': 504,
 'column': 749,
 'comforter': 334,
 'comforters': 246,
 'computer': 487,
 'computers': 313,
 'conditioner': 603,
 'conditioning': 304,
 'conference': 557,
 'connected': 416,
 'connecting': 726,
 'container': 287,
 'containers': 568,
 'containing': 608,
 'content': 295,
 'control': 108,
 'controls': 450,
 'cooker': 806,
 'cooking': 75,
 'cooler': 830,
 'corck': 372,
 'cord': 145,
 'cork': 736,
 'corner': 67,
 'corridor': 249,
 'cot': 530,
 'couch': 661,
 'couches': 42,
 'counter': 210,
 'counters': 85,
 'countries': 468,
 'course': 226,
 'cover': 426,
 'covered': 244,
 'covering': 263,
 'covers': 823,
 'cpu': 872,
 'cream': 620,
 'crib': 328,
 'cup': 532,
 'cupbaord': 739,
 'cupboard': 521,
 'cupboards': 18,
 'cups': 336,
 'curtain': 565,
 'curtains': 355,
 'cushion': 465,
 'cushions': 358,
 'cutting': 254,
 'cylinder': 109,
 'dark': 681,
 'decoration': 599,
 'decorations': 351,
 'decorative': 156,
 'deoderant': 777,
 'design': 65,
 'desk': 655,
 'desks': 306,
 'details': 846,
 'detergent': 723,
 'diagonally': 232,
 'dining': 665,
 'dinner': 756,
 'dinning': 666,
 'dish': 14,
 'dishes': 396,
 'dishwasher': 309,
 'dispenser': 381,
 'dispensers': 101,
 'display': 815,
 'displayed': 224,
 'displaying': 16,
 'divider': 188,
 'do': 517,
 'does': 725,
 'dog': 427,
 'doll': 782,
 'dolls': 247,
 'domes': 314,
 'dominant': 458,
 'dominantly': 102,
 'dominating': 331,
 'door': 130,
 'doors': 678,
 'doorway': 534,
 'down': 768,
 'draw': 853,
 'drawer': 166,
 'drawers': 104,
 'drawing': 648,
 'draws': 497,
 'dress': 170,
 'dresser': 851,
 'dressers': 876,
 'drier': 58,
 'dryer': 683,
 'drying': 178,
 'dustbin': 479,
 'dvd': 869,
 'each': 267,
 'edge': 643,
 'edible': 144,
 'eggplant': 624,
 'eimage999': 266,
 'either': 729,
 'electric': 408,
 'electrical': 845,
 'electronic': 337,
 'end': 174,
 'equipment': 400,
 'equipments': 868,
 'every': 168,
 'exercise': 473,
 'exhauster': 833,
 'ext': 430,
 'extinguisher': 173,
 'facing': 695,
 'fan': 36,
 'fans': 538,
 'far': 32,
 'faucet': 132,
 'fax': 29,
 'few': 121,
 'file': 829,
 'files': 789,
 'filing': 432,
 'filled': 84,
 'fire': 572,
 'fireplace': 38,
 'first': 322,
 'fishes': 860,
 'fitted': 433,
 'fixed': 474,
 'flags': 644,
 'flat': 129,
 'flipboard': 415,
 'floor': 639,
 'flour': 251,
 'flower': 286,
 'flowers': 366,
 'flush': 822,
 'folder': 535,
 'food': 250,
 'foosball': 633,
 'foot': 423,
 'for': 717,
 'fornt': 50,
 'found': 268,
 'four': 11,
 'fourth': 186,
 'frame': 481,
 'framed': 767,
 'frames': 773,
 'frequent': 321,
 'from': 117,
 'front': 255,
 'fruit': 630,
 'fruits': 520,
 'furnitures': 540,
 'game': 682,
 'garbage': 764,
 'gate': 187,
 'gery': 125,
 'girls': 345,
 'glass': 135,
 'globe': 277,
 'gloves': 472,
 'glue': 444,
 'glued': 54,
 'gray': 467,
 'green': 212,
 'grinder': 609,
 'ground': 443,
 'guitar': 528,
 'gym': 360,
 'hair': 709,
 'hand': 629,
 'handle': 748,
 'handles': 122,
 'hang': 628,
 'hanged': 280,
 'hanger': 273,
 'hangers': 297,
 'hanging': 13,
 'has': 673,
 'hat': 585,
 'have': 387,
 'heat': 795,
 'heater': 760,
 'held': 827,
 'help': 703,
 'her': 779,
 'high': 165,
 'hight': 196,
 'hing': 671,
 'hold': 139,
 'holder': 448,
 'holders': 539,
 'holds': 222,
 'hole': 138,
 'hood': 339,
 'hooks': 7,
 'horse': 831,
 'hot': 240,
 'house': 276,
 'how': 182,
 'hung': 700,
 'id': 605,
 'if': 606,
 'image10': 394,
 'image1007': 292,
 'image1008': 864,
 'image1035': 375,
 'image1043': 694,
 'image1045': 185,
 'image1072': 672,
 'image108': 420,
 'image114': 659,
 'image116': 657,
 'image12': 392,
 'image135': 275,
 'image139': 274,
 'image379': 675,
 'image486': 175,
 'image654': 550,
 'image888': 264,
 'image912': 114,
 'image929': 445,
 'image95': 133,
 'image982': 537,
 'immediately': 685,
 'in': 602,
 'inbetweeng': 674,
 'inner': 785,
 'inside': 805,
 'instrument': 669,
 'iron': 379,
 'ironing': 162,
 'is': 597,
 'island': 750,
 'it': 598,
 'item': 49,
 'items': 100,
 'its': 233,
 'jackets': 705,
 'jeans': 548,
 'jersey': 542,
 'juice': 385,
 'juicers': 713,
 'kept': 365,
 'kettle': 821,
 'keyboard': 230,
 'keys': 626,
 'kid': 635,
 'kids': 97,
 'kind': 370,
 'kinds': 854,
 'kitchen': 527,
 'knife': 728,
 'knifes': 650,
 'knives': 377,
 'knob': 676,
 'knobs': 402,
 'ladder': 194,
 'ladders': 40,
 'lamp': 183,
 'lamps': 738,
 'lampshade': 78,
 'laptop': 654,
 'laptops': 574,
 'larg': 26,
 'large': 48,
 'largest': 618,
 'last': 529,
 'laundry': 192,
 'leaf': 716,
 'leaning': 591,
 'leather': 469,
 'left': 649,
 'leg': 88,
 'legs': 353,
 'letter': 763,
 'letters': 762,
 'light': 380,
 'lights': 79,
 'liquid': 460,
 'liquod': 493,
 'lists': 362,
 'little': 340,
 'locker': 151,
 'long': 332,
 'lying': 877,
 'machine': 181,
 'machines': 69,
 'made': 804,
 'magazines': 771,
 'main': 245,
 'maker': 641,
 'man': 204,
 'mantel': 867,
 'manu': 580,
 'many': 531,
 'map': 190,
 'markers': 722,
 'maroon': 577,
 'mat': 193,
 'mats': 288,
 'mattress': 241,
 'mauve': 707,
 'metallic': 798,
 'microwave': 12,
 'middle': 589,
 'mirror': 710,
 'mirrors': 159,
 'mixer': 753,
 'modem': 422,
 'monitor': 236,
 'monitors': 284,
 'monkey': 343,
 'more': 128,
 'most': 414,
 'mounted': 20,
 'mouse': 604,
 'mouses': 155,
 'mug': 307,
 'mugs': 446,
 'music': 123,
 'name': 262,
 'napkin': 818,
 'near': 77,
 'nect': 217,
 'nest': 627,
 'newspapers': 751,
 'next': 120,
 'night': 699,
 'not': 837,
 'notebook': 197,
 'notebooks': 239,
 'notes': 46,
 'notice': 515,
 'number': 327,
 'nyu': 180,
 'obejct': 483,
 'object': 412,
 'objects': 90,
 'of': 160,
 'off': 636,
 'oil': 279,
 'on': 744,
 'one': 329,
 'onto': 319,
 'open': 291,
 'opened': 391,
 'opposite': 409,
 'or': 759,
 'orange': 243,
 'ornamental': 57,
 'other': 844,
 'ottoman': 441,
 'outlet': 464,
 'outlets': 858,
 'outside': 622,
 'oven': 229,
 'ovens': 730,
 'packet': 484,
 'pad': 161,
 'painted': 842,
 'painting': 544,
 'pairs': 774,
 'pan': 621,
 'pane': 478,
 'paper': 227,
 'papers': 847,
 'part': 363,
 'partition': 425,
 'patterns': 299,
 'peices': 131,
 'pen': 724,
 'pencil': 820,
 'pens': 235,
 'people': 706,
 'perfume': 470,
 'person': 642,
 'pf': 564,
 'photo': 653,
 'photocopying': 143,
 'photographs': 691,
 'photos': 575,
 'piano': 141,
 'picture': 852,
 'pictures': 500,
 'piece': 814,
 'pieces': 164,
 'pilar': 870,
 'pile': 834,
 'piled': 755,
 'pillar': 693,
 'pillars': 668,
 'pillow': 106,
 'pillows': 855,
 'ping': 179,
 'pink': 172,
 'pinned': 348,
 'pipe': 429,
 'place': 318,
 'placed': 499,
 'plant': 191,
 'plants': 581,
 'plastic': 347,
 'plate': 457,
 'plates': 619,
 'play': 406,
 'player': 146,
 'playing': 218,
 'plug': 743,
 'plugged': 841,
 'plum': 154,
 'pockets': 758,
 'pong': 171,
 'pool': 863,
 'portion': 875,
 'post': 740,
 'poster': 791,
 'posters': 861,
 'pot': 547,
 'pots': 326,
 'pouch': 140,
 'predominantly': 399,
 'present': 579,
 'printed': 417,
 'printer': 407,
 'printers': 453,
 'projector': 118,
 'pull': 803,
 'puncher': 596,
 'purple': 874,
 'put': 305,
 'rack': 316,
 'racket': 490,
 'racks': 203,
 'railing': 61,
 'reception': 878,
 'rectangle': 455,
 'rectangular': 68,
 'red': 298,
 'reflected': 252,
 'reflection': 637,
 'refridgerator': 184,
 'refrigerator': 421,
 'remote': 866,
 'rest': 656,
 'rested': 859,
 'rests': 99,
 'righ': 302,
 'right': 704,
 'rocking': 578,
 'rod': 8,
 'rods': 169,
 'roll': 850,
 'rolled': 518,
 'rolling': 769,
 'rolls': 776,
 'roof': 536,
 'room': 39,
 'rotates': 802,
 'round': 55,
 'row': 19,
 'rows': 778,
 'rug': 735,
 'rulers': 600,
 'runner': 384,
 'runners': 754,
 's': 615,
 'saffron': 840,
 'salt': 431,
 'sandal': 664,
 'saucer': 714,
 'says': 59,
 'scanner': 819,
 'scattered': 437,
 'scissor': 397,
 'scissors': 281,
 'screen': 519,
 'sculptures': 207,
 'seat': 485,
 'seater': 523,
 'seats': 482,
 'second': 64,
 'section': 74,
 'sections': 788,
 'see': 486,
 'seen': 388,
 'self': 398,
 'shade': 214,
 'shades': 103,
 'shadow': 686,
 'shaker': 731,
 'shape': 658,
 'shaped': 690,
 'shaver': 511,
 'sheet': 342,
 'sheets': 418,
 'shelf': 359,
 'shelve': 153,
 'shelves': 311,
 'shirt': 610,
 'shirts': 66,
 'shoe': 440,
 'shoes': 223,
 'show': 434,
 'shower': 376,
 'shutter': 715,
 'side': 272,
 'sides': 862,
 'sidetable': 289,
 'silver': 259,
 'sink': 91,
 'sinks': 87,
 'sitting': 31,
 'slippers': 593,
 'small': 53,
 'smallest': 507,
 'soap': 403,
 'soaps': 338,
 'socks': 766,
 'sofa': 692,
 'sofas': 611,
 'soft': 701,
 'solor': 237,
 'spatula': 838,
 'speaker': 616,
 'speakers': 393,
 'spice': 801,
 'splits': 584,
 'spoon': 35,
 'spot': 195,
 'spray': 524,
 'spread': 670,
 'sprinklers': 702,
 'square': 770,
 'squer': 296,
 'squere': 594,
 'sre': 62,
 'stack': 634,
 'stacked': 37,
 'stacks': 742,
 'stair': 447,
 'stairs': 367,
 'stamp': 561,
 'stand': 799,
 'standing': 116,
 'stands': 72,
 'stapler': 41,
 'staplers': 826,
 'statue': 632,
 'steel': 369,
 'step': 737,
 'steps': 697,
 'sticker': 127,
 'sticks': 775,
 'sticky': 824,
 'stnad': 503,
 'stool': 76,
 'stools': 680,
 'storage': 839,
 'store': 354,
 'stove': 27,
 'striped': 466,
 'stuck': 784,
 'student': 325,
 'students': 856,
 'suitcase': 813,
 'surround': 167,
 'surrounded': 5,
 'surrounding': 63,
 'suspended': 333,
 'sweatshirts': 410,
 'switch': 211,
 'switchboard': 796,
 'switchboards': 586,
 'switched': 6,
 'switches': 221,
 'swivel': 163,
 'symbols': 225,
 'system': 346,
 't': 698,
 'table': 553,
 'tablecloth': 201,
 'tables': 663,
 'tall': 216,
 'tallest': 567,
 'tank': 583,
 'tap': 158,
 'tape': 452,
 'tea': 60,
 'teacher': 94,
 'teddy': 560,
 'telelvision': 290,
 'telephon': 761,
 'telephone': 134,
 'telephones': 213,
 'television': 461,
 'televisions': 157,
 'text': 435,
 'th': 24,
 'that': 357,
 'the': 646,
 'their': 349,
 'them': 248,
 'there': 324,
 'thing': 317,
 'things': 612,
 'third': 371,
 'this': 147,
 'through': 228,
 'tier': 199,
 'tin': 843,
 'tissue': 601,
 'to': 23,
 'toaster': 613,
 'together': 871,
 'toilet': 816,
 'toiletries': 209,
 'tooth': 428,
 'toothbrush': 522,
 'toothbrushes': 105,
 'toothpaste': 562,
 'top': 344,
 'topmost': 442,
 'towel': 689,
 'towels': 552,
 'toy': 341,
 'toys': 113,
 'trampoline': 734,
 'trash': 849,
 'tray': 271,
 'trays': 555,
 'treadmill': 727,
 'tree': 373,
 'tricycle': 44,
 'trollies': 115,
 'tshirtsg': 570,
 'tub': 526,
 'tube': 879,
 'tumblers': 780,
 'tupperware': 857,
 'tv': 22,
 'two': 119,
 'type': 126,
 'umbrella': 300,
 'under': 25,
 'up': 808,
 'us': 809,
 'used': 315,
 'utensil': 301,
 'utensils': 825,
 'vacuum': 684,
 'varieties': 206,
 'variety': 836,
 'vase': 34,
 'vases': 198,
 'vegetables': 781,
 'vent': 543,
 'view': 476,
 'viisble': 294,
 'violet': 489,
 'visible': 111,
 'waht': 662,
 'wall': 617,
 'walls': 253,
 'wardrobe': 364,
 'wash': 330,
 'washing': 256,
 'water': 89,
 'way': 786,
 'wearing': 456,
 'weighing': 516,
 'what': 71,
 'whats': 231,
 'where': 475,
 'which': 401,
 'while': 590,
 'whit': 283,
 'white': 352,
 'whiteboard': 752,
 'who': 732,
 'window': 242,
 'windows': 810,
 'windowsill': 21,
 'wine': 220,
 'wire': 496,
 'wired': 631,
 'wires': 573,
 'with': 800,
 'woman': 30,
 'women': 142,
 'wood': 451,
 'wooden': 45,
 'world': 747,
 'writing': 554,
 'written': 449,
 'yellow': 9,
 'you': 848}

In addition, we are using a few special symbols that don't occur in the training dataset. Most important are $<pad>$ and $<unk>$. We will use the former to pad sequences in order to have the same number of temporal elements; we will use the latter for words (at test time) that don't exist in training dataset.

Armed with vocabulary, we can build one-hot representation of the training data. However, this is not neccessary and maybe even wasteful. Our one-hot representation of the input text doesn't explicitely build long vectors, but instead it operates on indices. The example above would be encoded as [0,1,4,2,7,3].

Can you prove the equivalence in the claim?

claim:

Let $x$ be a binary vector with exactly one value $1$ at position $index$, that is $x[index]=1$. Then $$W[:,index] = Wx$$ where $W[:,b]$ denotes a vector built from a column $b$ of $W$.



In [10]:

    
from kraino.utils.input_output_space import encode_questions_index
one_hot_x = encode_questions_index(train_raw_x, word2index_x)
print(train_raw_x[:3])
print(one_hot_x[:3])









    



['what is on the right side of the black telephone and on the left side of the red chair ?', 'what is in front of the white door on the left side of the desk ?', 'what is on the desk ?']
[[71, 597, 744, 646, 704, 272, 160, 646, 124, 134, 382, 744, 646, 649, 272, 160, 646, 298, 15, 52, 3], [71, 597, 602, 255, 160, 646, 352, 130, 744, 646, 649, 272, 160, 646, 655, 52, 3], [71, 597, 744, 646, 655, 52, 3]]

As we can see, the sequences have different elements. We will pad the sequences to have the same length $MAXLEN$.



In [11]:

    
# We use another framework that is useful to build deep learning models - Keras
from keras.preprocessing import sequence
MAXLEN=30
train_x = sequence.pad_sequences(one_hot_x, maxlen=MAXLEN)
train_x[:3]









    Out[11]:





array([[  0,   0,   0,   0,   0,   0,   0,   0,   0,  71, 597, 744, 646,
        704, 272, 160, 646, 124, 134, 382, 744, 646, 649, 272, 160, 646,
        298,  15,  52,   3],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         71, 597, 602, 255, 160, 646, 352, 130, 744, 646, 649, 272, 160,
        646, 655,  52,   3],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  71, 597, 744,
        646, 655,  52,   3]], dtype=int32)

And do the same with the answers.



In [14]:

    
# for simplicity, we consider only first answer words; that is, if answer is 'knife,fork' we encode only 'knife'
# note, however, that the tutorial supports multiple word answers (is_only_first_answer_word=False)
# with MAX_ANSWER_TIME_STEPS defining number of answer words (ignored if is_only_first_answer_word=True)
MAX_ANSWER_TIME_STEPS=10

from kraino.utils.input_output_space import encode_answers_one_hot
train_raw_y = train_text_representation['y']
wordcount_y = frequencies(' '.join(train_raw_y).replace(', ',',').split(' '))
word2index_y, index2word_y = build_vocabulary(this_wordcount=wordcount_y)
train_y, _ = encode_answers_one_hot(
    train_raw_y, 
    word2index_y, 
    answer_words_delimiter=train_text_representation['answer_words_delimiter'],
    is_only_first_answer_word=True,
    max_answer_time_steps=MAX_ANSWER_TIME_STEPS)
print(train_x.shape)
print(train_y.shape)
# word2index_y









    



(6795, 30)
(6795, 972)

Finally, we can also encode test questions. We need it later to see how well our models generalise to new question,answer,image triplets. Remember however that we should use vocabulary we generated from training samples.

Why should we use the training vocabulary to encode test questions?



In [15]:

    
test_text_representation = dp['text'](train_or_test='test')
test_raw_x = test_text_representation['x']
test_one_hot_x = encode_questions_index(test_raw_x, word2index_x)
test_x = sequence.pad_sequences(test_one_hot_x, maxlen=MAXLEN)
print_list(test_raw_x[:3])
test_x[:3]









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-15-f9c5c8c52ceb> in <module>()
      3 test_one_hot_x = encode_questions_index(test_raw_x, word2index_x)
      4 test_x = sequence.pad_sequences(test_one_hot_x, maxlen=MAXLEN)
----> 5 print_list(test_raw_x[:3])
      6 test_x[:3]

NameError: name 'print_list' is not defined

With encoded question, answer pairs we finish the first section. But before delving into building and training new models, let's have a look at summary to see bigger picture.

Summary

We started from raw questions from the training set. Use them to build a vocabulary. Next, we encode questions into sequences of one-hot vectors based on the vocabulary. Finally, we use the same vocabulary to encode questions from test set, if a word is absent we use extra token $<unk>$ to encode this fact (we encode the $<unk>$ token, not the word).

Language Only

Training - overall picture

Ok. We have textual features already built. Let's create some models that we will use for training and later for answering on questions about images.

As you may already know, we train models by weights updates. Let $x$ and $y$ be training samples (input, output), and $\ell(x,y)$ is an objective function. The formula for weights updates is: $$w := w - \alpha \nabla \ell(x,y; w)$$ with $\alpha$ that we call the learning rate, and $\nabla$ is a gradient wrt. weights $w$. This is a hyper-parameter that must be set in advance. The rule shown above is called SGD update, but other its variants are also possible. In fact, we will use a variant called ADAM.

We cast the question answering problem into a classification framework, so that we classify input $x$ into some class that represents an answer word. Therefore, we use, popular in classification, logistic regression as the objective: $$\ell(x,y;w):=\sum_{y' \in \mathcal{C}} \mathbb{1}\{y'=y\}\log p(y'\;|\;x,w)$$ where $\mathcal{C}$ is a set of all classes, and for $p(y\;|\;x,w)$ we will use softmax: $e^{w^y\phi(x)} / \sum_{z}e^{w^z\phi(x)}$. Here $\phi(x)$ denotes an output of a model (more precisely, it's often a neural network's response to the input, just before softmax).

The training can be formalised (and automatised) so that you need to execute a procedure that looks something like that:

training(gradient_of_the_model, optimizer='Adam')

Summary Given a model, and an optimization procedure (SGD, Adam, etc.) all we need is to compute gradient of the model $\nabla \ell(x,y;w)$ wrt. to its parameters $w$, and next plug it to the optimisation procedure.

Theano

Since computing gradients $\nabla \ell(x,y;w)$ may become quickly tedious, especially for more complex models, we search for tools that could automitise it as well. Imagine that you build some model $M$ and you get its gradient $\nabla M$ by just executing the tool

nabla_M = compute_gradient_symbolically(M,x,y)

This would definitely speed up prototyping.

Theano is such a tool that is specifically tailored to work with deep learning models. For broader understanding Theano, you can check this nice tutorial.

The programmming example shown cell below defines ReLU, a popular activation function, as well as shows its derivative using Theano. However, with this example, we only scratch the surface.

Assume ReLU is defined as follows $ReLU(x) = \max(x,0)$.

What is the gradient of ReLU? Consider two cases.

Btw. ReLU is a nondifferentiable function, so technically we are computing its subgradient - it is still fine for Theano.



In [16]:

    
import theano
import theano.tensor as T

# Theano is using symbolic calculations, so we need to first create symbolic variables
theano_x = T.scalar()
# we define a relationship between a symbolic input and a symbolic output
theano_y = T.maximum(0,theano_x)
# now it's time for a symbolic gradient wrt. to symbolic variable x
theano_nabla_y = T.grad(theano_y, theano_x)

# we can see that both variables are symbolic, they don't have any numerical values
print(theano_x)
print(theano_y)
print(theano_nabla_y)

# theano.function compiles the symbolic representation of the network
theano_f_x = theano.function([theano_x], theano_y)
print(theano_f_x(3))
print(theano_f_x(-3))
# and now for gradients

nabla_f_x = theano.function([theano_x], theano_nabla_y)
print(nabla_f_x(3))
print(nabla_f_x(-3))









    



Couldn't import dot_parser, loading of dot files will not be possible.






    



Using gpu device 0: Tesla K40m (CNMeM is disabled, cuDNN 5004)
WARNING (theano.sandbox.cuda.opt): Optimization Warning: Got the following error, but you can ignore it. This could cause less GpuElemwise fused together.
CudaNdarray_ptr_int_size: error when calling the gpu code. (invalid device function)






    



<TensorType(float32, scalar)>
Elemwise{maximum,no_inplace}.0
Elemwise{mul}.0
3.0
0.0
1.0
0.0

Summary

To compute gradient symbolically, we can use Theano.

Keras

Keras builds on top of Theano, and significantly simplifies creating new models as well as training such models, effectively speeding up the prototyping even further. Keras also abstracts away from some technical burden such as symbolic variable creation. Metaphorically, while Theano can be seen as a deep learning equivalent of assembler, Keras is more like Java :)



In [18]:

    
# we sample from noisy x^2 function
from numpy import asarray
from numpy import random
def myfun(x):
    return x*x

NUM_SAMPLES = 10000
HIGH_VALUE=10
keras_x = asarray(random.randint(low=0, high=HIGH_VALUE, size=NUM_SAMPLES))
keras_noise = random.normal(loc=0.0, scale=0.1, size=NUM_SAMPLES)
keras_noise = asarray([max(x,0) for x in keras_noise])
keras_y = asarray([myfun(x) + n for x,n in zip(keras_x, keras_noise)])
# print('X:')
# print(keras_x)
# print('Noise')
# print(keras_noise)
# print('Noisy X^2:')
# print(keras_y)

keras_x = keras_x.reshape(keras_x.shape[0],1)
keras_y = keras_y.reshape(keras_y.shape[0],1)

# import keras packages
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout

# build a regression network
KERAS_NUM_HIDDEN = 150
KERAS_NUM_HIDDEN_SECOND = 150
KERAS_NUM_HIDDEN_THIRD = 150
KERAS_DROPOUT_FRACTION = 0.5
m = Sequential()
m.add(Dense(KERAS_NUM_HIDDEN, input_dim=1))
m.add(Activation('relu'))
m.add(Dropout(KERAS_DROPOUT_FRACTION))
#TODO: add one more layer
# m.add(Dense(KERAS_NUM_HIDDEN_SECOND))
# m.add(Activation('relu'))
# m.add(Dropout(KERAS_DROPOUT_FRACTION))
#TODO: add one more layer
# m.add(Dense(KERAS_NUM_HIDDEN_THIRD))
# m.add(Activation('relu'))
# m.add(Dropout(KERAS_DROPOUT_FRACTION))
m.add(Dense(1))

# compile and fit
m.compile(loss='mse', optimizer='adam')
m.fit(keras_x, keras_y, nb_epoch=100, batch_size=250)

keras_x_predict = asarray([1,3,6,12,HIGH_VALUE+10])
keras_x_predict = keras_x_predict.reshape(keras_x_predict.shape[0],1)
keras_predictions = m.predict(keras_x_predict)
print("{0:>10}{1:>10}{2:>10}".format('X', 'Y', 'GT'))
for x,y in zip(keras_x_predict, keras_predictions):
    print("{0:>10}{1:>10.2f}{2:>10}".format(x[0], y[0], myfun(x[0])))









    



Epoch 1/100
10000/10000 [==============================] - 0s - loss: 1430.5141     
Epoch 2/100
10000/10000 [==============================] - 0s - loss: 1145.2346     
Epoch 3/100
10000/10000 [==============================] - 0s - loss: 850.6267     
Epoch 4/100
10000/10000 [==============================] - 0s - loss: 565.8305     
Epoch 5/100
10000/10000 [==============================] - 0s - loss: 346.8505     
Epoch 6/100
10000/10000 [==============================] - 0s - loss: 217.7273     
Epoch 7/100
10000/10000 [==============================] - 0s - loss: 163.6030     
Epoch 8/100
10000/10000 [==============================] - 0s - loss: 150.3268     
Epoch 9/100
10000/10000 [==============================] - 0s - loss: 140.7599     
Epoch 10/100
10000/10000 [==============================] - 0s - loss: 137.4868     
Epoch 11/100
10000/10000 [==============================] - 0s - loss: 130.3813     
Epoch 12/100
10000/10000 [==============================] - 0s - loss: 127.8312     
Epoch 13/100
10000/10000 [==============================] - 0s - loss: 125.1653     
Epoch 14/100
10000/10000 [==============================] - 0s - loss: 120.6361     
Epoch 15/100
10000/10000 [==============================] - 0s - loss: 113.2545     
Epoch 16/100
10000/10000 [==============================] - 0s - loss: 108.8070     
Epoch 17/100
10000/10000 [==============================] - 0s - loss: 107.1277     
Epoch 18/100
10000/10000 [==============================] - 0s - loss: 103.7616     
Epoch 19/100
10000/10000 [==============================] - 0s - loss: 99.4801     
Epoch 20/100
10000/10000 [==============================] - 0s - loss: 98.4273     
Epoch 21/100
10000/10000 [==============================] - 0s - loss: 95.5986     
Epoch 22/100
10000/10000 [==============================] - 0s - loss: 89.4088     
Epoch 23/100
10000/10000 [==============================] - 0s - loss: 89.0551     
Epoch 24/100
10000/10000 [==============================] - 0s - loss: 84.7646     
Epoch 25/100
10000/10000 [==============================] - 0s - loss: 82.3652     
Epoch 26/100
10000/10000 [==============================] - 0s - loss: 80.0207     
Epoch 27/100
10000/10000 [==============================] - 0s - loss: 78.4248     
Epoch 28/100
10000/10000 [==============================] - 0s - loss: 74.7715     
Epoch 29/100
10000/10000 [==============================] - 0s - loss: 73.6129     
Epoch 30/100
10000/10000 [==============================] - 0s - loss: 71.8952     
Epoch 31/100
10000/10000 [==============================] - 0s - loss: 70.8732     
Epoch 32/100
10000/10000 [==============================] - 0s - loss: 69.9281     
Epoch 33/100
10000/10000 [==============================] - 0s - loss: 66.7601     
Epoch 34/100
10000/10000 [==============================] - 0s - loss: 68.5129     
Epoch 35/100
10000/10000 [==============================] - 0s - loss: 64.2157     
Epoch 36/100
10000/10000 [==============================] - 0s - loss: 64.0088     
Epoch 37/100
10000/10000 [==============================] - 0s - loss: 62.1084     
Epoch 38/100
10000/10000 [==============================] - 0s - loss: 60.4780     
Epoch 39/100
10000/10000 [==============================] - 0s - loss: 58.8717     
Epoch 40/100
10000/10000 [==============================] - 0s - loss: 59.4814     
Epoch 41/100
10000/10000 [==============================] - 0s - loss: 57.8463     
Epoch 42/100
10000/10000 [==============================] - 0s - loss: 56.6349     
Epoch 43/100
10000/10000 [==============================] - 0s - loss: 53.3060     
Epoch 44/100
10000/10000 [==============================] - 0s - loss: 53.8152     
Epoch 45/100
10000/10000 [==============================] - 0s - loss: 53.2868     
Epoch 46/100
10000/10000 [==============================] - 0s - loss: 51.8249     
Epoch 47/100
10000/10000 [==============================] - 0s - loss: 51.4696     
Epoch 48/100
10000/10000 [==============================] - 0s - loss: 49.4345     
Epoch 49/100
10000/10000 [==============================] - 0s - loss: 49.5804     
Epoch 50/100
10000/10000 [==============================] - 0s - loss: 46.1419     
Epoch 51/100
10000/10000 [==============================] - 0s - loss: 46.0174     
Epoch 52/100
10000/10000 [==============================] - 0s - loss: 45.5138     
Epoch 53/100
10000/10000 [==============================] - 0s - loss: 43.0275     
Epoch 54/100
10000/10000 [==============================] - 0s - loss: 44.4757     
Epoch 55/100
10000/10000 [==============================] - 0s - loss: 43.1267     
Epoch 56/100
10000/10000 [==============================] - 0s - loss: 43.1435     
Epoch 57/100
10000/10000 [==============================] - 0s - loss: 42.6600     
Epoch 58/100
10000/10000 [==============================] - 0s - loss: 40.1952     
Epoch 59/100
10000/10000 [==============================] - 0s - loss: 39.9561     
Epoch 60/100
10000/10000 [==============================] - 0s - loss: 37.6587     
Epoch 61/100
10000/10000 [==============================] - 0s - loss: 38.5544     
Epoch 62/100
10000/10000 [==============================] - 0s - loss: 38.8892     
Epoch 63/100
10000/10000 [==============================] - 0s - loss: 37.9775     
Epoch 64/100
10000/10000 [==============================] - 0s - loss: 36.9681     
Epoch 65/100
10000/10000 [==============================] - 0s - loss: 37.5887     
Epoch 66/100
10000/10000 [==============================] - 0s - loss: 35.8916     
Epoch 67/100
10000/10000 [==============================] - 0s - loss: 34.5924     
Epoch 68/100
10000/10000 [==============================] - 0s - loss: 35.3941     
Epoch 69/100
10000/10000 [==============================] - 0s - loss: 34.5986     
Epoch 70/100
10000/10000 [==============================] - 0s - loss: 34.1270     
Epoch 71/100
10000/10000 [==============================] - 0s - loss: 33.3335     
Epoch 72/100
10000/10000 [==============================] - 0s - loss: 34.9812     
Epoch 73/100
10000/10000 [==============================] - 0s - loss: 33.6671     
Epoch 74/100
10000/10000 [==============================] - 0s - loss: 32.1354     
Epoch 75/100
10000/10000 [==============================] - 0s - loss: 32.4637     
Epoch 76/100
10000/10000 [==============================] - 0s - loss: 32.2159     
Epoch 77/100
10000/10000 [==============================] - 0s - loss: 31.7502     
Epoch 78/100
10000/10000 [==============================] - 0s - loss: 31.5369     
Epoch 79/100
10000/10000 [==============================] - 0s - loss: 31.7230     
Epoch 80/100
10000/10000 [==============================] - 0s - loss: 30.7868     
Epoch 81/100
10000/10000 [==============================] - 0s - loss: 31.0717     
Epoch 82/100
10000/10000 [==============================] - 0s - loss: 30.1934     
Epoch 83/100
10000/10000 [==============================] - 0s - loss: 30.4590     
Epoch 84/100
10000/10000 [==============================] - 0s - loss: 31.4997     
Epoch 85/100
10000/10000 [==============================] - 0s - loss: 28.4967     
Epoch 86/100
10000/10000 [==============================] - 0s - loss: 30.0666     
Epoch 87/100
10000/10000 [==============================] - 0s - loss: 29.2898     
Epoch 88/100
10000/10000 [==============================] - 0s - loss: 29.7264     
Epoch 89/100
10000/10000 [==============================] - 0s - loss: 29.2317     
Epoch 90/100
10000/10000 [==============================] - 0s - loss: 28.1746     
Epoch 91/100
10000/10000 [==============================] - 0s - loss: 29.5667     
Epoch 92/100
10000/10000 [==============================] - 0s - loss: 29.0125     
Epoch 93/100
10000/10000 [==============================] - 0s - loss: 28.0620     
Epoch 94/100
10000/10000 [==============================] - 0s - loss: 27.6503     
Epoch 95/100
10000/10000 [==============================] - 0s - loss: 28.5997     
Epoch 96/100
10000/10000 [==============================] - 0s - loss: 27.9306     
Epoch 97/100
10000/10000 [==============================] - 0s - loss: 27.1531     
Epoch 98/100
10000/10000 [==============================] - 0s - loss: 27.0039     
Epoch 99/100
10000/10000 [==============================] - 0s - loss: 27.0244     
Epoch 100/100
10000/10000 [==============================] - 0s - loss: 27.4526     
         X         Y        GT
         1      0.76         1
         3      8.03         9
         6     37.11        36
        12    116.20       144
        20    221.47       400

You can play with the example above.

What happens if you add one more layer? Or two more layers?
What happens if you change hidden size?
What happens if you use more/less samples?

Models

For the purpose of Visual Turing Test, and this tutorial, we have compiled a light framework that builds on top of Keras, and simplify building and training question answering machines. With the tradition of using fancy Greek names, we call it Kraino. Note that some parts of the Kraino, such as data provider, you have already seen.

In the following, we will go through BOW and LSTM approaches to answer questions about images, but, surprisingly, without the images. It turns out that a substantial fraction of questions can be answered without an access to an image, but rather by resorting to common sense (or statistics of the dataset). For instance, what can be placed at the table? How many eyes this human have?. Answers like 'chair' and '2' are quite likely to be good answers.

Please make sure that all the cells from Datasets and Textual Features have been executed.

BOW

The figure below illustrates BOW (Bag Of Words) approach. As we have already seen in Textual Features, we first encode the input sentence into one-hot vector representations. Such (very) sparse representation is next embedded into a denser space by a matrix $W_e$. Next, the denser representations are summed up and classified via 'Softmax'. Notice that, if $W_e$ were an identity matrix, we would obtain a histogram of the word's occurrences.

What is your biggest complain about such BOW representation? What happens if instead of 
'What is behind the table' we would have 'is What the behind table'? How does the BOW representation change?



In [19]:

    
#== Model definition

# First we define a model using keras/kraino
from keras.layers.core import Activation
from keras.layers.core import Dense
from keras.layers.core import Dropout
from keras.layers.core import TimeDistributedMerge
from keras.layers.embeddings import Embedding

from kraino.core.model_zoo import AbstractSequentialModel
from kraino.core.model_zoo import AbstractSingleAnswer
from kraino.core.model_zoo import AbstractSequentialMultiplewordAnswer
from kraino.core.model_zoo import Config
from kraino.core.keras_extensions import DropMask
from kraino.core.keras_extensions import LambdaWithMask
from kraino.core.keras_extensions import time_distributed_masked_ave

# This model inherits from AbstractSingleAnswer, and so it produces single answer words
# To use multiple answer words, you need to inherit from AbstractSequentialMultiplewordAnswer
class BlindBOW(AbstractSequentialModel, AbstractSingleAnswer):
    """
    BOW Language only model that produces single word answers.
    """
    def create(self):
        self.add(Embedding(
                self._config.input_dim, 
                self._config.textual_embedding_dim, 
                mask_zero=True))
        self.add(LambdaWithMask(time_distributed_masked_ave, output_shape=[self.output_shape[2]]))
        self.add(DropMask())
        self.add(Dropout(0.5))
        self.add(Dense(self._config.output_dim))
        self.add(Activation('softmax'))



In [22]:

    
model_config = Config(
    textual_embedding_dim=500,
    input_dim=len(word2index_x.keys()),
    output_dim=len(word2index_y.keys()))
model = BlindBOW(model_config)
model.create()

model.compile(
    loss='categorical_crossentropy', 
    optimizer='adam')
text_bow_model = model



In [23]:

    
#== Model training
text_bow_model.fit(
    train_x, 
    train_y,
    batch_size=512,
    nb_epoch=40,
    validation_split=0.1,
    show_accuracy=True)









    



Train on 6115 samples, validate on 680 samples
Epoch 1/40






    



---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-23-52e919d865ac> in <module>()
      6     nb_epoch=40,
      7     validation_split=0.1,
----> 8     show_accuracy=True)

/BS/mmalinow-projects/work/gpuEnv/local/lib/python2.7/site-packages/Keras-0.3.3-py2.7.egg/keras/models.pyc in fit(self, X, y, batch_size, nb_epoch, verbose, callbacks, validation_split, validation_data, shuffle, show_accuracy, class_weight, sample_weight)
    699                          verbose=verbose, callbacks=callbacks,
    700                          val_f=val_f, val_ins=val_ins,
--> 701                          shuffle=shuffle, metrics=metrics)
    702 
    703     def predict(self, X, batch_size=128, verbose=0):

/BS/mmalinow-projects/work/gpuEnv/local/lib/python2.7/site-packages/Keras-0.3.3-py2.7.egg/keras/models.pyc in _fit(self, f, ins, out_labels, batch_size, nb_epoch, verbose, callbacks, val_f, val_ins, shuffle, metrics)
    315                 batch_logs['size'] = len(batch_ids)
    316                 callbacks.on_batch_begin(batch_index, batch_logs)
--> 317                 outs = f(ins_batch)
    318                 if type(outs) != list:
    319                     outs = [outs]

/BS/mmalinow-projects/work/gpuEnv/local/lib/python2.7/site-packages/Keras-0.3.3-py2.7.egg/keras/backend/theano_backend.pyc in __call__(self, inputs)
    444     def __call__(self, inputs):
    445         assert type(inputs) in {list, tuple}
--> 446         return self.function(*inputs)
    447 
    448 

/BS/mmalinow-projects/work/gpuEnv/local/lib/python2.7/site-packages/Theano-0.9.0.dev0-py2.7.egg/theano/compile/function_module.pyc in __call__(self, *args, **kwargs)
    906                     node=self.fn.nodes[self.fn.position_of_error],
    907                     thunk=thunk,
--> 908                     storage_map=getattr(self.fn, 'storage_map', None))
    909             else:
    910                 # old-style linkers raise their own exceptions

/BS/mmalinow-projects/work/gpuEnv/local/lib/python2.7/site-packages/Theano-0.9.0.dev0-py2.7.egg/theano/gof/link.pyc in raise_with_op(node, thunk, exc_info, storage_map)
    312         # extra long error message in that case.
    313         pass
--> 314     reraise(exc_type, exc_value, exc_trace)
    315 
    316 

/BS/mmalinow-projects/work/gpuEnv/local/lib/python2.7/site-packages/Theano-0.9.0.dev0-py2.7.egg/theano/compile/function_module.pyc in __call__(self, *args, **kwargs)
    893         try:
    894             outputs =\
--> 895                 self.fn() if output_subset is None else\
    896                 self.fn(output_subset=output_subset)
    897         except Exception:

RuntimeError: Cuda error: CudaNdarray_TakeFrom: invalid device function.

Apply node that caused the error: GpuAdvancedSubtensor1(GpuElemwise{mul,no_inplace}.0, Elemwise{Cast{int64}}.0)
Toposort index: 40
Inputs types: [CudaNdarrayType(float32, matrix), TensorType(int64, vector)]
Inputs shapes: [(880, 500), (15360,)]
Inputs strides: [(500, 1), (8,)]
Inputs values: ['not shown', 'not shown']
Outputs clients: [[GpuReshape{3}(GpuAdvancedSubtensor1.0, MakeVector{dtype='int64'}.0)]]

HINT: Re-running with most Theano optimization disabled could give you a back-trace of when this node was created. This can be done with by setting the Theano flag 'optimizer=fast_compile'. If that does not work, Theano optimizations can be disabled with 'optimizer=None'.
HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.

Recurrent Neural Network

Although BOW is working pretty well, there is still something very disturbing about this approach. Consider the following question:



In [ ]:

    
train_raw_x[0]

If we swap 'chair' with 'telephone' in the question, we would get a different meaning, wouldn't we? Recurrent Neural Networks (RNNs) have been developed to mitigate this issue by directly processing the time series. As the figure below illustrates, the (temporarily) first word's embedding is given to an RNN unit. The RNN unit next 'processes' such embedding and outputs to the second RNN unit. This unit takes both the output of the first RNN unit and the 2nd word's embedding as inputs, and outputs some algebraic combination of both inputs. And so on. The last recurrent unit builds the representation of the whole sequence. Its output is next given to Softmax for the classification. One of the challenged that such approaches have to deal with are keeping long-term dependencies. Roughly speaking, as new inputs are coming it's getting easier to 'forget' information from the beginning. LSTM and GRU are two particularly successful Recurrent Neural Networks that can preserve such longer dependencies to some degree.

Note: If the code below is not compiling, please restart the notebook, and run only Datasets and Textual Features. In particular, don't run BOW.



In [24]:

    
#== Model definition

# First we define a model using keras/kraino
from keras.layers.core import Activation
from keras.layers.core import Dense
from keras.layers.core import Dropout
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import GRU
from keras.layers.recurrent import LSTM

from kraino.core.model_zoo import AbstractSequentialModel
from kraino.core.model_zoo import AbstractSingleAnswer
from kraino.core.model_zoo import AbstractSequentialMultiplewordAnswer
from kraino.core.model_zoo import Config
from kraino.core.keras_extensions import DropMask
from kraino.core.keras_extensions import LambdaWithMask
from kraino.core.keras_extensions import time_distributed_masked_ave

# This model inherits from AbstractSingleAnswer, and so it produces single answer words
# To use multiple answer words, you need to inherit from AbstractSequentialMultiplewordAnswer
class BlindRNN(AbstractSequentialModel, AbstractSingleAnswer):
    """
    RNN Language only model that produces single word answers.
    """
    def create(self):
        self.add(Embedding(
                self._config.input_dim, 
                self._config.textual_embedding_dim, 
                mask_zero=True))
        #TODO: Replace averaging with RNN (you can choose between LSTM and GRU)
#         self.add(LambdaWithMask(time_distributed_masked_ave, output_shape=[self.output_shape[2]]))
        self.add(GRU(self._config.hidden_state_dim, 
                      return_sequences=False))
        self.add(Dropout(0.5))
        self.add(Dense(self._config.output_dim))
        self.add(Activation('softmax'))



In [25]:

    
model_config = Config(
    textual_embedding_dim=500,
    hidden_state_dim=500,
    input_dim=len(word2index_x.keys()),
    output_dim=len(word2index_y.keys()))
model = BlindRNN(model_config)
model.create()
model.compile(
    loss='categorical_crossentropy', 
    optimizer='adam')
text_rnn_model = model



In [26]:

    
#== Model training
text_rnn_model.fit(
    train_x, 
    train_y,
    batch_size=512,
    nb_epoch=40,
    validation_split=0.1,
    show_accuracy=True)









    



Train on 6115 samples, validate on 680 samples
Epoch 1/40






    



---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-26-c93cbd64b6a3> in <module>()
      6     nb_epoch=40,
      7     validation_split=0.1,
----> 8     show_accuracy=True)

/BS/mmalinow-projects/work/gpuEnv/local/lib/python2.7/site-packages/Keras-0.3.3-py2.7.egg/keras/models.pyc in fit(self, X, y, batch_size, nb_epoch, verbose, callbacks, validation_split, validation_data, shuffle, show_accuracy, class_weight, sample_weight)
    699                          verbose=verbose, callbacks=callbacks,
    700                          val_f=val_f, val_ins=val_ins,
--> 701                          shuffle=shuffle, metrics=metrics)
    702 
    703     def predict(self, X, batch_size=128, verbose=0):

/BS/mmalinow-projects/work/gpuEnv/local/lib/python2.7/site-packages/Keras-0.3.3-py2.7.egg/keras/models.pyc in _fit(self, f, ins, out_labels, batch_size, nb_epoch, verbose, callbacks, val_f, val_ins, shuffle, metrics)
    315                 batch_logs['size'] = len(batch_ids)
    316                 callbacks.on_batch_begin(batch_index, batch_logs)
--> 317                 outs = f(ins_batch)
    318                 if type(outs) != list:
    319                     outs = [outs]

/BS/mmalinow-projects/work/gpuEnv/local/lib/python2.7/site-packages/Keras-0.3.3-py2.7.egg/keras/backend/theano_backend.pyc in __call__(self, inputs)
    444     def __call__(self, inputs):
    445         assert type(inputs) in {list, tuple}
--> 446         return self.function(*inputs)
    447 
    448 

/BS/mmalinow-projects/work/gpuEnv/local/lib/python2.7/site-packages/Theano-0.9.0.dev0-py2.7.egg/theano/compile/function_module.pyc in __call__(self, *args, **kwargs)
    906                     node=self.fn.nodes[self.fn.position_of_error],
    907                     thunk=thunk,
--> 908                     storage_map=getattr(self.fn, 'storage_map', None))
    909             else:
    910                 # old-style linkers raise their own exceptions

/BS/mmalinow-projects/work/gpuEnv/local/lib/python2.7/site-packages/Theano-0.9.0.dev0-py2.7.egg/theano/gof/link.pyc in raise_with_op(node, thunk, exc_info, storage_map)
    312         # extra long error message in that case.
    313         pass
--> 314     reraise(exc_type, exc_value, exc_trace)
    315 
    316 

/BS/mmalinow-projects/work/gpuEnv/local/lib/python2.7/site-packages/Theano-0.9.0.dev0-py2.7.egg/theano/compile/function_module.pyc in __call__(self, *args, **kwargs)
    893         try:
    894             outputs =\
--> 895                 self.fn() if output_subset is None else\
    896                 self.fn(output_subset=output_subset)
    897         except Exception:

RuntimeError: Cuda error: CudaNdarray_TakeFrom: invalid device function.

Apply node that caused the error: GpuAdvancedSubtensor1(GpuElemwise{mul,no_inplace}.0, Elemwise{Cast{int64}}.0)
Toposort index: 68
Inputs types: [CudaNdarrayType(float32, matrix), TensorType(int64, vector)]
Inputs shapes: [(880, 500), (15360,)]
Inputs strides: [(500, 1), (8,)]
Inputs values: ['not shown', 'not shown']
Outputs clients: [[GpuReshape{2}(GpuAdvancedSubtensor1.0, TensorConstant{[ -1 500]})]]

HINT: Re-running with most Theano optimization disabled could give you a back-trace of when this node was created. This can be done with by setting the Theano flag 'optimizer=fast_compile'. If that does not work, Theano optimizations can be disabled with 'optimizer=None'.
HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.

At the end of this Tutorial, you are free to experiment with two examples above.

You can change the size of embedding.
You can change the number of hidden state of RNN.
You can change number of epochs to train.
You can experiment with different batch sizes.
You can modify the models (many RNN layers, deeper classifiers). Use Keras documentation if you are in needs.

Summary

RNN models, as opposite to BOW, consider order of the words in the question. Moreover, apparently, a substantial number of questions can be answered without any access to image. This can be explained as models learn some specific dataset biases, some of them can be interpreted as common sense knowledge.

Evaluation Measures

First of all, please run the cell below to set up a link to the NLTK data.



In [ ]:

    
%env NLTK_DATA=/home/ubuntu/data/visual_turing_test/nltk_data

Ambiguities

To be able to monitor progress on any task, we need to find ways to evaluate the task. Otherwise, we wouldn't know how to compare two architectures, or even worse, we wouldn't even know what our goal is. Moreover, we should also aim at automatic evaluation measures, otherwise reproducibility is questionable, and the costs are high (speed and money; just imagine that you want to evaluate 100 different architectures of yours).

On the other hand, it's difficult to automatically evaluate holistic tasks such as question answering about images, because of, in just one word, ambiguities. We have ambiguities in naming objects, sometimes due to synonyms, but sometimes due to fuzziness. For instance is 'chair' == 'armchair' or 'chair' != 'armchair' or something in between? Such semantic boundaries become even more fuzzy when we increase the number of categories. We could easily find a mutually exclusive set of 10 different categories, but what if there are 1000 categories, or 10000 categories? Arguably, we cannot think in terms of equivalence class anymore, but rather in terms of similarities. That is 'chair' is semantically more similar to 'armchair', than to 'horse'. This simple example shows the main drawback of a traditional binary evaluation measure Accuracy, which scores 1 if the names are the same and 0 otherwise. So that Acc('chair', 'armchair') == Acc('chair', 'horse'). We use WUPS to handle such ambiguities.

We call these ambiguities, word-level ambiguities, but there are other ambiguities that are arguably more difficult to handle. For instance, the same question can be phrased in multiple other ways. The language of spatial relations is also ambiguous (you may be surprised that what you think is on the left, for others may be on the right). Language tends to be also rather vague - we sometimes skip details and resort to common sense. Some such ambiguities are rooted in a culture. A couple of such question-level ambiguities, we handle with Consensus Measure.

From an another side, arguably, it's easier to evaluate architectures on DAQUAR than on the Image Captioning tasks. The former restricts the output space to $N$ categories, while it still requires holistic (visual and linguistic) comprehension.

Wu-Palmer Similarity

Given an ontology a Wu-Palmer Similarity between two words (or broader concepts) is a soft measure defined as $$WuP(a,b) := \frac{lca(a,b)}{depth(a) + depth(b)}$$ where $lca(a,b)$ is the least common ancestor of $a$ and $b$, and $depth(a)$ is depth of $a$ in the ontology.

What is WuP(Dog, Horse) and WuP(Dog, Dalmatian) according to the ontology above? 
Can you also calculate Acc(Dog, Horse) and Acc(Dog, Dalmatian)?
What are your conclusions?

WUPS

Wu-Palmer Similarity depends on a ontology. One popular, large ontology is WordNet. Although Wu-Palmer Similarity may work on shallow ontologies, we are rather interested in ontologies with hundreds or even thousands categories. In indoor scenerio, it turns out that many indoor things share similar levels in the taxomy, and hence Wu-Palmer Similarities are very small between each other.

The code below exemplifies the issue.



In [ ]:

    
from nltk.corpus import wordnet as wn
armchair_synset = wn.synset('armchair.n.01')
chair_synset = wn.synset('chair.n.01')
wardrobe_synset = wn.synset('wardrobe.n.01')

print(armchair_synset.wup_similarity(armchair_synset))
print(armchair_synset.wup_similarity(chair_synset))
print(armchair_synset.wup_similarity(wardrobe_synset))
wn.synset('chair.n.01').wup_similarity(wn.synset('person.n.01'))

From the code we see that 'armchair' and 'wardrobe' are surprisingly close to each other. It is because, for large ontologies like WordNet, all the indoor things are essentially 'indoor things'.

This issue has motivated us to define thresholded Wu-Palmer Similarity Score, defined as follows $$ \begin{array}{rl} WuP(a,b) & \text{if}\; WuP(a,b) \ge \tau \\ 0.1 \cdot WuP(a,b) & \text{otherwise} \end{array} $$ where $\tau$ is a hand-chosen threshold. Empirically, we found that $\tau=0.9$ works fine on DAQUAR.

Moreover, since DAQUAR has answers as set answer words, so that 'knife,fork' == 'fork,knife', we have extended the above measure to work with sets. We call it Wu-Palmer Set score, or shortly WUPS.

A detailed exposition of WUPS is beyond this tutorial, but a curious reader is encoraged to read the 'Performance Measure' paragraph in our paper. Note that the measure in the paper is defined broader, and essentially it abstracts from any particular similarities such as Wu-Palmer Similarity. WUPS at 0.9 is WUPS with threshold $\tau=0.9$.

Although the WUPS is conceptually as we described here, technically, it's slightly different as it also needs to deal with synsets. Thus it's recommended to download the script from here, or re-implement with caution.

Consensus

In this tutorial, we won't cover the consensus measure. The curious reader is encouraged to read the 'Human Consensus' in the Ask Your Neurons paper.

A few caveats

A few caveats with WUPS, especially useful if you want to use the measure to your own dataset.

Lack of coverage Since WUPS is based on an ontology, not always it recognises words. For instance 'garbage bin' is missing, but 'garbage can' is perfectly fine. You can check it by yourself, either with the source code provided above, or by using this online script.

Synsets If you execute

wn.synsets('chair')

you will notice a list with many elements, these elements are semantically equivalent. You can check their definitions, for instance

wn.synset('chair.n.03').definition()

indicates a person. Indeed, the following has quite high value

wn.synset('chair.n.03').wup_similarity(wn.synset('person.n.01'))

but this one has a more preffered low value

wn.synset('chair.n.01').wup_similarity(wn.synset('person.n.01'))

How to deal with it? In DAQUAR we take an optimistic perspective and always consider the highest similarity score. This works with WUPS 0.9 and a restricted indoor domain with a vocabulary only from the trainin set. This may not be true in other domains though.

Ontology Since WUPS is based on an ontology, specifically on WordNet, it may give different scores on different ontologies, or even on different versions of the same ontology.

Threshold A good threshold $\tau$ is dataset dependent. In our case $\tau = 0.9$ seems to work well, while $\tau = 0.0$ is too forgivable and is rather reported due to the 'historical' reasons. However, following our papers, you should still consider to report plain set-based accuracy scores (so that Acc('knife,'fork','fork,knife')==1; it can be computed with WUPS -1 using our script) as this metric is widely recognised.

Summary

WUPS is an evaluation measure that works with sets and word-level ambiguities. Arguably, WUPS at 0.9 is the most practical measure.

New Predictions

Predictions - BOW

With more and more iterations we can increase training accuracy, however our goal is to see how well the models generalise. For that, we take a test, previously unknown, set.



In [ ]:

    
test_text_representation = dp['text'](train_or_test='test')
test_raw_x = test_text_representation['x']
test_one_hot_x = encode_questions_index(test_raw_x, word2index_x)
test_x = sequence.pad_sequences(test_one_hot_x, maxlen=MAXLEN)

Given encoded test questions, we use the maximum likelihood principle to withdraw answers.



In [ ]:

    
from numpy import argmax
# predict the probabilities for every word
predictions_scores = text_bow_model.predict([test_x])
print(predictions_scores.shape)
# follow the maximum likelihood principle, and get the best indices to vocabulary
predictions_best = argmax(predictions_scores, axis=-1)
print(predictions_best.shape)
# decode the predicted indices into word answers
predictions_answers = [index2word_y[x] for x in predictions_best]
print(len(predictions_answers))

We can now evaluate the answers using WUPS scores. For this tutorial, we care only about Accuracy, and WUPS at 0.9.



In [ ]:

    
from kraino.utils import print_metrics
test_raw_y = test_text_representation['y']
_ = print_metrics.select['wups'](
        gt_list=test_raw_y,
        pred_list=predictions_answers,
        verbose=1,
        extra_vars=None)

Let's also see the predictions.



In [ ]:

    
from numpy import random
test_image_name_list = test_text_representation['img_name']
indices_to_see = random.randint(low=0, high=len(test_image_name_list), size=5)
for index_now in indices_to_see:
    print(test_raw_x[index_now], predictions_answers[index_now])

Do you agree with the answers given above? What are your guesses?
Of course,neither you nor the model have seen any images so far.

But, what if you actually see the images?

Execute the code below.
Do your answers change after seeing the images?



In [ ]:

    
from matplotlib.pyplot import axis
from matplotlib.pyplot import figure
from matplotlib.pyplot import imshow

import numpy as np
from PIL import Image

%matplotlib inline
for index_now in indices_to_see:
    image_name_now = test_image_name_list[index_now]
    pil_im = Image.open('data/daquar/images/{0}.png'.format(image_name_now), 'r')
    fig = figure()
    fig.text(.2,.05,test_raw_x[index_now], fontsize=14)
    axis('off')
    imshow(np.asarray(pil_im))

Finally, let's also see the ground truths.



In [ ]:

    
print('question, prediction, ground truth answer')
for index_now in indices_to_see:
    print(test_raw_x[index_now], predictions_answers[index_now], test_raw_y[index_now])

In the code above, we have randomly taken questions, so for different executations we may get different answers.

Predictions - RNN

Curious how predictions with blind RNN went?

This time, we will use the help of Kraino, to make the predictions shorter.



In [ ]:

    
from kraino.core.model_zoo import word_generator
# we first need to add word_generator to _config (we could have done this before, in the Config constructor)
# we use maximum likelihood as a word generator
text_rnn_model._config.word_generator = word_generator['max_likelihood']
predictions_answers = text_rnn_model.decode_predictions(
    X=test_x,
    temperature=None,
    index2word=index2word_y,
    verbose=0)



In [ ]:

    
_ = print_metrics.select['wups'](
        gt_list=test_raw_y,
        pred_list=predictions_answers,
        verbose=1,
        extra_vars=None)

Visualise question, predicted answers, ground truth answers as before.
Check also images.

Visual Features

We won't go very far using only textual features. Hence, it's now time to consider its visual counterpart.

As shown in the figure below, a quite common procedure works as follows:

Use a CNN already pre-trained on some large-scale classification task, most often it is ImageNet with $1000$ for recognition.
'Chop off' CNN after some layer. We will use responses of that layer as visual features.

In this tutorial, we will use features extracted from the second last $4096$ dimensional layer of VGG NET-19. We have already extracted features in advance using Caffe - another excellent framework for deep learning, particularly good for CNNs.

Please run the cell below in order to get visual features aligned with textual featurs.



In [ ]:

    
# this contains a list of the image names of our interest; 
# it also makes sure that visual and textual features are aligned correspondingly
train_image_names = train_text_representation['img_name']
# the name for visual features that we use
# CNN_NAME='vgg_net'
# CNN_NAME='googlenet'
CNN_NAME='fb_resnet'
# the layer in CNN that is used to extract features
# PERCEPTION_LAYER='fc7'
# PERCEPTION_LAYER='pool5-7x7_s1'
# PERCEPTION_LAYER='res5c-152'
PERCEPTION_LAYER='l2_res5c-152' # l2 prefix since there are l2-normalized visual features

train_visual_features = dp['perception'](
    train_or_test='train',
    names_list=train_image_names,
    parts_extractor=None,
    max_parts=None,
    perception=CNN_NAME,
    layer=PERCEPTION_LAYER,
    second_layer=None
    )
train_visual_features.shape

Vision+Language

Since we are talking about answering on questions about images, we likely need images too :)

Take a look at the figure below one more time. How far can you go by blind guesses?

Let's creat an input as a pair of textual and visual features.



In [ ]:

    
train_input = [train_x, train_visual_features]

BOW + Vision

As with Language Only model, we start from a simpler BOW model that we will combine with visual features. Here, we will explore two ways of combining both modalities (circle with 'C' in the figure below): concatenation, and piece-wise multiplication. We will use CNN features extracted from the image, but for the sake of simplicity we won't backprop to fine tune the visual representation (dot line symbolizes the barrier that blocks back-prop in the figure below). Although in our Ask Your Neurons fine-tuning the last layer was actually important, benefits of end-to-end training on DAQUAR or larger VQA datasets remain an open question.



In [ ]:

    
#== Model definition

# First we define a model using keras/kraino
from keras.models import Sequential
from keras.layers.core import Activation
from keras.layers.core import Dense
from keras.layers.core import Dropout
from keras.layers.core import Layer
from keras.layers.core import Merge
from keras.layers.core import TimeDistributedMerge
from keras.layers.embeddings import Embedding

from kraino.core.model_zoo import AbstractSequentialModel
from kraino.core.model_zoo import AbstractSingleAnswer
from kraino.core.model_zoo import AbstractSequentialMultiplewordAnswer
from kraino.core.model_zoo import Config
from kraino.core.keras_extensions import DropMask
from kraino.core.keras_extensions import LambdaWithMask
from kraino.core.keras_extensions import time_distributed_masked_ave

# This model inherits from AbstractSingleAnswer, and so it produces single answer words
# To use multiple answer words, you need to inherit from AbstractSequentialMultiplewordAnswer
class VisionLanguageBOW(AbstractSequentialModel, AbstractSingleAnswer):
    """
    BOW Language only model that produces single word answers.
    """
    def create(self):
        language_model = Sequential()
        language_model.add(Embedding(
                self._config.input_dim, 
                self._config.textual_embedding_dim, 
                mask_zero=True))
        language_model.add(LambdaWithMask(
                time_distributed_masked_ave, 
                output_shape=[language_model.output_shape[2]]))
        language_model.add(DropMask())
        visual_model = Sequential()
        if self._config.visual_embedding_dim > 0:
            visual_model.add(Dense(
                    self._config.visual_embedding_dim,
                    input_shape=(self._config.visual_dim,)))
        else:
            visual_model.add(Layer(input_shape=(self._config.visual_dim,)))
        self.add(Merge([language_model, visual_model], mode=self._config.multimodal_merge_mode))
        self.add(Dropout(0.5))
        self.add(Dense(self._config.output_dim))
        self.add(Activation('softmax'))



In [ ]:

    
# dimensionality of embeddings
EMBEDDING_DIM = 500
# kind of multimodal fusion (ave, concat, mul, sum)
MULTIMODAL_MERGE_MODE = 'concat'

model_config = Config(
    textual_embedding_dim=EMBEDDING_DIM,
    visual_embedding_dim=0,
    multimodal_merge_mode=MULTIMODAL_MERGE_MODE,
    input_dim=len(word2index_x.keys()),
    output_dim=len(word2index_y.keys()),
    visual_dim=train_visual_features.shape[1])
model = VisionLanguageBOW(model_config)
model.create()
model.compile(
    loss='categorical_crossentropy', 
    optimizer='adam')

Now, we can train the model.



In [ ]:

    
#== Model training
model.fit(
    train_input, 
    train_y,
    batch_size=512,
    nb_epoch=40,
    validation_split=0.1,
    show_accuracy=True)

Interestingly, if we use a piece-wise multiplication to merge both modalities together, we will get better results.



In [ ]:

    
#== Model definition

# First we define a model using keras/kraino
from keras.models import Sequential
from keras.layers.core import Activation
from keras.layers.core import Dense
from keras.layers.core import Dropout
from keras.layers.core import Layer
from keras.layers.core import Merge
from keras.layers.core import TimeDistributedMerge
from keras.layers.embeddings import Embedding

from kraino.core.model_zoo import AbstractSequentialModel
from kraino.core.model_zoo import AbstractSingleAnswer
from kraino.core.model_zoo import AbstractSequentialMultiplewordAnswer
from kraino.core.model_zoo import Config
from kraino.core.keras_extensions import DropMask
from kraino.core.keras_extensions import LambdaWithMask
from kraino.core.keras_extensions import time_distributed_masked_ave

# This model inherits from AbstractSingleAnswer, and so it produces single answer words
# To use multiple answer words, you need to inherit from AbstractSequentialMultiplewordAnswer
class VisionLanguageBOW(AbstractSequentialModel, AbstractSingleAnswer):
    """
    BOW Language only model that produces single word answers.
    """
    def create(self):
        language_model = Sequential()
        language_model.add(Embedding(
                self._config.input_dim, 
                self._config.textual_embedding_dim, 
                mask_zero=True))
        language_model.add(LambdaWithMask(
                time_distributed_masked_ave, 
                output_shape=[language_model.output_shape[2]]))
        language_model.add(DropMask())
        visual_model = Sequential()
        if self._config.visual_embedding_dim > 0:
            visual_model.add(Dense(
                    self._config.visual_embedding_dim,
                    input_shape=(self._config.visual_dim,)))
        else:
            visual_model.add(Layer(input_shape=(self._config.visual_dim,)))
        self.add(Merge([language_model, visual_model], mode=self._config.multimodal_merge_mode))
        self.add(Dropout(0.5))
        self.add(Dense(self._config.output_dim))
        self.add(Activation('softmax'))



In [ ]:

    
# dimensionality of embeddings
EMBEDDING_DIM = 500
# kind of multimodal fusion (ave, concat, mul, sum)
MULTIMODAL_MERGE_MODE = 'mul'

model_config = Config(
    textual_embedding_dim=EMBEDDING_DIM,
    visual_embedding_dim=EMBEDDING_DIM,
    multimodal_merge_mode=MULTIMODAL_MERGE_MODE,
    input_dim=len(word2index_x.keys()),
    output_dim=len(word2index_y.keys()),
    visual_dim=train_visual_features.shape[1])
model = VisionLanguageBOW(model_config)
model.create()
model.compile(
    loss='categorical_crossentropy', 
    optimizer='adam')
text_image_bow_model = model

If we merge language and visual features with 'mul', do we need to set both embeddings to have the same number  of dimensions (textual_embedding_dim == visual_embedding_dim)?



In [ ]:

    
#== Model training
text_image_bow_model.fit(
    train_input, 
    train_y,
    batch_size=512,
    nb_epoch=40,
    validation_split=0.1,
    show_accuracy=True)

RNN + Vision

Now, we will repeat the BOW experiments but with RNN.



In [ ]:

    
#== Model definition

# First we define a model using keras/kraino
from keras.models import Sequential
from keras.layers.core import Activation
from keras.layers.core import Dense
from keras.layers.core import Dropout
from keras.layers.core import Layer
from keras.layers.core import Merge
from keras.layers.core import TimeDistributedMerge
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import GRU
from keras.layers.recurrent import LSTM

from kraino.core.model_zoo import AbstractSequentialModel
from kraino.core.model_zoo import AbstractSingleAnswer
from kraino.core.model_zoo import AbstractSequentialMultiplewordAnswer
from kraino.core.model_zoo import Config
from kraino.core.keras_extensions import DropMask
from kraino.core.keras_extensions import LambdaWithMask
from kraino.core.keras_extensions import time_distributed_masked_ave

# This model inherits from AbstractSingleAnswer, and so it produces single answer words
# To use multiple answer words, you need to inherit from AbstractSequentialMultiplewordAnswer
class VisionLanguageLSTM(AbstractSequentialModel, AbstractSingleAnswer):
    """
    BOW Language only model that produces single word answers.
    """
    def create(self):
        language_model = Sequential()
        language_model.add(Embedding(
                self._config.input_dim, 
                self._config.textual_embedding_dim, 
                mask_zero=True))
        #TODO: Replace averaging with RNN (you can choose between LSTM and GRU)
#         language_model.add(LambdaWithMask(time_distributed_masked_ave, output_shape=[self.output_shape[2]]))
        language_model.add(LSTM(self._config.hidden_state_dim, 
                      return_sequences=False))

        visual_model = Sequential()
        if self._config.visual_embedding_dim > 0:
            visual_model.add(Dense(
                    self._config.visual_embedding_dim,
                    input_shape=(self._config.visual_dim,)))
        else:
            visual_model.add(Layer(input_shape=(self._config.visual_dim,)))
        self.add(Merge([language_model, visual_model], mode=self._config.multimodal_merge_mode))
        self.add(Dropout(0.5))
        self.add(Dense(self._config.output_dim))
        self.add(Activation('softmax'))
        
        
# dimensionality of embeddings
EMBEDDING_DIM = 500
# kind of multimodal fusion (ave, concat, mul, sum)
MULTIMODAL_MERGE_MODE = 'sum'

model_config = Config(
    textual_embedding_dim=EMBEDDING_DIM,
    visual_embedding_dim=EMBEDDING_DIM,
    hidden_state_dim=EMBEDDING_DIM,
    multimodal_merge_mode=MULTIMODAL_MERGE_MODE,
    input_dim=len(word2index_x.keys()),
    output_dim=len(word2index_y.keys()),
    visual_dim=train_visual_features.shape[1])
model = VisionLanguageLSTM(model_config)
model.create()
model.compile(
    loss='categorical_crossentropy', 
    optimizer='adam')
text_image_rnn_model = model

Batch Size

So, again, let's train the model (if the following cell crashes, please move to the next cell).



In [ ]:

    
#== Model training
text_image_rnn_model.fit(
    train_input, 
    train_y,
    batch_size=5500,
    nb_epoch=40,
    validation_split=0.1,
    show_accuracy=True)

Ooops, apparently we run out of memory in our GPUs. Note, how large our batches are! Let's make them much smaller (argument batch_size).



In [ ]:

    
#== Model training
text_image_rnn_model.fit(
    train_input, 
    train_y,
    batch_size=1,
    nb_epoch=1,
    validation_split=0.1,
    show_accuracy=True)

Ok. Please, stop it! Batch size 1 is not good neither. Training is very slow. Let's use standard batch size 512. Please re-run the cell with the model definition, and next run the cell below.



In [ ]:

    
#== Model training
text_image_rnn_model.fit(
    train_input, 
    train_y,
    batch_size=512,
    nb_epoch=40,
    validation_split=0.1,
    show_accuracy=True)

Can you explain both issues regarding the batch size? Why training is impossible in the first case, and very tedious in the second case?

When do you get the best performance, with multiplication, concatenation, or summation?

Summary

As previously, using RNN makes the sequence processing order-aware. This time, however, we combine two modalities so that the whole model 'sees' the image. Finally, it's important how both modalities are combined, we have found that piece-wise multiplication outperforms traditional concatenation.

New Predictions with Vision+Language

Predictions (Features)



In [ ]:

    
test_image_names = test_text_representation['img_name']
test_visual_features = dp['perception'](
    train_or_test='test',
    names_list=test_image_names,
    parts_extractor=None,
    max_parts=None,
    perception=CNN_NAME,
    layer=PERCEPTION_LAYER,
    second_layer=None
    )
test_visual_features.shape



In [ ]:

    
test_input = [test_x, test_visual_features]

Predictions (Bow with Vision)

Let's evaluate the Vision+Language architectures as well.



In [ ]:

    
from kraino.core.model_zoo import word_generator
# we first need to add word_generator to _config (we could have done this before, in the Config constructor)
# we use maximum likelihood as a word generator
text_image_bow_model._config.word_generator = word_generator['max_likelihood']
predictions_answers = text_image_bow_model.decode_predictions(
    X=test_input,
    temperature=None,
    index2word=index2word_y,
    verbose=0)



In [ ]:

    
_ = print_metrics.select['wups'](
        gt_list=test_raw_y,
        pred_list=predictions_answers,
        verbose=1,
        extra_vars=None)

Predictions (RNN with Vision)



In [ ]:

    
from kraino.core.model_zoo import word_generator
# we first need to add word_generator to _config (we could have done this before, in the Config constructor)
# we use maximum likelihood as a word generator
text_image_rnn_model._config.word_generator = word_generator['max_likelihood']
predictions_answers = text_image_rnn_model.decode_predictions(
    X=test_input,
    temperature=None,
    index2word=index2word_y,
    verbose=0)



In [ ]:

    
_ = print_metrics.select['wups'](
        gt_list=test_raw_y,
        pred_list=predictions_answers,
        verbose=1,
        extra_vars=None)

VQA

The models that we have built so far can be transferred to other dataset. Let's consider recently introduced large-scale VQA built on top of COCO. In this section, we will train and evaluate VQA models. Since we are using all pieces introduced before, we will just quickly go into coding. For the sake of simplicity, we will use only BOW architectures, but you are free to experiment with RNN. Please also to pay attention to the comments.

Since VQA hides the test data for the purpose of challenge, we will use the publically validation set to evaluate the architectures.

VQA Language Features



In [ ]:

    
#TODO: Execute the following procedure (Shift+Enter)
from kraino.utils import data_provider

vqa_dp = data_provider.select['vqa-real_images-open_ended']
# VQA has a few answers associated with one question. 
# We take the most frequently occuring answers (single_frequent).
# Formal argument 'keep_top_qa_pairs' allows to filter out rare answers with the associated questions.
# We use 0 as we want to keep all question answer pairs, but you can change into 1000 and see how the results differ
vqa_train_text_representation = vqa_dp['text'](
    train_or_test='train',
    answer_mode='single_frequent',
    keep_top_qa_pairs=1000)
vqa_val_text_representation = vqa_dp['text'](
    train_or_test='val',
    answer_mode='single_frequent')



In [ ]:

    
from toolz import frequencies
vqa_train_raw_x = vqa_train_text_representation['x']
vqa_train_raw_y = vqa_train_text_representation['y']
vqa_val_raw_x = vqa_val_text_representation['x']
vqa_val_raw_y = vqa_val_text_representation['y']
# we start from building the frequencies table
vqa_wordcount_x = frequencies(' '.join(vqa_train_raw_x).split(' '))
# we can keep all answer words in the answer as a class
# therefore we use an artificial split symbol '{' to not split the answer into words
# you can see the difference if you replace '{' with ' ' and print vqa_wordcount_y
vqa_wordcount_y = frequencies('{'.join(vqa_train_raw_y).split('{'))
vqa_wordcount_y

Language-Only



In [ ]:

    
from keras.preprocessing import sequence
from kraino.utils.input_output_space import build_vocabulary
from kraino.utils.input_output_space import encode_questions_index
from kraino.utils.input_output_space import encode_answers_one_hot
MAXLEN=30
vqa_word2index_x, vqa_index2word_x = build_vocabulary(this_wordcount = vqa_wordcount_x)
vqa_word2index_y, vqa_index2word_y = build_vocabulary(this_wordcount = vqa_wordcount_y)
vqa_train_x = sequence.pad_sequences(encode_questions_index(vqa_train_raw_x, vqa_word2index_x), maxlen=MAXLEN)
vqa_val_x = sequence.pad_sequences(encode_questions_index(vqa_val_raw_x, vqa_word2index_x), maxlen=MAXLEN)
vqa_train_y, _ = encode_answers_one_hot(
    vqa_train_raw_y, 
    vqa_word2index_y, 
    answer_words_delimiter=vqa_train_text_representation['answer_words_delimiter'],
    is_only_first_answer_word=True,
    max_answer_time_steps=1)
vqa_val_y, _ = encode_answers_one_hot(
    vqa_val_raw_y, 
    vqa_word2index_y, 
    answer_words_delimiter=vqa_train_text_representation['answer_words_delimiter'],
    is_only_first_answer_word=True,
    max_answer_time_steps=1)



In [ ]:

    
from kraino.core.model_zoo import Config
from kraino.core.model_zoo import word_generator
# We are re-using the BlindBOW mode
# Please make sure you have run the cell with the class definition
# VQA is larger, so we can increase the dimensionality of the embedding
vqa_model_config = Config(
    textual_embedding_dim=1000,
    input_dim=len(vqa_word2index_x.keys()),
    output_dim=len(vqa_word2index_y.keys()),
    word_generator = word_generator['max_likelihood'])
vqa_text_bow_model = BlindBOW(vqa_model_config)
vqa_text_bow_model.create()
vqa_text_bow_model.compile(
    loss='categorical_crossentropy', 
    optimizer='adam')



In [ ]:

    
vqa_text_bow_model.fit(
    vqa_train_x, 
    vqa_train_y,
    batch_size=512,
    nb_epoch=10,
    validation_split=0.1,
    show_accuracy=True)



In [ ]:

    
# we first need to add word_generator to _config (we could have done this before, in the Config constructor)
# we use maximum likelihood as a word generator
vqa_predictions_answers = vqa_text_bow_model.decode_predictions(
    X=vqa_val_x,
    temperature=None,
    index2word=vqa_index2word_y,
    verbose=0)



In [ ]:

    
# Using VQA is unfortunately not that transparent
# we need extra VQA object.
vqa_vars = {
    'question_id':vqa_val_text_representation['question_id'],
    'vqa_object':vqa_val_text_representation['vqa_object'],
    'resfun': 
        lambda x: \
            vqa_val_text_representation['vqa_object'].loadRes(x, vqa_val_text_representation['questions_path'])
}



In [ ]:

    
from kraino.utils import print_metrics


_ = print_metrics.select['vqa'](
        gt_list=vqa_val_raw_y,
        pred_list=vqa_predictions_answers,
        verbose=1,
        extra_vars=vqa_vars)

VQA Language+Vision



In [ ]:

    
# the name for visual features that we use
VQA_CNN_NAME='vgg_net'
# VQA_CNN_NAME='googlenet'
# the layer in CNN that is used to extract features
VQA_PERCEPTION_LAYER='fc7'
# PERCEPTION_LAYER='pool5-7x7_s1'

vqa_train_visual_features = vqa_dp['perception'](
    train_or_test='train',
    names_list=vqa_train_text_representation['img_name'],
    parts_extractor=None,
    max_parts=None,
    perception=VQA_CNN_NAME,
    layer=VQA_PERCEPTION_LAYER,
    second_layer=None
    )
vqa_train_visual_features.shape



In [ ]:

    
vqa_val_visual_features = vqa_dp['perception'](
    train_or_test='val',
    names_list=vqa_val_text_representation['img_name'],
    parts_extractor=None,
    max_parts=None,
    perception=VQA_CNN_NAME,
    layer=VQA_PERCEPTION_LAYER,
    second_layer=None
    )
vqa_val_visual_features.shape



In [ ]:

    
from kraino.core.model_zoo import Config
from kraino.core.model_zoo import word_generator

# dimensionality of embeddings
VQA_EMBEDDING_DIM = 1000
# kind of multimodal fusion (ave, concat, mul, sum)
VQA_MULTIMODAL_MERGE_MODE = 'mul'

vqa_model_config = Config(
    textual_embedding_dim=VQA_EMBEDDING_DIM,
    visual_embedding_dim=VQA_EMBEDDING_DIM,
    multimodal_merge_mode=VQA_MULTIMODAL_MERGE_MODE,
    input_dim=len(vqa_word2index_x.keys()),
    output_dim=len(vqa_word2index_y.keys()),
    visual_dim=vqa_train_visual_features.shape[1],
    word_generator=word_generator['max_likelihood'])
vqa_text_image_bow_model = VisionLanguageBOW(vqa_model_config)
vqa_text_image_bow_model.create()
vqa_text_image_bow_model.compile(
    loss='categorical_crossentropy', 
    optimizer='adam')



In [ ]:

    
vqa_train_input = [vqa_train_x, vqa_train_visual_features]
vqa_val_input = [vqa_val_x, vqa_val_visual_features]



In [ ]:

    
#== Model training
vqa_text_image_bow_model.fit(
    vqa_train_input, 
    vqa_train_y,
    batch_size=512,
    nb_epoch=10,
    validation_split=0.1,
    show_accuracy=True)



In [ ]:

    
# we first need to add word_generator to _config (we could have done this before, in the Config constructor)
# we use maximum likelihood as a word generator
vqa_predictions_answers = vqa_text_image_bow_model.decode_predictions(
    X=vqa_val_input,
    temperature=None,
    index2word=vqa_index2word_y,
    verbose=0)



In [ ]:

    
# Using VQA is unfortunately not that transparent
# we need extra VQA object.
vqa_vars = {
    'question_id':vqa_val_text_representation['question_id'],
    'vqa_object':vqa_val_text_representation['vqa_object'],
    'resfun': 
        lambda x: \
            vqa_val_text_representation['vqa_object'].loadRes(x, vqa_val_text_representation['questions_path'])
}



In [ ]:

    
from kraino.utils import print_metrics


_ = print_metrics.select['vqa'](
        gt_list=vqa_val_raw_y,
        pred_list=vqa_predictions_answers,
        verbose=1,
        extra_vars=vqa_vars)

Kraino

For the purpose of fast experimentations on Visual Turing Test, we have prepared Kraino that builds on top of Keras. In this short section, you will see how to use it from a command line, example by example.

Kraino on DAQUAR

We use a blind model with a temporal fusion of the question (equivalent of BOW).

One era consists of max_epoch epochs, at the end of era we gather some statistics such as the model performance, or we dump the model weights. Since calculating wups scores is slow, we use 5 epoch before we output such information. In the example below we also use 5 eras, so in total we perform 25 epochs.

In the example below we monitor wups scores on test set (--metric=wups, --verbosity=monitor_test_metric), but we also use 10% of training data for validation (--validationsplit=0.1). Please remember to pick up the model based on validation set, NOT the test set!

We use one_hot vector representation. As an alternative we could use --word_representation=dense with a pre-trained embedding such as Word2Vec or Glove. You need to download both pre-trained embeddings.

The code below may be slow due to WUPS calculations.



In [ ]:

    
! python neural_solver.py --dataset=daquar-triples --model=sequential-blind-temporal_fusion-single_answer --validation_split=0.1 --metric=wups --max_epoch=20 --max_era=2 --verbosity=monitor_test_metric --word_representation=one_hot

Maybe we should use smaller embedding layer with --embedding_size=500 (500 dimensions).



In [ ]:

    
! python neural_solver.py --textual_embedding_size=500 --dataset=daquar-triples --model=sequential-blind-temporal_fusion-single_answer --validation_split=0.1 --metric=wups --max_epoch=20 --max_era=2 --verbosity=monitor_test_metric --word_representation=one_hot

Now we replace the temporal by the recurrent fussion (LSTM) with --model=sequential-blind-reccurent_fusion-single_answer



In [ ]:

    
! python neural_solver.py --dataset=daquar-triples --model=sequential-blind-recurrent_fusion-single_answer --validation_split=0.1 --metric=wups --max_epoch=20 --max_era=1 --verbosity=monitor_test_metric --word_representation=one_hot

We can easily replace LSTM by GRU as a question encoder with --text_encoder=gru.



In [ ]:

    
! python neural_solver.py --text_encoder=gru --dataset=daquar-triples --model=sequential-blind-recurrent_fusion-single_answer --validation_split=0.1 --metric=wups --max_epoch=20 --max_era=1 --verbosity=monitor_test_metric --word_representation=one_hot

We can also use 1 dimensional CNN to represent questions with --model=sequential-blind-cnn_fusion-single_answer_with_temporal_fusion.



In [ ]:

    
! python neural_solver.py --text_encoder=gru --dataset=daquar-triples --model=sequential-blind-cnn_fusion-single_answer_with_temporal_fusion --temporal_fusion=sum --validation_split=0.1 --metric=wups --max_epoch=20 --max_era=1 --verbosity=monitor_test_metric --word_representation=one_hot

Or we can combine GRU with visual CNN.

--model=sequential-multimodal-recurrent_fusion-at_last_timestep_multimodal_fusion-single_answer

--multimodal_fusion=mul



In [ ]:

    
! python neural_solver.py --text_encoder=gru --dataset=daquar-triples --model=sequential-multimodal-recurrent_fusion-at_last_timestep_multimodal_fusion-single_answer --temporal_fusion=sum --validation_split=0.1 --metric=wups --max_epoch=20 --max_era=1 --verbosity=monitor_test_metric --word_representation=one_hot --multimodal_fusion=mul

Or use the above with Resnet (by default it's GoogLeNet) with piece-wise summation. We use parameters: --perception=fb_resnet --perception_layer=l2_res5c-152.



In [ ]:

    
! python neural_solver.py --text_encoder=gru --dataset=daquar-triples --model=sequential-multimodal-recurrent_fusion-at_last_timestep_multimodal_fusion-single_answer --temporal_fusion=sum --validation_split=0.1 --metric=wups --max_epoch=20 --max_era=1 --verbosity=monitor_test_metric --word_representation=one_hot --multimodal_fusion=sum --perception=fb_resnet --perception_layer=l2_res5c-152

But there are more possibilities. Unfortunately the documentation is not ready yet, but you are welcome to experiment with different settings. Interestingly, some subsets of the parameters don't go well with each other.

To see available models, check kraino/core/model_zoo.py.

To see available command-line parameters together with default values, check kraino/utils/parsers.py.

Kraino on VQA

We can, however, also work with VQA.

We need to switch dataset (--dataset=vqa-real_images-open_ended), metric (--metric=vqa), and add the answer mode (--vqa_answer_mode=single_frequent). Moreover, we also truncate question answer pairs according to 2000 most frequent answers, and --use_whole_answer_as_answer_word as we want to treat answer 'yellow cab' as the answer, and won't split into words.

If the following code is too slow, try using BOW (--model=sequential-blind-temporal_fusion-single_answer).



In [ ]:

    
! python neural_solver.py --dataset=vqa-real_images-open_ended --model=sequential-blind-recurrent_fusion-single_answer --vqa_answer_mode=single_frequent --metric=vqa --max_epoch=10 --max_era=1 --verbosity=monitor_val_metric --word_representation=one_hot --number_most_frequent_qa_pairs=2000 --use_whole_answer_as_answer_word

Or we can use Vision + Language model. If there are memory problems try either smaller batches (--batch_size=...), smaller model (--hidden_state_size or --textual_embedding_size), or use BOW model (--model=equential-multimodal-temporal_fusion-single_answer).



In [ ]:

    
! python neural_solver.py --dataset=vqa-real_images-open_ended --model=sequential-multimodal-recurrent_fusion-at_last_timestep_multimodal_fusion-single_answer --vqa_answer_mode=single_frequent --metric=vqa --max_epoch=10 --max_era=1 --verbosity=monitor_val_metric --word_representation=one_hot --number_most_frequent_qa_pairs=2000 --use_whole_answer_as_answer_word --perception=fb_resnet --perception_layer=l2_res5c-152 --multimodal_fusion=mul

Further Experiments

Data and Task Understanding.
- Try to find by yourself how difficult/easy is to answer questions without looking into images (do this on DAQUAR or VQA). If you are better than our models, that's great. If you are worse, that's fine, our models, even the blind ones, were trained to answer on this particular dataset.
- Take a look at more images, and questions. Can you recognise other challenges that machines can potentially face off? Can you classify the new challenges?
- Experiment with evaluation measures. For instance Wu-Palmer Similarities with other categories, or look into WUPS source code, or check Consensus. You are also encouraged to check 'Performance Measure' in the Multiworld paper, 'Quantifying the Performance of Holistic Architectures in the Towards Visual Turing Challenge, and 'Human Consensus' in the Ask Your Neurons paper.
- You can experiment with New Predictions sections. Particularly, the last predictions section was quite short. For instance, you can visualise the predictions, similarly to what you did here. Maybe, you can dump a file with predictions, and come up with new conclusions by inspecting it.
- Look into VQA. Check images, and question answer pairs. What are the differences between VQA and DAQUAR?
Experiment with the provided code.
- More RNN layers, deeper classifiers.
- Different RNN models (GRU, LSTM, your own?).
- Using the best model found on a validation set. For this you may be willing to pass the checkpoint callback to the fit function as this example suggests.
- Recognise and change hyperparameters such as dimensionality of embeddings or the number of hidden units.
- Different ways to fuse two modalities (concat, ave, ...). If you use many RNN layers, you can fuse the modalities at different levels.
- Investigate different visual features
  - vgg_net: fc6, fc7, fc8
  - googlenet: loss3-classifier, pool5-7x7_s1
- If in needs, please consult with the official documentation.
Experiments with Keras.
Experiments with Kraino.

New Research Opportunities

Global Representation So far we have been using so called global representations of the images. Such representations may destroy too much information, and so we should consider a fine-grained alternative. Maybe we should use detections, or attention models. The latter becomes recently quite successful in answering questions about images. However, there is still a hope for global representations if they are trained end-to-end for the task. Recall that our global representation is extracted from CNN trained on different dataset (ImageNet), and for different task (object classification).
3D Scene Representation Most of current approaches, and all neural approaches, are trained on 2D images. However, it seems that some spatial relations such as 'behind' may need 3d representation of the scene. Luckily, DAQUAR is built on NYU-Depth dataset that provides both modes (2D images, and 3D depth). The question if such extra information helps remains open.
Recurrent Neural Networks There is disturbingly small gap between BOW and RNN models. As we have seen before, some questions clearly require an order, but such questions at the same time become longer, semantically more difficult, and require better visual understanding of the world. To handle them we may need other RNNs architectures, or better ways of fusing two modalities, or better Global Representation.
Logical Reasoning There are few questions that require a bit more sophisticated logical reasoning such as negation. Can Reccurent Neural Networks learn such logical operators? What about compositionality of the language?
Language + Vision There is too small gap between Language Only and Vision + Language models. But clearly, we need pictures to answer questions about images. So what is missing here? Is it due to Global Representation, 3D Scene Representation or there is something missing in fusing two modalities?
Learning from Few Examples In the Visual Turint Test, many questions are quite unique. But then how the models can generalise to new questions? What if a question is completely new, but its parts have been already observed (compositionality)? Can models guess the meaning of a new word from its context?
Ambiguities How to deal with ambiguities? They are all inherent in the task, so cannot be just ignored, and should be incorporated into question answering methods as well as evaluation metrics.
Evaluation Measures Although we have WUPS and Consensus, both are far from being perfect. Consensus has higher annotation cost for ambiguous tasks, and is unclear how to formally define good consensus measure. WUPS is an ontology dependent, but can we build one complete ontology that covers all cases?

External Links

Visual Turing Test - project webpage: https://www.d2.mpi-inf.mpg.de/visual-turing-challenge
Ask Your Neurons - the main inspiration for this tutorial. https://www.d2.mpi-inf.mpg.de/sites/default/files/iccv15-neural_qa.pdf
Multiworld Approach - our first paper on Visual Turing Test and DAQUAR. It also describes a symbolic approach to handle the challenge. http://arxiv.org/pdf/1410.0210v4.pdf
Towards a Visual Turing Challenge - we hope that it's a nice and accessible introduction to the task. http://arxiv.org/abs/1410.8027. Its more compact version: http://arxiv.org/abs/1501.03302
My talk from the ICCV'15 conference together with the slides
VQA - a complementary, large-scale dataset for Visual Question Answering. Project webpage: http://visualqa.org
Image Question Answering - another large-scale dataset for Image Question Answering. Project webpage http://www.cs.toronto.edu/~mren/imageqa/
Learning to Answer Questions about Images - similar to Ask Your Neurons model. http://www.hangli-hl.com/uploads/3/4/4/6/34465961/ma_et_al_aaai_2016.pdf
A nice introduction to VQA with working source codes, also in Keras. https://avisingh599.github.io/deeplearning/visual-qa/
In the Tutorial we use the facebook residual net features
DAQUAR uses NYU-Depth images, for more information please take a look here

Logs

25.04.2016: Added residual net features, more on Kraino, made it publicly available
20.03.2016: 2nd Summer School on Integrating Vision & Language: Deep Learning