NMT-Keras tutorial

This set of notebooks describes how to build a neural machine translation model with Keras-NMT. It's assumed that you properly set up all the required dependencies (Theano, Keras, Staged Keras Wrapper, COCO Caption...). First, we'll create a Dataset instance, in order to properly manage the data. Next, we'll create and train a Neural Translation Model. Finally, we'll apply the trained model on new (unseen) data.

1. Building a Dataset model

First, we are creating a Dataset object (from the Staged Keras Wrapper library). Let's make some imports and create an empty Dataset instance:


In [1]:
from keras_wrapper.dataset import Dataset, saveDataset
from data_engine.prepare_data import keep_n_captions
ds = Dataset('tutorial_dataset', 'tutorial', silence=False)


Using Theano backend.
Using cuDNN version 5105 on context None
Mapped name None to device cuda: GeForce GTX 1080 (0000:01:00.0)

Now that we have the empty dataset, we must indicate its inputs and outputs. In our case, we'll have two different inputs and one single output:

1) Outputs:
target_text: Sentences in our target language.

2) Inputs:
source_text: Sentences in the source language.

state_below: Sentences in the target language, but shifted one position to the right (for teacher-forcing training of the model).

For setting up the outputs, we use the setOutputs function, with the appropriate parameters. Note that, when we are building the dataset for the training split, we build the vocabulary (up to 30000 words).


In [3]:
ds.setOutput('examples/EuTrans/training.en',
             'train',
             type='text',
             id='target_text',
             tokenization='tokenize_none',
             build_vocabulary=True,
             pad_on_batch=True,
             sample_weights=True,
             max_text_len=30,
             max_words=30000,
             min_occ=0)

ds.setOutput('examples/EuTrans/dev.en',
             'val',
             type='text',
             id='target_text',
             pad_on_batch=True,
             tokenization='tokenize_none',
             sample_weights=True,
             max_text_len=30,
             max_words=0)


[26/04/2017 13:48:48] Creating vocabulary for data with id 'target_text'.
[26/04/2017 13:48:48] 	 Total: 513 unique words in 9900 sentences with a total of 98304 words.
[26/04/2017 13:48:48] Creating dictionary of 30000 most common words, covering 100.0% of the text.
[26/04/2017 13:48:48] Loaded "train" set outputs of type "text" with id "target_text" and length 9900.
[26/04/2017 13:48:48] Loaded "val" set outputs of type "text" with id "target_text" and length 100.

Similarly, we introduce the source text data, with the setInputs function. Again, when building the training split, we must construct the vocabulary.


In [4]:
ds.setInput('examples/EuTrans/training.es',
            'train',
            type='text',
            id='source_text',
            pad_on_batch=True,
            tokenization='tokenize_none',
            build_vocabulary=True,
            fill='end',
            max_text_len=30,
            max_words=30000,
            min_occ=0)
ds.setInput('examples/EuTrans/dev.es',
            'val',
            type='text',
            id='source_text',
            pad_on_batch=True,
            tokenization='tokenize_none',
            fill='end',
            max_text_len=30,
            min_occ=0)


[26/04/2017 13:48:52] Creating vocabulary for data with id 'source_text'.
[26/04/2017 13:48:52] 	 Total: 686 unique words in 9900 sentences with a total of 96172 words.
[26/04/2017 13:48:52] Creating dictionary of 30000 most common words, covering 100.0% of the text.
[26/04/2017 13:48:52] Loaded "train" set inputs of type "text" with id "source_text" and length 9900.
[26/04/2017 13:48:52] Loaded "val" set inputs of type "text" with id "source_text" and length 100.

...and for the 'state_below' data. Note that: 1) The offset flat is set to 1, which means that the text will be shifted to the right 1 position. 2) During sampling time, we won't have this input. Hence, we 'hack' the dataset model by inserting an artificial input, of type 'ghost' for the validation split.


In [5]:
ds.setInput('examples/EuTrans/training.en',
            'train',
            type='text',
            id='state_below',
            required=False,
            tokenization='tokenize_none',
            pad_on_batch=True,
            build_vocabulary='target_text',
            offset=1,
            fill='end',
            max_text_len=30,
            max_words=30000)
ds.setInput(None,
            'val',
            type='ghost',
            id='state_below',
            required=False)


[26/04/2017 13:48:58] 	Reusing vocabulary named "target_text" for data with id "state_below".
[26/04/2017 13:48:58] Loaded "train" set inputs of type "text" with id "state_below" and length 9900.
[26/04/2017 13:48:58] Loaded "val" set inputs of type "ghost" with id "state_below" and length 100.

We must match the references with inputs:


In [6]:
# If we had multiple references per sentence
keep_n_captions(ds, repeat=1, n=1, set_names=['val'])


[26/04/2017 13:48:59] Keeping 1 captions per input on the val set.
[26/04/2017 13:48:59] Samples reduced to 100 in val set.

Finally, we can save our dataset instance for using in other experiments:


In [7]:
saveDataset(ds, 'datasets')


[26/04/2017 13:49:01] <<< Saving Dataset instance to datasets/Dataset_tutorial_dataset.pkl ... >>>
[26/04/2017 13:49:01] <<< Dataset instance saved >>>