This set of notebooks describes how to build a neural machine translation model with Keras-NMT. It's assumed that you properly set up all the required dependencies (Theano, Keras, Staged Keras Wrapper, COCO Caption...). First, we'll create a Dataset instance, in order to properly manage the data. Next, we'll create and train a Neural Translation Model. Finally, we'll apply the trained model on new (unseen) data.
First, we are creating a Dataset object (from the Staged Keras Wrapper library). Let's make some imports and create an empty Dataset instance:
In [1]:
from keras_wrapper.dataset import Dataset, saveDataset
from data_engine.prepare_data import keep_n_captions
ds = Dataset('tutorial_dataset', 'tutorial', silence=False)
Now that we have the empty dataset, we must indicate its inputs and outputs. In our case, we'll have two different inputs and one single output:
1) Outputs:
target_text: Sentences in our target language.
2) Inputs:
source_text: Sentences in the source language.
state_below: Sentences in the target language, but shifted one position to the right (for teacher-forcing training of the model).
For setting up the outputs, we use the setOutputs function, with the appropriate parameters. Note that, when we are building the dataset for the training split, we build the vocabulary (up to 30000 words).
In [3]:
ds.setOutput('examples/EuTrans/training.en',
'train',
type='text',
id='target_text',
tokenization='tokenize_none',
build_vocabulary=True,
pad_on_batch=True,
sample_weights=True,
max_text_len=30,
max_words=30000,
min_occ=0)
ds.setOutput('examples/EuTrans/dev.en',
'val',
type='text',
id='target_text',
pad_on_batch=True,
tokenization='tokenize_none',
sample_weights=True,
max_text_len=30,
max_words=0)
Similarly, we introduce the source text data, with the setInputs function. Again, when building the training split, we must construct the vocabulary.
In [4]:
ds.setInput('examples/EuTrans/training.es',
'train',
type='text',
id='source_text',
pad_on_batch=True,
tokenization='tokenize_none',
build_vocabulary=True,
fill='end',
max_text_len=30,
max_words=30000,
min_occ=0)
ds.setInput('examples/EuTrans/dev.es',
'val',
type='text',
id='source_text',
pad_on_batch=True,
tokenization='tokenize_none',
fill='end',
max_text_len=30,
min_occ=0)
...and for the 'state_below' data. Note that: 1) The offset flat is set to 1, which means that the text will be shifted to the right 1 position. 2) During sampling time, we won't have this input. Hence, we 'hack' the dataset model by inserting an artificial input, of type 'ghost' for the validation split.
In [5]:
ds.setInput('examples/EuTrans/training.en',
'train',
type='text',
id='state_below',
required=False,
tokenization='tokenize_none',
pad_on_batch=True,
build_vocabulary='target_text',
offset=1,
fill='end',
max_text_len=30,
max_words=30000)
ds.setInput(None,
'val',
type='ghost',
id='state_below',
required=False)
We must match the references with inputs:
In [6]:
# If we had multiple references per sentence
keep_n_captions(ds, repeat=1, n=1, set_names=['val'])
Finally, we can save our dataset instance for using in other experiments:
In [7]:
saveDataset(ds, 'datasets')