This notebook describes, step by step, how to build a neural machine translation model with NMT-Keras. The tutorial is organized in different sections:
All these steps are automatically run by the toolkit. But, to learn and understand the full process, it is didactic to follow this tutorial.
So, let's start installing the toolkit.
In [1]:
!pip install update pip
!pip uninstall -y keras # Avoid crashes with pre-installed packages
!git clone https://github.com/lvapeab/nmt-keras
import os
os.chdir('nmt-keras')
!pip install -e .
First, we are creating a Dataset object (from the Multimodal Keras Wrapper library). This object will be the interface between our data (text files) and the model:
In [0]:
from keras_wrapper.dataset import Dataset, saveDataset
from data_engine.prepare_data import keep_n_captions
ds = Dataset('tutorial_dataset', 'tutorial', silence=False)
Now that we have the empty dataset, we must indicate its inputs and outputs. In our case, we'll have two different inputs and one single output:
Outputs: target_text: Sentences in our target language.
Inputs: source_text: Sentences in the source language.
state_below: Sentences in the target language, but shifted one position to the right (for teacher-forcing training of the model).
For setting up the outputs, we use the setOutputs function, with the appropriate parameters. Note that, when we are building the dataset for the training split, we build the vocabulary (up to 30000 words).
In [3]:
ds.setOutput('examples/EuTrans/training.en',
'train',
type='text',
id='target_text',
tokenization='tokenize_none',
build_vocabulary=True,
pad_on_batch=True,
sample_weights=True,
max_text_len=30,
max_words=30000,
min_occ=0)
ds.setOutput('examples/EuTrans/dev.en',
'val',
type='text',
id='target_text',
pad_on_batch=True,
tokenization='tokenize_none',
sample_weights=True,
max_text_len=30,
max_words=0)
Similarly, we introduce the source text data, with the setInputs function. Again, when building the training split, we must construct the vocabulary.
In [4]:
ds.setInput('examples/EuTrans/training.es',
'train',
type='text',
id='source_text',
pad_on_batch=True,
tokenization='tokenize_none',
build_vocabulary=True,
fill='end',
max_text_len=30,
max_words=30000,
min_occ=0)
ds.setInput('examples/EuTrans/dev.es',
'val',
type='text',
id='source_text',
pad_on_batch=True,
tokenization='tokenize_none',
fill='end',
max_text_len=30,
min_occ=0)
...and for the 'state_below' data. Note that: 1) The offset flat is set to 1, which means that the text will be shifted to the right 1 position. 2) During sampling time, we won't have this input. Hence, we 'hack' the dataset model by inserting an artificial input, of type 'ghost' for the validation split.
In [5]:
ds.setInput('examples/EuTrans/training.en',
'train',
type='text',
id='state_below',
required=False,
tokenization='tokenize_none',
pad_on_batch=True,
build_vocabulary='target_text',
offset=1,
fill='end',
max_text_len=30,
max_words=30000)
ds.setInput(None,
'val',
type='ghost',
id='state_below',
required=False)
We can also keep the literal source words (for replacing unknown words).
In [6]:
for split, input_text_filename in zip(['train', 'val'], ['examples/EuTrans/training.es', 'examples/EuTrans/dev.es']):
ds.setRawInput(input_text_filename,
split,
type='file-name',
id='raw_source_text',
overwrite_split=True)
We also need to match the references with the inputs. Since we only have one reference per input sample, we set repeat=1
.
In [7]:
keep_n_captions(ds, repeat=1, n=1, set_names=['val'])
Finally, we can save our dataset instance for using in other experiments:
In [8]:
saveDataset(ds, 'datasets')
Now, we'll create and train a Neural Machine Translation (NMT) model. Since there is a significant number of hyperparameters, we'll use the default ones, specified in the config.py
file. Note that almost every hardcoded parameter is automatically set from config if we run main.py
.
We'll create an 'AttentionRNNEncoderDecoder'
(a LSTM encoder-decoder with attention mechanism). Refer to the model_zoo.py
file for other models (e.g. Transformer).
So first, let's import the model and the hyperparameters. We'll also load the dataset we stored in the previous section (not necessary as it is in memory, but as a demonstration):
In [9]:
from config import load_parameters
from nmt_keras.model_zoo import TranslationModel
from keras_wrapper.cnn_model import loadModel
from keras_wrapper.dataset import loadDataset
from keras_wrapper.extra.callbacks import PrintPerformanceMetricOnEpochEndOrEachNUpdates
params = load_parameters()
dataset = loadDataset('datasets/Dataset_tutorial_dataset.pkl')
Since the number of words in the dataset may be unknown beforehand, we must update the params information according to the dataset instance:
In [0]:
params['INPUT_VOCABULARY_SIZE'] = dataset.vocabulary_len['source_text']
params['OUTPUT_VOCABULARY_SIZE'] = dataset.vocabulary_len['target_text']
Now, we create a TranslationModel
instance:
In [11]:
params['MODEL_TYPE'] = 'AttentionRNNEncoderDecoder' # Supported models: 'AttentionRNNEncoderDecoder' and 'Transformer'.
nmt_model = TranslationModel(params,
model_type=params['MODEL_TYPE'],
model_name='tutorial_model',
vocabularies=dataset.vocabulary,
store_path='trained_models/tutorial_model/',
verbose=True)
Next, we must define the inputs and outputs mapping from our Dataset instance to our model:
In [0]:
inputMapping = dict()
for i, id_in in enumerate(params['INPUTS_IDS_DATASET']):
pos_source = dataset.ids_inputs.index(id_in)
id_dest = nmt_model.ids_inputs[i]
inputMapping[id_dest] = pos_source
nmt_model.setInputsMapping(inputMapping)
outputMapping = dict()
for i, id_out in enumerate(params['OUTPUTS_IDS_DATASET']):
pos_target = dataset.ids_outputs.index(id_out)
id_dest = nmt_model.ids_outputs[i]
outputMapping[id_dest] = pos_target
nmt_model.setOutputsMapping(outputMapping)
We can add some callbacks for controlling the training (e.g. Sampling each N updates, early stop, learning rate annealing...). For instance, let's build a sampling callback. After each epoch, it will compute the BLEU scores on the development set using the sacreBLEU package. We need to pass some configuration variables to the callback (in the extra_vars dictionary):
In [0]:
is_transformer = params.get('ATTEND_ON_OUTPUT', 'transformer' in params['MODEL_TYPE'].lower())
search_params = {
'language': 'en',
'tokenize_f': eval('dataset.' + 'tokenize_none'),
'beam_size': 12,
'optimized_search': True,
'model_inputs': params['INPUTS_IDS_MODEL'],
'model_outputs': params['OUTPUTS_IDS_MODEL'],
'dataset_inputs': params['INPUTS_IDS_DATASET'],
'dataset_outputs': params['OUTPUTS_IDS_DATASET'],
'n_parallel_loaders': 1,
'maxlen': 50,
'normalize_probs': True,
'pos_unk': True and not is_transformer, # Pos_unk is unimplemented for transformer models
'heuristic': 0,
'state_below_maxlen': -1,
'attend_on_output': is_transformer,
'val': {'references': dataset.extra_variables['val']['target_text']}
}
vocab = dataset.vocabulary['target_text']['idx2words']
callbacks = []
input_text_id = params['INPUTS_IDS_DATASET'][0]
callbacks.append(PrintPerformanceMetricOnEpochEndOrEachNUpdates(nmt_model,
dataset,
gt_id='target_text',
metric_name=['sacrebleu'],
set_name=['val'],
batch_size=50,
each_n_epochs=1,
extra_vars=search_params,
reload_epoch=0,
is_text=True,
input_text_id=input_text_id,
index2word_y=vocab,
sampling_type='max_likelihood',
beam_search=True,
save_path=nmt_model.model_path,
start_eval_on_epoch=0,
write_samples=True,
write_type='list',
verbose=True))
Now we are ready to train. Let's set up some training parameters...
In [0]:
training_params = {'n_epochs': 4,
'batch_size': 50,
'maxlen': 30,
'epochs_for_save': 1,
'verbose': 1,
'eval_on_sets': [],
'n_parallel_loaders': 1,
'extra_callbacks': callbacks,
'reload_epoch': 0,
'epoch_offset': 0}
And train!
In [15]:
nmt_model.trainNet(dataset, training_params)
Now, we'll load from disk the model we just trained and we'll apply it for translating new text. In this case, we want to translate the 'test' split from our dataset.
Since we want to translate a new data split ('test') we must add it to the dataset instance, just as we did before (at the first tutorial). In case we also had the refences of the test split and we wanted to evaluate it, we can add it to the dataset. Note that this is not mandatory and we could just predict without evaluating.
In [16]:
dataset.setInput('examples/EuTrans/test.es',
'test',
type='text',
id='source_text',
pad_on_batch=True,
tokenization='tokenize_none',
fill='end',
max_text_len=30,
min_occ=0)
dataset.setInput(None,
'test',
type='ghost',
id='state_below',
required=False)
dataset.setRawInput('examples/EuTrans/test.es',
'test',
type='file-name',
id='raw_source_text',
overwrite_split=True)
Now, let's load the translation model. Suppose we want to load the model saved at the end of the epoch 4:
In [17]:
params['INPUT_VOCABULARY_SIZE'] = dataset.vocabulary_len[params['INPUTS_IDS_DATASET'][0]]
params['OUTPUT_VOCABULARY_SIZE'] = dataset.vocabulary_len[params['OUTPUTS_IDS_DATASET'][0]]
# Load model
nmt_model = loadModel('trained_models/tutorial_model', 4)
Once we loaded the model, we just have to invoke the sampling method (in this case, the Beam Search algorithm) for the 'test' split:
In [18]:
is_transformer = params.get('ATTEND_ON_OUTPUT', 'transformer' in params['MODEL_TYPE'].lower())
params_prediction = {
'language': 'en',
'tokenize_f': eval('dataset.' + 'tokenize_none'),
'beam_size': 12,
'optimized_search': True,
'model_inputs': params['INPUTS_IDS_MODEL'],
'model_outputs': params['OUTPUTS_IDS_MODEL'],
'dataset_inputs': params['INPUTS_IDS_DATASET'],
'dataset_outputs': params['OUTPUTS_IDS_DATASET'],
'n_parallel_loaders': 1,
'maxlen': 50,
'normalize_probs': True,
'pos_unk': True and not is_transformer,
'heuristic': 0,
'state_below_maxlen': -1,
'predict_on_sets': ['test'],
'verbose': 0,
'attend_on_output': is_transformer
}
predictions = nmt_model.predictBeamSearchNet(dataset, params_prediction)['test']
Up to now, in the variable 'predictions', we have the indices of the words of the hypotheses. We must decode them into words. For doing this, we'll use the dictionary stored in the dataset object:
In [19]:
from keras_wrapper.utils import decode_predictions_beam_search
vocab = dataset.vocabulary['target_text']['idx2words']
samples = predictions['samples'] # Get word indices from the samples.
predictions = decode_predictions_beam_search(samples,
vocab,
verbose=params['VERBOSE'])
Finally, we store the hypotheses:
In [20]:
filepath = 'test.pred'
from keras_wrapper.extra.read_write import list2file
list2file(filepath, predictions)
!head -n 4 test.pred
If we have the references of this split, we can also evaluate the performance of our system on it. First, we must add them to the dataset object:
In [21]:
dataset.setOutput('examples/EuTrans/test.en',
'test',
type='text',
id='target_text',
pad_on_batch=True,
tokenization='tokenize_none',
sample_weights=True,
max_text_len=30,
max_words=0)
keep_n_captions(dataset, repeat=1, n=1, set_names=['test'])
Next, we call the evaluation system: the sacreBLEU package:
In [22]:
from keras_wrapper.extra.evaluation import select
metric = 'sacrebleu'
# Apply sampling
extra_vars = dict()
extra_vars['tokenize_f'] = eval('dataset.' + 'tokenize_none')
extra_vars['language'] = params['TRG_LAN']
extra_vars['test'] = dict()
extra_vars['test']['references'] = dataset.extra_variables['test']['target_text']
metrics = select[metric](pred_list=predictions,
verbose=1,
extra_vars=extra_vars,
split='test')
And that's all!