Now, we'll create and train a Neural Machine Translation (NMT) model. Since there is a significant number of hyperparameters, we'll use the default ones, specified in the config.py
file. Note that almost every hardcoded parameter is automatically set from config if we run main.py
.
We'll create the so-called 'GroundHogModel'
. It is defined in the model_zoo.py
file. See the neural_machine_translation.pdf
for an overview of such system.
If you followed the notebook 1_dataset_tutorial.ipynb
, you should have a dataset instance. Otherwise, you should follow that notebook first.
First, we'll make some imports, load the default parameters and load the dataset.
In [8]:
from config import load_parameters
from model_zoo import TranslationModel
import utils
from keras_wrapper.cnn_model import loadModel
from keras_wrapper.dataset import loadDataset
from keras_wrapper.extra.callbacks import PrintPerformanceMetricOnEpochEndOrEachNUpdates
params = load_parameters()
dataset = loadDataset('datasets/Dataset_tutorial_dataset.pkl')
Since the number of words in the dataset may be unknown beforehand, we must update the params information according to the dataset instance:
In [2]:
params['INPUT_VOCABULARY_SIZE'] = dataset.vocabulary_len['source_text']
params['OUTPUT_VOCABULARY_SIZE'] = dataset.vocabulary_len['target_text']
Now, we create a TranslationModel instance:
In [4]:
nmt_model = TranslationModel(params,
model_type='GroundHogModel',
model_name='tutorial_model',
vocabularies=dataset.vocabulary,
store_path='trained_models/tutorial_model/',
verbose=True)
Now, we must define the inputs and outputs mapping from our Dataset instance to our model
In [6]:
inputMapping = dict()
for i, id_in in enumerate(params['INPUTS_IDS_DATASET']):
pos_source = dataset.ids_inputs.index(id_in)
id_dest = nmt_model.ids_inputs[i]
inputMapping[id_dest] = pos_source
nmt_model.setInputsMapping(inputMapping)
outputMapping = dict()
for i, id_out in enumerate(params['OUTPUTS_IDS_DATASET']):
pos_target = dataset.ids_outputs.index(id_out)
id_dest = nmt_model.ids_outputs[i]
outputMapping[id_dest] = pos_target
nmt_model.setOutputsMapping(outputMapping)
We can add some callbacks for controlling the training (e.g. Sampling each N updates, early stop, learning rate annealing...). For instance, let's build an Early-Stop callback. After each 2 epochs, it will compute the 'coco' scores on the development set. If the metric 'Bleu_4' doesn't improve during more than 5 checkings, it will stop. We need to pass some variables to the callback (in the extra_vars dictionary):
In [10]:
extra_vars = {'language': 'en',
'n_parallel_loaders': 8,
'tokenize_f': eval('dataset.' + 'tokenize_none'),
'beam_size': 12,
'maxlen': 50,
'model_inputs': ['source_text', 'state_below'],
'model_outputs': ['target_text'],
'dataset_inputs': ['source_text', 'state_below'],
'dataset_outputs': ['target_text'],
'normalize': True,
'alpha_factor': 0.6,
'val': {'references': dataset.extra_variables['val']['target_text']}
}
vocab = dataset.vocabulary['target_text']['idx2words']
callbacks = []
callbacks.append(PrintPerformanceMetricOnEpochEndOrEachNUpdates(nmt_model,
dataset,
gt_id='target_text',
metric_name=['coco'],
set_name=['val'],
batch_size=50,
each_n_epochs=2,
extra_vars=extra_vars,
reload_epoch=0,
is_text=True,
index2word_y=vocab,
sampling_type='max_likelihood',
beam_search=True,
save_path=nmt_model.model_path,
start_eval_on_epoch=0,
write_samples=True,
write_type='list',
verbose=True))
Now we are almost ready to train. We set up some training parameters...
In [11]:
training_params = {'n_epochs': 100,
'batch_size': 40,
'maxlen': 30,
'epochs_for_save': 1,
'verbose': 0,
'eval_on_sets': [],
'n_parallel_loaders': 8,
'extra_callbacks': callbacks,
'reload_epoch': 0,
'epoch_offset': 0}
And train!
In [ ]:
nmt_model.trainNet(dataset, training_params)
In [ ]: