Now, we'll create and train a Neural Machine Translation (NMT) model. Since there is a significant number of hyperparameters, we'll use the default ones, specified in the config.py file. Note that almost every hardcoded parameter is automatically set from config if we run main.py.
We'll create the so-called 'GroundHogModel'. It is defined in the model_zoo.py file. See the neural_machine_translation.pdf for an overview of such system.
If you followed the notebook 1_dataset_tutorial.ipynb, you should have a dataset instance. Otherwise, you should follow that notebook first.
First, we'll make some imports, load the default parameters and load the dataset.
In [8]:
from config import load_parameters
from model_zoo import TranslationModel
import utils
from keras_wrapper.cnn_model import loadModel
from keras_wrapper.dataset import loadDataset
from keras_wrapper.extra.callbacks import PrintPerformanceMetricOnEpochEndOrEachNUpdates
params = load_parameters()
dataset = loadDataset('datasets/Dataset_tutorial_dataset.pkl')
Since the number of words in the dataset may be unknown beforehand, we must update the params information according to the dataset instance:
In [2]:
params['INPUT_VOCABULARY_SIZE'] = dataset.vocabulary_len['source_text']
params['OUTPUT_VOCABULARY_SIZE'] = dataset.vocabulary_len['target_text']
Now, we create a TranslationModel instance:
In [4]:
nmt_model = TranslationModel(params,
model_type='GroundHogModel',
model_name='tutorial_model',
vocabularies=dataset.vocabulary,
store_path='trained_models/tutorial_model/',
verbose=True)
Now, we must define the inputs and outputs mapping from our Dataset instance to our model
In [6]:
inputMapping = dict()
for i, id_in in enumerate(params['INPUTS_IDS_DATASET']):
pos_source = dataset.ids_inputs.index(id_in)
id_dest = nmt_model.ids_inputs[i]
inputMapping[id_dest] = pos_source
nmt_model.setInputsMapping(inputMapping)
outputMapping = dict()
for i, id_out in enumerate(params['OUTPUTS_IDS_DATASET']):
pos_target = dataset.ids_outputs.index(id_out)
id_dest = nmt_model.ids_outputs[i]
outputMapping[id_dest] = pos_target
nmt_model.setOutputsMapping(outputMapping)
We can add some callbacks for controlling the training (e.g. Sampling each N updates, early stop, learning rate annealing...). For instance, let's build an Early-Stop callback. After each 2 epochs, it will compute the 'coco' scores on the development set. If the metric 'Bleu_4' doesn't improve during more than 5 checkings, it will stop. We need to pass some variables to the callback (in the extra_vars dictionary):
In [10]:
extra_vars = {'language': 'en',
'n_parallel_loaders': 8,
'tokenize_f': eval('dataset.' + 'tokenize_none'),
'beam_size': 12,
'maxlen': 50,
'model_inputs': ['source_text', 'state_below'],
'model_outputs': ['target_text'],
'dataset_inputs': ['source_text', 'state_below'],
'dataset_outputs': ['target_text'],
'normalize': True,
'alpha_factor': 0.6,
'val': {'references': dataset.extra_variables['val']['target_text']}
}
vocab = dataset.vocabulary['target_text']['idx2words']
callbacks = []
callbacks.append(PrintPerformanceMetricOnEpochEndOrEachNUpdates(nmt_model,
dataset,
gt_id='target_text',
metric_name=['coco'],
set_name=['val'],
batch_size=50,
each_n_epochs=2,
extra_vars=extra_vars,
reload_epoch=0,
is_text=True,
index2word_y=vocab,
sampling_type='max_likelihood',
beam_search=True,
save_path=nmt_model.model_path,
start_eval_on_epoch=0,
write_samples=True,
write_type='list',
verbose=True))
Now we are almost ready to train. We set up some training parameters...
In [11]:
training_params = {'n_epochs': 100,
'batch_size': 40,
'maxlen': 30,
'epochs_for_save': 1,
'verbose': 0,
'eval_on_sets': [],
'n_parallel_loaders': 8,
'extra_callbacks': callbacks,
'reload_epoch': 0,
'epoch_offset': 0}
And train!
In [ ]:
nmt_model.trainNet(dataset, training_params)
In [ ]: