NMT-Keras tutorial

2. Creating and training a Neural Translation Model

Now, we'll create and train a Neural Machine Translation (NMT) model. Since there is a significant number of hyperparameters, we'll use the default ones, specified in the config.py file. Note that almost every hardcoded parameter is automatically set from config if we run main.py.

We'll create the so-called 'GroundHogModel'. It is defined in the model_zoo.py file. See the neural_machine_translation.pdf for an overview of such system.

If you followed the notebook 1_dataset_tutorial.ipynb, you should have a dataset instance. Otherwise, you should follow that notebook first.

First, we'll make some imports, load the default parameters and load the dataset.


In [8]:
from config import load_parameters
from model_zoo import TranslationModel
import utils
from keras_wrapper.cnn_model import loadModel
from keras_wrapper.dataset import loadDataset
from keras_wrapper.extra.callbacks import PrintPerformanceMetricOnEpochEndOrEachNUpdates
params = load_parameters()
dataset = loadDataset('datasets/Dataset_tutorial_dataset.pkl')


[26/04/2017 13:51:24] <<< Loading Dataset instance from datasets/Dataset_tutorial_dataset.pkl ... >>>
[26/04/2017 13:51:24] <<< Dataset instance loaded >>>

Since the number of words in the dataset may be unknown beforehand, we must update the params information according to the dataset instance:


In [2]:
params['INPUT_VOCABULARY_SIZE'] = dataset.vocabulary_len['source_text']
params['OUTPUT_VOCABULARY_SIZE'] = dataset.vocabulary_len['target_text']

Now, we create a TranslationModel instance:


In [4]:
nmt_model = TranslationModel(params,
                             model_type='GroundHogModel', 
                             model_name='tutorial_model',
                             vocabularies=dataset.vocabulary,
                             store_path='trained_models/tutorial_model/',
                             verbose=True)


[26/04/2017 13:50:11] <<< Building GroundHogModel Translation_Model >>>
-----------------------------------------------------------------------------------
		TranslationModel instance
-----------------------------------------------------------------------------------
_model_type: GroundHogModel
name: tutorial_model
model_path: trained_models/tutorial_model/
verbose: True

MODEL params:
{'SAMPLE_ON_SETS': ['train', 'val'], 'CLIP_C': 1.0, 'HEURISTIC': 0, 'SAMPLING_SAVE_MODE': 'list', 'TRG_LAN': 'pe', 'SAMPLING': 'max_likelihood', 'SAMPLE_EACH_UPDATES': 300, 'N_LAYERS_DECODER': 1, 'TRG_PRETRAINED_VECTORS_TRAINABLE': True, 'USE_PRELU': False, 'POS_UNK': False, 'ENCODER_HIDDEN_SIZE': 256, 'REBUILD_DATASET': True, 'METRICS': ['coco'], 'TOKENIZE_REFERENCES': True, 'OPTIMIZER': 'Adadelta', 'SOURCE_TEXT_EMBEDDING_SIZE': 300, 'EVAL_EACH_EPOCHS': True, 'EPOCHS_FOR_SAVE': 1, 'USE_BATCH_NORMALIZATION': True, 'BATCH_SIZE': 50, 'MODEL_NAME': 'APE_mtpe_GroundHogModel_src_emb_300_bidir_True_enc_LSTM_256_dec_LSTM_256_deepout_linear_trg_emb_300_Adadelta_1.0', 'BATCH_NORMALIZATION_MODE': 1, 'EVAL_ON_SETS_KERAS': [], 'N_SAMPLES': 5, 'RECURRENT_DROPOUT_P': 0.5, 'WEIGHT_DECAY': 0.0001, 'OUTPUTS_IDS_DATASET': ['target_text'], 'INIT_LAYERS': ['tanh'], 'EARLY_STOP': True, 'DATA_AUGMENTATION': False, 'TEXT_FILES': {'test': 'test.', 'train': 'training.', 'val': 'dev.'}, 'MAPPING': '/media/HDD_2TB/DATASETS/APE/in-domain/joint_bpe//mapping.mt_pe.pkl', 'DROPOUT_P': 0.5, 'RECURRENT_WEIGHT_DECAY': 0.0, 'ADDITIONAL_OUTPUT_MERGE_MODE': 'sum', 'PARALLEL_LOADERS': 1, 'ALIGN_FROM_RAW': True, 'SAMPLE_WEIGHTS': True, 'EVAL_EACH': 1, 'EXTRA_NAME': '', 'MIN_OCCURRENCES_INPUT_VOCAB': 0, 'INIT_FUNCTION': 'glorot_uniform', 'LOSS': 'categorical_crossentropy', 'WRITE_VALID_SAMPLES': True, 'INPUTS_IDS_MODEL': ['source_text', 'state_below'], 'MODE': 'training', 'LR_GAMMA': 0.8, 'NOISE_AMOUNT': 0.01, 'SRC_PRETRAINED_VECTORS_TRAINABLE': True, 'STOP_METRIC': 'TER', 'N_LAYERS_ENCODER': 1, 'INPUTS_IDS_DATASET': ['source_text', 'state_below'], 'BIDIRECTIONAL_ENCODER': True, 'MAX_OUTPUT_TEXT_LEN_TEST': 150, 'FORCE_RELOAD_VOCABULARY': False, 'layer': ('linear', 300), 'TOKENIZATION_METHOD': 'tokenize_none', 'OUTPUT_VOCABULARY_SIZE': 516, 'SRC_PRETRAINED_VECTORS': None, 'START_EVAL_ON_EPOCH': 1, 'BEAM_SEARCH': True, 'TARGET_TEXT_EMBEDDING_SIZE': 300, 'DECODER_HIDDEN_SIZE': 256, 'MODEL_TYPE': 'GroundHogModel', 'STORE_PATH': '/media/HDD_2TB/MODELS/APE/trained_models/APE_mtpe_GroundHogModel_src_emb_300_bidir_True_enc_LSTM_256_dec_LSTM_256_deepout_linear_trg_emb_300_Adadelta_1.0/', 'TRG_PRETRAINED_VECTORS': None, 'JOINT_BATCHES': 4, 'CLIP_V': 0.0, 'SKIP_VECTORS_HIDDEN_SIZE': 300, 'NORMALIZE_SAMPLING': True, 'MAX_OUTPUT_TEXT_LEN': 50, 'PAD_ON_BATCH': True, 'START_SAMPLING_ON_EPOCH': 1, 'RNN_TYPE': 'LSTM', 'INPUT_VOCABULARY_SIZE': 689, 'BEAM_SIZE': 6, 'TRAIN_ON_TRAINVAL': False, 'LR': 1.0, 'SRC_LAN': 'mt', 'OPTIMIZED_SEARCH': True, 'CLASSIFIER_ACTIVATION': 'softmax', 'FILL': 'end', 'ALPHA_FACTOR': 0.6, 'TEMPERATURE': 1, 'MAX_INPUT_TEXT_LEN': 50, 'USE_RECURRENT_DROPOUT': False, 'DATASET_STORE_PATH': 'datasets/', 'APPLY_DETOKENIZATION': False, 'PATIENCE': 20, 'SAVE_EACH_EVALUATION': True, 'DATA_ROOT_PATH': '/media/HDD_2TB/DATASETS/APE/in-domain/joint_bpe/', 'HOMOGENEOUS_BATCHES': False, 'LR_DECAY': None, 'DATASET_NAME': 'APE', 'USE_DROPOUT': False, 'TOKENIZE_HYPOTHESES': True, 'VERBOSE': 1, 'MIN_OCCURRENCES_OUTPUT_VOCAB': 0, 'BIDIRECTIONAL_DEEP_ENCODER': True, 'OUTPUTS_IDS_MODEL': ['target_text'], 'USE_NOISE': True, 'DETOKENIZATION_METHOD': 'detokenize_bpe', 'DEEP_OUTPUT_LAYERS': [('linear', 300)], 'RELOAD': 0, 'EVAL_ON_SETS': ['val'], 'MAX_EPOCH': 500, 'USE_L2': False}
-----------------------------------------------------------------------------------
____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
source_text (InputLayer)         (None, None)          0                                            
____________________________________________________________________________________________________
source_word_embedding (Embedding (None, None, 300)     206700      source_text[0][0]                
____________________________________________________________________________________________________
src_embedding_gaussian_noise (Ga (None, None, 300)     0           source_word_embedding[0][0]      
____________________________________________________________________________________________________
src_embedding_batch_normalizatio (None, None, 300)     1200        src_embedding_gaussian_noise[0][0
____________________________________________________________________________________________________
bidirectional_encoder_LSTM (Bidi (None, None, 512)     1140736     src_embedding_batch_normalization
____________________________________________________________________________________________________
annotations_gaussian_noise (Gaus (None, None, 512)     0           bidirectional_encoder_LSTM[0][0] 
____________________________________________________________________________________________________
annotations_batch_normalization  (None, None, 512)     2048        annotations_gaussian_noise[0][0] 
____________________________________________________________________________________________________
state_below (InputLayer)         (None, None)          0                                            
____________________________________________________________________________________________________
maskedmean_1 (MaskedMean)        (None, 512)           0           annotations_batch_normalization[0
____________________________________________________________________________________________________
target_word_embedding (Embedding (None, None, 300)     154800      state_below[0][0]                
____________________________________________________________________________________________________
initial_state (Dense)            (None, 256)           131328      maskedmean_1[0][0]               
____________________________________________________________________________________________________
initial_memory (Dense)           (None, 256)           131328      maskedmean_1[0][0]               
____________________________________________________________________________________________________
state_below_gaussian_noise (Gaus (None, None, 300)     0           target_word_embedding[0][0]      
____________________________________________________________________________________________________
initial_state_gaussian_noise (Ga (None, 256)           0           initial_state[0][0]              
____________________________________________________________________________________________________
initial_memory_gaussian_noise (G (None, 256)           0           initial_memory[0][0]             
____________________________________________________________________________________________________
state_below_batch_normalization  (None, None, 300)     1200        state_below_gaussian_noise[0][0] 
____________________________________________________________________________________________________
masklayer_1 (MaskLayer)          (None, None, 512)     0           annotations_batch_normalization[0
____________________________________________________________________________________________________
initial_state_batch_normalizatio (None, 256)           1024        initial_state_gaussian_noise[0][0
____________________________________________________________________________________________________
initial_memory_batch_normalizati (None, 256)           1024        initial_memory_gaussian_noise[0][
____________________________________________________________________________________________________
decoder_AttLSTMCond (AttLSTMCond [(None, None, 256), ( 1488897     state_below_batch_normalization[0
                                                                   masklayer_1[0][0]                
                                                                   initial_state_batch_normalization
                                                                   initial_memory_batch_normalizatio
____________________________________________________________________________________________________
proj_h0_gaussian_noise (Gaussian (None, None, 256)     0           decoder_AttLSTMCond[0][0]        
____________________________________________________________________________________________________
proj_h0_batch_normalization (Bat (None, None, 256)     1024        proj_h0_gaussian_noise[0][0]     
____________________________________________________________________________________________________
logit_ctx (TimeDistributed)      multiple              153900      decoder_AttLSTMCond[0][1]        
____________________________________________________________________________________________________
logit_lstm (TimeDistributed)     multiple              77100       proj_h0_batch_normalization[0][0]
____________________________________________________________________________________________________
permutegeneral_1 (PermuteGeneral (None, None, 300)     0           logit_ctx[0][0]                  
____________________________________________________________________________________________________
logit_emb (TimeDistributed)      multiple              90300       state_below_batch_normalization[0
____________________________________________________________________________________________________
out_layer_mlp_gaussian_noise (Ga (None, None, 300)     0           logit_lstm[0][0]                 
____________________________________________________________________________________________________
out_layer_ctx_gaussian_noise (Ga (None, None, 300)     0           permutegeneral_1[0][0]           
____________________________________________________________________________________________________
out_layer_emb_gaussian_noise (Ga (None, None, 300)     0           logit_emb[0][0]                  
____________________________________________________________________________________________________
out_layer_mlp_batch_normalizatio (None, None, 300)     1200        out_layer_mlp_gaussian_noise[0][0
____________________________________________________________________________________________________
out_layer_ctx_batch_normalizatio (None, None, 300)     1200        out_layer_ctx_gaussian_noise[0][0
____________________________________________________________________________________________________
out_layer_emb_batch_normalizatio (None, None, 300)     1200        out_layer_emb_gaussian_noise[0][0
____________________________________________________________________________________________________
additional_input (Merge)         (None, None, 300)     0           out_layer_mlp_batch_normalization
                                                                   out_layer_ctx_batch_normalization
                                                                   out_layer_emb_batch_normalization
____________________________________________________________________________________________________
activation_1 (Activation)        (None, None, 300)     0           additional_input[0][0]           
____________________________________________________________________________________________________
linear_0 (TimeDistributed)       multiple              90300       activation_1[0][0]               
____________________________________________________________________________________________________
out_layerlinear_gaussian_noise ( (None, None, 300)     0           linear_0[0][0]                   
____________________________________________________________________________________________________
out_layerlinear_batch_normalizat (None, None, 300)     1200        out_layerlinear_gaussian_noise[0]
____________________________________________________________________________________________________
target_text (TimeDistributed)    multiple              155316      out_layerlinear_batch_normalizati
====================================================================================================
[26/04/2017 13:50:15] Preparing optimizer: Adadelta [LR: 1.0 - LOSS: categorical_crossentropy] and compiling.
Total params: 3,833,025
Trainable params: 3,826,865
Non-trainable params: 6,160
____________________________________________________________________________________________________

Now, we must define the inputs and outputs mapping from our Dataset instance to our model


In [6]:
inputMapping = dict()
for i, id_in in enumerate(params['INPUTS_IDS_DATASET']):
    pos_source = dataset.ids_inputs.index(id_in)
    id_dest = nmt_model.ids_inputs[i]
    inputMapping[id_dest] = pos_source
nmt_model.setInputsMapping(inputMapping)

outputMapping = dict()
for i, id_out in enumerate(params['OUTPUTS_IDS_DATASET']):
    pos_target = dataset.ids_outputs.index(id_out)
    id_dest = nmt_model.ids_outputs[i]
    outputMapping[id_dest] = pos_target
nmt_model.setOutputsMapping(outputMapping)

We can add some callbacks for controlling the training (e.g. Sampling each N updates, early stop, learning rate annealing...). For instance, let's build an Early-Stop callback. After each 2 epochs, it will compute the 'coco' scores on the development set. If the metric 'Bleu_4' doesn't improve during more than 5 checkings, it will stop. We need to pass some variables to the callback (in the extra_vars dictionary):


In [10]:
extra_vars = {'language': 'en',
              'n_parallel_loaders': 8,
              'tokenize_f': eval('dataset.' + 'tokenize_none'),
              'beam_size': 12,
              'maxlen': 50,
              'model_inputs': ['source_text', 'state_below'],
              'model_outputs': ['target_text'],
              'dataset_inputs': ['source_text', 'state_below'],
              'dataset_outputs': ['target_text'],
              'normalize': True,
              'alpha_factor': 0.6,
              'val': {'references': dataset.extra_variables['val']['target_text']}
              }

vocab = dataset.vocabulary['target_text']['idx2words']
callbacks = []
callbacks.append(PrintPerformanceMetricOnEpochEndOrEachNUpdates(nmt_model,
                                                                dataset,
                                                                gt_id='target_text',
                                                                metric_name=['coco'],
                                                                set_name=['val'],
                                                                batch_size=50,
                                                                each_n_epochs=2,
                                                                extra_vars=extra_vars,
                                                                reload_epoch=0,
                                                                is_text=True,
                                                                index2word_y=vocab,
                                                                sampling_type='max_likelihood',
                                                                beam_search=True,
                                                                save_path=nmt_model.model_path,
                                                                start_eval_on_epoch=0,
                                                                write_samples=True,
                                                                write_type='list',
                                                                verbose=True))

Now we are almost ready to train. We set up some training parameters...


In [11]:
training_params = {'n_epochs': 100,
                   'batch_size': 40,
                   'maxlen': 30,
                   'epochs_for_save': 1,
                   'verbose': 0,
                   'eval_on_sets': [], 
                   'n_parallel_loaders': 8,
                   'extra_callbacks': callbacks,
                   'reload_epoch': 0,
                   'epoch_offset': 0}

And train!


In [ ]:
nmt_model.trainNet(dataset, training_params)

In [ ]: