Now, we'll load from disk a trained Neural Machine Translation (NMT) model. We'll apply it for translating new text. In this case, we want to translate the 'test' split of our dataset.
This tutorial assumes that you followed both previous tutorials.
As before, let's import some stuff and load the dataset instance.
In [2]:
from config import load_parameters
from data_engine.prepare_data import keep_n_captions
from keras_wrapper.cnn_model import loadModel
from keras_wrapper.dataset import loadDataset
params = load_parameters()
dataset = loadDataset('datasets/Dataset_tutorial_dataset.pkl')
Since we want to translate a new data split ('test') we must add it to the dataset instance, just as we did before (at the first tutorial). In case we also had the refences of the test split and we wanted to evaluate it, we can add it to the dataset. Note that this is not mandatory and we could just predict without evaluating.
In [3]:
dataset.setInput('examples/EuTrans/DATA/test.es',
'test',
type='text',
id='source_text',
pad_on_batch=True,
tokenization='tokenize_none',
fill='end',
max_text_len=30,
min_occ=0)
dataset.setInput(None,
'test',
type='ghost',
id='state_below',
required=False)
Now, let's load the translation model. Suppose we want to load the model saved at the end of the epoch 4:
In [4]:
params['INPUT_VOCABULARY_SIZE'] = dataset.vocabulary_len[params['INPUTS_IDS_DATASET'][0]]
params['OUTPUT_VOCABULARY_SIZE'] = dataset.vocabulary_len[params['OUTPUTS_IDS_DATASET'][0]]
# Load model
nmt_model = loadModel('trained_models/tutorial_model', 4)
Once we loaded the model, we just have to invoke the sampling method (in this case, the Beam Search algorithm) for the 'test' split:
In [5]:
params_prediction = {'max_batch_size': 50,
'n_parallel_loaders': 8,
'predict_on_sets': ['test'],
'beam_size': 12,
'maxlen': 50,
'model_inputs': ['source_text', 'state_below'],
'model_outputs': ['target_text'],
'dataset_inputs': ['source_text', 'state_below'],
'dataset_outputs': ['target_text'],
'normalize': True,
'alpha_factor': 0.6
}
predictions = nmt_model.predictBeamSearchNet(dataset, params_prediction)['test']
Up to this moment, in the variable 'predictions', we have the indices of the words of the hypotheses. We must decode them into words. For doing this, we'll use the dictionary stored in the dataset object:
In [6]:
from keras_wrapper.utils import decode_predictions_beam_search
vocab = dataset.vocabulary['target_text']['idx2words']
predictions = decode_predictions_beam_search(predictions,
vocab,
verbose=params['VERBOSE'])
Finally, we store the system hypotheses:
In [9]:
filepath = nmt_model.model_path+'/' + 'test' + '_sampling.pred' # results file
from keras_wrapper.extra.read_write import list2file
list2file(filepath, predictions)
If we have the references of this split, we can also evaluate the performance of our system on it. First, we must add them to the dataset object:
In [10]:
# In case we had the references of this split, we could also load the split and evaluate on it
dataset.setOutput('examples/EuTrans/DATA/test.en',
'test',
type='text',
id='target_text',
pad_on_batch=True,
tokenization='tokenize_none',
sample_weights=True,
max_text_len=30,
max_words=0)
keep_n_captions(dataset, repeat=1, n=1, set_names=['test'])
Next, we call the evaluation system: The COCO package. Although its main usage is for multimodal captioning, we can use it in machine translation:
In [19]:
from keras_wrapper.extra.evaluation import select
metric = 'coco'
# Apply sampling
extra_vars = dict()
extra_vars['tokenize_f'] = eval('dataset.' + 'tokenize_none')
extra_vars['language'] = params['TRG_LAN']
extra_vars['test'] = dict()
extra_vars['test']['references'] = dataset.extra_variables['test']['target_text']
metrics = select[metric](pred_list=predictions,
verbose=1,
extra_vars=extra_vars,
split='test')