Learning Objectives
In this lab we'll build a translation model from Spanish to English using a RNN encoder-decoder model architecture.
We will start by creating train and eval datasets (using the tf.data.Dataset
API) that are typical for seq2seq problems. Then we will use the Keras functional API to train an RNN encoder-decoder model, which will save as two separate models, the encoder and decoder model. Using these two separate pieces we will implement the translation function.
At last, we'll benchmark our results using the industry standard BLEU score.
In [1]:
import os
import pickle
import sys
import nltk
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.keras.layers import (
Dense,
Embedding,
GRU,
Input,
)
from tensorflow.keras.models import (
load_model,
Model,
)
import utils_preproc
print(tf.__version__)
In [2]:
SEED = 0
MODEL_PATH = 'translate_models/baseline'
DATA_URL = 'http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip'
LOAD_CHECKPOINT = False
In [3]:
tf.random.set_seed(SEED)
We'll use a language dataset provided by http://www.manythings.org/anki/. The dataset contains Spanish-English translation pairs in the format:
May I borrow this book? ¿Puedo tomar prestado este libro?
The dataset is a curated list of 120K translation pairs from http://tatoeba.org/, a platform for community contributed translations by native speakers.
In [4]:
path_to_zip = tf.keras.utils.get_file(
'spa-eng.zip', origin=DATA_URL, extract=True)
path_to_file = os.path.join(
os.path.dirname(path_to_zip),
"spa-eng/spa.txt"
)
print("Translation data stored at:", path_to_file)
In [5]:
data = pd.read_csv(
path_to_file, sep='\t', header=None, names=['english', 'spanish'])
In [6]:
data.sample(3)
Out[6]:
From the utils_preproc
package we have written for you,
we will use the following functions to pre-process our dataset of sentence pairs.
The utils_preproc.preprocess_sentence()
method does the following:
<start>
and <end>
tokensFor example:
In [7]:
raw = [
"No estamos comiendo.",
"Está llegando el invierno.",
"El invierno se acerca.",
"Tom no comio nada.",
"Su pierna mala le impidió ganar la carrera.",
"Su respuesta es erronea.",
"¿Qué tal si damos un paseo después del almuerzo?"
]
In [8]:
processed = [utils_preproc.preprocess_sentence(s) for s in raw]
processed
Out[8]:
The utils_preproc.tokenize()
method does the following:
It returns an instance of a Keras Tokenizer containing the token-integer mapping along with the integerized sentences:
In [9]:
integerized, tokenizer = utils_preproc.tokenize(processed)
integerized
Out[9]:
The outputted tokenizer can be used to get back the actual works from the integers representing them:
In [10]:
tokenizer.sequences_to_texts(integerized)
Out[10]:
Implement a function that will read the raw sentence-pair file
and preprocess the sentences with utils_preproc.preprocess_sentence
.
The load_and_preprocess
function takes as input
It returns a tuple whose first component contains the english preprocessed sentences, while the second component contains the spanish ones:
In [11]:
def load_and_preprocess(path, num_examples):
with open(path_to_file, 'r') as fp:
lines = fp.read().strip().split('\n')
sentence_pairs = # TODO 1a
return zip(*sentence_pairs)
In [12]:
en, sp = load_and_preprocess(path_to_file, num_examples=10)
print(en[-1])
print(sp[-1])
Using utils_preproc.tokenize
, implement the function load_and_integerize
that takes as input the data path along with the number of examples we want to read in and returns the following tuple:
(input_tensor, target_tensor, inp_lang_tokenizer, targ_lang_tokenizer)
where
input_tensor
is an integer tensor of shape (num_examples, max_length_inp)
containing the integerized versions of the source language sentencestarget_tensor
is an integer tensor of shape (num_examples, max_length_targ)
containing the integerized versions of the target language sentencesinp_lang_tokenizer
is the source language tokenizertarg_lang_tokenizer
is the target language tokenizer
In [13]:
def load_and_integerize(path, num_examples=None):
targ_lang, inp_lang = load_and_preprocess(path, num_examples)
# TODO 1b
input_tensor, inp_lang_tokenizer = # TODO
target_tensor, targ_lang_tokenizer = # TODO
return input_tensor, target_tensor, inp_lang_tokenizer, targ_lang_tokenizer
We'll split this data 80/20 into train and validation, and we'll use only the first 30K examples, since we'll be training on a single GPU.
Let us set variable for that:
In [14]:
TEST_PROP = 0.2
NUM_EXAMPLES = 30000
Now let's load and integerize the sentence paris and store the tokenizer for the source and the target language into the int_lang
and targ_lang
variable respectively:
In [15]:
input_tensor, target_tensor, inp_lang, targ_lang = load_and_integerize(
path_to_file, NUM_EXAMPLES)
Let us store the maximal sentence length of both languages into two variables:
In [16]:
max_length_targ = target_tensor.shape[1]
max_length_inp = input_tensor.shape[1]
We are now using scikit-learn train_test_split
to create our splits:
In [17]:
splits = train_test_split(
input_tensor, target_tensor, test_size=TEST_PROP, random_state=SEED)
input_tensor_train = splits[0]
input_tensor_val = splits[1]
target_tensor_train = splits[2]
target_tensor_val = splits[3]
Let's make sure the number of example in each split looks good:
In [18]:
(len(input_tensor_train), len(target_tensor_train),
len(input_tensor_val), len(target_tensor_val))
Out[18]:
The utils_preproc.int2word
function allows you to transform back the integerized sentences into words. Note that the <start>
token is alwasy encoded as 1
, while the <end>
token is always encoded as 0
:
In [19]:
print("Input Language; int to word mapping")
print(input_tensor_train[0])
print(utils_preproc.int2word(inp_lang, input_tensor_train[0]), '\n')
print("Target Language; int to word mapping")
print(target_tensor_train[0])
print(utils_preproc.int2word(targ_lang, target_tensor_train[0]))
Implement the create_dataset
function that takes as input
encoder_input
which is an integer tensor of shape (num_examples, max_length_inp)
containing the integerized versions of the source language sentencesdecoder_input
which is an integer tensor of shape (num_examples, max_length_targ)
containing the integerized versions of the target language sentencesIt returns a tf.data.Dataset
containing examples for the form
((source_sentence, target_sentence), shifted_target_sentence)
where source_sentence
and target_setence
are the integer version of source-target language pairs and shifted_target
is the same as target_sentence
but with indices shifted by 1.
Remark: In the training code, source_sentence
(resp. target_sentence
) will be fed as the encoder (resp. decoder) input, while shifted_target
will be used to compute the cross-entropy loss by comparing the decoder output with the shifted target sentences.
In [20]:
def create_dataset(encoder_input, decoder_input):
# shift ahead by 1
target = tf.roll(decoder_input, -1, 1)
# replace last column with 0s
zeros = tf.zeros([target.shape[0], 1], dtype=tf.int32)
target = tf.concat((target[:, :-1], zeros), axis=-1)
dataset = # TODO
return dataset
Let's now create the actual train and eval dataset using the function above:
In [21]:
BUFFER_SIZE = len(input_tensor_train)
BATCH_SIZE = 64
In [22]:
train_dataset = create_dataset(
input_tensor_train, target_tensor_train).shuffle(
BUFFER_SIZE).repeat().batch(BATCH_SIZE, drop_remainder=True)
eval_dataset = create_dataset(
input_tensor_val, target_tensor_val).batch(
BATCH_SIZE, drop_remainder=True)
We use an encoder-decoder architecture, however we embed our words into a latent space prior to feeding them into the RNN.
In [23]:
EMBEDDING_DIM = 256
HIDDEN_UNITS = 1024
INPUT_VOCAB_SIZE = len(inp_lang.word_index) + 1
TARGET_VOCAB_SIZE = len(targ_lang.word_index) + 1
Implement the encoder network with Keras functional API. It will
Input
layer that will consume the source language integerized sentencesEmbedding
layer of EMBEDDING_DIM
dimensionsGRU
recurrent layer with HIDDEN_UNITS
The output of the encoder will be the encoder_outputs
and the encoder_state
.
In [24]:
encoder_inputs = Input(shape=(None,), name="encoder_input")
encoder_inputs_embedded = # TODO
encoder_rnn = # TODO
encoder_outputs, encoder_state = encoder_rnn(encoder_inputs_embedded)
Implement the decoder network, which is very similar to the encoder network.
It will
Input
layer that will consume the source language integerized sentencesEmbedding
layer of EMBEDDING_DIM
dimensionsGRU
recurrent layer with HIDDEN_UNITS
Important: The main difference with the encoder, is that the recurrent GRU
layer will take as input not only the decoder input embeddings, but also the encoder_state
as outputted by the encoder above. This is where the two networks are linked!
The output of the encoder will be the decoder_outputs
and the decoder_state
.
In [25]:
decoder_inputs = Input(shape=(None,), name="decoder_input")
decoder_inputs_embedded = # TODO
decoder_rnn = # TODO
decoder_outputs, decoder_state = decoder_rnn(
decoder_inputs_embedded, initial_state=encoder_state)
The last part of the encoder-decoder architecture is a softmax Dense
layer that will create the next word probability vector or next word predictions
from the decoder_output
:
In [26]:
decoder_dense = Dense(TARGET_VOCAB_SIZE, activation='softmax')
predictions = decoder_dense(decoder_outputs)
To be able to train the encoder-decoder network defined above, create a trainable Keras Model
by specifying which are the inputs
and the outputs
of our problem. They should correspond exactly to what the type of input/output in our train and eval tf.data.Dataset
since that's what will be fed to the inputs
and outputs
we declare while instantiating the Keras Model
.
While compiling our model, we should make sure that the loss is the sparse_categorical_crossentropy
so that we can compare the true word indices for the target language as outputted by our train tf.data.Dataset
with the next word predictions
vector as outputted by the decoder:
In [27]:
model = # TODO
model.compile(# TODO)
model.summary()
Let's now train the model!
In [28]:
STEPS_PER_EPOCH = len(input_tensor_train)//BATCH_SIZE
EPOCHS = 1
history = model.fit(
train_dataset,
steps_per_epoch=STEPS_PER_EPOCH,
validation_data=eval_dataset,
epochs=EPOCHS
)
We can't just use model.predict(), because we don't know all the inputs we used during training. We only know the encoder_input (source language) but not the decoder_input (target language), which is what we want to predict (i.e., the translation of the source language)!
We do however know the first token of the decoder input, which is the <start>
token. So using this plus the state of the encoder RNN, we can predict the next token. We will then use that token to be the second token of decoder input, and continue like this until we predict the <end>
token, or we reach some defined max length.
So, the strategy now is to split our trained network into two independent Keras models:
encoder_inputs -> encoder_state
[decoder_inputs, decoder_state_input] -> [predictions, decoder_state]
This way, we will be able to encode the source language sentence into the vector encoder_state
using the encoder and feed it to the decoder model along with the <start>
token at step 1.
Given that input, the decoder will produce the first word of the translation, by sampling from the predictions
vector (for simplicity, our sampling strategy here will be to take the next word to be the one whose index has the maximum probability in the predictions
vector) along with a new state vector, the decoder_state
.
At this point, we can feed again to the decoder the predicted first word and as well as the new decoder_state
to predict the translation second word.
This process can be continued until the decoder produces the token <stop>
.
This is how we will implement our translation (or decoding) function, but let us first extract a separate encoder and a separate decoder from our trained encoder-decoder model.
Remark: If we have already trained and saved the models (i.e, LOAD_CHECKPOINT
is True
) we will just load the models, otherwise, we extract them from the trained network above by explicitly creating the encoder and decoder Keras Model
s with the signature we want.
In [29]:
if LOAD_CHECKPOINT:
encoder_model = load_model(os.path.join(MODEL_PATH, 'encoder_model.h5'))
decoder_model = load_model(os.path.join(MODEL_PATH, 'decoder_model.h5'))
else:
encoder_model = # TODO
decoder_state_input = Input(shape=(HIDDEN_UNITS,), name="decoder_state_input")
# Reuses weights from the decoder_rnn layer
decoder_outputs, decoder_state = decoder_rnn(
decoder_inputs_embedded, initial_state=decoder_state_input)
# Reuses weights from the decoder_dense layer
predictions = decoder_dense(decoder_outputs)
decoder_model = # TODO
Now that we have a separate encoder and a separate decoder, implement a translation function, to which we will give the generic name of decode_sequences
(to stress that this procedure is general to all seq2seq problems).
decode_sequences
will take as input
input_seqs
which is the integerized source language sentence tensor that the encoder can consumeoutput_tokenizer
which is the target languague tokenizer we will need to extract back words from predicted word integersmax_decode_length
which is the length after which we stop decoding if the <stop>
token has not been predictedNote: Now that the encoder and decoder have been turned into Keras models, to feed them their input, we need to use the .predict
method.
In [30]:
def decode_sequences(input_seqs, output_tokenizer, max_decode_length=50):
"""
Arguments:
input_seqs: int tensor of shape (BATCH_SIZE, SEQ_LEN)
output_tokenizer: Tokenizer used to conver from int to words
Returns translated sentences
"""
# Encode the input as state vectors.
states_value = encoder_model.predict(input_seqs)
# Populate the first character of target sequence with the start character.
batch_size = input_seqs.shape[0]
target_seq = tf.ones([batch_size, 1])
decoded_sentences = [[] for _ in range(batch_size)]
for i in range(max_decode_length):
output_tokens, decoder_state = decoder_model.predict(
[target_seq, states_value])
# Sample a token
sampled_token_index = # TODO
tokens = # TODO
for j in range(batch_size):
decoded_sentences[j].append(tokens[j])
# Update the target sequence (of length 1).
target_seq = tf.expand_dims(tf.constant(sampled_token_index), axis=-1)
# Update states
states_value = decoder_state
return decoded_sentences
Now we're ready to predict!
In [31]:
sentences = [
"No estamos comiendo.",
"Está llegando el invierno.",
"El invierno se acerca.",
"Tom no comio nada.",
"Su pierna mala le impidió ganar la carrera.",
"Su respuesta es erronea.",
"¿Qué tal si damos un paseo después del almuerzo?"
]
reference_translations = [
"We're not eating.",
"Winter is coming.",
"Winter is coming.",
"Tom ate nothing.",
"His bad leg prevented him from winning the race.",
"Your answer is wrong.",
"How about going for a walk after lunch?"
]
machine_translations = decode_sequences(
utils_preproc.preprocess(sentences, inp_lang),
targ_lang,
max_length_targ
)
for i in range(len(sentences)):
print('-')
print('INPUT:')
print(sentences[i])
print('REFERENCE TRANSLATION:')
print(reference_translations[i])
print('MACHINE TRANSLATION:')
print(machine_translations[i])
In [32]:
if not LOAD_CHECKPOINT:
os.makedirs(MODEL_PATH, exist_ok=True)
# TODO
with open(os.path.join(MODEL_PATH, 'encoder_tokenizer.pkl'), 'wb') as fp:
pickle.dump(inp_lang, fp)
with open(os.path.join(MODEL_PATH, 'decoder_tokenizer.pkl'), 'wb') as fp:
pickle.dump(targ_lang, fp)
Unlike say, image classification, there is no one right answer for a machine translation. However our current loss metric, cross entropy, only gives credit when the machine translation matches the exact same word in the same order as the reference translation.
Many attempts have been made to develop a better metric for natural language evaluation. The most popular currently is Bilingual Evaluation Understudy (BLEU).
The score is from 0 to 1, where 1 is an exact match.
It works by counting matching n-grams between the machine and reference texts, regardless of order. BLUE-4 counts matching n grams from 1-4 (1-gram, 2-gram, 3-gram and 4-gram). It is common to report both BLUE-1 and BLUE-4
It still is imperfect, since it gives no credit to synonyms and so human evaluation is still best when feasible. However BLEU is commonly considered the best among bad options for an automated metric.
The NLTK framework has an implementation that we will use.
We can't run calculate BLEU during training, because at that time the correct decoder input is used. Instead we'll calculate it now.
For more info: https://machinelearningmastery.com/calculate-bleu-score-for-text-python/
In [33]:
def bleu_1(reference, candidate):
reference = list(filter(lambda x: x != '', reference)) # remove padding
candidate = list(filter(lambda x: x != '', candidate)) # remove padding
smoothing_function = nltk.translate.bleu_score.SmoothingFunction().method1
return nltk.translate.bleu_score.sentence_bleu(
reference, candidate, (1,), smoothing_function)
In [34]:
def bleu_4(reference, candidate):
reference = list(filter(lambda x: x != '', reference)) # remove padding
candidate = list(filter(lambda x: x != '', candidate)) # remove padding
smoothing_function = nltk.translate.bleu_score.SmoothingFunction().method1
return nltk.translate.bleu_score.sentence_bleu(
reference, candidate, (.25, .25, .25, .25), smoothing_function)
In [35]:
%%time
num_examples = len(input_tensor_val)
bleu_1_total = 0
bleu_4_total = 0
for idx in range(num_examples):
reference_sentence = utils_preproc.int2word(
targ_lang, target_tensor_val[idx][1:])
decoded_sentence = decode_sequences(
input_tensor_val[idx:idx+1], targ_lang, max_length_targ)[0]
bleu_1_total += # TODO
bleu_4_total += # TODO
print('BLEU 1: {}'.format(bleu_1_total/num_examples))
print('BLEU 4: {}'.format(bleu_4_total/num_examples))