In the following we will try to build a translation model from french phrases describing numbers to the corresponding numeric representation (base 10).
This is a toy machine translation task with a restricted vocabulary and a single valid translation for each source phrase which makes it more tractable to train on a laptop computer and easier to evaluate. Despite those limitations we expect that this task will highlight interesting properties of Seq2Seq models including:
The parallel text data is generated from a "ground-truth" Python function named to_french_phrase
that captures common rules. Hyphenation was intentionally omitted to make the phrases more ambiguous and therefore make the translation problem slightly harder to solve (and also because Olivier had no particular interest hyphenation in properly implementing rules :).
In [1]:
from french_numbers import to_french_phrase
for x in [21, 80, 81, 300, 213, 1100, 1201, 301000, 80080]:
print(str(x).rjust(6), to_french_phrase(x))
The following will generate phrases 20000 example phrases for numbers between 1 and 1,000,000 (excluded). We chose to over-represent small numbers by generating all the possible short sequences between 1 and exhaustive
.
We then split the generated set into non-overlapping train, validation and test splits.
In [2]:
from french_numbers import generate_translations
from sklearn.model_selection import train_test_split
numbers, french_numbers = generate_translations(
low=1, high=int(1e6) - 1, exhaustive=5000, random_seed=0)
num_train, num_dev, fr_train, fr_dev = train_test_split(
numbers, french_numbers, test_size=0.5, random_state=0)
num_val, num_test, fr_val, fr_test = train_test_split(
num_dev, fr_dev, test_size=0.5, random_state=0)
In [3]:
len(fr_train), len(fr_val), len(fr_test)
Out[3]:
In [4]:
for i, fr_phrase, num_phrase in zip(range(5), fr_train, num_train):
print(num_phrase.rjust(6), fr_phrase)
In [5]:
for i, fr_phrase, num_phrase in zip(range(5), fr_val, num_val):
print(num_phrase.rjust(6), fr_phrase)
Build the vocabularies from the training set only to get a chance to have some out-of-vocabulary words in the validation and test sets.
First we need to introduce specific symbols that will be used to:
Here we use the same convention as the tensorflow seq2seq tutorial:
In [6]:
PAD, GO, EOS, UNK = START_VOCAB = ['_PAD', '_GO', '_EOS', '_UNK']
To build the vocabulary we need to tokenize the sequences of symbols. For the digital number representation we use character level tokenization while whitespace-based word level tokenization will do for the French phrases:
In [7]:
def tokenize(sentence, word_level=True):
if word_level:
return sentence.split()
else:
return [sentence[i:i + 1] for i in range(len(sentence))]
In [8]:
tokenize('1234', word_level=False)
Out[8]:
In [9]:
tokenize('mille deux cent trente quatre', word_level=True)
Out[9]:
Let's now use this tokenization strategy to assign a unique integer token id to each possible token string found the traing set in each language ('French' and 'numeric'):
In [10]:
def build_vocabulary(tokenized_sequences):
rev_vocabulary = START_VOCAB[:]
unique_tokens = set()
for tokens in tokenized_sequences:
unique_tokens.update(tokens)
rev_vocabulary += sorted(unique_tokens)
vocabulary = {}
for i, token in enumerate(rev_vocabulary):
vocabulary[token] = i
return vocabulary, rev_vocabulary
In [11]:
tokenized_fr_train = [tokenize(s, word_level=True) for s in fr_train]
tokenized_num_train = [tokenize(s, word_level=False) for s in num_train]
fr_vocab, rev_fr_vocab = build_vocabulary(tokenized_fr_train)
num_vocab, rev_num_vocab = build_vocabulary(tokenized_num_train)
The two languages do not have the same vocabulary sizes:
In [12]:
len(fr_vocab)
Out[12]:
In [13]:
len(num_vocab)
Out[13]:
In [14]:
for k, v in sorted(fr_vocab.items())[:10]:
print(k.rjust(10), v)
print('...')
In [15]:
for k, v in sorted(num_vocab.items()):
print(k.rjust(10), v)
We also built the reverse mappings from token ids to token string representations:
In [16]:
print(rev_fr_vocab)
In [17]:
print(rev_num_vocab)
For a given source sequence - target sequence pair, we will:
_GO
token as a delimiter, _EOS
token to the source sequence.Let's do this as a function using the original string representations for the tokens so as to make it easier to debug:
Exercise
Note:
_GO
and _EOS
special symbols at the right locations.
In [18]:
def make_input_output(source_tokens, target_tokens, reverse_source=True):
# TOTO
return input_tokens, output_tokens
In [19]:
# %load solutions/make_input_output.py
def make_input_output(source_tokens, target_tokens, reverse_source=True):
if reverse_source:
source_tokens = source_tokens[::-1]
input_tokens = source_tokens + [GO] + target_tokens
output_tokens = target_tokens + [EOS]
return input_tokens, output_tokens
In [20]:
input_tokens, output_tokens = make_input_output(
['cent', 'vingt', 'et', 'un'],
['1', '2', '1'],
)
In [21]:
input_tokens
Out[21]:
In [22]:
output_tokens
Out[22]:
In [23]:
all_tokenized_sequences = tokenized_fr_train + tokenized_num_train
shared_vocab, rev_shared_vocab = build_vocabulary(all_tokenized_sequences)
In [24]:
import numpy as np
max_length = 20 # found by introspection of our training set
def vectorize_corpus(source_sequences, target_sequences, shared_vocab,
word_level_source=True, word_level_target=True,
max_length=max_length):
assert len(source_sequences) == len(target_sequences)
n_sequences = len(source_sequences)
source_ids = np.empty(shape=(n_sequences, max_length), dtype=np.int32)
source_ids.fill(shared_vocab[PAD])
target_ids = np.empty(shape=(n_sequences, max_length), dtype=np.int32)
target_ids.fill(shared_vocab[PAD])
numbered_pairs = zip(range(n_sequences), source_sequences, target_sequences)
for i, source_seq, target_seq in numbered_pairs:
source_tokens = tokenize(source_seq, word_level=word_level_source)
target_tokens = tokenize(target_seq, word_level=word_level_target)
in_tokens, out_tokens = make_input_output(source_tokens, target_tokens)
in_token_ids = [shared_vocab.get(t, UNK) for t in in_tokens]
source_ids[i, -len(in_token_ids):] = in_token_ids
out_token_ids = [shared_vocab.get(t, UNK) for t in out_tokens]
target_ids[i, -len(out_token_ids):] = out_token_ids
return source_ids, target_ids
In [25]:
X_train, Y_train = vectorize_corpus(fr_train, num_train, shared_vocab,
word_level_target=False)
In [26]:
X_train.shape
Out[26]:
In [27]:
Y_train.shape
Out[27]:
In [28]:
fr_train[0]
Out[28]:
In [29]:
num_train[0]
Out[29]:
In [30]:
X_train[0]
Out[30]:
In [31]:
Y_train[0]
Out[31]:
This looks good. In particular we can note:
Let's vectorize the validation and test set to be able to evaluate our models:
In [32]:
X_val, Y_val = vectorize_corpus(fr_val, num_val, shared_vocab,
word_level_target=False)
X_test, Y_test = vectorize_corpus(fr_test, num_test, shared_vocab,
word_level_target=False)
In [33]:
X_val.shape, Y_val.shape
Out[33]:
In [34]:
X_test.shape, Y_test.shape
Out[34]:
To keep the architecture simple we will use the same RNN model and weights for both the encoder part (before the _GO
token) and the decoder part (after the _GO
token).
We may GRU recurrent cell instead of LSTM because it is slightly faster to compute and should give comparable results.
Exercise:
Note:
[batch, sequence_length, vocab_size]
.
In [35]:
# %load solutions/simple_seq2seq.py
from keras.models import Sequential
from keras.layers import Embedding, Dropout, GRU, Dense
vocab_size = len(shared_vocab)
simple_seq2seq = Sequential()
simple_seq2seq.add(Embedding(vocab_size, 32, input_length=max_length))
simple_seq2seq.add(Dropout(0.2))
simple_seq2seq.add(GRU(256, return_sequences=True))
simple_seq2seq.add(Dense(vocab_size, activation='softmax'))
# Here we use the sparse_categorical_crossentropy loss to be able to pass
# integer-coded output for the token ids without having to convert to one-hot
# codes
simple_seq2seq.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
Let's use a callback mechanism to automatically snapshot the best model found so far on the validation set:
In [36]:
from keras.callbacks import ModelCheckpoint
from keras.models import load_model
best_model_fname = "simple_seq2seq_checkpoint.h5"
best_model_cb = ModelCheckpoint(best_model_fname, monitor='val_loss',
save_best_only=True, verbose=1)
We need to use np.expand_dims trick on Y: this is required by Keras because of we use a sparse (integer-based) representation for the output:
In [37]:
%matplotlib inline
import matplotlib.pyplot as plt
history = simple_seq2seq.fit(X_train, np.expand_dims(Y_train, -1),
validation_data=(X_val, np.expand_dims(Y_val, -1)),
nb_epoch=15, verbose=2, batch_size=32,
callbacks=[best_model_cb])
plt.figure(figsize=(12, 6))
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], '--', label='validation')
plt.ylabel('negative log likelihood')
plt.xlabel('epoch')
plt.title('Convergence plot for Simple Seq2Seq')
Out[37]:
Let's load the best model found on the validation set at the end of training:
In [38]:
simple_seq2seq = load_model(best_model_fname)
If you don't have access to a GPU and cannot wait 10 minutes to for the model to converge to a reasonably good state, feel to use the pretrained model. This model has been obtained by training the above model for ~150 epochs. The validation loss is significantly lower than 1e-5. In practice it should hardly ever make any prediction error on this easy translation problem.
Alternatively we will load this imperfect model (trained only to 50 epochs) with a validation loss of ~7e-4. This model makes funny translation errors so I would suggest to try it first:
In [39]:
from keras.utils.data_utils import get_file
import os
get_file("simple_seq2seq_partially_pretrained.h5",
"https://github.com/m2dsupsdlclass/lectures-labs/releases/"
"download/0.4/simple_seq2seq_partially_pretrained.h5")
filename = os.path.expanduser(os.path.join('~',
'.keras/datasets/simple_seq2seq_partially_pretrained.h5'))
### Uncomment the following to replace for the fully trained network
#get_file("simple_seq2seq_pretrained.h5",
# "https://github.com/m2dsupsdlclass/lectures-labs/releases/"
# "download/0.4/simple_seq2seq_pretrained.h5")
#filename = os.path.expanduser(os.path.join('~',
# '.keras/datasets/simple_seq2seq_pretrained.h5'))
simple_seq2seq = load_model(filename)
Let's have a look at a raw prediction on the first sample of the test set:
In [40]:
fr_test[0]
Out[40]:
In numeric array this is provided (along with the expected target sequence) as the following padded input sequence:
In [41]:
first_test_sequence = X_test[0:1]
first_test_sequence
Out[41]:
Remember that the _GO
(symbol indexed at 1
) separates the reversed source from the expected target sequence:
In [42]:
rev_shared_vocab[1]
Out[42]:
Exercise :
Interpretation
num_test[0]
In [43]:
# %load solutions/interpret_output.py
prediction = simple_seq2seq.predict(first_test_sequence)
print("prediction shape:", prediction.shape)
# Let's use `argmax` to extract the predicted token ids at each step:
predicted_token_ids = prediction[0].argmax(-1)
print("prediction token ids:", predicted_token_ids)
# We can use the shared reverse vocabulary to map
# this back to the string representation of the tokens,
# as well as removing Padding and EOS symbols
predicted_numbers = [rev_shared_vocab[token_id] for token_id in predicted_token_ids
if token_id not in (shared_vocab[PAD], shared_vocab[EOS])]
print("predicted number:", "".join(predicted_numbers))
print("test number:", num_test[0])
# The model successfully predicted the test sequence.
# However, we provided the full sequence as input, including all the solution
# (except for the last number). In a real testing condition, one wouldn't
# have the full input sequence, but only what is provided before the "GO"
# symbol
In the previous exercise we cheated a bit because we gave the complete sequence along with the solution in the input sequence. To correctly predict we need to predict one token at a time and reinject the predicted token in the input sequence to predict the next token:
In [44]:
def greedy_translate(model, source_sequence, shared_vocab, rev_shared_vocab,
word_level_source=True, word_level_target=True):
"""Greedy decoder recursively predicting one token at a time"""
# Initialize the list of input token ids with the source sequence
source_tokens = tokenize(source_sequence, word_level=word_level_source)
input_ids = [shared_vocab.get(t, UNK) for t in source_tokens[::-1]]
input_ids += [shared_vocab[GO]]
# Prepare a fixed size numpy array that matches the expected input
# shape for the model
input_array = np.empty(shape=(1, model.input_shape[1]),
dtype=np.int32)
decoded_tokens = []
while len(input_ids) <= max_length:
# Vectorize a the list of input tokens as
# and use zeros padding.
input_array.fill(shared_vocab[PAD])
input_array[0, -len(input_ids):] = input_ids
# Predict the next output: greedy decoding with argmax
next_token_id = model.predict(input_array)[0, -1].argmax()
# Stop decoding if the network predicts end of sentence:
if next_token_id == shared_vocab[EOS]:
break
# Otherwise use the reverse vocabulary to map the prediction
# back to the string space
decoded_tokens.append(rev_shared_vocab[next_token_id])
# Append prediction to input sequence to predict the next
input_ids.append(next_token_id)
separator = " " if word_level_target else ""
return separator.join(decoded_tokens)
In [45]:
phrases = [
"un",
"deux",
"trois",
"onze",
"quinze",
"cent trente deux",
"cent mille douze",
"sept mille huit cent cinquante neuf",
"vingt et un",
"vingt quatre",
"quatre vingts",
"quatre vingt onze mille",
"quatre vingt onze mille deux cent deux",
]
for phrase in phrases:
translation = greedy_translate(simple_seq2seq, phrase,
shared_vocab, rev_shared_vocab,
word_level_target=False)
print(phrase.ljust(40), translation)
Why does the partially trained network is able to correctly give the output for
"sept mille huit cent cinquante neuf"
but not for:
"cent mille douze"
?
The answer is as following:
"cent"
and "mille"
0s
for "cent mille douze"
requires more reasoning and ability to count.
In [46]:
phrases = [
"quatre vingt et un",
"quarante douze",
"onze cent soixante vingt quatorze",
]
for phrase in phrases:
translation = greedy_translate(simple_seq2seq, phrase,
shared_vocab, rev_shared_vocab,
word_level_target=False)
print(phrase.ljust(40), translation)
Because we expect only one correct translation for a given source sequence, we can use phrase-level accuracy as a metric to quantify our model quality.
Note that this is not the case for real translation models (e.g. from French to English on arbitrary sentences). Evaluation of a machine translation model is tricky in general. Automated evaluation can somehow be done at the corpus level with the BLEU score (bilingual evaluation understudy) given a large enough sample of correct translations provided by certified translators but its only a noisy proxy.
The only good evaluation is to give a large enough sample of the model predictions on some test sentences to certified translators and ask them to give an evaluation (e.g. a score between 0 and 6, 0 for non-sensical and 6 for the hypothetical perfect translation). However in practice this is very costly to do.
Fortunately we can just use phrase-level accuracy on a our very domain specific toy problem:
In [47]:
def phrase_accuracy(model, num_sequences, fr_sequences, n_samples=300,
decoder_func=greedy_translate):
correct = []
n_samples = len(num_sequences) if n_samples is None else n_samples
for i, num_seq, fr_seq in zip(range(n_samples), num_sequences, fr_sequences):
if i % 100 == 0:
print("Decoding %d/%d" % (i, n_samples))
predicted_seq = decoder_func(simple_seq2seq, fr_seq,
shared_vocab, rev_shared_vocab,
word_level_target=False)
correct.append(num_seq == predicted_seq)
return np.mean(correct)
In [48]:
print("Phrase-level test accuracy: %0.3f"
% phrase_accuracy(simple_seq2seq, num_test, fr_test))
In [49]:
print("Phrase-level train accuracy: %0.3f"
% phrase_accuracy(simple_seq2seq, num_train, fr_train))
Instead of decoding with greedy strategy that only considers the most-likely next token at each prediction, we can hold a priority queue of the most promising top-n sequences ordered by loglikelihoods.
This could potentially improve the final accuracy of an imperfect model: indeed it can be the case that the most likely sequence (based on the conditional proability estimated by the model) starts with a character that is not the most likely alone.
Bonus Exercise:
beam_size
candidates and their corresponding likelihood
In [50]:
def beam_translate(model, source_sequence, shared_vocab, rev_shared_vocab,
word_level_source=True, word_level_target=True,
beam_size=10, return_ll=False):
"""Decode candidate translations with a beam search strategy
If return_ll is False, only the best candidate string is returned.
If return_ll is True, all the candidate strings and their loglikelihoods
are returned.
"""
In [51]:
# %load solutions/beam_search.py
def beam_translate(model, source_sequence, shared_vocab, rev_shared_vocab,
word_level_source=True, word_level_target=True,
beam_size=10, return_ll=False):
"""Decode candidate translations with a beam search strategy
If return_ll is False, only the best candidate string is returned.
If return_ll is True, all the candidate strings and their loglikelihoods
are returned.
"""
# Initialize the list of input token ids with the source sequence
source_tokens = tokenize(source_sequence, word_level=word_level_source)
input_ids = [shared_vocab.get(t, UNK) for t in source_tokens[::-1]]
input_ids += [shared_vocab[GO]]
# initialize loglikelihood, input token ids, decoded tokens for
# each candidate in the beam
candidates = [(0, input_ids[:], [], False)]
# Prepare a fixed size numpy array that matches the expected input
# shape for the model
input_array = np.empty(shape=(beam_size, model.input_shape[1]),
dtype=np.int32)
while any([not done and (len(input_ids) < max_length)
for _, input_ids, _, done in candidates]):
# Vectorize a the list of input tokens and use zeros padding.
input_array.fill(shared_vocab[PAD])
for i, (_, input_ids, _, done) in enumerate(candidates):
if not done:
input_array[i, -len(input_ids):] = input_ids
# Predict the next output in a single call to the model to amortize
# the overhead and benefit from vector data parallelism on GPU.
next_likelihood_batch = model.predict(input_array)
# Build the new candidates list by summing the loglikelood of the
# next token with their parents for each new possible expansion.
new_candidates = []
for i, (ll, input_ids, decoded, done) in enumerate(candidates):
if done:
new_candidates.append((ll, input_ids, decoded, done))
else:
next_loglikelihoods = np.log(next_likelihood_batch[i, -1])
for next_token_id, next_ll in enumerate(next_loglikelihoods):
new_ll = ll + next_ll
new_input_ids = input_ids[:]
new_input_ids.append(next_token_id)
new_decoded = decoded[:]
new_done = done
if next_token_id == shared_vocab[EOS]:
new_done = True
if not new_done:
new_decoded.append(rev_shared_vocab[next_token_id])
new_candidates.append(
(new_ll, new_input_ids, new_decoded, new_done))
# Only keep a beam of the most promising candidates
new_candidates.sort(reverse=True)
candidates = new_candidates[:beam_size]
separator = " " if word_level_target else ""
if return_ll:
return [(separator.join(decoded), ll) for ll, _, decoded, _ in candidates]
else:
_, _, decoded, done = candidates[0]
return separator.join(decoded)
In [52]:
candidates = beam_translate(simple_seq2seq, "cent mille un",
shared_vocab, rev_shared_vocab,
word_level_target=False,
return_ll=True, beam_size=10)
candidates
Out[52]:
In [53]:
candidates = beam_translate(simple_seq2seq, "quatre vingts",
shared_vocab, rev_shared_vocab,
word_level_target=False,
return_ll=True, beam_size=10)
candidates
Out[53]:
In [54]:
print("Phrase-level test accuracy: %0.3f"
% phrase_accuracy(simple_seq2seq, num_test, fr_test,
decoder_func=beam_translate))
In [55]:
print("Phrase-level train accuracy: %0.3f"
% phrase_accuracy(simple_seq2seq, num_train, fr_train,
decoder_func=beam_translate))
When using the partially trained model the test phrase-level is slightly better (0.38 vs 0.37) with the beam decoder than with the greedy decoder. However the improvement is not that important on our toy task. Training the model to convergence would yield a perfect score on the test set anyway.
Properly tuned beam search decoding can be critical to improve the quality of Machine Translation systems trained on natural language pairs though.
We only scratched the surface of sequence-to-sequence systems. To go further, we recommend reading the initial Sequence to Sequence paper as well as the following developments, citing this work. Furthermore, here are a few pointers on how to go further if you're interested.
We may want to build a model with a separated encoder and decoder, to improve performance and be more flexible with the architecture.
Having a separated encoder-decoder framework also enables us to build an attention-model:
Attention models are efficient to model longer sequences, to find alignment between input and output sequences, and to model different parts of sequences with seperated meanings
In complement to studying the TensorFlow seq2seq and OpenNMT code base, you might also want to read the following 55 pages tutorial:
Neural Machine Translation and Sequence-to-sequence Models: A Tutorial by Graham Neubig.
In [ ]: