Licensed under the Apache License, Version 2.0 (the "License").

Neural Machine Translation with Attention

View source on GitHub

This notebook is still under construction! Please come back later.

This notebook trains a sequence to sequence (seq2seq) model for Spanish to English translation using TF 2.0 APIs. This is an advanced example that assumes some knowledge of sequence to sequence models.

After training the model in this notebook, you will be able to input a Spanish sentence, such as "¿todavia estan en casa?", and return the English translation: "are you still at home?"

The translation quality is reasonable for a toy example, but the generated attention plot is perhaps more interesting. This shows which parts of the input sentence has the model's attention while translating:

Note: This example takes approximately 10 mintues to run on a single P100 GPU.

In [0]:
import collections
import io
import itertools
import os
import random
import re
import time
import unicodedata

import numpy as np

import tensorflow as tf
assert tf.__version__.startswith('2')

import matplotlib.pyplot as plt


Download and prepare the dataset

We'll use a language dataset provided by This dataset contains language translation pairs in the format:

May I borrow this book? ¿Puedo tomar prestado este libro?

There are a variety of languages available, but we'll use the English-Spanish dataset. For convenience, we've hosted a copy of this dataset on Google Cloud, but you can also download your own copy. After downloading the dataset, here are the steps we'll take to prepare the data:

  1. Clean the sentences by removing special characters.
  2. Add a start and end token to each sentence.
  3. Create a word index and reverse word index (dictionaries mapping from word → id and id → word).
  4. Pad each sentence to a maximum length.

In [0]:
# TODO(brianklee): This preprocessing should ideally be implemented in TF
# because preprocessing should be exported as part of the SavedModel.

# Converts the unicode file to ascii
def unicode_to_ascii(s):
  return ''.join(c for c in unicodedata.normalize('NFD', s)
                 if unicodedata.category(c) != 'Mn')

START_TOKEN = u'<start>'
END_TOKEN = u'<end>'

def preprocess_sentence(w):
  # remove accents; lowercase everything
  w = unicode_to_ascii(w.strip()).lower()

  # creating a space between a word and the punctuation following it
  # eg: "he is a boy." => "he is a boy ."
  w = re.sub(r'([?.!,¿])', r' \1 ', w)

  # replacing everything with space except (a-z, '.', '?', '!', ',')
  w = re.sub(r'[^a-z?.!,¿]+', ' ', w)

  # adding a start and an end token to the sentence
  # so that the model know when to start and stop predicting.
  w = '<start> ' + w + ' <end>'
  return w

In [0]:
en_sentence = u"May I borrow this book?"
sp_sentence = u"¿Puedo tomar prestado este libro?"

Training on the complete dataset of >100,000 sentences will take a long time. To train faster, we can limit the size of the dataset (of course, translation quality degrades with less data).

In [0]:
def load_anki_data(num_examples=None):
  # Download the file
  path_to_zip = tf.keras.utils.get_file(
      '', origin='',

  path_to_file = os.path.dirname(path_to_zip) + '/spa-eng/spa.txt'
  with, 'rb') as f:
    lines ='utf8').strip().split('\n')

  # Data comes as tab-separated strings; one per line.
  eng_spa_pairs = [[preprocess_sentence(w) for w in line.split('\t')] for line in lines]

  # The translations file is ordered from shortest to longest, so slicing from
  # the front will select the shorter examples. This also speeds up training.
  if num_examples is not None:
    eng_spa_pairs = eng_spa_pairs[:num_examples]
  eng_sentences, spa_sentences = zip(*eng_spa_pairs)

  eng_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='')
  spa_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='')
  return (eng_spa_pairs, eng_tokenizer, spa_tokenizer)

In [0]:
sentence_pairs, english_tokenizer, spanish_tokenizer = load_anki_data(NUM_EXAMPLES)

In [0]:
# Turn our english/spanish pairs into TF Datasets by mapping words -> integers.

def make_dataset(eng_spa_pairs, eng_tokenizer, spa_tokenizer):
  eng_sentences, spa_sentences = zip(*eng_spa_pairs)
  eng_ints = eng_tokenizer.texts_to_sequences(eng_sentences)
  spa_ints = spa_tokenizer.texts_to_sequences(spa_sentences)

  padded_eng_ints = tf.keras.preprocessing.sequence.pad_sequences(
      eng_ints, padding='post')
  padded_spa_ints = tf.keras.preprocessing.sequence.pad_sequences(
      spa_ints, padding='post')

  dataset =, padded_spa_ints))
  return dataset

In [0]:
# Train/test split
train_size = int(len(sentence_pairs) * 0.8)
train_sentence_pairs, test_sentence_pairs = sentence_pairs[:train_size], sentence_pairs[train_size:]
# Show length
len(train_sentence_pairs), len(test_sentence_pairs)

In [0]:
_english, _spanish = train_sentence_pairs[0]
_eng_ints, _spa_ints = english_tokenizer.texts_to_sequences([_english])[0], spanish_tokenizer.texts_to_sequences([_spanish])[0]
print("Source language: ")
print('\n'.join('{:4d} ----> {}'.format(i, word) for i, word in zip(_eng_ints, _english.split())))
print("Target language: ")
print('\n'.join('{:4d} ----> {}'.format(i, word) for i, word in zip(_spa_ints, _spanish.split())))

In [0]:
# Set up datasets

train_ds = make_dataset(train_sentence_pairs, english_tokenizer, spanish_tokenizer)
test_ds = make_dataset(test_sentence_pairs, english_tokenizer, spanish_tokenizer)
train_ds = train_ds.shuffle(len(train_sentence_pairs)).batch(BATCH_SIZE, drop_remainder=True)
test_ds = test_ds.batch(BATCH_SIZE, drop_remainder=True)

In [0]:
print("Dataset outputs elements with shape ({}, {})".format(

Write the encoder and decoder model

Here, we'll implement an encoder-decoder model with attention. The following diagram shows that each input word is assigned a weight by the attention mechanism which is then used by the decoder to predict the next word in the sentence.

The input is put through an encoder model which gives us the encoder output of shape (batch_size, max_length, hidden_size) and the encoder hidden state of shape (batch_size, hidden_size).

In [0]:
MAX_OUTPUT_LENGTH = train_ds.output_shapes[1][1]

def gru(units):
  return tf.keras.layers.GRU(units,

In [0]:
class Encoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, encoder_size):
    super(Encoder, self).__init__()
    self.embedding_dim = embedding_dim
    self.encoder_size = encoder_size
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = gru(encoder_size)

  def call(self, x, hidden):
    x = self.embedding(x)
    output, state = self.gru(x, initial_state=hidden)
    return output, state

  def initial_hidden_state(self, batch_size):
    return tf.zeros((batch_size, self.encoder_size))

For the decoder, we're using Bahdanau attention. Here are the equations that are implemented:

Lets decide on notation before writing the simplified form:

  • FC = Fully connected (dense) layer
  • EO = Encoder output
  • H = hidden state
  • X = input to the decoder

And the pseudo-code:

  • score = FC(tanh(FC(EO) + FC(H)))
  • attention weights = softmax(score, axis = 1). Softmax by default is applied on the last axis but here we want to apply it on the 1st axis, since the shape of score is (batch_size, max_length, hidden_size). Max_length is the length of our input. Since we are trying to assign a weight to each input, softmax should be applied on that axis.
  • context vector = sum(attention weights * EO, axis = 1). Same reason as above for choosing axis as 1.
  • embedding output = The input to the decoder X is passed through an embedding layer.
  • merged vector = concat(embedding output, context vector)
  • This merged vector is then given to the GRU

The shapes of all the vectors at each step have been specified in the comments in the code:

In [0]:
class BahdanauAttention(tf.keras.Model):
  def __init__(self, units):
    super(BahdanauAttention, self).__init__()
    self.W1 = tf.keras.layers.Dense(units)
    self.W2 = tf.keras.layers.Dense(units)
    self.V = tf.keras.layers.Dense(1)
  def call(self, hidden_state, enc_output):
    # enc_output shape = (batch_size, max_length, hidden_size)

    # (batch_size, hidden_size) -> (batch_size, 1, hidden_size)
    hidden_with_time = tf.expand_dims(hidden_state, 1)
    # score shape == (batch_size, max_length, 1)
    score = self.V(tf.nn.tanh(self.W1(enc_output) + self.W2(hidden_with_time)))
    # attention_weights shape == (batch_size, max_length, 1)
    attention_weights = tf.nn.softmax(score, axis=1)

    # context_vector shape after sum = (batch_size, hidden_size)
    context_vector = attention_weights * enc_output
    context_vector = tf.reduce_sum(context_vector, axis=1)
    return context_vector, attention_weights

class Decoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, decoder_size):
    super(Decoder, self).__init__()
    self.vocab_size = vocab_size
    self.embedding_dim = embedding_dim
    self.decoder_size = decoder_size
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = gru(decoder_size)
    self.fc = tf.keras.layers.Dense(vocab_size)
    self.attention = BahdanauAttention(decoder_size)

  def call(self, x, hidden, enc_output):
    context_vector, attention_weights = self.attention(hidden, enc_output)

    # x shape after passing through embedding == (batch_size, 1, embedding_dim)
    x = self.embedding(x)

    # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
    x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)

    # passing the concatenated vector to the GRU
    output, state = self.gru(x)

    # output shape == (batch_size, hidden_size)
    output = tf.reshape(output, (-1, output.shape[2]))

    # output shape == (batch_size, vocab)
    x = self.fc(output)

    return x, state, attention_weights

Define a translate function

Now, let's put the encoder and decoder halves together. The encoder step is fairly straightforward; we'll just reuse Keras's dynamic unroll. For the decoder, we have to make some choices about how to feed the decoder RNN. Overall the process goes as follows:

  1. Pass the input through the encoder which return encoder output and the encoder hidden state.
  2. The encoder output, encoder hidden state and the <START> token is passed to the decoder.
  3. The decoder returns the predictions and the decoder hidden state.
  4. The encoder output, hidden state and next token is then fed back into the decoder repeatedly. This has two different behaviors under training and inference:
    • during training, we use teacher forcing, where the correct next token is fed into the decoder, regardless of what the decoder emitted.
    • during inference, we use tf.argmax(predictions) to select the most likely continuation and feed it back into the decoder. Another strategy that yields more robust results is called beam search.
  5. Repeat step 4 until either the decoder emits an <END> token, indicating that it's done translating, or we run into a hardcoded length limit.

In [0]:
class NmtTranslator(tf.keras.Model):
  def __init__(self, encoder, decoder, start_token_id, end_token_id):
    super(NmtTranslator, self).__init__()
    self.encoder = encoder
    self.decoder = decoder
    # (The token_id should match the decoder's language.)
    # Uses start_token_id to initialize the decoder.
    self.start_token_id = tf.constant(start_token_id)
    # Check for sequence completion using this token_id
    self.end_token_id = tf.constant(end_token_id)

  def call(self, inp, target=None, max_output_length=MAX_OUTPUT_LENGTH):
    '''Translate an input.

    If target is provided, teacher forcing is used to generate the translation.
    batch_size = inp.shape[0]
    hidden = self.encoder.initial_hidden_state(batch_size)

    enc_output, enc_hidden = self.encoder(inp, hidden)
    dec_hidden = enc_hidden

    if target is not None:
      output_length = target.shape[1]
      output_length = max_output_length

    predictions_array = tf.TensorArray(tf.float32, size=output_length - 1)
    attention_array = tf.TensorArray(tf.float32, size=output_length - 1)
    # Feed <START> token to start decoder.
    dec_input = tf.cast([self.start_token_id] * batch_size, tf.int32)
    # Keep track of which sequences have emitted an <END> token
    is_done = tf.zeros([batch_size], dtype=tf.bool)

    for i in tf.range(output_length - 1):
      dec_input = tf.expand_dims(dec_input, 1)
      predictions, dec_hidden, attention_weights = self.decoder(dec_input, dec_hidden, enc_output)
      predictions = tf.where(is_done, tf.zeros_like(predictions), predictions)
      # Write predictions/attention for later visualization.
      predictions_array = predictions_array.write(i, predictions)
      attention_array = attention_array.write(i, attention_weights)

      # Decide what to pass into the next iteration of the decoder.
      if target is not None:
        # if target is known, use teacher forcing
        dec_input = target[:, i + 1]
        # Otherwise, pick the most likely continuation
        dec_input = tf.argmax(predictions, axis=1, output_type=tf.int32)

      # Figure out which sentences just completed.
      is_done = tf.logical_or(is_done, tf.equal(dec_input, self.end_token_id))
      # Exit early if all our sentences are done.
      if tf.reduce_all(is_done):

    # [time, batch, predictions] -> [batch, time, predictions]
    return tf.transpose(predictions_array.stack(), [1, 0, 2]), tf.transpose(attention_array.stack(), [1, 0, 2, 3])

Define the loss function

Our loss function is a word-for-word comparison between true answer and model prediction.

real = [<start>, 'This', 'is', 'the', 'correct', 'answer', '.', '<end>', '<oov>']
pred = ['This', 'is', 'what', 'the', 'model', 'emitted', '.', '<end>']

results in comparing

This/This, is/is, the/what, correct/the, answer/model, ./emitted, <end>/.

and ignoring the rest of the prediction.

In [0]:
def loss_fn(real, pred):
  # The prediction doesn't include the <start> token.
  real = real[:, 1:]
  # Cut down the prediction to the correct shape (We ignore extra words).
  pred = pred[:, :real.shape[1]]
  # If real == <OOV>, then mask out the loss.
  mask = 1 - np.equal(real, 0)
  loss_ = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=real, logits=pred) * mask

  # Sum loss over the time dimension, but average it over the batch dimension.
  return tf.reduce_mean(tf.reduce_sum(loss_, axis=1))

Configure model directory

We'll use one directory to save all of our relevant artifacts (summary logs, checkpoints, SavedModel exports, etc.)

In [0]:
# Where to save checkpoints, tensorboard summaries, etc.
MODEL_DIR = '/tmp/tensorflow/nmt_attention'

def apply_clean():
    print('Removing existing model dir: {}'.format(MODEL_DIR))

In [0]:
# Optional: remove existing data

In [0]:
# Summary writers
train_summary_writer = tf.summary.create_file_writer(
  os.path.join(MODEL_DIR, 'summaries', 'train'), flush_millis=10000)
test_summary_writer = tf.summary.create_file_writer(
  os.path.join(MODEL_DIR, 'summaries', 'eval'), flush_millis=10000, name='test')

In [0]:
# Set up all stateful objects
encoder = Encoder(len(english_tokenizer.word_index) + 1, EMBEDDING_DIM, ENCODER_SIZE)
decoder = Decoder(len(spanish_tokenizer.word_index) + 1, EMBEDDING_DIM, DECODER_SIZE)
start_token_id = spanish_tokenizer.word_index[START_TOKEN]
end_token_id = spanish_tokenizer.word_index[END_TOKEN]
model = NmtTranslator(encoder, decoder, start_token_id, end_token_id)

# TODO(brianklee): Investigate whether Adam defaults have changed and whether it affects training.
optimizer = tf.keras.optimizers.Adam(epsilon=1e-8)#   tf.keras.optimizers.SGD(learning_rate=0.01)#Adam()

In [0]:
# Checkpoints
checkpoint_dir = os.path.join(MODEL_DIR, 'checkpoints')
checkpoint_prefix = os.path.join(checkpoint_dir, 'ckpt')
checkpoint = tf.train.Checkpoint(
    encoder=encoder, decoder=decoder, optimizer=optimizer)
# Restore variables on creation if a checkpoint exists.

In [0]:
# SavedModel exports
export_path = os.path.join(MODEL_DIR, 'export')

Visualize the model's output

Let's visualize our model's output. (It hasn't been trained yet, so it will output gibberish.)

We'll use this visualization to check on the model's progress.

In [0]:
def plot_attention(attention, sentence, predicted_sentence):
    fig = plt.figure(figsize=(10,10))
    ax = fig.add_subplot(1, 1, 1)
    ax.matshow(attention, cmap='viridis')
    fontdict = {'fontsize': 14}
    ax.set_xticklabels([''] + sentence.split(), fontdict=fontdict, rotation=90)
    ax.set_yticklabels([''] + predicted_sentence.split(), fontdict=fontdict)

def ints_to_words(tokenizer, ints):
    return ' '.join(tokenizer.index_word[int(i)] if int(i) != 0 else '<OOV>' for i in ints)
def sentence_to_ints(tokenizer, sentence):
    sentence = preprocess_sentence(sentence)
    return tf.constant(tokenizer.texts_to_sequences([sentence])[0])

def translate_and_plot_ints(model, english_tokenizer, spanish_tokenizer, ints, target_ints=None):
    """Run translation on a sentence and plot an attention matrix.
    Sentence should be passed in as list of integers.
    ints = tf.expand_dims(ints, 0)
    predictions, attention = model(ints)
    prediction_ids = tf.squeeze(tf.argmax(predictions, axis=-1))
    attention = tf.squeeze(attention)
    sentence = ints_to_words(english_tokenizer, ints[0])
    predicted_sentence = ints_to_words(spanish_tokenizer, prediction_ids)
    print(u'Input: {}'.format(sentence))
    print(u'Predicted translation: {}'.format(predicted_sentence))
    if target_ints is not None:
      print(u'Correct translation: {}'.format(ints_to_words(spanish_tokenizer, target_ints)))
    plot_attention(attention, sentence, predicted_sentence)    

def translate_and_plot_words(model, english_tokenizer, spanish_tokenizer, sentence, target_sentence=None):
    """Same as translate_and_plot_ints, but pass in a sentence as a string."""
    english_ints = sentence_to_ints(english_tokenizer, sentence)
    spanish_ints = sentence_to_ints(spanish_tokenizer, target_sentence) if target_sentence is not None else None
    translate_and_plot_ints(model, english_tokenizer, spanish_tokenizer, english_ints, target_ints=spanish_ints)

In [0]:
translate_and_plot_words(model, english_tokenizer, spanish_tokenizer, u"it's really cold here", u'hace mucho frio aqui')

Train the model

In [0]:
def train(model, optimizer, dataset):
  """Trains model on `dataset` using `optimizer`."""
  start = time.time()
  avg_loss = tf.keras.metrics.Mean('loss', dtype=tf.float32)
  for inp, target in dataset:
    with tf.GradientTape() as tape:
      predictions, _ = model(inp, target=target)
      loss = loss_fn(target, predictions)

    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))
    if tf.equal(optimizer.iterations % 10, 0):
      tf.summary.scalar('loss', avg_loss.result(), step=optimizer.iterations)
      rate = 10 / (time.time() - start)
      print('Step #%d\tLoss: %.6f (%.2f steps/sec)' % (optimizer.iterations, loss, rate))
      start = time.time()
    if tf.equal(optimizer.iterations % 100, 0):
#       translate_and_plot_words(model, english_index, spanish_index, u"it's really cold here.", u'hace mucho frio aqui.')
      translate_and_plot_ints(model, english_tokenizer, spanish_tokenizer, inp[0], target[0])

def test(model, dataset, step_num):
  """Perform an evaluation of `model` on the examples from `dataset`."""
  avg_loss = tf.keras.metrics.Mean('loss', dtype=tf.float32)
  for inp, target in dataset:
    predictions, _ = model(inp)
    loss = loss_fn(target, predictions)

  print('Model test set loss: {:0.4f}'.format(avg_loss.result()))
  tf.summary.scalar('loss', avg_loss.result(), step=step_num)

In [0]:
for i in range(NUM_TRAIN_EPOCHS):
  start = time.time()
  with train_summary_writer.as_default():
    train(model, optimizer, train_ds)
  end = time.time()
  print('\nTrain time for epoch #{} ({} total steps): {}'.format(
      i + 1, optimizer.iterations, end - start))
  with test_summary_writer.as_default():
    test(model, test_ds, optimizer.iterations)

In [0]:
# TODO(brianklee): This seems to be complaining about input shapes not being set?
#, export_path)

Next steps

  • Download a different dataset to experiment with translations, for example, English to German, or English to French.
  • Experiment with training on a larger dataset, or using more epochs

In [0]: