Tensor2Tensor, or T2T for short, is a library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research. In this notebook we will see how to use this library for a translation task by exploring the necessary steps. We will see how to define a problem, generate the data, train the model and test the quality of it, and we will translate our sequences and we visualize the attention. We will also see how to download a pre-trained model.
In [0]:
#@title
# Copyright 2018 Google LLC.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# https://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
In [0]:
# Install deps
!pip install -q -U tensor2tensor
In [0]:
import sys
if 'google.colab' in sys.modules: # Colab-only TensorFlow version selector
%tensorflow_version 1.x
import tensorflow as tf
import os
DATA_DIR = os.path.expanduser("/t2t/data") # This folder contain the data
TMP_DIR = os.path.expanduser("/t2t/tmp")
TRAIN_DIR = os.path.expanduser("/t2t/train") # This folder contain the model
EXPORT_DIR = os.path.expanduser("/t2t/export") # This folder contain the exported model for production
TRANSLATIONS_DIR = os.path.expanduser("/t2t/translation") # This folder contain all translated sequence
EVENT_DIR = os.path.expanduser("/t2t/event") # Test the BLEU score
USR_DIR = os.path.expanduser("/t2t/user") # This folder contains our data that we want to add
tf.gfile.MakeDirs(DATA_DIR)
tf.gfile.MakeDirs(TMP_DIR)
tf.gfile.MakeDirs(TRAIN_DIR)
tf.gfile.MakeDirs(EXPORT_DIR)
tf.gfile.MakeDirs(TRANSLATIONS_DIR)
tf.gfile.MakeDirs(EVENT_DIR)
tf.gfile.MakeDirs(USR_DIR)
In [0]:
PROBLEM = "translate_enfr_wmt32k" # We chose a problem translation English to French with 32.768 vocabulary
MODEL = "transformer" # Our model
HPARAMS = "transformer_big" # Hyperparameters for the model by default
# If you have a one gpu, use transformer_big_single_gpu
In [0]:
#Show all problems and models
from tensor2tensor.utils import registry
from tensor2tensor import problems
problems.available() #Show all problems
registry.list_models() #Show all registered models
#or
#Command line
!t2t-trainer --registry_help #Show all problems
!t2t-trainer --problems_help #Show all models
For more information: https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/bin/t2t_datagen.py
In [0]:
!t2t-datagen \
--data_dir=$DATA_DIR \
--tmp_dir=$TMP_DIR \
--problem=$PROBLEM \
--t2t_usr_dir=$USR_DIR
In [0]:
t2t_problem = problems.problem(PROBLEM)
t2t_problem.generate_data(DATA_DIR, TMP_DIR)
You can choose between command line or code.
batch_size : a great value of preference.
train_steps : research paper mentioned 300k steps with 8 gpu on big transformer. So if you have 1 gpu, you will need to train the model x8 more. (https://arxiv.org/abs/1706.03762 for more information).
In [0]:
train_steps = 300000 # Total number of train steps for all Epochs
eval_steps = 100 # Number of steps to perform for each evaluation
batch_size = 4096
save_checkpoints_steps = 1000
ALPHA = 0.1
schedule = "continuous_train_and_eval"
You can choose schedule :
https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/bin/t2t_trainer.py
In [0]:
!t2t-trainer \
--data_dir=$DATA_DIR \
--problem=$PROBLEM \
--model=$MODEL \
--hparams_set=$HPARAMS \
--hparams="batch_size=$batch_size" \
--schedule=$schedule\
--output_dir=$TRAIN_DIR \
--train_steps=$train_steps \
--worker-gpu=1 \
--eval_steps=$eval_steps
--worker-gpu = 1, for train on 1 gpu (facultative).
For distributed training see: https://github.com/tensorflow/tensor2tensor/blob/master/docs/distributed_training.md
create_hparams : https://github.com/tensorflow/tensor2tensor/blob/28adf2690c551ef0f570d41bef2019d9c502ec7e/tensor2tensor/utils/hparams_lib.py#L42
Change hyper parameters : https://github.com/tensorflow/tensor2tensor/blob/28adf2690c551ef0f570d41bef2019d9c502ec7e/tensor2tensor/models/transformer.py#L1627
In [0]:
from tensor2tensor.utils.trainer_lib import create_run_config, create_experiment
from tensor2tensor.utils.trainer_lib import create_hparams
from tensor2tensor.utils import registry
from tensor2tensor import models
from tensor2tensor import problems
# Init Hparams object from T2T Problem
hparams = create_hparams(HPARAMS)
# Make Changes to Hparams
hparams.batch_size = batch_size
hparams.learning_rate = ALPHA
#hparams.max_length = 256
# Can see all Hparams with code below
#print(json.loads(hparams.to_json())
In [0]:
RUN_CONFIG = create_run_config(
model_dir=TRAIN_DIR,
model_name=MODEL,
save_checkpoints_steps= save_checkpoints_steps
)
tensorflow_exp_fn = create_experiment(
run_config=RUN_CONFIG,
hparams=hparams,
model_name=MODEL,
problem_name=PROBLEM,
data_dir=DATA_DIR,
train_steps=train_steps,
eval_steps=eval_steps,
#use_xla=True # For acceleration
)
tensorflow_exp_fn.train_and_evaluate()
In [0]:
#INIT FILE FOR TRANSLATE
SOURCE_TEST_TRANSLATE_DIR = TMP_DIR+"/dev/newstest2014-fren-src.en.sgm"
REFERENCE_TEST_TRANSLATE_DIR = TMP_DIR+"/dev/newstest2014-fren-ref.en.sgm"
BEAM_SIZE=1
In [0]:
!t2t-translate-all \
--source=$SOURCE_TEST_TRANSLATE_DIR \
--model_dir=$TRAIN_DIR \
--translations_dir=$TRANSLATIONS_DIR \
--data_dir=$DATA_DIR \
--problem=$PROBLEM \
--hparams_set=$HPARAMS \
--output_dir=$TRAIN_DIR \
--t2t_usr_dir=$USR_DIR \
--beam_size=$BEAM_SIZE \
--model=$MODEL
The BLEU score for all translations: https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/bin/t2t_bleu.py#L68
In [0]:
!t2t-bleu \
--translations_dir=$TRANSLATIONS_DIR \
--model_dir=$TRAIN_DIR \
--data_dir=$DATA_DIR \
--problem=$PROBLEM \
--hparams_set=$HPARAMS \
--source=$SOURCE_TEST_TRANSLATE_DIR \
--reference=$REFERENCE_TEST_TRANSLATE_DIR \
--event_dir=$EVENT_DIR
https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/bin/t2t_decoder.py
In [0]:
!echo "the business of the house" > "inputs.en"
!echo -e "les affaires de la maison" > "reference.fr" # You can add other references
!t2t-decoder \
--data_dir=$DATA_DIR \
--problem=$PROBLEM \
--model=$MODEL \
--hparams_set=$HPARAMS \
--output_dir=$TRAIN_DIR \
--decode_hparams="beam_size=1,alpha=$ALPHA" \
--decode_from_file="inputs.en" \
--decode_to_file="outputs.fr"
# See the translations
!cat outputs.fr
In [0]:
import tensorflow as tf
#After training the model, re-run the environment but run this code in first, then predict.
tfe = tf.contrib.eager
tfe.enable_eager_execution()
Modes = tf.estimator.ModeKeys
In [0]:
#Config
from tensor2tensor import models
from tensor2tensor import problems
from tensor2tensor.layers import common_layers
from tensor2tensor.utils import trainer_lib
from tensor2tensor.utils import t2t_model
from tensor2tensor.utils import registry
from tensor2tensor.utils import metrics
import numpy as np
enfr_problem = problems.problem(PROBLEM)
# Copy the vocab file locally so we can encode inputs and decode model outputs
vocab_name = "vocab.translate_enfr_wmt32k.32768.subwords"
vocab_file = os.path.join(DATA_DIR, vocab_name)
# Get the encoders from the problem
encoders = enfr_problem.feature_encoders(DATA_DIR)
ckpt_path = tf.train.latest_checkpoint(os.path.join(TRAIN_DIR))
print(ckpt_path)
def translate(inputs):
encoded_inputs = encode(inputs)
with tfe.restore_variables_on_create(ckpt_path):
model_output = translate_model.infer(encoded_inputs)["outputs"]
return decode(model_output)
def encode(input_str, output_str=None):
"""Input str to features dict, ready for inference"""
inputs = encoders["inputs"].encode(input_str) + [1] # add EOS id
batch_inputs = tf.reshape(inputs, [1, -1, 1]) # Make it 3D.
return {"inputs": batch_inputs}
def decode(integers):
"""List of ints to str"""
integers = list(np.squeeze(integers))
if 1 in integers:
integers = integers[:integers.index(1)]
return encoders["inputs"].decode(np.squeeze(integers))
In [0]:
#Predict
hparams = trainer_lib.create_hparams(HPARAMS, data_dir=DATA_DIR, problem_name=PROBLEM)
translate_model = registry.model(MODEL)(hparams, Modes.PREDICT)
inputs = "the aniamal didn't cross the river because it was too tired"
ref = "l'animal n'a pas traversé la rue parcequ'il etait trop fatigué" ## this just a reference for evaluate the quality of the traduction
outputs = translate(inputs)
print("Inputs: %s" % inputs)
print("Outputs: %s" % outputs)
file_input = open("outputs.fr","w+")
file_input.write(outputs)
file_input.close()
file_output = open("reference.fr","w+")
file_output.write(ref)
file_output.close()
BLEU score for a sequence translation: https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/bin/t2t_bleu.py#L24
In [0]:
!t2t-bleu \
--translation=outputs.fr \
--reference=reference.fr
In [0]:
from tensor2tensor.visualization import attention
from tensor2tensor.data_generators import text_encoder
SIZE = 35
def encode_eval(input_str, output_str):
inputs = tf.reshape(encoders["inputs"].encode(input_str) + [1], [1, -1, 1, 1]) # Make it 3D.
outputs = tf.reshape(encoders["inputs"].encode(output_str) + [1], [1, -1, 1, 1]) # Make it 3D.
return {"inputs": inputs, "targets": outputs}
def get_att_mats():
enc_atts = []
dec_atts = []
encdec_atts = []
for i in range(hparams.num_hidden_layers):
enc_att = translate_model.attention_weights[
"transformer/body/encoder/layer_%i/self_attention/multihead_attention/dot_product_attention" % i][0]
dec_att = translate_model.attention_weights[
"transformer/body/decoder/layer_%i/self_attention/multihead_attention/dot_product_attention" % i][0]
encdec_att = translate_model.attention_weights[
"transformer/body/decoder/layer_%i/encdec_attention/multihead_attention/dot_product_attention" % i][0]
enc_atts.append(resize(enc_att))
dec_atts.append(resize(dec_att))
encdec_atts.append(resize(encdec_att))
return enc_atts, dec_atts, encdec_atts
def resize(np_mat):
# Sum across heads
np_mat = np_mat[:, :SIZE, :SIZE]
row_sums = np.sum(np_mat, axis=0)
# Normalize
layer_mat = np_mat / row_sums[np.newaxis, :]
lsh = layer_mat.shape
# Add extra dim for viz code to work.
layer_mat = np.reshape(layer_mat, (1, lsh[0], lsh[1], lsh[2]))
return layer_mat
def to_tokens(ids):
ids = np.squeeze(ids)
subtokenizer = hparams.problem_hparams.vocabulary['targets']
tokens = []
for _id in ids:
if _id == 0:
tokens.append('<PAD>')
elif _id == 1:
tokens.append('<EOS>')
elif _id == -1:
tokens.append('<NULL>')
else:
tokens.append(subtokenizer._subtoken_id_to_subtoken_string(_id))
return tokens
def call_html():
import IPython
display(IPython.core.display.HTML('''
<script src="/static/components/requirejs/require.js"></script>
<script>
requirejs.config({
paths: {
base: '/static/base',
"d3": "https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.8/d3.min",
jquery: '//ajax.googleapis.com/ajax/libs/jquery/2.0.0/jquery.min',
},
});
</script>
'''))
In [0]:
import numpy as np
# Convert inputs and outputs to subwords
inp_text = to_tokens(encoders["inputs"].encode(inputs))
out_text = to_tokens(encoders["inputs"].encode(outputs))
hparams = trainer_lib.create_hparams(HPARAMS, data_dir=DATA_DIR, problem_name=PROBLEM)
# Run eval to collect attention weights
example = encode_eval(inputs, outputs)
with tfe.restore_variables_on_create(tf.train.latest_checkpoint(ckpt_path)):
translate_model.set_mode(Modes.EVAL)
translate_model(example)
# Get normalized attention weights for each layer
enc_atts, dec_atts, encdec_atts = get_att_mats()
call_html()
attention.show(inp_text, out_text, enc_atts, dec_atts, encdec_atts)
For more information: https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/serving
In [0]:
#export Model
!t2t-exporter \
--data_dir=$DATA_DIR \
--output_dir=$TRAIN_DIR \
--problem=$PROBLEM \
--model=$MODEL \
--hparams_set=$HPARAMS \
--decode_hparams="beam_size=1,alpha=$ALPHA" \
--export_dir=$EXPORT_DIR
In [0]:
print("checkpoint: ")
!gsutil ls "gs://tensor2tensor-checkpoints"
print("data: ")
!gsutil ls "gs://tensor2tensor-data"
In [0]:
PROBLEM_PRETRAINED = "translate_ende_wmt32k"
MODEL_PRETRAINED = "transformer"
HPARAMS_PRETRAINED = "transformer_base"
In [0]:
import tensorflow as tf
import os
DATA_DIR_PRETRAINED = os.path.expanduser("/t2t/data_pretrained")
CHECKPOINT_DIR_PRETRAINED = os.path.expanduser("/t2t/checkpoints_pretrained")
tf.gfile.MakeDirs(DATA_DIR_PRETRAINED)
tf.gfile.MakeDirs(CHECKPOINT_DIR_PRETRAINED)
gs_data_dir = "gs://tensor2tensor-data/"
vocab_name = "vocab.translate_ende_wmt32k.32768.subwords"
vocab_file = os.path.join(gs_data_dir, vocab_name)
gs_ckpt_dir = "gs://tensor2tensor-checkpoints/"
ckpt_name = "transformer_ende_test"
gs_ckpt = os.path.join(gs_ckpt_dir, ckpt_name)
TRAIN_DIR_PRETRAINED = os.path.join(CHECKPOINT_DIR_PRETRAINED, ckpt_name)
!gsutil cp {vocab_file} {DATA_DIR_PRETRAINED}
!gsutil -q cp -R {gs_ckpt} {CHECKPOINT_DIR_PRETRAINED}
CHECKPOINT_NAME_PRETRAINED = tf.train.latest_checkpoint(TRAIN_DIR_PRETRAINED) # for translate with code
In [0]:
!echo "the business of the house" > "inputs.en"
!echo -e "das Geschäft des Hauses" > "reference.de"
!t2t-decoder \
--data_dir=$DATA_DIR_PRETRAINED \
--problem=$PROBLEM_PRETRAINED \
--model=$MODEL_PRETRAINED \
--hparams_set=$HPARAMS_PRETRAINED \
--output_dir=$TRAIN_DIR_PRETRAINED \
--decode_hparams="beam_size=1" \
--decode_from_file="inputs.en" \
--decode_to_file="outputs.de"
# See the translations
!cat outputs.de
!t2t-bleu \
--translation=outputs.de \
--reference=reference.de
To add a new dataset/problem, subclass Problem and register it with @registry.register_problem. See TranslateEnfrWmt8k for an example: https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/translate_enfr.py
Adding your own components: https://github.com/tensorflow/tensor2tensor#adding-your-own-components
See this example: https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/test_data/example_usr_dir
In [0]:
from tensor2tensor.utils import registry
@registry.register_problem
class MyTranslateEnFr(translate_enfr.TranslateEnfrWmt8k):
def generator(self, data_dir, tmp_dir, train):
#your code