NMT-Keras tutorial

In this module, we are going to create an encoder-decoder model with:

  • A bidirectional GRU encoder and a GRU decoder
  • An attention model
  • The previously generated word feeds back de decoder
  • MLPs for initializing the initial RNN state
  • Skip connections from inputs to outputs
  • Beam search.

As usual, first we import the necessary stuff.


In [1]:
from keras.layers import *
from keras.models import model_from_json, Model
from keras.optimizers import Adam, RMSprop, Nadam, Adadelta, SGD, Adagrad, Adamax
from keras.regularizers import l2
from keras_wrapper.cnn_model import Model_Wrapper
from keras_wrapper.extra.regularize import Regularize


Using Theano backend.
Using cuDNN version 5105 on context None
Mapped name None to device cuda: GeForce GTX 1080 (0000:01:00.0)
17/07/2017_12:31:04:  Log file (/home/lvapeab/.picloud/cloud.log) opened

And define the dimesnions of our model. For instance, a word embedding size of 50 and 100 units in RNNs. The inputs/outpus are defined as in previous tutorials.


In [2]:
ids_inputs = ['source_text', 'state_below']
ids_outputs = ['target_text']
word_embedding_size = 50
hidden_state_size = 100
input_vocabulary_size=686  # Autoset in the library
output_vocabulary_size=513  # Autoset in the library

Now, let's define our encoder. First, we have to create an Input layer to connect the input text to our model. Next, we'll apply a word embedding to the sequence of input indices. This word embedding will feed a Bidirectional GRU network, which will produce our sequence of annotations:


In [3]:
# 1. Source text input
src_text = Input(name=ids_inputs[0],
                 batch_shape=tuple([None, None]), # Since the input sequences have variable-length, we do not retrict the Input shape
                 dtype='int32')
# 2. Encoder
# 2.1. Source word embedding
src_embedding = Embedding(input_vocabulary_size, word_embedding_size, 
                          name='source_word_embedding', mask_zero=True # Zeroes as mask
                          )(src_text)
# 2.2. BRNN encoder (GRU/LSTM)
annotations = Bidirectional(GRU(hidden_state_size, 
                                return_sequences=True  # Return the full sequence
                                ),
                            name='bidirectional_encoder',
                            merge_mode='concat')(src_embedding)

Once we have built the encoder, let's build our decoder. First, we have an additional input: The previously generated word (the so-called state_below). We introduce it by means of an Input layer and a (target language) word embedding:


In [4]:
# 3. Decoder
# 3.1.1. Previously generated words as inputs for training -> Teacher forcing
next_words = Input(name=ids_inputs[1], batch_shape=tuple([None, None]), dtype='int32')
# 3.1.2. Target word embedding
state_below = Embedding(output_vocabulary_size, word_embedding_size,
                        name='target_word_embedding', 
                        mask_zero=True)(next_words)

The initial hidden state of the decoder's GRU is initialized by means of a MLP (in this case, single-layered) from the average of the annotations:


In [5]:
ctx_mean = MaskedMean()(annotations)
annotations = MaskLayer()(annotations)  # We may want the padded annotations

initial_state = Dense(hidden_state_size, name='initial_state',
                      activation='tanh')(ctx_mean)

So, we have the input of our decoder:


In [6]:
input_attentional_decoder = [state_below, annotations, initial_state]

Note that, for a sample, the sequence of annotations and initial state is the same, independently of the decoding time-step. In order to avoid computation time, we build two models, one for training and the other one for sampling. They will share weights, but the sampling model will be made up of two different models. One (model_init) will compute the sequence of annotations and initial_state. The other model (model_next) will compute a single recurrent step, given the sequence of annotations, the previous hidden state and the generated words up to this moment.

Therefore, now we slightly change the form of declaring layers. We must share layers between the decoding models.

So, let's start by building the attentional-conditional GRU:


In [7]:
# Define the AttGRUCond function
sharedAttGRUCond = AttGRUCond(hidden_state_size,
                              return_sequences=True,
                              return_extra_variables=True, # Return attended input and attenton weights 
                              return_states=True # Returns the sequence of hidden states (see discussion above)
                              )
[proj_h, x_att, alphas, h_state] = sharedAttGRUCond(input_attentional_decoder) # Apply shared_AttnGRUCond to our input

Now, we set skip connections between input and output layer. Note that, since we have a temporal dimension because of the RNN decoder, we must apply the layers in a TimeDistributed way. Finally, we will merge all skip-connections and apply a 'tanh' no-linearlity:


In [10]:
# Define layer function
shared_FC_mlp = TimeDistributed(Dense(word_embedding_size, activation='linear',),
                                name='logit_lstm')
# Apply layer function
out_layer_mlp = shared_FC_mlp(proj_h)

# Define layer function
shared_FC_ctx = TimeDistributed(Dense(word_embedding_size, activation='linear'),
                                name='logit_ctx')
# Apply layer function
out_layer_ctx = shared_FC_ctx(x_att)
shared_Lambda_Permute = PermuteGeneral((1, 0, 2))
out_layer_ctx = shared_Lambda_Permute(out_layer_ctx)

# Define layer function
shared_FC_emb = TimeDistributed(Dense(word_embedding_size, activation='linear'),
                                name='logit_emb')
# Apply layer function
out_layer_emb = shared_FC_emb(state_below)

shared_additional_output_merge = Add(name='additional_input')
additional_output = shared_additional_output_merge([out_layer_mlp, out_layer_ctx, out_layer_emb])
shared_activation_tanh = Activation('tanh')
out_layer = shared_activation_tanh(additional_output)

Now, we'll' apply a deep output layer, with linear activation:


In [11]:
shared_deep_out = TimeDistributed(Dense(word_embedding_size, activation='linear', name='maxout_layer'))
out_layer = shared_deep_out(out_layer)

Finally, we apply a softmax function for obtaining a probability distribution over the target vocabulary words at each timestep.


In [14]:
shared_FC_soft = TimeDistributed(Dense(output_vocabulary_size,
                                               activation='softmax',
                                               name='softmax_layer'),
                                         name=ids_outputs[0])
softout = shared_FC_soft(out_layer)
model = Model(inputs=[src_text, next_words], outputs=softout)

That's all! We built a NMT Model! Now, let's build the models required for sampling. Recall that we are building two models, one for encoding the inputs and the other one for advancing steps in the decoding stage.

Let's start with model_init. It will take the usual inputs (src_text and state_below) and will output: 1) The vector probabilities (for timestep 1) 2) The sequence of annotations (from encoder) 3) The current decoder's hidden state

The only restriction here is that the first output must be the output layer (probabilities) of the model.


In [15]:
model_init = Model(inputs=[src_text, next_words], outputs=[softout, annotations, h_state])
# Store inputs and outputs names for model_init
ids_inputs_init = ids_inputs

# first output must be the output probs.
ids_outputs_init = ids_outputs + ['preprocessed_input', 'next_state']

Next, we will be the model_next. It will have the following inputs:

  • Preprocessed input
  • Previously generated word
  • Previous hidden state

And the following outputs:

  • Model probabilities
  • Current hidden state

First, we define the inputs:


In [16]:
preprocessed_size = hidden_state_size*2
preprocessed_annotations = Input(name='preprocessed_input', shape=tuple([None, preprocessed_size]))
prev_h_state = Input(name='prev_state', shape=tuple([hidden_state_size]))
input_attentional_decoder = [state_below, preprocessed_annotations, prev_h_state]

And now, we build the model, using the functions stored in the 'shared*' variables declared before:


In [18]:
# Apply decoder
[proj_h, x_att, alphas, h_state] = sharedAttGRUCond(input_attentional_decoder)
out_layer_mlp = shared_FC_mlp(proj_h)
out_layer_ctx = shared_FC_ctx(x_att)
out_layer_ctx = shared_Lambda_Permute(out_layer_ctx)
out_layer_emb = shared_FC_emb(state_below)
additional_output = shared_additional_output_merge([out_layer_mlp, out_layer_ctx, out_layer_emb])
out_layer = shared_activation_tanh(additional_output)
out_layer = shared_deep_out(out_layer)
softout = shared_FC_soft(out_layer)
model_next = Model(inputs=[next_words, preprocessed_annotations, prev_h_state],
                   outputs=[softout, preprocessed_annotations, h_state])

Finally, we store inputs/outputs for model_next. In addition, we create a couple of dictionaries, matching inputs/outputs from the different models (model_init->model_next, model_nex->model_next):


In [19]:
# Store inputs and outputs names for model_next
# first input must be previous word
ids_inputs_next = [ids_inputs[1]] + ['preprocessed_input', 'prev_state']
# first output must be the output probs.
ids_outputs_next = ids_outputs + ['preprocessed_input', 'next_state']

# Input -> Output matchings from model_init to model_next and from model_next to model_nextxt
matchings_init_to_next = {'preprocessed_input': 'preprocessed_input', 'next_state': 'prev_state'}
matchings_next_to_next = {'preprocessed_input': 'preprocessed_input', 'next_state': 'prev_state'}

And that's all! For using this model together with the facilities provided by the staged_model_wrapper library, we should declare the model as a method of a Model_Wrapper class. A complete example of this can be found at model_zoo.py.