RNN Encoder-Decoder for Machine Translation

2017-04-30 jkang

This tutorial summarizes statistical concepts for RNN Encoder-Decoder for Machine Translation System

ref:


What is RNN?

  • A recurrent neural network (RNN) is an architecture where variable-length input sequence ($\mathbf{x} = (x_1, x_2, ... , x_T)$) is fed into the network to generate the hidden state and optionally output $\mathbf{y}$, according to Cho et al. (2014).
  • Hidden state in RNN is updated as in
    $$\mathbf{h}_{t} = f(\mathbf{h}_{t-1},\ x_t)$$

    • $\mathbf{h}_{t}$ is hidden state at time $t$
    • $f$ is a non-linear activate function (can be simple as sigmoid function or complex as LSTM)
  • RNN predicts next output based on previous sequence
  • RNN learns a probability distribution
    $$p(x_{x,j} = 1 | x_{t-1}, ... ,x_1) = \frac{exp(\mathbf{w}_j \mathbf{h}_t)}{\sum_{j'=1}^K exp(\mathbf{w}_{j'} \mathbf{h}_t)}$$

    • All possible labels are from $j=1, ... K$
    • $\mathbf{w}_j$ are the rows of a weight matrix $\mathbf{W}$
    • "How probable it is to predict $x_j$ at time $t$ given the previous sequence"
  • To generalize the above equation into probability of the sequence $\mathbf{x}$ $$p(x) = \prod_{t=1}^T p(x_t | x_{t-1}, ... , x_1)$$

RNN Encoder-Decoder

  • RNN Encoder-Decoder is "a general method to learn the conditional distribution over a variable-length sequence conditioned on yet another variable-length sequence"
    $$e.g. \quad p(y_1, ... y_{T'} | x_1, ... ,x_T)$$

    • Output length $T'$ and input length $T$ are not the same
  • Diagram for RNN Encoder-Decoder from Cho et al. (2014)
    • Hidden state of the decoder at time $t$ is computed as $$\mathbf{h}_t = f(\mathbf{h}_{t-1}, y_{t-1}, \mathbf{c})$$
      $$P(y_t | y_{t-1}, y_{t-2}, ... , y_1, \mathbf{c}) = g(\mathbf{h}_t, y_{t-1}, \mathbf{c})$$
    • $\mathbf{c}$ is a fixed-length encoded vector of the input sequence
    • $g$ is used to decode the probability of the output
  • To summarize, RNN Encoder-Decoder is a conditional probability mapping from inputs to outputs (or "jointly trained model to maxtmize the conditional log-likelihood")
    $$\arg\max_\theta \frac{1}{N}\sum_{n=1}^N log p_{\theta}(\mathbf{y}_n | \mathbf{x}_n)$$

    • $N$ is a number of training examples
    • ($\mathbf{x}_n$, $\mathbf{y}_n$) is an input sequence and output sequence pair
    • $\theta$ is set of model parameters