RNN Encoder-Decoder for Machine Translation

2017-04-30 jkang

ref:

What is RNN?

A recurrent neural network (RNN) is an architecture where variable-length input sequence ($\mathbf{x} = (x_1, x_2, ... , x_T)$) is fed into the network to generate the hidden state and optionally output $\mathbf{y}$, according to Cho et al. (2014).
Hidden state in RNN is updated as in
$$\mathbf{h}_{t} = f(\mathbf{h}_{t-1},\ x_t)$$
- $\mathbf{h}_{t}$ is hidden state at time $t$
- $f$ is a non-linear activate function (can be simple as sigmoid function or complex as LSTM)
RNN predicts next output based on previous sequence
RNN learns a probability distribution
$$p(x_{x,j} = 1 | x_{t-1}, ... ,x_1) = \frac{exp(\mathbf{w}_j \mathbf{h}_t)}{\sum_{j'=1}^K exp(\mathbf{w}_{j'} \mathbf{h}_t)}$$
- All possible labels are from $j=1, ... K$
- $\mathbf{w}_j$ are the rows of a weight matrix $\mathbf{W}$
- "How probable it is to predict $x_j$ at time $t$ given the previous sequence"
To generalize the above equation into probability of the sequence $\mathbf{x}$ $$p(x) = \prod_{t=1}^T p(x_t | x_{t-1}, ... , x_1)$$

RNN Encoder-Decoder is "a general method to learn the conditional distribution over a variable-length sequence conditioned on yet another variable-length sequence"
$$e.g. \quad p(y_1, ... y_{T'} | x_1, ... ,x_T)$$
- Output length $T'$ and input length $T$ are not the same
Diagram for RNN Encoder-Decoder from Cho et al. (2014)
- Hidden state of the decoder at time $t$ is computed as $$\mathbf{h}_t = f(\mathbf{h}_{t-1}, y_{t-1}, \mathbf{c})$$
  $$P(y_t | y_{t-1}, y_{t-2}, ... , y_1, \mathbf{c}) = g(\mathbf{h}_t, y_{t-1}, \mathbf{c})$$
- $\mathbf{c}$ is a fixed-length encoded vector of the input sequence
- $g$ is used to decode the probability of the output
To summarize, RNN Encoder-Decoder is a conditional probability mapping from inputs to outputs (or "jointly trained model to maxtmize the conditional log-likelihood")
$$\arg\max_\theta \frac{1}{N}\sum_{n=1}^N log p_{\theta}(\mathbf{y}_n | \mathbf{x}_n)$$
- $N$ is a number of training examples
- ($\mathbf{x}_n$, $\mathbf{y}_n$) is an input sequence and output sequence pair
- $\theta$ is set of model parameters