Attention-based Neural Machine Translation (NMT)

2017-04-15 jkang

This tutorial covers following concepts for NMT (based on Luong et al., 2015):

What is NMT?
- Definition of NMT
- Loss function
What is 'Attention-based' NMT?
- Definition of attention
- Global attention? Local attention?

What is Neural Machine Translation (NMT)?

Definition
- According to Luong et al., 2015, NMT is "a Neural Network that directly models the conditional probability $p(y|x)$ of translating a source sentence ($x_1, x_2, ... x_n $) to a target sentence ($y_1, y_2, ... y_n $)".
- This means, NMT learns the probability of target language given source language.
- Here is the conditional probability:
  
  $\begin{align*} \log{p(y|x)} &= \sum_{j=1}^m {\log{p(y_j\ |\ y_{<j}, \mathbf{s})}} \\ &= \log {\prod_{j=1}^{m} p(y_j\ |\ y_{<j}, \mathbf{s})} \qquad ...What\hspace{2mm} does\hspace{2mm} it\hspace{2mm} mean??\ \\ \end{align*}$
- Let's break down the equation above
- The left-side log probability, $\log{p(y|x)}$, simply means finding the best sequence of translation.
  For example, think about English to Korean translation.
  if $x$ is "I want an apple" (English), $y$ will be "나는 사과 한개를 원해" (Korean).
  $\log{p(y|x)}$ will assign the highest log probability to "나는 사과 한개를 원해",
  not "나는 감자 한개를 원해".
- The right side of the sum of log probabilities, $\sum_{j=1}^m {\log{p(y_j\ |\ y_{<j}, \mathbf{s})}}$, includes the actual process of decoding the source sentence "I want an apple".
  It will be sum of:
  $\log{p(y_1|\ \mathbf{s})}\ +$
  $\log{p(y_2|\ y_{1},\ \mathbf{s})}\ +$
  $\log{p(y_3|\ \{y_1,\ y_2\},\ \mathbf{s})}\ +$
  $\log{p(y_3|\ \{y_1,\ y_2,\ y_3\},\ \mathbf{s})}\ +\ ...$
- As you can see, $\mathbf{s}$ is added constantly to predict/decode the next targe word.
- This $\mathbf{s}$ is called "source representation" or "thought vector", which is referred to as "attention mechanism"
  $\mathbf{s}\ =\ ''Attention''$
Loss function
- So, NMT has two parts: encoder and decoder
- Output translation comes only in decoder part
- Let's look at the figure to make it clear
- From Luong et al., 2015:
- The loss function is simply defined as minimizing sum of negative log probabilities
  $\begin{align*} \\ J_t &= \sum_{(x,y)\in\mathbb{D}}-\log{p(y\ |\ x)} \\ & *\mathbb{D}\hspace{2mm} is\hspace{2mm} training\hspace{2mm} corpus \end{align*}$

What is 'Attention-based' NMT?

Definition of attention
- Attention in NMT is a fixed-length vector including information about input
- Attention mechanism works to help predict/decode outputs as in the image:
  *Image from WildML
  
  $h_3$ serves as attention vector
  
  *Image from WildML
  
  The $\alpha$ vector is fed into decoding the next output
- In Luong et al., 2015, they used global attention and local attention mechanism
  Global attention includes feeding the attention vector calculated from all inputs everytime when decoding the next output
  Local attention, on the other hand, indicates providing the attention vector calculated from the certain windowed portion of input to decoding the next output
Global attention model:
- Luong et al defined global attention as soft attention which is differentiable
Local attention model:
- Luong et al defined local attention as hard attention which is non-differentiable