2017-04-15 jkang
Definition
Here is the conditional probability:
$\begin{align*} \log{p(y|x)} &= \sum_{j=1}^m {\log{p(y_j\ |\ y_{<j}, \mathbf{s})}} \\ &= \log {\prod_{j=1}^{m} p(y_j\ |\ y_{<j}, \mathbf{s})} \qquad ...What\hspace{2mm} does\hspace{2mm} it\hspace{2mm} mean??\ \\ \end{align*}$
For example, think about English to Korean translation.
if $x$ is "I want an apple" (English), $y$ will be "나는 사과 한개를 원해" (Korean).
$\log{p(y|x)}$ will assign the highest log probability to "나는 사과 한개를 원해",
not "나는 감자 한개를 원해".
It will be sum of:
$\log{p(y_1|\ \mathbf{s})}\ +$
$\log{p(y_2|\ y_{1},\ \mathbf{s})}\ +$
$\log{p(y_3|\ \{y_1,\ y_2\},\ \mathbf{s})}\ +$
$\log{p(y_3|\ \{y_1,\ y_2,\ y_3\},\ \mathbf{s})}\ +\ ...$
$\mathbf{s}\ =\ ''Attention''$
Loss function
$\begin{align*} \\ J_t &= \sum_{(x,y)\in\mathbb{D}}-\log{p(y\ |\ x)} \\ & *\mathbb{D}\hspace{2mm} is\hspace{2mm} training\hspace{2mm} corpus \end{align*}$
Definition of attention
*Image from WildML
$h_3$ serves as attention vector
*Image from WildML
The $\alpha$ vector is fed into decoding the next output
Global attention includes feeding the attention vector calculated from all inputs everytime when decoding the next output
Local attention, on the other hand, indicates providing the attention vector calculated from the certain windowed portion of input to decoding the next output
Global attention model:
Local attention model: