"Using the output embedding to improve language models"

by Press and Wolf, 2017

input word is: $c \in \mathbb{R}^C$.

embedding, : $\mathbf{U}^T \mathbf{c}$

computation, gives $\mathbf{h}_2 = f(\mathbf{U}^T \mathbf{c})$

second matrix, $\mathbf{V}$ projects $\mathbf{h}_2$ to a vector $\mathbf{h}_3$ containing one score per vocabulary. $\mathbf{h}_3 = \mathbf{V}\mathbf{h}_2$

Vector of scores then converted to vector of probability values, using softmax.

$\mathbf{U} \in \mathbb{R}^{C\, \times\, H}$

$\mathbf{h}_2 \in \mathbb{R}^h$.

$\mathbf{V} \in \mathbb{R}^{C\,\times\,H}$