2017-04-30 jkang
ref:
$$\mathbf{h}_{t} = f(\mathbf{h}_{t-1},\ x_t)$$
- $\mathbf{h}_{t}$ is hidden state at time $t$
- $f$ is a non-linear activate function (can be simple as sigmoid function or complex as LSTM)
$$p(x_{x,j} = 1 | x_{t-1}, ... ,x_1) = \frac{exp(\mathbf{w}_j \mathbf{h}_t)}{\sum_{j'=1}^K exp(\mathbf{w}_{j'} \mathbf{h}_t)}$$
- All possible labels are from $j=1, ... K$
- $\mathbf{w}_j$ are the rows of a weight matrix $\mathbf{W}$
- "How probable it is to predict $x_j$ at time $t$ given the previous sequence"
$$e.g. \quad p(y_1, ... y_{T'} | x_1, ... ,x_T)$$
- Output length $T'$ and input length $T$ are not the same
- Hidden state of the decoder at time $t$ is computed as $$\mathbf{h}_t = f(\mathbf{h}_{t-1}, y_{t-1}, \mathbf{c})$$
$$P(y_t | y_{t-1}, y_{t-2}, ... , y_1, \mathbf{c}) = g(\mathbf{h}_t, y_{t-1}, \mathbf{c})$$- $\mathbf{c}$ is a fixed-length encoded vector of the input sequence
- $g$ is used to decode the probability of the output
$$\arg\max_\theta \frac{1}{N}\sum_{n=1}^N log p_{\theta}(\mathbf{y}_n | \mathbf{x}_n)$$
- $N$ is a number of training examples
- ($\mathbf{x}_n$, $\mathbf{y}_n$) is an input sequence and output sequence pair
- $\theta$ is set of model parameters