notebook.community

Modling Sequences in Neural Network

One-to-one
One-to-many: like generating text description for a single image
Many-to-one: example sentiment analysis
Many-to-many: language translation

Recurrent Neural Networks



In [ ]:



In [ ]:

$W_{hh}$ is shared across time.

$$h_t = g\left(W_{hh} h_{t-1} + W_{xh} x\right)$$$$\frac{\partial h_t}{\partial h_{t-1}} = W_{hh} \odot g'\left(W_{hh} h_{t-1} + W_{xh} x\right)$$

Vanishing gradients

$(\alpha \beta)^{t-k}$
if $\alpha < 1$ the gradients vanish to zero
if $\alpha>1$, then gradient become too large

Solution:
- Hidden states play like a meory of the system: $h_{t} = h_{t-1} + \gamma x_t$ It accumulates memory. But, if the time distance is too far, then, we should forget the old stuff
- $h_{t} = \theta_t h_{t-1} + \gamma x_t$:
  - To remember something: $\theta_t = 1$
  - To forget something: $\theta_t = 0$ this means that your memory only depends on current input.
  Hadamard product works like a gate:
  - inout gate: $i_t$
  - activation: $a_t$
  - forget gate: $f_t$
  - output gate: $o_t$
  - $c_t = f_t\odot c_{t-1} + i_t\odot a_t$
  - $h_t = o_t \odot Tanh(c_t) = o_t \odot Tanh(f_t\odot c_{t-1} + i_t\odot a_t)$

For training deep recurrent neural network, SGD do not work well. Instead, we should Adadelta



In [ ]: