Modling Sequences in Neural Network

  • One-to-one
  • One-to-many: like generating text description for a single image
  • Many-to-one: example sentiment analysis
  • Many-to-many: language translation

Recurrent Neural Networks


In [ ]:


In [ ]:

$W_{hh}$ is shared across time.

$$h_t = g\left(W_{hh} h_{t-1} + W_{xh} x\right)$$$$\frac{\partial h_t}{\partial h_{t-1}} = W_{hh} \odot g'\left(W_{hh} h_{t-1} + W_{xh} x\right)$$

Vanishing gradients

  • $(\alpha \beta)^{t-k}$
  • if $\alpha < 1$ the gradients vanish to zero
  • if $\alpha>1$, then gradient become too large
  • Solution:

    • Hidden states play like a meory of the system: $h_{t} = h_{t-1} + \gamma x_t$ It accumulates memory. But, if the time distance is too far, then, we should forget the old stuff

    • $h_{t} = \theta_t h_{t-1} + \gamma x_t$:

      • To remember something: $\theta_t = 1$
      • To forget something: $\theta_t = 0$ this means that your memory only depends on current input.

      Hadamard product works like a gate:

      • inout gate: $i_t$
      • activation: $a_t$
      • forget gate: $f_t$
      • output gate: $o_t$
      • $c_t = f_t\odot c_{t-1} + i_t\odot a_t$
      • $h_t = o_t \odot Tanh(c_t) = o_t \odot Tanh(f_t\odot c_{t-1} + i_t\odot a_t)$

For training deep recurrent neural network, SGD do not work well. Instead, we should Adadelta


In [ ]: