In [ ]:
In [ ]:
$W_{hh}$ is shared across time.
Solution:
Hidden states play like a meory of the system: $h_{t} = h_{t-1} + \gamma x_t$ It accumulates memory. But, if the time distance is too far, then, we should forget the old stuff
$h_{t} = \theta_t h_{t-1} + \gamma x_t$:
Hadamard product works like a gate:
For training deep recurrent neural network, SGD do not work well. Instead, we should Adadelta
In [ ]: