Plan of attack:

  • The idea behind Recurrent Neural Networks
  • The Vanishing Gradient Problem
  • Long Short-Term Memory (LSTM)
  • Practical Intuition
  • Extra: LSTM Variations

The idea behind Recurrent Neural Networks

The whole concept behind deep learning is to try to mimic the human brain.

ANN

  • The main concept of ANN is the weights. Those weights represent long-term memory, which means once you run ANN and train it, you can switch it off and come back later. So the long-term memory weights make ANN similar to the Temporal Lobe.

CNN

  • CNN is vision recognition of images and objects, so that is the Ociipital Lobe.

RNN

  • RNNs are like short-term memory. They can remember things that just happened in the previous couple of observations and apply that knowledge going forward. It is similar to the Frontal Lobe. Frontal Lobe is responsible for personality behavior or motor cortex working memory.

We have a simple Artificial Neural Network: 1 input, 1 output, and 1 hidden layer.

We can represent the RNN by squashing the ANN. Think of it as we are looking the ANN from underneath - as a new dimension.

To simplify things, we change multiple arrows to 2, then we twist the ANN to make it vertical.

We change the color of the hidden layer from green to blue and add another line, which represents the temporal loop. This means the hidden layer not only gives an output but also feeds back to itself.

The official representation of RNN is to unroll the temporal loop, and put RNN in the following manner.

Note that we are looking in the new dimension, so each one circle is not 1 neuron but actually a whole layer of neurons.

This means we have inputs coming into the neurons then you have output but also the neurons connect to themselves through time. That is the whole concept when they have short-term memory that they remember what was in that neuron just previously.

The Vanishing Gradient Problem

  • Gradient Descent algorithm: we are trying to find the global minimum of the cost function, which is going to be the optimal solution for our neural network.

  • With Gradient Descent in an ANN, your information travels through the network to get the outputs, then the error is calculated and is propagated back to the network to update the weights.

  • In a Recurrent Neural Network, you have information travels through the network and through time. Information from the previous time point keeps coming through the network.
    • At each point in time, you can calculate the cost function of your error

  • Let's focus on this one as time t to see what happens:
    • You calculate the cost function $\large \epsilon_t$, then you need to propagate the cost function back to the network. Since you need to update the weights, so every single neuron which participated in the calculation of the output should have their weights updated.
    • This means it is not only the neuron directly below $\large \epsilon_t$, but all neurons that contributed.

  • Here we have $w_{rec}$, stands for weight recurring, that is the weight used to connect the hidden layers to themselves in the unrolled temporal loop.

    • In simple term, we are multiplying the values by the weights from one layer to get to the next layer.
    • This is where the problem is: when you multiple by something small, your value decreases quickly.
    • Since weights are randomly assigned at the start of the neural network. If the initial weights are close to 0, since you are multiplying by multiple times, the more you multiple, the lower values you have. As you move backward, you gradient become less and less. That is the vanishing gradient problem
    • The lower the gradient, the harder it is for the network to update the weights.
    • The training becomes a vicious cycle, because of the gradients are so small, the training is slow and the outputs from the hidden layers are incorrect. Therefore, you are training on the non-final outputs, since the weights are not being trained properly, the whole network is not being trained properly.
  • In summary

    • If $w_{rec}$ is small, you have the vanishing gradient problem
    • If $w_{rec}$ is large, you have the exploding gradient problem

Solutions:

  1. Exploding Gradient
    • Truncated Backpropagation: stop back-propagating after a certain point
    • Penalties: the gradient is being penalized and is artifically reduced.
    • Gradient Clipping: maximum limit for the gradient
  2. Vanishing Gradient
    • Weight Initilization: initialize the weights to minimize the potential for vanishing gradient.
    • Echo State Networks
    • Long Short-Term Memory Networks (LSTMs): very popular.

Additional Reading

Long Short-Term Memory Networks - LSTMs

  • Let's simplify the problem of RNNs:
    • If $w_{rec}$ < 1, you have the vanishing gradient problem
    • If $w_{rec}$ > 1, you have the exploding gradient problem
    • The simplest solution is to make the weight recurring $w_{rec}$ = 1, that is the idea of LSTMs

  • LSTM has a memory cell, which is the upper straight line at the top, that goes through time, sometimes it is removed or added

  • Notation:
    • x: input
    • C: memory cell
    • h: output
  • An LSTM module takes in 3 input and has 2 outputs (2 $h_t$ is the same). Note that all input, output and memory cell are all vectors
    • Vector transfer: The arrow line represent all vector being transferred
    • Concatenate: 2 pipes running in parallel
    • Copy: The memories are being copied
    • Pointwise operation:
      • $\bigotimes$: valves, 3 valves are forget valve, memory valve, and output valve respectively. Controlled by the sigmoid function.
        • think of it like a water valves. If it opens, memory flows freely, if it closed, memory is cut off.
      • $\bigoplus$: T-shape joint, additional memory is added.
      • $tanh$ operation: makes output value between -1 and 1.