Recurrent Neural Networks

This notebook summarizes recurrent neural networks (RNN) as presented in the following resources:

RNN Overview

Sequence Data Applications

Speech Recognition
Music Generation
Sentiment Classification
DNA Sequence Analysis
Machine Language Translation
Video Activity Recognition
Name Entity Recognition
- Scan news articles and save the names of everyone mentioned

Notation

Vocabulary contains all the possible words. Commercial applications use dictionaries of 25,000 - 50,000 words. However, larger sizes are becoming more common... and the largest internet companies have up to 1,000,000+ entries

Use 1-hot notation to represent a single word. So a 10,000 vector would be all zeroes except for a 1 in one spot. Ex: $x^{<1>}$ may represent the word "and" and it might be a 10,000 long vector with a 1 near the beginning in the 2nd spot.

$\begin{pmatrix} 0 & 'a'\\ 1 & 'and'\\ . & .\\ . & . \\ . & . \\ 0 & 'zyzzogeton\\ \end{pmatrix} $

*Note above - zyzzogeton probably wouldn't be in our vocabulary because it's unlikely to occur over a threshold number of times in our corpus of training text

RNN Model

Standard neural network will not work:

Inputs/outputs can be of different length (padding doesn't help)
Won't share features learned across different positions of text
Like a CNN, we want the network to learn and then share that feature. * This allows a feature (or piece of text) to be learned in one position and then applied to another area of the image (or section of text).

Recurrent Neural Network (RNN):

Activation value from timestep 1 is passed on to timestep 2
Activations at timestep 0 can be initialized to zeroes

Forward Propagation Basic Processing:

Scans thru the data from left to right
Parameters are shared across each timestep
This example only uses old data in the sequence
- Bi-directional RNN (BRNN) fix this
Activation Functions chosen for the Y output at each step are usually tanh or ReLU (need to fix vanishing gradient)
- sigmoid if you have binary classification
- softmax if multi-classifcation
Don't have to have outputs at each step! It could just be at the final output step

Instead of carrying around 2 parameter matrics, they can be compressed into one. It simplifies notation when we have more complex models.

Backpropagation through Time

Same $W_a$ and $b_a$ values are used at each step (as well as $W_y$ and $b_y$
For Backpropagation, we need a loss function
- take the sum of the element-wise loss at each step
- can compute numbers allowing you to take derivatives wrt the params and then update the params using Gradient Descent

RNN Architectures

Vanilla RNN

Many-to-many

Number of inputs equals number of outputs
ex: Encoder -> Decoder portions of 1 RNN to translate 1 language to another (only Decorder has outputs)

Many-to-one

ex: take in a whole sentence, output one result (sentiment)

One-to-many

ex: music generation by inputing 1 number (or no nums...)
synthesized outputs get fed to each next step

Language Modeling

Language model gives probabilities of a sentence happening.

Training:

need large corpus of text
tokenize each word to 1-hot vectors
possibly use EOS for end of sentence
vocabulary is of a set size
- replace rare words with UNK (unknown) so they don't take the place of more used words

RNN Model:

output at each time step is a vector of 10,000 (vocabulary size) with the probabilities of each word in the list occuring next
for the next step, try to figure out second word. Give it the correct first word this time as the "memory" input from the previous step
each step looks at the preceding X words to predict the current word

Sampling:

Randomly generate a sentence from the trained RNN language model. Sample the output of each timestep's softmax output randomly, then pass that randomly chosen word in as the input to the next timestep (x<2> = y^<1>)

Vanishing Gradients:

Most outputs are mainly influenced by closer inputs, it is hard for the RNN to backprop errors all the way from the end of a sentence to the beginning. Vanishing gradients occurs when the derivative exponentially decays and earlier parameters in the neural network get very smaller, or no updates, hence no learning. GRUs are useful for fixing this problem.

Exploding Gradients:

Gradient clipping can fix exponentially increasing gradients by capping the value at a certain number.

Gated Recurrent Unit (GRU)

Modification to RNN hidden layer
GRU has new variable: c = memory cell
The memory cell can remember something for as long as you want
Update gate is always ~ 0 or 1
The job of the gate is to determine when to update the value of c
- ex: remember a word from the beginning of a sentence as either plural or not (1 or 0). Use that info later to decide on a later word which depends on the plurality of an earlier word
$c$~$<t>$ is the update candidate which can replace $c^{<t>}$
The value of $\Gamma_\mu$ determines the "amount" of $c^{<t-1>}$ that is "kept" from the previous step. Because $\Gamma_\mu$ can be extremely small, the value of $c^{<t-1>}$ does not decay much and very long term dependencies can be maintained (also fixes the vanishing gradient issue).

Full GRU:

Most commonly used versions of an RNN
In addition to the update gate, there is also a relevance gate

Long-short Term Memory (LSTM)

More powerful (and general) version of the GRU
There are now 3 gates instead of 2:
- Gamma update
- Gamma forget
- Gamma output

With the right gate settings, $c^{<0>}$ could be passed all the way from the beginning to the end of a long temporal sequence. It would be able to remember data across a very large sequence.

Peephole connection

Bi-directional RNN (BRNN)

Effective for many NLP problems, but need to have the entire sequence of text before you can start. If this was speech recognition, you would have to wait for the person to stop talking in order to get the whole sequence before processing begins. More complex algorithms are able to work in real-time for speech recognition.

Need information from the future
Acyclic Graph
Even though it goes forward and backward, it's still forward propagation only - it just goes "forward" starting at each end at the same time
Output at each time would have an activation function multiplied by the forward and "backward" activations at the same time

Deep RNNs

3 layers is deep for RNN ... usually don't see 10-100s of layers

may have deep NN off of each output for final prediction

Word Embeddings

1-hot vectors treat each as its own thing, with no general relationship between similar words (like apple and orange)
inter-product vectors across 1-hot encoded vectors is 0 (not good)
We can create a feature vector across words, ex: 300 feature vector:

t-SNE can be used to create a 2D view of the 300D vector
Words get embedded into a 300D space

replace 1-hot encoded vectors with word embeddings, this enables transfer learning
Transfer Learning and word embedding:
- 1B – 100B word corpus of text can be used for training word embeddings
- transfer embedding to new task with smaller training set (10k words)
- continue finetuning the word embeddings with new data

Analogies

Man -> Woman as King -> ?
e_man – e_woman ~= e_king – e_w
subtract the embedding vectors and compare the resulting vector

Find a word (w) that maximizes the similarity (e_w, e_king – e_man + e_woman)
- use Cosine similarity

Embedding Matrix

selects a column out of the embedding matrix
Keras has an “Embedding” Layer

Learning Word Embeddings

hyperparameter is the previous number of words used to guess next word
to really learn word embeddings, it’s good to have a few diff contexts:
- last 4 words
- last 1 word
- nearby 1 word (skip gram – pick some word near it)

Word2Vec

Skip-grams

Pick a target word from the sentence and then choose random context words from 1, 2, 3, etc. before and after. Now the algorithm starts to learn the target word as it relates to different context words

sample the context word randomly, but add certain selection criteria so simple and frequent words like "a" aren't constantly chosen
creates a pretty good embedding vector
takes a long time because have to process across all words in vocab (even at 10k words)
- Hierarchical Softmax can help with classification speed

Negative Sampling

similar to skip-gram, but more efficient
select a context and another word, set the target to 1
select another k random words, set the target to 0 for each
- k = 5-20 for smaller datasets, and 2-5 for larger datasets
- chances of it being related is close to 0

GloVe

global vectors for word representation

Sentiment Classification

Word embeddings allow you to build good sentiment classifiers with only a modestly sized labeled training set

RNN can generalize well even to words that weren't in the labeled sentiment training set, as long as they appear in the huge corpus of text used to create the embedding matrix. For example, in the figure below the sentiment score is very bad for the "Completely lacking..." sentence. If "lacking" is replaced with "absent" we can still get a correct sentiment score if "absent" showed up in our huge corpus of text so the embedding matrix would have already learned a relationship between absent and lacking. Now when we try to classify the sentiment of a sentence with the word "absent" it still knows the connotation of the whole sentence.

Debiasing Word Embeddings

Even NLP can pick up the gender, race, etc. biases which are present in the world, just from its presence in the training text.

Addressing Bias

Assume we've learned an embedding matrix
Identify bias direction
- take embedding vector for he and subtract she (repeat for similar)
- gets the general bias direction
Neutralize: for words that don't have inherent gender, project to get rid of bias
Equalize pairs: (boy/girl, father/mother, etc.) - make sure each of these have the same distance in the non-bias direction so that their gender does not have any effect on other embedding word pairs

How to pick words which are gender specific or not?

Sequence-to-sequence Models

Translate a French sentence into an English sentence

Encoder on the front creates an encoding of the French sentence
Decoder on the back creates an English sentence

This also works for Image Captioning

Use a CNN for image classification, but remove the final softmax and let's call this the Encoder. The image is now encoded
Feed the encoded image to an RNN which will output a sequence of words 1 at a time

Machine Translation Model

Conditional Language Model:

Given a sentence in 1 language, what is the Probability of another language sentence being the right translation
The decoder network for machine translation is very similar to a basic Language Model
instead of starting with all 0's matrix, we start with an encoder network

Greedy Search

not good
continually pick the highest probability word, even though it might not make the most sense in the end

Beam Search

This is a better way of doing translation because it takes into account more than just 1 option. Instead of picking the highest probability first word, followed by second, etc. (like greedy search), this will take $B$ words (the Beam width) as the most likely alternatives

Run input sentence through encoder
First step of Decoder with a softmax output will have probabilities for all words in our vocab that it is the most likely first word
- Pick the first word considering multiple alternatives
- B = 3 (beam width) means find the 3 most likely first words

Now for each word in the Beam, estimate the probabilities of the word following it. In this step we are finding the 3 highest probabilities for the first and second word pairs.

consider $(B * Vocab_{length})$ number of possibilities at this step
with $B = 3$ we have to instantiate 3 copies of the RNN
- each copy has it's first decoder cell hardwired to 1 of the 3 chosen words from step 1 (in, jane, september)
if one of the words chosen in step 1 is no longer in a best pair for step 2, then it's no longer a candidate
save each of the 3 possibilities to computer memory

This continues until the end of sentence (EOS) is reached and the translation is completed.

Length Normalization

Numeric Underflow can occur when multiplying many small numbers less than 1. Their product eventually turns out to be so small, that it can't be represented by the smallest possible data type on a computer system. By using Log values instead, we can still achieve the same result, but avoid this issue.

Also, by multiplying many small numbers together, the more you multiply the smaller it gets. Since these are probabilities, the solution will tend to favor shorter sentences which have fewer probabilities multiplied together. To fix this, normalize by dividing by the num of words in the translation.

Additionally, an $\alpha$ exponent can be added to the translated sentence length in order to finetune the amount of normalization. With $\alpha = 1$ there is full normalization, with $\alpha = 0$ there is no normalization.

After running through all combinations for Beam Search, score each sentence against the Normalized Log Probability Objective to find the best translated sentence: ${1/{{T_y}^{\alpha}}}\sum_{t=1}^{T_y} log P(y^{<t>} |~ x,y^{<1>},...,y^{<t-1>})$

Beam Width

How to choose $B$?

Small B is worse result, but lower memory and much faster
Large B is a better result, but slower

On large Production systems, B of 10-100 is common. For cutting-edge research, 1000+ is normal

Beam Search Error Analysis

Need to attribute error to either the Beam Search algorithm or the RNN

compare a human translation to the algorith
determine which is at fault for a number of examples
- if Beam Search at fault, then try increasing beam width ($B$)
- if RNN at fault, try getting more data or adjusting the architecture or regularization

Bleu Score

For translations, there can be more than 1 good answer - how do you pick the best one?

Bleu Score (Bilingual evaluation understudy) measures how good the machine translation is:

human reference translations are used as a benchmark
- Input: Le chat est sur le tapis.
- Ref 1: The cat is on the mat.
- Ref 2: There is a cat on the mat.
- MT output: the the the the.
precision: does each word in the translation show up in the human translations
- 4/4 "the the the the" -> each word is in the Refs, but this is not good
modified precision: word only gets credit up to the num of times it appears in a Ref
- 2/4 because "the" was only in Ref1 twice

Blue Score on bigrams:

check for groups of 2 words
- "the cat", "cat the", etc.
- cap at the num times each appears in any 1 Ref

There are open source Bleu score systems already in existence that can be used to score your own algorithm

Attention Model

Humans doing language translation would not read an entire lengthy sentence before starting to translate parts of it. They would read a chunk of the sentence, translate, and then continue. When doing machine translation, it is the same concept where translating certain words requires more or less attention for the surrounding words.

The parameter $\alpha$ is used to denote how much attention should be paid to one word when doing translations for another. For example, $\alpha^{<1,2>}$ represents a weight for much how we should consider the 2nd word while translating the first in a sequence. This gives a context, $C$, for translating a word at an RNN cell.

$\alpha^{<t,t'>}$ is the amount of attention $y^{<t>}$ should pay to $a^{<t'>}$

visualiznig $\alpha^{<t,t'>}$ will show that when a word is being translated, it pays attention to certain words and not others (based on the weights)

Speech Recognition

Audio is represented in computers as a sound clip which is just variations in sound pressure over time. The different frequencies end up making up different natural language words that humans are used to hearing.

Academic datasets may be around 300 - 3000 hours long of transcribed audio which is used to train a system. Commercial quality systems may require 10,000 - 100,000 hours of transcribed audio.

By having a fixed length RNN (1000), smaller sentences can be represented by outputting multiple repeated characters. The repeated chars not separated by blanks get collapsed into single chars to form understandable words.

Trigger Word Detection

Trigger words are used to have a computer "wake-up" and start doing something. For example, Amazon devices use the word "Alexa" and Google home products use the phrase "Okay, Google".

Table of Contents