This notebook summarizes recurrent neural networks (RNN) as presented in the following resources:
Vocabulary contains all the possible words. Commercial applications use dictionaries of 25,000 - 50,000 words. However, larger sizes are becoming more common... and the largest internet companies have up to 1,000,000+ entries
Use 1-hot notation to represent a single word. So a 10,000 vector would be all zeroes except for a 1 in one spot. Ex: $x^{<1>}$ may represent the word "and" and it might be a 10,000 long vector with a 1 near the beginning in the 2nd spot.
$\begin{pmatrix} 0 & 'a'\\ 1 & 'and'\\ . & .\\ . & . \\ . & . \\ 0 & 'zyzzogeton\\ \end{pmatrix} $
*Note above - zyzzogeton probably wouldn't be in our vocabulary because it's unlikely to occur over a threshold number of times in our corpus of training text
Standard neural network will not work:
Recurrent Neural Network (RNN):
Forward Propagation Basic Processing:
Instead of carrying around 2 parameter matrics, they can be compressed into one. It simplifies notation when we have more complex models.
Many-to-many
Many-to-one
One-to-many
Language model gives probabilities of a sentence happening.
Training:
RNN Model:
Sampling:
Randomly generate a sentence from the trained RNN language model. Sample the output of each timestep's softmax output randomly, then pass that randomly chosen word in as the input to the next timestep (x<2> = y^<1>)
Vanishing Gradients:
Most outputs are mainly influenced by closer inputs, it is hard for the RNN to backprop errors all the way from the end of a sentence to the beginning. Vanishing gradients occurs when the derivative exponentially decays and earlier parameters in the neural network get very smaller, or no updates, hence no learning. GRUs are useful for fixing this problem.
Exploding Gradients:
Gradient clipping can fix exponentially increasing gradients by capping the value at a certain number.
Modification to RNN hidden layer
GRU has new variable: c = memory cell
The job of the gate is to determine when to update the value of c
$c$~$<t>$ is the update candidate which can replace $c^{<t>}$
The value of $\Gamma_\mu$ determines the "amount" of $c^{<t-1>}$ that is "kept" from the previous step. Because $\Gamma_\mu$ can be extremely small, the value of $c^{<t-1>}$ does not decay much and very long term dependencies can be maintained (also fixes the vanishing gradient issue).
Full GRU:
With the right gate settings, $c^{<0>}$ could be passed all the way from the beginning to the end of a long temporal sequence. It would be able to remember data across a very large sequence.
Peephole connection
Effective for many NLP problems, but need to have the entire sequence of text before you can start. If this was speech recognition, you would have to wait for the person to stop talking in order to get the whole sequence before processing begins. More complex algorithms are able to work in real-time for speech recognition.
3 layers is deep for RNN ... usually don't see 10-100s of layers
Learning Word Embeddings
Skip-grams
Pick a target word from the sentence and then choose random context words from 1, 2, 3, etc. before and after. Now the algorithm starts to learn the target word as it relates to different context words
global vectors for word representation
Word embeddings allow you to build good sentiment classifiers with only a modestly sized labeled training set
RNN can generalize well even to words that weren't in the labeled sentiment training set, as long as they appear in the huge corpus of text used to create the embedding matrix. For example, in the figure below the sentiment score is very bad for the "Completely lacking..." sentence. If "lacking" is replaced with "absent" we can still get a correct sentiment score if "absent" showed up in our huge corpus of text so the embedding matrix would have already learned a relationship between absent and lacking. Now when we try to classify the sentiment of a sentence with the word "absent" it still knows the connotation of the whole sentence.
Even NLP can pick up the gender, race, etc. biases which are present in the world, just from its presence in the training text.
Addressing Bias
How to pick words which are gender specific or not?
Translate a French sentence into an English sentence
This also works for Image Captioning
Conditional Language Model:
Greedy Search
This is a better way of doing translation because it takes into account more than just 1 option. Instead of picking the highest probability first word, followed by second, etc. (like greedy search), this will take $B$ words (the Beam width) as the most likely alternatives
Now for each word in the Beam, estimate the probabilities of the word following it. In this step we are finding the 3 highest probabilities for the first and second word pairs.
This continues until the end of sentence (EOS) is reached and the translation is completed.
Numeric Underflow can occur when multiplying many small numbers less than 1. Their product eventually turns out to be so small, that it can't be represented by the smallest possible data type on a computer system. By using Log values instead, we can still achieve the same result, but avoid this issue.
Also, by multiplying many small numbers together, the more you multiply the smaller it gets. Since these are probabilities, the solution will tend to favor shorter sentences which have fewer probabilities multiplied together. To fix this, normalize by dividing by the num of words in the translation.
Additionally, an $\alpha$ exponent can be added to the translated sentence length in order to finetune the amount of normalization. With $\alpha = 1$ there is full normalization, with $\alpha = 0$ there is no normalization.
After running through all combinations for Beam Search, score each sentence against the Normalized Log Probability Objective to find the best translated sentence: ${1/{{T_y}^{\alpha}}}\sum_{t=1}^{T_y} log P(y^{<t>} |~ x,y^{<1>},...,y^{<t-1>})$
How to choose $B$?
On large Production systems, B of 10-100 is common. For cutting-edge research, 1000+ is normal
Need to attribute error to either the Beam Search algorithm or the RNN
For translations, there can be more than 1 good answer - how do you pick the best one?
Bleu Score (Bilingual evaluation understudy) measures how good the machine translation is:
Blue Score on bigrams:
There are open source Bleu score systems already in existence that can be used to score your own algorithm
Humans doing language translation would not read an entire lengthy sentence before starting to translate parts of it. They would read a chunk of the sentence, translate, and then continue. When doing machine translation, it is the same concept where translating certain words requires more or less attention for the surrounding words.
The parameter $\alpha$ is used to denote how much attention should be paid to one word when doing translations for another. For example, $\alpha^{<1,2>}$ represents a weight for much how we should consider the 2nd word while translating the first in a sequence. This gives a context, $C$, for translating a word at an RNN cell.
$\alpha^{<t,t'>}$ is the amount of attention $y^{<t>}$ should pay to $a^{<t'>}$
Audio is represented in computers as a sound clip which is just variations in sound pressure over time. The different frequencies end up making up different natural language words that humans are used to hearing.
Academic datasets may be around 300 - 3000 hours long of transcribed audio which is used to train a system. Commercial quality systems may require 10,000 - 100,000 hours of transcribed audio.
By having a fixed length RNN (1000), smaller sentences can be represented by outputting multiple repeated characters. The repeated chars not separated by blanks get collapsed into single chars to form understandable words.
Trigger words are used to have a computer "wake-up" and start doing something. For example, Amazon devices use the word "Alexa" and Google home products use the phrase "Okay, Google".