This notebook summarizes Natural Language Processing (NLP) as presented in the following resources:
Natural language processing (NLP) has the goal of making computers "understand" human languages in order to perform some useful tasks.
Human language is a symbolic/categorical signaling system. Large vocabulary creates a machine learning problem with extreme sparsity in word encodings.
NLP is difficult because human language is complex and ambiguous. Interpretation depends on learning and then using situtational, contexual, world, visual knowledge about the language.
Processing Text
Generating Text
Generating Image Descriptions
Speech Recognition
bag-of-words
Task of predicting what word comes next. More mathematically, this means that given a sequence of words, compute the probability distribution of the next word
given $x^{(1)}, x^{(2)},..., x^{(t)}$ then the probability of the next word $x^{(t+1)}$ is:
$P(x^{(t+1)} = w_j | x^{(t)},...,x^{(1)})$ where
$w_j$ is a word in the vocabulary $V = {w_1, ..., w_{|V|}}$
A Language Model can also generate text by choosing the next most likely word after training has been completed. However, a 3-gram will not lead to very meaningful sentences, a further looking back n-gram is needed - this would increase the size of the model exponentially though...
n-gram is a chunk of n consecutive words
Collect statistics about different n-gram frequency and use that to predict the next word. Use some large "corpus" of text in order to produce the n-gram probabilities. The probabilities will reflect the text trained on.
this part doesn't matter, because only these words ___
$P(w_j|only~these~words) = \frac{count(only~these~words~w_j)}{count(only~these~words)}$
Increasing the $n$ makes sparsity problems worse and the model size huge
The image above shows an RNN getting unrolled/unfolded into its complete sequence. If the sequence is 5 words long, then it gets unrolled into a 5-layer neural network.
RNN shares the same parameters (U, V, W) across all steps, unlike traditional NN which have different parameters at each layer. The RNN performs the same task at each step, just with different inputs. There are far fewer parameters in an RNN than a Convolutional NN, for example.
The main feature of an RNN is its hidden state, capturing some info about a sequence. Inputs/Outputs may not be needed at each step. When predicting sentiment of an entire sentence, only the final output is needed, not any of the intermediate ones.
Training is similar to a traditional RNN with a tweaked version of backpropagation. With parameters being shared across all time steps, the gradient at each output depends on the current and previous timesteps.
This is called Backpropagation Through Time (BPTT) - to calculate the gradient at time $t$, we have to backprop to multiple previous steps and sum the gradients.