Natural Language Processing

This notebook summarizes Natural Language Processing (NLP) as presented in the following resources:

What is NLP?

Natural language processing (NLP) has the goal of making computers "understand" human languages in order to perform some useful tasks.

Human language is a symbolic/categorical signaling system. Large vocabulary creates a machine learning problem with extreme sparsity in word encodings.

NLP is difficult because human language is complex and ambiguous. Interpretation depends on learning and then using situtational, contexual, world, visual knowledge about the language.

Applications

Processing Text

Sentiment Analysis
Translation

Generating Text

Generating Image Descriptions

Speech Recognition

Processing

bag-of-words

Language Modeling

Task of predicting what word comes next. More mathematically, this means that given a sequence of words, compute the probability distribution of the next word

given $x^{(1)}, x^{(2)},..., x^{(t)}$ then the probability of the next word $x^{(t+1)}$ is:
$P(x^{(t+1)} = w_j | x^{(t)},...,x^{(1)})$ where
$w_j$ is a word in the vocabulary $V = {w_1, ..., w_{|V|}}$

A Language Model can also generate text by choosing the next most likely word after training has been completed. However, a 3-gram will not lead to very meaningful sentences, a further looking back n-gram is needed - this would increase the size of the model exponentially though...

n-gram Language Model

n-gram is a chunk of n consecutive words

unigram: "there", "it", "goes"
bigram: "there it", "it goes"
trigram: "there it goes"

Collect statistics about different n-gram frequency and use that to predict the next word. Use some large "corpus" of text in order to produce the n-gram probabilities. The probabilities will reflect the text trained on.

Example

~~this part doesn't matter, because~~ only these words ___

discard the first words
condition on the last (n-1) = 3 words

$P(w_j|only~these~words) = \frac{count(only~these~words~w_j)}{count(only~these~words)}$

if "only these words" appear 1000 times, followed by "matter" in 400 cases, then: $P(matter|only~these~words) = 0.4$

Problems

Increasing the $n$ makes sparsity problems worse and the model size huge

if $only~these~words~w_j$ never occurs in the data, then $w_j$ has probability of 0
if $only~these~words$ never occurs in the data, then we can't calculate prob's for any $w_j$
have to store count for all possible n-grams, model size is $O(exp(n))$

Recurrent Neural Network

Main idea is to process sequential information
Normal neural networks assume that all inputs/outputs are independent
- not good for sentences where words have context with one another
RNN are recurrent because they perform the same task for every element in a sequence
- output depends on previous computations
- have memory from previous calcuations
RNNs can make use of long sequences, but in practice it gets too computationally expensive to look back too far

The image above shows an RNN getting unrolled/unfolded into its complete sequence. If the sequence is 5 words long, then it gets unrolled into a 5-layer neural network.

$x_t$ is the input at time step $t$
$s_t$ is the hidden state at time step $t$. It's the memory of the network and is calculated based on the previous hidden state and the input at the current step: $s_t = f(U_{x_t} + W_{{s_t}-1})$. The function f is usually non-linear (tanh or ReLU). $s-1$ is usually initialized to all zeroes for the first hidden state.
$o_t$ is the output at time step $t$. To predict the next word in a sentence, this output would be a vector of probabilities across the vocabulary: $o_t = softmax(V_{s_t})$

RNN shares the same parameters (U, V, W) across all steps, unlike traditional NN which have different parameters at each layer. The RNN performs the same task at each step, just with different inputs. There are far fewer parameters in an RNN than a Convolutional NN, for example.

The main feature of an RNN is its hidden state, capturing some info about a sequence. Inputs/Outputs may not be needed at each step. When predicting sentiment of an entire sentence, only the final output is needed, not any of the intermediate ones.

RNN Training

Training is similar to a traditional RNN with a tweaked version of backpropagation. With parameters being shared across all time steps, the gradient at each output depends on the current and previous timesteps.

This is called Backpropagation Through Time (BPTT) - to calculate the gradient at time $t$, we have to backprop to multiple previous steps and sum the gradients.