Python Tutorial 1: Smoothing 1

Version: 1.0, September 2016

This is a tutorial about developing simple Part-of-Speech taggers using Python 3.x and the NLTK.

This tutorial was developed as part of the course material for the course Advanced Natural Language Processing in the Computational Linguistics Program of the Department of Linguistics at Indiana University.


Smoothing is re-evaluating for example an n-gram frequency profile and zero and small probabilities in it, and assigning very small probabilities for zero n-grams, that is for n-grams that have not been seen in a training corpus.

For example, if we extract a frequency profile using the tuples tokens and Part-of-Speech tags from the Brown corpus, we will surely not find any occurrence of a tuple like (iPhone NN).

There are various smoothing techniques:

  • Additive smoothing

Add-One Smoothing

To estimate the probability of an n-gram, a token (or word) $w_n$ occuring, given that the words $w_1\dots{}w_{n-1}$ occured, we can use the Maximum Likelihood Estimation (MLE) as described below. The conditional probability of an n-gram like the cat as the conditional probability $P(cat|the)$, for example, is the probability of the n-gram $P(the\ cat)$ divided by the probability of the token the, i.e. $P(the)$.

$$P(w_n\ |\ w_1\ \dots{}\ w_{n-1}) = \frac{P(w_1\ \dots{}\ w_n)}{P(w_1\ \dots{}\ w_{n-1})}$$

Let us assume that $C(w_1\dots{}w_n)$ is the count or frequency of a particular n-gram with words $w_1\dots{}w_n$. $C(w_1\dots{}w_{n-1})$ is the count of the $(n-1)$-gram of words, i.e. the words preceding $w_n$ in the n-gram. $C(t)$ is the total count of tokens (or words), and $C(t)-(N-1)$ is the total count of n-grams. If we have a text with 3 words, i.e. $t=3$, and with $N=2$ n-grams, we will have $C(t)-(N-1)=3-(2-1)=2$ bigrams. We can now derive the conditional probability as an estimate, assuming that for large number of n-grams $C(t)-(N-1)$ and the number of $(n-1)$-grams $C(t)-(N-2)$ are very close, thus can be assumed to be the same:

$$P(w_n\ |\ w_1\ \dots{}\ w_{n-1}) = \frac{P(w_1\ \dots{}\ w_n)}{P(w_1\ \dots{}\ w_{n-1})} = \frac{\frac{C(w_1\dots{}w_n)}{C(t)-(N-1)}}{\frac{C(w_1\dots{}w_{n-1})}{C(t)-(N-2)}}=\frac{C(w_1\dots{}w_n)}{C(t)-(N-1)} \frac{C(t)-(N-2)}{C(w_1\dots{}w_{n-1})}=\frac{C(w_1\dots{}w_n)}{C(w_1\dots{}w_{n-1})}$$

The general idea in Add-One Smoothing is to pretend that an unseen n-gram should be assumed to have occurred at least once.

We differentiate between the number of types and the number of tokens. The number of tokens in the following text is 12, $tokens={the, black, cat, is, chasing, the, white, mouse, with, the, black, tail}$. The number of types is 9, $types = {the, black, cat, is, chasing, white, mouse, with, tail}$.

The black cat is chasing the white mouse with the black tail.

With $C$ the count of a particular n-gram in a corpus, $N$ the count of all n-grams, and $V$ the vocabulary size or the number of types, Add-One Smoothing can be computed for all n-grams as: