Python Tutorial 1: Part-of-Speech Tagging 1

(C) 2016-2019 by Damir Cavar <dcavar@iu.edu>

Version: 1.3, October 2019

Download: This and various other Jupyter notebooks are available from my GitHub repo.

Introduction

This is a tutorial about developing simple Part-of-Speech taggers using Python 3.x and the NLTK.

This tutorial was developed as part of the course material for Advanced Natural Language Processing classes at Indiana University.

The Brown corpus in distributed as part of the NLTK Data. To be able to use the NLTK Data and the Brown corpus on your local machine, you need to install the data as described on the Installing NLTK Data page. If you want to use iPython on your local machine, I recommend installing a Python 3.x distribution, for example the most recent Anaconda release, and reading the instructions how to run iPython on Anaconda.

Part-of-Speech Tagging

We refer to Part-of-Speech (PoS) tagging as the task of assigning class information to individual words (tokens) in some text. The tags are defined in tagsets that specify character sequences that represent sets of for example lexical, morphological, syntactic, or semantic features. See for more details the Categorizing and Tagging Words chapter of the NLTK book.

Using the Brown Corpus

The documentation of the Brown corpus design and properties can be found on this page.

Using the following line of code we are importing the Brown corpus into the running Python instance. This will make the tokens and PoS-tags from the Brown corpus available for further processing.


In [1]:
from nltk.corpus import brown

Our goal is to assign PoS-tags to a sequence of words that represent a phrase, utterance, or sentence.

Let us assume that the probability of a sequence of 5 tags $t_1\ t_2\ t_3\ t_4\ t_5$ given a sequence of 5 tokens $w_1\ w_2\ w_3\ w_4\ w_5$ is $P(t_1\ t_2\ t_3\ t_4\ t_5\ |\ w_1\ w_2\ w_3\ w_4\ w_5)$ and can be computed as the product of the probability of one tag given another, e.g. the probability of tag 2 given that tag 1 occurred: $P(t_2\ |\ t_1)$, and the probability of one word and a specific tag, e.g. the probability of word 2 given that tag 2 occurred: $P(w_2\ |\ t_2)$.

Let us assume that we use two extra symbols S and E. S stands for sentence beginning and E for sentence end. We use these symbols to keep track of different distributions of tags and tokens relative to sentence positions. The token the for example is very unlikely to occur in sentence final and more likely to occur in sentence initial position.

$$P(t_1 \dots t_5\ |\ w_1 \dots w_5) = P(t_1|S)\ P(w_1|t_1)\ P(t_2|t_1)\ P(w_2|t_2)\ P(t_3|t_2)\ P(w_3|t_3)\ P(t_4|t_3)\ P(w_3|t_3)\ P(t_5|t_4)\ P(w_4|t_4)\ P(E|t_4)\ P(w_5|t_5)$$

This equation can be abbreviated as follows:

$$P(t_1 \dots t_5\ |\ w_1 \dots w_5) = P(t_1\ |\ S)\ P(E\ |\ t_5)\ \prod_{i=1}^{5} P(t_{i+1}\ |\ t_i)\ P(w_{i+1}\ |\ t_{i+1})$$

We extract the probabilities for a word (or token) given that a certain tag occurred, that is $P(w_1\ |\ t_1)$, form the frequency profile for tags and tokens from the Brown corpus. The necessary data-structure should be loaded and in memory after executing the code cell above.

Since we loaded the Brown corpus into memory, we can now use specific methods to access tokens and PoS-tags from the corpus. The following line of code unzips the list of tuples that contain tokens and tags in sequence as found in the corpus and stores the tokens in the tokens list and tags in the tags list. Note that the * operator is used here to unzip a list. See for more details on the Python zip-function the documentation page. The function brown.tagged_words() returns a list of tuples (word, token). The zip-function creates two lists and assigns those to the variables tokens and tags respectively.


In [2]:
tokens, tags = zip(*brown.tagged_words())

You can inspect the resulting list of tokens by printing it out:


In [3]:
tokens[:20]


Out[3]:
('The',
 'Fulton',
 'County',
 'Grand',
 'Jury',
 'said',
 'Friday',
 'an',
 'investigation',
 'of',
 "Atlanta's",
 'recent',
 'primary',
 'election',
 'produced',
 '``',
 'no',
 'evidence',
 "''",
 'that')

You can print the tags as well:


In [4]:
tags[:20]


Out[4]:
('AT',
 'NP-TL',
 'NN-TL',
 'JJ-TL',
 'NN-TL',
 'VBD',
 'NR',
 'AT',
 'NN',
 'IN',
 'NP$',
 'JJ',
 'NN',
 'NN',
 'VBD',
 '``',
 'AT',
 'NN',
 "''",
 'CS')

The sequence of tokens and tags is aligned, that is, the first tag in the tags list belongs to the first token in the tokens list. You can print the token-tag pair out in the following way:


In [5]:
print("Token:", tokens[0], "Tag:", tags[0])


Token: The Tag: AT

To create a frequency profile of tags for example, we can make use of the Counter container datatype from the collections module. We import the Counter datatype with the following code:


In [6]:
from collections import Counter

We can create a frequency profile of the tags from the Brown corpus and store it in the variable tagCounts using the following code:


In [7]:
tagCounter = Counter(tags)

The tagCounter datatype now contains a hash-table with tags as keys and their frequencies as values. Accessing the frequency of a specific tag can be achieved using the following code:


In [8]:
tagCounter["NNS"]


Out[8]:
55110

The frequency of a specific token can be accessed by generating a frequency profile from the token-list in the same way as for tags:


In [9]:
tokenCounter = Counter(tokens)

We access the token frequency in the same way as for tags:


In [10]:
tokenCounter["the"]


Out[10]:
62713

Since one type (or word) in the Brown corpus can have more than one corresponding tag with a specific frequency, we need to store this information in a specific datastructure.


In [11]:
from collections import defaultdict

The following loop reads from the list of token-tag-tuples in brown.tagged_words the individual token and tag pairs and sets their counter in the dictionary of Counter datastructures.


In [12]:
tokenTags = defaultdict(Counter)
for token, tag in brown.tagged_words():
    tokenTags[token][tag] +=1

We can now ask for the Counter datastructure for the key the. The Counter datastructure is a hash-table with tags as keys and the corresponding frequency as values.


In [27]:
tokenTags["loves"]


Out[27]:
Counter({'VBZ': 17, 'NNS': 2})

In [14]:
tokenTags["the"]["AT"]


Out[14]:
62288

In [26]:
len(tokens)


Out[26]:
1161192

For the calculation of the probability of a $tag_2$ given that a $tag_1$ occured, that is $P(tag_2\ |\ tag_1)$ we will need to count the bigrams from the tags list. The NLTK ngram module provides a convenient set of functions and datastructures to achieve this:


In [16]:
from nltk.util import ngrams

As for the tokenTags datatype above, we can create a tags bigram model using a dictionary of Counter datatypes. The dictionary keys will be the first tag of the tag-bigram. The value will contain a Counter datatype with the second tag of the tag-bigram as the key and the frequency of the bigram as value.


In [17]:
tagTags = defaultdict(Counter)

Using the ngrams module we generate a bigram model from the tags list and store it in the variable posBigrams using the following code:


In [18]:
posBigrams = list(ngrams(tags, 2))

The following loop goes through the list of bigram tuples, assigned the left bigram tag to the variable tag1 and the right bigram tag to variable tag2, and stores the count of the bigram in the tagTags datastructure:


In [19]:
for tag1, tag2 in posBigrams:
    tagTags[tag1][tag2] += 1

We can now list all tags that follow the AT tag with the corresponding frequency:


In [20]:
tagTags["AT"]


Out[20]:
Counter({'NP-TL': 809,
         'NN': 48376,
         'NN-TL': 2565,
         'NP': 2230,
         'JJ': 19488,
         'JJT': 675,
         'AP': 3007,
         'NNS': 7215,
         'NN$': 907,
         'VBG': 1568,
         'CD': 981,
         'JJS': 206,
         'VBN': 1468,
         'JJ-TL': 1414,
         'NPS': 588,
         'OD': 1251,
         '``': 620,
         'NNS$': 97,
         'RB': 350,
         'QL': 1377,
         'JJS-TL': 2,
         'NN$-TL': 162,
         'JJR': 630,
         'VBN-TL': 390,
         'NR-TL': 208,
         'NNS-TL': 284,
         'FW-IN': 7,
         'ABN': 42,
         'NR': 218,
         'NPS$': 30,
         'PN': 149,
         'NNS$-TL': 28,
         '*': 4,
         'NP$': 62,
         "'": 24,
         'VBG-TL': 34,
         'OD-TL': 98,
         'JJR-TL': 3,
         'FW-NN-TL': 52,
         'RB-TL': 1,
         'CD-TL': 29,
         'FW-JJ-TL': 8,
         'NR$-TL': 8,
         'FW-NN': 76,
         'RBT': 11,
         '(': 15,
         "''": 7,
         'CC': 4,
         'VB': 16,
         'RB-NC': 1,
         'VBZ': 1,
         'RP': 1,
         'NP$-TL': 17,
         'VBD': 2,
         ',': 12,
         '--': 4,
         'PN$': 1,
         'RBR': 40,
         'NN+HVZ': 3,
         'FW-IN-TL': 4,
         'IN': 3,
         'NN+BEZ': 13,
         '.': 4,
         'HV': 1,
         'MD': 1,
         'FW-RB': 1,
         'NPS-TL': 2,
         'FW-NNS': 11,
         'NN+MD': 2,
         'AT': 1,
         'PPO': 1,
         'FW-JJ': 2,
         'FW-VB': 2,
         'NIL': 1,
         'JJT-TL': 3,
         'FW-NNS-TL': 5,
         'FW-VBN': 1,
         'FW-NN-TL-NC': 1,
         ')': 1,
         'NN-HL': 3,
         'VB-TL': 3,
         'BEZ-NC': 1,
         'JJ-HL': 1,
         'FW-JJT': 1,
         'NN+IN': 1,
         'FW-NN$': 1,
         'WDT': 2,
         'UH': 2,
         'AP$': 1,
         'FW-CC': 1,
         'AP-TL': 2,
         'PN+HVZ': 1,
         'RBR+CS': 1,
         'PPS': 1,
         'IN-TL': 1})

We can request the frequency of the tag-bigram AT NN using the following code:


In [35]:
tagTags["NP"]["NNS"]


Out[35]:
489

In [31]:
tokenTags["loves"]


Out[31]:
Counter({'VBZ': 17, 'NNS': 2})

We can calculate the total number of bigrams and relativize the count of any particular bigram:


In [26]:
total = float(len(tags))
print(total)
tagTags["NNS"]["NNS"]/(total-1)


1161192.0
Out[26]:
0.00012228823681892126

If we want to know how many times a certain tag occurs in sentence initial position, to estimate initial probabilities for startstates in a Hidden Markov Model for example, we can loop through the sentences and count the tags in initial position.


In [27]:
offset = 0
initialTags = Counter()
for x in brown.sents():
    initTag = tags[offset]
    initialTags[initTag] += 1
    offset += len(x)
print("Example:")
print("AT:", initialTags["AT"])


Example:
AT: 8297

Note, for the code above, I do not know how to access the initial sentence tag directly, thus I am indirectly accessing the tag over an offset count. If you know a better way, let me know, please.

We can now estimate the probability of any tag being in sentence initial position in the following way:


In [28]:
initialTags["AT"]/total


Out[28]:
0.007145243852868432

We can estimate the probability of any tag being followed by any other, in the following way:


In [29]:
tagTags["AT"]["NN"]/(total-1)


Out[29]:
0.04166067425600095

Note, we are dividing by total - 1, since the number of bigrams in the tagTags data structure is exactly this.

We can estimate the likelihood of a tag token combination using the tokenTags data-structure:


In [30]:
tokenTags["John"]["NN"]/total


Out[30]:
0.0

Given the data structures tokenTags and tagTags we can now estimate the probability of a word given a specific tag, or intuitively, the probability that a specific word is assigned a tag, that is for the token cat and the tag NN: $P(cat\ |\ NN)$ using the following equation and corresponding code (with $C(cat\ NN)$ as the absolute frequency or count of the cat NN tuple, and $C(NN)$ the count of the NN-tag):

$$P(w_n\ |\ t_n) = \frac{C(w_n\ t_n)}{C(t_n)}$$

In [31]:
tokenTags["cat"]["NN"] / tagCounter["NN"]


Out[31]:
0.00013117334557617892

We can estimate the probability of a $tag_2$ following a $tag_1$ using a similar approach:

$$P(t_n\ |\ t_{n-1}) = \frac{C(t_{n-1}\ t_n)}{C(t_{n-1})}$$

Here $C(t_{n-1}\ t_n)$ is the count of the bigram of these two tags in sequence. $C(t_{n-1})$ is the count or absolute frequency of the first or left tag in the bigram. Let us assume that the input sequence was the cat ... and that the most likely initial tag for the was AT, then the probability of the tag NN given that a tag AT occurred can be estimated as:


In [32]:
tagTags["AT"]["NN"] / tagCounter["AT"]


Out[32]:
0.4938392592819445

The product of the two probabilities $P(w_n\ |\ t_n)\ P(t_n\ |\ t_{n-1})$ for the tokens the cat and the possible tags AT NN should be:


In [33]:
(tokenTags["cat"]["NN"] / tagCounter["NN"]) * (tagTags["AT"]["NN"] / tagCounter["AT"])


Out[33]:
6.477854781687473e-05

If we would want to calculate this for any sequence of words, we should wrap this code in some function and a loop over all tokens. To avoid an underflow from the product of many probabilities, we can sum up the log-likelihoods of these probabilities. We would calculate the probabilities for all possible tag combinations assigned to the sequence of words or tokens and select the largest one as the best.

In the next section we will discuss Hidden Markov Models (HMMs) for Part-of-Speech Tagging.

References

Manning, Chris and Hinrich Schütze (1999) Foundations of Statistical Natural Language Processing, MIT Press. Cambridge, MA.

(C) 2016-2019 by Damir Cavar <dcavar@iu.edu>