Part-of-Speech Tagging using NLTK

One task in NLP has been to reliably identify a word's part of speech. This can help us with the ever-present task of identifying content words, but can be used in a variety of analyses. Part-of-speech tagging is a specific instance in the larger category of word tagging, or placing words in pre-determined categories.

Today we'll learn how to identify a word's part of speech and think through reasons we may want to do this.

Learning Goals:

  • Understand the intuition behind tagging and information extraction
  • Use NLTK to tag the part of speech of each word
  • Count most frequent words based on their part of speech

Outline

Key Terms

  • part-of-speech tagging:
    • the process of marking up a word in a text as corresponding to a particular part of speech, based on both its definition and its context
  • named entity recognition:
    • a subtask of information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc
  • tree
    • data structure made up of nodes or vertices and edges without having any cycle.
  • treebank:
    • a parsed text corpus that annotates syntactic or semantic sentence structure
  • tuple:
    • a sequence of immutable Python objects

Further Resources

For more information on information extraction using NLTK, see chapter 7: http://www.nltk.org/book/ch07.html

Part-of-Speech Tagging

You may have noticed that stop words are typically short function words. Intuitively, if we could identify the part of speech of a word, we would have another way of identifying content words. NLTK can do that too!

NLTK has a function that will tag the part of speech of every token in a text. For this, we will re-create our original tokenized text sentence from the previous tutorial, with the stop words and punctuation intact.

NLTK uses the Penn Treebank Project to tag the part-of-speech of the words. The NLTK algoritm is deterministic - it assigns the most common part of speech for each word, as found in the Penn Treebank. You can find a list of all the part-of-speech tags here: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html


In [ ]:
import nltk
from nltk import word_tokenize

sentence = "For me it has to do with the work that gets done at the crossroads of \
digital media and traditional humanistic study. And that happens in two different ways. \
On the one hand, it's bringing the tools and techniques of digital media to bear \
on traditional humanistic questions; on the other, it's also bringing humanistic modes \
of inquiry to bear on digital media."

sentence_tokens = word_tokenize(sentence)

#check we did everything correctly
sentence_tokens

In [ ]:
#use the nltk pos function to tag the tokens
tagged_sentence_tokens = nltk.pos_tag(sentence_tokens)

#view tagged sentence
tagged_sentence_tokens

Now comes more complicated code. Stay with me. The above output is a list of tuples. A tuple is a sequence of Python objects. In this case, each of these tuples is a sequence of strings. To loop through tuples is intuitively the same as looping through a list, but slightly different syntax.

Note that this is not a list of lists, as we saw in our lesson on Pandas. This is a list of tuples.

Let's pull out the part-of-speech tag from each tuple above and save that to a list. Notice the order stays exactly the same.


In [ ]:
word_tags = [tag for (word, tag) in tagged_sentence_tokens]
print(word_tags)

Question: What is the difference in syntax for the above code compared to our standard list comprehension code?

Counting words based on their part of speech

We can count the part-of-speech tags in a similar way we counted words, to output the most frequent types of words in our text. We can also count words based on their part of speech.

First, we count the frequency of each part-of-speech tag.


In [ ]:
tagged_frequency = nltk.FreqDist(word_tags)
tagged_frequency.most_common()

This sentence contains a lot of adjectives. So let's first look at the adjectives. Notice the syntax here.


In [ ]:
adjectives = [word for (word,pos) in tagged_sentence_tokens if pos == 'JJ' or pos=='JJR' or pos=='JJS']

#print all of the adjectives
print(adjectives)

Let's do the same for nouns.


In [ ]:
nouns = [word for (word,pos) in tagged_sentence_tokens if pos=='NN' or pos=='NNS']

#print all of the nouns
print(nouns)

And now verbs.


In [ ]:
#verbs = [word for (word,pos) in tagged_sentence_tokens if pos == 'VB' or pos=='VBD' or pos=='VBG' or pos=='VBN' or pos=='VBP' or pos=='VBZ']
verbs = [word for (word,pos) in tagged_sentence_tokens if pos in ['VB', 'VBD','VBG','VBN','VBP','VBZ']]

#print all of the verbs
print(verbs)

In [ ]:
##Ex: Print the most frequent nouns, adjective, and verbs in the sentence
######What does this tell us?
######Compare this to what we did earlier with removing stop words.

In [1]:
##Ex: Compare the most frequent part-of-speech used in two of the texts in our data folder