One task in NLP has been to reliably identify a word's part of speech. This can help us with the ever-present task of identifying content words, but can be used in a variety of analyses. Part-of-speech tagging is a specific instance in the larger category of word tagging, or placing words in pre-determined categories.
Today we'll learn how to identify a word's part of speech and think through reasons we may want to do this.
For more information on information extraction using NLTK, see chapter 7: http://www.nltk.org/book/ch07.html
You may have noticed that stop words are typically short function words. Intuitively, if we could identify the part of speech of a word, we would have another way of identifying content words. NLTK can do that too!
NLTK has a function that will tag the part of speech of every token in a text. For this, we will re-create our original tokenized text sentence from the previous tutorial, with the stop words and punctuation intact.
NLTK uses the Penn Treebank Project to tag the part-of-speech of the words. The NLTK algoritm is deterministic - it assigns the most common part of speech for each word, as found in the Penn Treebank. You can find a list of all the part-of-speech tags here: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
In [ ]:
import nltk
from nltk import word_tokenize
sentence = "For me it has to do with the work that gets done at the crossroads of \
digital media and traditional humanistic study. And that happens in two different ways. \
On the one hand, it's bringing the tools and techniques of digital media to bear \
on traditional humanistic questions; on the other, it's also bringing humanistic modes \
of inquiry to bear on digital media."
sentence_tokens = word_tokenize(sentence)
#check we did everything correctly
sentence_tokens
In [ ]:
#use the nltk pos function to tag the tokens
tagged_sentence_tokens = nltk.pos_tag(sentence_tokens)
#view tagged sentence
tagged_sentence_tokens
Now comes more complicated code. Stay with me. The above output is a list of tuples. A tuple is a sequence of Python objects. In this case, each of these tuples is a sequence of strings. To loop through tuples is intuitively the same as looping through a list, but slightly different syntax.
Note that this is not a list of lists, as we saw in our lesson on Pandas. This is a list of tuples.
Let's pull out the part-of-speech tag from each tuple above and save that to a list. Notice the order stays exactly the same.
In [ ]:
word_tags = [tag for (word, tag) in tagged_sentence_tokens]
print(word_tags)
Question: What is the difference in syntax for the above code compared to our standard list comprehension code?
We can count the part-of-speech tags in a similar way we counted words, to output the most frequent types of words in our text. We can also count words based on their part of speech.
First, we count the frequency of each part-of-speech tag.
In [ ]:
tagged_frequency = nltk.FreqDist(word_tags)
tagged_frequency.most_common()
This sentence contains a lot of adjectives. So let's first look at the adjectives. Notice the syntax here.
In [ ]:
adjectives = [word for (word,pos) in tagged_sentence_tokens if pos == 'JJ' or pos=='JJR' or pos=='JJS']
#print all of the adjectives
print(adjectives)
Let's do the same for nouns.
In [ ]:
nouns = [word for (word,pos) in tagged_sentence_tokens if pos=='NN' or pos=='NNS']
#print all of the nouns
print(nouns)
And now verbs.
In [ ]:
#verbs = [word for (word,pos) in tagged_sentence_tokens if pos == 'VB' or pos=='VBD' or pos=='VBG' or pos=='VBN' or pos=='VBP' or pos=='VBZ']
verbs = [word for (word,pos) in tagged_sentence_tokens if pos in ['VB', 'VBD','VBG','VBN','VBP','VBZ']]
#print all of the verbs
print(verbs)
In [ ]:
##Ex: Print the most frequent nouns, adjective, and verbs in the sentence
######What does this tell us?
######Compare this to what we did earlier with removing stop words.
In [1]:
##Ex: Compare the most frequent part-of-speech used in two of the texts in our data folder