Parts of Speech is identifying what each other resemble in the particular sentence, with nltk we can get the parts of speech tags returned from a particular sentence, from which we will able to retrieve particular type of words (lets say only nouns) and get the information about the document(which contains multiple sentences), here we will see how we can get information of "moby dick" book by finding only nouns, then do the frequencey distribution, then finding the most common words of distribution.
In [1]:
# Get the nltk
import nltk
In [2]:
# the original text
text = "I walked to the cafe to buy coffee after work."
In [3]:
# Tokenizing the sentences into words
tokens = nltk.word_tokenize(text)
In [4]:
# Calling the .pos_tag() to get the POS tags of the tokens
nltk.pos_tag(tokens)
Out[4]:
In [5]:
# For understanding the abbrevations of POS short cuts given above try below command
nltk.help.upenn_tagset()
In [6]:
# Now let us try to see how a word is tagged differently on different contexts of two different sentences
# Observe the 'desert' word
# Sentence 1
nltk.pos_tag(nltk.word_tokenize('I will have a desert'))
Out[6]:
In [7]:
# Sentence 2
nltk.pos_tag(nltk.word_tokenize('They will desert us.'))
Out[7]:
In [8]:
# From the above two sentences it is clear
# that in first sentence 'desert' is used to tell about course of meal - so marked as NN - noun
# in second sentence 'desert' is used to say someone left someone - so marked as VB - verb
In [9]:
# Now lets work on the 'Moby Dick' book and use POS to get the information about the book
# Getting the moby dick book - words
md = nltk.corpus.gutenberg.words('melville-moby_dick.txt')
In [10]:
# Normalize it - checking all words with only alphabets, converting all words to lower case
md_norm = [word.lower() for word in md if word.isalpha()]
In [11]:
# Rather than giving with the short cuts lets try to get the unviersal abbrevations like NOUN, VERB etc
md_tags = nltk.pos_tag(md_norm, tagset='universal')
In [12]:
# Printing the first five tags, see that tags are now in universal form
md_tags[:5]
Out[12]:
In [13]:
# Now get only the noun tag labelled values from tuples ('value', 'Universal form') -> 'value' if 'Universal form' == 'NOUN'
md_nouns = [word[0] for word in md_tags if word[1] == 'NOUN']
In [14]:
# Printing the first 10 nouns we identified
md_nouns[:10]
Out[14]:
In [15]:
# Now using frequency distribution and most common words to get the most used words
md_nouns_fd = nltk.FreqDist(md_nouns)
# Getting the first 10 most common
md_nouns_fd.most_common(10)
# We can see that words like 'whale', 'man', 'sea' etc - it talks more about the sailing of man on Sea ??
Out[15]:
In [ ]: