Introduction to NLTK

We have seen how to do some basic text processing in Python, now we introduce an open source framework for natural language processing that can further help to work with human languages: NLTK (Natural Language ToolKit).

Tokenise a text

Let's start with a simple text in a Python string:



In [1]:

    
sampleText1 = "The Elephant's 4 legs: THE Pub! You can't believe it or can you, the believer?"
sampleText2 = "Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29."

Tokens

The basic atomic part of each text are the tokens. A token is the NLP name for a sequence of characters that we want to treat as a group. We have seen how we can extract tokens by splitting the text at the blank spaces.
NTLK has a function word_tokenize() for it:



In [2]:

    
import nltk



In [3]:

    
s1Tokens = nltk.word_tokenize(sampleText1)
s1Tokens









    Out[3]:





['The',
 'Elephant',
 "'s",
 '4',
 'legs',
 ':',
 'THE',
 'Pub',
 '!',
 'You',
 'ca',
 "n't",
 'believe',
 'it',
 'or',
 'can',
 'you',
 ',',
 'the',
 'believer',
 '?']



In [4]:

    
len(s1Tokens)









    Out[4]:





21

21 tokens extracted, which include words and punctuation.
Note that the tokens are different than what a split by blank spaces would obtained, e.g. "can't" is by NTLK considered TWO tokens: "can" and "n't" (= "not") while a tokeniser that splits text by spaces would consider it a single token: "can't".
Let's see another example:



In [5]:

    
s2Tokens = nltk.word_tokenize(sampleText2)
s2Tokens









    Out[5]:





['Pierre',
 'Vinken',
 ',',
 '61',
 'years',
 'old',
 ',',
 'will',
 'join',
 'the',
 'board',
 'as',
 'a',
 'nonexecutive',
 'director',
 'Nov.',
 '29',
 '.']

And we can apply it to an entire book, "The Prince" by Machiavelli that we used last time:



In [8]:

    
# If you would like to work with the raw text you can use 'bookRaw'
with open('../datasets/ThePrince.txt', 'r') as f:
    bookRaw = f.read()



In [9]:

    
bookTokens = nltk.word_tokenize(bookRaw)
bookText = nltk.Text(bookTokens) # special format
nBookTokens= len(bookTokens) # or alternatively len(bookText)



In [11]:

    
print ("*** Analysing book ***")    
print ("The book is {} chars long".format (len(bookRaw)))
print ("The book has {} tokens".format (nBookTokens))









    



*** Analysing book ***
The book is 300814 chars long
The book has 59792 tokens

As mentioned above, the NTLK tokeniser works in a more sophisticated way than just splitting by spaces, therefore we got this time more tokens.

Sentences

NTLK has a function to tokenise a text not in words but in sentences.



In [12]:

    
text1 = "This is the first sentence. A liter of milk in the U.S. costs $0.99. Is this the third sentence? Yes, it is!"
sentences = nltk.sent_tokenize(text1)
len(sentences)









    Out[12]:





4



In [13]:

    
sentences









    Out[13]:





['This is the first sentence.',
 'A liter of milk in the U.S. costs $0.99.',
 'Is this the third sentence?',
 'Yes, it is!']

As you see, it is not splitting just after each full stop but check if it's part of an acronym (U.S.) or a number (0.99).
It also splits correctly sentences after question or exclamation marks but not after commas.



In [14]:

    
sentences = nltk.sent_tokenize(bookRaw) # extract sentences
nSent = len(sentences)
print ("The book has {} sentences".format (nSent))
print ("and each sentence has in average {} tokens".format (nBookTokens / nSent))









    



The book has 1416 sentences
and each sentence has in average 42.22598870056497 tokens

Most common tokens

What are the 20 most frequently occurring (unique) tokens in the text? What is their frequency?

The NTLK FreqDist class is used to encode “frequency distributions”, which count the number of times that something occurs, for example a token.

Its most_common() method then returns a list of tuples where each tuple is of the form (token, frequency). The list is sorted in descending order of frequency.



In [15]:

    
def get_top_words(tokens):
        # Calculate frequency distribution
    fdist = nltk.FreqDist(tokens)
    return fdist.most_common()



In [17]:

    
topBook = get_top_words(bookTokens)
  # Output top 20 words
topBook[:20]









    Out[17]:





[(',', 4192),
 ('the', 2954),
 ('to', 2081),
 ('and', 1794),
 ('of', 1772),
 ('.', 1397),
 ('in', 946),
 ('he', 844),
 ('a', 759),
 ('that', 735),
 ('his', 631),
 ('not', 562),
 ('it', 537),
 (';', 531),
 ('by', 495),
 ('with', 491),
 ('be', 467),
 ('is', 436),
 ('they', 422),
 ('him', 416)]

Comma is the most common: we need to remove the punctuation.

Most common alphanumeric tokens

We can use isalpha() to check if the token is a word and not punctuation.



In [18]:

    
topWords = [(freq, word) for (word,freq) in topBook if word.isalpha() and freq > 400]
topWords









    Out[18]:





[(2954, 'the'),
 (2081, 'to'),
 (1794, 'and'),
 (1772, 'of'),
 (946, 'in'),
 (844, 'he'),
 (759, 'a'),
 (735, 'that'),
 (631, 'his'),
 (562, 'not'),
 (537, 'it'),
 (495, 'by'),
 (491, 'with'),
 (467, 'be'),
 (436, 'is'),
 (422, 'they'),
 (416, 'him'),
 (409, 'for')]

We can also remove any capital letters before tokenising:



In [19]:

    
def preprocessText(text, lowercase=True):
    if lowercase:
        tokens = nltk.word_tokenize(text.lower())
    else:
        tokens = nltk.word_tokenize(text)

    return [word for word in tokens if word.isalpha()]



In [20]:

    
bookWords = preprocessText(bookRaw)



In [21]:

    
topBook = get_top_words(bookWords)
# Output top 20 words
topBook[:20]









    Out[21]:





[('the', 3110),
 ('to', 2108),
 ('and', 1938),
 ('of', 1802),
 ('in', 993),
 ('he', 921),
 ('a', 781),
 ('that', 745),
 ('his', 640),
 ('it', 586),
 ('not', 566),
 ('by', 506),
 ('with', 497),
 ('be', 471),
 ('for', 443),
 ('they', 442),
 ('is', 438),
 ('him', 417),
 ('have', 390),
 ('was', 380)]



In [22]:

    
print ("*** Analysing book ***")    
print ("The text has now {} words (tokens)".format (len(bookWords)))









    



*** Analysing book ***
The text has now 52202 words (tokens)

Now we removed the punctuation and the capital letters but the most common token is "the", not a significative word ...
As we have seen last time, these are so-called stop words that are very common and are normally stripped from a text when doing these kind of analysis.

Meaningful most common tokens

A simple approach could be to filter the tokens that have a length greater than 5 and frequency of more than 150.



In [23]:

    
meaningfulWords = [word for (word,freq) in topBook if len(word) > 5 and freq > 80]
sorted(meaningfulWords)









    Out[23]:





['against',
 'always',
 'because',
 'castruccio',
 'having',
 'himself',
 'others',
 'people',
 'prince',
 'project',
 'should',
 'therefore']

This would work but would leave out also tokens such as I and you which are actually significative.
The better approach - that we have seen earlier how - is to remove stopwords using external files containing the stop words.
NLTK has a corpus of stop words in several languages:



In [24]:

    
from nltk.corpus import stopwords



In [25]:

    
stopwordsEN = set(stopwords.words('english')) # english language



In [26]:

    
betterWords = [w for w in bookWords if w not in stopwordsEN]



In [27]:

    
topBook = get_top_words(betterWords)
# Output top 20 words
topBook[:20]









    Out[27]:





[('one', 302),
 ('prince', 222),
 ('would', 165),
 ('men', 163),
 ('castruccio', 142),
 ('people', 116),
 ('may', 110),
 ('many', 101),
 ('others', 96),
 ('time', 95),
 ('ought', 94),
 ('therefore', 92),
 ('duke', 91),
 ('great', 89),
 ('project', 87),
 ('state', 86),
 ('always', 81),
 ('man', 80),
 ('without', 79),
 ('new', 75)]

Now we excluded words such as the but we can improve further the list by looking at semantically similar words, such as plural and singular versions.



In [28]:

    
'princes' in betterWords









    Out[28]:





True



In [29]:

    
betterWords.count("prince") + betterWords.count("princes")









    Out[29]:





281

Stemming

Above, in the list of words we have both prince and princes which are respectively the singular and plural version of the same word (the stem). The same would happen with verb conjugation (love and loving are considered different words but are actually inflections of the same verb).
Stemmer is the tool that reduces such inflectional forms into their stem, base or root form and NLTK has several of them (each with a different heuristic algorithm).



In [30]:

    
input1 = "List listed lists listing listings"
words1 = input1.lower().split(' ')
words1









    Out[30]:





['list', 'listed', 'lists', 'listing', 'listings']

And now we apply one of the NLTK stemmer, the Porter stemmer:



In [31]:

    
porter = nltk.PorterStemmer()
[porter.stem(t) for t in words1]









    Out[31]:





['list', 'list', 'list', 'list', 'list']

As you see, all 5 different words have been reduced to the same stem and would be now the same lexical token.



In [32]:

    
stemmedWords = [porter.stem(w) for w in betterWords]
topBook = get_top_words(stemmedWords)
topBook[:20]  # Output top 20 words









    Out[32]:





[('one', 316),
 ('princ', 281),
 ('would', 165),
 ('men', 163),
 ('castruccio', 142),
 ('state', 137),
 ('time', 129),
 ('peopl', 118),
 ('may', 110),
 ('work', 108),
 ('great', 106),
 ('mani', 101),
 ('other', 96),
 ('ought', 94),
 ('duke', 92),
 ('therefor', 92),
 ('arm', 92),
 ('make', 90),
 ('project', 87),
 ('wish', 83)]

Now the word princ is counted 281 times, exactly like the sum of prince and princes.

A note here: Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes.
Prince and princes become princ.
A different flavour is the lemmatisation that we will see in one second, but first a note about stemming in other languages than English.

Stemming in other languages

Snowball is an improvement created by Porter: a language to create stemmers and have rules for many more languages than English. For example Italian:



In [33]:

    
from nltk.stem.snowball import SnowballStemmer
stemmerIT = SnowballStemmer("italian")



In [34]:

    
inputIT = "Io ho tre mele gialle, tu hai una mela gialla e due pere verdi"
wordsIT = inputIT.split(' ')



In [35]:

    
[stemmerIT.stem(w) for w in wordsIT]









    Out[35]:





['io',
 'ho',
 'tre',
 'mel',
 'gialle,',
 'tu',
 'hai',
 'una',
 'mel',
 'giall',
 'e',
 'due',
 'per',
 'verd']

Lemma

Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.
While a stemmer operates on a single word without knowledge of the context, a lemmatiser can take the context in consideration.

NLTK has also a built-in lemmatiser, so let's see it in action:



In [36]:

    
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()



In [37]:

    
words1









    Out[37]:





['list', 'listed', 'lists', 'listing', 'listings']



In [38]:

    
[lemmatizer.lemmatize(w, 'n') for w in words1] # n = nouns









    Out[38]:





['list', 'listed', 'list', 'listing', 'listing']

We tell the lemmatise that the words are nouns. In this case it considers the same lemma words such as list (singular noun) and lists (plural noun) but leave as they are the other words.



In [39]:

    
[lemmatizer.lemmatize(w, 'v') for w in words1] # v = verbs









    Out[39]:





['list', 'list', 'list', 'list', 'list']

We get a different result if we say that the words are verbs.
They have all the same lemma, in fact they could be all different inflections or conjugation of a verb.

The type of words that can be used are:
'n' = noun, 'v'=verb, 'a'=adjective, 'r'=adverb



In [40]:

    
words2 = ['good', 'better']



In [41]:

    
[porter.stem(w) for w in words2]









    Out[41]:





['good', 'better']



In [42]:

    
[lemmatizer.lemmatize(w, 'a') for w in words2]









    Out[42]:





['good', 'good']

It works with different adjectives, it doesn't look only at prefixes and suffixes.
You would wonder why stemmers are used, instead of always using lemmatisers: stemmers are much simpler, smaller and faster and for many applications good enough.

Now we lemmatise the book:



In [43]:

    
lemmatisedWords = [lemmatizer.lemmatize(w, 'n') for w in betterWords]
topBook = get_top_words(lemmatisedWords)
topBook[:20]  # Output top 20 words









    Out[43]:





[('one', 316),
 ('prince', 281),
 ('would', 165),
 ('men', 163),
 ('castruccio', 142),
 ('state', 130),
 ('time', 129),
 ('people', 118),
 ('may', 110),
 ('work', 103),
 ('many', 101),
 ('others', 96),
 ('ought', 94),
 ('duke', 92),
 ('therefore', 92),
 ('great', 89),
 ('project', 87),
 ('way', 83),
 ('make', 81),
 ('always', 81)]

Yes, the lemma now is prince.
But note that we consider all words in the book as nouns, while actually a proper way would be to apply the correct type to each single word.

Part of speech (PoS)

In traditional grammar, a part of speech (abbreviated form: PoS or POS) is a category of words which have similar grammatical properties.

For example, an adjective (red, big, quiet, ...) describe properties while a verb (throw, walk, have) describe actions or states.

Commonly listed parts of speech are noun, verb, adjective, adverb, pronoun, preposition, conjunction, interjection.



In [44]:

    
text1 = "Children shouldn't drink a sugary drink before bed."
tokensT1 = nltk.word_tokenize(text1)
nltk.pos_tag(tokensT1)









    Out[44]:





[('Children', 'NNP'),
 ('should', 'MD'),
 ("n't", 'RB'),
 ('drink', 'VB'),
 ('a', 'DT'),
 ('sugary', 'JJ'),
 ('drink', 'NN'),
 ('before', 'IN'),
 ('bed', 'NN'),
 ('.', '.')]

The NLTK function pos_tag() will tag each token with the estimated PoS.
NLTK has 13 categories of PoS. You can check the acronym using the NLTK help function:



In [45]:

    
nltk.help.upenn_tagset('RB')









    



RB: adverb
    occasionally unabatingly maddeningly adventurously professedly
    stirringly prominently technologically magisterially predominately
    swiftly fiscally pitilessly ...

Which are the most common PoS in The Prince book?



In [46]:

    
tokensAndPos = nltk.pos_tag(bookTokens)
posList = [thePOS for (word, thePOS) in tokensAndPos]
fdistPos = nltk.FreqDist(posList)
fdistPos.most_common(5)









    Out[46]:





[('IN', 7218), ('NN', 5992), ('DT', 5374), (',', 4192), ('PRP', 3489)]



In [47]:

    
nltk.help.upenn_tagset('IN')









    



IN: preposition or conjunction, subordinating
    astride among uppon whether out inside pro despite on by throughout
    below within for towards near behind atop around if like until below
    next into if beside ...

It's not nouns (NN) but interections (IN) such as preposition or conjunction.

Extra note: Parsing the grammar structure

Words can be ambiguous and sometimes is not easy to understand which kind of POS is a word, for example in the sentence "visiting aunts can be a nuisance", is visiting a verb or an adjective?
Tagging a PoS depends on the context, which can be ambiguous.

Making sense of a sentence is easier if it follows a well-defined grammatical structure, such as : subject + verb + object
NLTK allows to define a formal grammar which can then be used to parse a text. The NLTK ChartParser is a procedure for finding one or more trees (sentences have internal organisation that can be represented using a tree) corresponding to a grammatically well-formed sentence.



In [48]:

    
# Parsing sentence structure
text2 = nltk.word_tokenize("Alice loves Bob")
grammar = nltk.CFG.fromstring("""
S -> NP VP
VP -> V NP
NP -> 'Alice' | 'Bob'
V -> 'loves'
""")

parser = nltk.ChartParser(grammar)
trees = parser.parse_all(text2)
for tree in trees:
    print(tree)









    



(S (NP Alice) (VP (V loves) (NP Bob)))

This is a "toy grammar," a small grammar that illustrate the key aspects of parsing. But there is an obvious question as to whether the approach can be scaled up to cover large corpora of natural languages. How hard would it be to construct such a set of productions by hand? In general, the answer is: very hard.
Nevertheless, there are efforts to develop broad-coverage grammars, such as weighted and probabilistic grammars.

The world outside NLTK

As a final note, NLTK was used here for educational purpose but you should be aware that has its own limitations.
NLTK is a solid library but it's old and slow. Especially the NLTK's lemmatisation functionality is slow enough that it will become the bottleneck in almost any application that will use it.

For industrial NLP application a very performance-minded Python library is SpaCy.io instead.
And for robust multi-lingual support there is polyglot that has a much wider language support of all the above.

Other tools exist in other computer languages such as Stanford CoreNLP and Apache OpenNLP, both in Java.



In [ ]: