Preprocessing by NLTK

In this notebook, we will use NLTK to preprocess text documents. NLTK is a widely used library for Natural Language Processing. In addition to many built-in capabilities, it has interfaces to many corpus, other libraries, and a good online textbook/cookbook (http://www.nltk.org/book/).

Installing NLTK

You need to

If everything is installed correctly, you should be able to run the following code:


In [195]:
from pprint import pprint
import nltk
from nltk.book import *

In [145]:
doc = text6
print(doc)


<Text: Monty Python and the Holy Grail>

Examining the text


In [146]:
word = 'swallow'
doc.concordance(word, width=100, lines=30)


Displaying 10 of 10 matches:
ell , this is a temperate zone . ARTHUR : The swallow may fly south with the sun or the house marti
hey could be carried . SOLDIER # 1 : What ? A swallow carrying a coconut ? ARTHUR : It could grip i
In order to maintain air - speed velocity , a swallow needs to beat its wings forty - three times e
LDIER # 2 : It could be carried by an African swallow ! SOLDIER # 1 : Oh , yeah , an African swallo
wallow ! SOLDIER # 1 : Oh , yeah , an African swallow maybe , but not a European swallow . That ' s
an African swallow maybe , but not a European swallow . That ' s my point . SOLDIER # 2 : Oh , yeah
ing Arthur and Sir Bedevere , not more than a swallow ' s flight away , had discovered something . 
scovered something . Oh , that ' s an unladen swallow ' s flight , obviously . I mean , they were m
hat is the air - speed velocity of an unladen swallow ? ARTHUR : What do you mean ? An African or E
R : What do you mean ? An African or European swallow ? BRIDGEKEEPER : Huh ? I -- I don ' t know th

In [147]:
word = 'live'
doc.concordance(word, width=100, lines=30)


Displaying 6 of 6 matches:
NNIS : Man ! ARTHUR : Man . Sorry . What knight live in that castle over there ? DENNIS : I ' m thir
ste . Who lives in that castle ? WOMAN : No one live there . ARTHUR : Then who is your lord ? WOMAN 
hee ha ! Ha ha ha ha ... ARTHUR : Where does he live ? OLD MAN : ... Heh heh heh heh ... ARTHUR : Ol
eh heh heh ... ARTHUR : Old man , where does he live ? OLD MAN : ... Hee ha ha ha . He knows of a ca
eee - wom ! ARTHUR : Those who hear them seldom live to tell the tale ! HEAD KNIGHT : The Knights Wh
 ,-- HERBERT : Herbert . FATHER : ' Erbert . We live in a bloody swamp . We need all the land we can

In [148]:
word = 'lived'
doc.concordance(word, width=100, lines=30)


Displaying 1 of 1 matches:
o cruel that no man yet has fought with it and lived ! Bones of full fifty men lie strewn about its

In [149]:
word = 'CASTLE' 
doc.concordance(word, width=100, lines=30)


Displaying 25 of 25 matches:
I , Arthur , son of Uther Pendragon , from the castle of Camelot . King of the Britons , defeator of
RTHUR : Man . Sorry . What knight live in that castle over there ? DENNIS : I ' m thirty - seven . A
 . I am Arthur , King of the Britons . Who ' s castle is that ? WOMAN : King of the who ? ARTHUR : T
ood people . I am in haste . Who lives in that castle ? WOMAN : No one live there . ARTHUR : Then wh
se are my Knights of the Round Table . Who ' s castle is this ? FRENCH GUARD : This is the castle of
 s castle is this ? FRENCH GUARD : This is the castle of my master Guy de Loimbard . ARTHUR : Go and
ill not show us the Grail , we shall take your castle by force ! FRENCH GUARD : You don ' t frighten
 DIRECTOR : Action ! HISTORIAN : Defeat at the castle seems to have utterly disheartened King Arthur
T : Welcome gentle Sir Knight . Welcome to the Castle Anthrax . GALAHAD : The Castle Anthrax ? ZOOT 
 Welcome to the Castle Anthrax . GALAHAD : The Castle Anthrax ? ZOOT : Yes . Oh , it ' s not a very 
nd nineteen - and - a - half , cut off in this castle with no one to protect us . Oooh . It is a lon
seek the Grail ! I have seen it , here in this castle ! DINGO : Oh no . Oh , no ! Bad , bad Zoot ! G
n , and she must pay the penalty . And here in Castle Anthrax , we have but one punishment for setti
swamp . Other kings said I was daft to build a castle on a swamp , but I built it all the same , jus
 what you ' re gonna get , lad : the strongest castle in these islands . HERBERT : But I don ' t wan
nd rescue me . I am in the Tall Tower of Swamp Castle .' At last ! A call ! A cry of distress ! This
hen . Shall I , sir ? Yeah SCENE 16 : [ inside castle ] PRINCESS LUCKY and GIRLS : [ giggle giggle g
and GIRLS : [ giggle giggle giggle ] [ outside castle ] GUEST : ' Morning ! SENTRY # 1 : ' Morning .
ight of King Arthur , sir . FATHER : Very nice castle , Camelot . Uh , very good pig country ... LAU
 pure of spirit may find the Holy Grail in the Castle of uuggggggh '. ARTHUR : What ? MAYNARD : '...
uggggggh '. ARTHUR : What ? MAYNARD : '... the Castle of uuggggggh '. BEDEVERE : What is that ? MAYN
inging stops ] [ ethereal music ] ARTHUR : The Castle Aaagh . Our quest is at an end ! God be praise
 of Camelot , to open the doors of this sacred castle , to which God Himself has guided us ! FRENCH 
f the Lord , we demand entrance to this sacred castle ! FRENCH GUARD : No chance , English bed - wet
you do not open this door , we shall take this castle by force ! [ splat ] In the name of God and th

Creating your own text objects


In [150]:
from nltk.text import Text
from nltk import word_tokenize # sentence => words
from nltk import sent_tokenize # document => sentences

#str1 = "to be or not to BE? That's a question. "
str1 = "To be or not to BE?\n That's a question. "

tokens = word_tokenize(str1)
doc2 = Text(tokens)

Getting the statistics

  • Document length
  • Vocalbury size
  • tf(token)

In [151]:
print('# of tokens = {}'.format(len(doc)))
print('# of unique tokens = {}'.format(len(set(doc))))


# of tokens = 16967
# of unique tokens = 2166

In [152]:
print(doc2)
print('# of tokens = {}'.format(len(doc2)))
print('# of unique tokens = {}'.format(len(set(doc2))))
print(sorted(set(doc2)))
doc2.concordance('be', width=100, lines=30)


<Text: To be or not to BE ? That...>
# of tokens = 12
# of unique tokens = 12
["'s", '.', '?', 'BE', 'That', 'To', 'a', 'be', 'not', 'or', 'question', 'to']
Displaying 2 of 2 matches:
                                              To be or not to BE ? That 's a question .
                                 To be or not to BE ? That 's a question .

In [153]:
print(doc2.count('to'))
print(doc2.count('To'))
print(doc2.count('be'))
print(doc2.count('BE'))
print(doc2.count('bE'))


1
1
1
1
0

In [154]:
occ = doc2.index('to')
print(occ)


4

In [155]:
half_window = 3
doc2[occ - half_window : occ + half_window +1]


Out[155]:
['be', 'or', 'not', 'to', 'BE', '?', 'That']

In [156]:
fd = FreqDist(doc)
fd.most_common(20)


Out[156]:
[(':', 1197),
 ('.', 816),
 ('!', 801),
 (',', 731),
 ("'", 421),
 ('[', 319),
 (']', 312),
 ('the', 299),
 ('I', 255),
 ('ARTHUR', 225),
 ('?', 207),
 ('you', 204),
 ('a', 188),
 ('of', 158),
 ('--', 148),
 ('to', 144),
 ('s', 141),
 ('and', 135),
 ('#', 127),
 ('...', 118)]

In [157]:
words = ['the', 'knight', 'swallow']
print( [fd[word] for word in words] )


[299, 5, 10]

Collocations and n-grams

Collocations are good for getting a quick glimpse of what a text is about. Collocations(num = VAL) returns multi-word expressions that commonly co-occur. Notice that is not necessarily related to the frequency of the words.


In [158]:
doc.collocations(num=30)


BLACK KNIGHT; clop clop; HEAD KNIGHT; mumble mumble; Holy Grail;
squeak squeak; FRENCH GUARD; saw saw; Sir Robin; Run away; CARTOON
CHARACTER; King Arthur; Iesu domine; Pie Iesu; DEAD PERSON; Round
Table; clap clap; OLD MAN; dramatic chord; dona eis; eis requiem; LEFT
HEAD; FRENCH GUARDS; music stops; Sir Launcelot; MIDDLE HEAD; RIGHT
HEAD; Sir Galahad; angels sing; Arthur music

nltk.ngrams(text, n) returns a generator of all n-grams in text.


In [159]:
print(list(nltk.ngrams(doc, 3))[:10])


[('SCENE', '1', ':'), ('1', ':', '['), (':', '[', 'wind'), ('[', 'wind', ']'), ('wind', ']', '['), (']', '[', 'clop'), ('[', 'clop', 'clop'), ('clop', 'clop', 'clop'), ('clop', 'clop', ']'), ('clop', ']', 'KING')]

Stopwords and Regular Expression


In [160]:
from nltk.corpus import stopwords
stopwords = stopwords.words('english')

print(len(text6))
print(len(set(text6)))
new_text6 = [w for w in text6 if w not in stopwords]
print(len(new_text6))
print(len(set(new_text6)))


16967
2166
13288
2034

In [161]:
import re

newer_text6 = [w for w in new_text6 if re.search('^ab',w)]
print(len(newer_text6))
print(newer_text6)


3
['able', 'able', 'absolutely']

Tokenization and Lemmatization

In the following, we demonstrate

  • sentence segmentation
  • tokenizaiton
  • normalization
  • stemming
  • lemmatization

In [162]:
raw = " ".join(list(doc2))
print(raw)


To be or not to BE ? That 's a question .

In [163]:
sentences = sent_tokenize(raw)
sentences


Out[163]:
['To be or not to BE ?', "That 's a question ."]

In [164]:
tokens = nltk.word_tokenize(raw)
tokens


Out[164]:
['To', 'be', 'or', 'not', 'to', 'BE', '?', 'That', "'s", 'a', 'question', '.']

In [165]:
lc_tokens = [tk.lower() for tk in tokens]
lc_tokens


Out[165]:
['to', 'be', 'or', 'not', 'to', 'be', '?', 'that', "'s", 'a', 'question', '.']

In [166]:
porter = nltk.PorterStemmer()
stemmed_tokens = [porter.stem(tk) for tk in tokens]
stemmed_tokens


Out[166]:
['To', 'be', 'or', 'not', 'to', 'BE', '?', 'That', "'s", 'a', 'question', '.']

In [238]:
raw1 = 'Corpus linguistics proposes that reliable language analysis is more feasible with corpora collected in the field, in their natural contexts, and with minimal experimental-interference.'
tokens = nltk.word_tokenize(raw1)
print("   ".join(stemmed_tokens))


Corpu   linguist   propos   that   reliabl   languag   analysi   is   more   feasibl   with   corpora   collect   in   the   field   ,   in   their   natur   context   ,   and   with   minim   experimental-interfer   .

Note that the tokenization algorithm is pretty smart in not splitting experimental-interference into two tokens.


In [168]:
stemmed_tokens = [porter.stem(tk) for tk in tokens]
print("   ".join(stemmed_tokens))


Corpu   linguist   propos   that   reliabl   languag   analysi   is   more   feasibl   with   corpora   collect   in   the   field   ,   in   their   natur   context   ,   and   with   minim   experimental-interfer   .

In [169]:
wnl = nltk.WordNetLemmatizer()
lemmatized_tokens = [wnl.lemmatize(tk) for tk in tokens]
print("   ".join(lemmatized_tokens))


Corpus   linguistics   proposes   that   reliable   language   analysis   is   more   feasible   with   corpus   collected   in   the   field   ,   in   their   natural   context   ,   and   with   minimal   experimental-interference   .

You probably wonder by proposes above is not lemmatized to propose. This is because lemmatize method has a default optional parameter pos = 'n' (i.e., treating the proposes as a noun). If we specify the correct POS tag ('v' for verb), the output will be correct.


In [236]:
wnl.lemmatize('proposes', 'v')


Out[236]:
'propose'

In [237]:
wnl.lemmatize('is', 'v')


Out[237]:
'be'

Exercise

Write a function that lemmatizes all words in a sentence by considering their POS tags.

Wordnet only accepts the following POS tag: (from the source: http://www.nltk.org/_modules/nltk/corpus/reader/wordnet.html) ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v'

You can use nltk.pos_tag(tokens) to obtain POS tags for the input token list.

Your output should look like this (the tabular output is just showing the additional debugging info)

% proper_lemmatize_sentence(raw1, True)


                       token/POS           lemmatized_token
0                     Corpus/NNP                     Corpus
1                linguistics/NNS                linguistics
2                   proposes/VBZ                    propose
3                        that/IN                       that
4                    reliable/JJ                   reliable
5                    language/NN                   language
6                    analysis/NN                   analysis
7                         is/VBZ                         be
8                       more/RBR                       more
9                    feasible/JJ                   feasible
10                       with/IN                       with
11                   corpora/NNS                     corpus
12                 collected/VBN                    collect
13                         in/IN                         in
14                        the/DT                        the
15                      field/NN                      field
16                           ,/,                          ,
17                         in/IN                         in
18                    their/PRP$                      their
19                    natural/JJ                    natural
20                   contexts/NN                    context
21                           ,/,                          ,
22                        and/CC                        and
23                       with/IN                       with
24                    minimal/JJ                    minimal
25  experimental-interference/NN  experimental-interference
26                           ./.                          .

['Corpus',
 'linguistics',
 'propose',
 'that',
 'reliable',
 'language',
 'analysis',
 'be',
 'more',
 'feasible',
 'with',
 'corpus',
 'collect',
 'in',
 'the',
 'field',
 ',',
 'in',
 'their',
 'natural',
 'context',
 ',',
 'and',
 'with',
 'minimal',
 'experimental-interference',
 '.']

In [ ]:


In [ ]:


In [ ]: