In this notebook, we will use NLTK
to preprocess text documents. NLTK
is a widely used library for Natural Language Processing. In addition to many built-in capabilities, it has interfaces to many corpus, other libraries, and a good online textbook/cookbook (http://www.nltk.org/book/).
You need to
If everything is installed correctly, you should be able to run the following code:
In [195]:
from pprint import pprint
import nltk
from nltk.book import *
In [145]:
doc = text6
print(doc)
In [146]:
word = 'swallow'
doc.concordance(word, width=100, lines=30)
In [147]:
word = 'live'
doc.concordance(word, width=100, lines=30)
In [148]:
word = 'lived'
doc.concordance(word, width=100, lines=30)
In [149]:
word = 'CASTLE'
doc.concordance(word, width=100, lines=30)
In [150]:
from nltk.text import Text
from nltk import word_tokenize # sentence => words
from nltk import sent_tokenize # document => sentences
#str1 = "to be or not to BE? That's a question. "
str1 = "To be or not to BE?\n That's a question. "
tokens = word_tokenize(str1)
doc2 = Text(tokens)
In [151]:
print('# of tokens = {}'.format(len(doc)))
print('# of unique tokens = {}'.format(len(set(doc))))
In [152]:
print(doc2)
print('# of tokens = {}'.format(len(doc2)))
print('# of unique tokens = {}'.format(len(set(doc2))))
print(sorted(set(doc2)))
doc2.concordance('be', width=100, lines=30)
In [153]:
print(doc2.count('to'))
print(doc2.count('To'))
print(doc2.count('be'))
print(doc2.count('BE'))
print(doc2.count('bE'))
In [154]:
occ = doc2.index('to')
print(occ)
In [155]:
half_window = 3
doc2[occ - half_window : occ + half_window +1]
Out[155]:
In [156]:
fd = FreqDist(doc)
fd.most_common(20)
Out[156]:
In [157]:
words = ['the', 'knight', 'swallow']
print( [fd[word] for word in words] )
In [158]:
doc.collocations(num=30)
nltk.ngrams(text, n)
returns a generator of all n-grams in text
.
In [159]:
print(list(nltk.ngrams(doc, 3))[:10])
In [160]:
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
print(len(text6))
print(len(set(text6)))
new_text6 = [w for w in text6 if w not in stopwords]
print(len(new_text6))
print(len(set(new_text6)))
In [161]:
import re
newer_text6 = [w for w in new_text6 if re.search('^ab',w)]
print(len(newer_text6))
print(newer_text6)
In [162]:
raw = " ".join(list(doc2))
print(raw)
In [163]:
sentences = sent_tokenize(raw)
sentences
Out[163]:
In [164]:
tokens = nltk.word_tokenize(raw)
tokens
Out[164]:
In [165]:
lc_tokens = [tk.lower() for tk in tokens]
lc_tokens
Out[165]:
In [166]:
porter = nltk.PorterStemmer()
stemmed_tokens = [porter.stem(tk) for tk in tokens]
stemmed_tokens
Out[166]:
In [238]:
raw1 = 'Corpus linguistics proposes that reliable language analysis is more feasible with corpora collected in the field, in their natural contexts, and with minimal experimental-interference.'
tokens = nltk.word_tokenize(raw1)
print(" ".join(stemmed_tokens))
Note that the tokenization algorithm is pretty smart in not splitting experimental-interference
into two tokens.
In [168]:
stemmed_tokens = [porter.stem(tk) for tk in tokens]
print(" ".join(stemmed_tokens))
In [169]:
wnl = nltk.WordNetLemmatizer()
lemmatized_tokens = [wnl.lemmatize(tk) for tk in tokens]
print(" ".join(lemmatized_tokens))
You probably wonder by proposes
above is not lemmatized to propose
. This is because lemmatize
method has a default optional parameter pos = 'n'
(i.e., treating the proposes
as a noun). If we specify the correct POS tag ('v'
for verb), the output will be correct.
In [236]:
wnl.lemmatize('proposes', 'v')
Out[236]:
In [237]:
wnl.lemmatize('is', 'v')
Out[237]:
Write a function that lemmatizes all words in a sentence by considering their POS tags.
Wordnet only accepts the following POS tag: (from the source: http://www.nltk.org/_modules/nltk/corpus/reader/wordnet.html)
ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v'
You can use nltk.pos_tag(tokens)
to obtain POS tags for the input token list.
Your output should look like this (the tabular output is just showing the additional debugging info)
% proper_lemmatize_sentence(raw1, True)
token/POS lemmatized_token
0 Corpus/NNP Corpus
1 linguistics/NNS linguistics
2 proposes/VBZ propose
3 that/IN that
4 reliable/JJ reliable
5 language/NN language
6 analysis/NN analysis
7 is/VBZ be
8 more/RBR more
9 feasible/JJ feasible
10 with/IN with
11 corpora/NNS corpus
12 collected/VBN collect
13 in/IN in
14 the/DT the
15 field/NN field
16 ,/, ,
17 in/IN in
18 their/PRP$ their
19 natural/JJ natural
20 contexts/NN context
21 ,/, ,
22 and/CC and
23 with/IN with
24 minimal/JJ minimal
25 experimental-interference/NN experimental-interference
26 ./. .
['Corpus',
'linguistics',
'propose',
'that',
'reliable',
'language',
'analysis',
'be',
'more',
'feasible',
'with',
'corpus',
'collect',
'in',
'the',
'field',
',',
'in',
'their',
'natural',
'context',
',',
'and',
'with',
'minimal',
'experimental-interference',
'.']
In [ ]:
In [ ]:
In [ ]: