Processing Raw Text

Accessing Text from the Web and from Disk



In [1]:

    
%matplotlib inline

import matplotlib.pyplot as plt
import nltk
import re
import pprint
from nltk import word_tokenize

downloading Crime and Punishment



In [2]:

    
from urllib import request
url = 'http://www.gutenberg.org/files/2554/2554.txt'
response = request.urlopen(url)
raw = response.read().decode('utf8')
type(raw)









    Out[2]:





str

number of characters:



In [3]:

    
len(raw)









    Out[3]:





1176896



In [4]:

    
raw[:75]









    Out[4]:





'The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r\n'



In [5]:

    
tokens = word_tokenize(raw)
type(tokens)









    Out[5]:





list



In [6]:

    
len(tokens)









    Out[6]:





254352



In [7]:

    
tokens[:10]









    Out[7]:





['The',
 'Project',
 'Gutenberg',
 'EBook',
 'of',
 'Crime',
 'and',
 'Punishment',
 ',',
 'by']

Create a Text object from tokens



In [8]:

    
text = nltk.Text(tokens)
type(text)









    Out[8]:





nltk.text.Text



In [9]:

    
text[1024:1062]









    Out[9]:





['CHAPTER',
 'I',
 'On',
 'an',
 'exceptionally',
 'hot',
 'evening',
 'early',
 'in',
 'July',
 'a',
 'young',
 'man',
 'came',
 'out',
 'of',
 'the',
 'garret',
 'in',
 'which',
 'he',
 'lodged',
 'in',
 'S.',
 'Place',
 'and',
 'walked',
 'slowly',
 ',',
 'as',
 'though',
 'in',
 'hesitation',
 ',',
 'towards',
 'K.',
 'bridge',
 '.']

find collocations (words that frequently appear together)



In [10]:

    
text.collocations()









    



Katerina Ivanovna; Pyotr Petrovitch; Pulcheria Alexandrovna; Avdotya
Romanovna; Rodion Romanovitch; Marfa Petrovna; Sofya Semyonovna; old
woman; Project Gutenberg-tm; Porfiry Petrovitch; Amalia Ivanovna;
great deal; Nikodim Fomitch; young man; Ilya Petrovitch; n't know;
Project Gutenberg; Dmitri Prokofitch; Andrey Semyonovitch; Hay Market

Project Gutenberg is a collocation for this text because it is included as a header and possibly footer for the raw text file

find start and end manually using find



In [11]:

    
raw.find('PART I')









    Out[11]:





5338

reverse find using rfind



In [12]:

    
raw.rfind("End of Project Gutenberg's Crime")









    Out[12]:





1157746



In [13]:

    
raw = raw[5338:1157746] # slightly different from NLTK Book value



In [14]:

    
raw.find("PART I")









    Out[14]:





0

Dealing with HTML

Get an "urban legend" article called Blondes to die out in 200 years -- from the BBC



In [15]:

    
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = request.urlopen(url).read().decode('utf8')
html[:60]









    Out[15]:





'<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN'



In [16]:

    
type(html)









    Out[16]:





str



In [17]:

    
from bs4 import BeautifulSoup
raw = BeautifulSoup(html, 'html.parser').get_text()
tokens = word_tokenize(raw)
tokens[:50]









    Out[17]:





['BBC',
 'NEWS',
 '|',
 'Health',
 '|',
 'Blondes',
 "'to",
 'die',
 'out',
 'in',
 '200',
 "years'",
 'NEWS',
 'SPORT',
 'WEATHER',
 'WORLD',
 'SERVICE',
 'A-Z',
 'INDEX',
 'SEARCH',
 'You',
 'are',
 'in',
 ':',
 'Health',
 'News',
 'Front',
 'Page',
 'Africa',
 'Americas',
 'Asia-Pacific',
 'Europe',
 'Middle',
 'East',
 'South',
 'Asia',
 'UK',
 'Business',
 'Entertainment',
 'Science/Nature',
 'Technology',
 'Health',
 'Medical',
 'notes',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--']

find start and end indices of the content (manually) and create a Text object



In [18]:

    
tokens = tokens[110:390]
text = nltk.Text(tokens)
text









    Out[18]:





<Text: UK Blondes 'to die out in 200 years'...>

get concordance of gene -- shows occurrences of the word gene



In [19]:

    
text.concordance('gene')









    



Displaying 5 of 5 matches:
hey say too few people now carry the gene for blondes to last beyond the next 
blonde hair is caused by a recessive gene . In order for a child to have blond
 have blonde hair , it must have the gene on both sides of the family in the g
ere is a disadvantage of having that gene or by chance . They do n't disappear
des would disappear is if having the gene was a disadvantage and I do not thin

skipping these sections...

Processing Search Engine Results

Processing RSS Feeds

Reading Local Files

Strings: Text Processing at the Lowest Level

skipping basic string and list operations

Text Processing with Unicode



In [20]:

    
path = nltk.data.find('corpora/unicode_samples/polish-lat2.txt')
path









    Out[20]:





FileSystemPathPointer('/Users/sandip/nltk_data/corpora/unicode_samples/polish-lat2.txt')



In [21]:

    
with open(path, encoding='latin2') as f:
    for line in f:
        line_strip = line.strip()
        print(line_strip)









    



Pruska Biblioteka Państwowa. Jej dawne zbiory znane pod nazwą
"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez
Niemców pod koniec II wojny światowej na Dolny Śląsk, zostały
odnalezione po 1945 r. na terytorium Polski. Trafiły do Biblioteki
Jagiellońskiej w Krakowie, obejmują ponad 500 tys. zabytkowych
archiwaliów, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.



In [22]:

    
with open(path, encoding='latin2') as f:
    for line in f:
        line_strip = line.strip()
        print(line_strip.encode('unicode_escape'))









    



b'Pruska Biblioteka Pa\\u0144stwowa. Jej dawne zbiory znane pod nazw\\u0105'
b'"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez'
b'Niemc\\xf3w pod koniec II wojny \\u015bwiatowej na Dolny \\u015al\\u0105sk, zosta\\u0142y'
b'odnalezione po 1945 r. na terytorium Polski. Trafi\\u0142y do Biblioteki'
b'Jagiello\\u0144skiej w Krakowie, obejmuj\\u0105 ponad 500 tys. zabytkowych'
b'archiwali\\xf3w, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.'



In [23]:

    
import unicodedata
with open(path, encoding='latin2') as f:
    lines = f.readlines()
    
line = lines[2]
print(line.encode('unicode_escape'))









    



b'Niemc\\xf3w pod koniec II wojny \\u015bwiatowej na Dolny \\u015al\\u0105sk, zosta\\u0142y\\n'



In [24]:

    
for c in line:
    if ord(c) > 127:
        print('{} U+{:04x} {}'.format(c.encode('utf8'), ord(c), unicodedata.name(c)))









    



b'\xc3\xb3' U+00f3 LATIN SMALL LETTER O WITH ACUTE
b'\xc5\x9b' U+015b LATIN SMALL LETTER S WITH ACUTE
b'\xc5\x9a' U+015a LATIN CAPITAL LETTER S WITH ACUTE
b'\xc4\x85' U+0105 LATIN SMALL LETTER A WITH OGONEK
b'\xc5\x82' U+0142 LATIN SMALL LETTER L WITH STROKE



In [25]:

    
for c in line:
    if ord(c) > 127:
        print('{} U+{:04x} {}'.format(c, ord(c), unicodedata.name(c)))









    



ó U+00f3 LATIN SMALL LETTER O WITH ACUTE
ś U+015b LATIN SMALL LETTER S WITH ACUTE
Ś U+015a LATIN CAPITAL LETTER S WITH ACUTE
ą U+0105 LATIN SMALL LETTER A WITH OGONEK
ł U+0142 LATIN SMALL LETTER L WITH STROKE

Using Python string methods and re with Unicode characters



In [26]:

    
line









    Out[26]:





'Niemców pod koniec II wojny światowej na Dolny Śląsk, zostały\n'



In [27]:

    
line.find('zosta\u0142y')









    Out[27]:





54



In [28]:

    
line = line.lower()
line









    Out[28]:





'niemców pod koniec ii wojny światowej na dolny śląsk, zostały\n'



In [29]:

    
line.encode('unicode_escape')









    Out[29]:





b'niemc\\xf3w pod koniec ii wojny \\u015bwiatowej na dolny \\u015bl\\u0105sk, zosta\\u0142y\\n'



In [30]:

    
import re
m = re.search('\u015b\w*', line)
m.group()









    Out[30]:





'światowej'



In [31]:

    
m.group().encode('unicode_escape')









    Out[31]:





b'\\u015bwiatowej'

Can use Unicode strings with NLTK tokenizers



In [32]:

    
word_tokenize(line)









    Out[32]:





['niemców',
 'pod',
 'koniec',
 'ii',
 'wojny',
 'światowej',
 'na',
 'dolny',
 'śląsk',
 ',',
 'zostały']

Regular Expressions for Detecting Word Patterns

skipping this section

cheatsheet:

Operator    Behavior
.        Wildcard, matches any character
^abc     Matches some pattern abc at the start of a string
abc$     Matches some pattern abc at the end of a string
[abc]    Matches one of a set of characters
[A-Z0-9] Matches one of a range of characters
ed|ing|s Matches one of the specified strings (disjunction)
*        Zero or more of previous item, e.g. a*, [a-z]* (also known as Kleene Closure)
+        One or more of previous item, e.g. a+, [a-z]+
?        Zero or one of the previous item (i.e. optional), e.g. a?, [a-z]?
{n}      Exactly n repeats where n is a non-negative integer
{n,}     At least n repeats
{,n}     No more than n repeats
{m,n}    At least m and no more than n repeats
a(b|c)+  Parentheses that indicate the scope of the operators

Useful Applications of Regular Expressions

Extracting Word Pieces

find all vowels in a word and count them



In [33]:

    
import re

word = 'supercalifragilisticexpialidocious'
vowel_matches = re.findall(r'[aeiou]', word)
vowel_matches









    Out[33]:





['u',
 'e',
 'a',
 'i',
 'a',
 'i',
 'i',
 'i',
 'e',
 'i',
 'a',
 'i',
 'o',
 'i',
 'o',
 'u']



In [34]:

    
len(vowel_matches)









    Out[34]:





16

Frequencies for sequences of 2+ vowels in the text



In [35]:

    
wsj = sorted(set(nltk.corpus.treebank.words()))
len(wsj)









    Out[35]:





12408



In [36]:

    
fd = nltk.FreqDist(vowels for word in wsj
                          for vowels in re.findall(r'[aeiou]{2,}', word))
len(fd)









    Out[36]:





43



In [37]:

    
fd.most_common(12)









    Out[37]:





[('io', 549),
 ('ea', 476),
 ('ie', 331),
 ('ou', 329),
 ('ai', 261),
 ('ia', 253),
 ('ee', 217),
 ('oo', 174),
 ('ua', 109),
 ('au', 106),
 ('ue', 105),
 ('ui', 95)]

Doing More with Word Pieces

Remove internal vowels from words



In [38]:

    
regexp = r'^[AEIOUaeiou]+|[AEIOUaeiou]+$|[^AEIOUaeiou]'

def compress(word):
    pieces = re.findall(regexp, word)
    return ''.join(pieces)



In [39]:

    
re.findall(regexp, 'Universal')









    Out[39]:





['U', 'n', 'v', 'r', 's', 'l']



In [40]:

    
english_udhr = nltk.corpus.udhr.words('English-Latin1')
print(nltk.tokenwrap(compress(w) for w in english_udhr[:75]))









    



Unvrsl Dclrtn of Hmn Rghts Prmble Whrs rcgntn of the inhrnt dgnty and
of the eql and inlnble rghts of all mmbrs of the hmn fmly is the fndtn
of frdm , jstce and pce in the wrld , Whrs dsrgrd and cntmpt fr hmn
rghts hve rsltd in brbrs acts whch hve outrgd the cnscnce of mnknd ,
and the advnt of a wrld in whch hmn bngs shll enjy frdm of spch and

Extract consonant-vowel sequences from text



In [41]:

    
rotokas_words = nltk.corpus.toolbox.words('rotokas.dic')
cvs = [consonant_vowel for w in rotokas_words
                       for consonant_vowel in re.findall(r'[ptksvr][aeiou]', w)]
cvs[:25]









    Out[41]:





['ka',
 'ka',
 'ka',
 'ka',
 'ka',
 'ro',
 'ka',
 'ka',
 'vi',
 'ko',
 'ka',
 'ka',
 'vo',
 'ka',
 'ka',
 'ko',
 'ka',
 'ka',
 'si',
 'ka',
 'ka',
 'ka',
 'ka',
 'ko',
 'ka']



In [42]:

    
cfd = nltk.ConditionalFreqDist(cvs)
cfd.tabulate()









    



     a    e    i    o    u 
k  418  148   94  420  173 
p   83   31  105   34   51 
r  187   63   84   89   79 
s    0    0  100    2    1 
t   47    8    0  148   37 
v   93   27  105   48   49

create an index such that: cv_index['su'] returns all words containing su

Use nltk.Index()



In [43]:

    
cv_word_pairs = [(cv, w) for w in rotokas_words
                         for cv in re.findall(r'[ptksvr][aeiou]', w)]
cv_index = nltk.Index(cv_word_pairs)
type(cv_index)









    Out[43]:





nltk.util.Index



In [44]:

    
cv_index['su']









    Out[44]:





['kasuari']



In [45]:

    
cv_index['po']









    Out[45]:





['kaapo',
 'kaapopato',
 'kaipori',
 'kaiporipie',
 'kaiporivira',
 'kapo',
 'kapoa',
 'kapokao',
 'kapokapo',
 'kapokapo',
 'kapokapoa',
 'kapokapoa',
 'kapokapora',
 'kapokapora',
 'kapokaporo',
 'kapokaporo',
 'kapokari',
 'kapokarito',
 'kapokoa',
 'kapoo',
 'kapooto',
 'kapoovira',
 'kapopaa',
 'kaporo',
 'kaporo',
 'kaporopa',
 'kaporoto',
 'kapoto',
 'karokaropo',
 'karopo',
 'kepo',
 'kepoi',
 'keposi',
 'kepoto']

Finding Word Stems

one simple approach that just removes suffixes:



In [46]:

    
def stem(word):
    for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']:
        if word.endswith(suffix):
            return word[:-len(suffix)]
    return word



In [47]:

    
stem('walking')









    Out[47]:





'walk'

alternative using re module...



In [48]:

    
def stem_regexp(word):
    regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'
    stem, suffix = re.findall(regexp, word)[0]
    return stem



In [49]:

    
raw = """DENNIS: Listen, strange women lying in ponds distributing swords
is no basis for a system of government.  Supreme executive power derives from
a mandate from the masses, not from some farcical aquatic ceremony."""
tokens = word_tokenize(raw)
tokens









    Out[49]:





['DENNIS',
 ':',
 'Listen',
 ',',
 'strange',
 'women',
 'lying',
 'in',
 'ponds',
 'distributing',
 'swords',
 'is',
 'no',
 'basis',
 'for',
 'a',
 'system',
 'of',
 'government',
 '.',
 'Supreme',
 'executive',
 'power',
 'derives',
 'from',
 'a',
 'mandate',
 'from',
 'the',
 'masses',
 ',',
 'not',
 'from',
 'some',
 'farcical',
 'aquatic',
 'ceremony',
 '.']



In [50]:

    
[stem_regexp(t) for t in tokens]









    Out[50]:





['DENNIS',
 ':',
 'Listen',
 ',',
 'strange',
 'women',
 'ly',
 'in',
 'pond',
 'distribut',
 'sword',
 'i',
 'no',
 'basi',
 'for',
 'a',
 'system',
 'of',
 'govern',
 '.',
 'Supreme',
 'execut',
 'power',
 'deriv',
 'from',
 'a',
 'mandate',
 'from',
 'the',
 'mass',
 ',',
 'not',
 'from',
 'some',
 'farcical',
 'aquatic',
 'ceremony',
 '.']

Searching Tokenized Text

"<a> <man>" finds all instances of a man in the text



In [51]:

    
from nltk.corpus import gutenberg, nps_chat
moby = nltk.Text(gutenberg.words('melville-moby_dick.txt'))
moby.findall(r"<a> (<.*>) <man>")









    



monied; nervous; dangerous; white; white; white; pious; queer; good;
mature; white; Cape; great; wise; wise; butterless; white; fiendish;
pale; furious; better; certain; complete; dismasted; younger; brave;
brave; brave; brave



In [52]:

    
chat = nltk.Text(nps_chat.words())
chat.findall(r"<.*> <.*> <bro>")









    



you rule bro; telling you bro; u twizted bro



In [53]:

    
chat.findall(r"<l.*>{3,}")









    



lol lol lol; lmao lol lol; lol lol lol; la la la la la; la la la; la
la la; lovely lol lol love; lol lol lol.; la la la; la la la

Normalizing Text



In [54]:

    
raw = """DENNIS: Listen, strange women lying in ponds distributing swords
is no basis for a system of government.  Supreme executive power derives from
a mandate from the masses, not from some farcical aquatic ceremony."""
tokens = word_tokenize(raw)
tokens









    Out[54]:





['DENNIS',
 ':',
 'Listen',
 ',',
 'strange',
 'women',
 'lying',
 'in',
 'ponds',
 'distributing',
 'swords',
 'is',
 'no',
 'basis',
 'for',
 'a',
 'system',
 'of',
 'government',
 '.',
 'Supreme',
 'executive',
 'power',
 'derives',
 'from',
 'a',
 'mandate',
 'from',
 'the',
 'masses',
 ',',
 'not',
 'from',
 'some',
 'farcical',
 'aquatic',
 'ceremony',
 '.']

Stemmers

"off-the-shelf" stemmers included in NLTK

Porter
Lancaster



In [55]:

    
porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()
[porter.stem(t) for t in tokens]









    Out[55]:





['DENNI',
 ':',
 'Listen',
 ',',
 'strang',
 'women',
 'lie',
 'in',
 'pond',
 'distribut',
 'sword',
 'is',
 'no',
 'basi',
 'for',
 'a',
 'system',
 'of',
 'govern',
 '.',
 'Suprem',
 'execut',
 'power',
 'deriv',
 'from',
 'a',
 'mandat',
 'from',
 'the',
 'mass',
 ',',
 'not',
 'from',
 'some',
 'farcic',
 'aquat',
 'ceremoni',
 '.']



In [56]:

    
[lancaster.stem(t) for t in tokens]









    Out[56]:





['den',
 ':',
 'list',
 ',',
 'strange',
 'wom',
 'lying',
 'in',
 'pond',
 'distribut',
 'sword',
 'is',
 'no',
 'bas',
 'for',
 'a',
 'system',
 'of',
 'govern',
 '.',
 'suprem',
 'execut',
 'pow',
 'der',
 'from',
 'a',
 'mand',
 'from',
 'the',
 'mass',
 ',',
 'not',
 'from',
 'som',
 'farc',
 'aqu',
 'ceremony',
 '.']

Porter stemmer correctly handled lying -> lie while Lancaster stemmer did not

Defining a custom Text class that uses the Porter Stemmer and can generate concordance for a text using word stems



In [57]:

    
class IndexedText(object):

    def __init__(self, stemmer, text):
        self._text = text
        self._stemmer = stemmer
        self._index = nltk.Index((self._stem(word), i)
                                 for (i, word) in enumerate(text))

    def concordance(self, word, width=40):
        key = self._stem(word)
        wc = int(width/4)                # words of context
        for i in self._index[key]:
            lcontext = ' '.join(self._text[i-wc:i])
            rcontext = ' '.join(self._text[i:i+wc])
            ldisplay = '{:>{width}}'.format(lcontext[-width:], width=width)
            rdisplay = '{:{width}}'.format(rcontext[:width], width=width)
            print(ldisplay, rdisplay)

    def _stem(self, word):
        return self._stemmer.stem(word).lower()



In [58]:

    
porter = nltk.PorterStemmer()
grail = nltk.corpus.webtext.words('grail.txt')
text = IndexedText(porter, grail)
text.concordance('lie')









    



r king ! DENNIS : Listen , strange women lying in ponds distributing swords is no
 beat a very brave retreat . ROBIN : All lies ! MINSTREL : [ singing ] Bravest of
       Nay . Nay . Come . Come . You may lie here . Oh , but you are wounded !   
doctors immediately ! No , no , please ! Lie down . [ clap clap ] PIGLET : Well  
ere is much danger , for beyond the cave lies the Gorge of Eternal Peril , which 
   you . Oh ... TIM : To the north there lies a cave -- the cave of Caerbannog --
h it and lived ! Bones of full fifty men lie strewn about its lair . So , brave k
not stop our fight ' til each one of you lies dead , and the Holy Grail returns t

Lemmatization

WordNet lemmatizer only removes affixes for words in its dictionary

dictionary lookup process is much slower than Porter stemmer



In [59]:

    
wnl = nltk.WordNetLemmatizer()
[wnl.lemmatize(t) for t in tokens]









    Out[59]:





['DENNIS',
 ':',
 'Listen',
 ',',
 'strange',
 'woman',
 'lying',
 'in',
 'pond',
 'distributing',
 'sword',
 'is',
 'no',
 'basis',
 'for',
 'a',
 'system',
 'of',
 'government',
 '.',
 'Supreme',
 'executive',
 'power',
 'derives',
 'from',
 'a',
 'mandate',
 'from',
 'the',
 'mass',
 ',',
 'not',
 'from',
 'some',
 'farcical',
 'aquatic',
 'ceremony',
 '.']

Regular Expressions for Tokenizing Text

Simple Approaches to Tokenization



In [60]:

    
raw = """'When I'M a Duchess,' she said to herself, (not in a very hopeful tone
though), 'I won't have any pepper in my kitchen AT ALL. Soup does very
well without--Maybe it's always pepper that makes people hot-tempered,'..."""

split on whitespace



In [61]:

    
re.split(r' ', raw)









    Out[61]:





["'When",
 "I'M",
 'a',
 "Duchess,'",
 'she',
 'said',
 'to',
 'herself,',
 '(not',
 'in',
 'a',
 'very',
 'hopeful',
 'tone\nthough),',
 "'I",
 "won't",
 'have',
 'any',
 'pepper',
 'in',
 'my',
 'kitchen',
 'AT',
 'ALL.',
 'Soup',
 'does',
 'very\nwell',
 'without--Maybe',
 "it's",
 'always',
 'pepper',
 'that',
 'makes',
 'people',
 "hot-tempered,'..."]



In [62]:

    
re.split(r'[ \t\n]+', raw)









    Out[62]:





["'When",
 "I'M",
 'a',
 "Duchess,'",
 'she',
 'said',
 'to',
 'herself,',
 '(not',
 'in',
 'a',
 'very',
 'hopeful',
 'tone',
 'though),',
 "'I",
 "won't",
 'have',
 'any',
 'pepper',
 'in',
 'my',
 'kitchen',
 'AT',
 'ALL.',
 'Soup',
 'does',
 'very',
 'well',
 'without--Maybe',
 "it's",
 'always',
 'pepper',
 'that',
 'makes',
 'people',
 "hot-tempered,'..."]

re offers \w (word characters) and \W (all characters except letters, digits, _ )

split on nonword characters:



In [63]:

    
re.split(r'\W+', raw)









    Out[63]:





['',
 'When',
 'I',
 'M',
 'a',
 'Duchess',
 'she',
 'said',
 'to',
 'herself',
 'not',
 'in',
 'a',
 'very',
 'hopeful',
 'tone',
 'though',
 'I',
 'won',
 't',
 'have',
 'any',
 'pepper',
 'in',
 'my',
 'kitchen',
 'AT',
 'ALL',
 'Soup',
 'does',
 'very',
 'well',
 'without',
 'Maybe',
 'it',
 's',
 'always',
 'pepper',
 'that',
 'makes',
 'people',
 'hot',
 'tempered',
 '']

exclude empty strings...



In [64]:

    
re.findall(r'\w+|\Sw*', raw)









    Out[64]:





["'",
 'When',
 'I',
 "'",
 'M',
 'a',
 'Duchess',
 ',',
 "'",
 'she',
 'said',
 'to',
 'herself',
 ',',
 '(',
 'not',
 'in',
 'a',
 'very',
 'hopeful',
 'tone',
 'though',
 ')',
 ',',
 "'",
 'I',
 'won',
 "'",
 't',
 'have',
 'any',
 'pepper',
 'in',
 'my',
 'kitchen',
 'AT',
 'ALL',
 '.',
 'Soup',
 'does',
 'very',
 'well',
 'without',
 '-',
 '-',
 'Maybe',
 'it',
 "'",
 's',
 'always',
 'pepper',
 'that',
 'makes',
 'people',
 'hot',
 '-',
 'tempered',
 ',',
 "'",
 '.',
 '.',
 '.']

allow internal hyphens and apostrophes in words



In [65]:

    
re.findall(r"\w+(?:[-']\w+)*|'|[-.(\)]+|\S\w*", raw)









    Out[65]:





["'",
 'When',
 "I'M",
 'a',
 'Duchess',
 ',',
 "'",
 'she',
 'said',
 'to',
 'herself',
 ',',
 '(',
 'not',
 'in',
 'a',
 'very',
 'hopeful',
 'tone',
 'though',
 ')',
 ',',
 "'",
 'I',
 "won't",
 'have',
 'any',
 'pepper',
 'in',
 'my',
 'kitchen',
 'AT',
 'ALL',
 '.',
 'Soup',
 'does',
 'very',
 'well',
 'without',
 '--',
 'Maybe',
 "it's",
 'always',
 'pepper',
 'that',
 'makes',
 'people',
 'hot-tempered',
 ',',
 "'",
 '...']

NLTK's Regular Expression Tokenizer

nltk.regexp_tokenize() is similar to re.findall() but more efficient -- don't need to treat parentheses as a special case



In [66]:

    
text = 'That U.S.A. poster-print costs $12.40...'
pattern = r'''(?x)    # set flag to allow verbose regexps
    ([A-Z]\.)+        # abbreviations, e.g. U.S.A.
  | \w+(-\w+)*        # words with optional internal hyphens
  | \$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
  | \.\.\.            # ellipsis
  | [][.,;"'?():-_`]  # these are separate tokens; includes ], [
'''
nltk.regexp_tokenize(text, pattern)









    Out[66]:





[('', '', ''),
 ('A.', '', ''),
 ('', '-print', ''),
 ('', '', ''),
 ('', '', '.40'),
 ('', '', '')]

(?x) is a verbose flag -- strips out embedded whitespace and comments

Further Issues with Tokenization

Important to have a "gold standard" for tokenization to compare performance of a custom tokenizer...

NLTK Corpus includes Penn Treebank corpus, tokenized and raw text, for this purpose:

nltk.corpus.treebank_raw.raw() and nltk.corpus.treebank.words()

Segmentation

Tokenization is a specific case of the more general segmentation

Sentence Segmentation

Average number of words per sentence:



In [67]:

    
len(nltk.corpus.brown.words()) / len(nltk.corpus.brown.sents())









    Out[67]:





20.250994070456922

Segmenting a stream of characters into sentences: sent_tokenize



In [68]:

    
import pprint

text = nltk.corpus.gutenberg.raw('chesterton-thursday.txt')
sents = nltk.sent_tokenize(text)
pprint.pprint(sents[79:89])









    



['"Nonsense!"',
 'said Gregory, who was very rational when anyone else\nattempted paradox.',
 '"Why do all the clerks and navvies in the\n'
 'railway trains look so sad and tired, so very sad and tired?',
 'I will\ntell you.',
 'It is because they know that the train is going right.',
 'It\n'
 'is because they know that whatever place they have taken a ticket\n'
 'for that place they will reach.',
 'It is because after they have\n'
 'passed Sloane Square they know that the next station must be\n'
 'Victoria, and nothing but Victoria.',
 'Oh, their wild rapture!',
 'oh,\n'
 'their eyes like stars and their souls again in Eden, if the next\n'
 'station were unaccountably Baker Street!"',
 '"It is you who are unpoetical," replied the poet Syme.']

Formatting: From Lists to Strings

skipping basic python stuff

From Lists to Strings

skipping -- basically just ' '.join(iterable)

Strings and Formats

str.format() skipping

Writing Results to a File

skipping

Text Wrapping

skipping