Language Processing and Python

Computing with Language: Texts and Words

Ran the following in python3 interpreter:

import nltk
nltk.download()

Select book to download corpora for NLTK Book


In [1]:
%matplotlib inline
import matplotlib.pyplot as plt

import nltk

In [2]:
from nltk.book import *


*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

In [3]:
text1


Out[3]:
<Text: Moby Dick by Herman Melville 1851>

In [4]:
text2


Out[4]:
<Text: Sense and Sensibility by Jane Austen 1811>

concordance is a view that shows every occurrence of a word alongside some context


In [5]:
text1.concordance("monstrous")


Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us , 
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But 
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u

In [6]:
text2.concordance("affection")


Displaying 25 of 79 matches:
, however , and , as a mark of his affection for the three girls , he left them
t . It was very well known that no affection was ever supposed to exist between
deration of politeness or maternal affection on the side of the former , the tw
d the suspicion -- the hope of his affection for me may warrant , without impru
hich forbade the indulgence of his affection . She knew that his mother neither
rd she gave one with still greater affection . Though her late conversation wit
 can never hope to feel or inspire affection again , and if her home be uncomfo
m of the sense , elegance , mutual affection , and domestic comfort of the fami
, and which recommended him to her affection beyond every thing else . His soci
ween the parties might forward the affection of Mr . Willoughby , an equally st
 the most pointed assurance of her affection . Elinor could not be surprised at
he natural consequence of a strong affection in a young and ardent mind . This 
 opinion . But by an appeal to her affection for her mother , by representing t
 every alteration of a place which affection had established as perfect with hi
e will always have one claim of my affection , which no other can possibly shar
f the evening declared at once his affection and happiness . " Shall we see you
ause he took leave of us with less affection than his usual behaviour has shewn
ness ." " I want no proof of their affection ," said Elinor ; " but of their en
onths , without telling her of his affection ;-- that they should part without 
ould be the natural result of your affection for her . She used to be all unres
distinguished Elinor by no mark of affection . Marianne saw and listened with i
th no inclination for expense , no affection for strangers , no profession , an
till distinguished her by the same affection which once she had felt no doubt o
al of her confidence in Edward ' s affection , to the remembrance of every mark
 was made ? Had he never owned his affection to yourself ?" " Oh , no ; but if 

In [7]:
text3.concordance("lived")


Displaying 25 of 38 matches:
ay when they were created . And Adam lived an hundred and thirty years , and be
ughters : And all the days that Adam lived were nine hundred and thirty yea and
nd thirty yea and he died . And Seth lived an hundred and five years , and bega
ve years , and begat Enos : And Seth lived after he begat Enos eight hundred an
welve years : and he died . And Enos lived ninety years , and begat Cainan : An
 years , and begat Cainan : And Enos lived after he begat Cainan eight hundred 
ive years : and he died . And Cainan lived seventy years and begat Mahalaleel :
rs and begat Mahalaleel : And Cainan lived after he begat Mahalaleel eight hund
years : and he died . And Mahalaleel lived sixty and five years , and begat Jar
s , and begat Jared : And Mahalaleel lived after he begat Jared eight hundred a
and five yea and he died . And Jared lived an hundred sixty and two years , and
o years , and he begat Eno And Jared lived after he begat Enoch eight hundred y
 and two yea and he died . And Enoch lived sixty and five years , and begat Met
 ; for God took him . And Methuselah lived an hundred eighty and seven years , 
 , and begat Lamech . And Methuselah lived after he begat Lamech seven hundred 
nd nine yea and he died . And Lamech lived an hundred eighty and two years , an
ch the LORD hath cursed . And Lamech lived after he begat Noah five hundred nin
naan shall be his servant . And Noah lived after the flood three hundred and fi
xad two years after the flo And Shem lived after he begat Arphaxad five hundred
at sons and daughters . And Arphaxad lived five and thirty years , and begat Sa
ars , and begat Salah : And Arphaxad lived after he begat Salah four hundred an
begat sons and daughters . And Salah lived thirty years , and begat Eber : And 
y years , and begat Eber : And Salah lived after he begat Eber four hundred and
 begat sons and daughters . And Eber lived four and thirty years , and begat Pe
y years , and begat Peleg : And Eber lived after he begat Peleg four hundred an

similar shows other words that appear in a similar context to the entered word


In [8]:
text1.similar("monstrous")


trustworthy mean impalpable true singular reliable wise doleful
domineering passing uncommon contemptible delightfully exasperate
gamesome mystifying part imperial maddens determined

In [9]:
text2.similar("monstrous")


very exceedingly so heartily as remarkably great vast extremely good a
amazingly sweet

text 1 (Melville) uses monstrous very differently from text 2 (Austen)

  • Text 2: monstrous has positive connotations, sometimes functions as an intensifier like very

common_contexts shows contexts that are shared by two or more words


In [10]:
text2.common_contexts(["monstrous", "very"])


be_glad a_pretty am_glad is_pretty a_lucky

trying out other words...


In [11]:
text2.similar("affection")


attention regard time love mother heart sister wife kindness opinion
arrival marianne wishes visit behaviour engagement letter brother
elinor marriage

In [12]:
text2.common_contexts(["affection", "regard"])


my_for her_and his_for her_for continued_for the_of your_for no_for
his_and

Lexical Dispersion Plot

Determining the location of words in text (how many words from beginning does this word appear?) -- using dispersion_plot


In [44]:
plt.figure(figsize=(18,10))
text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America", "liberty", "constitution"])


Generating some random text in the style of text3 -- using generate()

not yet supported in NLTK 3.0


In [14]:
# (not available in NLTK 3.0)

# text3.generate()

1.4 Counting Vocabulary

Count the number of tokens using len


In [15]:
len(text3)


Out[15]:
44764

View/count vocabulary using set(text_obj)


In [16]:
len(set(text3))


Out[16]:
2789

In [17]:
# first 50
sorted(set(text3))[:50]


Out[17]:
['!',
 "'",
 '(',
 ')',
 ',',
 ',)',
 '.',
 '.)',
 ':',
 ';',
 ';)',
 '?',
 '?)',
 'A',
 'Abel',
 'Abelmizraim',
 'Abidah',
 'Abide',
 'Abimael',
 'Abimelech',
 'Abr',
 'Abrah',
 'Abraham',
 'Abram',
 'Accad',
 'Achbor',
 'Adah',
 'Adam',
 'Adbeel',
 'Admah',
 'Adullamite',
 'After',
 'Aholibamah',
 'Ahuzzath',
 'Ajah',
 'Akan',
 'All',
 'Allonbachuth',
 'Almighty',
 'Almodad',
 'Also',
 'Alvah',
 'Alvan',
 'Am',
 'Amal',
 'Amalek',
 'Amalekites',
 'Ammon',
 'Amorite',
 'Amorites']

Calculating lexical richness of the text


In [18]:
len(set(text3)) / len(text3)


Out[18]:
0.06230453042623537

Count how often a word occurs in the text


In [19]:
text3.count("smote")


Out[19]:
5

Compute what percentage of the text is taken up by a specific word


In [20]:
100 * text4.count('a') / len(text4)


Out[20]:
1.4643016433938312

In [21]:
text5.count('lol')


Out[21]:
704

In [22]:
100 * text5.count('lol') / len(text5)


Out[22]:
1.5640968673628082

Define some simple functions to calculate these values


In [23]:
def lexical_diversity(text):
    return len(set(text)) / len(text)

def percentage(count, total):
    return 100 * count / total

In [24]:
lexical_diversity(text3), lexical_diversity(text5)


Out[24]:
(0.06230453042623537, 0.13477005109975562)

In [25]:
percentage(text4.count('a'), len(text4))


Out[25]:
1.4643016433938312

A Closer Look at Python: Texts as Lists of Words

skipping some basic python parts of this section...


In [26]:
sent1


Out[26]:
['Call', 'me', 'Ishmael', '.']

In [27]:
sent2


Out[27]:
['The',
 'family',
 'of',
 'Dashwood',
 'had',
 'long',
 'been',
 'settled',
 'in',
 'Sussex',
 '.']

In [28]:
lexical_diversity(sent1)


Out[28]:
1.0

List Concatenation


In [29]:
['Monty', 'Python'] + ['and', 'the', 'Holy', 'Grail']


Out[29]:
['Monty', 'Python', 'and', 'the', 'Holy', 'Grail']

Indexing Lists (...and Text objects)


In [30]:
text4[173]


Out[30]:
'awaken'

In [31]:
text4.index('awaken')


Out[31]:
173

In [32]:
text5[16715:16735]


Out[32]:
['U86',
 'thats',
 'why',
 'something',
 'like',
 'gamefly',
 'is',
 'so',
 'good',
 'because',
 'you',
 'can',
 'actually',
 'play',
 'a',
 'full',
 'game',
 'without',
 'buying',
 'it']

In [33]:
text6[1600:1625]


Out[33]:
['We',
 "'",
 're',
 'an',
 'anarcho',
 '-',
 'syndicalist',
 'commune',
 '.',
 'We',
 'take',
 'it',
 'in',
 'turns',
 'to',
 'act',
 'as',
 'a',
 'sort',
 'of',
 'executive',
 'officer',
 'for',
 'the',
 'week']

Computing with Language: Simple Statistics


In [34]:
saying = 'After all is said and done more is said than done'.split()
tokens = sorted(set(saying))
tokens[-2:]


Out[34]:
['said', 'than']

Frequency Distributions


In [35]:
fdist1 = FreqDist(text1)
print(fdist1)


<FreqDist with 19317 samples and 260819 outcomes>

In [36]:
fdist1.most_common(50)


Out[36]:
[(',', 18713),
 ('the', 13721),
 ('.', 6862),
 ('of', 6536),
 ('and', 6024),
 ('a', 4569),
 ('to', 4542),
 (';', 4072),
 ('in', 3916),
 ('that', 2982),
 ("'", 2684),
 ('-', 2552),
 ('his', 2459),
 ('it', 2209),
 ('I', 2124),
 ('s', 1739),
 ('is', 1695),
 ('he', 1661),
 ('with', 1659),
 ('was', 1632),
 ('as', 1620),
 ('"', 1478),
 ('all', 1462),
 ('for', 1414),
 ('this', 1280),
 ('!', 1269),
 ('at', 1231),
 ('by', 1137),
 ('but', 1113),
 ('not', 1103),
 ('--', 1070),
 ('him', 1058),
 ('from', 1052),
 ('be', 1030),
 ('on', 1005),
 ('so', 918),
 ('whale', 906),
 ('one', 889),
 ('you', 841),
 ('had', 767),
 ('have', 760),
 ('there', 715),
 ('But', 705),
 ('or', 697),
 ('were', 680),
 ('now', 646),
 ('which', 640),
 ('?', 637),
 ('me', 627),
 ('like', 624)]

In [37]:
fdist1['whale']


Out[37]:
906

50 most frequent words account for almost half of the book


In [38]:
plt.figure(figsize=(18,10))
fdist1.plot(50, cumulative=True)


Fine-grained Selection of Words

Looking at long words of a text (maybe these will be more meaningful words?)


In [39]:
V = set(text1)
long_words = [w for w in V if len(w) > 15]

sorted(long_words)


Out[39]:
['CIRCUMNAVIGATION',
 'Physiognomically',
 'apprehensiveness',
 'cannibalistically',
 'characteristically',
 'circumnavigating',
 'circumnavigation',
 'circumnavigations',
 'comprehensiveness',
 'hermaphroditical',
 'indiscriminately',
 'indispensableness',
 'irresistibleness',
 'physiognomically',
 'preternaturalness',
 'responsibilities',
 'simultaneousness',
 'subterraneousness',
 'supernaturalness',
 'superstitiousness',
 'uncomfortableness',
 'uncompromisedness',
 'undiscriminating',
 'uninterpenetratingly']

words that are longer than 7 characters and occur more than 7 times


In [40]:
fdist5 = FreqDist(text5)

sorted(w for w in set(text5) if len(w) > 7 and fdist5[w] > 7)


Out[40]:
['#14-19teens',
 '#talkcity_adults',
 '((((((((((',
 '........',
 'Question',
 'actually',
 'anything',
 'computer',
 'cute.-ass',
 'everyone',
 'football',
 'innocent',
 'listening',
 'remember',
 'seriously',
 'something',
 'together',
 'tomorrow',
 'watching']

collocation - sequence of words that occur together unusually often (red wine is a collocation, vs. the wine is not)


In [41]:
list(nltk.bigrams(['more', 'is', 'said', 'than', 'done'])) # bigrams() returns a generator


Out[41]:
[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]

collocations are just frequent bigrams -- we want to focus on the cases that involve rare words

collocations() returns bigrams that occur more often than expected, based on word frequency


In [42]:
text4.collocations()


United States; fellow citizens; four years; years ago; Federal
Government; General Government; American people; Vice President; Old
World; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice;
God bless; every citizen; Indian tribes; public debt; one another;
foreign nations; political parties

In [43]:
text8.collocations()


would like; medium build; social drinker; quiet nights; non smoker;
long term; age open; Would like; easy going; financially secure; fun
times; similar interests; Age open; weekends away; poss rship; well
presented; never married; single mum; permanent relationship; slim
build

counting other things

word length distribution in text1


In [47]:
[len(w) for w in text1][:10]


Out[47]:
[1, 4, 4, 2, 6, 8, 4, 1, 9, 1]

In [48]:
fdist = FreqDist(len(w) for w in text1)
print(fdist)


<FreqDist with 19 samples and 260819 outcomes>

In [49]:
fdist


Out[49]:
Counter({1: 47933,
         2: 38513,
         3: 50223,
         4: 42345,
         5: 26597,
         6: 17111,
         7: 14399,
         8: 9966,
         9: 6428,
         10: 3528,
         11: 1873,
         12: 1053,
         13: 567,
         14: 177,
         15: 70,
         16: 22,
         17: 12,
         18: 1,
         20: 1})

In [50]:
fdist.most_common()


Out[50]:
[(3, 50223),
 (1, 47933),
 (4, 42345),
 (2, 38513),
 (5, 26597),
 (6, 17111),
 (7, 14399),
 (8, 9966),
 (9, 6428),
 (10, 3528),
 (11, 1873),
 (12, 1053),
 (13, 567),
 (14, 177),
 (15, 70),
 (16, 22),
 (17, 12),
 (18, 1),
 (20, 1)]

In [52]:
fdist.max()


Out[52]:
3

In [53]:
fdist[3]


Out[53]:
50223

In [54]:
fdist.freq(3)


Out[54]:
0.19255882431878046

words of length 3 (~50k) make up ~20% of all words in the book

Back to Python: Making Decisions and Taking Control

skipping basic python stuff

More accurate vocabulary size counting -- convert all strings to lowercase


In [56]:
len(text1)


Out[56]:
260819

In [57]:
len(set(text1))


Out[57]:
19317

In [58]:
len(set(word.lower() for word in text1))


Out[58]:
17231

Only include alphabetic words -- no punctuation


In [59]:
len(set(word.lower() for word in text1 if word.isalpha()))


Out[59]:
16948

In [ ]: