Language Processing and Python

Computing with Language: Texts and Words

Ran the following in python3 interpreter:

import nltk
nltk.download()

Select book to download corpora for NLTK Book



In [1]:

    
%matplotlib inline
import matplotlib.pyplot as plt

import nltk



In [2]:

    
from nltk.book import *









    



*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908



In [3]:

    
text1









    Out[3]:





<Text: Moby Dick by Herman Melville 1851>



In [4]:

    
text2









    Out[4]:





<Text: Sense and Sensibility by Jane Austen 1811>

concordance is a view that shows every occurrence of a word alongside some context



In [5]:

    
text1.concordance("monstrous")









    



Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us , 
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But 
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u



In [6]:

    
text2.concordance("affection")









    



Displaying 25 of 79 matches:
, however , and , as a mark of his affection for the three girls , he left them
t . It was very well known that no affection was ever supposed to exist between
deration of politeness or maternal affection on the side of the former , the tw
d the suspicion -- the hope of his affection for me may warrant , without impru
hich forbade the indulgence of his affection . She knew that his mother neither
rd she gave one with still greater affection . Though her late conversation wit
 can never hope to feel or inspire affection again , and if her home be uncomfo
m of the sense , elegance , mutual affection , and domestic comfort of the fami
, and which recommended him to her affection beyond every thing else . His soci
ween the parties might forward the affection of Mr . Willoughby , an equally st
 the most pointed assurance of her affection . Elinor could not be surprised at
he natural consequence of a strong affection in a young and ardent mind . This 
 opinion . But by an appeal to her affection for her mother , by representing t
 every alteration of a place which affection had established as perfect with hi
e will always have one claim of my affection , which no other can possibly shar
f the evening declared at once his affection and happiness . " Shall we see you
ause he took leave of us with less affection than his usual behaviour has shewn
ness ." " I want no proof of their affection ," said Elinor ; " but of their en
onths , without telling her of his affection ;-- that they should part without 
ould be the natural result of your affection for her . She used to be all unres
distinguished Elinor by no mark of affection . Marianne saw and listened with i
th no inclination for expense , no affection for strangers , no profession , an
till distinguished her by the same affection which once she had felt no doubt o
al of her confidence in Edward ' s affection , to the remembrance of every mark
 was made ? Had he never owned his affection to yourself ?" " Oh , no ; but if



In [7]:

    
text3.concordance("lived")









    



Displaying 25 of 38 matches:
ay when they were created . And Adam lived an hundred and thirty years , and be
ughters : And all the days that Adam lived were nine hundred and thirty yea and
nd thirty yea and he died . And Seth lived an hundred and five years , and bega
ve years , and begat Enos : And Seth lived after he begat Enos eight hundred an
welve years : and he died . And Enos lived ninety years , and begat Cainan : An
 years , and begat Cainan : And Enos lived after he begat Cainan eight hundred 
ive years : and he died . And Cainan lived seventy years and begat Mahalaleel :
rs and begat Mahalaleel : And Cainan lived after he begat Mahalaleel eight hund
years : and he died . And Mahalaleel lived sixty and five years , and begat Jar
s , and begat Jared : And Mahalaleel lived after he begat Jared eight hundred a
and five yea and he died . And Jared lived an hundred sixty and two years , and
o years , and he begat Eno And Jared lived after he begat Enoch eight hundred y
 and two yea and he died . And Enoch lived sixty and five years , and begat Met
 ; for God took him . And Methuselah lived an hundred eighty and seven years , 
 , and begat Lamech . And Methuselah lived after he begat Lamech seven hundred 
nd nine yea and he died . And Lamech lived an hundred eighty and two years , an
ch the LORD hath cursed . And Lamech lived after he begat Noah five hundred nin
naan shall be his servant . And Noah lived after the flood three hundred and fi
xad two years after the flo And Shem lived after he begat Arphaxad five hundred
at sons and daughters . And Arphaxad lived five and thirty years , and begat Sa
ars , and begat Salah : And Arphaxad lived after he begat Salah four hundred an
begat sons and daughters . And Salah lived thirty years , and begat Eber : And 
y years , and begat Eber : And Salah lived after he begat Eber four hundred and
 begat sons and daughters . And Eber lived four and thirty years , and begat Pe
y years , and begat Peleg : And Eber lived after he begat Peleg four hundred an

similar shows other words that appear in a similar context to the entered word



In [8]:

    
text1.similar("monstrous")









    



trustworthy mean impalpable true singular reliable wise doleful
domineering passing uncommon contemptible delightfully exasperate
gamesome mystifying part imperial maddens determined



In [9]:

    
text2.similar("monstrous")









    



very exceedingly so heartily as remarkably great vast extremely good a
amazingly sweet

text 1 (Melville) uses monstrous very differently from text 2 (Austen)

Text 2: monstrous has positive connotations, sometimes functions as an intensifier like very

common_contexts shows contexts that are shared by two or more words



In [10]:

    
text2.common_contexts(["monstrous", "very"])









    



be_glad a_pretty am_glad is_pretty a_lucky

trying out other words...



In [11]:

    
text2.similar("affection")









    



attention regard time love mother heart sister wife kindness opinion
arrival marianne wishes visit behaviour engagement letter brother
elinor marriage



In [12]:

    
text2.common_contexts(["affection", "regard"])









    



my_for her_and his_for her_for continued_for the_of your_for no_for
his_and

Lexical Dispersion Plot

Determining the location of words in text (how many words from beginning does this word appear?) -- using dispersion_plot



In [44]:

    
plt.figure(figsize=(18,10))
text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America", "liberty", "constitution"])

Generating some random text in the style of text3 -- using generate()

not yet supported in NLTK 3.0



In [14]:

    
# (not available in NLTK 3.0)

# text3.generate()

1.4 Counting Vocabulary

Count the number of tokens using len



In [15]:

    
len(text3)









    Out[15]:





44764

View/count vocabulary using set(text_obj)



In [16]:

    
len(set(text3))









    Out[16]:





2789



In [17]:

    
# first 50
sorted(set(text3))[:50]









    Out[17]:





['!',
 "'",
 '(',
 ')',
 ',',
 ',)',
 '.',
 '.)',
 ':',
 ';',
 ';)',
 '?',
 '?)',
 'A',
 'Abel',
 'Abelmizraim',
 'Abidah',
 'Abide',
 'Abimael',
 'Abimelech',
 'Abr',
 'Abrah',
 'Abraham',
 'Abram',
 'Accad',
 'Achbor',
 'Adah',
 'Adam',
 'Adbeel',
 'Admah',
 'Adullamite',
 'After',
 'Aholibamah',
 'Ahuzzath',
 'Ajah',
 'Akan',
 'All',
 'Allonbachuth',
 'Almighty',
 'Almodad',
 'Also',
 'Alvah',
 'Alvan',
 'Am',
 'Amal',
 'Amalek',
 'Amalekites',
 'Ammon',
 'Amorite',
 'Amorites']

Calculating lexical richness of the text



In [18]:

    
len(set(text3)) / len(text3)









    Out[18]:





0.06230453042623537

Count how often a word occurs in the text



In [19]:

    
text3.count("smote")









    Out[19]:





5

Compute what percentage of the text is taken up by a specific word



In [20]:

    
100 * text4.count('a') / len(text4)









    Out[20]:





1.4643016433938312



In [21]:

    
text5.count('lol')









    Out[21]:





704



In [22]:

    
100 * text5.count('lol') / len(text5)









    Out[22]:





1.5640968673628082

Define some simple functions to calculate these values



In [23]:

    
def lexical_diversity(text):
    return len(set(text)) / len(text)

def percentage(count, total):
    return 100 * count / total



In [24]:

    
lexical_diversity(text3), lexical_diversity(text5)









    Out[24]:





(0.06230453042623537, 0.13477005109975562)



In [25]:

    
percentage(text4.count('a'), len(text4))









    Out[25]:





1.4643016433938312

A Closer Look at Python: Texts as Lists of Words

skipping some basic python parts of this section...



In [26]:

    
sent1









    Out[26]:





['Call', 'me', 'Ishmael', '.']



In [27]:

    
sent2









    Out[27]:





['The',
 'family',
 'of',
 'Dashwood',
 'had',
 'long',
 'been',
 'settled',
 'in',
 'Sussex',
 '.']



In [28]:

    
lexical_diversity(sent1)









    Out[28]:





1.0

List Concatenation



In [29]:

    
['Monty', 'Python'] + ['and', 'the', 'Holy', 'Grail']









    Out[29]:





['Monty', 'Python', 'and', 'the', 'Holy', 'Grail']

Indexing Lists (...and Text objects)



In [30]:

    
text4[173]









    Out[30]:





'awaken'



In [31]:

    
text4.index('awaken')









    Out[31]:





173



In [32]:

    
text5[16715:16735]









    Out[32]:





['U86',
 'thats',
 'why',
 'something',
 'like',
 'gamefly',
 'is',
 'so',
 'good',
 'because',
 'you',
 'can',
 'actually',
 'play',
 'a',
 'full',
 'game',
 'without',
 'buying',
 'it']



In [33]:

    
text6[1600:1625]









    Out[33]:





['We',
 "'",
 're',
 'an',
 'anarcho',
 '-',
 'syndicalist',
 'commune',
 '.',
 'We',
 'take',
 'it',
 'in',
 'turns',
 'to',
 'act',
 'as',
 'a',
 'sort',
 'of',
 'executive',
 'officer',
 'for',
 'the',
 'week']

Computing with Language: Simple Statistics



In [34]:

    
saying = 'After all is said and done more is said than done'.split()
tokens = sorted(set(saying))
tokens[-2:]









    Out[34]:





['said', 'than']

Frequency Distributions



In [35]:

    
fdist1 = FreqDist(text1)
print(fdist1)









    



<FreqDist with 19317 samples and 260819 outcomes>



In [36]:

    
fdist1.most_common(50)









    Out[36]:





[(',', 18713),
 ('the', 13721),
 ('.', 6862),
 ('of', 6536),
 ('and', 6024),
 ('a', 4569),
 ('to', 4542),
 (';', 4072),
 ('in', 3916),
 ('that', 2982),
 ("'", 2684),
 ('-', 2552),
 ('his', 2459),
 ('it', 2209),
 ('I', 2124),
 ('s', 1739),
 ('is', 1695),
 ('he', 1661),
 ('with', 1659),
 ('was', 1632),
 ('as', 1620),
 ('"', 1478),
 ('all', 1462),
 ('for', 1414),
 ('this', 1280),
 ('!', 1269),
 ('at', 1231),
 ('by', 1137),
 ('but', 1113),
 ('not', 1103),
 ('--', 1070),
 ('him', 1058),
 ('from', 1052),
 ('be', 1030),
 ('on', 1005),
 ('so', 918),
 ('whale', 906),
 ('one', 889),
 ('you', 841),
 ('had', 767),
 ('have', 760),
 ('there', 715),
 ('But', 705),
 ('or', 697),
 ('were', 680),
 ('now', 646),
 ('which', 640),
 ('?', 637),
 ('me', 627),
 ('like', 624)]



In [37]:

    
fdist1['whale']









    Out[37]:





906

50 most frequent words account for almost half of the book



In [38]:

    
plt.figure(figsize=(18,10))
fdist1.plot(50, cumulative=True)

Fine-grained Selection of Words

Looking at long words of a text (maybe these will be more meaningful words?)



In [39]:

    
V = set(text1)
long_words = [w for w in V if len(w) > 15]

sorted(long_words)









    Out[39]:





['CIRCUMNAVIGATION',
 'Physiognomically',
 'apprehensiveness',
 'cannibalistically',
 'characteristically',
 'circumnavigating',
 'circumnavigation',
 'circumnavigations',
 'comprehensiveness',
 'hermaphroditical',
 'indiscriminately',
 'indispensableness',
 'irresistibleness',
 'physiognomically',
 'preternaturalness',
 'responsibilities',
 'simultaneousness',
 'subterraneousness',
 'supernaturalness',
 'superstitiousness',
 'uncomfortableness',
 'uncompromisedness',
 'undiscriminating',
 'uninterpenetratingly']

words that are longer than 7 characters and occur more than 7 times



In [40]:

    
fdist5 = FreqDist(text5)

sorted(w for w in set(text5) if len(w) > 7 and fdist5[w] > 7)









    Out[40]:





['#14-19teens',
 '#talkcity_adults',
 '((((((((((',
 '........',
 'Question',
 'actually',
 'anything',
 'computer',
 'cute.-ass',
 'everyone',
 'football',
 'innocent',
 'listening',
 'remember',
 'seriously',
 'something',
 'together',
 'tomorrow',
 'watching']

collocation - sequence of words that occur together unusually often (red wine is a collocation, vs. the wine is not)



In [41]:

    
list(nltk.bigrams(['more', 'is', 'said', 'than', 'done'])) # bigrams() returns a generator









    Out[41]:





[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]

collocations are just frequent bigrams -- we want to focus on the cases that involve rare words

collocations() returns bigrams that occur more often than expected, based on word frequency



In [42]:

    
text4.collocations()









    



United States; fellow citizens; four years; years ago; Federal
Government; General Government; American people; Vice President; Old
World; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice;
God bless; every citizen; Indian tribes; public debt; one another;
foreign nations; political parties



In [43]:

    
text8.collocations()









    



would like; medium build; social drinker; quiet nights; non smoker;
long term; age open; Would like; easy going; financially secure; fun
times; similar interests; Age open; weekends away; poss rship; well
presented; never married; single mum; permanent relationship; slim
build

counting other things

word length distribution in text1



In [47]:

    
[len(w) for w in text1][:10]









    Out[47]:





[1, 4, 4, 2, 6, 8, 4, 1, 9, 1]



In [48]:

    
fdist = FreqDist(len(w) for w in text1)
print(fdist)









    



<FreqDist with 19 samples and 260819 outcomes>



In [49]:

    
fdist









    Out[49]:





Counter({1: 47933,
         2: 38513,
         3: 50223,
         4: 42345,
         5: 26597,
         6: 17111,
         7: 14399,
         8: 9966,
         9: 6428,
         10: 3528,
         11: 1873,
         12: 1053,
         13: 567,
         14: 177,
         15: 70,
         16: 22,
         17: 12,
         18: 1,
         20: 1})



In [50]:

    
fdist.most_common()









    Out[50]:





[(3, 50223),
 (1, 47933),
 (4, 42345),
 (2, 38513),
 (5, 26597),
 (6, 17111),
 (7, 14399),
 (8, 9966),
 (9, 6428),
 (10, 3528),
 (11, 1873),
 (12, 1053),
 (13, 567),
 (14, 177),
 (15, 70),
 (16, 22),
 (17, 12),
 (18, 1),
 (20, 1)]



In [52]:

    
fdist.max()









    Out[52]:





3



In [53]:

    
fdist[3]









    Out[53]:





50223



In [54]:

    
fdist.freq(3)









    Out[54]:





0.19255882431878046

words of length 3 (~50k) make up ~20% of all words in the book

Back to Python: Making Decisions and Taking Control

skipping basic python stuff

More accurate vocabulary size counting -- convert all strings to lowercase



In [56]:

    
len(text1)









    Out[56]:





260819



In [57]:

    
len(set(text1))









    Out[57]:





19317



In [58]:

    
len(set(word.lower() for word in text1))









    Out[58]:





17231

Only include alphabetic words -- no punctuation



In [59]:

    
len(set(word.lower() for word in text1 if word.isalpha()))









    Out[59]:





16948



In [ ]: