Introduction

(Be sure to start this notebook with the command "ipython notebook --pylab inline".)

Section 1.1 of the NLTK book describes some pre-loaded books and pre-defined functions that come with them. Section 1.2 reviews fundamental concepts abßout python lists and strings -- if you need to brush up on these concepts, then study this subsection carefully. Be sure you know the difference between a set and a list and that you can work easily with python slices.

The part that I am most interested in having you focus on is Section 1.3, which introduces NLTK's frequency distribution data structure. You need to have the books loaded and accessible from section 1.1 for this part to work.

NLTK's Frequency Distribution Object

This data structure makes it easy to tally up frequencies across words and other items, and incorporate them into list comprehensions (and later we'll see the conditional frequency distribution as well).

The code below counts up all of the words in Monty Python and the Holy Grail (text6 in the nltk.book collection) and the final line shows the top 50 most frequent.


In [1]:
import nltk
from nltk.book import *        # loads in pre-defined texts
mp_freqdist = FreqDist(text6)  # compute the frequency distribution
mp_freqdist.items()[:50]       # show the top 50 (word, frequency) pairs


*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
Out[1]:
[(':', 1197),
 ('.', 816),
 ('!', 801),
 (',', 731),
 ("'", 421),
 ('[', 319),
 (']', 312),
 ('the', 299),
 ('I', 255),
 ('ARTHUR', 225),
 ('?', 207),
 ('you', 204),
 ('a', 188),
 ('of', 158),
 ('--', 148),
 ('to', 144),
 ('s', 141),
 ('and', 135),
 ('#', 127),
 ('...', 118),
 ('Oh', 110),
 ('it', 107),
 ('is', 106),
 ('-', 88),
 ('in', 86),
 ('that', 84),
 ('t', 77),
 ('1', 76),
 ('LAUNCELOT', 76),
 ('No', 76),
 ('your', 75),
 ('not', 70),
 ('GALAHAD', 69),
 ('KNIGHT', 68),
 ('What', 65),
 ('FATHER', 63),
 ('we', 62),
 ('BEDEVERE', 61),
 ('You', 61),
 ('We', 60),
 ('this', 59),
 ('no', 55),
 ('HEAD', 54),
 ('Well', 54),
 ('GUARD', 53),
 ('have', 53),
 ('Sir', 52),
 ('are', 52),
 ('A', 50),
 ('And', 50)]

Task 1 Wow, those are some weird results. It might make some sense to look at the actual text itself. In the line below, write a line of code that pulls out the first 500 words of the text and shows them to you (hint: the text object is simply a list of strings).


In [2]:
" ".join(text6[:500])


Out[2]:
"SCENE 1 : [ wind ] [ clop clop clop ] KING ARTHUR : Whoa there ! [ clop clop clop ] SOLDIER # 1 : Halt ! Who goes there ? ARTHUR : It is I , Arthur , son of Uther Pendragon , from the castle of Camelot . King of the Britons , defeator of the Saxons , sovereign of all England ! SOLDIER # 1 : Pull the other one ! ARTHUR : I am , ... and this is my trusty servant Patsy . We have ridden the length and breadth of the land in search of knights who will join me in my court at Camelot . I must speak with your lord and master . SOLDIER # 1 : What ? Ridden on a horse ? ARTHUR : Yes ! SOLDIER # 1 : You ' re using coconuts ! ARTHUR : What ? SOLDIER # 1 : You ' ve got two empty halves of coconut and you ' re bangin ' ' em together . ARTHUR : So ? We have ridden since the snows of winter covered this land , through the kingdom of Mercea , through -- SOLDIER # 1 : Where ' d you get the coconuts ? ARTHUR : We found them . SOLDIER # 1 : Found them ? In Mercea ? The coconut ' s tropical ! ARTHUR : What do you mean ? SOLDIER # 1 : Well , this is a temperate zone . ARTHUR : The swallow may fly south with the sun or the house martin or the plover may seek warmer climes in winter , yet these are not strangers to our land ? SOLDIER # 1 : Are you suggesting coconuts migrate ? ARTHUR : Not at all . They could be carried . SOLDIER # 1 : What ? A swallow carrying a coconut ? ARTHUR : It could grip it by the husk ! SOLDIER # 1 : It ' s not a question of where he grips it ! It ' s a simple question of weight ratios ! A five ounce bird could not carry a one pound coconut . ARTHUR : Well , it doesn ' t matter . Will you go and tell your master that Arthur from the Court of Camelot is here . SOLDIER # 1 : Listen . In order to maintain air - speed velocity , a swallow needs to beat its wings forty - three times every second , right ? ARTHUR : Please ! SOLDIER # 1 : Am I right ? ARTHUR : I ' m not interested ! SOLDIER # 2 : It could be carried by an African swallow ! SOLDIER # 1 : Oh , yeah , an African swallow maybe , but not a European swallow . That ' s my point . SOLDIER # 2 : Oh , yeah , I agree with that . ARTHUR : Will you ask your"

Task 2 Now that you've looked at the text, what are two reasons for these strange results?

  • Answer 1: The text is a script, so a lot of the punctation is from script punctuation and scene description.
  • Answer 2: This also means that names are repeatedly entered in ALL CAPS.

Task 3 Address one of the problems by modifying the text of Monty Python and rerunning the frequency distribution calculation. In the box below write your code to modify the text:


In [3]:
#Create new text that removes the ALL CAPS names
#The check for length is to keep punctuation from being removed since ".".upper() == "."
new_text = [t for t in text6 if len(t) == 1 or t != t.upper()]

Task 4 In the box below, show the output after applying this version of the text to a FreqDist.


In [4]:
new_freq = FreqDist(new_text)
new_freq.items()[:50]


Out[4]:
[(':', 1197),
 ('.', 816),
 ('!', 801),
 (',', 731),
 ("'", 421),
 ('[', 319),
 (']', 312),
 ('the', 299),
 ('I', 255),
 ('?', 207),
 ('you', 204),
 ('a', 188),
 ('of', 158),
 ('to', 144),
 ('s', 141),
 ('and', 135),
 ('#', 127),
 ('Oh', 110),
 ('it', 107),
 ('is', 106),
 ('-', 88),
 ('in', 86),
 ('that', 84),
 ('t', 77),
 ('1', 76),
 ('No', 76),
 ('your', 75),
 ('not', 70),
 ('What', 65),
 ('we', 62),
 ('You', 61),
 ('We', 60),
 ('this', 59),
 ('no', 55),
 ('Well', 54),
 ('have', 53),
 ('Sir', 52),
 ('are', 52),
 ('A', 50),
 ('And', 50),
 ('Ni', 47),
 ('on', 47),
 ('He', 46),
 ('me', 46),
 ('boom', 45),
 ('be', 43),
 ('he', 43),
 ('2', 42),
 ('Yes', 42),
 ('ha', 42)]

Task 5 How if at all has the output changed? *Answer: The upper case names have been removed. This has made more room for very short words in the top 50.

Task 6 Following the example from the book, show a cumulative frequency plot for the words in Monty Python as newly computed, in the box below.


In [5]:
new_freq.plot(50 ,cumulative=True)


Task 7 In the box below, write a list comprehension that users the FreqDist you computed above to find all words in Monty Python that are longer than 5 characters long and occur at least 5 times (hint: the text shows how to do a variation of this).
Show the output sorted in alphabetical order.


In [6]:
long_words = [w[0] for w in new_freq.items() if len(w[0]) > 5 and w[1] >= 5]
long_words.sort()
long_words


Out[6]:
['Aaaaugh',
 'Arthur',
 'Bedevere',
 'Bridge',
 'Britons',
 'Camelot',
 'Castle',
 'Christ',
 'Concorde',
 'English',
 'Father',
 'French',
 'Galahad',
 'Knight',
 'Knights',
 'Launcelot',
 'Please',
 'afraid',
 'angels',
 'better',
 'carried',
 'castle',
 'chanting',
 'coconut',
 'course',
 'domine',
 'dramatic',
 'easily',
 'escape',
 'father',
 'forest',
 'giggle',
 'killed',
 'knight',
 'knights',
 'master',
 'mumble',
 'nothing',
 'people',
 'please',
 'questions',
 'rabbit',
 'really',
 'requiem',
 'sacred',
 'saying',
 'second',
 'shrubberies',
 'shrubbery',
 'simple',
 'singing',
 'spanking',
 'squeak',
 'swallow',
 'swallows',
 'taunting',
 'through']