Goal In this assignment, you'll make a first pass look at your newly adopted text collection similar to the Wolfram Alpha's view.
Title, author, and other metadata. First, print out some summary information that gives the background explaining what this collection is and where it comes from:
This collection contains transcripts of United States Presidential debates from 1960 to the present. These transcripts are taken from The Commission on Presidentail Debates. It contains 39 text files with some HTML markup.
In [1]:
import nltk
import re
sent_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')
First, load in the file or files below. First, take a look at your text. An easy way to get started is to first read it in, and then run it through the sentence tokenizer to divide it up, even if this division is not fully accurate. You may have to do a bit of work to figure out which will be the "opening phrase" that Wolfram Alpha shows. Below, write the code to read in the text and split it into sentences, and then print out the opening phrase.
In [2]:
# I created a module to make it easier to load a text corpus.
import corpii
debates = nltk.clean_html(corpii.load_pres_debates().raw())
sents = sent_tokenizer.sentences_from_text(debates)
print sents[0]
Next, tokenize. Look at the several dozen sentences to see what kind of tokenization issues you'll have. Write a regular expression tokenizer, using the nltk.regexp_tokenize() as seen in class, to do a nice job of breaking your text up into words. You may need to make changes to the regex pattern that is given in the book to make it work well for your text collection.
Note that this is the key part of the assignment. How you break up the words will have effects down the line for how you can manipulate your text collection. You may want to refine this code later.
In [3]:
# just looking at some sentence
sents[0:50]
Out[3]:
In [4]:
token_regex= """(?x)
[A-Z][A-Z ]+: #Catch name of speaker
# taken from ntlk book
|([A-Z]\.)+ # abbreviations, e.g. U.S.A.
| \w+(-\w+)* # words with optional internal hyphens
| \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
| \.\.\. # ellipsis
| [][.,;"'?():-_`] # these are separate tokens
"""
tokens = nltk.regexp_tokenize(debates, token_regex)
tokens
Out[4]:
Compute word counts. Now compute your frequency distribution using a FreqDist over the words. Let's not do lowercasing or stemming yet. You can run this over the whole collection together, or sentence by sentence. Write the code for computing the FreqDist below.
In [5]:
freq = nltk.FreqDist(tokens)
freq.items()[:50]
Out[5]:
Creating a table. Python provides an easy way to line columns up in a table. You can specify a width for a string such as %6s, producing a string that is padded to width 6. It is right-justified by default, but a minus sign in front of it switches it to left-justified, so -3d% means left justify an integer with width 3. AND if you don't know the width in advance, you can make it a variable by using an asterisk rather than a number before the '*s%' or the '-*d%'. Check out this example (this is just fyi):
In [6]:
print '%-16s' % 'Info type', '%-16s' % 'Value'
print '%-16s' % 'number of words', '%-16d' % 100000
Word Properties Table Next there is a table of word properties, which you should compute (skip unique word stems, since we will do stemming in class on Wed). Make a table that prints out:
You can make your table look prettier than the example I showed above if you like!
You can decide for yourself if you want to eliminate punctuation and function words (stop words) or not. It's your collection!
In [7]:
word_count = len(tokens)
unique_count = len(set(tokens))
avg_length = sum(len(w) for w in tokens)/float(word_count)
# get longest word
longest = ''
for w in tokens:
if len(w) >= len(longest):
longest = w
print "{:<16}|{:<16}|{:<16}|{:<16}".format("# Words", "Unique Words", "Avg Length", "Longest Word")
print "{:<16,d}|{:<16,d}|{:<16.3f}|{:<16}".format(word_count, unique_count, avg_length, longest)
Most Frequent Words List. Next is the most frequent words list. This table shows the percent of the total as well as the most frequent words, so compute this number as well.
In [8]:
print "{:<16}|{:<16}|{:<16}".format("Word", "Count", "Frequency")
for i in freq.items()[0:50]:
print "{:<16}|{:<16,d}|{:<16.3%}".format(i[0], i[1], freq.freq(i[0]))
Most Frequent Capitalized Words List We haven't lower-cased the text so you should be able to compute this. Don't worry about whether capitalization comes from proper nouns, start of sentences, or elsewhere. You need to make a different FreqDist to do this one. Write the code here for the new FreqDist and the List itself. Show the list here.
In [9]:
cap_freq = nltk.FreqDist([w for w in tokens if re.match(r"[A-Z]", w)])
print "{:<16}|{:<16}|{:<16}".format("Word", "Count", "Frequency")
for i in cap_freq.items()[0:50]:
print "{:<16}|{:<16,d}|{:<16.3%}".format(i[0], i[1], cap_freq.freq(i[0]))
Sentence Properties Table This summarizes number of sentences and average sentence length in words and characters (you decide if you want to include stopwords/punctuation or not). Print those out in a table here.
In [10]:
sent_count = len(sents)
avg_char = sum(len(s) for s in sents)/float(sent_count)
avg_word = sum([len(nltk.regexp_tokenize(s, token_regex)) for s in sents])/float(sent_count)
print "{:<16}|{:<16}|{:<16}".format("# Sents", "Avg Words", "Avg Char")
print "{:<16,d}|{:<16.3f}|{:<16.3f}".format(sent_count, avg_word, avg_char)