## IS620 week 8 - NLTK High Frequency Words

Daina Bouquin

Perform an analysis of high frequency words in a corpus of interest.

1. Choose a corpus of interest.
2. How many total unique words are in the corpus? (Please feel free to define unique words in any interesting, defensible way).
3. Taking the most common words, how many unique words represent half of the total words in the corpus?
4. Identify the 200 highest frequency words in this corpus.
5. Create a graph that shows the relative frequency of these 200 words.
6. Does the observed relative frequency of these words follow Zipf’s law? Explain.
7. In what ways do you think the frequency of the words in this corpus differ from “all words in all corpora.”
``````

In [1]:

import nltk

``````
``````

[nltk_data]    |
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/biocreative_ppi.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/brown.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/brown_tei.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_cat.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_esp.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/chat80.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/city_database.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/cmudict.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/comparative_sentences.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/conll2000.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/conll2002.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/dependency_treebank.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/europarl_raw.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/floresta.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/framenet_v15.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/gazetteers.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/genesis.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/gutenberg.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/ieer.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/inaugural.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/indian.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/kimmo.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/lin_thesaurus.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/mac_morpho.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping models/moses_sample.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/movie_reviews.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/names.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/nps_chat.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/omw.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/opinion_lexicon.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/pil.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/pl196x.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/ppattach.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/problem_reports.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/ptb.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/product_reviews_1.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/product_reviews_2.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/pros_cons.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/qc.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/rte.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/senseval.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/sentiwordnet.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/sentence_polarity.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/shakespeare.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/sinica_treebank.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/smultron.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/state_union.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/stopwords.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/subjectivity.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/switchboard.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/timit.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/toolbox.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/treebank.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/udhr.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/udhr2.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/unicode_samples.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/verbnet.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/webtext.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/wordnet.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/wordnet_ic.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/words.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/ycoe.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping stemmers/rslp.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping taggers/hmm_treebank_pos_tagger.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping taggers/maxent_treebank_pos_tagger.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping taggers/universal_tagset.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping tokenizers/punkt.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping grammars/book_grammars.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping grammars/sample_grammars.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping grammars/spanish_grammars.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping grammars/basque_grammars.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping grammars/large_grammars.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping help/tagsets.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping models/bllip_wsj_no_aux.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping models/word2vec_sample.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/mte_teip5.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/panlex_lite.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping misc/perluniprops.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/nonbreaking_prefixes.zip.
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping stemmers/porter_test.zip.
[nltk_data]    |

Out[1]:

True

``````
``````

In [8]:

# make sure it's all set :)
nltk.word_tokenize("hello world")

``````
``````

Out[8]:

['hello', 'world']

``````
``````

In [3]:

%matplotlib inline
import pandas as pd
import seaborn as sns

``````

### Choose Corpus and Find Unique Words

I chose the Emma corpus from the nltk package. I am going to define unique words as the set of distinct alphabetic strings in the corpus, and I will remove common stop words such as 'a' and 'the'. This will also remove numbers and punctuation from the corpus. The length of the set is taken to survey the number of unique words.

``````

In [16]:

emma = nltk.corpus.gutenberg.words('austen-emma.txt')

# strip punctuation and numerics using isalpha() method
emma = [w for w in emma if w.isalpha()]
# strip out stop words
from nltk.corpus import stopwords
emma = [w for w in emma if w not in stopwords.words('english')]

``````
``````

In [17]:

# How many total unique words are in the corpus
emma_unique = set(emma)
len(emma_unique)

``````
``````

Out[17]:

7406

``````

### Most Common Words and building a Frequency Distribution

Here we build a frequency distribution from the corpus and isolate the 200 most common words. This method returns the a sort list of tuples that is then loaded into a dataframe in order to calculate relative frequencies.

``````

In [39]:

# build the frequency distribution using FreqDist()
freq_emma = nltk.FreqDist(emma)

# make a dataframe to produce relative frequencies - top 200
emma_top200 = pd.DataFrame(freq_emma.most_common(200),columns=['word','count'])
emma_top200['rel_freq'] = emma_top['count']/float(len(emma))

``````
``````

Out[39]:

word
count
rel_freq

0
I
3178
0.039032

1
Mr
1153
0.014161

2
Emma
865
0.010624

3
could
825
0.010133

4
would
815
0.010010

5
Mrs
699
0.008585

6
Miss
592
0.007271

7
must
564
0.006927

8
She
562
0.006902

9
Harriet
506
0.006215

``````

We want to find out the number of most common unique words that make up approximately 50% of the dataset. By plotting the cumulative distribution we can see that approximately 250 words accounts for 50% of all words in the dataset. This is confirmed by summing the first 250 indexes of relative frequencies.

``````

In [40]:

# top 500
emma_top500 = pd.DataFrame(freq_emma.most_common(500),columns=['word','count'])
emma_top500['rel_freq'] = emma_top['count']/float(len(emma))

``````
``````

In [19]:

len(emma)/2.0   ## half of all words

``````
``````

Out[19]:

40710.5

``````
``````

In [41]:

sum(emma_top500[:250]['rel_freq']) # The first 250 words account for approximately half of all words

``````
``````

Out[41]:

0.51471978973483457

``````
``````

In [42]:

freq_emma.plot(250, cumulative=True)

``````
``````

``````

The following barplot shows the relative frequencies of all 200 of the most frequent unique words.

``````

In [43]:

g = sns.barplot(x=emma_top200.word, y=emma_top200.rel_freq)

``````
``````

``````

The observed relative frequencies do follow Zipf's Law, in that the frequency of any word is approximately inversely proportional to its ranking in the frequency table. This law holds for all words in all corpora. The only differences are the words themselves-- they differ based on the content of the corpus being analyzed.