IS620 week 8 - NLTK High Frequency Words

Daina Bouquin

Perform an analysis of high frequency words in a corpus of interest.

Complete the following tasks:

  1. Choose a corpus of interest.
  2. How many total unique words are in the corpus? (Please feel free to define unique words in any interesting, defensible way).
  3. Taking the most common words, how many unique words represent half of the total words in the corpus?
  4. Identify the 200 highest frequency words in this corpus.
  5. Create a graph that shows the relative frequency of these 200 words.
  6. Does the observed relative frequency of these words follow Zipf’s law? Explain.
  7. In what ways do you think the frequency of the words in this corpus differ from “all words in all corpora.”

In [1]:
import nltk
nltk.download('all')


[nltk_data] Downloading collection u'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/biocreative_ppi.zip.
[nltk_data]    | Downloading package brown to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/brown.zip.
[nltk_data]    | Downloading package brown_tei to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/brown_tei.zip.
[nltk_data]    | Downloading package cess_cat to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_cat.zip.
[nltk_data]    | Downloading package cess_esp to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_esp.zip.
[nltk_data]    | Downloading package chat80 to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/chat80.zip.
[nltk_data]    | Downloading package city_database to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/city_database.zip.
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/cmudict.zip.
[nltk_data]    | Downloading package comparative_sentences to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/comparative_sentences.zip.
[nltk_data]    | Downloading package comtrans to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    | Downloading package conll2000 to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/conll2000.zip.
[nltk_data]    | Downloading package conll2002 to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/conll2002.zip.
[nltk_data]    | Downloading package conll2007 to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    | Downloading package crubadan to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/crubadan.zip.
[nltk_data]    | Downloading package dependency_treebank to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/dependency_treebank.zip.
[nltk_data]    | Downloading package europarl_raw to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/europarl_raw.zip.
[nltk_data]    | Downloading package floresta to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/floresta.zip.
[nltk_data]    | Downloading package framenet_v15 to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/framenet_v15.zip.
[nltk_data]    | Downloading package gazetteers to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/gazetteers.zip.
[nltk_data]    | Downloading package genesis to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/genesis.zip.
[nltk_data]    | Downloading package gutenberg to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/gutenberg.zip.
[nltk_data]    | Downloading package ieer to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/ieer.zip.
[nltk_data]    | Downloading package inaugural to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/inaugural.zip.
[nltk_data]    | Downloading package indian to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/indian.zip.
[nltk_data]    | Downloading package jeita to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    | Downloading package kimmo to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/kimmo.zip.
[nltk_data]    | Downloading package knbc to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    | Downloading package lin_thesaurus to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/lin_thesaurus.zip.
[nltk_data]    | Downloading package mac_morpho to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/mac_morpho.zip.
[nltk_data]    | Downloading package machado to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    | Downloading package masc_tagged to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    | Downloading package moses_sample to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping models/moses_sample.zip.
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/movie_reviews.zip.
[nltk_data]    | Downloading package names to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/names.zip.
[nltk_data]    | Downloading package nombank.1.0 to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    | Downloading package nps_chat to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/nps_chat.zip.
[nltk_data]    | Downloading package omw to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/omw.zip.
[nltk_data]    | Downloading package opinion_lexicon to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/opinion_lexicon.zip.
[nltk_data]    | Downloading package paradigms to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/paradigms.zip.
[nltk_data]    | Downloading package pil to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/pil.zip.
[nltk_data]    | Downloading package pl196x to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/pl196x.zip.
[nltk_data]    | Downloading package ppattach to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/ppattach.zip.
[nltk_data]    | Downloading package problem_reports to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/problem_reports.zip.
[nltk_data]    | Downloading package propbank to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    | Downloading package ptb to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/ptb.zip.
[nltk_data]    | Downloading package product_reviews_1 to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/product_reviews_1.zip.
[nltk_data]    | Downloading package product_reviews_2 to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/product_reviews_2.zip.
[nltk_data]    | Downloading package pros_cons to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/pros_cons.zip.
[nltk_data]    | Downloading package qc to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/qc.zip.
[nltk_data]    | Downloading package reuters to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    | Downloading package rte to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/rte.zip.
[nltk_data]    | Downloading package semcor to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    | Downloading package senseval to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/senseval.zip.
[nltk_data]    | Downloading package sentiwordnet to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/sentiwordnet.zip.
[nltk_data]    | Downloading package sentence_polarity to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/sentence_polarity.zip.
[nltk_data]    | Downloading package shakespeare to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/shakespeare.zip.
[nltk_data]    | Downloading package sinica_treebank to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/sinica_treebank.zip.
[nltk_data]    | Downloading package smultron to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/smultron.zip.
[nltk_data]    | Downloading package state_union to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/state_union.zip.
[nltk_data]    | Downloading package stopwords to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/stopwords.zip.
[nltk_data]    | Downloading package subjectivity to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/subjectivity.zip.
[nltk_data]    | Downloading package swadesh to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/swadesh.zip.
[nltk_data]    | Downloading package switchboard to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/switchboard.zip.
[nltk_data]    | Downloading package timit to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/timit.zip.
[nltk_data]    | Downloading package toolbox to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/toolbox.zip.
[nltk_data]    | Downloading package treebank to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/treebank.zip.
[nltk_data]    | Downloading package twitter_samples to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/twitter_samples.zip.
[nltk_data]    | Downloading package udhr to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/udhr.zip.
[nltk_data]    | Downloading package udhr2 to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/udhr2.zip.
[nltk_data]    | Downloading package unicode_samples to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/unicode_samples.zip.
[nltk_data]    | Downloading package universal_treebanks_v20 to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    | Downloading package verbnet to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/verbnet.zip.
[nltk_data]    | Downloading package webtext to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/webtext.zip.
[nltk_data]    | Downloading package wordnet to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/wordnet.zip.
[nltk_data]    | Downloading package wordnet_ic to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/wordnet_ic.zip.
[nltk_data]    | Downloading package words to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/words.zip.
[nltk_data]    | Downloading package ycoe to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/ycoe.zip.
[nltk_data]    | Downloading package rslp to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping stemmers/rslp.zip.
[nltk_data]    | Downloading package hmm_treebank_pos_tagger to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping taggers/hmm_treebank_pos_tagger.zip.
[nltk_data]    | Downloading package maxent_treebank_pos_tagger to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping taggers/maxent_treebank_pos_tagger.zip.
[nltk_data]    | Downloading package universal_tagset to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping taggers/universal_tagset.zip.
[nltk_data]    | Downloading package maxent_ne_chunker to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data]    | Downloading package punkt to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping tokenizers/punkt.zip.
[nltk_data]    | Downloading package book_grammars to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping grammars/book_grammars.zip.
[nltk_data]    | Downloading package sample_grammars to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping grammars/sample_grammars.zip.
[nltk_data]    | Downloading package spanish_grammars to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping grammars/spanish_grammars.zip.
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping grammars/basque_grammars.zip.
[nltk_data]    | Downloading package large_grammars to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping grammars/large_grammars.zip.
[nltk_data]    | Downloading package tagsets to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping help/tagsets.zip.
[nltk_data]    | Downloading package snowball_data to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    | Downloading package bllip_wsj_no_aux to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping models/bllip_wsj_no_aux.zip.
[nltk_data]    | Downloading package word2vec_sample to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping models/word2vec_sample.zip.
[nltk_data]    | Downloading package panlex_swadesh to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    | Downloading package mte_teip5 to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/mte_teip5.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package panlex_lite to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/panlex_lite.zip.
[nltk_data]    | Downloading package perluniprops to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping misc/perluniprops.zip.
[nltk_data]    | Downloading package nonbreaking_prefixes to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping corpora/nonbreaking_prefixes.zip.
[nltk_data]    | Downloading package vader_lexicon to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    | Downloading package porter_test to
[nltk_data]    |     /Users/dainabouquin/nltk_data...
[nltk_data]    |   Unzipping stemmers/porter_test.zip.
[nltk_data]    | 
[nltk_data]  Done downloading collection all
Out[1]:
True

In [8]:
# make sure it's all set :)
nltk.word_tokenize("hello world")


Out[8]:
['hello', 'world']

In [3]:
%matplotlib inline
import pandas as pd
import seaborn as sns

Choose Corpus and Find Unique Words

I chose the Emma corpus from the nltk package. I am going to define unique words as the set of distinct alphabetic strings in the corpus, and I will remove common stop words such as 'a' and 'the'. This will also remove numbers and punctuation from the corpus. The length of the set is taken to survey the number of unique words.


In [16]:
emma = nltk.corpus.gutenberg.words('austen-emma.txt')

# strip punctuation and numerics using isalpha() method
emma = [w for w in emma if w.isalpha()]
# strip out stop words
from nltk.corpus import stopwords
emma = [w for w in emma if w not in stopwords.words('english')]

In [17]:
# How many total unique words are in the corpus
emma_unique = set(emma)
len(emma_unique)


Out[17]:
7406

Most Common Words and building a Frequency Distribution

Here we build a frequency distribution from the corpus and isolate the 200 most common words. This method returns the a sort list of tuples that is then loaded into a dataframe in order to calculate relative frequencies.


In [39]:
# build the frequency distribution using FreqDist()
freq_emma = nltk.FreqDist(emma)

# make a dataframe to produce relative frequencies - top 200
emma_top200 = pd.DataFrame(freq_emma.most_common(200),columns=['word','count'])
emma_top200['rel_freq'] = emma_top['count']/float(len(emma))
emma_top200.head(10)


Out[39]:
word count rel_freq
0 I 3178 0.039032
1 Mr 1153 0.014161
2 Emma 865 0.010624
3 could 825 0.010133
4 would 815 0.010010
5 Mrs 699 0.008585
6 Miss 592 0.007271
7 must 564 0.006927
8 She 562 0.006902
9 Harriet 506 0.006215

We want to find out the number of most common unique words that make up approximately 50% of the dataset. By plotting the cumulative distribution we can see that approximately 250 words accounts for 50% of all words in the dataset. This is confirmed by summing the first 250 indexes of relative frequencies.


In [40]:
# top 500
emma_top500 = pd.DataFrame(freq_emma.most_common(500),columns=['word','count'])
emma_top500['rel_freq'] = emma_top['count']/float(len(emma))

In [19]:
len(emma)/2.0   ## half of all words


Out[19]:
40710.5

In [41]:
sum(emma_top500[:250]['rel_freq']) # The first 250 words account for approximately half of all words


Out[41]:
0.51471978973483457

In [42]:
freq_emma.plot(250, cumulative=True)


The following barplot shows the relative frequencies of all 200 of the most frequent unique words.


In [43]:
g = sns.barplot(x=emma_top200.word, y=emma_top200.rel_freq)


The observed relative frequencies do follow Zipf's Law, in that the frequency of any word is approximately inversely proportional to its ranking in the frequency table. This law holds for all words in all corpora. The only differences are the words themselves-- they differ based on the content of the corpus being analyzed.