Daina Bouquin
Perform an analysis of high frequency words in a corpus of interest.
Complete the following tasks:
In [1]:
import nltk
nltk.download('all')
Out[1]:
In [8]:
# make sure it's all set :)
nltk.word_tokenize("hello world")
Out[8]:
In [3]:
%matplotlib inline
import pandas as pd
import seaborn as sns
I chose the Emma corpus from the nltk package. I am going to define unique words as the set of distinct alphabetic strings in the corpus, and I will remove common stop words such as 'a' and 'the'. This will also remove numbers and punctuation from the corpus. The length of the set is taken to survey the number of unique words.
In [16]:
emma = nltk.corpus.gutenberg.words('austen-emma.txt')
# strip punctuation and numerics using isalpha() method
emma = [w for w in emma if w.isalpha()]
# strip out stop words
from nltk.corpus import stopwords
emma = [w for w in emma if w not in stopwords.words('english')]
In [17]:
# How many total unique words are in the corpus
emma_unique = set(emma)
len(emma_unique)
Out[17]:
In [39]:
# build the frequency distribution using FreqDist()
freq_emma = nltk.FreqDist(emma)
# make a dataframe to produce relative frequencies - top 200
emma_top200 = pd.DataFrame(freq_emma.most_common(200),columns=['word','count'])
emma_top200['rel_freq'] = emma_top['count']/float(len(emma))
emma_top200.head(10)
Out[39]:
We want to find out the number of most common unique words that make up approximately 50% of the dataset. By plotting the cumulative distribution we can see that approximately 250 words accounts for 50% of all words in the dataset. This is confirmed by summing the first 250 indexes of relative frequencies.
In [40]:
# top 500
emma_top500 = pd.DataFrame(freq_emma.most_common(500),columns=['word','count'])
emma_top500['rel_freq'] = emma_top['count']/float(len(emma))
In [19]:
len(emma)/2.0 ## half of all words
Out[19]:
In [41]:
sum(emma_top500[:250]['rel_freq']) # The first 250 words account for approximately half of all words
Out[41]:
In [42]:
freq_emma.plot(250, cumulative=True)
The following barplot shows the relative frequencies of all 200 of the most frequent unique words.
In [43]:
g = sns.barplot(x=emma_top200.word, y=emma_top200.rel_freq)
The observed relative frequencies do follow Zipf's Law, in that the frequency of any word is approximately inversely proportional to its ranking in the frequency table. This law holds for all words in all corpora. The only differences are the words themselves-- they differ based on the content of the corpus being analyzed.