Word Frequency in Literary Text

Click on the play icon above to "run" each box of code.

This program generates a table of how often words appear in a file and sorts them to show the ones the author used most frequently. This example uses Jane Eyre, but there are tons of books to choose from here with lots of books in .txt format.



In [1]:

    
import re
import pandas as pd
import urllib.request

frequency = {}

document_text = urllib.request.urlopen \
    ('http://www.textfiles.com/etext/FICTION/bronte-jane-178.txt') \
    .read().decode('utf-8')

text_string = document_text.lower()
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string)
 
for word in match_pattern:
    count = frequency.get(word,0)
    frequency[word] = count + 1
     
frequency_list = frequency.keys()

d = []
for word in frequency_list:
    var = word + "," + str(frequency[word]) + "\r"
    d.append({'word':word, 'Frequency': frequency[word]})

df = pd.DataFrame(d)

Word frequency list



In [2]:

    
df1 = df.sort_values(by="Frequency", ascending=False)

# the next line displays the first number of rows you select
df1.head(10)

Filtering the results

This next part removes some of the less interesting words from the list.



In [3]:

    
df2 = df1.query('word not in \
    ("the","and","it","was","for","but","that") \
    ')
df2.head(10)

Don't be afraid to edit the code display more or fewer results. Maybe add some words to the omitted list? Then go to kernel --> restart and run all in the menu at the top. You can paste in the URL for a different novel from here.

You can also start over by clicking file --> revert to checkpoint.



In [ ]:

	Frequency	word
11630	7834	the
7927	6622	and
3092	2991	you
1025	2523	was
368	1712	her
267	1662	that
4638	1558	not
5145	1487	had
6686	1475	she
3158	1403	with

	Frequency	word
3092	2991	you
368	1712	her
4638	1558	not
5145	1487	had
6686	1475	she
3158	1403	with
4364	1215	his
12035	1083	have
2092	722	what
6596	720	him