Word Frequency in Literary Text

Click on the play icon above to "run" each box of code.

This program generates a table of how often words appear in a file and sorts them to show the ones the author used most frequently. This example uses Jane Eyre, but there are tons of books to choose from here with lots of books in .txt format.


In [1]:
import re
import pandas as pd
import urllib.request

frequency = {}

document_text = urllib.request.urlopen \
    ('http://www.textfiles.com/etext/FICTION/bronte-jane-178.txt') \
    .read().decode('utf-8')

text_string = document_text.lower()
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string)
 
for word in match_pattern:
    count = frequency.get(word,0)
    frequency[word] = count + 1
     
frequency_list = frequency.keys()

d = []
for word in frequency_list:
    var = word + "," + str(frequency[word]) + "\r"
    d.append({'word':word, 'Frequency': frequency[word]})

df = pd.DataFrame(d)

Word frequency list


In [2]:
df1 = df.sort_values(by="Frequency", ascending=False)

# the next line displays the first number of rows you select
df1.head(10)


Out[2]:
Frequency word
11630 7834 the
7927 6622 and
3092 2991 you
1025 2523 was
368 1712 her
267 1662 that
4638 1558 not
5145 1487 had
6686 1475 she
3158 1403 with

Filtering the results

This next part removes some of the less interesting words from the list.


In [3]:
df2 = df1.query('word not in \
    ("the","and","it","was","for","but","that") \
    ')
df2.head(10)


Out[3]:
Frequency word
3092 2991 you
368 1712 her
4638 1558 not
5145 1487 had
6686 1475 she
3158 1403 with
4364 1215 his
12035 1083 have
2092 722 what
6596 720 him

Don't be afraid to edit the code display more or fewer results. Maybe add some words to the omitted list? Then go to kernel --> restart and run all in the menu at the top. You can paste in the URL for a different novel from here.

You can also start over by clicking file --> revert to checkpoint.


In [ ]: