Click on the play icon above to "run" each box of code.
This program generates a table of how often words appear in a file and sorts them to show the ones the author used most frequently. This example uses Jane Eyre, but there are tons of books to choose from here with lots of books in .txt format.
In [1]:
import re
import pandas as pd
import urllib.request
frequency = {}
document_text = urllib.request.urlopen \
('http://www.textfiles.com/etext/FICTION/bronte-jane-178.txt') \
.read().decode('utf-8')
text_string = document_text.lower()
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string)
for word in match_pattern:
count = frequency.get(word,0)
frequency[word] = count + 1
frequency_list = frequency.keys()
d = []
for word in frequency_list:
var = word + "," + str(frequency[word]) + "\r"
d.append({'word':word, 'Frequency': frequency[word]})
df = pd.DataFrame(d)
In [2]:
df1 = df.sort_values(by="Frequency", ascending=False)
# the next line displays the first number of rows you select
df1.head(10)
Out[2]:
In [3]:
df2 = df1.query('word not in \
("the","and","it","was","for","but","that") \
')
df2.head(10)
Out[3]:
Don't be afraid to edit the code display more or fewer results. Maybe add some words to the omitted list? Then go to kernel --> restart and run all in the menu at the top. You can paste in the URL for a different novel from here.
You can also start over by clicking file --> revert to checkpoint.
In [ ]: