DM_09_04

Import packages

We'll use "codecs" for reading the text files, "re" (for "regular expressions") and "collections" for working with tokens, and "nltk" ("natural language toolkit") in several operations.


In [45]:
% matplotlib inline

import codecs
import re
import copy
import collections

import numpy as np
import pandas as pd
import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import WordPunctTokenizer
from __future__ import division

We need some specialized functions from NLTK that are not included by default. It is possible to download just the "stopwords" portion but it may be easier to simply download everything in NLTK. Note that this is very time consuming; it took over 30 minutes on my machine.


In [ ]:
nltk.download('all')

Get the "stopwords" package from NLTK.


In [46]:
from nltk.corpus import stopwords

Read data


In [47]:
with codecs.open("JaneEyre.txt", "r", encoding="utf-8") as f:
    jane_eyre = f.read()
with codecs.open("WutheringHeights.txt", "r", encoding="utf-8") as f:
    wuthering_heights = f.read()

Process data

Check for English stopwords.


In [48]:
esw = stopwords.words('english')
esw.append("would")

Filter tokens (using regular expressions).


In [49]:
word_pattern = re.compile("^\w+$")

Create a token counter function.


In [50]:
def get_text_counter(text):
    tokens = WordPunctTokenizer().tokenize(PorterStemmer().stem(text))
    tokens = list(map(lambda x: x.lower(), tokens))
    tokens = [token for token in tokens if re.match(word_pattern, token) and token not in esw]
    return collections.Counter(tokens), len(tokens)

Create a function to calculate the absolute frequency and relative frequency of the most common words.


In [51]:
def make_df(counter, size):
    abs_freq = np.array([el[1] for el in counter])
    rel_freq = abs_freq / size
    index = [el[0] for el in counter]
    df = pd.DataFrame(data=np.array([abs_freq, rel_freq]).T, index=index, columns=["Absolute frequency", "Relative frequency"])
    df.index.name = "Most common words"
    return df

Analyze individual texts

Calculate the most common words of Jane Eyre. This takes a while. Then display the 10 most common.


In [52]:
je_counter, je_size = get_text_counter(jane_eyre)
make_df(je_counter.most_common(10), je_size)


Out[52]:
Absolute frequency Relative frequency
Most common words
one 593.0 0.006789
said 584.0 0.006686
mr 543.0 0.006217
could 504.0 0.005770
like 397.0 0.004545
rochester 366.0 0.004190
well 348.0 0.003984
jane 342.0 0.003916
little 341.0 0.003904
sir 315.0 0.003607

Save the 1000 most common words of Jane Eyre to CSV.


In [53]:
je_df = make_df(je_counter.most_common(1000), je_size)
je_df.to_csv("JE_1000.csv")

Calculate the most common words of Wuthering Heights. This also takes a while. Display the 10 most common.


In [54]:
wh_counter, wh_size = get_text_counter(wuthering_heights)
make_df(wh_counter.most_common(10), wh_size)


Out[54]:
Absolute frequency Relative frequency
Most common words
heathcliff 475.0 0.008735
linton 404.0 0.007429
catherine 379.0 0.006970
said 375.0 0.006896
mr 312.0 0.005738
one 290.0 0.005333
could 279.0 0.005131
master 205.0 0.003770
shall 191.0 0.003512
come 190.0 0.003494

Save the 1000 most common words of Wuthering Heights to CSV.


In [55]:
wh_df = make_df(wh_counter.most_common(1000), wh_size)
wh_df.to_csv("WH_1000.csv")

Compare texts

Find the most common words across the two documents.


In [56]:
all_counter = wh_counter + je_counter
all_df = make_df(wh_counter.most_common(1000), 1)
most_common_words = all_df.index.values

Create a data frame with the word frequency differences.


In [57]:
df_data = []
for word in most_common_words:
    je_c = je_counter.get(word, 0) / je_size
    wh_c = wh_counter.get(word, 0) / wh_size
    d = abs(je_c - wh_c)
    df_data.append([je_c, wh_c, d])
dist_df = pd.DataFrame(data=df_data, index=most_common_words,
                       columns=["Jane Eyre relative frequency", "Wuthering Heights relative frequency",
                                "Relative frequency difference"])
dist_df.index.name = "Most common words"
dist_df.sort_values("Relative frequency difference", ascending=False, inplace=True)

Display the most distinctive words.


In [58]:
dist_df.head(10)


Out[58]:
Jane Eyre relative frequency Wuthering Heights relative frequency Relative frequency difference
Most common words
heathcliff 0.000000 0.008735 0.008735
linton 0.000000 0.007429 0.007429
catherine 0.000011 0.006970 0.006958
hareton 0.000000 0.003292 0.003292
sir 0.003607 0.000791 0.002816
master 0.001133 0.003770 0.002636
joseph 0.000000 0.002575 0.002575
earnshaw 0.000000 0.002372 0.002372
cathy 0.000000 0.002280 0.002280
edgar 0.000000 0.002133 0.002133

Save the full list of distinctive words to a CSV entitled "bronte.csv."


In [59]:
dist_df.to_csv("bronte.csv")

In [ ]: