DM_09_04

Import packages

We'll use "codecs" for reading the text files, "re" (for "regular expressions") and "collections" for working with tokens, and "nltk" ("natural language toolkit") in several operations.



In [45]:

    
% matplotlib inline

import codecs
import re
import copy
import collections

import numpy as np
import pandas as pd
import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import WordPunctTokenizer
from __future__ import division

We need some specialized functions from NLTK that are not included by default. It is possible to download just the "stopwords" portion but it may be easier to simply download everything in NLTK. Note that this is very time consuming; it took over 30 minutes on my machine.



In [ ]:

    
nltk.download('all')

Get the "stopwords" package from NLTK.



In [46]:

    
from nltk.corpus import stopwords

Read data



In [47]:

    
with codecs.open("JaneEyre.txt", "r", encoding="utf-8") as f:
    jane_eyre = f.read()
with codecs.open("WutheringHeights.txt", "r", encoding="utf-8") as f:
    wuthering_heights = f.read()

Process data

Check for English stopwords.



In [48]:

    
esw = stopwords.words('english')
esw.append("would")

Filter tokens (using regular expressions).



In [49]:

    
word_pattern = re.compile("^\w+$")

Create a token counter function.



In [50]:

    
def get_text_counter(text):
    tokens = WordPunctTokenizer().tokenize(PorterStemmer().stem(text))
    tokens = list(map(lambda x: x.lower(), tokens))
    tokens = [token for token in tokens if re.match(word_pattern, token) and token not in esw]
    return collections.Counter(tokens), len(tokens)

Create a function to calculate the absolute frequency and relative frequency of the most common words.



In [51]:

    
def make_df(counter, size):
    abs_freq = np.array([el[1] for el in counter])
    rel_freq = abs_freq / size
    index = [el[0] for el in counter]
    df = pd.DataFrame(data=np.array([abs_freq, rel_freq]).T, index=index, columns=["Absolute frequency", "Relative frequency"])
    df.index.name = "Most common words"
    return df

Analyze individual texts

Calculate the most common words of Jane Eyre. This takes a while. Then display the 10 most common.



In [52]:

    
je_counter, je_size = get_text_counter(jane_eyre)
make_df(je_counter.most_common(10), je_size)









    Out[52]:






  
    
      
      Absolute frequency
      Relative frequency
    
    
      Most common words
      
      
    
  
  
    
      one
      593.0
      0.006789
    
    
      said
      584.0
      0.006686
    
    
      mr
      543.0
      0.006217
    
    
      could
      504.0
      0.005770
    
    
      like
      397.0
      0.004545
    
    
      rochester
      366.0
      0.004190
    
    
      well
      348.0
      0.003984
    
    
      jane
      342.0
      0.003916
    
    
      little
      341.0
      0.003904
    
    
      sir
      315.0
      0.003607

Save the 1000 most common words of Jane Eyre to CSV.



In [53]:

    
je_df = make_df(je_counter.most_common(1000), je_size)
je_df.to_csv("JE_1000.csv")

Calculate the most common words of Wuthering Heights. This also takes a while. Display the 10 most common.



In [54]:

    
wh_counter, wh_size = get_text_counter(wuthering_heights)
make_df(wh_counter.most_common(10), wh_size)









    Out[54]:






  
    
      
      Absolute frequency
      Relative frequency
    
    
      Most common words
      
      
    
  
  
    
      heathcliff
      475.0
      0.008735
    
    
      linton
      404.0
      0.007429
    
    
      catherine
      379.0
      0.006970
    
    
      said
      375.0
      0.006896
    
    
      mr
      312.0
      0.005738
    
    
      one
      290.0
      0.005333
    
    
      could
      279.0
      0.005131
    
    
      master
      205.0
      0.003770
    
    
      shall
      191.0
      0.003512
    
    
      come
      190.0
      0.003494

Save the 1000 most common words of Wuthering Heights to CSV.



In [55]:

    
wh_df = make_df(wh_counter.most_common(1000), wh_size)
wh_df.to_csv("WH_1000.csv")

Compare texts

Find the most common words across the two documents.



In [56]:

    
all_counter = wh_counter + je_counter
all_df = make_df(wh_counter.most_common(1000), 1)
most_common_words = all_df.index.values

Create a data frame with the word frequency differences.



In [57]:

    
df_data = []
for word in most_common_words:
    je_c = je_counter.get(word, 0) / je_size
    wh_c = wh_counter.get(word, 0) / wh_size
    d = abs(je_c - wh_c)
    df_data.append([je_c, wh_c, d])
dist_df = pd.DataFrame(data=df_data, index=most_common_words,
                       columns=["Jane Eyre relative frequency", "Wuthering Heights relative frequency",
                                "Relative frequency difference"])
dist_df.index.name = "Most common words"
dist_df.sort_values("Relative frequency difference", ascending=False, inplace=True)

Display the most distinctive words.



In [58]:

    
dist_df.head(10)









    Out[58]:






  
    
      
      Jane Eyre relative frequency
      Wuthering Heights relative frequency
      Relative frequency difference
    
    
      Most common words
      
      
      
    
  
  
    
      heathcliff
      0.000000
      0.008735
      0.008735
    
    
      linton
      0.000000
      0.007429
      0.007429
    
    
      catherine
      0.000011
      0.006970
      0.006958
    
    
      hareton
      0.000000
      0.003292
      0.003292
    
    
      sir
      0.003607
      0.000791
      0.002816
    
    
      master
      0.001133
      0.003770
      0.002636
    
    
      joseph
      0.000000
      0.002575
      0.002575
    
    
      earnshaw
      0.000000
      0.002372
      0.002372
    
    
      cathy
      0.000000
      0.002280
      0.002280
    
    
      edgar
      0.000000
      0.002133
      0.002133

Save the full list of distinctive words to a CSV entitled "bronte.csv."



In [59]:

    
dist_df.to_csv("bronte.csv")



In [ ]:

	Absolute frequency	Relative frequency
Most common words
one	593.0	0.006789
said	584.0	0.006686
mr	543.0	0.006217
could	504.0	0.005770
like	397.0	0.004545
rochester	366.0	0.004190
well	348.0	0.003984
jane	342.0	0.003916
little	341.0	0.003904
sir	315.0	0.003607

	Absolute frequency	Relative frequency
Most common words
heathcliff	475.0	0.008735
linton	404.0	0.007429
catherine	379.0	0.006970
said	375.0	0.006896
mr	312.0	0.005738
one	290.0	0.005333
could	279.0	0.005131
master	205.0	0.003770
shall	191.0	0.003512
come	190.0	0.003494

	Jane Eyre relative frequency	Wuthering Heights relative frequency	Relative frequency difference
Most common words
heathcliff	0.000000	0.008735	0.008735
linton	0.000000	0.007429	0.007429
catherine	0.000011	0.006970	0.006958
hareton	0.000000	0.003292	0.003292
sir	0.003607	0.000791	0.002816
master	0.001133	0.003770	0.002636
joseph	0.000000	0.002575	0.002575
earnshaw	0.000000	0.002372	0.002372
cathy	0.000000	0.002280	0.002280
edgar	0.000000	0.002133	0.002133