Bonus: Computing chi-squared and (non-normalized) correlation statistics

As a bonus, we will calculate the chi-squared statistic for all of the words in two novels, Pride and Prejudice and Garland for Girls, and then calculate the non-normalized correlation for two sample words in the corpus. This can be used, for example, to derive dictionaries from labeled text, as is done in

Jacob Jensen, Ethan Kaplan, Suresh Naidu, and Laurence Wilse-Samson (2012). Political Polarization and the Dynamics of Political Language: Evidence from 130 Years of Partisan Speech. Brookings Papers on Economic Activity.

Don't worry if you don't understand all of this. If it helps some of you, great. If it's a bit advanced with the math no problem. Stick with me as much as you can.

0. Document Term Matrix

First, I'll create a document term matrix from the two novels.



In [ ]:

    
import pandas
from sklearn.feature_extraction.text import CountVectorizer

text_list = []
#open and read the novels, save them as variables
austen_string = open('../Data/Austen_PrideAndPrejudice.txt', encoding='utf-8').read()
alcott_string = open('../Data/Alcott_GarlandForGirls.txt', encoding='utf-8').read()

#append each novel to the list
text_list.append(austen_string)
text_list.append(alcott_string)

countvec = CountVectorizer(stop_words="english")


novels_df = pandas.DataFrame(countvec.fit_transform(text_list).toarray(), columns=countvec.get_feature_names())
novels_df

1. Chi-Squared for sample words

The chi-sqaured statistic is a simple calculation:

alt text

Let's look at the frequency for two words, "sister" and "child".



In [ ]:

    
novels_df[['sister', 'child']]

To calculate the chi-squared statistic for these two words we need to know the expected frequency, if these two words were used the same in both novels. To do this, we divide the total frequency across both novels by two.



In [ ]:

    
expected_sister = novels_df['sister'].sum(axis=0)/2
expected_sister



In [ ]:

    
expected_child = novels_df['child'].sum(axis=0)/2
expected_child

To calculate the chi_squares we subtract the expected frequency for each novel from the actual frequency for each novel, square this value, and divide by the expected frequency, and add the two numbers together.



In [ ]:

    
chi_sister = ((novels_df.loc[0,'sister'] - expected_sister)**2 / expected_sister) + ((novels_df.loc[1,'sister'] - expected_sister)**2 / expected_sister)
chi_sister



In [ ]:

    
chi_child = ((novels_df.loc[0,'child'] - expected_child)**2 / expected_child) + ((novels_df.loc[1,'child'] - expected_child)**2 / expected_child)
chi_child

These are large values. Let's try a word that has a much closer frequency across both novels. The result is a much smaller chi-squared statistic.



In [ ]:

    
novels_df['writing']



In [ ]:

    
expected_writes = novels_df['writing'].sum(axis=0)/2
chi_writes = ((novels_df.loc[0,'writing'] - expected_writes)**2 / expected_writes) + ((novels_df.loc[1,'writing'] - expected_writes)**2 / expected_writes)
chi_writes

2. Partisan Score

Next, we can find the partisan score for our chosen words. We do this simply, by multiplying the word frequency in Pride and Prejudice by 1, and multiple the word frequency in Garland for Girls by -1, and adding these together. A partisan score above 0 will indicate it's used more often in Austen, a negative score will mean it is used more often in Alcott.



In [ ]:

    
sister_corr = novels_df.loc[0,'sister']*1 + novels_df.loc[1,'sister']*-1
sister_corr



In [ ]:

    
child_corr = (novels_df.loc[0,'child']*1) + (novels_df.loc[1,'child']*-1)
child_corr



In [ ]:

    
writing_corr = (novels_df.loc[0,'writing']*1) + (novels_df.loc[1,'writing']*-1)
writing_corr



In [ ]:

    
writes_corr = (novels_df.loc[0,'writes']*1) + (novels_df.loc[1,'writes']*-1)
writes_corr

What does a partisan score of 0 mean?



In [ ]:

    
novels_df['writes']

3. Chi-Squared for every word, using a for-loop

Now we can calculate this for each word in our corpus. To do this we have to introduce the for loop. We've seen this before in list comprehension, but we're splitting it out now into multiple lines. To think this intuitively, take this example:

For every child that knocks on my door on Halloween I will do the following:

Ask them what their costume is
Give them a piece of candy
Cackle wildly

The for loop in Python is intuitively the same. For every element in a list, we want to do something to that element.

In this case, we will loop through all columns in our dataframe and calculate the chi-squared statistic. We will then append both the column name (our word) and the chi-squared statistic to a list using .append().



In [ ]:

    
columns = list(novels_df)
chi_list = []

for c in columns:
    chi_list.append([c,((novels_df.loc[0,c] - novels_df[c].sum(axis=0)/2)**2 / novels_df[c].sum(axis=0)/2) + ((novels_df.loc[1,c] - novels_df[c].sum(axis=0)/2)**2 / novels_df[c].sum(axis=0)/2)])



In [ ]:

    
chi_list[:10]

We can now sort this list by the second element in each tuple (it's not technically a tuple, but no matter), and print the top 50 "partisan" words.



In [ ]:

    
chi_list.sort(key=lambda x: x[1], reverse=True)
chi_list[:50]

Exercise:

Calculate the partisan score for each word in the corpus and print the most partisan words for each novel.



In [ ]:

    
##For exercise code