As a bonus, we will calculate the chi-squared statistic for all of the words in two novels, Pride and Prejudice and Garland for Girls, and then calculate the non-normalized correlation for two sample words in the corpus. This can be used, for example, to derive dictionaries from labeled text, as is done in
Jacob Jensen, Ethan Kaplan, Suresh Naidu, and Laurence Wilse-Samson (2012). Political Polarization and the Dynamics of Political Language: Evidence from 130 Years of Partisan Speech. Brookings Papers on Economic Activity.
Don't worry if you don't understand all of this. If it helps some of you, great. If it's a bit advanced with the math no problem. Stick with me as much as you can.
First, I'll create a document term matrix from the two novels.
In [ ]:
import pandas
from sklearn.feature_extraction.text import CountVectorizer
text_list = []
#open and read the novels, save them as variables
austen_string = open('../Data/Austen_PrideAndPrejudice.txt', encoding='utf-8').read()
alcott_string = open('../Data/Alcott_GarlandForGirls.txt', encoding='utf-8').read()
#append each novel to the list
text_list.append(austen_string)
text_list.append(alcott_string)
countvec = CountVectorizer(stop_words="english")
novels_df = pandas.DataFrame(countvec.fit_transform(text_list).toarray(), columns=countvec.get_feature_names())
novels_df
In [ ]:
novels_df[['sister', 'child']]
To calculate the chi-squared statistic for these two words we need to know the expected frequency, if these two words were used the same in both novels. To do this, we divide the total frequency across both novels by two.
In [ ]:
expected_sister = novels_df['sister'].sum(axis=0)/2
expected_sister
In [ ]:
expected_child = novels_df['child'].sum(axis=0)/2
expected_child
To calculate the chi_squares we subtract the expected frequency for each novel from the actual frequency for each novel, square this value, and divide by the expected frequency, and add the two numbers together.
In [ ]:
chi_sister = ((novels_df.loc[0,'sister'] - expected_sister)**2 / expected_sister) + ((novels_df.loc[1,'sister'] - expected_sister)**2 / expected_sister)
chi_sister
In [ ]:
chi_child = ((novels_df.loc[0,'child'] - expected_child)**2 / expected_child) + ((novels_df.loc[1,'child'] - expected_child)**2 / expected_child)
chi_child
These are large values. Let's try a word that has a much closer frequency across both novels. The result is a much smaller chi-squared statistic.
In [ ]:
novels_df['writing']
In [ ]:
expected_writes = novels_df['writing'].sum(axis=0)/2
chi_writes = ((novels_df.loc[0,'writing'] - expected_writes)**2 / expected_writes) + ((novels_df.loc[1,'writing'] - expected_writes)**2 / expected_writes)
chi_writes
Next, we can find the partisan score for our chosen words. We do this simply, by multiplying the word frequency in Pride and Prejudice by 1, and multiple the word frequency in Garland for Girls by -1, and adding these together. A partisan score above 0 will indicate it's used more often in Austen, a negative score will mean it is used more often in Alcott.
In [ ]:
sister_corr = novels_df.loc[0,'sister']*1 + novels_df.loc[1,'sister']*-1
sister_corr
In [ ]:
child_corr = (novels_df.loc[0,'child']*1) + (novels_df.loc[1,'child']*-1)
child_corr
In [ ]:
writing_corr = (novels_df.loc[0,'writing']*1) + (novels_df.loc[1,'writing']*-1)
writing_corr
In [ ]:
writes_corr = (novels_df.loc[0,'writes']*1) + (novels_df.loc[1,'writes']*-1)
writes_corr
What does a partisan score of 0 mean?
In [ ]:
novels_df['writes']
Now we can calculate this for each word in our corpus. To do this we have to introduce the for loop. We've seen this before in list comprehension, but we're splitting it out now into multiple lines. To think this intuitively, take this example:
For every child that knocks on my door on Halloween I will do the following:
The for loop in Python is intuitively the same. For every element in a list, we want to do something to that element.
In this case, we will loop through all columns in our dataframe and calculate the chi-squared statistic. We will then append both the column name (our word) and the chi-squared statistic to a list using .append().
In [ ]:
columns = list(novels_df)
chi_list = []
for c in columns:
chi_list.append([c,((novels_df.loc[0,c] - novels_df[c].sum(axis=0)/2)**2 / novels_df[c].sum(axis=0)/2) + ((novels_df.loc[1,c] - novels_df[c].sum(axis=0)/2)**2 / novels_df[c].sum(axis=0)/2)])
In [ ]:
chi_list[:10]
We can now sort this list by the second element in each tuple (it's not technically a tuple, but no matter), and print the top 50 "partisan" words.
In [ ]:
chi_list.sort(key=lambda x: x[1], reverse=True)
chi_list[:50]
In [ ]:
##For exercise code