Another simple way to calculate distinctive words in two texts is to calculate the words with the highest and lowest difference or proportions. In theory frequent words like 'the' and 'of' will have a small difference. In practice this doesn't happen.
To demonstrate this we will run a difference of proportion calculation on Pride and Prejudice and A Garland for Girls.
To get the text in shape for scikit-learn we need to creat a list object with each novel as an element in a list. We'll use the append function to do this.
In [ ]:
import pandas
from sklearn.feature_extraction.text import CountVectorizer
In [ ]:
text_list = []
#open and read the novels, save them as variables
austen_string = open('../Data/Austen_PrideAndPrejudice.txt', encoding='utf-8').read()
alcott_string = open('../Data/Alcott_GarlandForGirls.txt', encoding='utf-8').read()
#append each novel to the list
text_list.append(austen_string)
text_list.append(alcott_string)
print(text_list[0][:100])
Create a DTM from these two novels, force it into a pandas DF, and inspect the output:
In [ ]:
countvec = CountVectorizer()
novels_df = pandas.DataFrame(countvec.fit_transform(text_list).toarray(), columns=countvec.get_feature_names())
novels_df
Notice the number of rows and columns.
Question: What does this mean?
Next, we need to get a word frequency count for each novel, which we can do by summing across the entire row. Note how the syntax is different here compared to when we summed one column across all rows.
In [ ]:
novels_df['word_count'] = novels_df.sum(axis=1)
novels_df
Next we divide each frequency cell by the word count. This syntax gets a bit tricky, so let's walk through it.
In [ ]:
novels_df = novels_df.iloc[:,:].div(novels_df.word_count, axis=0)
novels_df
Finally, we subtract one row from another, and add the output as a third row.
In [ ]:
novels_df.loc[2] = novels_df.loc[0] - novels_df.loc[1]
novels_df
We can sort based of the values of this row
In [ ]:
novels_df.loc[2].sort_values(ascending=False)
Stop words are still in there. Why?
We can, of course, manually remove stop words. This does successfully identify distinctive content words.
We can do this in the CountVectorizer step, by setting the correct option.
In [ ]:
#change stop_words option to 'english
countvec_sw = CountVectorizer(stop_words="english")
#same as code above
novels_df_sw = pandas.DataFrame(countvec_sw.fit_transform(text_list).toarray(), columns=countvec_sw.get_feature_names())
novels_df_sw['word_count'] = novels_df_sw.sum(axis=1)
novels_df_sw = novels_df_sw.iloc[:,0:].div(novels_df_sw.word_count, axis=0)
novels_df_sw.loc[2] = novels_df_sw.loc[0] - novels_df_sw.loc[1]
novels_df_sw.loc[2].sort_values(axis=0, ascending=False)
We can also do this by setting the max_df option (maximum document frequency) to either an absolute value, or a decimal between 0 and 1. An absolute value indicate that if the word occurs in more documents than the stated value, that word will not be included in the DTM. A decimal value will do the same, but proportion of documents.
Question: In the case of this corpus, what does setting the max_df value to 1 do? What output do you expect?
In [ ]:
#Change max_df option to 1
countvec_freq = CountVectorizer(max_df=1)
#same as the code above
novels_df_freq = pandas.DataFrame(countvec_freq.fit_transform(text_list).toarray(), columns=countvec_freq.get_feature_names())
novels_df_freq['word_count'] = novels_df_freq.sum(axis=1)
novels_df_freq = novels_df_freq.iloc[:,0:].div(novels_df_freq.word_count, axis=0)
novels_df_freq.loc[2] = novels_df_freq.loc[0] - novels_df_freq.loc[1]
novels_df_freq.loc[2].sort_values(axis=0, ascending=False)
Question: What would happen if we set the max_df to 2, in this case? Question: What might we do for the music reviews dataset?