This is the most simple way to measure the prevelence of a theme in a corpus, and is used for many purposes, including sentiment analysis. This is one of the most long-standing, and ubiquitous, methods in automated text analysis, so it's important to both understand the method and be able to implement it.
The method is simple: it involves grouping words into categories or themes, and then counting the number of words from each theme in your corpus. We will use this method to do sentiment analysis, a popular text analysis task, on our Music Review corpus, using a standard sentiment analysis dictionary.
A Novel Method for Detecting Plot, Matt Jockers
Enns, Peter, Nathan Kelly, Jana Morgan, and Christopher Witko. 2015.“Money and the Supply of Political Rhetoric: Understanding the Congressional (Non-)Response to Economic Inequality.” Paper presented at the APSA Annual Meetings, San Francisco.
Neal Caren has a tutorial using MPQA, which implements the dictionary method in Python but in a much different way
The dictionary method is based on the assumption that themes or categories consist of a group of words, and texts that cover that theme will have a higher percentage of that group of words compared to other texts. Dictionary methods are used for many purposes. A few possibilities:
There are two forms of dictionaries: standard or general dictionaries, and custom dictionaries.
There are a number of standard dictionaries that have been created by field experts. The benefit of standarized dictionaries is that they're developed by experts and have been throughoughly validated. Others have likely published using these dictionaries, so reviewers are more likely to accept them as valid. Because of this, they are good options if they fit your research question.
Here are a few:
Many research questions or data are domain specific, however, and will thus require you to create your own dictionary based on your own knowledge of the domain and question. Creating your own dictionary requires a lot of thought, and must be validated. These dictionaries are typically created in an interative fashion, and are modified as they are validated. See Enns et al. (2015) for an example of how they constructed their own dictionary.
Today we will use the free and standard sentiment dictionary from MPQA to measure positive and negative sentiment in the music reviews.
Our first step, as with any technique, is the pre-processing step, to get the data ready for analyis.
First, read in our Music Reviews corpus as a Pandas dataframe.
In [ ]:
#import the necessary packages
import pandas
import nltk
from nltk import word_tokenize
import string
#read the Music Reviews corpus into a Pandas dataframe
df = pandas.read_csv("../Data/BDHSI2016_music_reviews.csv", encoding='utf-8', sep = '\t')
#view the dataframe
df
The next step is to create a new column in our dataset that contains tokenized words with all the pre-processing steps.
In [ ]:
#first create a new column called "body_tokens" and transform to lowercase by applying the string function str.lower()
df['body'] = df['body'].apply(lambda x: ''.join([i for i in x if not i.isdigit()]))
df['body_tokens'] = df['body'].str.lower()
In [ ]:
#tokenize
df['body_tokens'] = df['body_tokens'].apply(nltk.word_tokenize)
#view output
print(df['body_tokens'])
In [ ]:
punctuations = list(string.punctuation)
#remove punctuation. Let's talk about that lambda x.
df['body_tokens'] = df['body_tokens'].apply(lambda x: [word for word in x if word not in punctuations])
#view output
print(df['body_tokens'])
Pre-processing is done. What other pre-processing steps might we use?
One more step before getting to the dictionary method. We want a total token count for each row, so we can normalize the dictionary counts. To do this we simply create a new column that contains the length of the token list in each row.
In [ ]:
df['token_count'] = df['body_tokens'].apply(lambda x: len(x))
print(df[['body_tokens','token_count']])
I created two text files, one is a list of positive words from the MPQA dictionary, the other is a list of negative words. One word per line. Our goal here is to count the number of positive and negative words in each row of our dataframe, and add two columns to our dataset with the count of positive and negative words.
First, read in the positive and negative words and create list variables for each.
In [ ]:
pos_sent = open("../Data/positive_words.txt", encoding='utf-8').read()
neg_sent = open("../Data/negative_words.txt", encoding='utf-8').read()
#view part of the pos_sent variable, to see how it's formatted.
print(pos_sent[:101])
In [ ]:
#remember the split function? We'll split on the newline character (\n) to create a list
positive_words=pos_sent.split('\n')
negative_words=neg_sent.split('\n')
#view the first elements in the lists
print(positive_words[:10])
print(negative_words[:10])
positive_words
In [ ]:
#count number of words in each list
print(len(positive_words))
print(len(negative_words))
Great! You know what to do now.
Exercise:
In [ ]:
#exercise code here
That's the dictionary method! You can do this with any dictionary you want, standard or you can create your own.
We can also do this using the document term matrix. We'll again do this in pandas, to make it conceptually clear. As you get more comfortable with programming you may want to eventually shift over to working with sparse matrix format.
In [ ]:
#import the function CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
countvec = CountVectorizer()
#create our document term matrix as a pandas dataframe
dtm_df = pandas.DataFrame(countvec.fit_transform(df.body).toarray(), columns=countvec.get_feature_names(), index = df.index)
dtm_df
Now we can keep only those columns that occur in our positive words list. To do this, we'll first save a list of the columns names as a variable, and then only keep the elements of the list that occur in our positive words list. We'll then create a new dataframe keeping only those select columns.
In [ ]:
#create a columns variable that is a list of all column names
columns = list(dtm_df)
columns
In [ ]:
#create a new variable that contains only column names that are in our postive words list
pos_columns = [word for word in columns if word in positive_words]
pos_columns
In [ ]:
#create a dtm from our dtm_df that keeps only positive sentiment columns
dtm_pos = dtm_df[pos_columns]
dtm_pos
In [ ]:
#count the number of positive words for each document
dtm_pos['pos_count'] = dtm_pos.sum(axis=1)
#dtm_pos.drop('pos_count',axis=1, inplace=True)
dtm_pos['pos_count']
EX: Do the same for negative words.
EX: Calculate the proportion of negative and positive words for each document.