We have been dealing with texts as strings, or as lists of strings. Another way to represent text which opens up a variety of other possibilities for analysis is the Document Term Matrix (DTM).
The best Python library for this, along with the subsequent analyses we can peform on a DTM, is scikit-learn. It's a powerful library, and one you will continually return to as you advance in text analysis (and looks great on your CV!). At it's core, this library allow us to implement a variety of machine learning algorithms on our text.
Because scikit-learn is such a large and powerful library the goal today is not to become experts, but instead learn the basic functions in the library and gain an intuition about how you might use it to do text analysis. To give an overview, here are some of the things you can do using scikit-learn:
Today, we'll start with the Document Term Matrix (DTM). The DTM is the bread and butter of computational text analysis techniques, both simple and more sophisticated methods. In sum, the DTM vectorizes our text which allows us to do matrix manipulation on it. We'll see further uses of the DTM in future tutorials.
In this lesson we will use Python's scikit-learn package learn to make a document term matrix from a .csv Music Reviews dataset (collected from MetaCritic.com). We will visualize the DTM in a pandas dataframe. We will then use the DTM and a word weighting technique called tf-idf (term frequency inverse document frequency) to identify important and discriminating words within this dataset. The illustrating question: what words distinguish reviews of Rap albums, Indie Rock albums, and Jazz albums? Finally, we'll use the DTM to implement a difference of proportions calculation on two novels in our data folder.
This blog post goes through finding distinctive words using Python in more detail.
Paper: Fightin’ Words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict, Burt Monroe, Michael Colaresi, Kevin Quinn
First, we read a music reviews corpus, which is stored as a .csv file on our hard drive, into a Pandas dataframe. These data were collected from MetaCritic.com, and include all user reviews of albums from the start of the website through 2014.
In [ ]:
import pandas
#create a dataframe called "df"
df = pandas.read_csv("../Data/BDHSI2016_music_reviews.csv", sep = '\t', encoding = 'utf-8')
#view the dataframe
#The column "body" contains our text of interest.
df
In [ ]:
#print the first review from the column 'body'
df.loc[0,'body']
You folks are experts at this now. Write Python code using pandas to do the following exploration of the data:
In [ ]:
#Write your code here
Ok, that's the summary of the metadata. Next, we turn to analyzing the text of the reviews. Remember, the text is stored in the 'body' column. First, a preprocessing step to remove numbers. To do this we will use a lambda function.
In [ ]:
df['body'] = df['body'].apply(lambda x: ''.join([i for i in x if not i.isdigit()]))
Our next step is to turn the text into a document term matrix using the scikit-learn function called CountVectorizer. There are two ways to do this. We can turn it into a sparse matrix type, which can be used within scikit-learn for further analyses. We do this by "fitting" the text to our CountVectorizer object, which calculates the full vocabulary in our corpus, and then "transform" our text by counting the number of times each word occurs in each docuent. We combine these two steps by calling the fit_transform()
function.
In [ ]:
#import the function CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
#Create our CountVectorizer object
countvec = CountVectorizer()
sklearn_dtm = countvec.fit_transform(df.body)
print(sklearn_dtm)
How do we know what word corresponds to each number? We can access the words themselves through the CountVectorizer function get_feature_names()
.
In [ ]:
print(countvec.get_feature_names()[:10])
This format is called Compressed Sparse Format. It save a lot of memory to store the dtm in this format, but it is difficult to look at for a human. To illustrate the techniques in this lesson we will first convert this matrix back to a Pandas dataframe, a format we're more familiar with.
Note: This is a case of do as I say, not as I do. As we continue we will rarely transform a DTM into a Pandas dataframe, because of memory issues. I'm doing it today so we can understand the intuition behind the DTM, word scores, and distinctive words.
In [ ]:
#we do the same as we did above, but covert it into a Pandas dataframe. Note this takes quite a bit more memory, so will not be good for bigger data.
#don't understand this code? we'll go through it, but don't worry about understanding it.
dtm_df = pandas.DataFrame(countvec.fit_transform(df.body).toarray(), columns=countvec.get_feature_names(), index = df.index)
#view the dtm dataframe
dtm_df
We can do a number of calculations using a DTM. For a toy example, we can quickly identify the most frequent words (compare this to how many steps it took using NLTK).
In [ ]:
print(dtm_df.sum().sort_values(ascending=False))
In [ ]:
##Ex: print the average number of times each word is used in a review
#Print this out sorted from highest to lowest.
Question: What does this tell us about our data?
What else does the DTM enable? Because it is in the format of a matrix, we can perform any matrix algebra or vector manipulation on it, which enables some pretty exciting things (think vector space and Euclidean geometry). But, what do we lose when we reprsent text in this format?
Today, we will use variations on the DTM to find distinctive words in this dataset.
How to find content words in a corpus is a long-standing question in text analysis. We have seen a few ways of doing this: removing stop words and identifying and counting only nouns, verbs, and adjectives. Today, we'll learn one more simple approach to this: word scores. The idea behind words scores is to weight words not just by their frequency, but by their frequency in one document compared to their distribution across all documents. Words that are frequent, but are also used in every single document, will not be indicative of the content of that document. We want to instead identify frequent words that are unevenly distributed across the corpus.
One of the most popular ways to weight words (beyond frequency counts) is tf-idf scores. By offsetting the frequency of a word by its document frequency (the number of documents in which it appears) will in theory filter out common terms such as 'the', 'of', and 'and'; what we have been calling stop words.
More precisely, the inverse document frequency is calculated as such:
number_of_documents / number_documents_with_term
so:
tfidf_word1 = word1_frequency_document1 * (number_of_documents / number_document_with_word1)
You can, and often should, normalize the numerator:
tfidf_word1 = (word1_frequency_document1 / word_count_document1) * (number_of_documents / number_document_with_word1)
We can calculate this manually, but scikit-learn has a built-in function to do so. We'll use it, but a challenge for you: use Pandas to calculate this manually.
To do so, we simply do the same thing we did above with CountVectorizer, but instead we use the function TfidfVectorizer.
In [ ]:
#import the function
from sklearn.feature_extraction.text import TfidfVectorizer
tfidfvec = TfidfVectorizer()
#create the dtm, but with cells weigthed by the tf-idf score.
dtm_tfidf_df = pandas.DataFrame(tfidfvec.fit_transform(df.body).toarray(), columns=tfidfvec.get_feature_names(), index = df.index)
#view results
dtm_tfidf_df
Let's look at the 20 words with highest tf-idf weights.
In [ ]:
print(dtm_tfidf_df.max().sort_values(ascending=False)[:20])
Ok! We have successfully identified content words, without removing stop words and without part-of-speech tagging. What else do you notice about this list?
What can we do with this? These scores are best used when you want to identify distinctive words for individual documents, or groups of documents, compared to other groups or the corpus as a whole. To illustrate this, let's compare three genres and identify the most distinctive words by genre.
First we add the genre of the document into our dtm weighted by tf-idf scores, and then compare genres.
In [ ]:
#Copy our tfidf df to a new df to add genre
dtm_tfidf_df_genre = dtm_tfidf_df
#add a 'GENRE' column to our tfidf df
dtm_tfidf_df_genre['GENRE'] = df['genre']
#Question: Why is 'GENRE' in caps?
dtm_tfidf_df_genre
Now lets compare the words with the highest tf-idf weight for each genre.
Note: there are other ways to do this. Challenge: what is a different approach to identifying rows from a certain genre in our dtm?
In [ ]:
#pull out the reviews for three genres, Rap, Alternative/Indie Rock, and Jazz
dtm_rap = dtm_tfidf_df_genre[dtm_tfidf_df_genre['GENRE']=="Rap"]
dtm_indie = dtm_tfidf_df_genre[dtm_tfidf_df_genre['GENRE']=="Alternative/Indie Rock"]
dtm_jazz = dtm_tfidf_df_genre[dtm_tfidf_df_genre['GENRE']=="Jazz"]
#print the words with the highest tf-idf scores for each genre
print("Rap Words")
print(dtm_rap.max(numeric_only=True).sort_values(ascending=False)[0:20])
print()
print("Indie Words")
print(dtm_indie.max(numeric_only=True).sort_values(ascending=False)[0:20])
print()
print("Jazz Words")
print(dtm_jazz.max(numeric_only=True).sort_values(ascending=False)[0:20])
There we go! A method of identifying content words, and distinctive words based on groups of texts. You notice there are some proper nouns in there. How might we remove those if we're not interested in them?
In [ ]:
##Write your code here