The Document Term Matrix and Discriminating Words

The Document Term Matrix (DTM) in the bread and butter of most computational text analysis techniques, both simple and more sophisticated methods. In this lesson we will use Python's scikit-learn package learn to make a document term matrix from the .csv Music Reviews dataset. We will then use the DTM and a word weighting technique called tf-idf (term frequency inverse document frequency) to identify important and discriminating words within this dataset (utilizing the Pandas package). The illustrating question: what words distinguish reviews of Rap albums, Indie Rock albums, and Jazz albums? Theoretical exerxise: What can we learn from these words?

Note: Python's scikit-learn package is an enormous package with a lot of functionality. Knowing this package will enable you to do some very sophisticated analyses, including almost all machine learning techniques. (It looks great on your CV too!). We'll get back to this package later in the workshop.

Learning Goals

  • Understand the DTM and why it's important to text analysis
  • Learn how to create a DTM from a .csv file
  • Learn basic functionality of Python's package scikit-learn (we'll return to scikit-learn in lesson 06)
  • Understand tf-idf scores, and word scores in general
  • Learn a simple way to identify distinctive words
  • In the process, gain more familiarity and comfort with the Pandas pacakge and manipulating data

Outline

  • The Pandas Dataframe: Music Reviews
  • Explore the Data using Pandas
    • Basic descriptive statistics
    • The Groupby function
  • Creating the DTM: scikit-learn
    • CountVectorizer function
  • Tf-idf scores
    • TfidfVectorizer
  • Identifying Distinctive Words
    • Identify distinctive reviews by genre

Key Jargon

  • Document Term Matrix:
    • a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.
  • TF-IDF Scores:
    • short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

Further Resources

This blog post goes through finding distinctive words using Python in more detail

Paper: Fightin’ Words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict, Burt Monroe, Michael Colaresi, Kevin Quinn

0. The Pandas Dataframe: Music Reviews

First, we read our music reviews corpus, which is stored as a .csv file on our hard drive, into a Pandas dataframe.


In [ ]:
import pandas

#create a dataframe called "df"
df = pandas.read_csv("BDHSI2016_music_reviews.csv", sep = '\t')

##I'm going to do a pre-processing step to remove digits in the text, for analytical purposes.
##If you don't understand this code right now it's ok. But challenge yourself to make sense of it!
df['body'] = df['body'].apply(lambda x: ''.join([i for i in x if not i.isdigit()]))

#view the dataframe
#notice the metadata. The column "body" contains our text of interest.
df

In [ ]:
## Review Ex: Think back to yesterday's tutorial on Pandas.
###Use the dataframe slicing methods to print the full text of the first review.

1. Explore the Data using Pandas

Let's first look at some descriptive statistics about this dataset, to get a feel for what's in it. We'll do this using the Pandas package.

Note: this is always good practice. It serves two purposes. It checks to make sure your data is correct, and there's no major errors. It also keeps you in touch with your data, which will help with interpretation. <3 your data!

There are a number of built-in functions in the Pandas package which makes summarizing, manipulating, and visualizing your data easy. That's the point of Pandas!

Summarizing your Data

We'll start with summarizing our data. The most basic way to summarize your data is using the describe function. The syntax is simply `DataFrame.describe().


In [ ]:
df.describe()

You can see that this provides summary statistics for numerical columns. In our case, our only numerical column is score. We can access each of these summaries statistics separately, using Pandas built-in functions. For example, we can get the max score using the syntax DataFrame[Column Name].max()


In [ ]:
df['score'].max()

In [ ]:
##EX: Print the mean and the standard deviation for the score column.

What about the other columns? The columns that contain strings, not numbers? One way we can summarize these columns is by counting the unique string values in the column. We can do this using the DataFrame.value_counts() function.

Let's see what the different genres in our dataframe are:


In [ ]:
df['genre'].value_counts()

In [ ]:
##EX: Print the most frequent reviewers and artists in the dataframe.

Groupby

One of the most frequent things we might want to do with a dataframe is to leverage the different groups in our data and obtain summary statistics for them. For example, if we had a dataframe with individuals and their wages, we might want to compare the average wage for men versus women.

Pandas has made this easy with the DataFrame.groupby() function. You specify the column you want to group your data into. As an example, let's group our data into the different genres, and calculate the average score by genre. We'll do this in a couple of steps.

First, create a groupby object by specifying the name of the column in the paranthases.


In [ ]:
#create a groupby dataframe grouped by genre
df_genres = df.groupby("genre")

#What kind of object is df_genres? Let's find out.
df_genres

We now have a pandas object. We can perform most of the in-built Pandas functions on this object, but we'll see slightly different output than what we saw with a dataframe object.

Let's find the average score by genre by calling the DataFrame.mean() function on our grouped by object. We'll discuss the different options we can use.


In [ ]:
#calculate the mean score by genre, print out the results
df_genres['score'].mean().sort_values(ascending=False)

How is this output different that the previous time we used the DataFrame.mean() function?


In [ ]:
##EX: Print the maximum score for each genre.

In [ ]:
##Bonus EX: Find the artist with the highest average score. Find the artist with the lowest average score.

2. Creating the DTM: scikit-learn

Ok, that's the summary of the metadata. Next, we turn to analyzing the text of the reviews. Remember, the text is stored in the 'body' column.

Our first step is to turn the text into a document term matrix (DTM). A DTM is a different way of representing text called vector representation. The goal is to turn the text into numbers. To do so, we transform each document into a vector, where each number in the vector represents the count of a particular word. To get a feel for this we'll jump right into an example.

We've learned how to count each word using Python's NLTK. We could, then, construct a DTM manually. Luckily, however, Python has a great package, scikit-learn with a built-in function to do this called CountVectorizer().

Let's first look at the documentation for CountVectorizer.

We'll implement this by creating a CountVectorizer() object.


In [ ]:
#import the function CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

countvec = CountVectorizer()

#fit and transform our text into a DTM. Ask me about what this code does...
sklearn_dtm = countvec.fit_transform(df.body)
print(sklearn_dtm)

This format is called Compressed Sparse Format. How do we know what each number indicates? We can access the words themselves through the CountVectorizer function get_feature_names.


In [ ]:
print(countvec.get_feature_names()[:10])

In [ ]:
##EX: What word is indicated by the first row of the DTM printed above?
###Hint: Think back to the tutorial on lists, and how to slice a list.

It save a lot of memory to store the dtm in this format, but it is difficult to look at for a human. To illustrate the techniques in this lesson we will first convert this matrix back to a Pandas dataframe, a format we're more familiar with. For larger datasets, you will have to use the Compressed Sparse Format. Putting it into a DataFrame, however, will enable us to get more comfortable with Pandas!


In [ ]:
#we do the same as we did above, but covert it into a Pandas dataframe
#Don't worry about understanding every line of this code
dtm_df = pandas.DataFrame(countvec.fit_transform(df.body).toarray(), columns=countvec.get_feature_names(), index = df.index)

#view the dtm dataframe
dtm_df

3. What can we do with a DTM?

We can do a number of calculations using a DTM. For a toy example, we can quickly identify the most frequent words (compare this to how many steps it took in lesson 1, where we found the most frequent words using NLTK).


In [ ]:
dtm_df.sum().sort_values(ascending=False)

In [ ]:
##Ex: print the average number of times each word is used in a review
#Print this out sorted from highest to lowest.

We'll see further stuff we can do with a DTM in days to come. Because it is in the format of a matrix, we can perform any matrix algebra or vector manipulation on it, which enables some pretty exciting things (think vector space and Euclidean geometry). But, what do we lose when we reprsent text in this format?

Today, we will use variations on the DTM to find distinctive words in this dataset.

4. Tf-idf scores

How to find distinctive words in a corpus is a long-standing question in text analysis. We saw two approaches to doing this in lesson 1 (removing stop words and identifying nouns, verbs, and adjectives). Today, we'll learn one more approach: word scores. The idea behind words scores is to weight words not just by their frequency, but by their frequency in one document compared to their distribution across all documents. Words that are frequent, but are also used in every single document, will not be distinguising. We want to identify words that are unevenly distributed across the corpus.

One of the most popular ways to weight words (beyond frequency counts) is tf-idf scores. By offsetting the frequency of a word by its document frequency (the number of documents in which it appears) will in theory filter out common terms such as 'the', 'of', and 'and'.

More precisely, the inverse document frequency is calculated as such:

number_of_documents / number_documents_with_term

so:

tf_idf_word1 = word1_frequency_document1 * (number_of_documents / number_document_with_word1)

We can calculate this manually, but scikit-learn has a built-in function to do so. We'll use it, but a challenge for you: use Pandas to calculate this manually.

To do so, we simply do the same thing we did above with CountVectorizer, but instead we use the function TfidfVectorizer.


In [ ]:
#import the function
from sklearn.feature_extraction.text import TfidfVectorizer

#define out tfidfvec object
tfidfvec = TfidfVectorizer()
#create the dtm, but with cells weigthed by the tf-idf score.
dtm_tfidf_df = pandas.DataFrame(tfidfvec.fit_transform(df.body).toarray(), columns=tfidfvec.get_feature_names(), index = df.index)

#view results
dtm_tfidf_df

It's still mostly zeros. Let's look at the 20 words with highest tf-idf weights.


In [ ]:
dtm_tfidf_df.max().sort_values(ascending=False)[0:20]

Ok! We have successfully identified content words, without removing stop words. What else do you notice about this list?

5. Identifying Distinctive Words

What can we do with this? These scores are best used when you want to identify distinctive words for individual documents, or groups of documents, compared to other groups or the corpus as a whole. To illustrate this, let's compare three genres and identify the most distinctive words by genre.

First we add the genre of the document into our dtm weighted by tf-idf scores.

Why can we do this? Remember dataframes are immutable. What does this mean, and why does this ensure the below code works?


In [ ]:
#creat dataset with document index and genre

#firs make a copy of our dtm_tfidf_df
dtm_tfidf_df_genre = dtm_tfidf_df

#add a GENRE column to it. Why am I making the name of the column GENRE and no genre?
dtm_tfidf_df_genre['GENRE'] = df['genre']
dtm_tfidf_df_genre

Now lets compare the words with the highest tf-idf weight for each genre.

First, create a groupby object, grouped by GENRE, and then create a dataframe out of that object by taking the max of each column. Intuitively, the words with the highest values in each genre are most distinctive to that genre.


In [ ]:
groupby_dtm_genre = dtm_tfidf_df_genre.groupby('GENRE').max()
groupby_dtm_genre

In [ ]:
#sort the values in the Indie column.
groupby_dtm_genre.loc['Indie'].sort_values(ascending=False)

In [ ]:
##EX: do the same for Rap and Jazz genres. Compare the most distinctive words. What do you notice?

There we go! A method of identifying distinctive words.

Questions:

  • How would you describe in words what these lists are telling us about our data?
  • What theory of language does this measure rely on?
  • What are the drawbacks of the method for finding distinctive words?
  • What are other methods we could use to find distinctive words?