Introduction to Topic Modeling

Today we'll implement the most basic, and the original, topic modeling algorithm, LDA, using Python's scikit-learn. The other major topic modeling package is Gensim.

Learning Goals

  • Implement a basic topic modeling algorithm and learn how to tweak it
  • Learn how to use different methods to calculate topic prevalence
  • Learn how to create some simple graphs with this output
  • Think though how and why you might use topic modeling in a text analysis project

Outline

  1. [The Pandas Dataframe: Music Reviews](#df)
  2. [Fit an LDA Topic Model using scikit-learn](#fit)
  3. [Document by Topic Distribution](#dtd)
  4. [Words Aligned with each Topic](#words)
  5. [Topic Prevalence](#prev)
  6. [Topics Over Time](#time)

Key Terms

  • Topic Modeling:

    • A statistical model to uncover abstract topics within a text. It uses the co-occurrence fo words within documents, compared to their distribution across documents, to uncover these abstract themes. The output is a list of weighted words, which indicate the subject of each topic, and a weight distribution across topics for each document.
  • LDA:

    • Latent Dirichlet Allocation. A implementation of topic modeling that assumes a Dirichlet prior. It does not take document order into account, unlike other topic modeling algorithms.

Further Resources

More detailed description of implementing LDA using scikit-learn.

0. The Pandas Dataframe: Music Reviews

First, we read our music reviews corpus, which is stored as a .csv file on our hard drive, into a Pandas dataframe.


In [ ]:
import pandas
import numpy as np
import matplotlib.pyplot as plt
df_lit = pandas.read_csv("../Data/childrens_lit.csv.bz2", sep='\t', index_col=0, encoding = 'utf-8', compression='bz2')

#drop rows where the text is missing.
df_lit = df_lit.dropna(subset=['text'])

#view the dataframe
df_lit

1. Fit a Topic Model, using LDA

Now we're ready to fit the model. This requires the use of CountVecorizer, which we've already used, and the scikit-learn function LatentDirichletAllocation.

See here for more information about this function.


In [ ]:
####Adopted From: 
#Author: Olivier Grisel <olivier.grisel@ensta.org>
#         Lars Buitinck
#         Chyi-Kwei Yau <chyikwei.yau@gmail.com>
# License: BSD 3 clause

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

n_samples = 2000
n_topics = 4
n_top_words = 50

##This is a function to print out the top words for each topic in a pretty way.
#Don't worry too much about understanding every line of this code.
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("\nTopic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()

In [ ]:
# Vectorize our text using CountVectorizer
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.80, min_df=50,
                                max_features=None,
                                stop_words='english'
                                )

tf = tf_vectorizer.fit_transform(df_lit.text)

In [ ]:
print("Fitting LDA models with tf features, "
      "n_samples=%d and n_topics=%d..."
      % (n_samples, n_topics))

#define the lda function, with desired options
#Check the documentation, linked above, to look through the options
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=20,
                                learning_method='online',
                                learning_offset=80.,
                                total_samples=n_samples,
                                random_state=0)
#fit the model
lda.fit(tf)

In [ ]:
#print the top words per topic, using the function defined above.
#Unlike R, which has a built-in function to print top words, we have to write our own for scikit-learn
#I think this demonstrates the different aims of the two packages: R is for social scientists, Python for computer scientists

print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

In [ ]:
####Exercise:
###Copy and paste the above code and fit a new model, lda_new, by changing some of the parameters. How does this change the output.
###Suggestions:
## 1. Change the number of topics. 
## 2. Do not remove stop words. 
## 3. Change other options, either in the vectorize stage or the LDA model

lda_new = LatentDirichletAllocation(n_topics=10, max_iter=20,
                                learning_method='online',
                                learning_offset=80.,
                                total_samples=n_samples,
                                random_state=0)
#fit the model
lda_new.fit(tf)

2. Document by Topic Distribution

One thing we may want to do with the output is find the most representative texts for each topic. A simple way to do this (but not memory efficient), is to merge the topic distribution back into the Pandas dataframe.

First get the topic distribution array.


In [ ]:
topic_dist = lda.transform(tf)
topic_dist

Merge back in with the original dataframe.


In [ ]:
topic_dist_df = pandas.DataFrame(topic_dist)
df_w_topics = topic_dist_df.join(df_lit)
df_w_topics

Now we can sort the dataframe for the topic of interest, and view the top documents for the topics. Below we sort the documents first by Topic 0 (looking at the top words for this topic I think it's about family, health, and domestic activities), and next by Topic 1 (again looking at the top words I think this topic is about children playing outside in nature). These topics may be a family/nature split?

Look at the titles for the two different topics. Look at the gender of the author. Hypotheses?


In [ ]:
print(df_w_topics[['title', 'author gender', 0]].sort_values(by=[0], ascending=False))

In [ ]:
print(df_w_topics[['title', 'author gender', 1]].sort_values(by=[1], ascending=False))

In [ ]:
#EX: What is the average topic weight by author gender, for each topic?
### Grapth these results
#Hint: You can use the python 'range' function and a for-loop

grouped_mean=df_w_topics.groupby('author gender').mean()
grouped_mean[[0,1,2,3]].plot(kind='bar')
plt.show()

3. Words Aligned with each Topic

Following DiMaggio et al., we can calculate the total number of words aligned with each topic, and compare by author gender.


In [ ]:
#first create word count column

df_w_topics['word_count'] = df_w_topics['text'].apply(lambda x: len(str(x).split()))
df_w_topics['word_count']

In [ ]:
#multiple topic weight by word count

df_w_topics['0_wc'] = df_w_topics[0] * df_w_topics['word_count']
df_w_topics['0_wc']

In [ ]:
#create a for loop to do this for every topic

In [ ]:
topic_columns = range(0, n_topics)
col_list = []
for num in topic_columns:
    col = "%d_wc" % num
    col_list.append(col)
    #Solution
    df_w_topics[col] = df_w_topics[num] * df_w_topics['word_count']
    
df_w_topics

In [ ]:
#EX: What is the total number of words aligned with each topic, by author gender?
    
###Solution
grouped = df_w_topics.groupby("author gender")
grouped.sum()

In [ ]:
#EX: What is the proportion of total words aligned with each topic, by author gender?
wc_columns = ['0_wc', '1_wc', '2_wc', '3_wc']
for n in wc_columns:
    print(n)
    print(grouped[n].sum()/grouped['word_count'].sum())

Question: Why might we want to do one calculation over the other? Take average topic weight per documents versus the average number of words aligned with each topic?

This brings us to...

4. Topic Prevalence


In [ ]:
###EX: 
#       Find the most prevalent topic in the corpus.
#       Find the least prevalent topic in the corpus. 
#       Hint: How do we define prevalence? What are different ways of measuring this,
#              and the benefits/drawbacks of each? 

for e in col_list:
    print(e)
    print(df_w_topics[e].sum()/df_w_topics['word_count'].sum())

In [ ]:
for e in topic_columns:
    print(e)
    print(df_w_topics[e].mean())

4. Prevalence over time

We can do the same as above, but by year, to graph the prevalence of each topic over time.


In [ ]:
grouped_year = df_w_topics.groupby('year')
fig3 = plt.figure()
chrt = 0
for e in col_list:
    chrt += 1 
    ax2 = fig3.add_subplot(2,3, chrt)
    (grouped_year[e].sum()/grouped_year['word_count'].sum()).plot(kind='line', title=e)
    
fig3.tight_layout()
plt.show()

Topic 2 I interpret to be about battles in France. What is going on between 1800 and 1804 in France that might make this topic increasingly popular over this time period?