Today we'll implement the most basic, and the original, topic modeling algorithm, LDA, using Python's scikit-learn. The other major topic modeling package is Gensim.
Another option for topic modeling is the software MALLET. Check out this blog post to learn more about implementing MALLET.
Topic Modeling:
LDA:
More detailed description of implementing LDA using scikit-learn.
First, we read our music reviews corpus, which is stored as a .csv file on our hard drive, into a Pandas dataframe.
In [ ]:
import pandas
import numpy as np
import matplotlib.pyplot as plt
df_lit = pandas.read_csv("../Data/childrens_lit.csv.bz2", sep='\t', index_col=0, encoding = 'utf-8', compression='bz2')
#drop rows where the text is missing.
df_lit = df_lit.dropna(subset=['text'])
#view the dataframe
df_lit
Now we're ready to fit the model. This requires the use of CountVecorizer, which we've already used, and the scikit-learn function LatentDirichletAllocation.
See here for more information about this function.
In [ ]:
####Adopted From:
#Author: Olivier Grisel <olivier.grisel@ensta.org>
# Lars Buitinck
# Chyi-Kwei Yau <chyikwei.yau@gmail.com>
# License: BSD 3 clause
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
n_samples = 2000
n_topics = 4
n_top_words = 50
##This is a function to print out the top words for each topic in a pretty way.
#Don't worry too much about understanding every line of this code.
def print_top_words(model, feature_names, n_top_words):
for topic_idx, topic in enumerate(model.components_):
print("\nTopic #%d:" % topic_idx)
print(" ".join([feature_names[i]
for i in topic.argsort()[:-n_top_words - 1:-1]]))
print()
In [ ]:
# Vectorize our text using CountVectorizer
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.80, min_df=50,
max_features=None,
stop_words='english'
)
tf = tf_vectorizer.fit_transform(df_lit.text)
In [ ]:
print("Fitting LDA models with tf features, "
"n_samples=%d and n_topics=%d..."
% (n_samples, n_topics))
#define the lda function, with desired options
#Check the documentation, linked above, to look through the options
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=20,
learning_method='online',
learning_offset=80.,
total_samples=n_samples,
random_state=0)
#fit the model
lda.fit(tf)
In [ ]:
#print the top words per topic, using the function defined above.
#Unlike R, which has a built-in function to print top words, we have to write our own for scikit-learn
#I think this demonstrates the different aims of the two packages: R is for social scientists, Python for computer scientists
print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)
In [ ]:
####Exercise:
###Copy and paste the above code and fit a new model, lda_new, by changing some of the parameters. How does this change the output.
###Suggestions:
## 1. Change the number of topics.
## 2. Do not remove stop words.
## 3. Change other options, either in the vectorize stage or the LDA model
One thing we may want to do with the output is find the most representative texts for each topic. A simple way to do this (but not memory efficient), is to merge the topic distribution back into the Pandas dataframe.
First get the topic distribution array.
In [ ]:
topic_dist = lda.transform(tf)
topic_dist
Merge back in with the original dataframe.
In [ ]:
topic_dist_df = pandas.DataFrame(topic_dist)
df_w_topics = topic_dist_df.join(df_lit)
df_w_topics
Now we can sort the dataframe for the topic of interest, and view the top documents for the topics. Below we sort the documents first by Topic 0 (looking at the top words for this topic I think it's about family, health, and domestic activities), and next by Topic 1 (again looking at the top words I think this topic is about children playing outside in nature). These topics may be a family/nature split?
Look at the titles for the two different topics. Look at the gender of the author. Hypotheses?
In [ ]:
print(df_w_topics[['title', 'author gender', 0]].sort_values(by=[0], ascending=False))
In [ ]:
print(df_w_topics[['title', 'author gender', 1]].sort_values(by=[1], ascending=False))
In [ ]:
#EX: What is the average topic weight by author gender, for each topic?
### Grapth these results
Following DiMaggio et al., we can calculate the total number of words aligned with each topic, and compare by author gender.
In [ ]:
#first create word count column
df_w_topics['word_count'] = df_w_topics['text'].apply(lambda x: len(str(x).split()))
df_w_topics['word_count']
In [ ]:
#multiple topic weight by word count
df_w_topics['0_wc'] = df_w_topics[0] * df_w_topics['word_count']
df_w_topics['0_wc']
In [ ]:
#create a for loop to do this for every topic
In [ ]:
topic_columns = range(0, n_topics)
col_list = []
for num in topic_columns:
col = "%d_wc" % num
col_list.append(col)
#Solution
df_w_topics[col] = df_w_topics[num] * df_w_topics['word_count']
df_w_topics
In [ ]:
#EX: What is the total number of words aligned with each topic, by author gender?
#EX: What is the proportion of total words aligned with each topic, by author gender?
Question: Why might we want to do one calculation over the other? Take average topic weight per documents versus the average number of words aligned with each topic?
This brings us to...
In [ ]:
###EX:
# Find the most prevalent topic in the corpus.
# Find the least prevalent topic in the corpus.
# Hint: How do we define prevalence? What are different ways of measuring this,
# and the benefits/drawbacks of each?
We can do the same as above, but by year, to graph the prevalence of each topic over time.
In [ ]:
grouped_year = df_w_topics.groupby('year')
fig3 = plt.figure()
chrt = 0
for e in col_list:
chrt += 1
ax2 = fig3.add_subplot(2,3, chrt)
(grouped_year[e].sum()/grouped_year['word_count'].sum()).plot(kind='line', title=e)
fig3.tight_layout()
plt.show()
Topic 2 I interpret to be about battles in France. What is going on between 1800 and 1804 in France that might make this topic increasingly popular over this time period?