In [ ]:
%%capture
!rm -rf data/*
!unzip data.zip -d data/
!pip install --no-cache-dir pyldavis
from datascience import *
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import pyLDAvis
import pyLDAvis.sklearn
import pickle
%matplotlib inline
In Lisa Rhody's article, "Topic Modeling and Figurative Language", she uses LDA topic modeling to look at ekphrasis poetry. She argues that ekphrasis poetry is particulary well-suited to an LDA analysis because of the assumption of a previously existing set of topics. She's able to extract a number of topics, each constituted of a set of words and probabilities. While we don't have Rhody's corpus, we can use this technique on any large text corpus. We'll use a corpus of novels curated by Andrew Piper.
We'll look at an English-language subset of Andrew Piper's novel corpus, totaling 150 novels by British and American authors spanning the years 1771-1930. These texts are each in a separate plaintext file in our data
folder. Metadata is contained in a spreadsheet distributed with the novel files by the txtLAB at McGill.
The metadata provided describes the corpus that exists as .txt
files. So let's first read in the metadata:
In [ ]:
metadata_tb = Table.read_table('data/txtlab_Novel150_English.csv')
metadata_tb.show(5)
Before we go anywhere, let's randomly shuffle the rows so that we don't have them ordered by dates or anything else:
In [ ]:
np.random.seed(0)
metadata_tb = Table.from_df(metadata_tb.to_df().sample(frac=1))
metadata_tb.show(5)
We can see the column variables we have in the metadata with the .labels
attribute:
In [ ]:
metadata_tb.labels
To clarify:
We see a list of filename
s in the table, these map into a folder we have called txtlab_Novel150_English
:
In [ ]:
!ls data/txtlab_Novel150_English/
We can then read in the full text for each novel by iterating through the column, reading each file and appending the string to our novel_list
:
In [ ]:
# create empty list, entries will be list of tokens from each novel
novel_list = []
# iterate through filenames in metadata table
for filename in metadata_tb['filename']:
# read in novel text as single string
with open('data/txtlab_Novel150_English/'+filename, 'r') as f:
novel = f.read()
# clean up (no titles)
toks = novel.split() # split to tokens
toks = [t for t in toks if not t.istitle() and not t.isupper()] # quick & dirty no titles/proper nouns
novel = ' '.join(toks) # join to single string
# add string
novel_list.append(novel)
Let's double check they all came through:
In [ ]:
len(novel_list)
And look at the first 200 characters of the fourth novel:
In [ ]:
metadata_tb['author'][3], metadata_tb['title'][3], novel_list[3][:200]
In [ ]:
from sklearn.feature_extraction.text import CountVectorizer
While you may not have seen the importance of max_features
, max_df
and min_df
before, for topic modeling this is extremely important, because otherwise your topics will not be super coherent.
Let's start out with this:
max_features
= 5000 (i.e. only include 5000 tokens in our dtm)max_df
= .8 (i.e. don't keep any tokens that appear in > 80% of the documents)min_df
= 5 (i.e. only keep the token if it appears in > 5 documents)We'll add in a stop_words='english'
too, which automatically uses its own stopwords list to remove from our dtm:
In [ ]:
cv = CountVectorizer(max_features=5000, stop_words='english', max_df=0.80, min_df=5)
As with most machine learning approaches, to validate your model you need training and testing partitions. Since we don't have any labels (topic modeling is unsupervised machine learning), we just need to do this for the novel strings:
In [ ]:
train = novel_list[:120]
test = novel_list[120:]
Now we can use our cv
to fit_transform
our training list of novels (strings!):
In [ ]:
dtm = cv.fit_transform(train)
To get our words back out we'll use the method get_feature_names()
In [ ]:
dtm_feature_names = cv.get_feature_names()
dtm_feature_names[:10]
We can double check that our feature limit was enforced by calling len
on the dtm_feature_names
:
In [ ]:
len(dtm_feature_names)
We can throw our dtm into a Table
like we have before too:
In [ ]:
dtm_tb = Table(dtm_feature_names).with_rows(dtm.toarray())
dtm_tb.show(5)
LDA reflects an intuition that words in a text are not merely chosen at random but are drawn from underlying concepts (the so-called "latent variables"). The goal of LDA is to look across many texts in order to reverse engineer these concepts by finding words that tend to cluster with one another. For this reason, LDA has been referred to as "the mother of all word collocation techniques."
sklearn
has the LatentDirichletAllocation
function:
In [ ]:
from sklearn.decomposition import LatentDirichletAllocation
Let's check the doc string:
In [ ]:
LatentDirichletAllocation?
Importantly, we'll note:
Let's just say we'll look for 10 topics. We'll do a max_iter
of 5. Generally, the higher max_iter
volume the better opportunity to the model has to accurately tune, but it also takes much longer.
In [ ]:
lda = LatentDirichletAllocation(n_components=10, max_iter=5)
Before we fit
the model, we need to remember that with a lot of these probabilistic models random number generators are used to star the algorithm. If we want our results to be reproducible, we need to set the random seed of the math library we use, in this case numpy
:
In [ ]:
np.random.seed(0)
Now we just fit
the model, as we've done with all sklearn
models! This may take a while, a lot is going on:
In [ ]:
lda_model = lda.fit(dtm)
One measure of the model's fit is perplexity, with which we can judge how well the model fits the data. We need to call this on our test
portion after it's been transformed into a dtm:
In [ ]:
lda_model.perplexity(cv.transform(test))
NOTE: Currently sklearn
s perplexity algorithm is broken.
The lower the perplexity, the better the fit of the model. So one way to get the optimal number of topics would be to loop through several numbers of topics and minimize the perplexity value.
Unfortunately, it has been shown time and again that minimizing perplexity does not actually separate topics into coherent groups that humans would.
Since traditional metrics of evaluating a model's accuracy have not proven to conform to human understanding, a new appraoch was developed by David Minmo in 2011.
this score measures how much, within the words used to describe a topic, a common word is in average a good predictor for a less common word. (More on topic coherency.)
Here we look for the highest value. This algorithm has only been implemented in the Python gensim
library. I ran the following code for you on a remote server because it takes a while!
import pickle
from joblib import Parallel, delayed
import multiprocessing
def try_topic_number(i):
lda_model = gensim.models.LdaModel(
corpus,
num_topics=i,
id2word=dictionary,
iterations=1000,
alpha='auto',
passes=4)
cm = gensim.models.CoherenceModel(
model=lda_model,
corpus=corpus,
dictionary=dictionary,
coherence='u_mass')
return cm.get_coherence()
if __name__ == '__main__':
num_cores = multiprocessing.cpu_count()
results = Parallel(n_jobs=num_cores)(delayed(try_topic_number)(i)
for i in try_topic_n)
pickle.dump(results, open('scores.pkl', 'wb'))
You can see above I've dumped the coherence scores into a binary pickle
file. A pickle
is simply any Python object that has been saved to a binary file. We can load
these in too:
In [ ]:
try_topic_n = list(range(5,200,2))
scores = pickle.load(open('scripts/scores.pkl', 'rb'))
list(zip(try_topic_n, scores))
Let's plot these results:
In [ ]:
plt.plot(try_topic_n, [x for x in scores])
plt.xlabel('number of topics')
plt.ylabel('coherence score')
numpy
has a handy argmax
or argmin
function that returns the index of the highest or lowest value in an array:
In [ ]:
np.argmax(scores)
Then we can just index our topic numbers to get the corresponding number of topics with the highest coherency:
In [ ]:
try_topic_n[np.argmax(scores)]
I've retrained the model for 13 topics and exported as below (note the max_iter=1000
takes a long time, so I've pickled the model again):
lda = LatentDirichletAllocation(n_components=13, max_iter=1000)
lda_model = lda.fit(dtm)
pickle.dump((lda, lda_model, dtm, cv), open('13-topics.pkl', 'wb'))
We can load in the pre-trained model from the pickle
:
In [ ]:
lda, lda_model, dtm, cv = pickle.load(open('scripts/13-topics.pkl', 'rb'))
Many papers in the social sciences still don't use a quantitative evaluation metric. Many use the library pyLDAvis
to simply visualize the topic distributions, looking for the right size and little overlap in topics as markers of a well-chosen number of topics:
In [ ]:
pyLDAvis.enable_notebook()
pyLDAvis.sklearn.prepare(lda_model, dtm, cv)
In [ ]:
def display_topics(model, feature_names, num_top_words):
for topic_idx, topic in enumerate(model.components_):
print(topic_idx, " ".join([feature_names[i] for i in topic.argsort()[:-num_top_words - 1:-1]]))
Now let's print the top 10 words of the 20 topics for the model we trained, using our display_topics
function. Have a look through the output and see what topics you can spot:
In [ ]:
display_topics(lda, dtm_feature_names, 10)
We can print
which topic each novel is closest to by indexing the topic probabilities and using the argmax
function:
In [ ]:
doc_topic = lda.transform(dtm)
for n in range(doc_topic.shape[0]):
topic_most_pr = doc_topic[n].argmax()
print(metadata_tb['author'][n], metadata_tb['title'][n])
print("doc: {} topic: {}\n".format(n,topic_most_pr))
To get the probabilities for each topic for a given book we can print the whole probability list for a given novel:
In [ ]:
metadata_tb['author'][25], metadata_tb['title'][25], doc_topic[25]
In [ ]:
# YOUR CODE HERE
There are many strategies that can be used to interpret the output of a topic model. In this case, we will look for any correlations between the topic distributions and metadata.
We'll first grab all the topic distributions similar to what we did above. Remember, the order of the novels is still the same!
In [ ]:
list_of_doctopics = [doc_topic[n] for n in range(len(doc_topic))]
list_of_doctopics[0]
We'll make a DataFrame
, which is similar to a Table
, with the probabilities for the topics (columns) and documents (rows):
In [ ]:
df = pd.DataFrame(list_of_doctopics)
df.head()
We can add these columns to our metadata_tb
Table
:
In [ ]:
meta = metadata_tb.to_df()
meta[df.columns] = df
meta.head()
The corr()
method will give us a correlation matrix:
In [ ]:
meta.corr()
We see some strong correlations of topics with date
, recall:
In [ ]:
display_topics(lda, dtm_feature_names, 10)
In [ ]:
meta.plot.scatter(x='date', y=1)
In [ ]:
meta.plot.scatter(x='date', y=5)
In [ ]:
meta.plot.scatter(x='date', y=6)
In [ ]:
meta.plot.scatter(x='date', y=12)
Why do you think we see this?
We're going to download the 20 Newsgroups, a widely used corpus for demos of general texts:
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of my knowledge, it was originally collected by Ken Lang, probably for his Newsweeder: Learning to filter netnews paper, though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.
Let's read in the training data:
In [ ]:
train_subset = pickle.load(open('scripts/20-news-train.pkl', 'rb'))
Here are the predetermined catgories:
In [ ]:
train_subset.target_names
Since we're topic modeling, we don't care about what they've been labeled, but it'll be interesting to see how our topics line up with these!
How many documents are there?
In [ ]:
len(train_subset.data)
Let's get a list of documents as strings just like we did with the novels, and then we'll randomly shuffle them in case they're ordered by category already:
In [ ]:
documents_train = train_subset.data
np.random.shuffle(documents_train)
In [ ]:
print(documents_train[0])
Now we'll do the same for the test set:
In [ ]:
test_subset = pickle.load(open('scripts/20-news-test.pkl', 'rb'))
documents_test = test_subset.data
np.random.shuffle(documents_test)
print(documents_test[0])
You now have two arrays of strings: documents_train
and documents_test
. Create a dtm
and then a topic model for k
number of topics. Just choose one number of k
and a very low iter
value for the training so it doesn't take too long.
See how the topics match up to the annotated categories, and play with different ways of preprocessing the data. Use the pyLDAvis
library to evaluate your model.
What did you have to do to get decent results?
In [ ]:
In [ ]:
train_subset.target
In [ ]:
test_subset.target
In [ ]: