The Hierarchical Dirichlet Process (HDP) is typically used for topic modeling when the number of topics is unknown
Let's explore the topics of the political blog, The Daily Kos using data from the UCI Machine Learning Repository.
In [1]:
import itertools
import pyLDAvis
import pandas as pd
import re
import simplejson
import seaborn as sns
from microscopes.common.rng import rng
from microscopes.lda.definition import model_definition
from microscopes.lda.model import initialize
from microscopes.lda import model, runner
from random import shuffle
import matplotlib.pyplot as plt
sns.set_style('darkgrid')
sns.set_context('talk')
%matplotlib inline
First, let's grab the data from UCI:
In [2]:
!curl http://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/docword.kos.txt.gz | gunzip > docword.kos.txt
!head docword.kos.txt
In [3]:
!curl http://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/vocab.kos.txt > vocab.kos.txt
!head vocab.kos.txt
The format of the docword.os.txt file is 3 header lines, followed by
NNZ triples:
---
D
W
NNZ
docID wordID count
docID wordID count
docID wordID count
docID wordID count
...
docID wordID count
docID wordID count
docID wordID count
---
We'll process the data into a list of lists of words ready to be fed into our algorithm:
In [4]:
def parse_bag_of_words_file(docword, vocab):
with open(vocab, "r") as f:
kos_vocab = [word.strip() for word in f.readlines()]
id_to_word = {i: word for i, word in enumerate(kos_vocab)}
with open(docword, "r") as f:
raw = [map(int, _.strip().split()) for _ in f.readlines()][3:]
docs = []
for _, grp in itertools.groupby(raw, lambda x: x[0]):
doc = []
for _, word_id, word_cnt in grp:
doc += word_cnt * [id_to_word[word_id-1]]
docs.append(doc)
return docs, id_to_word
In [6]:
docs, id_to_word = parse_bag_of_words_file("docword.kos.txt", "vocab.kos.txt")
vocab_size = len(set(word for doc in docs for word in doc))
We must define our model before we intialize it. In this case, we need the number of docs and the number of words.
From there, we can initialize our model and set the hyperparameters
In [7]:
defn = model_definition(len(docs), vocab_size)
prng = rng()
kos_state = initialize(defn, docs, prng,
vocab_hp=1,
dish_hps={"alpha": 1, "gamma": 1})
r = runner.runner(defn, docs, kos_state)
print "number of docs:", defn.n, "vocabulary size:", defn.v
Given the size of the dataset, it'll take some time to run.
We'll run our model for 1000 iterations and save our results every 25 iterations.
In [8]:
%%time
step_size = 50
steps = 500 / step_size
print "randomly initialized model:", "perplexity:", kos_state.perplexity(), "num topics:", kos_state.ntopics()
for s in range(steps):
r.run(prng, step_size)
print "iteration:", (s + 1) * step_size, "perplexity:", kos_state.perplexity(), "num topics:", kos_state.ntopics()
pyLDAvis is a Python implementation of the LDAvis tool created by Carson Sievert.
LDAvis is designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization.
In [9]:
prepared = pyLDAvis.prepare(**kos_state.pyldavis_data())
pyLDAvis.display(prepared)
Out[9]:
Other Functionality
Model Serialization
LDA state
objects are fully serializable with Pickle and cPickle.
In [10]:
import cPickle as pickle
with open('kos_state.pkl','wb') as f:
pickle.dump(kos_state, f)
with open('kos_state.pkl','rb') as f:
new_state = pickle.load(f)
In [11]:
kos_state.assignments() == new_state.assignments()
Out[11]:
In [12]:
kos_state.dish_assignments() == new_state.dish_assignments()
Out[12]:
In [13]:
kos_state.table_assignments() == new_state.table_assignments()
Out[13]:
In [14]:
kos_state = new_state
Term Relevance
We can generate term relevances (as defined by Sievert and Shirley 2014) for each topic.
In [17]:
relevance = new_state.term_relevance_by_topic()
Here are the ten most relevant words for each topic:
In [19]:
for topics in relevance:
words = [word for word, _ in topics[:10]]
print ' '.join(words)
Topic Prediction
We can also predict how the topics with be distributed within an arbitrary document.
Let's create a document from the 100 most relevant words in the 7th topic.
In [31]:
doc7 = [word for word, _ in relevance[6][:100]]
shuffle(doc7)
doc_text = [word for word in doc7]
print ' '.join(doc_text)
In [32]:
predictions = pd.DataFrame()
predictions['From Topic 7'] = pd.Series(kos_state.predict(doc, rng())[0])
The prediction is that this document is mostly generated by topic 7.
Similarly, if we create a document from words from the 1st and 7th topic, our prediction is that the document is generated mostly by those topics.
In [40]:
doc17 = [word for word, _ in relevance[0][:100]] + [word for word, _ in relevance[6][:100]]
shuffle(doc17)
predictions['From Topics 1 and 7']= pd.Series(kos_state.predict(doc17, r)[0])
predictions.plot(kind='bar')
plt.title('Prediced topic distribution')
plt.xticks(pred.index, ['%d' % (d+1) for d in xrange(len(pred.index))])
plt.xlabel('Topic Number')
plt.ylabel('Probability of Each Topic')
Out[40]:
Topic and Term Distributions
Of course, we can also get the topic distribution for each document (commonly called $\Theta$).
In [41]:
pd.Series(kos_state.topic_distribution_by_document()[0]).plot(kind='bar').set_title('Topic distribution for first document')
plt.xticks(pred.index, ['%d' % (d+1) for d in xrange(len(pred.index))])
plt.xlabel('Topic Number')
plt.ylabel('Probability of Each Topic')
Out[41]:
We can also get the raw word distribution for each topic (commonly called $\Phi$). This is related to the word relevance. Here are the most common words in one of the topics.
In [44]:
pd.Series(kos_state.word_distribution_by_topic()[3]).sort(inplace=False).tail(10).plot(kind='barh')
plt.title('Top 10 Words in Topic 4')
plt.xlabel('Probability')
Out[44]:
To use our HDP, install our libary in conda:
$ conda install microscopes-lda