Note: This is best viewed on NBViewer. It is part of a series on Dirichlet Processes and Nonparametric Bayes.

Nonparametric Latent Dirichlet Allocation

Analysis of the topics of Econtalk

In 2003, a groundbreaking statistical model called "Latent Dirichlet Allocation" was presented by David Blei, Andrew Ng, and Michael Jordan.

LDA provides a method for summarizing the topics discussed in a document. LDA defines topics to be discrete probability distrbutions over words. For an introduction to LDA, see Edwin Chen's post.

The original LDA model requires the number of topics in the document to be specfied as a known parameter of the model. In 2005, Yee Whye Teh and others published a "nonparametric" version of this model that doesn't require the number of topics to be specified. This model uses a prior distribution over the topics called a hierarchical Dirichlet process. I wrote an introduction to this HDP-LDA model earlier this year.

For the last six months, I have been developing a Python-based Gibbs sampler for the HDP-LDA model. This is part of a larger library of "robust, validated Bayesian nonparametric models for discovering structure in data" known as Data Microscopes.

This notebook demonstrates the functionality of this implementation.

The Data Microscopes library is available on anaconda.org for Linux and OS X. microscopes-lda can be installed with:

$ conda install -c datamicroscopes -c distributions microscopes-lda 

In [1]:
%matplotlib inline
import pyLDAvis
import json
import sys
import cPickle

from microscopes.common.rng import rng
from microscopes.lda.definition import model_definition
from microscopes.lda.model import initialize
from microscopes.lda import utils
from microscopes.lda import model, runner

from numpy import genfromtxt 
from numpy import linalg
from numpy import array

dtm.csv contains a document-term matrix representation of the words used in Econtalk transcripts. The columns of the matrix correspond to the words in vocab.txt. The rows in the matrix correspond to the show urls in urls.txt.

Our LDA implementation takes input data as a list of lists of hashable objects (typically words). We can use a utility function to convert the document-term matrix to the list of tokenized documents.


In [2]:
vocab = genfromtxt('./econtalk-data/vocab.txt', delimiter=",", dtype='str').tolist()
dtm = genfromtxt('./econtalk-data/dtm.csv', delimiter=",", dtype=int)
docs = utils.docs_from_document_term_matrix(dtm, vocab=vocab)
urls = [s.strip() for s in open('./econtalk-data/urls.txt').readlines()]

In [3]:
dtm.shape[1] == len(vocab)


Out[3]:
True

In [4]:
dtm.shape[0] == len(urls)


Out[4]:
True

Here's a utility method to get the title of a webpage that we'll use later.


In [5]:
def get_title(url):
    """Scrape webpage title
    """
    import lxml.html
    t = lxml.html.parse(url)
    return t.find(".//title").text.split("|")[0].strip()

Let's set up our model. First we created a model definition describing the basic structure of our data. Next we initialize an MCMC state object using the model definition, documents, random number generator, and hyper-parameters.


In [8]:
N, V = len(docs), len(vocab)
defn = model_definition(N, V)
prng = rng(12345)
state = initialize(defn, docs, prng,
                        vocab_hp=1,
                        dish_hps={"alpha": .6, "gamma": 2})
r = runner.runner(defn, docs, state, )

When we first create a state object, the words are randomly assigned to topics. Thus, our perplexity (model score) is quite high. After we start to run the MCMC, the score will drop quickly.


In [9]:
print "randomly initialized model:"
print " number of documents", defn.n
print " vocabulary size", defn.v
print " perplexity:", state.perplexity(), "num topics:", state.ntopics()


randomly initialized model:
 number of documents 454
 vocabulary size 16445
 perplexity: 16523.1820356 num topics: 9

Run one iteration of the MCMC to make sure everything is working.


In [17]:
%%time
r.run(prng, 1)


CPU times: user 12.3 s, sys: 128 ms, total: 12.4 s
Wall time: 12.6 s

Now lets run 1000 generations of the MCMC.

Unfortunately, MCMC is slow going.


In [15]:
%%time
r.run(prng, 500)


CPU times: user 2h 9min 8s, sys: 26.8 s, total: 2h 9min 35s
Wall time: 2h 10min 12s

In [16]:
with open('./econtalk-data/2015-10-07-state.pkl', 'w') as f:
    cPickle.dump(state, f)

In [17]:
%%time
r.run(prng, 500)


CPU times: user 2h 17min 35s, sys: 30 s, total: 2h 18min 6s
Wall time: 2h 18min 50s

In [18]:
with open('./econtalk-data/2015-10-07-state.pkl', 'w') as f:
    cPickle.dump(state, f)

Now that we've run the MCMC, the perplexity has dropped significantly.


In [19]:
print "after 1000 iterations:"
print " perplexity:", state.perplexity(), "num topics:", state.ntopics()


after 1000 iterations:
 perplexity: 2363.65138771 num topics: 11

pyLDAvis projects the topics into two dimensions using techniques described by Carson Sievert.


In [21]:
vis = pyLDAvis.prepare(**state.pyldavis_data())
pyLDAvis.display(vis)


Out[21]:

We can extract the term relevance (shown in the right hand side of the visualization) right from our state object. Here are the 10 most relevant words for each topic:


In [23]:
relevance = state.term_relevance_by_topic()
for i, topic in enumerate(relevance):
    print "topic", i, ":",
    for term, _ in topic[:10]:
        print term, 
    print


topic 0 : banks fed bank money financial monetary debt inflation crisis rates
topic 1 : party republicans constitution vote democrats republican tax election president stalin
topic 2 : fat science diet eat insulin disease immune replication scientific eating
topic 3 : growth trade water cities china city development climate inequality oil
topic 4 : people think don just going like say things lot way
topic 5 : smith hayek moral economics society adam liberty coase theory rules
topic 6 : bitcoin internet software google technology store bitcoins computer machines company
topic 7 : prison health drug care drugs medicaid medical patients patient women
topic 8 : schools teachers school kids teacher education students parents teaching sports
topic 9 : bees honey pollination colony ants bee queen cheung ant colonies
topic 10 : museum museums art gallery galleries monet seating trustees admission director

We could assign titles to each of these topics. For example, Topic 5 appears to be about the foundations of classical liberalism. Topic 6 is obviously Bitcoin and Software. Topic 0 is the financial system and monetary policy. Topic 4 seems to be generic words used in most episodes; unfortunately, the prevalence of "don" is a result of my preprocessing which splits up the contraction "don't".

We can also get the topic distributions for each document.


In [24]:
topic_distributions = state.topic_distribution_by_document()

Topic 5 appears to be about the theory of classical liberalism. Let's find the 20 episodes which have the highest proportion of words from that topic.


In [25]:
austrian_topic = 5
foundations_episodes = sorted([(dist[austrian_topic], url) for url, dist in zip(urls, topic_distributions)], reverse=True)
for url in [url for _, url in foundations_episodes][:20]:
    print get_title(url), url


The Economics of Organ Donations http://www.econtalk.org/archives/2006/06/the_economics_o_4.html
Klein on The Theory of Moral Sentiments, Episode 2--A Discussion of Part I http://www.econtalk.org/archives/2009/04/klein_on_the_th_1.html
Boudreaux on Law and Legislation http://www.econtalk.org/archives/2006/12/boudreaux_on_la.html
Klein on The Theory of Moral Sentiments, Episode 4--A Discussion of Part III http://www.econtalk.org/archives/2009/04/klein_on_the_th_3.html
Klein on The Theory of Moral Sentiments, Episode 5--A Discussion of Parts III  (cont.), IV, and V http://www.econtalk.org/archives/2009/05/klein_on_the_th_4.html
Klein on The Theory of Moral Sentiments, Episode 3--A Discussion of Part II http://www.econtalk.org/archives/2009/04/klein_on_the_th_2.html
Klein on The Theory of Moral Sentiments, Episode 1--An Overview http://www.econtalk.org/archives/2009/04/klein_on_the_th.html
Klein on The Theory of Moral Sentiments, Episode 6--A Discussion of Parts VI and VII, and Summary http://www.econtalk.org/archives/2009/05/klein_on_the_th_5.html
Wolfe on Liberalism http://www.econtalk.org/archives/2009/05/wolfe_on_libera.html
Boettke on Katrina and the Economics of Disaster http://www.econtalk.org/archives/2006/12/boettke_on_katr.html
Richard Thaler on Libertarian Paternalism http://www.econtalk.org/archives/2006/11/richard_thaler_1.html
Marglin on Markets and Community http://www.econtalk.org/archives/2008/03/marglin_on_mark.html
Dan Klein on Coordination and Cooperation http://www.econtalk.org/archives/2008/02/dan_klein_on_co.html
Leeson on Pirates and the Invisible Hook http://www.econtalk.org/archives/2009/05/leeson_on_pirat.html
Vernon Smith on Markets and Experimental Economics http://www.econtalk.org/archives/2007/05/vernon_smith_on.html
Boettke on Elinor Ostrom, Vincent Ostrom, and the Bloomington School http://www.econtalk.org/archives/2009/11/boettke_on_elin.html
Postrel on Style http://www.econtalk.org/archives/2006/11/postrel_on_styl.html
The Economics of Religion http://www.econtalk.org/archives/2006/10/the_economics_o_7.html
Boettke on Mises http://www.econtalk.org/archives/2010/12/boettke_on_mise.html
The Economics of Medical Malpractice http://www.econtalk.org/archives/2006/05/the_economics_o_3.html

We could also find the episodes that have notable discussion of both politics AND the financial system.


In [29]:
topic_a = 0
topic_b = 1
joint_episodes = [url for url, dist in zip(urls, topic_distributions) if dist[0] > 0.18 and dist[1] > 0.18]
for url in joint_episodes:
    print get_title(url), url


Benn Steil on the Battle of Bretton Woods http://www.econtalk.org/archives/2015/02/benn_steil_on_t.html
Epstein on the Rule of Law http://www.econtalk.org/archives/2009/06/epstein_on_the.html
Shlaes on the Great Depression http://www.econtalk.org/archives/2007/06/shlaes_on_the_g.html
Rabushka on the Flat Tax http://www.econtalk.org/archives/2007/04/rabushka_on_the.html
Greg Mankiw on Gasoline Taxes, Keynes and Macroeconomics http://www.econtalk.org/archives/2007/01/greg_mankiw_on.html

We can look at the topic distributions as projections of the documents into a much lower dimension (16). We can try to find shows that are similar by comparing the topic distributions of the documents.


In [30]:
def find_similar(url, max_distance=0.2):
    """Find episodes most similar to input url.
    """
    index = urls.index(url)
    for td, url in zip(topic_distributions, urls):
        if linalg.norm(array(topic_distributions[index]) - array(td)) < max_distance:
            print get_title(url), url

Which Econtalk episodes are most similar, in content, to "Mike Munger on the Division of Labor"?


In [31]:
find_similar('http://www.econtalk.org/archives/2007/04/mike_munger_on.html')


Brink Lindsey on the Age of Abundance http://www.econtalk.org/archives/2009/03/brink_lindsey_o.html
Richard Epstein on Happiness, Inequality, and Envy http://www.econtalk.org/archives/2008/11/richard_epstein.html
Sowell on Economic Facts and Fallacies http://www.econtalk.org/archives/2008/02/sowell_on_econo.html
Yandle on the Tragedy of the Commons and the Implications for Environmental Regulation http://www.econtalk.org/archives/2007/10/yandle_on_the_t.html
McCraw on Schumpeter, Innovation, and Creative Destruction http://www.econtalk.org/archives/2007/10/mccraw_on_schum.html
Boudreaux on Market Failure, Government Failure and the Economics of Antitrust Regulation http://www.econtalk.org/archives/2007/10/boudreaux_on_ma.html
Mike Munger on the Division of Labor http://www.econtalk.org/archives/2007/04/mike_munger_on.html

How about episodes similar to "Kling on Freddie and Fannie and the Recent History of the U.S. Housing Market"?


In [32]:
find_similar('http://www.econtalk.org/archives/2008/09/kling_on_freddi.html')


Irwin on the Great Depression and the Gold Standard http://www.econtalk.org/archives/2010/10/irwin_on_the_gr.html
Rustici on Smoot-Hawley and the Great Depression http://www.econtalk.org/archives/2010/01/rustici_on_smoo.html
Reinhart on Financial Crises http://www.econtalk.org/archives/2009/11/reinhart_on_fin.html
Posner on the Financial Crisis http://www.econtalk.org/archives/2009/11/posner_on_the_f.html
Sumner on Monetary Policy http://www.econtalk.org/archives/2009/11/sumner_on_monet.html
Calomiris on the Financial Crisis http://www.econtalk.org/archives/2009/10/calomiris_on_th.html
Gary Stern on Too Big to Fail http://www.econtalk.org/archives/2009/10/gary_stern_on_t.html
Cohan on the Life and Death of Bear Stearns http://www.econtalk.org/archives/2009/09/cohan_on_the_li.html
John Taylor on the Financial Crisis http://www.econtalk.org/archives/2009/07/john_taylor_on_1.html
Meltzer on Inflation http://www.econtalk.org/archives/2009/02/meltzer_on_infl.html
Boettke on the Austrian Perspective on Business Cycles and Monetary Policy http://www.econtalk.org/archives/2009/01/boettke_on_the.html
Selgin on Free Banking http://www.econtalk.org/archives/2008/11/selgin_on_free.html
Kling on Credit Default Swaps, Counterparty Risk, and the Political Economy of Financial Regulation http://www.econtalk.org/archives/2008/11/kling_on_credit.html
Kling on Freddie and Fannie and the Recent History of the U.S. Housing Market http://www.econtalk.org/archives/2008/09/kling_on_freddi.html
John Taylor on Monetary Policy http://www.econtalk.org/archives/2008/08/john_taylor_on.html
Gene Epstein on Gold, the Fed, and Money http://www.econtalk.org/archives/2008/06/gene_epstein_on.html
Meltzer on the Fed, Money, and Gold http://www.econtalk.org/archives/2008/05/meltzer_on_the.html
Cowen on Monetary Policy http://www.econtalk.org/archives/2008/03/cowen_on_moneta.html

The model also gives us distributions over words for each topic.


In [33]:
word_dists = state.word_distribution_by_topic()

We can use this to find the topics a word is most likely to occur in.


In [34]:
def bars(x, scale_factor=10000):
    return int(x * scale_factor) * "="

def topics_related_to_word(word, n=10):
    for wd, rel in zip(word_dists, relevance):
        score = wd[word]
        rel_words = ' '.join([w for w, _ in rel][:n]) 
        if bars(score):
            print bars(score), rel_words

What topics are most likely to contain the word "Munger" (as in Mike Munger). The number of equal signs indicates the probability the word is generated by the topic. If a topic isn't shown, it's extremely unlikley to generate the word.


In [35]:
topics_related_to_word('munger')


==== growth trade water cities china city development climate inequality oil
================== smith hayek moral economics society adam liberty coase theory rules
=== bitcoin internet software google technology store bitcoins computer machines company

Where does Munger come up? In discussing the moral foundations of classical liberalism and microeconomics!

How about the word "lovely"? Russ Roberts uses it often when talking about the Theory of Moral Sentiments. It looks like it also comes up when talking about schools.


In [36]:
topics_related_to_word('lovely')


= smith hayek moral economics society adam liberty coase theory rules

If you have feedback on this implementation of HDP-LDA, you can reach me on Twitter or open an issue on Github.