Anchored CorEx: Topic Modeling with Minimal Domain Knowledge

Author: Ryan J. Gallagher

Last updated: 07/21/2018

This notebook walks through how to use the CorEx topic model code. This includes fitting CorEx to your data, examining the topic model output, outputting results, building a hierarchical topic model, and anchoring words to topics.

Details of the CorEx topic model and evaluations against unsupervised and semi-supervised variants of LDA can be found in our TACL paper:

Gallagher, Ryan J., Kyle Reing, David Kale, and Greg Ver Steeg. "Anchored Correlation Explanation: Topic Modeling with Minimal Domain Knowledge." Transactions of the Association for Computational Linguistics (TACL), 2017.


In [2]:
import numpy as np
import scipy.sparse as ss
import matplotlib.pyplot as plt

from corextopic import corextopic as ct
from corextopic import vis_topic as vt # jupyter notebooks will complain matplotlib is being loaded twice

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer

%matplotlib inline

Loading the 20 Newsgroups Dataset

We need to first load data to run the CorEx topic model. We'll use the 20 Newsgroups dataset, which scikit-learn provides functionality to access.


In [3]:
# Get 20 newsgroups data
newsgroups = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))

The topic model assumes input is in the form of a doc-word matrix, where rows are documents and columns are binary counts. We'll vectorize the newsgroups data, take the top 20,000 words, and convert it to a sparse matrix to save on memory usage. Note, we use binary count vectors as input to the CorEx topic model.


In [4]:
# Transform 20 newsgroup data into a sparse matrix
vectorizer = CountVectorizer(stop_words='english', max_features=20000, binary=True)
doc_word = vectorizer.fit_transform(newsgroups.data)
doc_word = ss.csr_matrix(doc_word)

doc_word.shape # n_docs x m_words


Out[4]:
(11314, 20000)

Our doc-word matrix is 11,314 documents by 20,000 words. Let's get the words that label the columns. We'll need these for outputting readable topics and later for anchoring.


In [5]:
# Get words that label the columns (needed to extract readable topics and make anchoring easier)
words = list(np.asarray(vectorizer.get_feature_names()))

We'll do a final step of preprocessing where we remove all integers from our set of words. This brings is down to 19,038 words.


In [6]:
not_digit_inds = [ind for ind,word in enumerate(words) if not word.isdigit()]
doc_word = doc_word[:,not_digit_inds]
words    = [word for ind,word in enumerate(words) if not word.isdigit()]

doc_word.shape # n_docs x m_words


Out[6]:
(11314, 19038)

CorEx Topic Model

The main parameters of the CorEx topic model are:

  • n_hidden: number of topics ("hidden" as in "hidden latent topics")
  • words: words that label the columns of the doc-word matrix (optional)
  • docs: document labels that label the rows of the doc-word matrix (optional)
  • max_iter: number of iterations to run through the update equations (optional, defaults to 200)
  • verbose: if verbose=1, then CorEx will print the topic TCs with each iteration
  • seed: random number seed to use for model initialization (optional)

We'll train a topic model with 50 topics. (This will take a few minutes.)


In [7]:
# Train the CorEx topic model with 50 topics
topic_model = ct.Corex(n_hidden=50, words=words, max_iter=200, verbose=False, seed=1)
topic_model.fit(doc_word, words=words);

CorEx Output

Topics

The CorEx topic model provides functionality for easily accessing the topics. Let's take a look one of the topics.


In [8]:
# Print a single topic from CorEx topic model
topic_model.get_topics(topic=1, n_words=10)


Out[8]:
[('team', 0.07605118792683423),
 ('game', 0.06666754127982474),
 ('season', 0.04887993842517729),
 ('players', 0.04734426771441264),
 ('league', 0.043708887910980376),
 ('play', 0.043335932010217425),
 ('hockey', 0.042396968873452616),
 ('games', 0.039692707277392415),
 ('teams', 0.03682532840163069),
 ('nhl', 0.03191145093230828)]

The topic words are those with the highest mutual information with the topic, rather than those with highest probability within the topic as in LDA. The mutual information with the topic is the number reported in each tuple. Theoretically, mutual information is always positive. If the CorEx output returns a negative mutual information from get_topics(), then the absolute value of that quantity is the mutual information between the topic and the absence of that word.

If the column labels have not been specified through words, then the code will return the column indices for the top words in each topic.

We can also retrieve all of the topics at once if we would like.


In [9]:
# Print all topics from the CorEx topic model
topics = topic_model.get_topics()
for n,topic in enumerate(topics):
    topic_words,_ = zip(*topic)
    print('{}: '.format(n) + ','.join(topic_words))


0: dsl,n3jxp,chastity,cadre,geb,shameful,intellect,skepticism,banks,pitt
1: team,game,season,players,league,play,hockey,games,teams,nhl
2: government,law,public,rights,state,encryption,clipper,federal,security,secure
3: god,jesus,bible,christians,christian,christ,religion,jews,church,faith
4: people,say,fact,point,believe,person,saying,world,reason,mean
5: armenians,armenian,national,international,argic,press,policy,serdar,soviet,armenia
6: file,program,window,directory,ftp,pub,server,application,unix,available
7: based,issue,sense,clear,truth,subject,certain,known,particular,existence
8: cs,ma,au,gmt,cc,uu,id,sites,fi,host
9: windows,software,card,thanks,pc,dos,files,disk,advance,ram
10: drive,sale,scsi,controller,board,shipping,ide,drives,cd,bus
11: pitching,hit,staff,braves,runs,hitter,nl,smith,hr,baltimore
12: just,don,like,time,going,right,better,let,come,didn
13: archive,various,document,related,addition,modified,published,contents,complete,distributed
14: information,internet,university,systems,send,following,address,phone,contact,computer
15: year,april,san,york,los,washington,north,angeles,city,california
16: war,country,children,killed,military,population,society,live,soldiers,anti
17: space,nasa,orbit,earth,moon,launch,shuttle,lunar,mission,flight
18: life,sin,words,mind,spirit,born,father,follow,accept,son
19: pp,special,van,berkeley,journal,ai,mark,mu,la,vol
20: years,away,later,left,came,days,old,ago,took,gave
21: disease,medical,doctor,patients,food,cause,treatment,medicine,blood,health
22: provide,questions,provides,developed,specific,development,standard,require,appropriate,commercial
23: given,number,note,end,present,taken,according,purpose,numbers,major
24: key,keys,data,algorithm,details,des,process,contains,users,provided
25: members,turkish,involved,army,organizations,troops,received,organization,land,fighting
26: read,different,example,does,word,having,groups,written,book,try
27: united,states,american,force,individual,independent,arms,community,nation,forces
28: death,human,said,evidence,crime,self,kill,lives,murder,killing
29: new,including,sent,single,department,short,news,ii,school,placed
30: use,using,work,used,need,run,problems,line,help,type
31: large,small,control,needed,outside,local,light,parts,useful,ground
32: general,important,far,course,non,times,actually,consider,likely,result
33: think,way,good,things,really,know,did,thing,ve,probably
34: problem,set,place,called,change,trying,return,open,support,instead
35: man,history,today,women,went,told,coming,happened,stand,knew
36: second,john,period,1st,2nd,3rd,points,goal,ed,followed
37: bike,ride,engine,riding,dod,bikes,miles,motorcycle,rear,honda
38: drivers,mode,mb,faster,interface,os,driver,hp,color,fast
39: gun,guns,weapons,firearms,defense,weapon,batf,armed,assault,shooting
40: high,power,low,current,model,al,lower,higher,series,average
41: ways,dr,break,passed,kinds,reach,mass,larry,content,stands
42: wide,included,volume,remote,bit,pages,notes,fully,fields,operations
43: long,day,especially,situation,rest,body,century,ones,family,worse
44: make,want,real,case,possible,order,quite,free,able,ask
45: car,money,cars,pay,tax,road,deal,insurance,worth,dollars
46: drug,certainly,considered,taking,effective,expect,generally,social,child,purposes
47: necessary,strong,prevent,required,plan,safe,carefully,attention,aside,unique
48: little,wants,takes,comes,lead,trouble,looks,pass,capable,unfortunately
49: bring,brought,happy,charge,smart,improve,shows,england,cast,belong

The first topic for the newsgroup data tends to be less coherent than expected because of encodings and other oddities in the newsgroups data.

We can also get the column indices instead of the column labels if necessary.


In [10]:
topic_model.get_topics(topic=5, n_words=10, print_words=False)


Out[10]:
[(1336, 0.037197519799330066),
 (1335, 0.03608567121432171),
 (11448, 0.03531603719337086),
 (8968, 0.03075981567134055),
 (1306, 0.029126522278008646),
 (13303, 0.02777130708538792),
 (13051, 0.027594397620650163),
 (15437, 0.026729063880204122),
 (16092, 0.026121923976488198),
 (1334, 0.02589034185072226)]

If we need to directly access the topic assignments for each word, they can be accessed through cluster.


In [11]:
print(topic_model.clusters)
print(topic_model.clusters.shape) # m_words


[ 8  9 38 ... 37  0  0]
(19038,)

Document Labels

As with the topic words, the most probable documents per topic can also be easily accessed. Documents are sorted according to log probabilities which is why the highest probability documents have a score of 0 ($e^0 = 1$) and other documents have negative scores (for example, $e^{-0.5} \approx 0.6$).


In [12]:
# Print a single topic from CorEx topic model
topic_model.get_top_docs(topic=0, n_docs=10, sort_by='log_prob')


NOTE: 'docs' not provided to CorEx. Returning top docs as lists of row indices
Out[12]:
[(3097, 0.0),
 (2350, 0.0),
 (105, 0.0),
 (3864, 0.0),
 (9396, 0.0),
 (11229, 0.0),
 (6440, 0.0),
 (6437, 0.0),
 (2284, 0.0),
 (8445, 0.0)]

CorEx is a discriminative model, whereas LDA is a generative model. This means that while LDA outputs a probability distribution over each document, CorEx instead estimates the probability a document belongs to a topic given that document's words. As a result, the probabilities across topics for a given document do not have to add up to 1. The estimated probabilities of topics for each document can be accessed through log_p_y_given_x or p_y_given_x.


In [13]:
print(topic_model.p_y_given_x.shape) # n_docs x k_topics


(11314, 50)

We can also use a softmax to make a binary determination of which documents belong to each topic. These softmax labels can be accessed through labels.


In [14]:
print(topic_model.labels.shape) # n_docs x k_topics


(11314, 50)

Since CorEx does not prescribe a probability distribution of topics over each document, this means that a document could possibly belong to no topics (all 0's across topics in labels) or all topics (all 1's across topics in labels).

Total Correlation and Model Selection

Overall TC

Total correlation is the measure which CorEx maximize when constructing the topic model. It can be accessed through tc and is reported in nats.


In [15]:
topic_model.tc


Out[15]:
44.54780845461276

Model selection: CorEx starts its algorithm with a random initialization, and so different runs can result in different topic models. One way of finding a better topic model is to restart the CorEx algorithm several times and take the run that has the highest TC value (i.e. the run that produces topics that are most informative about the documents).

Topic TC

The overall total correlation is the sum of the total correlation per each topic. These can be accessed through tcs. For an unsupervised CorEx topic model, the topics are always sorted from high to low according to their TC. For an anchored CorEx topic model, the topics are not sorted, and are outputted such that the anchored topics come first.


In [16]:
topic_model.tcs.shape # k_topics


Out[16]:
(50,)

In [17]:
print(np.sum(topic_model.tcs))
print(topic_model.tc)


44.54780845461276
44.54780845461276

Selecting number of topics: one way to choose the number of topics is to observe the distribution of TCs for each topic to see how much each additional topic contributes to the overall TC. We should keep adding topics until additional topics do not significantly contribute to the overall TC. This is similar to choosing a cutoff eigenvalue when doing topic modeling via LSA.


In [18]:
plt.figure(figsize=(10,5))
plt.bar(range(topic_model.tcs.shape[0]), topic_model.tcs, color='#4e79a7', width=0.5)
plt.xlabel('Topic', fontsize=16)
plt.ylabel('Total Correlation (nats)', fontsize=16);


We see the first topic is much more informative than the other topics. Given that we suspect that this topic is picking up on image encodings (as given by "dsl" and "n3jxp" in the topic) and other boilerplate text (as given by the high TC and lack of coherence of the rest of the topic), we could consider doing additional investigation and preprocessing to help ensure that the CorEx topic model does not pick up on these patterns which are not insightful.

Pointwise Document TC

We can decompose total correlation further. The topic correlation is the average of the pointwise total correlations for each individual document. The pointwise total correlations can be accessed through log_z.


In [19]:
topic_model.log_z.shape # n_docs x k_topics


Out[19]:
(11314, 50)

In [20]:
print(np.mean(topic_model.log_z, axis=0))
print(topic_model.tcs)


[3.64813418 1.57958569 1.48835238 1.42639341 1.42043661 1.39232446
 1.37222362 1.36535367 1.34334261 1.08793513 1.06264005 1.03767991
 1.01780633 0.98444346 0.98350034 0.97751588 0.97425573 0.96560217
 0.9170467  0.91160502 0.87920818 0.86341492 0.83259909 0.82311403
 0.79730235 0.78507652 0.76449939 0.74186846 0.73348486 0.7232203
 0.70714498 0.70407292 0.6876558  0.68271949 0.66403509 0.60590956
 0.59919508 0.58802044 0.5869777  0.58426131 0.57613847 0.57032326
 0.5434933  0.54324576 0.504955   0.48664151 0.28893956 0.27183877
 0.24173114 0.21054385]
[3.64813418 1.57958569 1.48835238 1.42639341 1.42043661 1.39232446
 1.37222362 1.36535367 1.34334261 1.08793513 1.06264005 1.03767991
 1.01780633 0.98444346 0.98350034 0.97751588 0.97425573 0.96560217
 0.9170467  0.91160502 0.87920818 0.86341492 0.83259909 0.82311403
 0.79730235 0.78507652 0.76449939 0.74186846 0.73348486 0.7232203
 0.70714498 0.70407292 0.6876558  0.68271949 0.66403509 0.60590956
 0.59919508 0.58802044 0.5869777  0.58426131 0.57613847 0.57032326
 0.5434933  0.54324576 0.504955   0.48664151 0.28893956 0.27183877
 0.24173114 0.21054385]

The pointwise total correlations in log_z represent the correlations within an individual document explained by a particular topic. These correlations have been used to measure how "surprising" documents are with respect to given topics (see references below).

Hierarchical Topic Models

The labels attribute gives the binary topic expressions for each document and each topic. We can use this output as input to another CorEx topic model to get latent representations of the topics themselves. This yields a hierarchical CorEx topic model. Like the first layer of the topic model, one can determine the number of latent variables to add in higher layers through examination of the topic TCs.


In [21]:
# Train a second layer to the topic model
tm_layer2 = ct.Corex(n_hidden=10)
tm_layer2.fit(topic_model.labels);

# Train a third layer to the topic model
tm_layer3 = ct.Corex(n_hidden=1)
tm_layer3.fit(tm_layer2.labels);


WARNING: Some words never appear (or always appear)

If you have graphviz installed, then you can output visualizations of the hierarchial topic model to your current working directory. One can also create custom visualizations of the hierarchy by properly making use of the labels attribute of each layer.


In [ ]:
vt.vis_hierarchy([topic_model, tm_layer2, tm_layer3], column_label=words, max_edges=200, prefix='topic-model-example')

Anchoring for Semi-Supervised Topic Modeling

Anchored CorEx is an extension of CorEx that allows the "anchoring" of words to topics. When anchoring a word to a topic, CorEx is trying to maximize the mutual information between that word and the anchored topic. So, anchoring provides a way to guide the topic model towards specific subsets of words that the user would like to explore.

The anchoring mechanism is flexible, and so there are many possibilities of anchoring. We explored the following types of anchoring in our TACL paper:

  1. Anchoring a single set of words to a single topic. This can help promote a topic that did not naturally emerge when running an unsupervised instance of the CorEx topic model. For example, one might anchor words like "snow," "cold," and "avalanche" to a topic if one suspects there should be a snow avalanche topic within a set of disaster relief articles.

  2. Anchoring single sets of words to multiple topics. This can help find different aspects of a topic that may be discussed in several different contexts. For example, one might anchor "protest" to three topics and "riot" to three other topics to understand different framings that arise from tweets about political protests.

  3. Anchoring different sets of words to multiple topics. This can help enforce topic separability if there appear to be chimera topics. For example, one might anchor "mountain," "Bernese," and "dog" to one topic and "mountain," "rocky," and "colorado" to another topic to help separate topics that merge discussion of Bernese Mountain Dogs and the Rocky Mountains.

We'll demonstrate how to anchor words to the the CorEx topic model and how to develop other anchoring strategies.


In [22]:
# Anchor one word to the first topic
anchor_words = ['nasa']

In [23]:
# Anchor the word 'nasa' to the first topic
anchored_topic_model = ct.Corex(n_hidden=50, seed=2)
anchored_topic_model.fit(doc_word, words=words, anchors=anchor_words, anchor_strength=6);

This anchors the single word "nasa" to the first topic.


In [24]:
topic_words,_ = zip(*anchored_topic_model.get_topics(topic=0))
print('0: ' + ','.join(topic_words))


0: nasa,gov,ames,institute,jpl,station,propulsion,jsc,arc,shafer

We can anchor multiple groups of words to multiple topics as well.


In [25]:
# Anchor 'nasa' and 'space' to first topic, 'sports' and 'stadium' to second topic, so on...
anchor_words = [['nasa', 'space'], ['sports', 'stadium'], ['politics', 'government'], ['love', 'hope']]

anchored_topic_model = ct.Corex(n_hidden=50, seed=2)
anchored_topic_model.fit(doc_word, words=words, anchors=anchor_words, anchor_strength=6);

In [26]:
for n in range(len(anchor_words)):
    topic_words,_ = zip(*anchored_topic_model.get_topics(topic=n))
    print('{}: '.format(n) + ','.join(topic_words))


0: space,nasa,orbit,moon,shuttle,launch,gov,earth,lunar,ames
1: sports,stadium,april,san,city,los,york,washington,angeles,center
2: government,politics,state,rights,law,war,country,military,public,security
3: hope,love,helps,relates,virile,tatoos,sustaining,whosoever,weird,allegory

Note, in the above topic model, topics will no longer be sorted according to descending TC. Instead, the first topic will be the one with "nasa" and "space" anchored to it, the second topic will be the one with "sports" and "stadium" anchored to it, and so on.

Observe, the topic with "love" and "hope" anchored to it is less interpretable than the other three topics. This could be a sign that there is not a good topic around these two words, and one should consider if it is appropriate to anchor around them.

We can continue to develop even more involved anchoring strategies. Here we anchor "nasa" by itself, as well as in two other topics each with "politics" and "news" to find different aspects around the word "nasa". We also create a fourth anchoring of "war" to a topic.


In [27]:
# Anchor with single words and groups of words
anchor_words = ['nasa', ['nasa', 'politics'], ['nasa', 'news'], 'war']

anchored_topic_model = ct.Corex(n_hidden=50, seed=2)
anchored_topic_model.fit(doc_word, words=words, anchors=anchor_words, anchor_strength=6);

In [28]:
for n in range(len(anchor_words)):
    topic_words,_ = zip(*anchored_topic_model.get_topics(topic=n))
    print('{}: '.format(n) + ','.join(topic_words))


0: nasa,space,orbit,launch,shuttle,moon,earth,lunar,satellite,commercial
1: nasa,politics,research,gov,science,scientific,institute,organization,studies,providing
2: news,nasa,insisting,edwards,hal,llnl,cso,cfv,nodak,admin
3: war,israel,armenians,armenian,israeli,jews,soldiers,military,killed,history

Note: If you do not specify the column labels through words, then you can still anchor by specifying the column indices of the features you wish to anchor on. You may also specify anchors using a mix of strings and indices if desired.

Choosing anchor strength: the anchor strength controls how much weight CorEx puts towards maximizing the mutual information between the anchor words and their respective topics. Anchor strength should always be set at a value greater than 1, since setting anchor strength between 0 and 1 only recovers the unsupervised CorEx objective. Empirically, setting anchor strength from 1.5-3 seems to nudge the topic model towards the anchor words. Setting anchor strength greater than 5 is strongly enforcing that the CorEx topic model find a topic associated with the anchor words.

We encourage users to experiment with the anchor strength and determine what values are best for their needs.

Other Output

The vis_topic module provides support for outputting topics and visualizations of the CorEx topic model. The code below creates a results direcory named "twenty" in your working directory.


In [ ]:
vt.vis_rep(topic_model, column_label=words, prefix='twenty')

Further Reading

Our TACL paper details the theory of the CorEx topic model, its sparsity optimization, anchoring via the information bottleneck, comparisons to LDA, and anchoring experiments. The two papers from Greg Ver Steeg and Aram Galstyan develop the CorEx theory in general and provide further motivation and details of the underlying CorEx mechanisms. Hodas et al. demonstrated early CorEx topic model results and investigated an application of pointwise total correlations to quantify "surprising" documents.

  1. Anchored Correlation Explanation: Topic Modeling with Minimal Domain Knowledge, Gallagher et al., TACL 2017.

  2. Discovering Structure in High-Dimensional Data Through Correlation Explanation, Ver Steeg and Galstyan, NIPS 2014.

  3. Maximally Informative Hierarchical Representions of High-Dimensional Data, Ver Steeg and Galstyan, AISTATS 2015.

  4. Disentangling the Lexicons of Disaster Response in Twitter, Hodas et al., WWW 2015.