Performing Model Selection Using Topic Coherence

This notebook will perform topic modeling on the 20 Newsgroups corpus using LDA. We will perform model selection (over the number of topics) using topic coherence as our evaluation metric. This will showcase some of the features of the topic coherence pipeline implemented in gensim. In particular, we will see several features of the CoherenceModel.


In [1]:
from __future__ import print_function

import os
import re
import logging
from collections import OrderedDict

from gensim.corpora import TextCorpus, MmCorpus
from gensim import utils, models

logging.basicConfig(level=logging.ERROR)  # disable warning logging


Using TensorFlow backend.

Loading the Dataset

The 20 Newsgroups dataset consists of 20 different newsgroup (forum discussion) groups, with many posts per group. The original data is available here: http://qwone.com/~jason/20Newsgroups/. However, sklearn also provides a wrapper around this data, so we'll use that for simplicity. It takes care of downloading the text and loading them into memory.

The documents are in the newsgroup format, which includes some headers, quoting of previous messages in the thread, and possibly PGP signature blocks. The code below builds on the TextCorpus preprocessing to handle the newsgroup-specific text parsing. By default, TextCorpus preprocessing performs asciifolding and lowercases all text, then tokenizes by pulling out contiguous sequences of alphabetic characters, then discards stopwords and tokens less than length 3.


In [2]:
from sklearn import datasets


class NewsgroupCorpus(TextCorpus):
    """Parse 20 Newsgroups dataset."""
    
    def __init__(self, *args, **kwargs):
        super(NewsgroupCorpus, self).__init__(
            datasets.fetch_20newsgroups(subset='all'), *args, **kwargs)

    def getstream(self):
        for doc in self.input.data:
            yield doc  # already unicode

    def preprocess_text(self, text):
        body = extract_body(text)
        return super(NewsgroupCorpus, self).preprocess_text(body)

    
def extract_body(text):
    return strip_newsgroup_header(
        strip_newsgroup_footer(
            strip_newsgroup_quoting(text)))


def strip_newsgroup_header(text):
    """Given text in "news" format, strip the headers, by removing everything
    before the first blank line.
    """
    _before, _blankline, after = text.partition('\n\n')
    return after


_QUOTE_RE = re.compile(r'(writes in|writes:|wrote:|says:|said:'
                       r'|^In article|^Quoted from|^\||^>)')
def strip_newsgroup_quoting(text):
    """Given text in "news" format, strip lines beginning with the quote
    characters > or |, plus lines that often introduce a quoted section
    (for example, because they contain the string 'writes:'.)
    """
    good_lines = [line for line in text.split('\n')
                  if not _QUOTE_RE.search(line)]
    return '\n'.join(good_lines)


_PGP_SIG_BEGIN = "-----BEGIN PGP SIGNATURE-----"
def strip_newsgroup_footer(text):
    """Given text in "news" format, attempt to remove a signature block."""
    try:
        return text[:text.index(_PGP_SIG_BEGIN)]
    except ValueError:
        return text

Loading the Dataset

Now that we have defined the necessary code for preprocessing the dataset, let's load it up and serialize it into Matrix Market format. We'll do this because we want to train LDA on it with several different parameter settings, and this will allow us to avoid repeating the preprocessing.


In [3]:
%%time

corpus = NewsgroupCorpus()
corpus.dictionary.filter_extremes(no_below=5, no_above=0.8)
dictionary = corpus.dictionary
print(len(corpus))
print(dictionary)


18846
Dictionary(23776 unique tokens: [u'circuitry', u'hanging', u'woody', u'localized', u'gaa']...)
CPU times: user 18.8 s, sys: 259 ms, total: 19.1 s
Wall time: 19.1 s

In [4]:
%%time

mm_path = '20_newsgroups.mm'
MmCorpus.serialize(mm_path, corpus, id2word=dictionary)
mm_corpus = MmCorpus(mm_path)  # load back in to use for LDA training


CPU times: user 21.4 s, sys: 233 ms, total: 21.6 s
Wall time: 21.7 s

Training the Models

Our goal is to determine which number of topics produces the most coherent topics for the 20 Newsgroups corpus. The corpus contains 18,846 documents. If we used 100 topics and the documents were evenly distributed among topics, we'd have clusters of ~188 documents. This seems like a reasonable upper bound. In this case, the corpus actually has categories, which we show below. There are 20 of these (hence the name of the dataset), so we'll use 20 as our lower bound for the number of topics.

One could argue that we already know the model should have 20 topics. I'll argue there may be additional categorizations within each newsgroup and we might hope to capture those by using more topics. We'll step by increments of 10 from 20 to 100.


In [5]:
print('\n'.join(corpus.input.target_names))


alt.atheism
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
misc.forsale
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
sci.crypt
sci.electronics
sci.med
sci.space
soc.religion.christian
talk.politics.guns
talk.politics.mideast
talk.politics.misc
talk.religion.misc

In [7]:
%%time

trained_models = OrderedDict()
for num_topics in range(20, 101, 10):
    print("Training LDA(k=%d)" % num_topics)
    lda = models.LdaMulticore(
        mm_corpus, id2word=dictionary, num_topics=num_topics, workers=4,
        passes=10, iterations=100, random_state=42, eval_every=None,
        alpha='asymmetric',  # shown to be better than symmetric in most cases
        decay=0.5, offset=64  # best params from Hoffman paper
    )
    trained_models[num_topics] = lda


Training LDA(k=20)
Training LDA(k=30)
Training LDA(k=40)
Training LDA(k=50)
Training LDA(k=60)
Training LDA(k=70)
Training LDA(k=80)
Training LDA(k=90)
Training LDA(k=100)
CPU times: user 19min 41s, sys: 3min 32s, total: 23min 13s
Wall time: 23min 25s

In [6]:
# Some useful utility functions in case you want to save your models.

home = os.path.expanduser('~/')
models_dir = os.path.join(home, 'workshop', 'nlp', 'models')  # use whatever directory you prefer

def save_models(named_models):
    for num_topics, model in named_models.items():
        model_path = os.path.join(models_dir, 'lda-newsgroups-k%d.lda' % num_topics)
        model.save(model_path, separately=False)

        
def load_models():
    trained_models = OrderedDict()
    for num_topics in range(20, 101, 10):
        model_path = os.path.join(models_dir, 'lda-newsgroups-k%d.lda' % num_topics)
        print("Loading LDA(k=%d) from %s" % (num_topics, model_path))
        trained_models[num_topics] = models.LdaMulticore.load(model_path)

    return trained_models

In [8]:
# save_models(trained_models)
# trained_models = load_models()

Evaluation Using Coherence

Now we get to the heart of this notebook. In this section, we'll evaluate each of our LDA models using topic coherence. Coherence is a measure of how interpretable the topics are to humans. It is based on the representation of topics as the top-N most probable words for a particular topic. More specifically, given the topic-term matrix for LDA, we sort each topic from highest to lowest term weights and then select the first N terms.

Coherence essentially measures how similar these words are to each other. There are various methods for doing this, most of which have been explored in the paper "Exploring the Space of Topic Coherence Measures". The authors performed a comparative analysis of various methods, correlating them to human judgements. The method named "c_v" coherence was found to be the most highly correlated. This and several of the other methods have been implemented in gensim.models.CoherenceModel. We will use this to perform our evaluations.

The "c_v" coherence method makes an expensive pass over the corpus, accumulating term occurrence and co-occurrence counts. It only accumulates counts for the terms in the lists of top-N terms for each topic. In order to ensure we only need to make one pass, we'll construct a "super topic" from the top-N lists of each of the models. This will consist of a single topic with all the relevant terms from all the models. We choose 20 as N.


In [8]:
%%time

# Now estimate the probabilities for the CoherenceModel.
# This performs a single pass over the reference corpus, accumulating
# the necessary statistics for all of the models at once.
cm = models.CoherenceModel.for_models(
    trained_models.values(), dictionary, texts=corpus.get_texts(), coherence='c_v')


CPU times: user 44.9 s, sys: 2.09 s, total: 47 s
Wall time: 1min 32s

In [9]:
%%time

coherence_estimates = cm.compare_models(trained_models.values())
coherences = dict(zip(trained_models.keys(), coherence_estimates))


CPU times: user 1min 14s, sys: 332 ms, total: 1min 15s
Wall time: 1min 15s

In [9]:
def print_coherence_rankings(coherences):
    avg_coherence = \
        [(num_topics, avg_coherence)
         for num_topics, (_, avg_coherence) in coherences.items()]
    ranked = sorted(avg_coherence, key=lambda tup: tup[1], reverse=True)
    print("Ranked by average '%s' coherence:\n" % cm.coherence)
    for item in ranked:
        print("num_topics=%d:\t%.4f" % item)
    print("\nBest: %d" % ranked[0][0])

In [11]:
print_coherence_rankings(coherences)


Ranked by average 'c_v' coherence:

num_topics=30:	0.5551
num_topics=20:	0.5374
num_topics=90:	0.5022
num_topics=80:	0.4986
num_topics=50:	0.4873
num_topics=40:	0.4856
num_topics=100:	0.4848
num_topics=70:	0.4769
num_topics=60:	0.4662

Best: 30

Results so Far

So far in this notebook, we have used gensim's CoherenceModel to perform model selection over the number of topics for LDA. We found that for the 20 Newsgroups corpus, 30 topics is best, followed by 20. We showcased the ability of the coherence pipeline to evaluate individual aggregated model coherence (Note that the individual topic coherence is also computed by cm.compare_models). We also demonstrated how to avoid repeated passes over the corpus, estimating the term similarity probabilities for all relevant terms just once. Topic coherence is a powerful alternative to evaluation using perplexity on a held-out document set. It is appropriate to use whenever the objective of the topic modeling is to present the topics as top-N lists for human consumption.

Note that coherence calculations are generally much more accurate when a larger reference corpus is used to estimate the probabilities. In this case, we used the same corpus as for our modeling, which is relatively small at only 20,000 documents. A better reference corpus is the full Wikipedia corpus. The motivated explorer of this notebook is encouraged to download that corpus (see Experiments on the English Wikipedia) and use it for probability estimation.

Next we'll look at another method of coherence evaluation using distributed word embeddings.

Evaluating Coherence with Word2Vec

The fact that "c_v" coherence uses distributional semantics to evalaute word similarity motivates the use of Word2Vec for coherence evaluation. This idea is explored further in an appendix at the end of the notebook. The CoherenceModel implemented in gensim also supports this, so let's look at a few examples.


In [12]:
%%time

class TextsIterable(object):
    """Wrap a TextCorpus in something that yields texts from its __iter__.
    
    It's necessary to use this because the Word2Vec model is built by scanning
    over the texts several times. Passing in corpus.get_texts() would result in
    an empty iterable on passes after the first.
    """
    
    def __init__(self, corpus):
        self.corpus = corpus
    
    def __iter__(self):
        return self.corpus.get_texts()


cm = models.CoherenceModel.for_models(
    trained_models.values(), dictionary, texts=TextsIterable(corpus), coherence='c_w2v')


CPU times: user 3min 15s, sys: 5.57 s, total: 3min 20s
Wall time: 2min 1s

In [13]:
%%time

coherence_estimates = cm.compare_models(trained_models.values())
coherences = dict(zip(trained_models.keys(), coherence_estimates))
print_coherence_rankings(coherences)


Ranked by average 'c_w2v' coherence:

num_topics=90:	0.7450
num_topics=100:	0.7433
num_topics=80:	0.7421
num_topics=70:	0.7310
num_topics=50:	0.7181
num_topics=60:	0.7169
num_topics=30:	0.7149
num_topics=20:	0.7131
num_topics=40:	0.7107

Best: 90
CPU times: user 10.5 s, sys: 149 ms, total: 10.7 s
Wall time: 10.8 s

Using pre-trained word vectors for coherence evaluation.

Whoa! These results are completely different from those of the "c_v" method, and "c_w2v" is saying the models we thought were best are actually some of the worst! So what happened here?

The same note must be made for Word2Vec ("c_w2v") that we made for "c_v": results are more accurate when a larger reference corpus is used. Except for "c_w2v", this is actually way, way more important. Distributional word embedding techniques such as Word2Vec are fitting a probability distribution with a large number of parameters, and doing that takes a lot of data.

Luckily, there are a variety of pre-trained word vectors freely available for download. Below we demonstrate using word vectors trained on ~100 billion words from Google News, available at this link. Note that this file is 1.5G, so downloading it can take quite some time. It is also quite slow to load and ends up occupying about 3.35G in memory (this load time is included in the timing below). There is no need to use such a large set of word vectors for this evaluation; this one is just readily available.


In [18]:
%%time

models_dir = os.path.join(home, 'workshop', 'nlp', 'models')
vectors_path = os.path.join(models_dir, 'GoogleNews-vectors-negative300.bin.gz')
keyed_vectors = models.KeyedVectors.load_word2vec_format(vectors_path, binary=True)

# still need to estimate_probabilities, but corpus is not scanned
cm = models.CoherenceModel.for_models(
    trained_models.values(), dictionary, texts=corpus.get_texts(),
    coherence='c_w2v', keyed_vectors=keyed_vectors)


CPU times: user 3min 8s, sys: 5.45 s, total: 3min 13s
Wall time: 3min 15s

In [19]:
%%time

coherence_estimates = cm.compare_models(trained_models.values())
coherences = dict(zip(trained_models.keys(), coherence_estimates))
print_coherence_rankings(coherences)


Ranked by average 'c_w2v' coherence:

num_topics=30:	0.5044
num_topics=20:	0.4996
num_topics=40:	0.4965
num_topics=50:	0.4950
num_topics=80:	0.4906
num_topics=60:	0.4839
num_topics=90:	0.4832
num_topics=70:	0.4826
num_topics=100:	0.4796

Best: 30
CPU times: user 10.4 s, sys: 146 ms, total: 10.6 s
Wall time: 10.6 s

Watch out for Out-of-Vocabulary (OOV) terms!

At first glance, it might seem like the "c_w2v" coherence with the GoogleNews vectors is a great improvement on the "c_v" method; it certainly improves on training on the newsgroups corpus. However, if you run the above code with logging enabled, you'll notice a TON of warning messages stating something like "3 terms for topic 10 not in word2vec model." This is a real gotcha to watch out for! In this case, we might suspect there is significant mismatch because all the coherence measures are so similar (within about 0.02 of each other).

When using pre-trained word vectors, there is likely to be some vocabulary mismatch. So unless the corpus you're modeling on was included in the training data for the vectors, you need to watch out for this. In the results above, it is easy to diagnose because all of the models have very similar coherence rankings. You can use the function below to dig in and see exactly how bad the issue is.


In [20]:
from gensim import utils


def report_on_oov_terms(cm, topic_models):
    """OOV = out-of-vocabulary"""
    topics_as_topn_terms = [
        models.CoherenceModel.top_topics_as_word_lists(model, dictionary)
        for model in topic_models
    ]

    oov_words = cm._accumulator.not_in_vocab(topics_as_topn_terms)
    print('number of oov words: %d' % len(oov_words))
    
    for num_topics, words in zip(trained_models.keys(), topics_as_topn_terms):
        oov_words = cm._accumulator.not_in_vocab(words)
        print('number of oov words for num_topics=%d: %d' % (num_topics, len(oov_words)))

report_on_oov_terms(cm, trained_models.values())


number of oov words: 411
number of oov words for num_topics=20: 18
number of oov words for num_topics=30: 45
number of oov words for num_topics=40: 68
number of oov words for num_topics=50: 93
number of oov words for num_topics=60: 117
number of oov words for num_topics=70: 135
number of oov words for num_topics=80: 151
number of oov words for num_topics=90: 182
number of oov words for num_topics=100: 218

Yikes! That's a lot of terms that are being ignored when calculating the coherence metrics. So these results are not really reliable. Let's use a different set of pre-trained word vectors. I trained these on a recent Wikipedia dump using skip-gram negative sampling (SGNS) with a context window of 5 and 300 dimensions.


In [14]:
%%time

vectors_path = os.path.join(models_dir, 'wiki-en_sgns5-w5-s300.bin.gz')
keyed_vectors = models.KeyedVectors.load_word2vec_format(vectors_path, binary=True)

# still need to estimate_probabilities, but corpus is not scanned
cm = models.CoherenceModel.for_models(
    trained_models.values(), dictionary, texts=corpus.get_texts(),
    coherence='c_w2v', keyed_vectors=keyed_vectors)


CPU times: user 2min 7s, sys: 7.12 s, total: 2min 14s
Wall time: 2min 16s

In [15]:
%%time

coherence_estimates = cm.compare_models(trained_models.values())
coherences = dict(zip(trained_models.keys(), coherence_estimates))
print_coherence_rankings(coherences)


/Users/vru959/anaconda2/lib/python2.7/site-packages/numpy/core/fromnumeric.py:2909: RuntimeWarning: Mean of empty slice.
  out=out, **kwargs)
/Users/vru959/anaconda2/lib/python2.7/site-packages/numpy/core/_methods.py:80: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
Ranked by average 'c_w2v' coherence:

num_topics=30:	0.5907
num_topics=20:	0.5833
num_topics=40:	0.5750
num_topics=50:	0.5673
num_topics=60:	0.5614
num_topics=80:	0.5595
num_topics=70:	0.5571
num_topics=90:	0.5520
num_topics=100:	0.5471

Best: 30
CPU times: user 10.3 s, sys: 158 ms, total: 10.5 s
Wall time: 10.5 s

In [16]:
report_on_oov_terms(cm, trained_models.values())


number of oov words: 64
number of oov words for num_topics=20: 1
number of oov words for num_topics=30: 10
number of oov words for num_topics=40: 11
number of oov words for num_topics=50: 13
number of oov words for num_topics=60: 16
number of oov words for num_topics=70: 23
number of oov words for num_topics=80: 27
number of oov words for num_topics=90: 37
number of oov words for num_topics=100: 32

Looks like we've now restored order

The results with the Wikipedia-trained word vectors are much better because we have less terms OOV. The "c_w2v" evalution is now agreeing with "c_v" on the best two models, and the rest of the ordering is generally similar. Note that the "c_w2v" values should not be compared directly to those produced by the "c_v" method. Only the ranking of models is comparable.

Appendix: Why Word2Vec for Coherence?

The "c_v" coherence method drags a sliding window across all documents in the corpus to accumulate co-occurrence statistics. Similarity is calculated using normalized pointwise mutual information (PMI) values estimated from these statistics. More specifically, each word is represented by a vector of its NPMI with every other word in its top-N topic list. These vectors are then used to compute (cosine) similarity between words. The restriction to the other words in the top-N list was found to produce better results than using the entire vocabulary and other methods of reducing the vocabulary (see section 3.2.2 of http://www.aclweb.org/anthology/W13-0102).

The fact that a reduced space is superior for these metrics indicates there is noise getting in the way. The "c_v" method can be seen as constructing an NPMI matrix between words. The vector of NPMI values for a particular word can then be looked up by indexing the row or column corresponding to that word's Dictionary ID. The reduction to the "topic word space" can then be achieved by using a mask to select out the top-N topic words. If we are constructing an NPMI matrix between words, then discarding some elements to reduce noise, why not factorize the matrix instead? Dimensionality reduction techniques such as SVD do a great job of reducing noise along with dimensionality, while also providing a compressed representation to work with.

Recent work has shown that Word2Vec (trained with Skip-Gram Negative Sampling (SGNS)) is actually implicitly factorizing a PMI matrix shifted by a positive constant. A subsequent paper compared Word2Vec to a few different PMI-based metrics and showed that it found coherence values that correlated more strongly with human judgements.