Performing Model Selection Using Topic Coherence

This notebook will perform topic modeling on the 20 Newsgroups corpus using LDA. We will perform model selection (over the number of topics) using topic coherence as our evaluation metric. This will showcase some of the features of the topic coherence pipeline implemented in gensim. In particular, we will see several features of the CoherenceModel.


In [39]:
from __future__ import print_function

import os
import re

from gensim.corpora import TextCorpus, MmCorpus
from gensim import utils, models
from gensim.parsing.preprocessing import STOPWORDS
from gensim.utils import deaccent

Parsing the Dataset

The 20 Newsgroups dataset uses a hierarchical directory structure to store the articles. The structure looks something like this:

20news-18828/
|-- alt.atheism
|   |-- 49960
|   |-- 51060
|   |-- 51119
|-- comp.graphics
|   |-- 37261
|   |-- 37913
|   |-- 37914
|-- comp.os.ms-windows.misc
|   |-- 10000
|   |-- 10001
|   |-- 10002

The files are in the newsgroup markup format, which includes some headers, quoting of previous messages in the thread, and possibly PGP signature blocks. The message body itself is raw text, which requires preprocessing. The code immediately below is an adaptation of an active PR for parsing hierarchical directory structures into corpora. The code just below that builds on this basic corpus parser to handle the newsgroup-specific text parsing.


In [34]:
class TextDirectoryCorpus(TextCorpus):
    """Read documents recursively from a directory,
    where each file is interpreted as a plain text document.
    """
    
    def iter_filepaths(self):
        """Lazily yield paths to each file in the directory structure within the specified
        range of depths. If a filename pattern to match was given, further filter to only
        those filenames that match.
        """
        for dirpath, dirnames, filenames in os.walk(self.input):
            for name in filenames:
                yield os.path.join(dirpath, name)
                
    def getstream(self):
        for path in self.iter_filepaths():
            with utils.smart_open(path) as f:
                doc_content = f.read()
            yield doc_content
    
    def preprocess_text(self, text):
        text = deaccent(
            lower_to_unicode(
                strip_multiple_whitespaces(text)))
        tokens = simple_tokenize(text)
        return remove_short(
            remove_stopwords(tokens))
        
    def get_texts(self):
        """Iterate over the collection, yielding one document at a time. A document
        is a sequence of words (strings) that can be fed into `Dictionary.doc2bow`.
        Override this function to match your input (parse input files, do any
        text preprocessing, lowercasing, tokenizing etc.). There will be no further
        preprocessing of the words coming out of this function.
        """
        lines = self.getstream()
        if self.metadata:
            for lineno, line in enumerate(lines):
                yield self.preprocess_text(line), (lineno,)
        else:
            for line in lines:
                yield self.preprocess_text(line)

    
def remove_stopwords(tokens, stopwords=STOPWORDS):
    return [token for token in tokens if token not in stopwords]

def remove_short(tokens, minsize=3):
    return [token for token in tokens if len(token) >= minsize]

def lower_to_unicode(text):
    return utils.to_unicode(text.lower(), 'ascii', 'ignore')

RE_WHITESPACE = re.compile(r"(\s)+", re.UNICODE)
def strip_multiple_whitespaces(text):
    return RE_WHITESPACE.sub(" ", text)

PAT_ALPHABETIC = re.compile('(((?![\d])\w)+)', re.UNICODE)
def simple_tokenize(text):
    for match in PAT_ALPHABETIC.finditer(text):
        yield match.group()

In [35]:
class NewsgroupCorpus(TextDirectoryCorpus):
    """Parse 20 Newsgroups dataset."""

    def extract_body(self, text):
        return strip_newsgroup_header(
            strip_newsgroup_footer(
                strip_newsgroup_quoting(text)))

    def preprocess_text(self, text):
        body = self.extract_body(text)
        return super(NewsgroupCorpus, self).preprocess_text(body)


def strip_newsgroup_header(text):
    """Given text in "news" format, strip the headers, by removing everything
    before the first blank line.
    """
    _before, _blankline, after = text.partition('\n\n')
    return after


_QUOTE_RE = re.compile(r'(writes in|writes:|wrote:|says:|said:'
                       r'|^In article|^Quoted from|^\||^>)')
def strip_newsgroup_quoting(text):
    """Given text in "news" format, strip lines beginning with the quote
    characters > or |, plus lines that often introduce a quoted section
    (for example, because they contain the string 'writes:'.)
    """
    good_lines = [line for line in text.split('\n')
                  if not _QUOTE_RE.search(line)]
    return '\n'.join(good_lines)


_PGP_SIG_BEGIN = "-----BEGIN PGP SIGNATURE-----"
def strip_newsgroup_footer(text):
    """Given text in "news" format, attempt to remove a signature block."""
    try:
        return text[:text.index(_PGP_SIG_BEGIN)]
    except ValueError:
        return text

Loading the Dataset

Now that we have defined the necessary code for parsing the dataset, let's load it up and serialize it into Matrix Market format. We'll do this because we want to train LDA on it with several different parameter settings, and this will allow us to avoid repeating the preprocessing.


In [36]:
# Replace data_path with path to your own copy of the corpus.
# You can download it from here: http://qwone.com/~jason/20Newsgroups/
# I'm using the original, called: 20news-19997.tar.gz

home = os.path.expanduser('~')
data_dir = os.path.join(home, 'workshop', 'nlp', 'data')
data_path = os.path.join(data_dir, '20_newsgroups')

In [49]:
%%time

corpus = NewsgroupCorpus(data_path)
dictionary = corpus.dictionary
print(len(corpus))
print(dictionary)


19998
Dictionary(107980 unique tokens: [u'jbwn', u'porkification', u'sowell', u'sonja', u'luanch']...)
CPU times: user 38.3 s, sys: 2.43 s, total: 40.7 s
Wall time: 43.7 s

In [38]:
%%time

mm_path = os.path.join(data_dir, '20_newsgroups.mm')
MmCorpus.serialize(mm_path, corpus, id2word=dictionary)
mm_corpus = MmCorpus(mm_path)  # load back in to use for LDA training


CPU times: user 25.9 s, sys: 2.76 s, total: 28.7 s
Wall time: 34 s

Training the Models

Our goal is to determine which number of topics produces the most coherent topics for the 20 Newsgroups corpus. The corpus is roughly 20,000 documents. If we used 100 topics and the documents were evenly distributed among topics, we'd have clusters of 200 documents. This seems like a reasonable upper bound. In this case, the corpus actually has categories, defined by the first-level directory structure. This can be seen in the directory structure shown above, and three examples are: alt.atheism, comp.graphics, and comp.os.ms-windows.misc. There are 20 of these (hence the name of the dataset), so we'll use 20 as our lower bound for the number of topics.

One could argue that we already know the model should have 20 topics. I'll argue there may be additional categorizations within each newsgroup and we might hope to capture those by using more topics. We'll step by increments of 10 from 20 to 100.


In [40]:
%%time

trained_models = {}
for num_topics in range(20, 101, 10):
    print("Training LDA(k=%d)" % num_topics)
    lda = models.LdaMulticore(
        mm_corpus, id2word=dictionary, num_topics=num_topics, workers=4,
        passes=10, iterations=200, random_state=42,
        alpha='asymmetric',  # shown to be better than symmetric in most cases
        decay=0.5, offset=64  # best params from Hoffman paper
    )
    trained_models[num_topics] = lda


Training LDA(k=20)
Training LDA(k=30)
Training LDA(k=40)
Training LDA(k=50)
Training LDA(k=60)
Training LDA(k=70)
Training LDA(k=80)
Training LDA(k=90)
Training LDA(k=100)
CPU times: user 1h 27min 7s, sys: 7min 54s, total: 1h 35min 2s
Wall time: 1h 3min 27s

Evaluation Using Coherence

Now we get to the heart of this notebook. In this section, we'll evaluate each of our LDA models using topic coherence. Coherence is a measure of how interpretable the topics are to humans. It is based on the representation of topics as the top-N most probable words for a particular topic. More specifically, given the topic-term matrix for LDA, we sort each topic from highest to lowest term weights and then select the first N terms.

Coherence essentially measures how similar these words are to each other. There are various methods for doing this, most of which have been explored in the paper "Exploring the Space of Topic Coherence Measures". The authors performed a comparative analysis of various methods, correlating them to human judgements. The method named "c_v" coherence was found to be the most highly correlated. This and several of the other methods have been implemented in gensim.models.CoherenceModel. We will use this to perform our evaluations.

The "c_v" coherence method makes an expensive pass over the corpus, accumulating term occurrence and co-occurrence counts. It only accumulates counts for the terms in the lists of top-N terms for each topic. In order to ensure we only need to make one pass, we'll construct a "super topic" from the top-N lists of each of the models. This will consist of a single topic with all the relevant terms from all the models. We choose 20 as N.


In [53]:
# Build topic listings from each model.
import itertools
from gensim import matutils


def top_topics(lda, num_words=20):
    str_topics = []
    for topic in lda.state.get_lambda():
        topic = topic / topic.sum()  # normalize to probability distribution
        bestn = matutils.argsort(topic, topn=num_words, reverse=True)
        beststr = [lda.id2word[_id] for _id in bestn]
        str_topics.append(beststr)
    return str_topics


model_topics = {}
super_topic = set()
for num_topics, model in trained_models.items():
    topics_as_topn_terms = top_topics(model)
    model_topics[num_topics] = topics_as_topn_terms
    super_topic.update(itertools.chain.from_iterable(topics_as_topn_terms))
    
print("Number of relevant terms: %d" % len(super_topic))


Number of relevant terms: 3517

In [54]:
%%time
# Now estimate the probabilities for the CoherenceModel

cm = models.CoherenceModel(
    topics=[super_topic], texts=corpus.get_texts(),
    dictionary=dictionary, coherence='c_v')
cm.estimate_probabilities()


CPU times: user 34 s, sys: 3.1 s, total: 37.1 s
Wall time: 56.9 s

In [64]:
%%time
import numpy as np
# Next we perform the coherence evaluation for each of the models.
# Since we have already precomputed the probabilities, this simply
# involves using the accumulated stats in the `CoherenceModel` to
# perform the evaluations, which should be pretty quick.

coherences = {}
for num_topics, topics in model_topics.items():
    cm.topics = topics

    # We evaluate at various values of N and average them. This is a more robust,
    # according to: http://people.eng.unimelb.edu.au/tbaldwin/pubs/naacl2016.pdf
    coherence_at_n = {}
    for n in (20, 15, 10, 5):
        cm.topn = n
        topic_coherences = cm.get_coherence_per_topic()
        
        # Let's record the coherences for each topic, as well as the aggregated
        # coherence across all of the topics.
        coherence_at_n[n] = (topic_coherences, cm.aggregate_measures(topic_coherences))
        
    topic_coherences, avg_coherences = zip(*coherence_at_n.values())
    avg_topic_coherences = np.vstack(topic_coherences).mean(0)
    avg_coherence = np.mean(avg_coherences)
    print("Avg coherence for num_topics=%d: %.5f" % (num_topics, avg_coherence))
    coherences[num_topics] = (avg_topic_coherences, avg_coherence)


Avg coherence for num_topics=100: 0.48958
Avg coherence for num_topics=70: 0.50393
Avg coherence for num_topics=40: 0.51029
Avg coherence for num_topics=80: 0.51147
Avg coherence for num_topics=50: 0.51582
Avg coherence for num_topics=20: 0.49602
Avg coherence for num_topics=90: 0.47067
Avg coherence for num_topics=60: 0.48913
Avg coherence for num_topics=30: 0.48709
CPU times: user 2min 39s, sys: 524 ms, total: 2min 39s
Wall time: 2min 40s

In [68]:
# Print the coherence rankings

avg_coherence = \
    [(num_topics, avg_coherence)
     for num_topics, (_, avg_coherence) in coherences.items()]
ranked = sorted(avg_coherence, key=lambda tup: tup[1], reverse=True)
print("Ranked by average '%s' coherence:\n" % cm.coherence)
for item in ranked:
    print("num_topics=%d:\t%.4f" % item)
print("\nBest: %d" % ranked[0][0])


Ranked by average 'c_v' coherence:

num_topics=50:	0.5158
num_topics=80:	0.5115
num_topics=40:	0.5103
num_topics=70:	0.5039
num_topics=20:	0.4960
num_topics=100:	0.4896
num_topics=60:	0.4891
num_topics=30:	0.4871
num_topics=90:	0.4707

Best: 50

Conclusion

In this notebook, we used gensim's CoherenceModel to perform model selection over the number of topics for LDA. We found that for the 20 Newsgroups corpus, 50 topics is best. We showcased the ability of the coherence pipeline to evaluate individual topic coherence as well as aggregated model coherence. We also demonstrated how to avoid repeated passes over the corpus, estimating the term similarity probabilities for all relevant terms just once. Topic coherence is a powerful alternative to evaluation using perplexity on a held-out document set. It is appropriate to use whenever the objective of the topic modeling is to present the topics as top-N lists for human consumption.

Note that coherence calculations are generally much more accurate when a larger reference corpus is used to estimate the probabilities. In this case, we used the same corpus as for our modeling, which is relatively small at only 20 documents. A better reference corpus is the full Wikipedia corpus. The motivated explorer of this notebook is encouraged to download that corpus (see Experiments on the English Wikipedia) and use it for probability estimation.


In [ ]: