Topic Modeling with MALLET

Version: 1.0, September 2016

License: Creative Commons Attribution-ShareAlike 4.0 International License (CA BY-SA 4.0)

This is a basic tutorial on Topic Modeling using MALLET. There are various tutorials available online. I make use of the material therein in this notebook, in particular:

Shawn Graham, Scott Weingart, and Ian Milligan (2012) Getting Started with Topic Modeling and MALLET.

Introduction

Topic modeling is a method of identifying topics in texts that are represented by a bag of words. The assumption is that some words mentioned in text represent a topic or maybe some discourse thread.

Latent Dirichlet Allocation

The Latent Drichlet Allocation (LDA) method automatically discovers topics in a collections of sentences or texts.

Grouping of words into $k$ groups.

Initialization: Assign each word in a document to a topic randomly.

The initialization results in a topics being assigned words and distributional models of all words and documents to topics. The improvement of the model is achieved using an optimization cycle.

Optimization:

for each document $d$:
- for each word in $d$:
  - for each topic $t$ compute:
    - $p(t\ |\ d)$ (the proportion of words in document $d$ that are currently assigned to topic $t$)
    - $p(w\ |\ t)$ (the proportion of assignments to topic $t$ over all documents that come from this word $w$)
    - Reassign $w$ to a new topic $t$ that maximizes the probability $p(t\ |\ d) * p(w\ |\ t)$ (according to our generative model, this is essentially the probability that topic $t$ generated word $w$, so it makes sense that we resample the current word’s topic with this probability)
    - The reassigning is assuming that all topic assignments except for the current word $w$ are correct, and we are only updating the assignment of the current word using our model of how documents are generated.

Repeating this process a large number of times results in an acceptable or accurate model of word, topic, and document associations.

Installation and Using MALLET

MALLET is a Java-based program. It requires at least an installation of the Java runtime on your system. Java is limited to certain computer types. It will not necessarily run on tablets or Chrome Books. Please follow the instructions on the Oracle Java pages how to set up Java on your computer.

As explained in Graham et al. (2012), MALLET uses a system variable to point to the path of the MALLET code and data folder. Set this variable MALLET_HOME on your system.

Running MALLET in the command line:

bin/mallet

bin/mallet import-dir --help

bin/mallet import-dir --input sample-data/web/en --output tutorial.mallet --keep-sequence --remove-stopwords

Running the same on another folder with the German data:

bin/mallet import-dir --input sample-data/web/de --output tutorial.mallet --keep-sequence --remove-stopwords

bin/mallet train-topics --input tutorial.mallet

bin/mallet train-topics --input tutorial.mallet --num-topics 20 --output-state topic-state.gz --output-topic-keys tutorial_keys.txt --output-doc-topics tutorial_composition.txt

bin\mallet train-topics --input tutorial.mallet --num-topics 20 --optimize-interval 20 --output-state topic-state.gz --output-topic-keys tutorial_keys.txt --output-doc-topics tutorial_composition.txt

References

Edwin Chen (2011) Introduction to Latent Dirichlet Allocation.

Shawn Graham, Scott Weingart and Ian Milligan (2012) Getting Started with Topic Modeling and MALLET.



In [ ]: