(C) 2016 by Damir Cavar <dcavar@iu.edu>
Version: 1.0, September 2016
This is a basic tutorial on Topic Modeling using MALLET. There are various tutorials available online. I make use of the material therein in this notebook, in particular:
Topic modeling is a method of identifying topics in texts that are represented by a bag of words. The assumption is that some words mentioned in text represent a topic or maybe some discourse thread.
The Latent Drichlet Allocation (LDA) method automatically discovers topics in a collections of sentences or texts.
Grouping of words into $k$ groups.
Initialization: Assign each word in a document to a topic randomly.
The initialization results in a topics being assigned words and distributional models of all words and documents to topics. The improvement of the model is achieved using an optimization cycle.
Optimization:
Repeating this process a large number of times results in an acceptable or accurate model of word, topic, and document associations.
MALLET is a Java-based program. It requires at least an installation of the Java runtime on your system. Java is limited to certain computer types. It will not necessarily run on tablets or Chrome Books. Please follow the instructions on the Oracle Java pages how to set up Java on your computer.
As explained in Graham et al. (2012), MALLET uses a system variable to point to the path of the MALLET code and data folder. Set this variable MALLET_HOME on your system.
Running MALLET in the command line:
bin/mallet
bin/mallet import-dir --help
bin/mallet import-dir --input sample-data/web/en --output tutorial.mallet --keep-sequence --remove-stopwords
Running the same on another folder with the German data:
bin/mallet import-dir --input sample-data/web/de --output tutorial.mallet --keep-sequence --remove-stopwords
bin/mallet train-topics --input tutorial.mallet
bin/mallet train-topics --input tutorial.mallet --num-topics 20 --output-state topic-state.gz --output-topic-keys tutorial_keys.txt --output-doc-topics tutorial_composition.txt
bin\mallet train-topics --input tutorial.mallet --num-topics 20 --optimize-interval 20 --output-state topic-state.gz --output-topic-keys tutorial_keys.txt --output-doc-topics tutorial_composition.txt
Edwin Chen (2011) Introduction to Latent Dirichlet Allocation.
Shawn Graham, Scott Weingart and Ian Milligan (2012) Getting Started with Topic Modeling and MALLET.
In [ ]: