In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

Introduction to Topic Modeling

Topic modeling is a technique used to discover the collections of words that frequently occur together within a collection of documents. A topic modeling algorithm identifies the words that belong to each topic and the topics that belong to each document:

Each word has a certain proportion in each topic, and each topic has a certain proportion in each document. The goal of the topic modeling algorithm is to infer these proportions.

Topic Modeling Methods

There are two popular techniques for building topic models: Non-Negative Matrix Factorization, and Latent Dirichlet Allocation.

NMF

Non-Negative Matrix Factorization is a dimension reduction technique that factors an input matrix of shape m x n into a matrix of shape m x k and another matrix of shape n x k.

In text mining, one can use NMF to build topic models. To do so, one first builds a Term-Document Matrix, or matrix in which each row represents a term, each column represents a document, and each cell value represents the number of times the given term occurs in the given document:

After building this matrix, one can use NMF to "factor" the matrix into two smaller matrices. Those two matrices, when multiplied, should approximate as greatly as possible the original Term Document Matrix:

n = Terms
v = Documents
k = Topics

In other words, the n x v matrix represents the Term Document Matrix, the n x k matrix represents a Term by Topic matrix, and the k x v matrix represents the Topic by Document matrix.

The Term by Topic matrix represents the distribution of each term over each topic, and the Topic by Document matrix represents the distribution of each topic over each document.

LDA

Latent Dirichlet Allocation is a generative statistical technique that attempts to model each document in an input corpus as a combination of a fixed number of topics. In LDA, each document is modeled as a probability distribution over a fixed number of topics, and each topic is modeled as a probability distribution over the linguistic types (or unique words) contained within the input corpus.

In other words, each term has a certain frequency in each topic, and each topic has a certain frequency in each document. These frequencies are learned through a fairly statistical technique known as Bayesian inference [mathematics outlined here].

Comparing NMF and LDA

There has been some experimental research to evaluate the coherence of topics generated by NMF and LDA. O'Callaghan 2015 found the following results for one topical coherence measure:

While other measures produce different results, generally NMF and LDA perform quite similarly.

Getting Started with NMF

Because the math behind NMF is much more intuitive than LDA, we'll use NMF in our session. To do so, we'll use the nmf package to build our topic models. You can install this package and all other packages we'll use in this session with the following commands:

git clone https://github.com/yaledhlab/lab-workshops
cd lab-workshops/topic-modeling-python
pip install -r requirements.txt
jupyter notebook model-topics.ipynb

Once that completes, you can verify the package is installed by running pip freeze, which displays the versions of each Python package you've installed. You should see each of the packages outlined in requirements.txt

Acquiring Sample Data

To build topic models, we'll need some text files. For the purpuses of the workshop, let's download this collection of sample scientific papers from the Philosophical Transactions, the first professional scientific journal:

To download those files from the command line on Unix, you can run (you may need to brew install wget):

wget https://s3.amazonaws.com/duhaime/github/nmf/texts.tar.gz

Once you've downloaded those files, if you're on OSX or Linux, you can unzip them by running:

tar -zxf texts.tar.gz

If you're on Windows, you can download and use 7-zip to unzip the archive.

Once you've installed nmf and downloaded the sample documents, we're ready to begin!


In [2]:
from subprocess import check_output
import os

try:
  if not os.path.exists('texts.tar.gz'):
    result = check_output('wget https://s3.amazonaws.com/duhaime/github/nmf/texts.tar.gz', shell=True)
  if not os.path.exists('texts'):
    result = check_output('tar -zxf texts.tar.gz', shell=True)
except Exception as exc:
  print(' ! ', exc)
  print(' ! could not download data -- please download manually')

In [3]:
from nmf import NMF
import json

# `files` specifies the directory with text documents
# `topics` indicates the number of topics in the model
model = NMF(files='texts', topics=20)

'''
The following attributes are accessible on `model`:
  topics_to_words: {dict} maps each topic to its top terms
  docs_to_topics: {dict} maps each document to each topic's presence in the document
  documents_by_topics: {numpy.ndarray} - contains one row per document and one columm per topic
  topics_by_terms: {numpy.ndarray} - contains one row per topic and one column per term
'''

# e.g. Fetch the topics_to_words attribute
print(json.dumps(model.topics_to_words, indent=4))


{
    "0": [
        "particles",
        "salt",
        "glass",
        "small",
        "salts",
        "water",
        "coagulated",
        "said",
        "sand",
        "silver"
    ],
    "1": [
        "ad",
        "ut",
        "est",
        "non",
        "quod",
        "cum",
        "vel",
        "quae",
        "si",
        "ex"
    ],
    "2": [
        "10",
        "11",
        "00",
        "12",
        "30",
        "20",
        "15",
        "16",
        "21",
        "13"
    ],
    "3": [
        "book",
        "books",
        "author",
        "printing",
        "printed",
        "manuscripts",
        "ancient",
        "library",
        "account",
        "language"
    ],
    "4": [
        "read",
        "line",
        "pag",
        "paul",
        "printers",
        "smith",
        "arms",
        "printed",
        "sam",
        "benj"
    ],
    "5": [
        "malab",
        "tab",
        "leaves",
        "pl",
        "mal",
        "plant",
        "folio",
        "fig",
        "flowers",
        "pluk"
    ],
    "6": [
        "blood",
        "heart",
        "brain",
        "vessels",
        "matter",
        "body",
        "lungs",
        "child",
        "left",
        "kidney"
    ],
    "7": [
        "air",
        "experiment",
        "receiver",
        "mercury",
        "weight",
        "glass",
        "gage",
        "inches",
        "recipient",
        "atmosphere"
    ],
    "8": [
        "mr",
        "stone",
        "letter",
        "dr",
        "concerning",
        "bladder",
        "society",
        "account",
        "numb",
        "royal"
    ],
    "9": [
        "sea",
        "river",
        "great",
        "earth",
        "island",
        "shells",
        "trees",
        "west",
        "east",
        "country"
    ],
    "10": [
        "water",
        "salt",
        "spirit",
        "mineral",
        "waters",
        "oyl",
        "glass",
        "acid",
        "freezing",
        "quantity"
    ],
    "11": [
        "fair",
        "cloudy",
        "rain",
        "weather",
        "29",
        "frost",
        "close",
        "30",
        "gales",
        "20"
    ],
    "12": [
        "fig",
        "vessel",
        "brass",
        "figures",
        "ring",
        "instrument",
        "roman",
        "microscope",
        "instruments",
        "iron"
    ],
    "13": [
        "worms",
        "flies",
        "animalcula",
        "seed",
        "creatures",
        "eggs",
        "glass",
        "worm",
        "did",
        "body"
    ],
    "14": [
        "hail",
        "storm",
        "fell",
        "stones",
        "house",
        "great",
        "windows",
        "broke",
        "lightning",
        "thunder"
    ],
    "15": [
        "sun",
        "spot",
        "moon",
        "spots",
        "limb",
        "suns",
        "seconds",
        "june",
        "time",
        "clock"
    ],
    "16": [
        "ex",
        "est",
        "folia",
        "seu",
        "indis",
        "aut",
        "ut",
        "planta",
        "ad",
        "fructus"
    ],
    "17": [
        "inferred",
        "4to",
        "la",
        "1697",
        "8vo",
        "paris",
        "des",
        "1698",
        "cum",
        "di"
    ],
    "18": [
        "inches",
        "horns",
        "foot",
        "bone",
        "head",
        "inch",
        "fingers",
        "like",
        "bones",
        "feet"
    ],
    "19": [
        "bark",
        "china",
        "tree",
        "wood",
        "vessels",
        "canals",
        "horizontal",
        "inferred",
        "trees",
        "sap"
    ]
}

More NMF

Our fist code gave us the top terms for each of our 20 topics. There are three other queries we can run on our sample data using the NMF library.

The last example give us a dictionary with terms by topic. In the following example, running model.documents_by_topics will output a matrix which contains one row per document and one column per topic. The values in the matrix represent the weight or salience of any topic with any document. A value of 0 means that this topic did not appear in this document.

Based on these results, we could begin to explore co-occurance of topics in documents to answer questions like "If 'Topic X' appears in a 'Document 1', how likely is it that 'Topic Z' will also appear in 'Document 1'.


In [4]:
# the documents by topics matrix; shape = (documents, topics)
model.documents_by_topics

# the topics by terms matrix; shape = (topics, terms)
# model.topics_by_terms


Out[4]:
array([[0.00000000e+00, 0.00000000e+00, 3.62174267e-01, ...,
        0.00000000e+00, 0.00000000e+00, 5.34145380e-02],
       [0.00000000e+00, 9.45554015e-01, 0.00000000e+00, ...,
        3.42394251e-02, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 2.39929625e-02],
       ...,
       [1.64507894e-02, 0.00000000e+00, 4.20577747e-03, ...,
        0.00000000e+00, 1.75471473e-01, 5.44733807e-02],
       [0.00000000e+00, 1.57498764e-04, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 5.95188907e-04, 2.21912345e-01, ...,
        4.06178321e-02, 0.00000000e+00, 0.00000000e+00]])

Visualize a Matrix

We can now visulaize our matrix (documents x topics) using the matplotlib library.

The x-axis represents each of the 20 topics. The y-axis represents each of the 500 documents. A darker marking means that this topic is more previlent in that document compared to the others.


In [5]:
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
%matplotlib inline

fig, ax = plt.subplots()

c = ax.pcolor(model.documents_by_topics, cmap='Reds')
fig.colorbar(c, ax=ax)
ax.set_title('Documents x Topics')
ax.set_xlabel('Topics')
ax.set_ylabel('Documents')
ax.xaxis.set_major_locator(ticker.MaxNLocator(integer=True))

fig.tight_layout()
plt.show()


Visualizing Term Vectors

As we noted above, NMF breaks a Term-Document Matrix into two matrices: model.topics_by_terms, which has shape (topics x terms), and model.documents_by_topics, which has shape (documents x topics)

Because the former matrix visualizes each term as a distribution over terms, we can treat each row of that matrix as a "word embedding", or vector representation of a word's semantic significance. Let's try to visualize what those word embeddings look like.

Each column of model.topics_by_terms has n rows, where n = the number of topics we specified in the NMF() constructor. In what follows, we'll use the TSNE dimension reduction algorithm to reduce each term vector down to three dimensions, then we'll visualize the three dimensional embeddings of each term.

In this scatterplot, each term is positioned near terms with similar semantic content.


In [6]:
from helpers import plot_term_scatterplot
%matplotlib inline

# draw a scatterplot of the terms in the model
plot_term_scatterplot(model, threshold=0.1, method='umap')


Hierarchical Document Similarity

model.documents_by_topics contains one row for each document and one column for each topic. Using this data we can fairly easily construct a hierarchical visualization of the similarity between the documents in the model (or topics in the model).

In this hierarchical visualization, documents with more similar content appear close together, which less closely related documents appear more distant from one another.


In [7]:
from helpers import plot_document_similarity

# plot the similarity between documents in the model
plot_document_similarity(model, csv='metadata.csv', n_docs=100)


Topics over Time

When working with topic models, it's often helpful to visualize the distribution of the topics over time. To do so, one of course needs some time-based metadata, which we happen to have available for our sample corpus.

The visualization below displays the mean presence of each topic in each decade.


In [8]:
from helpers import plot_topics_over_time

# plot the distribution of each topic over time in the same data
plot_topics_over_time(model, csv='metadata.csv')


Going Further with Topic Modeling

We've now covered the basics of working with topic models to answer some basic questions of a text corpus. For those looking to go further with topic modeling, there are a few resources that may be of interest.

Building Topic Models on the GPU

If your research lab has access to a server with graphical processing unit chips, you can leverage GPU-based implementations of topic modeling to greatly expedite the process of building topic models.

If you have access to a discrete GPU chip, you may wish to consider the Functional Bioinformatics Group's GPU-based implementation of NMF.

Distributed Topic Models

In the work we did above, we assumed the original Term Document Matrix would fit in RAM concurrently. When working with huge text collections, this is not always feasible. One popular technique for building extremely large topic models is to use a multi-host implementation in which each host machine only works on a subset of the input data. If this is your situation, you may wish to consider the LDA implementation in Spark, a popular map-reduce style process manager often used for very large compute operations.

Experimenting with LDA

We've focused on using Non-Negative Matrix Factorization in this session, but Latent Dirichlet Allocation is often used to generate topic models. To try building topic models using LDA, feel free to build sample LDA topics here. This interface uses JavaScript to build and display topic models:

If you inspect the JavaScript file that powers this interface, you'll see that the topic weight estimation all happens inside the sweep() function, which is less than 100 lines of code.

Bring Your Own Data

Now that you've seen differnet methods for topic modeling using sample data, let's work through the process using real data.

Sample Data vs Real Data

Your research data will always be messier than sample data. Make sure your text data is clean before starting your topic modeling. Look out for common errors in text data like OCR atifacts, HTML or XML elements, and page headers or footers. Some textual data from Newspapers and Periodicals will include advertisements, images, or other extraneous text. If those are not important to your research, you will have to edit your data before beginning Topic Modeling so as not to skew your results.

Template Code for Topic Modeling

Below we've included sample python code to get started with Topic Modeling for any text corpus.

The template assumes your text files are in a directory called 'corpus', which is a sub-directory of your current directory (where this code is saved). You can change this value in the code to match the directory name where your corpus is stored. This sample code assumes the following directory structure:

project_directory
│   model-topics.ipynb 
│
└───corpus
│   │   file011.txt
│   │   file012.txt
│   │   file013.txt

There are four attributes that you can query on your text corpus (stored in the variable called model): topics_to_words, doc_to_topics, documents_by_topics, topics_by_terms. Uncomment (remove the #) any of the lines beginning with model to run that query on your corpus.


In [ ]:
import requests as r
import os
import codecs

# create a directory for sample data
if not os.path.exists('corpus'): os.makedirs('corpus')
  
# download some sample texts from project gutenberg into `corpus`
for i in ['25', '571', '27348', '27509', '29233', '35830', '35829', '29233']:
  with codecs.open(os.path.join('corpus', os.path.basename(i) + '.txt'), 'w', 'utf8') as out:
    out.write(r.get('http://www.gutenberg.org/cache/epub/' + i + '/pg' + i + '.txt').text)

In [ ]:
from nmf import NMF

# `files` specifies the directory with text documents
# `topics` indicates the number of topics in the model
model = NMF(files='corpus', topics=5)

'''
The following attributes are queryable on `model`:
  topics_to_words: {dict} maps each topic to its top terms
  docs_to_topics: {dict} maps each document to each topic's presence in the document
  documents_by_topics: {numpy.ndarray} - contains one row per document and one columm per topic
  topics_by_terms: {numpy.ndarray} - contains one row per topic and one column per term
'''

# the top terms in each topic
#model.topics_to_words 

# the presence of each topic in each document
#model.doc_to_topics 

# the documents by topics matrix; shape = (documents, topics)
#model.documents_by_topics

# the topics by terms matrix; shape = (topics, terms)
#model.topics_by_terms