Topic Modelling

Author: Jesús Cid Sueiro

Date: 2017/04/21

In this notebook we will explore some tools for text analysis in python. To do so, first we will import the requested python libraries.


In [3]:
# %matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
# import pylab

# Required imports
from wikitools import wiki
from wikitools import category

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

import gensim

import lda
import lda.datasets

from time import time
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

from test_helper import Test

1. Corpus acquisition.

In this notebook we will explore some tools for text processing and analysis and two topic modeling algorithms available from Python toolboxes.

To do so, we will explore and analyze collections of Wikipedia articles from a given category, using wikitools, that makes easy the capture of content from wikimedia sites.

(As a side note, there are many other available text collections to test topic modelling algorithm. In particular, the NLTK library has many examples, that can explore them using the nltk.download() tool.

import nltk
nltk.download()

for instance, you can take the gutemberg dataset

Mycorpus = nltk.corpus.gutenberg
text_name = Mycorpus.fileids()[0]
raw = Mycorpus.raw(text_name)
Words = Mycorpus.words(text_name)

Also, tools like Gensim or Sci-kit learn include text databases to work with).

In order to use Wikipedia data, we will select a single category of articles:


In [5]:
site = wiki.Wiki("https://en.wikipedia.org/w/api.php")
# Select a category with a reasonable number of articles (>100)
# cat = "Economics"
cat = "Pseudoscience"
print cat

You can try with any other categories. Take into account that the behavior of topic modelling algorithms may depend on the amount of documents available for the analysis. Select a category with at least 100 articles. You can browse the wikipedia category tree here, https://en.wikipedia.org/wiki/Category:Contents, for instance.

We start downloading the text collection.


In [7]:
# Loading category data. This may take a while
print "Loading category data. This may take a while..."
cat_data = category.Category(site, cat)

corpus_titles = []
corpus_text = []

for n, page in enumerate(cat_data.getAllMembersGen()):
    print "\r Loading article {0}".format(n + 1),
    corpus_titles.append(page.title)
    corpus_text.append(page.getWikiText())

n_art = len(corpus_titles)
print "\nLoaded " + str(n_art) + " articles from category " + cat

Now, we have stored the whole text collection in two lists:

  • corpus_titles, which contains the titles of the selected articles
  • corpus_text, with the text content of the selected wikipedia articles

You can browse the content of the wikipedia articles to get some intuition about the kind of documents that will be processed.


In [9]:
# n = 5
# print corpus_titles[n]
# print corpus_text[n]

2. Corpus Processing

Topic modelling algorithms process vectorized data. In order to apply them, we need to transform the raw text input data into a vector representation. To do so, we will remove irrelevant information from the text data and preserve as much relevant information as possible to capture the semantic content in the document collection.

Thus, we will proceed with the following steps:

  1. Tokenization, filtering and cleaning
  2. Homogeneization (stemming or lemmatization)
  3. Vectorization

2.1. Tokenization, filtering and cleaning.

The first steps consists on the following:

  1. Tokenization: convert text string into lists of tokens.
  2. Filtering:
    • Removing capitalization: capital alphabetic characters will be transformed to their corresponding lowercase characters.
    • Removing non alphanumeric tokens (e.g. punktuation signs)
  3. Cleaning: Removing stopwords, i.e., those words that are very common in language and do not carry out useful semantic content (articles, pronouns, etc).

To do so, we will need some packages from the Natural Language Toolkit.


In [13]:
# You can comment this if the package is already available.
# nltk.download("punkt")
# nltk.download("stopwords")

In [14]:
stopwords_en = stopwords.words('english')
corpus_clean = []

for n, art in enumerate(corpus_text): 
    print "\rProcessing article {0} out of {1}".format(n + 1, n_art),
    # This is to make sure that all characters have the appropriate encoding.
    art = art.decode('utf-8')  
    
    # Tokenize each text entry. 
    # scode: tokens = <FILL IN>
    token_list = word_tokenize(art)
    
    # Convert all tokens in token_list to lowercase, remove non alfanumeric tokens and stem.
    # Store the result in a new token list, clean_tokens.
    # scode: filtered_tokens = <FILL IN>
    filtered_tokens = [token.lower() for token in token_list if token.isalnum()]

    # Remove all tokens in the stopwords list and append the result to corpus_clean
    # scode: clean_tokens = <FILL IN>
    clean_tokens = [token for token in filtered_tokens if token not in stopwords_en]    

    # scode: <FILL IN>
    corpus_clean.append(clean_tokens)
print "\nLet's check the first tokens from document 0 after processing:"
print corpus_clean[0][0:30]

In [15]:
Test.assertTrue(len(corpus_clean) == n_art, 'List corpus_clean does not contain the expected number of articles')
Test.assertTrue(len([c for c in corpus_clean[0] if c in stopwords_en])==0, 'Stopwords have not been removed')

2.2. Stemming vs Lemmatization

At this point, we can choose between applying a simple stemming or ussing lemmatization. We will try both to test their differences.

Task: Apply the .stem() method, from the stemmer object created in the first line, to corpus_filtered.


In [18]:
# Select stemmer.
stemmer = nltk.stem.SnowballStemmer('english')
corpus_stemmed = []

for n, token_list in enumerate(corpus_clean):
    print "\rStemming article {0} out of {1}".format(n + 1, n_art),
    
    # Convert all tokens in token_list to lowercase, remove non alfanumeric tokens and stem.
    # Store the result in a new token list, clean_tokens.
    # scode: stemmed_tokens = <FILL IN>
    stemmed_tokens = [stemmer.stem(token) for token in token_list]
    
    # Add art to the stemmed corpus
    # scode: <FILL IN>
    corpus_stemmed.append(stemmed_tokens)

print "\nLet's check the first tokens from document 0 after stemming:"
print corpus_stemmed[0][0:30]

In [19]:
Test.assertTrue((len([c for c in corpus_stemmed[0] if c!=stemmer.stem(c)]) < 0.1*len(corpus_stemmed[0])), 
                'It seems that stemming has not been applied properly')

Alternatively, we can apply lemmatization. For english texts, we can use the lemmatizer from NLTK, which is based on WordNet. If you have not used wordnet before, you will likely need to download it from nltk


In [21]:
# You can comment this if the package is already available.
# nltk.download("wordnet")

Task: Apply the .lemmatize() method, from the WordNetLemmatizer object created in the first line, to corpus_filtered.


In [23]:
wnl = WordNetLemmatizer()

# Select stemmer.
corpus_lemmat = []

for n, token_list in enumerate(corpus_clean):
    print "\rLemmatizing article {0} out of {1}".format(n + 1, n_art),
    
    # scode: lemmat_tokens = <FILL IN>
    lemmat_tokens = [wnl.lemmatize(token) for token in token_list]

    # Add art to the stemmed corpus
    # scode: <FILL IN>
    corpus_lemmat.append(lemmat_tokens)

print "\nLet's check the first tokens from document 0 after stemming:"
print corpus_lemmat[0][0:30]

One of the advantages of the lemmatizer method is that the result of lemmmatization is still a true word, which is more advisable for the presentation of text processing results and lemmatization.

However, without using contextual information, lemmatize() does not remove grammatical differences. This is the reason why "is" or "are" are preserved and not replaced by infinitive "be".

As an alternative, we can apply .lemmatize(word, pos), where 'pos' is a string code specifying the part-of-speech (pos), i.e. the grammatical role of the words in its sentence. For instance, you can check the difference between wnl.lemmatize('is') and wnl.lemmatize('is, pos='v').

2.3. Vectorization

Up to this point, we have transformed the raw text collection of articles in a list of articles, where each article is a collection of the word roots that are most relevant for semantic analysis. Now, we need to convert these data (a list of token lists) into a numerical representation (a list of vectors, or a matrix). To do so, we will start using the tools provided by the gensim library.

As a first step, we create a dictionary containing all tokens in our text corpus, and assigning an integer identifier to each one of them.


In [26]:
# Create dictionary of tokens
D = gensim.corpora.Dictionary(corpus_clean)
n_tokens = len(D)

print "The dictionary contains {0} tokens".format(n_tokens)
print "First tokens in the dictionary: "
for n in range(10):
    print str(n) + ": " + D[n]

In the second step, let us create a numerical version of our corpus using the doc2bow method. In general, D.doc2bow(token_list) transform any list of tokens into a list of tuples (token_id, n), one per each token in token_list, where token_id is the token identifier (according to dictionary D) and n is the number of occurrences of such token in token_list.

Task: Apply the doc2bow method from gensim dictionary D, to all tokens in every article in corpus_clean. The result must be a new list named corpus_bow where each element is a list of tuples (token_id, number_of_occurrences).


In [29]:
# Transform token lists into sparse vectors on the D-space
# scode: corpus_bow = <FILL IN>
corpus_bow = [D.doc2bow(doc) for doc in corpus_clean]

In [30]:
Test.assertTrue(len(corpus_bow)==n_art, 'corpus_bow has not the appropriate size')

At this point, it is good to make sure to understand what has happened. In corpus_clean we had a list of token lists. With it, we have constructed a Dictionary, D, which assign an integer identifier to each token in the corpus. After that, we have transformed each article (in corpus_clean) in a list tuples (id, n).


In [32]:
print "Original article (after cleaning): "
print corpus_clean[0][0:30]
print "Sparse vector representation (first 30 components):"
print corpus_bow[0][0:30]
print "The first component, {0} from document 0, states that token 0 ({1}) appears {2} times".format(
    corpus_bow[0][0], D[0], corpus_bow[0][0][1])

Note that we can interpret each element of corpus_bow as a sparse_vector. For example, a list of tuples

[(0, 1), (3, 3), (5,2)] 

for a dictionary of 10 elements can be represented as a vector, where any tuple (id, n) states that position id must take value n. The rest of positions must be zero.

[1, 0, 0, 3, 0, 2, 0, 0, 0, 0]

These sparse vectors will be the inputs to the topic modeling algorithms.

Note that, at this point, we have built a Dictionary containing


In [34]:
print "{0} tokens".format(len(D))

and a bow representation of a corpus with


In [36]:
print "{0} Wikipedia articles".format(len(corpus_bow))

Before starting with the semantic analyisis, it is interesting to observe the token distribution for the given corpus.


In [38]:
# SORTED TOKEN FREQUENCIES (I):
# Create a "flat" corpus with all tuples in a single list
corpus_bow_flat = [item for sublist in corpus_bow for item in sublist]

# Initialize a numpy array that we will use to count tokens.
# token_count[n] should store the number of ocurrences of the n-th token, D[n]
token_count = np.zeros(n_tokens)

# Count the number of occurrences of each token.
for x in corpus_bow_flat:
    # Update the proper element in token_count
    # scode: <FILL IN>
    token_count[x[0]] += x[1]

# Sort by decreasing number of occurences
ids_sorted = np.argsort(- token_count)
tf_sorted = token_count[ids_sorted]

ids_sorted is a list of all token ids, sorted by decreasing number of occurrences in the whole corpus. For instance, the most frequent term is


In [40]:
print D[ids_sorted[0]]

which appears


In [42]:
print "{0} times in the whole corpus".format(tf_sorted[0])

In the following we plot the most frequent terms in the corpus.


In [44]:
# SORTED TOKEN FREQUENCIES (II):
plt.rcdefaults()

# Example data
n_bins = 25
hot_tokens = [D[i] for i in ids_sorted[n_bins-1::-1]]
y_pos = np.arange(len(hot_tokens))
z = tf_sorted[n_bins-1::-1]/n_art

plt.barh(y_pos, z, align='center', alpha=0.4)
plt.yticks(y_pos, hot_tokens)
plt.xlabel('Average number of occurrences per article')
plt.title('Token distribution')
plt.show()

In [45]:
# SORTED TOKEN FREQUENCIES:

# Example data
plt.semilogy(tf_sorted)
plt.xlabel('Average number of occurrences per article')
plt.title('Token distribution')
plt.show()
display()

Exercise: There are usually many tokens that appear with very low frequency in the corpus. Count the number of tokens appearing only once, and what is the proportion of them in the token list.


In [47]:
# scode: <WRITE YOUR CODE HERE>
# Example data
cold_tokens = [D[i] for i in ids_sorted if tf_sorted[i]==1]

print "There are {0} cold tokens, which represent {1}% of the total number of tokens in the dictionary".format(
    len(cold_tokens), float(len(cold_tokens))/n_tokens*100)

Exercise: Represent graphically those 20 tokens that appear in the highest number of articles. Note that you can use the code above (headed by # SORTED TOKEN FREQUENCIES) with a very minor modification.


In [49]:
# scode: <WRITE YOUR CODE HERE>

# SORTED TOKEN FREQUENCIES (I):
# Count the number of occurrences of each token.
token_count2 = np.zeros(n_tokens)
for x in corpus_bow_flat:
    token_count2[x[0]] += (x[1]>0)

# Sort by decreasing number of occurences
ids_sorted2 = np.argsort(- token_count2)
tf_sorted2 = token_count2[ids_sorted2]

# SORTED TOKEN FREQUENCIES (II):
# Example data
n_bins = 25
hot_tokens2 = [D[i] for i in ids_sorted2[n_bins-1::-1]]
y_pos2 = np.arange(len(hot_tokens2))
z2 = tf_sorted2[n_bins-1::-1]/n_art

plt.figure()
plt.barh(y_pos2, z2, align='center', alpha=0.4)
plt.yticks(y_pos2, hot_tokens2)
plt.xlabel('Average number of occurrences per article')
plt.title('Token distribution')
plt.show()
display()

3. Semantic Analysis

The dictionary D and the Bag of Words in corpus_bow are the key inputs to the topic model algorithms. In this section we will explore two algorithms:

  1. Latent Semantic Indexing (LSI)
  2. Latent Dirichlet Allocation (LDA)

The topic model algorithms in gensim assume that input documents are parameterized using the tf-idf model. This can be done using


In [51]:
tfidf = gensim.models.TfidfModel(corpus_bow)

From now on, tfidf can be used to convert any vector from the old representation (bow integer counts) to the new one (TfIdf real-valued weights):


In [53]:
doc_bow = [(0, 1), (1, 1)]
tfidf[doc_bow]

Or to apply a transformation to a whole corpus


In [55]:
corpus_tfidf = tfidf[corpus_bow]
print corpus_tfidf[0][0:5]

3.1. Latent Semantic Indexing (LSI)

Now we are ready to apply a topic modeling algorithm. Latent Semantic Indexing is provided by LsiModel.

Task: Generate a LSI model with 5 topics for corpus_tfidf and dictionary D. You can check de sintaxis for gensim.models.LsiModel.


In [59]:
# Initialize an LSI transformation
n_topics = 5

# scode: lsi = <FILL IN>

From LSI, we can check both the topic-tokens matrix and the document-topics matrix.

Now we can check the topics generated by LSI. An intuitive visualization is provided by the show_topics method.


In [62]:
lsi.show_topics(num_topics=-1, num_words=10, log=False, formatted=True)

However, a more useful representation of topics is as a list of tuples (token, value). This is provided by the show_topic method.

Task: Represent the columns of the topic-token matrix as a series of bar diagrams (one per topic) with the top 25 tokens of each topic.


In [65]:
# SORTED TOKEN FREQUENCIES (II):
plt.rcdefaults()

n_bins = 25
    
# Example data
y_pos = range(n_bins-1, -1, -1)

# pylab.rcParams['figure.figsize'] = 16, 8  # Set figure size
plt.figure(figsize=(16, 8))

for i in range(n_topics):
    
    ### Plot top 25 tokens for topic i
    # Read i-thtopic
    # topic_i =  <FILL IN>
    # tokens =  <FILL IN>
    # weights =  <FILL IN>

    # Plot
    plt.subplot(1, n_topics, i+1)
    plt.barh(# scode: <FILL IN> 
    plt.yticks(y_pos, tokens)
    plt.xlabel('Top {0} topic weights'.format(n_bins))
    plt.title('Topic {0}'.format(i))
    
plt.show()
display()

LSI approximates any document as a linear combination of the topic vectors. We can compute the topic weights for any input corpus entered as input to the lsi model.


In [67]:
# On real corpora, target dimensionality of
# 200–500 is recommended as a “golden standard”
# Create a double wrapper over the original 
# corpus bow  tfidf  fold-in-lsi
corpus_lsi = lsi[corpus_tfidf]
print corpus_lsi[0]

Task: Find the document with the largest positive weight for topic 0. Compare the document and the topic.


In [69]:
# Extract weights from corpus_lsi
# scode weight0 = <FILL IN>
weight0 = [doc[0][1] for doc in corpus_lsi]

# Locate the maximum positive weight
nmax = np.argmax(weight0)
print nmax
print weight0[nmax]
print corpus_lsi[nmax]

# Get topic 0
# scode: topic_0 = <FILL IN>

# Compute a list of tuples (token, wordcount) for all tokens in topic_0, where wordcount is the number of 
# occurences of the token in the article.
# scode: token_counts = <FILL IN>

print "Topic 0 is:"
print topic_0
print "Token counts:"
print token_counts

3.2. Latent Dirichlet Allocation (LDA)

There are several implementations of the LDA topic model in python:

  • Python library lda.
  • Gensim module: gensim.models.ldamodel.LdaModel
  • Sci-kit Learn module: sklearn.decomposition

3.2.1. LDA using Gensim

The use of the LDA module in gensim is similar to LSI. Furthermore, it assumes that a tf-idf parametrization is used as an input, which is not in complete agreement with the theoretical model, which assumes documents represented as vectors of token-counts.

To use LDA in gensim, we must first create a lda model object.


In [74]:
ldag = gensim.models.ldamodel.LdaModel(
    corpus=corpus_tfidf, id2word=D, num_topics=10, update_every=1, passes=10)

In [75]:
ldag.print_topics()

3.2.2. LDA using python lda library

An alternative to gensim for LDA is the lda library from python. It requires a doc-frequency matrix as input


In [78]:
# For testing LDA, you can use the reuters dataset
# X = lda.datasets.load_reuters()
# vocab = lda.datasets.load_reuters_vocab()
# titles = lda.datasets.load_reuters_titles()
X = np.int32(np.zeros((n_art, n_tokens)))
for n, art in enumerate(corpus_bow):
    for t in art:
        X[n, t[0]] = t[1]
print X.shape
print X.sum()

vocab = D.values()
titles = corpus_titles

In [79]:
# Default parameters:
# model = lda.LDA(n_topics, n_iter=2000, alpha=0.1, eta=0.01, random_state=None, refresh=10)
model = lda.LDA(n_topics=10, n_iter=1500, random_state=1)
model.fit(X)  # model.fit_transform(X) is also available
topic_word = model.topic_word_  # model.components_ also works

# Show topics...
n_top_words = 8
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    print('Topic {}: {}'.format(i, ' '.join(topic_words)))

Document-topic distribution


In [81]:
doc_topic = model.doc_topic_
for i in range(10):
    print("{} (top topic: {})".format(titles[i], doc_topic[i].argmax()))

In [82]:
# This is to apply the model to a new doc(s)
# doc_topic_test = model.transform(X_test)
# for title, topics in zip(titles_test, doc_topic_test):
#    print("{} (top topic: {})".format(title, topics.argmax()))

It allows incremental updates

3.2.2. LDA using Sci-kit Learn

The input matrix to the sklearn implementation of LDA contains the token-counts for all documents in the corpus. sklearn contains a powerfull CountVectorizer method that can be used to construct the input matrix from the corpus_bow.

First, we will define an auxiliary function to print the top tokens in the model, that has been taken from the sklearn documentation.


In [86]:
# Adapted from an example in sklearn site 
# http://scikit-learn.org/dev/auto_examples/applications/topics_extraction_with_nmf_lda.html

# You can try also with the dataset provided by sklearn in 
# from sklearn.datasets import fetch_20newsgroups
# dataset = fetch_20newsgroups(shuffle=True, random_state=1,
#                              remove=('headers', 'footers', 'quotes'))

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()

Now, we need a dataset to feed the Count_Vectorizer object, by joining all tokens in corpus_clean in a single string, using a space ' ' as separator.


In [88]:
print("Loading dataset...")
# scode: data_samples = <FILL IN>

print 'Document 0:'
print data_samples[0][0:200], '...'

Now we are ready to compute the token counts.


In [90]:
# Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
n_features = 1000
n_samples = 2000
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=n_features,
                                stop_words='english')

t0 = time()
tf = tf_vectorizer.fit_transform(data_samples)
print("done in %0.3fs." % (time() - t0))
print tf[0][0][0]

Now we can apply the LDA algorithm.

Task: Create an LDA object with the following parameters: n_topics=n_topics, max_iter=5, learning_method='online', learning_offset=50., random_state=0


In [92]:
print("Fitting LDA models with tf features, "
      "n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
# scode: lda = <FILL IN>

Task: Fit model lda with the token frequencies computed by tf_vectorizer.


In [94]:
t0 = # scode: <FILL IN>
corpus_lda = # scode: <FILL IN>
print corpus_lda[0]/np.sum(corpus_lda[0])
print("done in %0.3fs." % (time() - t0))

In [95]:
print("\nTopics in LDA model:")
tf_feature_names = # scode: <FILL IN>
print_top_words(lda, tf_feature_names, n_top_words)

In [96]:
topics = lda.components_
topic_probs = [t/np.sum(t) for t in topics]
# print topic_probs[0]
print -np.sort(-topic_probs[0])

Exercise: Represent graphically the topic distributions for the top 25 tokens with highest probability for each topic.


In [98]:

Exercise: Explore the influence of the concentration parameters, $alpha$ (doc_topic_prior in sklearn) and $eta$(topic_word_prior). In particular observe how do topic and document distributions change as these parameters increase.


In [100]:

Exercise: The token dictionary and the token distribution have shown that:

  1. Some tokens, despite being very frequent in the corpus, have no semantic relevance for topic modeling. Unfortunately, they were not present in the stopword list, and have not been elliminated before the analysis.

  2. A large portion of tokens appear only once and, thus, they are not statistically relevant for the inference engine of the topic models.

Revise the entire corpus be removing from the corpus all these sets of terms.


In [102]:

Exercise: Note that we have not used the terms in the article titles, though the can be expected to containg relevant words for the topic modeling. Include the title words in the analyisis. In order to give them a special relevante, insert them in the corpus several time, so as to make their words more significant.


In [104]:

Exercise: The topic modelling algorithms we have tested in this notebook are non-supervised. This makes them difficult to evaluate objectivelly. In order to test if LDA captures real topics, construct a dataset as the mixture of wikipedia articles from 4 different categories, and test if LDA with 4 topics identifies topics closely related to the original categories.


In [106]: