Text Analysis and Topic Modelling

Author: Jesús Cid Sueiro

Date: 2016/04/03

In this notebook we will explore some tools for text analysis in python. To do so, first we will import the requested python libraries.


In [ ]:
%matplotlib inline

# Required imports
from wikitools import wiki
from wikitools import category

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

import gensim

import numpy as np
import lda
import lda.datasets

from time import time
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

import matplotlib.pyplot as plt
import pylab

from test_helper import Test

In [ ]:
import pickle
data = pickle.load(open("wikiresults.p", "rb"))
D = data['D']
corpus_bow = data['corpus_bow']

3. Semantic Analysis

The dictionary D and the Bag of Words in corpus_bow are the key inputs to the topic model algorithms. The topic model algorithms in gensim assume that input documents are parameterized using the tf-idf model.


In [ ]:
tfidf = gensim.models.TfidfModel(corpus_bow)

From now on, tfidf can be used to convert any vector from the old representation (bow integer counts) to the new one (TfIdf real-valued weights):


In [ ]:
doc_bow = [(0, 1), (1, 1)]
tfidf[doc_bow]

Or to apply a transformation to a whole corpus


In [ ]:
corpus_tfidf = tfidf[corpus_bow]

3.1. Latent Semantic Indexing (LSI)

Now we are ready to apply a topic modeling algorithm. Latent Semantic Indexing is provided by LsiModel.

Task: Generate a LSI model with 5 topics for corpus_tfidf and dictionary D. You can check de sintaxis for gensim.models.LsiModel.


In [ ]:
# Initialize an LSI transformation
n_topics = 5
# scode: lsi = <FILL IN>

From LSI, we can check both the topic-tokens matrix and the document-topics matrix.

Now we can check the topics generated by LSI. An intuitive visualization is provided by the show_topics method.


In [ ]:
lsi.show_topics(num_topics=-1, num_words=10, log=False, formatted=True)

However, a more useful representation of topics is as a list of tuples (token, value). This is provided by the show_topic method.

Task: Represent the columns of the topic-token matrix as a series of bar diagrams (one per topic) with the top 25 tokens of each topic.


In [ ]:
# SORTED TOKEN FREQUENCIES (II):
plt.rcdefaults()

n_bins = 25

# Example data
y_pos = range(n_bins-1, -1, -1)

pylab.rcParams['figure.figsize'] = 16, 8  # Set figure size

for i in range(n_topics):

    ### Plot top 25 tokens for topic i
    # Read i-thtopic
    # scode: <FILL IN>
    topic_i = lsi.show_topic(i, topn=n_bins)
    tokens = [t[0] for t in topic_i]
    weights = [t[1] for t in topic_i]
    
    # Plot
    # scode: <FILL IN>
    plt.subplot(1, n_topics, i+1)
    plt.barh(y_pos, weights, align='center', alpha=0.4)
    plt.yticks(y_pos, tokens)
    plt.xlabel('Top {0} topic weights'.format(n_bins))
    plt.title('Topic {0}'.format(i))

plt.show()

LSI approximates any document as a linear combination of the topic vectors. We can compute the topic weights for any input corpus entered as input to the lsi model.


In [ ]:
# On real corpora, target dimensionality of
# 200–500 is recommended as a “golden standard”
# Create a double wrapper over the original 
# corpus bow  tfidf  fold-in-lsi
corpus_lsi = lsi[corpus_tfidf]
print corpus_lsi[0]

Task: Find the document with the largest positive weight for topic 0. Compare the document and the topic.


In [ ]:
# Extract weights from corpus_lsi
# scode: weight0 = <FILL IN>

# Locate the maximum positive weight
nmax = np.argmax(weight0)
print nmax
print weight0[nmax]
print corpus_lsi[nmax]

# Get topic 0
# scode: topic_0 = <FILL IN>

# Compute a list of tuples (token, wordcount) for all tokens in topic_0, where wordcount is the number of 
# occurences of the token in the article.
# scode: token_counts = <FILL IN>

print "Topic 0 is:"
print topic_0
print "Token counts:"
print token_counts

3.2. Latent Dirichlet Allocation (LDA)

There are several implementations of the LDA topic model in python:

  • Python library lda.
  • Gensim module: gensim.models.ldamodel.LdaModel
  • Sci-kit Learn module: sklearn.decomposition

3.2.1. LDA using Gensim

The use of the LDA module in gensim is similar to LSI. Furthermore, it assumes that a tf-idf parametrization is used as an input, which is not in complete agreement with the theoretical model, which assumes documents represented as vectors of token-counts.

To use LDA in gensim, we must first create a lda model object.


In [ ]:
ldag = gensim.models.ldamodel.LdaModel(
    corpus=corpus_tfidf, id2word=D, num_topics=10, update_every=1, passes=10)

In [ ]:
ldag.print_topics()

3.2.2. LDA using Sci-kit Learn

The input matrix to the sklearn implementation of LDA contains the token-counts for all documents in the corpus. sklearn contains a powerfull CountVectorizer method that can be used to construct the input matrix from the corpus_bow.

First, we will define an auxiliary function to print the top tokens in the model, that has been taken from the sklearn documentation.


In [ ]:
# Adapted from an example in sklearn site 
# http://scikit-learn.org/dev/auto_examples/applications/topics_extraction_with_nmf_lda.html

# You can try also with the dataset provided by sklearn in 
# from sklearn.datasets import fetch_20newsgroups
# dataset = fetch_20newsgroups(shuffle=True, random_state=1,
#                              remove=('headers', 'footers', 'quotes'))

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()
    

    " ".join(ListaTokens)

Now, we need a dataset to feed the Count_Vectorizer object, by joining all tokens in corpus_clean in a single string, using a space ' ' as separator.

Task: Join all tokens from each document in a single string, using a white space as separator.


In [ ]:
print("Loading dataset...")
# scode: data_samples = <FILL IN>   # Usar join sobre corpus_clean.
data_samples = [" ".join(doc) for doc in corpus_clean]
data_samples = map(lambda x: " ".join(x), corpus_clean)

print 'Document 0:'
print data_samples[0][0:200], '...'

Now we are ready to compute the token counts.


In [ ]:
# Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
n_features = 1000
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=n_features,
                                stop_words='english')

t0 = time()
tf = tf_vectorizer.fit_transform(data_samples)
print("done in %0.3fs." % (time() - t0))
print tf[0][0][0]

Now we can apply the LDA algorithm.

Task: Create an LDA object with the following parameters: n_topics=n_topics, max_iter=5, learning_method='online', learning_offset=50., random_state=0


In [ ]:
print("Fitting LDA models with tf features, "
      "n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
# scode: lda = <FILL IN>
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)

Task: Fit model lda with the token frequencies computed by tf_vectorizer.


In [ ]:
t0 = time()

corpus_lda = lda.fit_transform(tf)

print("done in %0.3fs." % (time() - t0))

In [ ]:
print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

Exercise: Represent graphically the topic distributions

Exercise: Explore the influence of the concentration parameters, $alpha$ (doc_topic_prior in sklearn) and $eta$(topic_word_prior). In particular observe how do topic and document distributions change as these parameters increase.

Exercise: The token dictionary and the token distribution have shown that:

  1. Some tokens, despite being very frequent in the corpus, have no semantic relevance for topic modeling. Unfortunately, they were not present in the stopword list, and have not been elliminated before the analysis.

  2. A large portion of tokens appear only once and, thus, they are not statistically relevant for the inference engine of the topic models.

Revise the entire corpus be removing from the corpus all these sets of terms.

Exercise: Note that we have not used the terms in the article titles, though the can be expected to containg relevant words for the topic modeling. Include the title words in the analyisis. In order to give them a special relevante, insert them in the corpus several time, so as to make their words more significant.

Exercise: The topic modelling algorithms we have tested in this notebook are non-supervised. This makes them difficult to evaluate objectivelly. In order to test if LDA captures real topics, construct a dataset as the mixture of wikipedia articles from 4 different categories, and test if LDA with 4 topics identifies topics closely related to the original categories.