In this notebook we will explore some tools for text analysis in python. To do so, first we will import the requested python libraries.
In [ ]:
%matplotlib inline
# Required imports
from wikitools import wiki
from wikitools import category
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import gensim
import numpy as np
import lda
import lda.datasets
from time import time
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import matplotlib.pyplot as plt
import pylab
from test_helper import Test
In [ ]:
import pickle
data = pickle.load(open("wikiresults.p", "rb"))
D = data['D']
corpus_bow = data['corpus_bow']
In [ ]:
tfidf = gensim.models.TfidfModel(corpus_bow)
From now on, tfidf can be used to convert any vector from the old representation (bow integer counts) to the new one (TfIdf real-valued weights):
In [ ]:
doc_bow = [(0, 1), (1, 1)]
tfidf[doc_bow]
Or to apply a transformation to a whole corpus
In [ ]:
corpus_tfidf = tfidf[corpus_bow]
Now we are ready to apply a topic modeling algorithm. Latent Semantic Indexing is provided by LsiModel
.
Task: Generate a LSI model with 5 topics for corpus_tfidf
and dictionary D
. You can check de sintaxis for gensim.models.LsiModel.
In [ ]:
# Initialize an LSI transformation
n_topics = 5
# scode: lsi = <FILL IN>
From LSI, we can check both the topic-tokens matrix and the document-topics matrix.
Now we can check the topics generated by LSI. An intuitive visualization is provided by the show_topics
method.
In [ ]:
lsi.show_topics(num_topics=-1, num_words=10, log=False, formatted=True)
However, a more useful representation of topics is as a list of tuples (token, value)
. This is provided by the show_topic
method.
Task: Represent the columns of the topic-token matrix as a series of bar diagrams (one per topic) with the top 25 tokens of each topic.
In [ ]:
# SORTED TOKEN FREQUENCIES (II):
plt.rcdefaults()
n_bins = 25
# Example data
y_pos = range(n_bins-1, -1, -1)
pylab.rcParams['figure.figsize'] = 16, 8 # Set figure size
for i in range(n_topics):
### Plot top 25 tokens for topic i
# Read i-thtopic
# scode: <FILL IN>
topic_i = lsi.show_topic(i, topn=n_bins)
tokens = [t[0] for t in topic_i]
weights = [t[1] for t in topic_i]
# Plot
# scode: <FILL IN>
plt.subplot(1, n_topics, i+1)
plt.barh(y_pos, weights, align='center', alpha=0.4)
plt.yticks(y_pos, tokens)
plt.xlabel('Top {0} topic weights'.format(n_bins))
plt.title('Topic {0}'.format(i))
plt.show()
LSI approximates any document as a linear combination of the topic vectors. We can compute the topic weights for any input corpus entered as input to the lsi
model.
In [ ]:
# On real corpora, target dimensionality of
# 200–500 is recommended as a “golden standard”
# Create a double wrapper over the original
# corpus bow tfidf fold-in-lsi
corpus_lsi = lsi[corpus_tfidf]
print corpus_lsi[0]
Task: Find the document with the largest positive weight for topic 0. Compare the document and the topic.
In [ ]:
# Extract weights from corpus_lsi
# scode: weight0 = <FILL IN>
# Locate the maximum positive weight
nmax = np.argmax(weight0)
print nmax
print weight0[nmax]
print corpus_lsi[nmax]
# Get topic 0
# scode: topic_0 = <FILL IN>
# Compute a list of tuples (token, wordcount) for all tokens in topic_0, where wordcount is the number of
# occurences of the token in the article.
# scode: token_counts = <FILL IN>
print "Topic 0 is:"
print topic_0
print "Token counts:"
print token_counts
There are several implementations of the LDA topic model in python:
lda
.gensim.models.ldamodel.LdaModel
sklearn.decomposition
The use of the LDA module in gensim
is similar to LSI. Furthermore, it assumes that a tf-idf
parametrization is used as an input, which is not in complete agreement with the theoretical model, which assumes documents represented as vectors of token-counts.
To use LDA in gensim, we must first create a lda model object.
In [ ]:
ldag = gensim.models.ldamodel.LdaModel(
corpus=corpus_tfidf, id2word=D, num_topics=10, update_every=1, passes=10)
In [ ]:
ldag.print_topics()
The input matrix to the sklearn
implementation of LDA contains the token-counts for all documents in the corpus.
sklearn
contains a powerfull CountVectorizer
method that can be used to construct the input matrix from the corpus_bow
.
First, we will define an auxiliary function to print the top tokens in the model, that has been taken from the sklearn
documentation.
In [ ]:
# Adapted from an example in sklearn site
# http://scikit-learn.org/dev/auto_examples/applications/topics_extraction_with_nmf_lda.html
# You can try also with the dataset provided by sklearn in
# from sklearn.datasets import fetch_20newsgroups
# dataset = fetch_20newsgroups(shuffle=True, random_state=1,
# remove=('headers', 'footers', 'quotes'))
def print_top_words(model, feature_names, n_top_words):
for topic_idx, topic in enumerate(model.components_):
print("Topic #%d:" % topic_idx)
print(" ".join([feature_names[i]
for i in topic.argsort()[:-n_top_words - 1:-1]]))
print()
" ".join(ListaTokens)
Now, we need a dataset to feed the Count_Vectorizer object, by joining all tokens in corpus_clean
in a single string, using a space ' ' as separator.
Task: Join all tokens from each document in a single string, using a white space as separator.
In [ ]:
print("Loading dataset...")
# scode: data_samples = <FILL IN> # Usar join sobre corpus_clean.
data_samples = [" ".join(doc) for doc in corpus_clean]
data_samples = map(lambda x: " ".join(x), corpus_clean)
print 'Document 0:'
print data_samples[0][0:200], '...'
Now we are ready to compute the token counts.
In [ ]:
# Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
n_features = 1000
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
max_features=n_features,
stop_words='english')
t0 = time()
tf = tf_vectorizer.fit_transform(data_samples)
print("done in %0.3fs." % (time() - t0))
print tf[0][0][0]
Now we can apply the LDA algorithm.
Task: Create an LDA object with the following parameters: n_topics=n_topics, max_iter=5, learning_method='online', learning_offset=50., random_state=0
In [ ]:
print("Fitting LDA models with tf features, "
"n_samples=%d and n_features=%d..."
% (n_samples, n_features))
# scode: lda = <FILL IN>
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5,
learning_method='online',
learning_offset=50.,
random_state=0)
Task: Fit model lda
with the token frequencies computed by tf_vectorizer
.
In [ ]:
t0 = time()
corpus_lda = lda.fit_transform(tf)
print("done in %0.3fs." % (time() - t0))
In [ ]:
print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)
Exercise: Represent graphically the topic distributions
Exercise: Explore the influence of the concentration parameters, $alpha$ (doc_topic_prior
in sklearn
) and $eta$(topic_word_prior
). In particular observe how do topic and document distributions change as these parameters increase.
Exercise: The token dictionary and the token distribution have shown that:
Some tokens, despite being very frequent in the corpus, have no semantic relevance for topic modeling. Unfortunately, they were not present in the stopword list, and have not been elliminated before the analysis.
A large portion of tokens appear only once and, thus, they are not statistically relevant for the inference engine of the topic models.
Revise the entire corpus be removing from the corpus all these sets of terms.
Exercise: Note that we have not used the terms in the article titles, though the can be expected to containg relevant words for the topic modeling. Include the title words in the analyisis. In order to give them a special relevante, insert them in the corpus several time, so as to make their words more significant.
Exercise: The topic modelling algorithms we have tested in this notebook are non-supervised. This makes them difficult to evaluate objectivelly. In order to test if LDA captures real topics, construct a dataset as the mixture of wikipedia articles from 4 different categories, and test if LDA with 4 topics identifies topics closely related to the original categories.