In [1]:
import pandas as pd
import random
from random import shuffle
import numpy as np
import matplotlib.pyplot as plt
import time
#Specialized packages
import nltk
from sklearn import feature_extraction
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim import corpora, models
from scipy.special import gammaln
import re
In [2]:
%pylab inline
In [253]:
from jyquickhelper import add_notebook_menu
add_notebook_menu()
Out[253]:
We are considering a collection of English news articles about the case relating to allegations of sexual assault against the former IMF director Dominique Strauss-Kahn (May 2011). It was obtained thanks to a Web Search with keywords and provided generosyly by Aurélien Lauf, Leila Khouas and Mohamed Dermouche on the UCL Machine Learning Repository at : https://archive.ics.uci.edu/ml/datasets/NYSK
In [4]:
#get raw data
import xml.etree.ElementTree as ET
tree = ET.parse('../dataset/nysk.xml')
root = tree.getroot()
root1 = root.getchildren()[150].getchildren()
texts=[]
for document in root.iter('document'):
text = document.find('text').text
texts += [text]
In [235]:
#Example of text
texts[1]
Out[235]:
We can see from this example that the textual data are not very cleaned: title and paratext might ne included in the sample.
We will consider for the implementation 250 articles out of the 10421 for computational cost. Please note that the code itself is scalable up to an higher limit.
In [180]:
# Sample texts
print(len(texts))
shuffle(texts)
texts_test = texts[:250]
print(len(texts_test))
In [181]:
from nltk.tokenize import MWETokenizer
from nltk.stem.snowball import SnowballStemmer
In [182]:
my_stop_words = nltk.corpus.stopwords.words('english')
# Add my stopwords
my_stop_words = my_stop_words + ["n't", "'s", "wednesday", "year",
"ve", "said", "a", "would", "may", "say", "saturday",
"thursday", "select", "one", "part"]
In [183]:
tokenizer = MWETokenizer([("world", "trade", "organisation"), ('dominique', 'strauss-kahn'),
("international", "monetary", "fund"), ('new', 'york'), ("wall", "street")])
In [184]:
stemmer = SnowballStemmer("english")
In [185]:
texts_tok = []
for text in texts_test:
tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
tokens = tokenizer.tokenize(tokens)
filtered_tokens = []
for token in tokens:
if token not in my_stop_words:
if re.search('[a-zA-Z]', token):
stemmed_token = stemmer.stem(token)
filtered_tokens.append(stemmed_token)
texts_tok += [filtered_tokens]
One problem of text analysis is the high number of unique words that are to be found in the corpus and its corollary the sparsity. We want first to give an overview of this issue and then to answer to it thanks to the method implemented in TfidfVectorizer
of scikit-learn
In [228]:
from collections import defaultdict
frequency = defaultdict(int)
for text in texts_tok:
for token in text:
frequency[token] += 1
df_freq = pd.DataFrame(frequency, index=['value']).T
print('Most frequent words')
print(df_freq.sort_values(['value'], ascending=False).head())
df_freq[df_freq['value'] <10].hist(['value'], bins=50)
plt.title('Distribution for the least frequent words')
print()
In [187]:
# Extract the most discrimnant tokens
def tokenStem(text):
tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
tokens = tokenizer.tokenize(tokens)
filtered_tokens = []
for token in tokens:
if re.search('[a-zA-Z]', token):
stemmed_token = stemmer.stem(token)
filtered_tokens.append(stemmed_token)
return filtered_tokens
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
min_df=0.2, stop_words=my_stop_words,
use_idf=True, tokenizer=tokenStem, ngram_range=(1,1))
%time tfidf_matrix = tfidf_vectorizer.fit_transform(texts_test)
terms = tfidf_vectorizer.get_feature_names()
#Build
texts_red = [[token for token in text if token in terms] for text in texts_tok]
In [190]:
# Building the dictionnary
dictionary = corpora.Dictionary(texts_red)
# Store the translation rule in a pd.DataFrame
dict_df = pd.DataFrame(data=dictionary.token2id, index=['value']).T
dict_df.head()
Out[190]:
In [191]:
# Translating our corpus
texts_idx = [dictionary.doc2idx(text) for text in texts_red]
# Example
texts_idx[2]
Out[191]:
In [230]:
length = []
for text in range(len(texts_idx)):
length += [len(texts_idx[text])]
plt.figure()
plt.boxplot(length)
print(max(length), min(length))
Let's recall briefly to understand the notation the generative graphical model that is under the LDA. We won't discuss here further.
Given $d \in \{1, \cdots n_{docs}\}$, the document index, $w \in \{1, \cdots, n_{words}\}$ the word index, $ k \in \{1, \cdots, n_{topics}\}$ the topic index, the underlying process modeled by LDA is:
The variables to infer in the process is the two unknown latent variable $\pi_d$ and $\phi_k$ what can be done with several techniques as describe in the article's commentary. We want here to implement the Collapsed Gibbs Sampling for its familiarity and the nice explanation written in Griffiths TL, Steyvers M. Finding scientific topics. From their demonstration we get that the full conditional distribution is:
$ \Pi(t_{dw}) \equiv P(t_{dw}=j | t_{-d}, w) \propto \frac{n_{-d,j}^{w} + \beta}{n_{-d,j} + W\beta}\frac{n_{-d,j}^d + \alpha}{n_{-d}^d + T\alpha}$
In [194]:
n_docs = len(texts_idx) # Number of documents in corpus
n_words = len(dict_df) # Number of words in full corpus
n_topics =5 # Number of topics we want to find
n_iter =20
alpha = 0.1
beta =0.1
In [195]:
def initialisation(n_docs,n_topics,n_words,texts_idx):
doc_topic = np.zeros((n_docs, n_topics)) # number of words per topic for each doc
word_topic = np.zeros((n_topics, n_words)) # count of each word for each topic
doc = np.zeros(n_docs) # number of words for each doc/length of each doc
topic = np.zeros(n_topics) # number of words for each topic
topics_peridx = {} # topic assigned for each word for each document
for d in range(n_docs):
idx =0
for w in texts_idx[d]:
# generate random data for the first step
t=np.random.randint(n_topics)
doc_topic[d, t] +=1 #
doc[d] +=1
word_topic[t,w] +=1
topic[t] +=1
topics_peridx[(d,idx)] = t
idx +=1
output = [doc_topic, doc, word_topic, topic, topics_peridx]
return output
new = initialisation(n_docs,n_topics,n_words,texts_idx)
doc_topic = new[0]
doc = new[1]
word_topic = new[2]
topic = new[3]
topics_peridx =new[4]
In [196]:
print(word_topic.shape) # n_topics*n_words (number of topics * number of words in final dictionary)
print(word_topic)
We can check that the amount of word_0 in the full corpus is dispatched between each topic.
NB:It illustrates one interesting feature in Latent Dirichlet Allocation that allows one word to be in several topic thus having different meaning relatively to the context.
In [197]:
# Find the word corresponding to the index 0
value0 = dict_df[dict_df.value==0].index[0]
print(value0)
In [198]:
# Look for its frequency inside the final vocabulary (of texts_red)
freq_red = defaultdict(int)
for text in texts_red:
for token in text:
freq_red[token] += 1
print('Number of occurences in full corpus:',freq_red[value0])
print('Dispatched word_0 to each topic:',freq_red[value0] == sum(word_topic[:,0]) )
doc_topic
This matrix represents the count of words for each document d that belongs to each topic t. It describes which topic is present in document d and with which relative intensity.
Example: document d has 13 words that are classified to belong to t.
NB: It illustrates the fact that one document can entail/be generated by several topics.
In [199]:
print(doc_topic[0:10])
print()
print('Matrix shape',doc_topic.shape) # n_docs*n_topics
We can check that the number of words classified matches the number of word in the document
In [200]:
print('Number of words in document_0:', len(texts_idx[0]))
print('Equals to sum of words in each topic for document_0:',sum(doc_topic[0])==len(texts_idx[0]))
In [201]:
def pi(d,w, alpha, beta):
'''
Compute p(t|w, -t):
the full conditional distribution of topic t given the word w
'''
left = (word_topic[:,w] + beta) / (topic + beta*n_words)
right = (doc_topic[d,:] + alpha) / (doc[d] + alpha*n_topics)
p_t = left*right # is equivalent
p_t /= (np.sum(p_t)) # normalization to get a probability
return(p_t)
In [202]:
start_time = time.time()
for iteration in range(n_iter):
print('iteration:',iteration)
for d in range(n_docs):
idx =0
for w in texts_idx[d]:
t = topics_peridx[(d,idx)]
# withdraw the current assignment of t
doc_topic[d, t] -=1
doc[d] -=1
word_topic[t,w] -=1
topic[t] -=1
# compute the conditional distribution
p_t = pi(d,w,alpha, beta)
# choose the topic for word w
t = np.random.multinomial(1,p_t)
t= t.argmax()
doc_topic[d, t] +=1
doc[d] +=1
word_topic[t,w] +=1
topic[t] +=1
topics_peridx[(d,idx)] = t
idx +=1
print("--- %s seconds ---" % (time.time() - start_time))
In [203]:
# Relative number of words per topic
pd.DataFrame(topic/sum(topic)*100).T
Out[203]:
In [204]:
# Distribution of words per topic
word_topic_df = pd.DataFrame(word_topic)
word_topic_df.columns = dict_df.sort_values(['value']).index
word_topic_df
Out[204]:
In [205]:
# Estimation of pi : P(w|t)
word_topic_df / word_topic_df.sum(axis=0)
Out[205]:
In [208]:
for t in range(n_topics):
topic_ = word_topic_df.iloc[t]
print('topic', t)
print(topic_[topic_ >50].index)
A complete analysis of the result is out of our goal since we haven't thought long enough to our data that were a little bit unclean due to the fetching method (sometimes other info that were we guess around the article in the website page are in fact integrated to the text). Still we can interpret the topic 4 to be related to international matter, the topic 3 to economics, topic 2 to juridical matter, topic 1 the global picture in terms of network and topic 0 to the act itself.
Next idea is to understand the impact when we vary some hyperparameters or parameters we considered as constant. For that purpose we decided to compare the performance along two axis:
For that we should recall the form of the likelihood and apply to it the logarithm of the multinomial beta function. We get this idea from Blondel (http://mblondel.org/journal/2010/08/21/latent-dirichlet-allocation-in-python/).
In [209]:
def log_multi_beta(alpha, K=None):
"""
Logarithm of the multinomial beta function.
"""
if K is None:
# alpha is assumed to be a vector
return np.sum(gammaln(alpha)) - gammaln(np.sum(alpha))
else:
# alpha is assumed to be a scalar
return K * gammaln(alpha) - gammaln(K*alpha)
In [210]:
def loglikelihood():
"""
Compute the likelihood that the model generated the data.
"""
loglik = 0
for t in range(n_topics):
loglik += log_multi_beta(word_topic[t,:]+beta)
loglik -= log_multi_beta(beta, n_words)
for d in range(n_docs):
loglik += log_multi_beta(doc_topic[d,:]+alpha)
loglik -= log_multi_beta(alpha, n_topics)
return loglik
In [211]:
def LDA(n_iter,alpha,beta, verbose =False):
logliks = []
for iteration in range(n_iter):
for d in range(n_docs):
idx =0
for w in texts_idx[d]:
t = topics_peridx[(d,idx)]
# withdraw the current assignment of t
doc_topic[d, t] -=1
doc[d] -=1
word_topic[t,w] -=1
topic[t] -=1
p_t = pi(d,w, alpha, beta)
t = np.random.multinomial(1,p_t)
t= t.argmax()
doc_topic[d, t] +=1
doc[d] +=1
word_topic[t,w] +=1
topic[t] +=1
topics_peridx[(d,idx)] = t
idx +=1
if (iteration % 5==0):
print('iteration:',iteration)
if (verbose==True):
loglik = loglikelihood()
print("loglikelihood",round(loglik))
logliks += [loglik]
if verbose == False:
logliks = loglikelihood()
print("loglikelihood",round(logliks))
return(logliks)
In [212]:
new = initialisation(n_docs,n_topics,n_words,texts_idx)
doc_topic = new[0]
doc = new[1]
word_topic = new[2]
topic = new[3]
topics_peridx =new[4]
In [213]:
n_iter = 70
%time convergenceLDA = LDA(n_iter, alpha,beta, verbose = True)
In [249]:
x = range(n_iter)
fig = plt.figure()
plt.plot(x,convergenceLDA)
plt.ylabel('loglikelihood')
plt.xlabel('iterations')
plt.ylim(-290000, -210000)
Out[249]:
$\alpha$ and $\beta$ are the parameters for the two Dirichlet priors. They are called the concentration parameter, since in our symmetric distribution case, a high value means an higher degree of mixture of the input. More precisely the higher the $\alpha$, the more likely each document contains a mixture of most of the topics, and not any single topic specifically. Likewise, a high $\beta$-value means that each topic is likely to contain a mixture of most of the words, and not any word specifically, while a low value means that a topic may contain a mixture of just a few of the words.
In [223]:
#Study on alpha
lik_alpha = []
iter_alpha = np.linspace(0.1, 2.0, num=10).tolist()
for alpha in iter_alpha:
print('alpha:',alpha)
new = initialisation(n_docs,n_topics,n_words,texts_idx)
doc_topic = new[0]
doc = new[1]
word_topic = new[2]
topic = new[3]
topics_peridx =new[4]
lik = LDA(20, alpha, beta)
lik_alpha += [lik]
In [250]:
fig = plt.figure()
plt.plot(iter_alpha,lik_alpha)
plt.ylabel('loglikelihood')
plt.xlabel('iterations')
plt.ylim(-290000, -210000)
Out[250]:
In [146]:
alpha=0.1
lik_beta = []
iter_beta = np.linspace(0.1, 2.0, num=10).tolist()
for beta in iter_beta:
print('beta:',beta)
new = initialisation(n_docs,n_topics,n_words,texts_idx)
doc_topic = new[0]
doc = new[1]
word_topic = new[2]
topic = new[3]
topics_peridx =new[4]
lik = LDA(20, alpha, beta)
lik_beta += [lik]
In [150]:
fig = plt.figure()
plt.plot(iter_beta,lik_beta)
plt.ylabel('pseudo loglikelihood')
plt.xlabel('iterations')
plt.ylim(-77000, -57000)
print(min(lik_beta), max(lik_beta))
In [245]:
alpha=0.1
beta =0.75
iter_topics = range(2,6)
lik_topics = []
for n_topics in iter_topics:
print('n_topics:',n_topics)
new = initialisation(n_docs,n_topics,n_words,texts_idx)
doc_topic = new[0]
doc = new[1]
word_topic = new[2]
topic = new[3]
topics_peridx =new[4]
lik = LDA(20, alpha, beta)
lik_topics += [lik]
In [251]:
fig = plt.figure()
plt.plot(iter_topics,lik_topics)
plt.ylabel('loglikelihood')
plt.xlabel('iterations')
plt.ylim(-290000, -210000)
print(min(lik_topics), max(lik_topics))
We have decided to study the collapsed gibbs sampling implementation of the Latent Dirichlet Allocation because of its similarities with algorithms seen in course. An idea would have been to compare our result with other implementation but we thought that it might make no sense since the running time of our implementation and the low capacity of the computer we've done the study is such that this script will experiences difficulties to concurr with alternatives -especially the one using C language (cf. https://github.com/davidandrzej/cvbLDA for a direct implementation of the methodology of the article).
A remark to be done is the really bad likelihood. This result is multiply when considering larger corpus (cf previous experiment where n_docs=50 and loglikelihood = -61950, it can be seen also in the Experimental notebook that we joined)
Furthermore added to the quite slow computation time we can see that it performs better with lower number of topics. Even if it's understandable since the difficulty to partition assignement probabilities in several group increases with the number of group, it's a pity for real data where we want to clusterize in several groups.