EGC is a French-speaking conference on knowledge discovery in databases (KDD). In this notebook we show how to use TOM for inferring latent topics that pervade the corpus of articles published at EGC between 2004 and 2015 using non-negative matrix factorization. Based on the discovered topics we use TOM to shed light on interesting facts on the topical structure of the EGC society.
We prune words which absolute frequency in the corpus is less than 4, as well as words which relative frequency is higher than 80%, with the aim to only keep the most significant ones. Eventually, we build the vector space representation of these articles with $tf \cdot idf$ weighting. It is a $n \times m$ matrix denoted by $A$, where each line represents an article, with $n = 817$ (i.e. the number of articles) and $m = 1738$ (i.e. the number of words).
In [1]:
from tom_lib.structure.corpus import Corpus
from tom_lib.visualization.visualization import Visualization
corpus = Corpus(source_file_path='input/egc_lemmatized.csv',
language='french',
vectorization='tfidf',
max_relative_frequency=0.8,
min_absolute_frequency=4)
print('corpus size:', corpus.size)
print('vocabulary size:', len(corpus.vocabulary))
Non-negative matrix factorization approximates $A$, the document-term matrix, in the following way:
$$ A \approx HW $$where $H$ is a $n \times k$ matrix that describes the documents in terms of topics, and $W$ is a $k \times m$ matrix that describes topics in terms of words. More precisely, the coefficient $h_{i,j}$ defines the importance of topic $j$ in article $i$, and the coefficient $w_{i,j}$ defines the importance of word $j$ in topic $i$.
Determining an appropriate value of $k$ is critical to ensure a pertinent analysis of the EGC anthology. If $k$ is too small, then the discovered topics will be too vague; if $k$ is too large, then the discovered topics will be too narrow and may be redundant. To help us with this task, we compute two metrics implemented in TOM : the stability metric proposed by Greene et al. (2014) and the spectral metric proposed by Arun et al. (2010).
In [2]:
from tom_lib.nlp.topic_model import NonNegativeMatrixFactorization
topic_model = NonNegativeMatrixFactorization(corpus)
In [3]:
from bokeh.io import show, output_notebook
from bokeh.plotting import figure
output_notebook()
p = figure(plot_height=250)
p.line(range(10, 51), topic_model.greene_metric(min_num_topics=10, step=1, max_num_topics=50, top_n_words=10, tao=10), line_width=2)
show(p)
In [4]:
p = figure(plot_height=250)
p.line(range(10, 51), topic_model.arun_metric(min_num_topics=10, max_num_topics=50, iterations=10), line_width=2)
show(p)
Guided by the two metrics described previously, we manually evaluate the quality of the topics identified with $k$ varying between 15 and 20. Eventually, we judge that the best results are achieved with NMF for $k=15$.
In [5]:
k = 15
topic_model.infer_topics(num_topics=k)
The table below lists the most relevant words for each of the 15 topics discovered from the articles with NMF. They reveal that the people who form the EGC society are interested in a wide variety of both theoretical and applied issues. For instance, topics 11 and 12 are related to theoretical issues: topic 11 covers papers about model and variable selection, and topic 12 covers papers that propose new or improved learning algorithms. On the other hand, topics 0 and 6 are related to applied issues: topic 13 covers papers about social network analysis, and topic 6 covers papers about Web usage mining.
In [6]:
import pandas as pd
pd.set_option('display.max_colwidth', 500)
d = {'Most relevant words': [', '.join([word for word, weight in topic_model.top_words(i, 10)]) for i in range(k)]}
df = pd.DataFrame(data=d)
df.head(k)
Out[6]:
In the following, we leverage the discovered topics to highlight interesting particularities about the EGC society. To be able to analyze the topics, supplemented with information about the related papers, we partition the papers into 15 non-overlapping clusters, i.e. a cluster per topic. Each article $i \in [0;1-n]$ is assigned to the cluster $j$ that corresponds to the topic with the highest weight $w_{ij}$:
\begin{equation} \text{cluster}_i = \underset{j}{\mathrm{argmax}}(w_{i,j}) \label{eq:cluster} \end{equation}
In [7]:
p = figure(x_range=[str(_) for _ in range(k)], plot_height=350, x_axis_label='topic', y_axis_label='proportion')
p.vbar(x=[str(_) for _ in range(k)], top=topic_model.topics_frequency(), width=0.7)
show(p)
In [8]:
def plot_top_words(topic_id):
words = [word for word, weight in topic_model.top_words(topic_id, 10)]
weights = [weight for word, weight in topic_model.top_words(topic_id, 10)]
p = figure(x_range=words, plot_height=300, plot_width=800, x_axis_label='word', y_axis_label='weight')
p.vbar(x=words, top=weights, width=0.7)
show(p)
def top_documents_df(topic_id):
top_docs = topic_model.top_documents(topic_id, 3)
d = {'Article title': [corpus.title(doc_id) for doc_id, weight in top_docs], 'Year': [int(corpus.date(doc_id)) for doc_id, weight in top_docs]}
df = pd.DataFrame(data=d)
return df
In [9]:
plot_top_words(12)
In [10]:
top_documents_df(12).head()
Out[10]:
In [11]:
plot_top_words(3)
In [12]:
top_documents_df(3).head()
Out[12]:
The figure below shows the frequency of topics 12 (social network analysis and mining) and 3 (association rule mining) per year, from 2004 until 2015. The frequency of a topic for a given year is defined as the proportion of articles, among those published this year, that belong to the corresponding cluster. This figure reveals two opposite trends: topic 12 is emerging and topic 3 is fading over time. While there was apparently no article about social network analysis in 2004, in 2013, 12% of the articles presented at the conference were related to this topic. In contrast, papers related to association rule mining were the most frequent in 2006 (12%), but their frequency dropped down to as low as 0.2% in 2014. This illustrates how the attention of the members of the EGC society is shifting between topics through time. This goes on to show that the EGC society is evolving and is enlarging its scope to incorporate works about novel issues.
In [13]:
p = figure(plot_height=250, x_axis_label='year', y_axis_label='topic frequency')
p.line(range(2004, 2015), [topic_model.topic_frequency(3, date=i) for i in range(2004, 2015)], line_width=2, line_color='blue', legend='topic #3')
p.line(range(2004, 2015), [topic_model.topic_frequency(12, date=i) for i in range(2004, 2015)], line_width=2, line_color='red', legend='topic #12')
show(p)
In [ ]: