Introduction: For the validation of any model adapted from a paper, it is of utmost importance that the results of benchmark testing on the datasets listed in the paper match between the actual implementation (palmetto) and gensim. This coherence pipeline has been implemented from the work done by Roeder et al. The paper can be found here.
Approach :
In [1]:
from __future__ import print_function
import re
import os
from scipy.stats import pearsonr
from datetime import datetime
from gensim.models import CoherenceModel
from gensim.corpora.dictionary import Dictionary
Download the dataset (movie.zip
) and gold standard data (topicsMovie.txt
and goldMovie.txt
) from the link and plug in the locations below.
In [2]:
base_dir = os.path.join(os.path.expanduser('~'), "workshop/nlp/data/")
data_dir = os.path.join(base_dir, 'wiki-movie-subset')
if not os.path.exists(data_dir):
raise ValueError("SKIP: Please download the movie corpus.")
ref_dir = os.path.join(base_dir, 'reference')
topics_path = os.path.join(ref_dir, 'topicsMovie.txt')
human_scores_path = os.path.join(ref_dir, 'goldMovie.txt')
In [3]:
%%time
texts = []
file_num = 0
preprocessed = 0
listing = os.listdir(data_dir)
for fname in listing:
file_num += 1
if 'disambiguation' in fname:
continue # discard disambiguation and redirect pages
elif fname.startswith('File_'):
continue # discard images, gifs, etc.
elif fname.startswith('Category_'):
continue # discard category articles
# Not sure how to identify portal and redirect pages,
# as well as pages about a single year.
# As a result, this preprocessing differs from the paper.
with open(os.path.join(data_dir, fname)) as f:
for line in f:
# lower case all words
lowered = line.lower()
#remove punctuation and split into seperate words
words = re.findall(r'\w+', lowered, flags = re.UNICODE | re.LOCALE)
texts.append(words)
preprocessed += 1
if file_num % 10000 == 0:
print('PROGRESS: %d/%d, preprocessed %d, discarded %d' % (
file_num, len(listing), preprocessed, (file_num - preprocessed)))
In [4]:
%%time
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
According to the paper the number of documents should be 108,952 with a vocabulary of 1,625,124. The difference is because of a difference in preprocessing. However the results obtained are still very similar.
In [5]:
print(len(corpus))
print(dictionary)
In [6]:
topics = [] # list of 100 topics
with open(topics_path) as f:
topics = [line.split() for line in f if line]
len(topics)
Out[6]:
In [7]:
human_scores = []
with open(human_scores_path) as f:
for line in f:
human_scores.append(float(line.strip()))
len(human_scores)
Out[7]:
In [8]:
# We first need to filter out any topics that contain terms not in our dictionary
# These may occur as a result of preprocessing steps differing from those used to
# produce the reference topics. In this case, this only occurs in one topic.
invalid_topic_indices = set(
i for i, topic in enumerate(topics)
if any(t not in dictionary.token2id for t in topic)
)
print("Topics with out-of-vocab terms: %s" % ', '.join(map(str, invalid_topic_indices)))
usable_topics = [topic for i, topic in enumerate(topics) if i not in invalid_topic_indices]
In [9]:
%%time
cm = CoherenceModel(topics=usable_topics, corpus=corpus, dictionary=dictionary, coherence='u_mass')
u_mass = cm.get_coherence_per_topic()
print("Calculated u_mass coherence for %d topics" % len(u_mass))
In [10]:
%%time
cm = CoherenceModel(topics=usable_topics, texts=texts, dictionary=dictionary, coherence='c_v')
c_v = cm.get_coherence_per_topic()
print("Calculated c_v coherence for %d topics" % len(c_v))
c_v and c_uci and c_npmi all use the boolean sliding window approach of estimating probabilities. Since the CoherenceModel
caches the accumulated statistics, calculation of c_uci and c_npmi are practically free after calculating c_v coherence. These two methods are simpler and were shown to correlate less with human judgements than c_v but more so than u_mass.
In [11]:
%%time
cm.coherence = 'c_uci'
c_uci = cm.get_coherence_per_topic()
print("Calculated c_uci coherence for %d topics" % len(c_uci))
In [12]:
%%time
cm.coherence = 'c_npmi'
c_npmi = cm.get_coherence_per_topic()
print("Calculated c_npmi coherence for %d topics" % len(c_npmi))
In [13]:
final_scores = [
score for i, score in enumerate(human_scores)
if i not in invalid_topic_indices
]
len(final_scores)
Out[13]:
The values in the paper were:
u_mass
correlation : 0.093
c_v
correlation : 0.548
c_uci
correlation : 0.473
c_npmi
correlation : 0.438
Our values are also very similar to these values which is good. This validates the correctness of our pipeline, as we can reasonably attribute the differences to differences in preprocessing.
In [14]:
for our_scores in (u_mass, c_v, c_uci, c_npmi):
print(pearsonr(our_scores, final_scores)[0])
In [ ]: