Wordrank is a fresh new approach to the word embeddings, which formulates it as a ranking problem. That is, given a word w, it aims to output an ordered list (c1, c2, · · ·) of context words such that words that co-occur with w appear at the top of the list. This formulation fits naturally to popular word embedding tasks such as word similarity/analogy since instead of the likelihood of each word, we are interested in finding the most relevant words in a given context[1].
This notebook accompanies a more theoretical blog post here.
Gensim is used to train and evaluate the word2vec models. Analogical reasoning and Word Similarity tasks are used for comparing the models. Word2vec and FastText embeddings are trained using the skipgram architecture here.
In [1]:
import nltk
from gensim.parsing.preprocessing import strip_punctuation, strip_multiple_whitespaces
# Only the brown corpus is needed in case you don't have it.
nltk.download('brown')
# Generate brown corpus text file
with open('brown_corp.txt', 'w+') as f:
for word in nltk.corpus.brown.words():
f.write('{word} '.format(word=word))
f.seek(0)
brown = f.read()
# Preprocess brown corpus
with open('proc_brown_corp.txt', 'w') as f:
proc_brown = strip_punctuation(brown)
proc_brown = strip_multiple_whitespaces(proc_brown).lower()
f.write(proc_brown)
# Set WR_HOME and FT_HOME to respective directory root
WR_HOME = 'wordrank/'
FT_HOME = 'fastText/'
# download the text8 corpus (a 100 MB sample of preprocessed wikipedia text)
import os.path
if not os.path.isfile('text8'):
!wget -c http://mattmahoney.net/dc/text8.zip
!unzip text8.zip
In [2]:
MODELS_DIR = 'models/'
!mkdir -p {MODELS_DIR}
from gensim.models import Word2Vec
from gensim.models.wrappers import Wordrank
from gensim.models.word2vec import Text8Corpus
# fasttext params
lr = 0.05
dim = 100
ws = 5
epoch = 5
minCount = 5
neg = 5
loss = 'ns'
t = 1e-4
w2v_params = {
'alpha': 0.025,
'size': 100,
'window': 15,
'iter': 5,
'min_count': 5,
'sample': t,
'sg': 1,
'hs': 0,
'negative': 5
}
wr_params = {
'size': 100,
'window': 15,
'iter': 91,
'min_count': 5
}
def train_models(corpus_file, output_name):
# Train using word2vec
output_file = '{:s}_gs'.format(output_name)
if not os.path.isfile(os.path.join(MODELS_DIR, '{:s}.vec'.format(output_file))):
print('\nTraining word2vec on {:s} corpus..'.format(corpus_file))
# Text8Corpus class for reading space-separated words file
%time gs_model = Word2Vec(Text8Corpus(corpus_file), **w2v_params); gs_model
locals()['gs_model'].save_word2vec_format(os.path.join(MODELS_DIR, '{:s}.vec'.format(output_file)))
print('\nSaved gensim model as {:s}.vec'.format(output_file))
else:
print('\nUsing existing model file {:s}.vec'.format(output_file))
# Train using fasttext
output_file = '{:s}_ft'.format(output_name)
if not os.path.isfile(os.path.join(MODELS_DIR, '{:s}.vec'.format(output_file))):
print('Training fasttext on {:s} corpus..'.format(corpus_file))
%time !{FT_HOME}fasttext skipgram -input {corpus_file} -output {MODELS_DIR+output_file} -lr {lr} -dim {dim} -ws {ws} -epoch {epoch} -minCount {minCount} -neg {neg} -loss {loss} -t {t}
else:
print('\nUsing existing model file {:s}.vec'.format(output_file))
# Train using wordrank
output_file = '{:s}_wr'.format(output_name)
output_dir = 'model' # directory to save embeddings and metadata to
if not os.path.isfile(os.path.join(MODELS_DIR, '{:s}.vec'.format(output_file))):
print('\nTraining wordrank on {:s} corpus..'.format(corpus_file))
%time wr_model = Wordrank.train(WR_HOME, corpus_file, output_dir, **wr_params); wr_model
locals()['wr_model'].save_word2vec_format(os.path.join(MODELS_DIR, '{:s}.vec'.format(output_file)))
print('\nSaved wordrank model as {:s}.vec'.format(output_file))
else:
print('\nUsing existing model file {:s}.vec'.format(output_file))
# Loading ensemble embeddings
output_file = '{:s}_wr_ensemble'.format(output_name)
if not os.path.isfile(os.path.join(MODELS_DIR, '{:s}.vec'.format(output_file))):
print('\nLoading ensemble embeddings (vector combination of word and context embeddings)..')
%time wr_model = Wordrank.load_wordrank_model(os.path.join(WR_HOME, 'model/wordrank.words'), os.path.join(WR_HOME, 'model/meta/vocab.txt'), os.path.join(WR_HOME, 'model/wordrank.contexts'), sorted_vocab=1, ensemble=1); wr_model
locals()['wr_model'].wv.save_word2vec_format(os.path.join(MODELS_DIR, '{:s}.vec'.format(output_file)))
print('\nSaved wordrank (ensemble) model as {:s}.vec'.format(output_file))
else:
print('\nUsing existing model file {:s}.vec'.format(output_file))
train_models(corpus_file='proc_brown_corp.txt', output_name='brown')
In [3]:
train_models(corpus_file='text8', output_name='text8')
Here we train wordrank model using ensemble in second case as it is known to give a small performance boost in some cases. So we'll test accuracy for both the cases.
In [1]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
def print_analogy_accuracy(model, questions_file):
acc = model.accuracy(questions_file)
sem_correct = sum((len(acc[i]['correct']) for i in range(5)))
sem_total = sum((len(acc[i]['correct']) + len(acc[i]['incorrect'])) for i in range(5))
sem_acc = 100*float(sem_correct)/sem_total
print('\nSemantic: {:d}/{:d}, Accuracy: {:.2f}%'.format(sem_correct, sem_total, sem_acc))
syn_correct = sum((len(acc[i]['correct']) for i in range(5, len(acc)-1)))
syn_total = sum((len(acc[i]['correct']) + len(acc[i]['incorrect'])) for i in range(5,len(acc)-1))
syn_acc = 100*float(syn_correct)/syn_total
print('Syntactic: {:d}/{:d}, Accuracy: {:.2f}%\n'.format(syn_correct, syn_total, syn_acc))
def print_similarity_accuracy(model, similarity_file):
acc = model.evaluate_word_pairs(similarity_file)
print('Pearson correlation coefficient: {:.2f}'.format(acc[0][0]))
print('Spearman rank correlation coefficient: {:.2f}'.format(acc[1][0]))
In [2]:
MODELS_DIR = 'models/'
word_analogies_file = './datasets/questions-words.txt'
simlex_file = '../../gensim/test/test_data/simlex999.txt'
wordsim_file = '../../gensim/test/test_data/wordsim353.tsv'
print('\nLoading Gensim embeddings')
brown_gs = KeyedVectors.load_word2vec_format(MODELS_DIR + 'brown_gs.vec')
print('Accuracy for Word2Vec:')
print_analogy_accuracy(brown_gs, word_analogies_file)
print('SimLex-999 similarity')
print_similarity_accuracy(brown_gs, simlex_file)
print('\nWordSim-353 similarity')
print_similarity_accuracy(brown_gs, wordsim_file)
print('\nLoading FastText embeddings')
brown_ft = KeyedVectors.load_word2vec_format(MODELS_DIR + 'brown_ft.vec')
print('Accuracy for FastText:')
print_analogy_accuracy(brown_ft, word_analogies_file)
print('SimLex-999 similarity')
print_similarity_accuracy(brown_ft, simlex_file)
print('\nWordSim-353 similarity')
print_similarity_accuracy(brown_ft, wordsim_file)
print('\nLoading Wordrank embeddings')
brown_wr = KeyedVectors.load_word2vec_format(MODELS_DIR + 'brown_wr.vec')
print('Accuracy for Wordrank:')
print_analogy_accuracy(brown_wr, word_analogies_file)
print('SimLex-999 similarity')
print_similarity_accuracy(brown_wr, simlex_file)
print('\nWordSim-353 similarity')
print_similarity_accuracy(brown_wr, wordsim_file)
print('\nLoading Wordrank ensemble embeddings')
brown_wr_ensemble = KeyedVectors.load_word2vec_format(MODELS_DIR + 'brown_wr_ensemble.vec')
print('Accuracy for Wordrank:')
print_analogy_accuracy(brown_wr_ensemble, word_analogies_file)
print('SimLex-999 similarity')
print_similarity_accuracy(brown_wr_ensemble, simlex_file)
print('\nWordSim-353 similarity')
print_similarity_accuracy(brown_wr_ensemble, wordsim_file)
As evident from the above outputs, WordRank performs significantly better in Semantic analogies, whereas, FastText on Syntactic analogies. Also ensemble embeddings gives a small performance boost in WordRank's case.
Wordrank's effectiveness in Semantic analogies is possibly due to it's focused attention on getting most relevant words right at the top using the ranking approach. And as fasttext is designed to incorporate morphological information about words, it results in it's performance boost in Syntactic analogies, as most of the Syntactic analogies are morphology based[2].
And for the Word Similarity, Word2Vec performed better on SimLex-999 test data, whereas, WordRank on WS-353. This is probably due to the different types of similarities these datasets address. SimLex-999 provides a measure of how well the two words are interchangeable in similar contexts, and WS-353 tries to estimate the relatedness or co-occurrence of two words. Also, ensemble embeddings doesn't help in the Word Similarity task[1], which is evident from the results above so we'll use just the Word Embeddings for it.
Now lets evaluate on a larger corpus, text8, and see how it effects the performance of different embedding models.
In [3]:
print('Loading Gensim embeddings')
text8_gs = KeyedVectors.load_word2vec_format(MODELS_DIR + 'text8_gs.vec')
print('Accuracy for word2vec:')
print_analogy_accuracy(text8_gs, word_analogies_file)
print('SimLex-999 similarity')
print_similarity_accuracy(text8_gs, simlex_file)
print('\nWordSim-353 similarity')
print_similarity_accuracy(text8_gs, wordsim_file)
print('Loading FastText embeddings')
text8_ft = KeyedVectors.load_word2vec_format(MODELS_DIR + 'text8_ft.vec')
print('Accuracy for FastText (with n-grams):')
print_analogy_accuracy(text8_ft, word_analogies_file)
print('SimLex-999 similarity')
print_similarity_accuracy(text8_ft, simlex_file)
print('\nWordSim-353 similarity')
print_similarity_accuracy(text8_ft, wordsim_file)
print('\nLoading Wordrank embeddings')
text8_wr = KeyedVectors.load_word2vec_format(MODELS_DIR + 'text8_wr.vec')
print('Accuracy for Wordrank:')
print_analogy_accuracy(text8_wr, word_analogies_file)
print('SimLex-999 similarity')
print_similarity_accuracy(text8_wr, simlex_file)
print('\nWordSim-353 similarity')
print_similarity_accuracy(text8_wr, wordsim_file)
print('\nLoading Wordrank ensemble embeddings')
text8_wr_ensemble = KeyedVectors.load_word2vec_format(MODELS_DIR + 'text8_wr_ensemble.vec')
print('Accuracy for Wordrank:')
print_analogy_accuracy(text8_wr_ensemble, word_analogies_file)
print('SimLex-999 similarity')
print_similarity_accuracy(text8_wr_ensemble, simlex_file)
print('\nWordSim-353 similarity')
print_similarity_accuracy(text8_wr_ensemble, wordsim_file)
With a larger corpus, we observe similar patterns in the accuracies. Here also, WordRank dominates the Semantic analogies and FastText Syntactic ones. Word2Vec again performs better on SimLex-999 dataset and WordRank on WordSim-353. Though we observe a little performance decrease in WordRank in case of ensemble embeddings here, so it's good to try both the cases for evaluations.
In this section, we'll see if the frequency of a word has any effect on embedding model's performance in Analogy task. Accuracy vs. Frequency graph is used to analyze this effect. The mean frequency of four words involved in each analogy is computed, and then bucketed with other analogies having similar mean frequencies. Each bucket has six percent of the total analogies involved in the particular task. You can go to this repo if you want to inspect about what analogies(with their sorted frequencies) were used for each of the plot.
In [9]:
from __future__ import division
import matplotlib.pyplot as plt
import copy
import multiprocessing
import numpy as np
def compute_accuracies(model, freq):
# mean_freq will contain analogies together with the mean frequency of 4 words involved
mean_freq = {}
with open(word_analogies_file, 'r') as r:
for i, line in enumerate(r):
if ':' not in line:
analogy = tuple(line.split())
else:
continue
try:
mfreq = sum([int(freq[x.lower()]) for x in analogy])/4
mean_freq['a%d'%i] = [analogy, mfreq]
except KeyError:
continue
# compute model's accuracy
model = KeyedVectors.load_word2vec_format(model)
acc = model.accuracy(word_analogies_file)
sem_correct = [acc[i]['correct'] for i in range(5)]
sem_total = [acc[i]['correct'] + acc[i]['incorrect'] for i in range(5)]
syn_correct = [acc[i]['correct'] for i in range(5, len(acc)-1)]
syn_total = [acc[i]['correct'] + acc[i]['incorrect'] for i in range(5, len(acc)-1)]
total_correct = sem_correct + syn_correct
total_total = sem_total + syn_total
sem_x, sem_y = calc_axis(sem_correct, sem_total, mean_freq)
syn_x, syn_y = calc_axis(syn_correct, syn_total, mean_freq)
total_x, total_y = calc_axis(total_correct, total_total, mean_freq)
return ((sem_x, sem_y), (syn_x, syn_y), (total_x, total_y))
def calc_axis(correct, total, mean_freq):
# make flat lists
correct_analogies = []
for i in range(len(correct)):
for analogy in correct[i]:
correct_analogies.append(analogy)
total_analogies = []
for i in range(len(total)):
for analogy in total[i]:
total_analogies.append(analogy)
copy_mean_freq = copy.deepcopy(mean_freq)
# delete other case's analogy from total analogies
for key, value in copy_mean_freq.items():
value[0] = tuple(x.upper() for x in value[0])
if value[0] not in total_analogies:
del copy_mean_freq[key]
# append 0 or 1 for incorrect or correct analogy
for key, value in copy_mean_freq.iteritems():
value[0] = tuple(x.upper() for x in value[0])
if value[0] in correct_analogies:
copy_mean_freq[key].append(1)
else:
copy_mean_freq[key].append(0)
x = []
y = []
bucket_size = int(len(copy_mean_freq) * 0.06)
# sort analogies according to their mean frequences
copy_mean_freq = sorted(copy_mean_freq.items(), key=lambda x: x[1][1])
# prepare analogies buckets according to given size
for centre_p in xrange(bucket_size//2, len(copy_mean_freq), bucket_size):
bucket = copy_mean_freq[centre_p-bucket_size//2:centre_p+bucket_size//2]
b_acc = 0
# calculate current bucket accuracy with b_acc count
for analogy in bucket:
if analogy[1][2]==1:
b_acc+=1
y.append(b_acc/bucket_size)
x.append(np.log(copy_mean_freq[centre_p][1][1]))
return x, y
# a sample model using gensim's Word2Vec for getting vocab counts
corpus = Text8Corpus('proc_brown_corp.txt')
model = Word2Vec(min_count=5)
model.build_vocab(corpus)
freq = {}
for word in model.wv.index2word:
freq[word] = model.wv.vocab[word].count
# plot results
word2vec = compute_accuracies('brown_gs.vec', freq)
wordrank = compute_accuracies('brown_wr_ensemble.vec', freq)
fasttext = compute_accuracies('brown_ft.vec', freq)
fig = plt.figure(figsize=(7,15))
for i, subplot, title in zip([0, 1, 2], ['311', '312', '313'], ['Semantic Analogies', 'Syntactic Analogies', 'Total Analogy']):
ax = fig.add_subplot(subplot)
ax.plot(word2vec[i][0], word2vec[i][1], 'r-', label='Word2Vec')
ax.plot(wordrank[i][0], wordrank[i][1], 'g--', label='WordRank')
ax.plot(fasttext[i][0], fasttext[i][1], 'b:', label='FastText')
ax.set_ylabel('Average accuracy')
ax.set_xlabel('Log mean frequency')
ax.set_title(title)
ax.legend(loc='upper right', prop={'size':10})
plt.show()
This graph show the results trained over Brown corpus(1 million tokens).
The main observations that can be drawn here are-
Now, let’s see if a larger corpus creates any difference in this pattern of model's performance over different frequencies.
In [10]:
# a sample model using gensim's Word2Vec for getting vocab counts
corpus = Text8Corpus('text8')
model = Word2Vec(min_count=5)
model.build_vocab(corpus)
freq = {}
for word in model.wv.index2word:
freq[word] = model.wv.vocab[word].count
word2vec = compute_accuracies('text8_gs.vec', freq)
wordrank = compute_accuracies('text8_wr.vec', freq)
fasttext = compute_accuracies('text8_ft.vec', freq)
fig = plt.figure(figsize=(7,15))
for i, subplot, title in zip([0, 1, 2], ['311', '312', '313'], ['Semantic Analogies', 'Syntactic Analogies', 'Total Analogy']):
ax = fig.add_subplot(subplot)
ax.plot(word2vec[i][0], word2vec[i][1], 'r-', label='Word2Vec')
ax.plot(wordrank[i][0], wordrank[i][1], 'g--', label='WordRank')
ax.plot(fasttext[i][0], fasttext[i][1], 'b:', label='FastText')
ax.set_ylabel('Average accuracy')
ax.set_xlabel('Log mean frequency')
ax.set_title(title)
ax.legend(loc='upper right', prop={'size':10})
plt.show()
This shows the results for text8(17 million tokens). Following points can be observed in this case-
These graphs also conclude that WordRank is the best suited method for Semantic Analogies, and FastText for Syntactic Analogies for all the frequency ranges and over different corpus sizes, though all the embedding methods could become very competitive as the corpus size increases largerly[2].
The experiments here conclude two main points from comparing Word embeddings. Firstly, there is no single global embedding model we could rely on for different types of NLP applications. For example, in Word Similarity, WordRank performed better than the other two algorithms for WS-353 test data whereas, Word2Vec performed better on SimLex-999. This is probably due to the different type of similarities these datasets address[3]. And in Word Analogy task, WordRank performed better for Semantic Analogies and FastText for Syntactic Analogies. This basically tells us that we need to choose the embedding method carefully according to our final use-case.
Secondly, our query words do matter apart from the generalized model performance. As we observed in Accuracy vs. Frequency graphs that models perform differently depending on the frequency of question analogy words in training corpus. For example, we are likely to get poor results if our query words are all highly frequent.
Note: WordRank can sometimes produce NaN values during model evaluation, when the embedding vector values get too diverged at some iterations, but it dumps embedding vectors after every few iterations, so you could just load embeddings from a different iteration’s text file.