When working with words, dealing with the huge but sparse domain of language can be challenging. Even for a small corpus, your neural network (or any type of model) needs to support many thousands of discrete inputs and outputs.
Besides the raw number words, the standard technique of representing words as one-hot vectors (e.g. "the" = [0 0 0 1 0 0 0 0 ...]
) does not capture any information about relationships between words.
Word vectors address this problem by representing words in a multi-dimensional vector space. This can bring the dimensionality of the problem from hundreds-of-thousands to just hundreds. Plus, the vector space is able to capture semantic relationships between words in terms of distance and vector arithmetic.
There are a few techniques for creating word vectors. The word2vec algorithm predicts words in a context (e.g. what is the most likely word to appear in "the cat ? the mouse"), while GloVe vectors are based on global counts across the corpus — see How is GloVe different from word2vec? on Quora for some better explanations.
In my opinion the best feature of GloVe is that multiple sets of pre-trained vectors are easily available for download, so that's what we'll use here.
In [1]:
import torch
import torchtext.vocab as vocab
In [2]:
glove = vocab.GloVe(name='6B', dim=100)
print('Loaded {} words'.format(len(glove.itos)))
The returned GloVe
object includes attributes:
stoi
string-to-index returns a dictionary of words to indexesitos
index-to-string returns an array of words by indexvectors
returns the actual vectors. To get a word vector get the index to get the vector:
In [3]:
def get_word(word):
return glove.vectors[glove.stoi[word]]
In [4]:
def closest(vec, n=10):
"""
Find the closest words for a given vector
"""
all_dists = [(w, torch.dist(vec, get_word(w))) for w in glove.itos]
return sorted(all_dists, key=lambda t: t[1])[:n]
This will return a list of (word, distance)
tuple pairs. Here's a helper function to print that list:
In [5]:
def print_tuples(tuples):
for tuple in tuples:
print('(%.4f) %s' % (tuple[1], tuple[0]))
Now using a known word vector we can see which other vectors are closest:
In [6]:
print_tuples(closest(get_word('google')))
The most interesting feature of a well-trained word vector space is that certain semantic relationships (beyond just close-ness of words) can be captured with regular vector arithmetic.
(image borrowed from a slide from Omer Levy and Yoav Goldberg)
In [6]:
# In the form w1 : w2 :: w3 : ?
def analogy(w1, w2, w3, n=5, filter_given=True):
print('\n[%s : %s :: %s : ?]' % (w1, w2, w3))
# w2 - w1 + w3 = w4
closest_words = closest(get_word(w2) - get_word(w1) + get_word(w3))
# Optionally filter out given words
if filter_given:
closest_words = [t for t in closest_words if t[0] not in [w1, w2, w3]]
print_tuples(closest_words[:n])
The classic example:
In [7]:
analogy('king', 'man', 'queen')
Now let's explore the word space and see what stereotypes we can uncover:
In [8]:
analogy('man', 'actor', 'woman')
analogy('cat', 'kitten', 'dog')
analogy('dog', 'puppy', 'cat')
analogy('russia', 'moscow', 'france')
analogy('obama', 'president', 'trump')
analogy('rich', 'mansion', 'poor')
analogy('elvis', 'rock', 'eminem')
analogy('paper', 'newspaper', 'screen')
analogy('monet', 'paint', 'michelangelo')
analogy('beer', 'barley', 'wine')
analogy('earth', 'moon', 'sun') # Interesting failure mode
analogy('house', 'roof', 'castle')
analogy('building', 'architect', 'software')
analogy('boston', 'bruins', 'phoenix')
analogy('good', 'heaven', 'bad')
analogy('jordan', 'basketball', 'woods')