The goal of this assignment is to train a Word2Vec skip-gram model over Text8 data.
Some reading material to get familiarised with word2vec approach.
When you're dealing with words in text, you end up with tens of thousands of classes to predict, one for each word. Trying to one-hot encode these words is massively inefficient, you'll have one element set to 1 and the other 50,000 set to 0. The matrix multiplication going into the first hidden layer will have almost all of the resulting values be zero. This a huge waste of computation.
To solve this problem and greatly increase the efficiency of our networks, we use what are called embeddings. Embeddings are just a fully connected layer like you've seen before. We call this layer the embedding layer and the weights are embedding weights. We skip the multiplication into the embedding layer by instead directly grabbing the hidden layer values from the weight matrix. We can do this because the multiplication of a one-hot encoded vector with a matrix returns the row of the matrix corresponding the index of the "on" input unit.
Instead of doing the matrix multiplication, we use the weight matrix as a lookup table. We encode the words as integers, for example "heart" is encoded as 958, "mind" as 18094. Then to get hidden layer values for "heart", you just take the 958th row of the embedding matrix. This process is called an embedding lookup and the number of hidden units is the embedding dimension.
There is nothing magical going on here. The embedding lookup table is just a weight matrix. The embedding layer is just a hidden layer. The lookup is just a shortcut for the matrix multiplication. The lookup table is trained just like any weight matrix as well.
Embeddings aren't only used for words of course. You can use them for any model where you have a massive number of classes. A particular type of model called Word2Vec uses the embedding layer to find vector representations of words that contain semantic meaning.
The word2vec algorithm finds much more efficient representations by finding vectors that represent the words. These vectors also contain semantic information about the words. Words that show up in similar contexts, such as "black", "white", and "red" will have vectors near each other.
Word2vec is a particularly computationally-efficient predictive model for learning word embeddings from raw text. It comes in two flavors, the Continuous Bag-of-Words model (CBOW) and the Skip-Gram model (Section 3.1 and 3.2 in Mikolov et al.). Algorithmically, these models are similar, except that CBOW predicts target words (e.g. 'mat') from source context words ('the cat sits on the'), while the skip-gram does the inverse and predicts source context-words from the target words. This inversion might seem like an arbitrary choice, but statistically it has the effect that CBOW smoothes over a lot of the distributional information (by treating an entire context as one observation). For the most part, this turns out to be a useful thing for smaller datasets. However, skip-gram treats each context-target pair as a new observation, and this tends to do better when we have larger datasets.
The two architectures for implementing word2vec, CBOW (Continuous Bag-Of-Words) and Skip-gram are depicted in the following graph:
We will first work with the skip-gram model and later with the CBOW. In the skip-gram model, we pass in a word and try to predict the words surrounding it in the text. In this way, we can train the network to learn representations for words that show up in similar contexts.
In [1]:
# These are all the modules we'll be using later.
# Make sure you can import them before proceeding further.
%matplotlib inline
from __future__ import print_function
import collections
import math
import numpy as np
import os
import random
import tensorflow as tf
import zipfile
from matplotlib import pylab
from six.moves import range
from six.moves.urllib.request import urlretrieve
from sklearn.manifold import TSNE
import shutil # high level file operations
Download the data from the source website if necessary.
In [2]:
url = 'http://mattmahoney.net/dc/'
def maybe_download(filename, expected_bytes):
"""Download a file if not present, and make sure it's the right size."""
# Go to parent directory and then go to data directory
fpath = os.getcwd()
cpath = os.path.abspath(os.path.join(fpath, os.pardir))
cpath = os.path.join(cpath, 'data')
cpath = os.path.join(cpath, filename)
# create boolean variable if file exists
cpathl = os.path.exists(cpath)
if not cpathl:
filename, _ = urlretrieve(url + filename, filename)
statinfo = os.stat(filename)
else:
statinfo = os.stat(cpath)
if statinfo.st_size == expected_bytes:
print('Found and verified %s' % filename)
else:
print(statinfo.st_size)
raise Exception(
'Failed to verify ' + filename + '. Can you get to it with a browser?')
if not cpathl:
# After file has been verified move it to data folder
fpath = os.path.join(fpath, filename)
shutil.move(fpath, cpath)
return(filename)
filename = maybe_download('text8.zip', 31344016)
Read the data into a string.
In [3]:
def read_data(filename):
# Go to parent directory and then go to data directory
fpath = os.getcwd()
cpath = os.path.abspath(os.path.join(fpath, os.pardir))
cpath = os.path.join(cpath, 'data')
cpath = os.path.join(cpath, filename)
"""Extract the first file enclosed in a zip file as a list of words"""
with zipfile.ZipFile(cpath) as f:
# ZipFile.namelist() :: Returns a list of archive members by name.
data = tf.compat.as_str(f.read(f.namelist()[0])).split()
return (data)
words = read_data(filename)
print('Data size %d' % len(words))
Build the dictionary and replace rare words with UNK token.
Note: The UNK token is a special token used to capture out-of-vocabulary (OOV) words.
In [4]:
vocabulary_size = 50000
def build_dataset(words):
count = [['UNK', -1]]
# add most common words and their count as tupples in the list==count
count.extend(collections.Counter(words).most_common(vocabulary_size - 1))
dictionary = dict()
for word, _ in count:
# The dictionaty[word] gives the index of the word in the dictionary
# ie the time it was added, hence entries start with most frequent word and go down
dictionary[word] = len(dictionary)
data = list()
unk_count = 0
for word in words:
if word in dictionary:
index = dictionary[word]
else:
index = 0 # dictionary['UNK']
unk_count = unk_count + 1
# data collects the indexes contained in dictionary
# each data element represents the results/dictionary_index of the relevant words element
data.append(index)
count[0][1] = unk_count
# reverses the dictionary entries
reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
return (data, count, dictionary, reverse_dictionary)
data, count, dictionary, reverse_dictionary = build_dataset(words)
print('Most common words (+UNK)', count[:5])
print('Sample data', data[:10])
del(words) # Hint to reduce memory.
Let's Explore the data structures created by the previous function
In [5]:
print('data type is {} with length {}'.format(type(data), len(data)))
print(data[:24])
print('\ncount type is {} with length {}'.format(type(count), len(count)))
print(count[:6])
print('\ndictionary type is {} with length {}'.format(type(dictionary), len(dictionary)))
print(list(dictionary.items())[:6])
print('\nreverse_dictionary type is {} with length {}'.format(
type(reverse_dictionary), len(reverse_dictionary)))
print(list(reverse_dictionary.items())[:6])
Why do we need to return the data variable? It is used later to generate data batches.
Data is a list of indexes in the dictionary for all the words in our corpus. It either points to UNK or to a most_common_word entry in the dictionary.
Function to generate a training batch for the skip-gram model.
In [6]:
data_index = 0
def generate_batch(batch_size, num_skips, skip_window):
"""
Function to generate a batch from the dataset.
Args:
batch_size: size of the batch to be created
num_skips: How many times to reuse an input to generate a label.
skip_window: radius of the window of word2vec/skipgram model
"""
global data_index
# assert % == 0 because we later iterate over that //
assert batch_size % num_skips == 0
# num_skips <= 2*skip_window so our exclusions later don't break down
# because there are not enough words to create (word, label) pairs.
assert num_skips <= 2 * skip_window
batch = np.ndarray(shape=(batch_size), dtype=np.int32)
labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
span = 2 * skip_window + 1 # [ skip_window target skip_window ]
# https://docs.python.org/3.5/library/collections.html#collections.deque
# deque :list-like container with fast appends and pops on either end
buffer = collections.deque(maxlen=span)
# First for loop gets numbers/indexes from data object so they can be used in next loop
for __ in range(span):
buffer.append(data[data_index])
# data_index is global variable and the next line iterates it
# up to the end of the dataset and then at start
data_index = (data_index + 1) % len(data)
# loop to create batch data
for i in range(batch_size // num_skips):
target = skip_window # target label at the center of the buffer
targets_to_avoid = [ skip_window ]
# We divide by num skips for i but now we iterate over them!
for j in range(num_skips):
while target in targets_to_avoid:
target = random.randint(0, span - 1)
# appends word that will become label so that it is not picked in the next iteration
targets_to_avoid.append(target)
### create word+label:
# batch always picks the skip_window entry from buffer
# ?why? it has to do with move step?
batch[i * num_skips + j] = buffer[skip_window]
labels[i * num_skips + j, 0] = buffer[target]
# this adds a word at the end but removes one at the beginning (from buffer)!
buffer.append(data[data_index])
# data index moves with buffer - just like initialisation
# in previous for loop.
data_index = (data_index + 1) % len(data)
return (batch, labels)
print('data:', [reverse_dictionary[di] for di in data[:8]])
for num_skips, skip_window in [(2, 1), (4, 2)]:
data_index = 0
batch, labels = generate_batch(batch_size=8, num_skips=num_skips, skip_window=skip_window)
print('\nwith num_skips = %d and skip_window = %d:' % (num_skips, skip_window))
print(' batch:', [reverse_dictionary[bi] for bi in batch])
print(' labels:', [reverse_dictionary[li] for li in labels.reshape(8)])
In [7]:
for num_skips, skip_window in [(1, 1), (1, 3), (1, 4)]:
data_index = 0
batch, labels = generate_batch(batch_size=8, num_skips=num_skips, skip_window=skip_window)
print('\nwith num_skips = %d and skip_window = %d:' % (num_skips, skip_window))
print(' batch:', [reverse_dictionary[bi] for bi in batch])
print(' labels:', [reverse_dictionary[li] for li in labels.reshape(8)])
print(batch)
print(labels)
In [8]:
batch_size = 128
embedding_size = 128 # Dimension of the embedding vector.
skip_window = 1 # How many words to consider left and right.
num_skips = 2 # How many times to reuse an input to generate a label.
# We pick a random validation set to sample nearest neighbors. here we limit the
# validation samples to the words that have a low numeric ID, which by
# construction are also the most frequent.
valid_size = 16 # Random set of words to evaluate similarity on.
valid_window = 100 # Only pick dev samples in the head of the distribution.
valid_examples = np.array(random.sample(range(valid_window), valid_size))
num_sampled = 64 # Number of negative examples to sample.
graph = tf.Graph()
# with graph.as_default(), tf.device('/cpu:0'):
# why use cpu explicitely?
with graph.as_default():
# Input data.
train_dataset = tf.placeholder(tf.int32, shape=[batch_size])
train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
# Variables.
embeddings = tf.Variable(
tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
softmax_weights = tf.Variable(
tf.truncated_normal([vocabulary_size, embedding_size],
stddev=1.0 / math.sqrt(embedding_size)))
softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))
# Model.
# Look up embeddings for inputs.
embed = tf.nn.embedding_lookup(embeddings, train_dataset)
# Compute the softmax loss, using a sample of the negative labels each time.
loss = tf.reduce_mean(
tf.nn.sampled_softmax_loss(weights=softmax_weights,
biases=softmax_biases,
inputs=embed,
labels=train_labels,
num_sampled=num_sampled,
num_classes=vocabulary_size))
# Optimizer.
# Note: The optimizer will optimize the softmax_weights AND the embeddings.
# This is because the embeddings are defined as a variable quantity and the
# optimizer's `minimize` method will by default modify all variable quantities
# that contribute to the tensor it is passed.
# See docs on `tf.train.Optimizer.minimize()` for more details.
optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)
# Compute the similarity between minibatch examples and all embeddings.
# We use the cosine distance:
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
normalized_embeddings = embeddings / norm
valid_embeddings = tf.nn.embedding_lookup(
normalized_embeddings, valid_dataset)
similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))
In [9]:
num_steps = 100001
with tf.Session(graph=graph) as session:
tf.global_variables_initializer().run()
print('Initialized')
average_loss = 0
for step in range(num_steps):
batch_data, batch_labels = generate_batch(
batch_size, num_skips, skip_window)
feed_dict = {train_dataset : batch_data, train_labels : batch_labels}
_, l = session.run([optimizer, loss], feed_dict=feed_dict)
average_loss += l
if step % 2000 == 0:
if step > 0:
average_loss = average_loss / 2000
# The average loss is an estimate of the loss over the last 2000 batches.
print('Average loss at step %d: %f' % (step, average_loss))
average_loss = 0
# note that this is expensive (~20% slowdown if computed every 500 steps)
if step % 10000 == 0:
sim = similarity.eval()
for i in range(valid_size):
valid_word = reverse_dictionary[valid_examples[i]]
top_k = 8 # number of nearest neighbors
nearest = (-sim[i, :]).argsort()[1:top_k+1]
log = 'Nearest to %s:' % valid_word
for k in range(top_k):
close_word = reverse_dictionary[nearest[k]]
log = '%s %s,' % (log, close_word)
print(log)
final_embeddings = normalized_embeddings.eval()
In [10]:
num_points = 400
tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
two_d_embeddings = tsne.fit_transform(final_embeddings[1:num_points+1, :])
In [11]:
def plot(embeddings, labels):
assert embeddings.shape[0] >= len(labels), 'More labels than embeddings'
pylab.figure(figsize=(15,15)) # in inches
for i, label in enumerate(labels):
x, y = embeddings[i,:]
pylab.scatter(x, y)
pylab.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points',
ha='right', va='bottom')
pylab.show()
words = [reverse_dictionary[i] for i in range(1, num_points+1)]
plot(two_d_embeddings, words)
An alternative to skip-gram is another Word2Vec model called CBOW (Continuous Bag of Words). In the CBOW model, instead of predicting a context word from a word vector, you predict a word from the sum of all the word vectors in its context. Implement and evaluate a CBOW model trained on the text8 dataset.
In [12]:
data_index = 0
def generate_batch(batch_size, bag_window):
global data_index
span = 2 * bag_window + 1 # [ bag_window target bag_window ]
batch = np.ndarray(shape=(batch_size, span - 1), dtype=np.int32)
labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
buffer = collections.deque(maxlen=span)
for _ in range(span):
buffer.append(data[data_index])
data_index = (data_index + 1) % len(data)
for i in range(batch_size):
buffer_list = list(buffer)
labels[i, 0] = buffer_list.pop(bag_window)
batch[i] = buffer_list
# iterate to the next buffer
buffer.append(data[data_index])
data_index = (data_index + 1) % len(data)
return batch, labels
print('data:', [reverse_dictionary[di] for di in data[:16]])
for bag_window in [1, 2]:
data_index = 0
batch, labels = generate_batch(batch_size=4, bag_window=bag_window)
print('\nwith bag_window = %d:' % (bag_window))
print(' batch:', [[reverse_dictionary[w] for w in bi] for bi in batch])
print(' labels:', [reverse_dictionary[li] for li in labels.reshape(4)])
In [13]:
batch_size = 128
embedding_size = 128 # Dimension of the embedding vector.
###skip_window = 1 # How many words to consider left and right.
###num_skips = 2 # How many times to reuse an input to generate a label.
bag_window = 2 # How many words to consider left and right.
# We pick a random validation set to sample nearest neighbors. here we limit the
# validation samples to the words that have a low numeric ID, which by
# construction are also the most frequent.
valid_size = 16 # Random set of words to evaluate similarity on.
valid_window = 100 # Only pick dev samples in the head of the distribution.
valid_examples = np.array(random.sample(range(valid_window), valid_size))
num_sampled = 64 # Number of negative examples to sample.
graph = tf.Graph()
with graph.as_default():
# Input data.
train_dataset = tf.placeholder(tf.int32, shape=[batch_size, bag_window * 2])
train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
# Variables.
embeddings = tf.Variable(
tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
softmax_weights = tf.Variable(
tf.truncated_normal([vocabulary_size, embedding_size],
stddev=1.0 / math.sqrt(embedding_size)))
softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))
# Model.
# Look up embeddings for inputs.
embeds = tf.nn.embedding_lookup(embeddings, train_dataset)
# Compute the softmax loss, using a sample of the negative labels each time.
loss = tf.reduce_mean(
tf.nn.sampled_softmax_loss(
weights=softmax_weights,
biases=softmax_biases,
inputs=tf.reduce_sum(embeds, 1),
labels=train_labels,
num_sampled=num_sampled,
num_classes=vocabulary_size))
# Optimizer.
optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)
# Compute the similarity between minibatch examples and all embeddings.
# We use the cosine distance:
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
normalized_embeddings = embeddings / norm
valid_embeddings = tf.nn.embedding_lookup(
normalized_embeddings, valid_dataset)
similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))
In [14]:
num_steps = 100001
with tf.Session(graph=graph) as session:
tf.global_variables_initializer().run()
print('Initialized')
average_loss = 0
for step in range(num_steps):
batch_data, batch_labels = generate_batch(
batch_size, bag_window)
feed_dict = {train_dataset : batch_data, train_labels : batch_labels}
_, l = session.run([optimizer, loss], feed_dict=feed_dict)
average_loss += l
if step % 2000 == 0:
if step > 0:
average_loss = average_loss / 2000
# The average loss is an estimate of the loss over the last 2000 batches.
print('Average loss at step %d: %f' % (step, average_loss))
average_loss = 0
# note that this is expensive (~20% slowdown if computed every 500 steps)
if step % 10000 == 0:
sim = similarity.eval()
for i in range(valid_size):
valid_word = reverse_dictionary[valid_examples[i]]
top_k = 8 # number of nearest neighbors
nearest = (-sim[i, :]).argsort()[1:top_k+1]
log = 'Nearest to %s:' % valid_word
for k in range(top_k):
close_word = reverse_dictionary[nearest[k]]
log = '%s %s,' % (log, close_word)
print(log)
final_embeddings = normalized_embeddings.eval()
Re-use code to visualise embeddings
In [15]:
num_points = 400
tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
two_d_embeddings = tsne.fit_transform(final_embeddings[1:num_points+1, :])
words = [reverse_dictionary[i] for i in range(1, num_points+1)]
plot(two_d_embeddings, words)