5_word2vec


Deep Learning

Assignment 5

The goal of this assignment is to train a Word2Vec skip-gram model over Text8 data.


In [1]:
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
%matplotlib inline
from __future__ import print_function
import collections
import math
import numpy as np
import os
import random
import tensorflow as tf
import zipfile
from matplotlib import pylab
from six.moves import range
from six.moves.urllib.request import urlretrieve
from sklearn.manifold import TSNE

Download the data from the source website if necessary.


In [2]:
url = 'http://mattmahoney.net/dc/'

def maybe_download(filename, expected_bytes):
  """Download a file if not present, and make sure it's the right size."""
  if not os.path.exists(filename):
    filename, _ = urlretrieve(url + filename, filename)
  statinfo = os.stat(filename)
  if statinfo.st_size == expected_bytes:
    print('Found and verified %s' % filename)
  else:
    print(statinfo.st_size)
    raise Exception(
      'Failed to verify ' + filename + '. Can you get to it with a browser?')
  return filename

filename = maybe_download('text8.zip', 31344016)


Found and verified text8.zip

Read the data into a string.


In [3]:
def read_data(filename):
  """Extract the first file enclosed in a zip file as a list of words"""
  with zipfile.ZipFile(filename) as f:
    data = tf.compat.as_str(f.read(f.namelist()[0])).split()
  return data
  
words = read_data(filename)
print('Data size %d' % len(words))


Data size 17005207

Build the dictionary and replace rare words with UNK token.


In [4]:
vocabulary_size = 50000

def build_dataset(words):
  count = [['UNK', -1]]
  count.extend(collections.Counter(words).most_common(vocabulary_size - 1))
  dictionary = dict()
  for word, _ in count:
    dictionary[word] = len(dictionary)
  data = list()
  unk_count = 0
  for word in words:
    if word in dictionary:
      index = dictionary[word]
    else:
      index = 0  # dictionary['UNK']
      unk_count = unk_count + 1
    data.append(index)
  count[0][1] = unk_count
  reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys())) 
  return data, count, dictionary, reverse_dictionary

data, count, dictionary, reverse_dictionary = build_dataset(words)
print('Most common words (+UNK)', count[:5])
print('Sample data', data[:10])
del words  # Hint to reduce memory.


Most common words (+UNK) [['UNK', 418391], ('the', 1061396), ('of', 593677), ('and', 416629), ('one', 411764)]
Sample data [5239, 3084, 12, 6, 195, 2, 3137, 46, 59, 156]

Let's display the internal variables to better understand their structure:


In [5]:
print(data[:10])
print(count[:10])
print(dictionary.items()[:10])
print(reverse_dictionary.items()[:10])


[5239, 3084, 12, 6, 195, 2, 3137, 46, 59, 156]
[['UNK', 418391], ('the', 1061396), ('of', 593677), ('and', 416629), ('one', 411764), ('in', 372201), ('a', 325873), ('to', 316376), ('zero', 264975), ('nine', 250430)]
[('fawn', 45848), ('homomorphism', 9648), ('nordisk', 39343), ('nunnery', 36075), ('chthonic', 33554), ('sowell', 40562), ('sonja', 38175), ('showa', 32906), ('woods', 6263), ('hsv', 44222)]
[(0, 'UNK'), (1, 'the'), (2, 'of'), (3, 'and'), (4, 'one'), (5, 'in'), (6, 'a'), (7, 'to'), (8, 'zero'), (9, 'nine')]

Function to generate a training batch for the skip-gram model.


In [6]:
data_index = 0

def generate_batch(batch_size, num_skips, skip_window):
  global data_index
  assert batch_size % num_skips == 0
  assert num_skips <= 2 * skip_window
  batch = np.ndarray(shape=(batch_size), dtype=np.int32)
  labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
  span = 2 * skip_window + 1 # [ skip_window target skip_window ]
  buffer = collections.deque(maxlen=span)
  for _ in range(span):
    buffer.append(data[data_index])
    data_index = (data_index + 1) % len(data)
  for i in range(batch_size // num_skips):
    target = skip_window  # target label at the center of the buffer
    targets_to_avoid = [ skip_window ]
    for j in range(num_skips):
      while target in targets_to_avoid:
        target = random.randint(0, span - 1)
      targets_to_avoid.append(target)
      batch[i * num_skips + j] = buffer[skip_window]
      labels[i * num_skips + j, 0] = buffer[target]
    buffer.append(data[data_index])
    data_index = (data_index + 1) % len(data)
  return batch, labels

print('data:', [reverse_dictionary[di] for di in data[:32]])

for num_skips, skip_window in [(2, 1), (4, 2)]:
    data_index = 0
    batch, labels = generate_batch(batch_size=16, num_skips=num_skips, skip_window=skip_window)
    print('\nwith num_skips = %d and skip_window = %d:' % (num_skips, skip_window))
    print('    batch:', [reverse_dictionary[bi] for bi in batch])
    print('    labels:', [reverse_dictionary[li] for li in labels.reshape(16)])
    
for num_skips, skip_window in [(2, 1), (4, 2)]:
    data_index = 1
    batch, labels = generate_batch(batch_size=16, num_skips=num_skips, skip_window=skip_window)
    print('\nwith num_skips = %d and skip_window = %d:' % (num_skips, skip_window))
    print('    batch:', [reverse_dictionary[bi] for bi in batch])
    print('    labels:', [reverse_dictionary[li] for li in labels.reshape(16)])


data: ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english', 'revolution', 'and', 'the', 'sans', 'UNK', 'of', 'the', 'french', 'revolution', 'whilst', 'the', 'term']

with num_skips = 2 and skip_window = 1:
    batch: ['originated', 'originated', 'as', 'as', 'a', 'a', 'term', 'term', 'of', 'of', 'abuse', 'abuse', 'first', 'first', 'used', 'used']
    labels: ['as', 'anarchism', 'a', 'originated', 'term', 'as', 'of', 'a', 'abuse', 'term', 'of', 'first', 'abuse', 'used', 'first', 'against']

with num_skips = 4 and skip_window = 2:
    batch: ['as', 'as', 'as', 'as', 'a', 'a', 'a', 'a', 'term', 'term', 'term', 'term', 'of', 'of', 'of', 'of']
    labels: ['a', 'term', 'anarchism', 'originated', 'as', 'of', 'originated', 'term', 'as', 'abuse', 'a', 'of', 'term', 'a', 'first', 'abuse']

with num_skips = 2 and skip_window = 1:
    batch: ['as', 'as', 'a', 'a', 'term', 'term', 'of', 'of', 'abuse', 'abuse', 'first', 'first', 'used', 'used', 'against', 'against']
    labels: ['originated', 'a', 'as', 'term', 'of', 'a', 'abuse', 'term', 'of', 'first', 'used', 'abuse', 'first', 'against', 'early', 'used']

with num_skips = 4 and skip_window = 2:
    batch: ['a', 'a', 'a', 'a', 'term', 'term', 'term', 'term', 'of', 'of', 'of', 'of', 'abuse', 'abuse', 'abuse', 'abuse']
    labels: ['of', 'originated', 'as', 'term', 'a', 'of', 'as', 'abuse', 'abuse', 'a', 'first', 'term', 'first', 'of', 'used', 'term']

Note: the labels is a sliding random value of the word surrounding the words of the batch.

It is not obvious with the output above, but all the data are based on index, and not the word directly.


In [7]:
print(batch)
print(labels)


[   6    6    6    6  195  195  195  195    2    2    2    2 3137 3137 3137
 3137]
[[   2]
 [3084]
 [  12]
 [ 195]
 [   6]
 [   2]
 [  12]
 [3137]
 [3137]
 [   6]
 [  46]
 [ 195]
 [  46]
 [   2]
 [  59]
 [ 195]]

Train a skip-gram model.


In [8]:
batch_size = 128
embedding_size = 128 # Dimension of the embedding vector.
skip_window = 1 # How many words to consider left and right.
num_skips = 2 # How many times to reuse an input to generate a label.
# We pick a random validation set to sample nearest neighbors. here we limit the
# validation samples to the words that have a low numeric ID, which by
# construction are also the most frequent. 
valid_size = 16 # Random set of words to evaluate similarity on.
valid_window = 100 # Only pick dev samples in the head of the distribution.
valid_examples = np.array(random.sample(range(valid_window), valid_size))
num_sampled = 64 # Number of negative examples to sample.

graph = tf.Graph()

with graph.as_default(), tf.device('/cpu:0'):

  # Input data.
  train_dataset = tf.placeholder(tf.int32, shape=[batch_size])
  train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
  valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
  
  # Variables.
  embeddings = tf.Variable(
    tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
  softmax_weights = tf.Variable(
    tf.truncated_normal([vocabulary_size, embedding_size],
                         stddev=1.0 / math.sqrt(embedding_size)))
  softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))
  
  # Model.
  # Look up embeddings for inputs.
  embed = tf.nn.embedding_lookup(embeddings, train_dataset)
  # Compute the softmax loss, using a sample of the negative labels each time.
  loss = tf.reduce_mean(
    tf.nn.sampled_softmax_loss(softmax_weights, softmax_biases, embed,
                               train_labels, num_sampled, vocabulary_size))

  # Optimizer.
  optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)
  
  # Compute the similarity between minibatch examples and all embeddings.
  # We use the cosine distance:
  norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
  normalized_embeddings = embeddings / norm
  valid_embeddings = tf.nn.embedding_lookup(
    normalized_embeddings, valid_dataset)
  similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))

In [9]:
num_steps = 100001

with tf.Session(graph=graph) as session:
  tf.initialize_all_variables().run()
  print('Initialized')
  average_loss = 0
  for step in range(num_steps):
    batch_data, batch_labels = generate_batch(
      batch_size, num_skips, skip_window)
    feed_dict = {train_dataset : batch_data, train_labels : batch_labels}
    _, l = session.run([optimizer, loss], feed_dict=feed_dict)
    average_loss += l
    if step % 2000 == 0:
      if step > 0:
        average_loss = average_loss / 2000
      # The average loss is an estimate of the loss over the last 2000 batches.
      print('Average loss at step %d: %f' % (step, average_loss))
      average_loss = 0
    # note that this is expensive (~20% slowdown if computed every 500 steps)
    if step % 10000 == 0:
      sim = similarity.eval()
      for i in range(valid_size):
        valid_word = reverse_dictionary[valid_examples[i]]
        top_k = 8 # number of nearest neighbors
        nearest = (-sim[i, :]).argsort()[1:top_k+1]
        log = 'Nearest to %s:' % valid_word
        for k in range(top_k):
          close_word = reverse_dictionary[nearest[k]]
          log = '%s %s,' % (log, close_word)
        print(log)
  final_embeddings = normalized_embeddings.eval()


Initialized
Average loss at step 0: 7.973985
Nearest to people: preserved, shear, cheats, refusing, elca, transcontinental, galician, ppg,
Nearest to called: suitability, extracellular, hangzhou, conducts, surrender, roadway, biographies, babylonians,
Nearest to there: clicks, pumps, hipc, depictions, streaming, overwritten, conifers, urartu,
Nearest to three: inadequately, idiopathic, ancona, epigram, northumbrian, rains, rquez, ldl,
Nearest to such: cylindrical, kanal, disposition, cao, denotation, spreads, engraver, outspoken,
Nearest to as: breakdancing, konqueror, sealing, eritreans, brubeck, olwen, denotation, marty,
Nearest to or: charlestown, gleaming, subsumed, proposal, anisotropic, goodfellas, auvergne, schloss,
Nearest to between: sufficed, privacy, amar, equates, peloponnesus, fuji, statuette, endowed,
Nearest to into: usr, glucagon, responds, rupture, caeiro, galvani, alexandra, pitfalls,
Nearest to an: copycat, darts, polytechnique, sartre, anglicanism, morrison, technologies, cortona,
Nearest to with: mansi, amortized, abendana, baal, ifrcs, unpaved, nokia, opposite,
Nearest to for: wadis, wide, optimal, accustomed, mezzo, davids, dns, retroviruses,
Nearest to i: lorica, conn, guadalajara, feat, krzy, randomization, yat, repression,
Nearest to in: transmigration, paralleled, uniquely, ouadda, coding, panama, worshipping, oems,
Nearest to the: reflex, marcellinus, anchovies, ecm, parlement, fitzroy, cavers, angels,
Nearest to years: eminem, fortunate, progression, signalled, elmore, disallowed, uvular, hotspot,
Average loss at step 2000: 4.363197
Average loss at step 4000: 3.864227
Average loss at step 6000: 3.793599
Average loss at step 8000: 3.684167
Average loss at step 10000: 3.617109
Nearest to people: preserved, transcontinental, mention, addressee, driver, myspace, flower, anarchists,
Nearest to called: conducts, shielded, eeg, chapels, extracellular, sisak, hangzhou, wandered,
Nearest to there: it, still, he, also, overwritten, not, they, pumps,
Nearest to three: five, four, six, eight, two, seven, nine, zero,
Nearest to such: kanal, known, processes, cylindrical, demo, dag, annoyed, disposition,
Nearest to as: login, sensation, keel, neighboring, capitalization, belong, by, denotation,
Nearest to or: and, characterizations, coerce, texans, s, organometallic, acct, subsidised,
Nearest to between: forsyth, laminar, moire, fuji, endowed, banana, in, equates,
Nearest to into: usr, caeiro, galvani, extradition, gerald, revealing, succeed, from,
Nearest to an: grimoires, unsettling, the, copycat, slings, terrible, anglicanism, ectopic,
Nearest to with: in, lilly, ohm, by, bayezid, mansi, ebrd, from,
Nearest to for: rothbard, towards, over, mamertines, of, cha, dubna, theodor,
Nearest to i: guadalajara, conn, juice, it, danner, malay, yetzirah, bra,
Nearest to in: of, on, during, at, by, subtype, panama, greatest,
Nearest to the: a, its, this, his, their, alamo, lavrenty, an,
Nearest to years: eminem, nzenberg, rescued, signalled, disallowed, center, sutras, cesium,
Average loss at step 12000: 3.604228
Average loss at step 14000: 3.570798
Average loss at step 16000: 3.409652
Average loss at step 18000: 3.457004
Average loss at step 20000: 3.536956
Nearest to people: preserved, addressee, transcontinental, mention, anarchists, semicolon, members, myspace,
Nearest to called: named, conducts, sisak, universally, spines, olney, eeg, chapels,
Nearest to there: it, they, still, he, which, also, overwritten, pumps,
Nearest to three: six, four, five, seven, two, eight, zero, one,
Nearest to such: known, kanal, demo, well, liege, these, processes, cylindrical,
Nearest to as: login, roo, adema, halting, murray, postmodernist, kashmiri, classes,
Nearest to or: and, organometallic, hoare, than, stefano, texans, activate, visits,
Nearest to between: in, with, forsyth, from, secular, selling, laminar, fuji,
Nearest to into: usr, caeiro, galvani, from, extradition, gerald, recites, pitfalls,
Nearest to an: the, unsettling, grimoires, slings, xxxix, sartre, expanse, mehr,
Nearest to with: at, in, between, artifact, for, duals, bayezid, caa,
Nearest to for: rothbard, towards, by, of, in, from, with, alene,
Nearest to i: conn, brainstem, malay, guadalajara, brute, designs, danner, program,
Nearest to in: at, on, during, from, between, panama, with, for,
Nearest to the: its, an, their, a, this, any, nanaimo, his,
Nearest to years: eminem, nzenberg, rescued, centuries, circularly, center, disallowed, hours,
Average loss at step 22000: 3.499382
Average loss at step 24000: 3.491234
Average loss at step 26000: 3.482409
Average loss at step 28000: 3.477295
Average loss at step 30000: 3.501148
Nearest to people: members, players, addressee, students, transcontinental, asparagales, carthusian, ammanati,
Nearest to called: named, universally, polymorphic, objection, spines, chapels, olney, shielded,
Nearest to there: they, it, still, he, jerky, often, pumps, this,
Nearest to three: four, five, seven, six, eight, two, nine, zero,
Nearest to such: known, well, these, liege, demo, spaceflights, kanal, processes,
Nearest to as: became, puritanism, gediminas, halting, ambiguities, capitalization, by, clamp,
Nearest to or: and, but, biotechnology, without, widowed, organometallic, than, gloster,
Nearest to between: with, from, volunteers, secular, forsyth, drilled, in, swore,
Nearest to into: from, galvani, usr, caeiro, extradition, through, over, in,
Nearest to an: unsettling, tut, sartre, xxxix, busby, grimoires, desk, disguising,
Nearest to with: between, in, by, secular, among, when, guybrush, ohm,
Nearest to for: eduardo, of, from, alene, rothbard, in, dubna, gamecube,
Nearest to i: ii, you, designs, we, guadalajara, iii, dial, they,
Nearest to in: during, at, of, from, on, under, by, with,
Nearest to the: their, its, his, smile, some, delson, chromatography, many,
Nearest to years: centuries, hours, eminem, circularly, center, year, thence, times,
Average loss at step 32000: 3.503519
Average loss at step 34000: 3.490309
Average loss at step 36000: 3.457498
Average loss at step 38000: 3.294679
Average loss at step 40000: 3.427992
Nearest to people: members, someone, addressee, students, players, codepoint, ver, ammanati,
Nearest to called: named, considered, conducts, overhaul, tachi, universally, olney, enver,
Nearest to there: they, it, still, often, which, he, now, also,
Nearest to three: four, two, five, six, seven, eight, nine, zero,
Nearest to such: known, well, many, these, including, kanal, liege, described,
Nearest to as: capitalization, gediminas, when, sphinx, bathtub, commonly, expectancies, petition,
Nearest to or: and, than, organometallic, a, splash, com, deflecting, coerce,
Nearest to between: with, drilled, volunteers, swore, melting, forsyth, secular, jun,
Nearest to into: galvani, through, from, during, over, bankhead, iron, usr,
Nearest to an: tut, grimoires, probus, unsettling, roamed, busby, sartre, robots,
Nearest to with: between, secular, by, when, in, caa, sophist, ruck,
Nearest to for: in, dubna, of, without, kilda, to, is, towards,
Nearest to i: we, you, they, ii, t, he, it, designs,
Nearest to in: during, of, and, from, for, on, without, at,
Nearest to the: this, a, its, his, their, shilton, celestines, gulden,
Nearest to years: hours, centuries, times, eminem, year, days, anu, thence,
Average loss at step 42000: 3.436011
Average loss at step 44000: 3.454509
Average loss at step 46000: 3.448703
Average loss at step 48000: 3.359070
Average loss at step 50000: 3.382532
Nearest to people: men, members, children, addressee, someone, austronesian, students, leave,
Nearest to called: named, overhaul, conducts, tachi, polymorphic, universally, nan, chapels,
Nearest to there: they, it, he, still, hattie, jerky, believed, she,
Nearest to three: six, four, seven, eight, two, five, nine, zero,
Nearest to such: well, known, these, many, follows, spaceflights, liege, regarded,
Nearest to as: became, capitalization, equity, francophones, martius, walters, puritanism, stipend,
Nearest to or: and, organometallic, than, gloster, hagbard, conjectured, com, whose,
Nearest to between: with, from, in, volunteers, melting, forsyth, crabbe, maximilien,
Nearest to into: through, from, during, galvani, usr, over, caeiro, afghan,
Nearest to an: unsettling, grimoires, probus, rooted, roamed, polytechnique, officio, sartre,
Nearest to with: between, among, by, caa, clinician, ruck, mcdonough, when,
Nearest to for: dubna, towards, against, glasnost, without, hermeticism, siemens, rhotic,
Nearest to i: we, you, ii, they, brasses, t, danner, inbound,
Nearest to in: during, on, of, within, since, from, throughout, including,
Nearest to the: its, this, their, his, architecturally, a, epilepsy, some,
Nearest to years: times, days, centuries, hours, months, year, decades, nattiez,
Average loss at step 52000: 3.436867
Average loss at step 54000: 3.426783
Average loss at step 56000: 3.439991
Average loss at step 58000: 3.400027
Average loss at step 60000: 3.390829
Nearest to people: men, members, children, someone, students, players, semicolon, scientists,
Nearest to called: named, considered, used, overhaul, chapels, conducts, universally, prost,
Nearest to there: they, it, now, still, today, this, he, often,
Nearest to three: four, six, five, eight, two, seven, nine, zero,
Nearest to such: known, well, these, regarded, spaceflights, including, described, liege,
Nearest to as: puritanism, biomass, halting, em, belong, misdemeanor, capitalization, netsplit,
Nearest to or: and, organometallic, but, than, without, gloster, volunteering, whit,
Nearest to between: with, in, among, crabbe, within, goodall, volunteers, maximilien,
Nearest to into: through, from, over, in, galvani, within, presbyter, goth,
Nearest to an: unsettling, grimoires, roamed, probus, xxxix, upheld, cabbage, colophon,
Nearest to with: between, when, among, by, duals, caa, in, although,
Nearest to for: of, including, alene, against, dubna, coughing, messerschmitt, bureaucrat,
Nearest to i: you, we, ii, they, brasses, danner, she, t,
Nearest to in: during, within, of, from, on, between, among, since,
Nearest to the: a, their, its, each, epilepsy, this, nanaimo, stimulates,
Nearest to years: centuries, days, times, hours, months, decades, year, nattiez,
Average loss at step 62000: 3.236575
Average loss at step 64000: 3.254499
Average loss at step 66000: 3.404697
Average loss at step 68000: 3.394414
Average loss at step 70000: 3.359899
Nearest to people: men, children, members, someone, students, players, sides, semicolon,
Nearest to called: named, considered, overhaul, known, reflector, used, referred, universally,
Nearest to there: they, it, now, still, she, usually, often, he,
Nearest to three: six, five, four, seven, two, eight, nine, zero,
Nearest to such: well, these, known, regarded, liege, spaceflights, many, described,
Nearest to as: capitalization, segmented, equity, puritanism, sphinx, is, corals, when,
Nearest to or: and, organometallic, fried, your, than, coupe, objecting, patton,
Nearest to between: with, within, among, rebelling, drilled, from, jun, separatist,
Nearest to into: through, from, within, toward, attachments, bogota, destabilize, usr,
Nearest to an: unsettling, probus, robots, bleaching, upheld, roamed, busby, diffuse,
Nearest to with: between, embittered, caa, ruck, when, confidential, jun, duals,
Nearest to for: including, without, if, alene, in, towards, gamecube, against,
Nearest to i: you, we, ii, she, they, t, brasses, g,
Nearest to in: during, within, of, on, including, until, by, for,
Nearest to the: their, this, its, a, any, these, your, some,
Nearest to years: days, months, hours, centuries, times, decades, year, minutes,
Average loss at step 72000: 3.373863
Average loss at step 74000: 3.346582
Average loss at step 76000: 3.316972
Average loss at step 78000: 3.355325
Average loss at step 80000: 3.380022
Nearest to people: men, children, students, members, someone, players, women, individuals,
Nearest to called: named, considered, used, known, universally, overhaul, referred, prost,
Nearest to there: it, they, he, often, still, she, now, believed,
Nearest to three: four, six, seven, five, eight, two, nine, zero,
Nearest to such: well, regarded, known, these, spaceflights, including, follows, described,
Nearest to as: massive, hexen, like, puritanism, equity, hao, gne, bugzilla,
Nearest to or: and, organometallic, gloster, while, com, coupe, than, objecting,
Nearest to between: within, with, jun, among, over, drilled, goodall, through,
Nearest to into: through, from, within, attachments, towards, toward, during, using,
Nearest to an: unsettling, grimoires, desk, probus, busby, slovakian, dhtml, cabbage,
Nearest to with: between, when, duals, barnum, caa, formalization, among, embittered,
Nearest to for: kilda, prefaced, towards, alene, after, gamecube, siemens, against,
Nearest to i: you, ii, t, we, danner, g, iii, dial,
Nearest to in: during, within, on, until, at, throughout, of, since,
Nearest to the: a, its, his, tofu, mellitus, their, episcopi, this,
Nearest to years: days, times, decades, months, hours, centuries, year, minutes,
Average loss at step 82000: 3.404380
Average loss at step 84000: 3.408689
Average loss at step 86000: 3.388645
Average loss at step 88000: 3.351643
Average loss at step 90000: 3.363047
Nearest to people: men, children, students, women, authors, players, christians, someone,
Nearest to called: named, considered, used, referred, known, universally, overhaul, said,
Nearest to there: it, they, still, he, we, now, she, believed,
Nearest to three: four, two, five, seven, eight, six, zero, nine,
Nearest to such: well, known, regarded, spaceflights, these, certain, follows, selves,
Nearest to as: nuisance, puritanism, notoc, kampf, sphinx, gernika, biomass, varuna,
Nearest to or: and, organometallic, than, houghton, but, polyester, eightfold, lulu,
Nearest to between: with, within, among, from, jun, through, under, drilled,
Nearest to into: through, from, within, under, attachments, caeiro, between, cowes,
Nearest to an: cabbage, probus, unsettling, desk, elysium, kickboxing, slovakian, shemini,
Nearest to with: between, by, in, jun, when, prost, among, ruck,
Nearest to for: against, including, of, without, after, while, bureaucrat, when,
Nearest to i: you, g, we, t, ii, danner, dial, iii,
Nearest to in: during, within, of, under, at, and, on, throughout,
Nearest to the: its, this, his, a, conti, any, each, their,
Nearest to years: days, hours, decades, months, centuries, times, minutes, year,
Average loss at step 92000: 3.398833
Average loss at step 94000: 3.253199
Average loss at step 96000: 3.356412
Average loss at step 98000: 3.240982
Average loss at step 100000: 3.359134
Nearest to people: children, men, students, women, players, authors, speakers, individuals,
Nearest to called: named, referred, overhaul, used, considered, www, misplaced, known,
Nearest to there: they, it, still, we, he, now, generally, often,
Nearest to three: four, two, five, seven, eight, six, zero, nine,
Nearest to such: known, well, regarded, these, spaceflights, follows, certain, many,
Nearest to as: sphinx, puritanism, notoc, varuna, equity, hotbed, segmented, gru,
Nearest to or: and, than, organometallic, coupe, houghton, adding, cafeteria, objecting,
Nearest to between: within, among, with, around, drilled, jun, goodall, rebelling,
Nearest to into: through, within, from, during, towards, off, across, attachments,
Nearest to an: grimoires, probus, unsettling, another, bleaching, copycat, roamed, xxxix,
Nearest to with: between, including, when, using, jets, embittered, in, to,
Nearest to for: without, when, hgp, while, after, during, if, dubna,
Nearest to i: you, we, ii, danner, v, t, they, g,
Nearest to in: during, throughout, within, until, on, among, of, from,
Nearest to the: a, their, your, lodge, his, its, this, our,
Nearest to years: days, hours, decades, centuries, months, times, year, minutes,

This is what an embedding looks like:


In [10]:
print(final_embeddings[0])


[-0.02051404 -0.01841166 -0.1244709   0.05074964 -0.06100134 -0.00200008
  0.11170887 -0.08766419  0.02409515  0.1101989   0.16667137 -0.00165691
 -0.14681569 -0.11056326 -0.03933677 -0.02154945 -0.10062175 -0.05628699
  0.01922635  0.1639808   0.01684226 -0.0540074  -0.03108865  0.01777779
  0.04706197 -0.01260165 -0.29304644 -0.0302702  -0.10937209  0.03770837
  0.15830405  0.06936062  0.16194052 -0.03061437  0.04542041 -0.07499804
 -0.00841922  0.01684443  0.07148984 -0.06406043  0.09492613 -0.09933698
 -0.00687152 -0.08409496  0.01544465 -0.11804878  0.14000055 -0.09653874
  0.02137009  0.01145039 -0.06205352  0.02977284 -0.01883831 -0.11986119
  0.05494772  0.05486212 -0.00120243 -0.02710927 -0.0569164  -0.11858083
 -0.01223189  0.09930295  0.09467366 -0.02073516 -0.12372935  0.00244062
  0.05924704  0.14948699 -0.1119219  -0.07551062  0.00928106 -0.01831606
 -0.05067647 -0.10931925  0.0324928   0.00191521 -0.01634123  0.08661727
  0.12392872 -0.01820764 -0.16330577 -0.24632891  0.02271111  0.00542232
  0.00801045 -0.00690781  0.16911526  0.01407609  0.02365589 -0.07911306
  0.07899767  0.16927634  0.07133543  0.05735794  0.08887751  0.19834489
  0.17564225 -0.08074389  0.17213371  0.04207939  0.02885242  0.02857619
 -0.05606408 -0.12006876 -0.07494419 -0.10423107 -0.05341928 -0.09745979
 -0.13150343  0.05612956 -0.08829316  0.12047683  0.00201026 -0.09098013
  0.00637597  0.08963603 -0.02186407  0.01224304 -0.02592809  0.05893643
 -0.0640655   0.00615067 -0.07741468  0.10470562  0.01330675  0.08730693
 -0.0361954  -0.05102392]

All the values are abstract, there is practical meaning of the them. Moreover, the final embeddings are normalized as you can see here:


In [11]:
print(np.sum(np.square(final_embeddings[0])))


1.0

In [12]:
num_points = 400

tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
two_d_embeddings = tsne.fit_transform(final_embeddings[1:num_points+1, :])

In [13]:
def plot(embeddings, labels):
  assert embeddings.shape[0] >= len(labels), 'More labels than embeddings'
  pylab.figure(figsize=(15,15))  # in inches
  for i, label in enumerate(labels):
    x, y = embeddings[i,:]
    pylab.scatter(x, y)
    pylab.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points',
                   ha='right', va='bottom')
  pylab.show()

words = [reverse_dictionary[i] for i in range(1, num_points+1)]
plot(two_d_embeddings, words)



Problem

An alternative to skip-gram is another Word2Vec model called CBOW (Continuous Bag of Words). In the CBOW model, instead of predicting a context word from a word vector, you predict a word from the sum of all the word vectors in its context. Implement and evaluate a CBOW model trained on the text8 dataset.


For the continuous bag of words, the train inputs are slightly different from the skip-gram:


In [14]:
data_index = 0

def generate_batch(batch_size, bag_window):
  global data_index
  span = 2 * bag_window + 1 # [ bag_window target bag_window ]
  batch = np.ndarray(shape=(batch_size, span - 1), dtype=np.int32)
  labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)  
  buffer = collections.deque(maxlen=span)
  for _ in range(span):
    buffer.append(data[data_index])
    data_index = (data_index + 1) % len(data)
  for i in range(batch_size):
    # just for testing
    buffer_list = list(buffer)
    labels[i, 0] = buffer_list.pop(bag_window)
    batch[i] = buffer_list
    # iterate to the next buffer
    buffer.append(data[data_index])
    data_index = (data_index + 1) % len(data)
  return batch, labels

print('data:', [reverse_dictionary[di] for di in data[:16]])

for bag_window in [1, 2]:
  data_index = 0
  batch, labels = generate_batch(batch_size=4, bag_window=bag_window)
  print('\nwith bag_window = %d:' % (bag_window))  
  print('    batch:', [[reverse_dictionary[w] for w in bi] for bi in batch])  
  print('    labels:', [reverse_dictionary[li] for li in labels.reshape(4)])


data: ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the']

with bag_window = 1:
    batch: [['anarchism', 'as'], ['originated', 'a'], ['as', 'term'], ['a', 'of']]
    labels: ['originated', 'as', 'a', 'term']

with bag_window = 2:
    batch: [['anarchism', 'originated', 'a', 'term'], ['originated', 'as', 'term', 'of'], ['as', 'a', 'of', 'abuse'], ['a', 'term', 'abuse', 'first']]
    labels: ['as', 'a', 'term', 'of']

Note the instruction change on the loss function, with reduce_sum to sum the word vectors in the context:


In [15]:
batch_size = 128
embedding_size = 128 # Dimension of the embedding vector.
###skip_window = 1 # How many words to consider left and right.
###num_skips = 2 # How many times to reuse an input to generate a label.
bag_window = 2 # How many words to consider left and right.
# We pick a random validation set to sample nearest neighbors. here we limit the
# validation samples to the words that have a low numeric ID, which by
# construction are also the most frequent. 
valid_size = 16 # Random set of words to evaluate similarity on.
valid_window = 100 # Only pick dev samples in the head of the distribution.
valid_examples = np.array(random.sample(range(valid_window), valid_size))
num_sampled = 64 # Number of negative examples to sample.

graph = tf.Graph()

with graph.as_default(), tf.device('/cpu:0'):

  # Input data.
  train_dataset = tf.placeholder(tf.int32, shape=[batch_size, bag_window * 2])
  train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
  valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
  
  # Variables.
  embeddings = tf.Variable(
    tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
  softmax_weights = tf.Variable(
    tf.truncated_normal([vocabulary_size, embedding_size],
                         stddev=1.0 / math.sqrt(embedding_size)))
  softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))
  
  # Model.
  # Look up embeddings for inputs.
  embeds = tf.nn.embedding_lookup(embeddings, train_dataset)
  # Compute the softmax loss, using a sample of the negative labels each time.
  loss = tf.reduce_mean(
    tf.nn.sampled_softmax_loss(softmax_weights, softmax_biases, tf.reduce_sum(embeds, 1),
                               train_labels, num_sampled, vocabulary_size))

  # Optimizer.
  optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)
  
  # Compute the similarity between minibatch examples and all embeddings.
  # We use the cosine distance:
  norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
  normalized_embeddings = embeddings / norm
  valid_embeddings = tf.nn.embedding_lookup(
    normalized_embeddings, valid_dataset)
  similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))

In [16]:
num_steps = 100001

with tf.Session(graph=graph) as session:
  tf.initialize_all_variables().run()
  print('Initialized')
  average_loss = 0
  for step in range(num_steps):
    batch_data, batch_labels = generate_batch(
      batch_size, bag_window)
    feed_dict = {train_dataset : batch_data, train_labels : batch_labels}
    _, l = session.run([optimizer, loss], feed_dict=feed_dict)
    average_loss += l
    if step % 2000 == 0:
      if step > 0:
        average_loss = average_loss / 2000
      # The average loss is an estimate of the loss over the last 2000 batches.
      print('Average loss at step %d: %f' % (step, average_loss))
      average_loss = 0
    # note that this is expensive (~20% slowdown if computed every 500 steps)
    if step % 10000 == 0:
      sim = similarity.eval()
      for i in range(valid_size):
        valid_word = reverse_dictionary[valid_examples[i]]
        top_k = 8 # number of nearest neighbors
        nearest = (-sim[i, :]).argsort()[1:top_k+1]
        log = 'Nearest to %s:' % valid_word
        for k in range(top_k):
          close_word = reverse_dictionary[nearest[k]]
          log = '%s %s,' % (log, close_word)
        print(log)
  final_embeddings = normalized_embeddings.eval()


Initialized
Average loss at step 0: 7.983871
Nearest to an: ambiguously, tennessee, cats, regents, numerals, inpatient, madre, krak,
Nearest to d: woolly, atypical, climatic, textual, caapi, sungorus, collusion, disarmament,
Nearest to one: modi, optimizations, williams, overtake, finned, glissando, inciting, gasoline,
Nearest to may: fogo, kwanzaa, rivaled, doi, cavour, cante, tansley, incorruptible,
Nearest to has: pollock, literatures, existential, evelyn, inspector, ventricle, sweeps, tending,
Nearest to while: embouchure, analogue, petomane, etymology, etienne, referencing, starvation, hinds,
Nearest to will: waived, introduced, lithography, purge, nss, fabian, torquato, helicopters,
Nearest to nine: dore, promise, ailments, forty, boast, tarleton, courtney, apo,
Nearest to but: fucked, catalyzed, uart, commemorating, minicomputers, theatrically, sombart, meshech,
Nearest to are: bosch, handedly, dummies, bearings, repulsive, allegedly, eph, nk,
Nearest to this: shore, smog, dwell, injuries, poet, engraver, wafers, leviathan,
Nearest to when: derives, postfix, shimei, warlords, gorbachev, obtain, whale, prophesies,
Nearest to see: symbolized, algebra, fanpage, undesirable, jena, exerting, sierpinski, albans,
Nearest to up: cumming, thresholds, organic, ullman, manoeuvre, cnt, maguey, philanthropic,
Nearest to time: connie, amnesia, photoelectric, archdiocese, tennant, auburn, turbofan, rubicon,
Nearest to is: silicon, organisms, honesty, ghola, duckworth, glaze, wealthier, sheldon,
Average loss at step 2000: 7.571952
Average loss at step 4000: 4.962989
Average loss at step 6000: 5.264466
Average loss at step 8000: 4.020034
Average loss at step 10000: 3.823763
Nearest to an: the, a, expense, contends, thumb, stroke, bad, ontario,
Nearest to d: b, chicxulub, qf, kotzebue, balls, berenguer, apo, mustelids,
Nearest to one: crannog, fmln, accented, vivian, adl, baths, nesting, amedeo,
Nearest to may: can, should, could, would, might, must, will, cannot,
Nearest to has: had, have, was, is, agilent, finegold, lockyer, schottky,
Nearest to while: and, of, creed, than, thus, secundus, width, athanasian,
Nearest to will: would, must, creed, fish, at, oneself, thus, to,
Nearest to nine: eight, seven, five, six, zero, iga, three, apollo,
Nearest to but: however, meso, yamaha, korean, encounter, which, does, injustice,
Nearest to are: is, include, fish, creed, of, were, possibly, at,
Nearest to this: what, nigeria, stings, physiol, chili, dalek, danielle, husayn,
Nearest to when: where, fertilize, relative, sovereignty, because, oxidise, by, awake,
Nearest to see: utica, metonic, rimfire, subgroups, declarative, ratsiraka, commoner, bani,
Nearest to up: ranges, azores, thresholds, mano, imax, application, down, migrated,
Nearest to time: least, prosciutto, oxidant, hellish, place, end, kennel, year,
Nearest to is: are, was, of, holds, include, will, at, a,
Average loss at step 12000: 3.904979
Average loss at step 14000: 3.926719
Average loss at step 16000: 3.572611
Average loss at step 18000: 3.749069
Average loss at step 20000: 3.532388
Nearest to an: the, a, this, umlauts, doc, sorel, stad, melaka,
Nearest to d: b, balls, artifact, mahavira, broadly, zf, currency, remainders,
Nearest to one: sourcewatch, compaq, inflation, glissando, durch, artisan, amendment, duration,
Nearest to may: would, can, will, must, could, should, might, latvia,
Nearest to has: had, have, having, is, agilent, cordillera, pods, was,
Nearest to while: when, among, although, with, and, but, like, willie,
Nearest to will: would, must, can, could, should, might, may, cannot,
Nearest to nine: zero, eight, five, seven, two, vigor, appellation, complexities,
Nearest to but: however, although, and, though, while, paradigmatic, than, asleep,
Nearest to are: were, is, include, teleological, have, jfif, was, mutate,
Nearest to this: it, what, that, any, an, bismarck, a, beekeeping,
Nearest to when: if, where, while, because, during, before, since, tidewater,
Nearest to see: declarative, subgroups, lian, rimfire, known, ligase, mosquitia, utica,
Nearest to up: down, out, back, intelligibility, him, organic, spotters, azores,
Nearest to time: least, end, year, rubicon, period, subpixels, aemilius, unsurprising,
Nearest to is: was, are, became, were, has, wired, does, refers,
Average loss at step 22000: 3.588760
Average loss at step 24000: 4.065557
Average loss at step 26000: 3.706526
Average loss at step 28000: 3.455289
Average loss at step 30000: 3.427290
Nearest to an: another, this, ncipe, cryonics, propel, kanagawa, sade, albeit,
Nearest to d: zf, r, lysis, intercity, tua, b, unnamed, rubble,
Nearest to one: francs, seven, dominus, two, statism, paymaster, addition, toe,
Nearest to may: can, would, will, could, must, should, might, cannot,
Nearest to has: had, have, having, was, agilent, unionism, smiths, includes,
Nearest to while: although, berlin, afc, wall, cow, sse, convertible, mandelbrot,
Nearest to will: would, can, could, may, might, should, must, cannot,
Nearest to nine: eight, seven, late, five, th, derry, four, six,
Nearest to but: however, and, than, see, although, button, afterward, which,
Nearest to are: were, is, possibly, johnstone, chad, cow, afc, berlin,
Nearest to this: it, what, that, which, piecewise, stoicism, a, an,
Nearest to when: because, before, however, where, episcopate, since, after, by,
Nearest to see: but, prematurely, scandinavian, rimfire, include, staircase, cartier, deductions,
Nearest to up: down, off, out, back, appendix, intelligibility, him, animated,
Nearest to time: least, period, year, end, start, beginning, unsurprising, times,
Nearest to is: mandelbrot, wall, berlin, finland, are, cow, chad, afc,
Average loss at step 32000: 3.131431
Average loss at step 34000: 3.298668
Average loss at step 36000: 3.281630
Average loss at step 38000: 3.314990
Average loss at step 40000: 3.389277
Nearest to an: another, the, chechen, amway, ascend, sade, vanzetti, a,
Nearest to d: b, zf, acorn, intercity, hitters, gentlemen, xilinx, shifter,
Nearest to one: inappropriately, headquartered, dominus, granny, irreconcilable, recollections, sweep, smuggle,
Nearest to may: can, will, should, would, could, must, might, does,
Nearest to has: have, had, having, is, was, includes, contains, agilent,
Nearest to while: however, although, though, and, but, when, jakobson, before,
Nearest to will: would, could, may, must, should, can, does, might,
Nearest to nine: eight, seven, six, september, five, four, late, bb,
Nearest to but: and, however, than, though, although, while, see, asks,
Nearest to are: were, is, was, include, being, have, exist, mutate,
Nearest to this: it, which, the, any, what, piecewise, a, another,
Nearest to when: if, where, while, before, since, by, because, laughs,
Nearest to see: introduce, but, undamaged, rimfire, include, known, called, cartier,
Nearest to up: off, out, down, forward, appendix, intelligibility, veered, migrated,
Nearest to time: least, end, beginning, rubicon, base, times, manner, bijection,
Nearest to is: was, are, has, became, remains, makes, includes, were,
Average loss at step 42000: 3.285913
Average loss at step 44000: 3.211095
Average loss at step 46000: 3.221205
Average loss at step 48000: 3.127906
Average loss at step 50000: 3.156376
Nearest to an: another, reconsideration, eden, the, stad, any, mohamed, turboprop,
Nearest to d: b, asin, pricing, origami, entourage, odie, whispering, peacemaker,
Nearest to one: enroll, pumps, thither, honor, favor, papen, taya, batavians,
Nearest to may: can, must, will, could, should, would, might, cannot,
Nearest to has: had, have, having, contains, is, includes, agilent, agreeable,
Nearest to while: although, when, though, and, however, if, where, before,
Nearest to will: must, may, should, could, would, can, might, does,
Nearest to nine: eight, seven, six, zero, august, january, four, five,
Nearest to but: although, though, than, while, include, monies, ucla, formulaic,
Nearest to are: were, is, was, include, exist, being, mugabe, awaiting,
Nearest to this: which, what, it, piecewise, another, that, epictetus, hypnotic,
Nearest to when: while, where, because, if, although, since, before, during,
Nearest to see: called, include, includes, rimfire, known, islami, operates, deductions,
Nearest to up: off, out, them, back, forward, together, down, spotters,
Nearest to time: least, period, times, night, end, subpixels, moment, height,
Nearest to is: was, are, remains, were, became, makes, has, contains,
Average loss at step 52000: 3.190961
Average loss at step 54000: 3.200878
Average loss at step 56000: 3.025388
Average loss at step 58000: 3.134411
Average loss at step 60000: 3.130078
Nearest to an: another, this, the, talons, doc, waza, a, sedition,
Nearest to d: berlin, cow, school, acid, afc, creed, joseph, dissociation,
Nearest to one: two, favor, afghanistan, terms, finlandization, adl, lipi, allein,
Nearest to may: can, must, will, might, should, would, could, cannot,
Nearest to has: had, have, having, is, includes, contains, agilent, isambard,
Nearest to while: although, though, however, when, and, including, if, before,
Nearest to will: would, may, must, could, might, can, should, cannot,
Nearest to nine: eight, seven, six, five, zero, late, january, september,
Nearest to but: farad, examines, dissociation, somewhere, ghanima, afc, sire, perhaps,
Nearest to are: were, is, observances, include, use, was, suborders, many,
Nearest to this: it, which, another, longwave, lasorda, epictetus, wasp, she,
Nearest to when: if, where, though, while, since, before, although, after,
Nearest to see: include, rimfire, known, surfing, provides, backers, minnesotans, references,
Nearest to up: off, out, back, veered, intelligibility, together, appendix, forward,
Nearest to time: least, night, moment, period, start, point, times, end,
Nearest to is: was, are, becomes, has, remains, exists, makes, became,
Average loss at step 62000: 3.094546
Average loss at step 64000: 3.011245
Average loss at step 66000: 3.013084
Average loss at step 68000: 3.084206
Average loss at step 70000: 3.157549
Nearest to an: another, a, the, mohamed, ruth, no, reconsideration, disrepair,
Nearest to d: b, unobstructed, zf, liza, teborg, intercity, militants, acorn,
Nearest to one: recollections, nchen, offside, adl, hultsfred, deakin, dilemmas, heraclius,
Nearest to may: can, should, must, might, could, would, will, cannot,
Nearest to has: had, have, having, is, includes, was, agilent, contains,
Nearest to while: scrip, and, though, montage, although, level, without, thus,
Nearest to will: would, must, should, could, can, may, might, cannot,
Nearest to nine: eight, th, zero, july, january, april, september, kapitan,
Nearest to but: though, and, although, however, which, while, until, that,
Nearest to are: were, is, include, exist, have, semmelweis, contain, was,
Nearest to this: another, what, some, which, that, aryeh, lasorda, beekeeping,
Nearest to when: if, before, because, after, where, although, since, however,
Nearest to see: allows, rimfire, include, includes, putnam, provides, gives, enables,
Nearest to up: off, down, out, forth, back, veered, forward, gaius,
Nearest to time: least, night, speed, summer, moment, beginning, end, period,
Nearest to is: was, are, makes, remains, provides, becomes, exists, includes,
Average loss at step 72000: 3.039138
Average loss at step 74000: 2.935000
Average loss at step 76000: 3.065070
Average loss at step 78000: 3.059412
Average loss at step 80000: 2.909747
Nearest to an: another, rhyming, the, durand, doc, ascend, bau, turboprop,
Nearest to d: b, c, zf, unobstructed, teborg, newington, blotter, symbolizing,
Nearest to one: intrauterine, gentile, involved, spectacularly, ericaceae, underpinnings, sociopathic, two,
Nearest to may: can, might, could, should, must, would, will, cannot,
Nearest to has: having, had, have, is, includes, contains, maintains, was,
Nearest to while: although, though, where, and, however, but, trypanosomiasis, including,
Nearest to will: would, must, could, can, should, might, may, cannot,
Nearest to nine: eight, seven, oi, agile, august, berlin, ghanima, methodologies,
Nearest to but: though, however, although, see, and, or, while, which,
Nearest to are: were, is, was, include, exist, have, semmelweis, including,
Nearest to this: what, which, it, that, beekeeping, a, the, another,
Nearest to when: if, because, though, before, after, although, where, finally,
Nearest to see: but, include, provides, rimfire, gives, rada, known, surfing,
Nearest to up: off, down, together, quickly, empty, back, me, decisions,
Nearest to time: night, least, period, moment, manner, battle, arrival, antidisestablishmentarianism,
Nearest to is: was, are, makes, remains, exists, becomes, has, includes,
Average loss at step 82000: 3.032796
Average loss at step 84000: 2.954726
Average loss at step 86000: 2.974028
Average loss at step 88000: 3.004254
Average loss at step 90000: 2.893942
Nearest to an: another, the, reconsideration, doc, xvii, any, a, turboprop,
Nearest to d: disrepair, village, h, reckoning, emitting, berlin, dissociation, hindi,
Nearest to one: two, racer, harman, shaker, qh, wallach, attenuation, resulting,
Nearest to may: can, might, should, could, would, will, must, cannot,
Nearest to has: had, have, having, includes, contains, is, was, rb,
Nearest to while: although, though, and, including, where, programmes, amongst, trypanosomiasis,
Nearest to will: would, can, could, might, should, cannot, may, did,
Nearest to nine: eight, seven, mid, late, six, five, four, zero,
Nearest to but: although, however, though, see, declarative, and, than, inundated,
Nearest to are: were, is, include, exist, was, contain, denote, have,
Nearest to this: which, it, another, what, any, lasorda, beekeeping, piecewise,
Nearest to when: if, after, because, before, where, though, by, actually,
Nearest to see: includes, rimfire, include, provides, ponty, but, schechter, gives,
Nearest to up: off, down, together, back, forward, out, forth, spotters,
Nearest to time: least, height, odds, moment, beginning, period, corner, look,
Nearest to is: was, remains, are, makes, becomes, exists, includes, became,
Average loss at step 92000: 2.952135
Average loss at step 94000: 2.945493
Average loss at step 96000: 2.800004
Average loss at step 98000: 2.569032
Average loss at step 100000: 2.762196
Nearest to an: another, inpatient, doc, stad, xvii, the, reconsideration, csc,
Nearest to d: b, pricing, kocher, wheat, deansgate, tightening, berenguer, wiesbaden,
Nearest to one: implicated, allein, crannog, excerpted, hoop, heraclius, cocos, irreconcilable,
Nearest to may: might, should, can, could, would, will, must, cannot,
Nearest to has: had, have, having, contains, includes, is, holds, takes,
Nearest to while: although, where, however, but, savory, though, until, when,
Nearest to will: would, should, could, might, can, may, cannot, must,
Nearest to nine: eight, seven, five, six, zero, july, four, june,
Nearest to but: however, although, and, though, than, while, or, nor,
Nearest to are: were, is, was, include, exist, contain, semmelweis, have,
Nearest to this: which, some, another, every, piecewise, fateful, any, that,
Nearest to when: if, where, because, although, though, before, that, while,
Nearest to see: include, allow, references, known, allows, say, but, schechter,
Nearest to up: off, down, together, out, back, forward, forth, sends,
Nearest to time: moment, stage, period, least, start, night, distances, subpixels,
Nearest to is: was, are, makes, exists, includes, remains, contains, provides,

In [17]:
num_points = 400

tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
two_d_embeddings = tsne.fit_transform(final_embeddings[1:num_points+1, :])

In [18]:
def plot(embeddings, labels):
  assert embeddings.shape[0] >= len(labels), 'More labels than embeddings'
  pylab.figure(figsize=(15,15))  # in inches
  for i, label in enumerate(labels):
    x, y = embeddings[i,:]
    pylab.scatter(x, y)
    pylab.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points',
                   ha='right', va='bottom')
  pylab.show()

words = [reverse_dictionary[i] for i in range(1, num_points+1)]
plot(two_d_embeddings, words)


Some clusters are less obvious (like the standalone characters), but it clearly totaly works!

How does your CBOW model perform compared to the given Word2Vec model? (to be answered)

At the first sight, they look similar. The CBOW is a more compact.