Deep Learning

Assignment 5

The goal of this assignment is to train a Word2Vec skip-gram model over Text8 data.


Reading Material

Some reading material to get familiarised with word2vec approach.


Short Theory Introduction

Word embeddings

When you're dealing with words in text, you end up with tens of thousands of classes to predict, one for each word. Trying to one-hot encode these words is massively inefficient, you'll have one element set to 1 and the other 50,000 set to 0. The matrix multiplication going into the first hidden layer will have almost all of the resulting values be zero. This a huge waste of computation.

To solve this problem and greatly increase the efficiency of our networks, we use what are called embeddings. Embeddings are just a fully connected layer like you've seen before. We call this layer the embedding layer and the weights are embedding weights. We skip the multiplication into the embedding layer by instead directly grabbing the hidden layer values from the weight matrix. We can do this because the multiplication of a one-hot encoded vector with a matrix returns the row of the matrix corresponding the index of the "on" input unit.

Instead of doing the matrix multiplication, we use the weight matrix as a lookup table. We encode the words as integers, for example "heart" is encoded as 958, "mind" as 18094. Then to get hidden layer values for "heart", you just take the 958th row of the embedding matrix. This process is called an embedding lookup and the number of hidden units is the embedding dimension.

There is nothing magical going on here. The embedding lookup table is just a weight matrix. The embedding layer is just a hidden layer. The lookup is just a shortcut for the matrix multiplication. The lookup table is trained just like any weight matrix as well.

Embeddings aren't only used for words of course. You can use them for any model where you have a massive number of classes. A particular type of model called Word2Vec uses the embedding layer to find vector representations of words that contain semantic meaning.

Word2Vec

The word2vec algorithm finds much more efficient representations by finding vectors that represent the words. These vectors also contain semantic information about the words. Words that show up in similar contexts, such as "black", "white", and "red" will have vectors near each other.

Word2vec is a particularly computationally-efficient predictive model for learning word embeddings from raw text. It comes in two flavors, the Continuous Bag-of-Words model (CBOW) and the Skip-Gram model (Section 3.1 and 3.2 in Mikolov et al.). Algorithmically, these models are similar, except that CBOW predicts target words (e.g. 'mat') from source context words ('the cat sits on the'), while the skip-gram does the inverse and predicts source context-words from the target words. This inversion might seem like an arbitrary choice, but statistically it has the effect that CBOW smoothes over a lot of the distributional information (by treating an entire context as one observation). For the most part, this turns out to be a useful thing for smaller datasets. However, skip-gram treats each context-target pair as a new observation, and this tends to do better when we have larger datasets.

The two architectures for implementing word2vec, CBOW (Continuous Bag-Of-Words) and Skip-gram are depicted in the following graph:

We will first work with the skip-gram model and later with the CBOW. In the skip-gram model, we pass in a word and try to predict the words surrounding it in the text. In this way, we can train the network to learn representations for words that show up in similar contexts.


In [1]:
# These are all the modules we'll be using later. 
# Make sure you can import them before proceeding further.
%matplotlib inline
from __future__ import print_function
import collections
import math
import numpy as np
import os
import random
import tensorflow as tf
import zipfile
from matplotlib import pylab
from six.moves import range
from six.moves.urllib.request import urlretrieve
from sklearn.manifold import TSNE
import shutil # high level file operations

Download the data from the source website if necessary.


In [2]:
url = 'http://mattmahoney.net/dc/'

def maybe_download(filename, expected_bytes):
    """Download a file if not present, and make sure it's the right size."""
    # Go to parent directory and then go to data directory
    fpath = os.getcwd()
    cpath = os.path.abspath(os.path.join(fpath, os.pardir))
    cpath = os.path.join(cpath, 'data')
    cpath = os.path.join(cpath, filename)
    # create boolean variable if file exists
    cpathl = os.path.exists(cpath)
    if not cpathl:
        filename, _ = urlretrieve(url + filename, filename)
        statinfo = os.stat(filename)
    else:
        statinfo = os.stat(cpath)
    if statinfo.st_size == expected_bytes:
        print('Found and verified %s' % filename)
    else:
        print(statinfo.st_size)
        raise Exception(
            'Failed to verify ' + filename + '. Can you get to it with a browser?')
        
    if not cpathl:
        # After file has been verified move it to data folder
        fpath = os.path.join(fpath, filename)
        shutil.move(fpath, cpath)
    
    return(filename)

filename = maybe_download('text8.zip', 31344016)


Found and verified text8.zip

Read the data into a string.


In [3]:
def read_data(filename):
    # Go to parent directory and then go to data directory
    fpath = os.getcwd()
    cpath = os.path.abspath(os.path.join(fpath, os.pardir))
    cpath = os.path.join(cpath, 'data')
    cpath = os.path.join(cpath, filename)
    """Extract the first file enclosed in a zip file as a list of words"""
    with zipfile.ZipFile(cpath) as f:
        # ZipFile.namelist() :: Returns a list of archive members by name.
        data = tf.compat.as_str(f.read(f.namelist()[0])).split()
    return (data)
  
words = read_data(filename)
print('Data size %d' % len(words))


Data size 17005207

Build the dictionary and replace rare words with UNK token.

Note: The UNK token is a special token used to capture out-of-vocabulary (OOV) words.


In [4]:
vocabulary_size = 50000

def build_dataset(words):
    count = [['UNK', -1]]
    # add most common words and their count as tupples in the list==count
    count.extend(collections.Counter(words).most_common(vocabulary_size - 1))
    dictionary = dict()
    for word, _ in count:
        # The dictionaty[word] gives the index of the word in the dictionary
        # ie the time it was added, hence entries start with most frequent word and go down
        dictionary[word] = len(dictionary)
    data = list()
    unk_count = 0
    for word in words:
        if word in dictionary:
            index = dictionary[word]
        else:
            index = 0  # dictionary['UNK']
            unk_count = unk_count + 1
        # data collects the indexes contained in dictionary
        # each data element represents the results/dictionary_index of the relevant words element
        data.append(index)
    count[0][1] = unk_count
    # reverses the dictionary entries
    reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys())) 
    return (data, count, dictionary, reverse_dictionary)

data, count, dictionary, reverse_dictionary = build_dataset(words)
print('Most common words (+UNK)', count[:5])
print('Sample data', data[:10])
del(words) # Hint to reduce memory.


Most common words (+UNK) [['UNK', 418391], ('the', 1061396), ('of', 593677), ('and', 416629), ('one', 411764)]
Sample data [5242, 3082, 12, 6, 195, 2, 3135, 46, 59, 156]

Let's Explore the data structures created by the previous function


In [5]:
print('data type is {} with length {}'.format(type(data), len(data)))
print(data[:24])

print('\ncount type is {} with length {}'.format(type(count), len(count)))
print(count[:6])

print('\ndictionary type is {} with length {}'.format(type(dictionary), len(dictionary)))
print(list(dictionary.items())[:6])
    
print('\nreverse_dictionary type is {} with length {}'.format(
    type(reverse_dictionary), len(reverse_dictionary)))
print(list(reverse_dictionary.items())[:6])


data type is <class 'list'> with length 17005207
[5242, 3082, 12, 6, 195, 2, 3135, 46, 59, 156, 128, 742, 477, 10603, 134, 1, 27559, 2, 1, 103, 855, 3, 1, 15192]

count type is <class 'list'> with length 50000
[['UNK', 418391], ('the', 1061396), ('of', 593677), ('and', 416629), ('one', 411764), ('in', 372201)]

dictionary type is <class 'dict'> with length 50000
[('gabaa', 47135), ('steppenwolf', 45225), ('nullified', 30107), ('totalling', 19614), ('siemens', 22245), ('olympians', 32333)]

reverse_dictionary type is <class 'dict'> with length 50000
[(0, 'UNK'), (1, 'the'), (2, 'of'), (3, 'and'), (4, 'one'), (5, 'in')]
  • Why do we need to return the data variable? It is used later to generate data batches.

    Data is a list of indexes in the dictionary for all the words in our corpus. It either points to UNK or to a most_common_word entry in the dictionary.

Function to generate a training batch for the skip-gram model.


In [6]:
data_index = 0

def generate_batch(batch_size, num_skips, skip_window):
    """
    Function to generate a batch from the dataset.
    
    Args:
        batch_size: size of the batch to be created
        num_skips: How many times to reuse an input to generate a label.
        skip_window: radius of the window of word2vec/skipgram model 
    """
    global data_index
    # assert % == 0 because we later iterate over that //
    assert batch_size % num_skips == 0
    # num_skips <= 2*skip_window so our exclusions later don't break down 
    # because there are not enough words to create (word, label) pairs.
    assert num_skips <= 2 * skip_window
    
    batch = np.ndarray(shape=(batch_size), dtype=np.int32)
    labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
    span = 2 * skip_window + 1 # [ skip_window target skip_window ]
    # https://docs.python.org/3.5/library/collections.html#collections.deque
    # deque :list-like container with fast appends and pops on either end
    buffer = collections.deque(maxlen=span)
    # First for loop gets numbers/indexes from data object so they can be used in next loop
    for __ in range(span):
        buffer.append(data[data_index])
        # data_index is global variable and the next line iterates it
        # up to the end of the dataset and then at start
        data_index = (data_index + 1) % len(data)
    # loop to create batch data
    for i in range(batch_size // num_skips):
        target = skip_window  # target label at the center of the buffer
        targets_to_avoid = [ skip_window ]
        # We divide by num skips for i but now we iterate over them!
        for j in range(num_skips):
            while target in targets_to_avoid:
                target = random.randint(0, span - 1)
            # appends word that will become label so that it is not picked in the next iteration
            targets_to_avoid.append(target)
            ### create word+label:
            # batch always picks the skip_window entry from buffer
            # ?why? it has to do with move step?
            batch[i * num_skips + j] = buffer[skip_window]
            labels[i * num_skips + j, 0] = buffer[target]
        # this adds a word at the end but removes one at the beginning (from buffer)!
        buffer.append(data[data_index])
        # data index moves with buffer - just like initialisation
        # in previous for loop.
        data_index = (data_index + 1) % len(data)
    return (batch, labels)

print('data:', [reverse_dictionary[di] for di in data[:8]])

for num_skips, skip_window in [(2, 1), (4, 2)]:
    data_index = 0
    batch, labels = generate_batch(batch_size=8, num_skips=num_skips, skip_window=skip_window)
    print('\nwith num_skips = %d and skip_window = %d:' % (num_skips, skip_window))
    print('    batch:', [reverse_dictionary[bi] for bi in batch])
    print('    labels:', [reverse_dictionary[li] for li in labels.reshape(8)])


data: ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first']

with num_skips = 2 and skip_window = 1:
    batch: ['originated', 'originated', 'as', 'as', 'a', 'a', 'term', 'term']
    labels: ['as', 'anarchism', 'a', 'originated', 'term', 'as', 'of', 'a']

with num_skips = 4 and skip_window = 2:
    batch: ['as', 'as', 'as', 'as', 'a', 'a', 'a', 'a']
    labels: ['anarchism', 'term', 'a', 'originated', 'originated', 'term', 'of', 'as']

In [7]:
for num_skips, skip_window in [(1, 1), (1, 3), (1, 4)]:
    data_index = 0
    batch, labels = generate_batch(batch_size=8, num_skips=num_skips, skip_window=skip_window)
    print('\nwith num_skips = %d and skip_window = %d:' % (num_skips, skip_window))
    print('    batch:', [reverse_dictionary[bi] for bi in batch])
    print('    labels:', [reverse_dictionary[li] for li in labels.reshape(8)])
    print(batch)
    print(labels)


with num_skips = 1 and skip_window = 1:
    batch: ['originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used']
    labels: ['as', 'originated', 'term', 'a', 'abuse', 'of', 'used', 'against']
[3082   12    6  195    2 3135   46   59]
[[  12]
 [3082]
 [ 195]
 [   6]
 [3135]
 [   2]
 [  59]
 [ 156]]

with num_skips = 1 and skip_window = 3:
    batch: ['a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early']
    labels: ['originated', 'of', 'as', 'used', 'of', 'abuse', 'working', 'used']
[   6  195    2 3135   46   59  156  128]
[[3082]
 [   2]
 [  12]
 [  59]
 [   2]
 [3135]
 [ 742]
 [  59]]

with num_skips = 1 and skip_window = 4:
    batch: ['term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working']
    labels: ['first', 'used', 'of', 'term', 'early', 'first', 'first', 'early']
[ 195    2 3135   46   59  156  128  742]
[[ 46]
 [ 59]
 [  2]
 [195]
 [128]
 [ 46]
 [ 46]
 [128]]

Train a skip-gram model.


In [8]:
batch_size = 128
embedding_size = 128 # Dimension of the embedding vector.
skip_window = 1 # How many words to consider left and right.
num_skips = 2 # How many times to reuse an input to generate a label.
# We pick a random validation set to sample nearest neighbors. here we limit the
# validation samples to the words that have a low numeric ID, which by
# construction are also the most frequent. 
valid_size = 16 # Random set of words to evaluate similarity on.
valid_window = 100 # Only pick dev samples in the head of the distribution.
valid_examples = np.array(random.sample(range(valid_window), valid_size))
num_sampled = 64 # Number of negative examples to sample.

graph = tf.Graph()

# with graph.as_default(), tf.device('/cpu:0'):
# why use cpu explicitely?
with graph.as_default():

    # Input data.
    train_dataset = tf.placeholder(tf.int32, shape=[batch_size])
    train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
    valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
  
    # Variables.
    embeddings = tf.Variable(
        tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
    softmax_weights = tf.Variable(
        tf.truncated_normal([vocabulary_size, embedding_size],
                            stddev=1.0 / math.sqrt(embedding_size)))
    softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))
  
    # Model.
    # Look up embeddings for inputs.
    embed = tf.nn.embedding_lookup(embeddings, train_dataset)
    # Compute the softmax loss, using a sample of the negative labels each time.
    loss = tf.reduce_mean(
        tf.nn.sampled_softmax_loss(weights=softmax_weights,
                                   biases=softmax_biases,
                                   inputs=embed,
                                   labels=train_labels,
                                   num_sampled=num_sampled,
                                   num_classes=vocabulary_size))

    # Optimizer.
    # Note: The optimizer will optimize the softmax_weights AND the embeddings.
    # This is because the embeddings are defined as a variable quantity and the
    # optimizer's `minimize` method will by default modify all variable quantities 
    # that contribute to the tensor it is passed.
    # See docs on `tf.train.Optimizer.minimize()` for more details.
    optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)
  
    # Compute the similarity between minibatch examples and all embeddings.
    # We use the cosine distance:
    norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
    normalized_embeddings = embeddings / norm
    valid_embeddings = tf.nn.embedding_lookup(
        normalized_embeddings, valid_dataset)
    similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))

In [9]:
num_steps = 100001

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print('Initialized')
    average_loss = 0
    for step in range(num_steps):
        batch_data, batch_labels = generate_batch(
            batch_size, num_skips, skip_window)
        feed_dict = {train_dataset : batch_data, train_labels : batch_labels}
        _, l = session.run([optimizer, loss], feed_dict=feed_dict)
        average_loss += l
        if step % 2000 == 0:
            if step > 0:
                average_loss = average_loss / 2000
            # The average loss is an estimate of the loss over the last 2000 batches.
            print('Average loss at step %d: %f' % (step, average_loss))
            average_loss = 0
        # note that this is expensive (~20% slowdown if computed every 500 steps)
        if step % 10000 == 0:
            sim = similarity.eval()
            for i in range(valid_size):
                valid_word = reverse_dictionary[valid_examples[i]]
                top_k = 8 # number of nearest neighbors
                nearest = (-sim[i, :]).argsort()[1:top_k+1]
                log = 'Nearest to %s:' % valid_word
                for k in range(top_k):
                    close_word = reverse_dictionary[nearest[k]]
                    log = '%s %s,' % (log, close_word)
                print(log)
    final_embeddings = normalized_embeddings.eval()


Initialized
Average loss at step 0: 8.145626
Nearest to system: fishermen, suicide, occultist, walls, sufis, mannered, chalcedonian, downside,
Nearest to there: keel, taoiseach, bae, poul, ollie, hmac, nihilism, astrologers,
Nearest to as: bua, wisconsin, patriarch, ergodic, partaking, halting, atom, jacobites,
Nearest to more: blyth, mikhail, circumnavigate, habitus, unaided, toomer, watched, pejoratively,
Nearest to most: wizardry, doings, customers, engelbert, kammu, faced, growing, hinders,
Nearest to from: immunoglobulin, brigade, ramparts, mexico, hottest, traversed, darren, alomari,
Nearest to eight: morphemes, population, spears, cascades, mesmer, compagnie, finalists, professor,
Nearest to four: combustion, prized, precludes, schleicher, romulus, fort, bionic, pivot,
Nearest to use: lamarr, chandragupta, evidences, claudian, vacuum, grab, obliged, ger,
Nearest to would: madrigals, taraza, fresh, intestine, discerning, pfa, warm, disputes,
Nearest to some: mcfarland, harmonious, acquire, employment, antony, madelyne, unsuitable, orinoco,
Nearest to be: deansgate, benzodiazepine, administratives, shameful, steiner, churchmen, magnetite, periodic,
Nearest to these: basically, incredible, gave, alpine, dividend, richard, egg, unchanging,
Nearest to an: durations, parables, abscess, navigators, polio, creator, puranas, earliest,
Nearest to is: worries, marvel, robeson, horthy, stamp, groceries, fantasyland, line,
Nearest to only: impossible, esperantists, alligator, money, coronal, ogdoad, counterattack, waco,
Average loss at step 2000: 4.371076
Average loss at step 4000: 3.867749
Average loss at step 6000: 3.795359
Average loss at step 8000: 3.683756
Average loss at step 10000: 3.615238
Nearest to system: fishermen, walls, suicide, chalcedonian, organise, gay, kosher, funicular,
Nearest to there: it, he, tireless, summoning, poul, gestapo, ollie, instants,
Nearest to as: bilinear, saltpeter, hermaphroditus, josiah, hbox, fraenkel, ergodic, by,
Nearest to more: toomer, watched, chances, bruford, stenella, thuringia, severing, toxicology,
Nearest to most: adequate, ezra, historia, hing, styx, examiners, intransitive, doings,
Nearest to from: in, on, by, at, popularizing, between, through, hottest,
Nearest to eight: nine, six, seven, five, four, three, zero, two,
Nearest to four: six, five, eight, seven, three, two, nine, zero,
Nearest to use: lamarr, startled, hagi, liberalize, fascinated, chandragupta, angelina, claudian,
Nearest to would: will, can, should, may, could, had, privately, must,
Nearest to some: many, these, orinoco, working, thrived, rogers, zr, all,
Nearest to be: have, is, become, by, was, showing, shameful, magnetite,
Nearest to these: many, some, ckel, cdna, mudd, foggy, infinitesimal, infraorder,
Nearest to an: the, disadvantageous, essequibo, cappadocia, durations, puranas, price, interoperable,
Nearest to is: was, are, has, be, byproduct, telephony, happens, groceries,
Nearest to only: impossible, present, inventions, money, overlapped, brides, you, boomers,
Average loss at step 12000: 3.603018
Average loss at step 14000: 3.569268
Average loss at step 16000: 3.413002
Average loss at step 18000: 3.457578
Average loss at step 20000: 3.540574
Nearest to system: suicide, walls, fishermen, organise, nonprofit, elijah, traditions, listening,
Nearest to there: they, it, which, he, since, said, instants, often,
Nearest to as: josiah, hermaphroditus, bertram, tiny, jarmusch, hellenic, monadic, defense,
Nearest to more: less, rather, very, thuringia, watched, most, assessments, salt,
Nearest to most: some, more, adequate, many, intransitive, historia, past, examiners,
Nearest to from: between, in, after, into, burmese, through, brigade, hottest,
Nearest to eight: seven, nine, six, four, five, three, two, zero,
Nearest to four: six, seven, eight, three, five, two, nine, zero,
Nearest to use: hagi, startled, liberalize, idea, lamarr, beatification, chandragupta, size,
Nearest to would: will, could, can, may, should, must, to, had,
Nearest to some: many, these, all, several, their, most, other, this,
Nearest to be: have, by, been, was, is, become, buy, showing,
Nearest to these: some, many, all, cdna, other, such, both, ckel,
Nearest to an: the, essequibo, disadvantageous, cappadocia, wedge, durations, parables, puranas,
Nearest to is: was, are, has, be, humps, were, marvel, virtually,
Nearest to only: impossible, drill, present, indeed, it, overlapped, sketches, feasible,
Average loss at step 22000: 3.502873
Average loss at step 24000: 3.489870
Average loss at step 26000: 3.478944
Average loss at step 28000: 3.478009
Average loss at step 30000: 3.506219
Nearest to system: suicide, systems, organise, nonprofit, walls, traditions, fishermen, passos,
Nearest to there: they, it, he, often, still, this, these, also,
Nearest to as: became, hbox, breeze, before, xu, bilinear, josiah, interpretive,
Nearest to more: less, very, rather, most, longer, watched, assessments, thuringia,
Nearest to most: some, many, more, domestic, intransitive, examiners, adequate, usaf,
Nearest to from: into, through, between, popularizing, on, after, by, in,
Nearest to eight: nine, seven, five, six, four, three, zero, two,
Nearest to four: five, six, three, eight, seven, nine, two, zero,
Nearest to use: beatification, practitioners, form, hb, limit, cause, idea, startled,
Nearest to would: will, can, could, may, should, must, cannot, to,
Nearest to some: many, these, several, all, most, their, each, this,
Nearest to be: have, been, is, aquarius, were, appear, give, was,
Nearest to these: some, many, such, they, both, all, several, ckel,
Nearest to an: parables, undercover, essequibo, gunned, ringed, car, wavefunctions, java,
Nearest to is: was, has, are, does, be, became, although, had,
Nearest to only: impossible, present, drill, feasible, it, generally, indeed, hurting,
Average loss at step 32000: 3.501713
Average loss at step 34000: 3.494517
Average loss at step 36000: 3.460312
Average loss at step 38000: 3.299490
Average loss at step 40000: 3.433991
Nearest to system: systems, suicide, traditions, listening, nonprofit, passos, body, carefully,
Nearest to there: it, they, he, still, often, which, also, usually,
Nearest to as: when, unexplored, hermaphroditus, mourning, cartilage, bilinear, xu, became,
Nearest to more: less, very, most, longer, rather, assessments, greater, actually,
Nearest to most: more, some, very, domestic, many, particularly, intransitive, examiners,
Nearest to from: through, into, in, between, toward, of, after, to,
Nearest to eight: nine, seven, six, five, four, three, zero, one,
Nearest to four: six, five, seven, three, two, eight, nine, one,
Nearest to use: cause, limit, beatification, form, give, hb, practitioners, size,
Nearest to would: will, could, may, can, should, must, cannot, did,
Nearest to some: many, these, several, both, each, various, most, any,
Nearest to be: have, been, become, is, being, are, were, refer,
Nearest to these: many, some, both, such, several, they, their, all,
Nearest to an: essequibo, undercover, deeps, prerogatives, durations, the, druze, bookmakers,
Nearest to is: was, are, has, be, if, flaw, when, became,
Nearest to only: impossible, present, another, indeed, already, drill, sketches, absolutist,
Average loss at step 42000: 3.435207
Average loss at step 44000: 3.450014
Average loss at step 46000: 3.447944
Average loss at step 48000: 3.361730
Average loss at step 50000: 3.382464
Nearest to system: systems, suicide, listening, hexafluoride, smallpox, nonprofit, ripening, carefully,
Nearest to there: they, it, he, still, now, paulus, often, said,
Nearest to as: breeze, capricious, very, hermaphroditus, capitalization, became, pauli, detonated,
Nearest to more: less, most, very, rather, longer, greater, assessments, thuringia,
Nearest to most: more, less, some, domestic, many, roles, detonation, hing,
Nearest to from: through, in, into, since, at, after, between, careful,
Nearest to eight: seven, six, nine, four, five, three, zero, one,
Nearest to four: six, eight, seven, five, three, nine, two, one,
Nearest to use: cause, size, limit, startled, form, hb, beatification, liberalize,
Nearest to would: will, could, may, can, should, must, cannot, might,
Nearest to some: many, these, several, both, the, those, various, each,
Nearest to be: have, being, been, become, were, was, refer, give,
Nearest to these: some, both, many, several, such, various, they, other,
Nearest to an: parables, almohad, essequibo, durations, scholarships, druze, abijah, disadvantageous,
Nearest to is: was, are, has, became, hazy, siouxsie, does, if,
Nearest to only: already, absolutist, initially, nexgen, indeed, present, impossible, another,
Average loss at step 52000: 3.437006
Average loss at step 54000: 3.430390
Average loss at step 56000: 3.440542
Average loss at step 58000: 3.393925
Average loss at step 60000: 3.391485
Nearest to system: systems, suicide, listening, anchorage, network, greensboro, gladiatorial, nonprofit,
Nearest to there: they, it, he, now, still, this, often, also,
Nearest to as: when, aiken, hbox, became, saigon, before, saltpeter, capricious,
Nearest to more: less, most, very, rather, longer, greater, homage, assessments,
Nearest to most: more, many, some, particularly, domestic, roles, examiners, less,
Nearest to from: through, into, since, labiodental, hottest, emirs, after, sleek,
Nearest to eight: nine, six, seven, four, five, three, zero, one,
Nearest to four: six, five, eight, seven, three, nine, zero, two,
Nearest to use: size, cause, limit, most, swastika, form, alma, wilhelmina,
Nearest to would: will, could, may, can, must, should, might, cannot,
Nearest to some: many, several, these, each, most, any, this, various,
Nearest to be: been, become, have, refer, being, were, is, was,
Nearest to these: many, some, several, both, such, those, which, various,
Nearest to an: almohad, parables, cappadocia, durations, essequibo, disadvantageous, undercover, lucia,
Nearest to is: was, are, has, remains, be, but, does, telephony,
Nearest to only: already, nexgen, initially, absolutist, present, another, impossible, until,
Average loss at step 62000: 3.241324
Average loss at step 64000: 3.259750
Average loss at step 66000: 3.403643
Average loss at step 68000: 3.394207
Average loss at step 70000: 3.357083
Nearest to system: systems, suicide, anchorage, listening, carefully, nonprofit, greensboro, advocacy,
Nearest to there: they, it, he, still, we, sometimes, now, usually,
Nearest to as: is, slapped, before, bilinear, xu, anglicized, fatah, shalom,
Nearest to more: less, most, very, rather, longer, greater, quite, aubier,
Nearest to most: more, many, less, particularly, some, roles, use, handicap,
Nearest to from: through, into, hottest, between, labiodental, via, in, careful,
Nearest to eight: nine, six, seven, four, five, zero, three, one,
Nearest to four: six, five, seven, three, eight, two, nine, zero,
Nearest to use: size, cause, limit, most, hb, swastika, form, consent,
Nearest to would: will, could, may, can, should, must, might, cannot,
Nearest to some: many, several, these, all, any, various, most, both,
Nearest to be: been, become, is, have, being, were, was, remain,
Nearest to these: such, many, some, several, those, various, both, were,
Nearest to an: almohad, durations, parables, scholarships, piccard, essequibo, windward, the,
Nearest to is: was, has, are, be, although, makes, does, becomes,
Nearest to only: absolutist, impossible, last, best, already, necessary, compound, rancher,
Average loss at step 72000: 3.376311
Average loss at step 74000: 3.350815
Average loss at step 76000: 3.313397
Average loss at step 78000: 3.350235
Average loss at step 80000: 3.373963
Nearest to system: systems, suicide, listening, anchorage, verity, advocacy, greensboro, ecowas,
Nearest to there: it, they, he, we, still, she, now, usually,
Nearest to as: slapped, before, staring, uvs, loa, acorns, breeze, after,
Nearest to more: less, most, very, rather, longer, quite, greater, fairly,
Nearest to most: more, many, some, less, roles, particularly, handicap, past,
Nearest to from: through, into, via, after, remind, microchip, lynx, hottest,
Nearest to eight: six, seven, nine, four, five, three, zero, one,
Nearest to four: five, six, seven, eight, three, nine, two, zero,
Nearest to use: cause, limit, because, size, form, consent, mean, make,
Nearest to would: could, will, may, can, should, might, must, cannot,
Nearest to some: many, several, these, various, most, both, any, those,
Nearest to be: been, become, being, have, is, refer, remain, appear,
Nearest to these: several, those, many, such, some, various, both, are,
Nearest to an: almohad, durations, parables, scholarships, druze, essequibo, piccard, roam,
Nearest to is: was, has, are, be, became, hazy, although, humps,
Nearest to only: last, absolutist, impossible, ask, best, present, necessary, averaging,
Average loss at step 82000: 3.406620
Average loss at step 84000: 3.411817
Average loss at step 86000: 3.391557
Average loss at step 88000: 3.355118
Average loss at step 90000: 3.365197
Nearest to system: systems, suicide, anchorage, process, method, listening, area, line,
Nearest to there: they, it, he, now, still, we, she, usually,
Nearest to as: pauli, fatah, bottoms, counterbalance, bodhi, staring, saigon, after,
Nearest to more: less, most, greater, very, rather, quite, longer, extremely,
Nearest to most: more, many, less, roles, some, particularly, lad, past,
Nearest to from: through, into, via, during, lynx, toward, remind, sleek,
Nearest to eight: seven, nine, six, five, four, three, one, zero,
Nearest to four: seven, five, six, eight, three, nine, two, one,
Nearest to use: cause, because, limit, patriotism, swastika, achieve, size, most,
Nearest to would: could, will, may, might, should, can, must, cannot,
Nearest to some: many, several, these, any, both, most, all, those,
Nearest to be: been, become, was, being, have, is, refer, remain,
Nearest to these: several, many, some, those, such, are, various, both,
Nearest to an: almohad, parables, piccard, essequibo, lucia, durations, escapades, covenanters,
Nearest to is: was, has, are, humps, be, remains, makes, hazy,
Nearest to only: no, absolutist, even, rarely, initially, every, inventions, last,
Average loss at step 92000: 3.400678
Average loss at step 94000: 3.255582
Average loss at step 96000: 3.357955
Average loss at step 98000: 3.242479
Average loss at step 100000: 3.359282
Nearest to system: systems, listening, suicide, process, anchorage, samsara, hexafluoride, advocacy,
Nearest to there: they, he, it, still, we, she, now, sometimes,
Nearest to as: unexplored, slapped, saigon, bilinear, breeze, how, when, astigmatism,
Nearest to more: less, most, very, greater, rather, quite, longer, extremely,
Nearest to most: more, less, roles, today, past, particularly, many, especially,
Nearest to from: into, through, in, at, within, during, via, across,
Nearest to eight: seven, nine, six, four, five, three, zero, two,
Nearest to four: seven, six, eight, five, three, two, nine, zero,
Nearest to use: cause, limit, inactive, swastika, form, size, because, hb,
Nearest to would: could, will, may, can, might, should, must, cannot,
Nearest to some: many, several, these, any, various, numerous, all, each,
Nearest to be: been, become, have, being, is, refer, are, mulligan,
Nearest to these: several, some, many, those, various, such, certain, are,
Nearest to an: almohad, piccard, durations, essequibo, lucia, parables, covenanters, another,
Nearest to is: was, has, be, are, becomes, seems, remains, became,
Nearest to only: even, rarely, goebbels, no, salford, guildhall, actually, trampoline,

In [10]:
num_points = 400

tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
two_d_embeddings = tsne.fit_transform(final_embeddings[1:num_points+1, :])

In [11]:
def plot(embeddings, labels):
    assert embeddings.shape[0] >= len(labels), 'More labels than embeddings'
    pylab.figure(figsize=(15,15))  # in inches
    for i, label in enumerate(labels):
        x, y = embeddings[i,:]
        pylab.scatter(x, y)
        pylab.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points',
                       ha='right', va='bottom')
    pylab.show()

words = [reverse_dictionary[i] for i in range(1, num_points+1)]
plot(two_d_embeddings, words)



Problem

An alternative to skip-gram is another Word2Vec model called CBOW (Continuous Bag of Words). In the CBOW model, instead of predicting a context word from a word vector, you predict a word from the sum of all the word vectors in its context. Implement and evaluate a CBOW model trained on the text8 dataset.



In [12]:
data_index = 0

def generate_batch(batch_size, bag_window):
    global data_index
    span = 2 * bag_window + 1 # [ bag_window target bag_window ]
    batch = np.ndarray(shape=(batch_size, span - 1), dtype=np.int32)
    labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)  
    buffer = collections.deque(maxlen=span)
    for _ in range(span):
        buffer.append(data[data_index])
        data_index = (data_index + 1) % len(data)
    for i in range(batch_size):
        buffer_list = list(buffer)
        labels[i, 0] = buffer_list.pop(bag_window)
        batch[i] = buffer_list
        # iterate to the next buffer
        buffer.append(data[data_index])
        data_index = (data_index + 1) % len(data)
    return batch, labels

print('data:', [reverse_dictionary[di] for di in data[:16]])

for bag_window in [1, 2]:
    data_index = 0
    batch, labels = generate_batch(batch_size=4, bag_window=bag_window)
    print('\nwith bag_window = %d:' % (bag_window))  
    print('    batch:', [[reverse_dictionary[w] for w in bi] for bi in batch])  
    print('    labels:', [reverse_dictionary[li] for li in labels.reshape(4)])


data: ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the']

with bag_window = 1:
    batch: [['anarchism', 'as'], ['originated', 'a'], ['as', 'term'], ['a', 'of']]
    labels: ['originated', 'as', 'a', 'term']

with bag_window = 2:
    batch: [['anarchism', 'originated', 'a', 'term'], ['originated', 'as', 'term', 'of'], ['as', 'a', 'of', 'abuse'], ['a', 'term', 'abuse', 'first']]
    labels: ['as', 'a', 'term', 'of']

In [13]:
batch_size = 128
embedding_size = 128 # Dimension of the embedding vector.
###skip_window = 1 # How many words to consider left and right.
###num_skips = 2 # How many times to reuse an input to generate a label.
bag_window = 2 # How many words to consider left and right.
# We pick a random validation set to sample nearest neighbors. here we limit the
# validation samples to the words that have a low numeric ID, which by
# construction are also the most frequent. 
valid_size = 16 # Random set of words to evaluate similarity on.
valid_window = 100 # Only pick dev samples in the head of the distribution.
valid_examples = np.array(random.sample(range(valid_window), valid_size))
num_sampled = 64 # Number of negative examples to sample.

graph = tf.Graph()

with graph.as_default():

    # Input data.
    train_dataset = tf.placeholder(tf.int32, shape=[batch_size, bag_window * 2])
    train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
    valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
  
    # Variables.
    embeddings = tf.Variable(
        tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
    softmax_weights = tf.Variable(
        tf.truncated_normal([vocabulary_size, embedding_size],
                         stddev=1.0 / math.sqrt(embedding_size)))
    softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))
  
    # Model.
    # Look up embeddings for inputs.
    embeds = tf.nn.embedding_lookup(embeddings, train_dataset)
    # Compute the softmax loss, using a sample of the negative labels each time.
    loss = tf.reduce_mean(
        tf.nn.sampled_softmax_loss(
            weights=softmax_weights, 
            biases=softmax_biases,
            inputs=tf.reduce_sum(embeds, 1),
            labels=train_labels, 
            num_sampled=num_sampled,
            num_classes=vocabulary_size))

    # Optimizer.
    optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)
  
    # Compute the similarity between minibatch examples and all embeddings.
    # We use the cosine distance:
    norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
    normalized_embeddings = embeddings / norm
    valid_embeddings = tf.nn.embedding_lookup(
        normalized_embeddings, valid_dataset)
    similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))

In [14]:
num_steps = 100001

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print('Initialized')
    average_loss = 0
    for step in range(num_steps):
        batch_data, batch_labels = generate_batch(
            batch_size, bag_window)
        feed_dict = {train_dataset : batch_data, train_labels : batch_labels}
        _, l = session.run([optimizer, loss], feed_dict=feed_dict)
        average_loss += l
        if step % 2000 == 0:
            if step > 0:
                average_loss = average_loss / 2000
            # The average loss is an estimate of the loss over the last 2000 batches.
            print('Average loss at step %d: %f' % (step, average_loss))
            average_loss = 0
        # note that this is expensive (~20% slowdown if computed every 500 steps)
        if step % 10000 == 0:
            sim = similarity.eval()
            for i in range(valid_size):
                valid_word = reverse_dictionary[valid_examples[i]]
                top_k = 8 # number of nearest neighbors
                nearest = (-sim[i, :]).argsort()[1:top_k+1]
                log = 'Nearest to %s:' % valid_word
                for k in range(top_k):
                    close_word = reverse_dictionary[nearest[k]]
                    log = '%s %s,' % (log, close_word)
                print(log)
    final_embeddings = normalized_embeddings.eval()


Initialized
Average loss at step 0: 7.601776
Nearest to state: accrington, fin, syndicated, reconciled, zach, combatants, damme, montju,
Nearest to a: productive, madison, eliminating, prowess, lewiston, servlet, erik, adenauer,
Nearest to been: tice, freedmen, cruel, oversee, rumford, milos, runs, earthquakes,
Nearest to not: parishioners, ieung, tribal, mortis, fantastical, indonesian, irbms, sauce,
Nearest to called: olof, informant, qc, almost, jingle, slate, atom, tswana,
Nearest to years: bigger, mira, alleviating, whim, synods, xxxiii, athos, crag,
Nearest to see: jerky, fondation, oeuvre, usmc, exp, medline, underway, deposition,
Nearest to new: flourishing, aerobatic, ying, oscillators, paladin, bosom, damping, silverstein,
Nearest to states: unauthorized, pital, censuses, thabit, reorganization, mayer, mic, ms,
Nearest to other: business, differently, photographic, replay, khalq, brainchild, dam, blackstone,
Nearest to up: samuelson, tortuous, lucky, nagano, affectionately, livestock, schmitz, mri,
Nearest to it: hbp, heated, tetrameter, mystique, slapping, lincoln, caps, maxine,
Nearest to these: mares, celtics, temperaments, adriatic, carcinogens, syd, fn, pkn,
Nearest to use: klaas, slots, adventurous, fork, bas, hunyadi, hare, faroese,
Nearest to many: litigants, contingent, humble, cellist, lipid, enrolled, wound, patched,
Nearest to to: eureka, noteworthy, chemotherapy, minkowski, dictionary, reefs, chemically, odd,
Average loss at step 2000: 4.614977
Average loss at step 4000: 3.934441
Average loss at step 6000: 3.723376
Average loss at step 8000: 3.536821
Average loss at step 10000: 3.470661
Nearest to state: facilitated, embalming, basilica, anticipation, hastened, hizbullah, reconciled, disrupts,
Nearest to a: any, the, an, another, this, no, merkle, filename,
Nearest to been: become, smash, remixing, statist, ravens, iwama, reformist, appeared,
Nearest to not: bosch, khoi, venter, strangers, also, leanings, tribal, hava,
Nearest to called: informant, beggar, used, replaced, himmler, shimura, acetate, symbolizes,
Nearest to years: yards, chickie, times, transcribe, year, richey, adopters, gab,
Nearest to see: usmc, textures, epicurus, oeuvre, cheerleader, davidic, amathus, entertainments,
Nearest to new: katsura, flourishing, puente, riddles, nsh, aquariums, few, bonneville,
Nearest to states: kingdom, terrace, nations, langdon, reorganization, freezing, dye, sacerdotal,
Nearest to other: these, those, argumentative, neburg, incremented, nicotine, dab, innocent,
Nearest to up: resistant, back, lucky, tortuous, colloquial, livestock, periphery, workmanship,
Nearest to it: he, this, there, which, what, she, they, iuds,
Nearest to these: other, there, some, canarian, unreadable, tsuba, many, reconstructionism,
Nearest to use: adventurous, example, marquise, spoleto, volcanic, wasted, steele, numbers,
Nearest to many: some, several, all, those, various, both, juanita, these,
Nearest to to: would, may, can, could, will, cannot, barbed, kuznetsov,
Average loss at step 12000: 3.492411
Average loss at step 14000: 3.440343
Average loss at step 16000: 3.437531
Average loss at step 18000: 3.397088
Average loss at step 20000: 3.217505
Nearest to state: government, democracy, facilitated, mauritania, embalming, gamble, sheehan, progesterone,
Nearest to a: the, another, every, any, cruzi, no, each, redesign,
Nearest to been: become, rumford, technetium, come, pwnage, reformist, backer, renfrew,
Nearest to not: nothing, never, tribal, no, still, babysitter, detectives, depot,
Nearest to called: used, considered, known, beggar, there, driftwood, constantius, acetate,
Nearest to years: days, centuries, yards, times, consecrations, minors, sexist, months,
Nearest to see: usmc, conflated, kat, roddick, cem, cunningham, gerhard, melanoleuca,
Nearest to new: alva, lecoq, katsura, recorder, few, flourishing, wahhabism, carnivorous,
Nearest to states: kingdom, nations, terrace, rooks, langdon, mehmet, atsc, mayer,
Nearest to other: more, less, hickok, different, magnetometer, ilyich, ev, these,
Nearest to up: off, back, them, him, out, primo, workmanship, sound,
Nearest to it: he, she, this, reincarnation, there, forged, they, compressibility,
Nearest to these: several, many, various, lazar, some, certain, imminent, there,
Nearest to use: form, marquise, steele, risings, be, fish, develop, incarnate,
Nearest to many: some, several, both, all, various, those, plaster, these,
Nearest to to: will, may, would, could, must, can, might, should,
Average loss at step 22000: 3.357614
Average loss at step 24000: 3.283038
Average loss at step 26000: 3.260876
Average loss at step 28000: 3.284067
Average loss at step 30000: 3.223785
Nearest to state: government, climbs, democracy, basilica, bilirubin, hada, kinetic, divorce,
Nearest to a: another, the, jeet, this, lorica, any, berkman, greenspan,
Nearest to been: become, come, existed, technetium, extraterrestrial, pwnage, lineker, statist,
Nearest to not: still, never, boisbaudran, nothing, fog, legislated, furioso, suarez,
Nearest to called: considered, used, named, said, described, enterprising, deduced, known,
Nearest to years: days, year, months, centuries, yards, minors, times, miles,
Nearest to see: cem, but, roddick, includes, melanoleuca, bode, include, annabel,
Nearest to new: particular, katsura, different, greats, lecoq, defenders, resale, balloon,
Nearest to states: kingdom, nations, terrace, spliced, tribes, atsc, antoninus, countries,
Nearest to other: different, these, various, all, empowered, those, both, rather,
Nearest to up: off, back, them, out, him, away, surpluses, mutiny,
Nearest to it: she, he, this, what, there, fathom, still, forged,
Nearest to these: other, many, different, several, they, there, various, some,
Nearest to use: form, production, farewell, steele, release, purpose, sense, cases,
Nearest to many: several, some, various, most, all, both, these, those,
Nearest to to: would, will, could, must, might, may, should, can,
Average loss at step 32000: 3.014785
Average loss at step 34000: 3.199903
Average loss at step 36000: 3.188485
Average loss at step 38000: 3.134698
Average loss at step 40000: 3.154567
Nearest to state: government, devotee, dirks, divorce, engages, mischievous, atheromatous, acclaimed,
Nearest to a: any, another, every, the, vance, each, sutcliffe, mesozoic,
Nearest to been: become, come, begun, existed, extraterrestrial, be, hers, backer,
Nearest to not: still, never, fog, parishioners, poisonous, now, orator, suarez,
Nearest to called: considered, named, circling, used, known, postpartum, formed, referred,
Nearest to years: days, months, centuries, decades, year, times, miles, yards,
Nearest to see: includes, known, usmc, roddick, melanoleuca, aesthetic, but, lev,
Nearest to new: particular, different, kau, nl, defenders, large, greats, cola,
Nearest to states: kingdom, nations, spliced, terrace, jana, mired, provinces, countries,
Nearest to other: various, these, others, pomp, including, legislatures, different, empowered,
Nearest to up: off, out, him, them, down, away, kibo, rise,
Nearest to it: he, she, this, there, fein, coven, heated, only,
Nearest to these: various, different, those, several, some, other, certain, many,
Nearest to use: form, release, production, share, develop, lack, types, postpositions,
Nearest to many: several, some, various, numerous, all, both, most, few,
Nearest to to: will, would, must, might, towards, could, should, cannot,
Average loss at step 42000: 3.208073
Average loss at step 44000: 3.134486
Average loss at step 46000: 3.139427
Average loss at step 48000: 3.059514
Average loss at step 50000: 3.065282
Nearest to state: government, climbs, divorce, base, engages, lothair, minicomputers, mischievous,
Nearest to a: any, the, another, every, this, glider, no, sutcliffe,
Nearest to been: become, existed, come, remained, occurred, begun, fragrance, was,
Nearest to not: never, still, almost, nothing, kappa, fog, elected, deformation,
Nearest to called: named, considered, used, sold, known, formed, referred, cerberus,
Nearest to years: days, months, year, decades, times, yards, hours, centuries,
Nearest to see: includes, known, cem, theseus, roddick, include, epicurus, ary,
Nearest to new: particular, different, specific, katsura, trilemma, certain, rebekah, foreigner,
Nearest to states: kingdom, nations, jana, spliced, tribes, terrace, countries, dst,
Nearest to other: others, rather, fewer, empowered, various, different, older, individual,
Nearest to up: off, out, down, him, road, back, holt, them,
Nearest to it: he, she, this, there, officially, hiding, they, reintegrated,
Nearest to these: some, various, certain, several, different, many, multiple, expletive,
Nearest to use: support, purpose, importance, used, conduct, result, multiplayer, develop,
Nearest to many: some, several, various, numerous, all, most, these, those,
Nearest to to: will, must, towards, could, can, cannot, kuznetsov, fluctuated,
Average loss at step 52000: 3.089522
Average loss at step 54000: 3.087258
Average loss at step 56000: 2.916493
Average loss at step 58000: 3.015235
Average loss at step 60000: 3.052750
Nearest to state: government, base, climbs, congress, engages, chills, fon, governor,
Nearest to a: another, the, any, every, no, jeet, alvarez, legalizing,
Nearest to been: become, existed, occurred, come, remained, ford, begun, grown,
Nearest to not: never, still, nothing, elected, without, detectives, billie, unable,
Nearest to called: considered, used, referred, named, known, included, introduced, produced,
Nearest to years: months, days, decades, times, centuries, minutes, year, versions,
Nearest to see: includes, cem, roddick, known, maga, include, usmc, aafc,
Nearest to new: specific, particular, oa, single, medieval, old, katsura, different,
Nearest to states: kingdom, nations, countries, tribes, provinces, mariners, terrace, organizations,
Nearest to other: various, rather, fewer, others, different, certain, many, individual,
Nearest to up: off, out, back, rise, down, mutiny, road, open,
Nearest to it: he, this, she, what, wala, there, reintegrated, recognise,
Nearest to these: several, different, various, certain, multiple, numerous, many, expletive,
Nearest to use: allow, produce, form, refer, importance, develop, banknote, paley,
Nearest to many: some, several, various, numerous, all, most, certain, both,
Nearest to to: must, could, might, would, cannot, will, should, fgth,
Average loss at step 62000: 3.014585
Average loss at step 64000: 2.920011
Average loss at step 66000: 2.940597
Average loss at step 68000: 2.952805
Average loss at step 70000: 3.011108
Nearest to state: government, climbs, county, divorce, perfect, holdings, uniquely, geometrical,
Nearest to a: the, another, every, any, jeet, sutcliffe, mgm, predisposition,
Nearest to been: become, existed, occurred, remained, begun, come, evolved, pwnage,
Nearest to not: never, almost, still, rarely, detectives, nothing, billie, tribal,
Nearest to called: considered, named, known, used, referred, rasmussen, sold, portrayed,
Nearest to years: months, days, decades, minutes, year, times, centuries, weeks,
Nearest to see: includes, cem, known, provides, usmc, maga, melanoleuca, roddick,
Nearest to new: specific, katsura, old, skirt, different, gettysburg, neighbouring, cola,
Nearest to states: kingdom, nations, sps, airlines, displeasure, overman, provinces, paperbacks,
Nearest to other: various, different, fewer, smaller, others, certain, older, local,
Nearest to up: off, down, out, back, toe, anthropic, rise, away,
Nearest to it: he, she, this, they, stabilization, recognise, fein, always,
Nearest to these: various, those, different, certain, their, several, all, numerous,
Nearest to use: importance, release, prefer, usage, allow, return, types, produce,
Nearest to many: some, several, numerous, various, all, most, both, those,
Nearest to to: will, must, could, might, cannot, can, would, may,
Average loss at step 72000: 2.950280
Average loss at step 74000: 2.864707
Average loss at step 76000: 2.993262
Average loss at step 78000: 3.006166
Average loss at step 80000: 2.842763
Nearest to state: government, gradients, jtc, climbs, holdings, orange, divorce, flaps,
Nearest to a: another, any, the, jeet, every, each, this, mgm,
Nearest to been: become, occurred, come, existed, evolved, grown, remained, appeared,
Nearest to not: never, almost, defaulted, either, nothing, also, still, giraffe,
Nearest to called: named, considered, used, referred, known, featured, included, portrayed,
Nearest to years: months, days, decades, year, minutes, times, centuries, hours,
Nearest to see: cem, includes, melanoleuca, resumed, roddick, provides, danubian, known,
Nearest to new: old, rebekah, belgian, jungingen, specific, few, single, katsura,
Nearest to states: kingdom, nations, provinces, terrace, salinas, manchester, countries, airlines,
Nearest to other: various, fewer, rather, smaller, individual, others, certain, different,
Nearest to up: off, down, back, out, away, together, toe, pows,
Nearest to it: he, she, this, there, they, itself, today, what,
Nearest to these: various, whose, different, which, certain, some, other, numerous,
Nearest to use: usage, form, release, list, sense, kinds, return, importance,
Nearest to many: some, several, numerous, various, all, both, most, thousands,
Nearest to to: will, should, might, would, could, cannot, must, may,
Average loss at step 82000: 2.935587
Average loss at step 84000: 2.903268
Average loss at step 86000: 2.929577
Average loss at step 88000: 2.937466
Average loss at step 90000: 2.833315
Nearest to state: government, holdings, perfect, junction, county, incubus, descendant, minicomputers,
Nearest to a: another, any, the, every, jeet, this, warr, an,
Nearest to been: become, occurred, grown, existed, evolved, come, remained, begun,
Nearest to not: never, defaulted, occasionally, almost, irvin, nothing, billie, detectives,
Nearest to called: named, considered, known, used, referred, termed, described, portrayed,
Nearest to years: days, decades, months, centuries, minutes, weeks, year, versions,
Nearest to see: includes, roddick, contains, provides, allows, danubian, maga, include,
Nearest to new: old, previous, sub, local, specific, trilemma, rigidity, single,
Nearest to states: nations, kingdom, felt, countries, provinces, organizations, salinas, sc,
Nearest to other: others, fewer, interspersed, various, both, rather, pancakes, these,
Nearest to up: off, out, back, down, road, toe, him, rise,
Nearest to it: she, he, this, there, aloes, increasingly, damien, they,
Nearest to these: multiple, certain, some, different, separate, several, various, whose,
Nearest to use: usage, amount, importance, because, release, share, exposures, version,
Nearest to many: some, several, numerous, various, both, thousands, most, all,
Nearest to to: must, might, will, towards, should, may, would, cannot,
Average loss at step 92000: 2.894689
Average loss at step 94000: 2.887201
Average loss at step 96000: 2.712174
Average loss at step 98000: 2.448248
Average loss at step 100000: 2.703996
Nearest to state: government, holdings, hashem, nobel, federal, divorce, chairman, chrysler,
Nearest to a: another, the, any, jeet, every, bois, qua, greenspan,
Nearest to been: become, occurred, grown, evolved, existed, remained, begun, come,
Nearest to not: never, occasionally, nothing, indeed, almost, now, also, still,
Nearest to called: named, considered, formed, termed, referred, used, described, attributed,
Nearest to years: decades, days, months, minutes, hours, year, weeks, centuries,
Nearest to see: barisan, include, volga, allow, refer, melanoleuca, consider, contains,
Nearest to new: old, sub, trilemma, rebekah, zebulun, steelers, rigidity, shamanistic,
Nearest to states: kingdom, nations, provinces, outsider, countries, displeasure, felt, paperbacks,
Nearest to other: others, fewer, various, both, these, interspersed, different, individual,
Nearest to up: off, out, together, kicks, wheelchair, toe, him, down,
Nearest to it: she, he, there, transmembrane, garten, itself, fein, tiresias,
Nearest to these: certain, several, various, those, many, their, other, numerous,
Nearest to use: support, usage, version, application, amount, production, form, development,
Nearest to many: some, several, numerous, various, both, thousands, all, multiple,
Nearest to to: must, would, will, might, towards, could, should, cannot,

Re-use code to visualise embeddings


In [15]:
num_points = 400

tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
two_d_embeddings = tsne.fit_transform(final_embeddings[1:num_points+1, :])

words = [reverse_dictionary[i] for i in range(1, num_points+1)]
plot(two_d_embeddings, words)


Further work:

  • Explore the GloVe technique( Global Vectors for Word Representation). A good implementation is here as described here.
  • Explore whether representing words as punctuations helps.
  • Explore better model performance metrics (e.g. training and validation datasets).