Word2Vec

This notebook lays out one possible method for generating word embeddings.

Word embeddings provide meaningful representations of words as they are used in context. Word embeddings embed words in high-dimensional space, mapping each word or phrase to its own vector. Word embeddings are trained to help with a specific task, which determines its meaningulness as a representation.

For example, the relationships between vectors can be meaningful (image from the TensorFlow documentation):

Arithmetic with vectors can also be meaningful. Perhaps the most well-known example of this is:

$$ \text{king} - \text{man} + \text{woman} = \text{queen} $$

The positioning of these vectors in this space actually tells us something about how these words are used.

This allows us to do things like find the most similar words by looking at the closest words.

As mentioned earlier, these word embeddings are trained to help with a particular task, which is learned through a neural network. Two tasks developed for training embeddings are CBOW ("Continuous Bag Of Words") and skip-grams; together these methods of learning word embeddings are called "Word2Vec".

For the CBOW task, we take the context words (the words around the target word) and give the target word. We want to predict whether or not the target word belongs to the context.

A skip-gram is basically the inverse: we take the target word (also called the "pivot"), then give the context. We want to predict whether or not the context belongs to the word.

They are quite similar but have different properties, e.g. CBOW works better on small datasets, while skip-grams work better for larger ones. In any case, the idea with word embeddings is that they can be trained to help with any task.

For this example, we're going to be working on the skip-gram task.

Corpus

We need a reasonably-sized text corpus to learn from. Here we'll use State of the Union addresses retrieved from The American Presidency Project. These addresses tend to use similar patterns so we should be able to learn some decent word embeddings. Since the skip-gram task looks at context, texts that use words in a consistent way (i.e. in consistent contexts) are easier to learn from.

The corpus is available here as a compressed archive of .txt files. Download and un-zip this file and place it in the same directory as this notebook. The texts were preprocessed a bit (mainly removing URL-encoded characters). (nb: this isn't the complete collection of texts but enough to work with here).

Skip-grams

Before we go any further, let's get a bit more concrete about what the skip-gram task is.

Let's consider the sentence "I think cats are cool".

The skip-gram task is as follows:

  • We take a word, e.g. 'cats', which we'll represent as $w_i$. We feed this as input into our neural network.
  • We take the word's context, e.g. ['I', 'think', 'are', 'cool']. We'll represent this as $\{w_{i-2}, w_{i-1}, w_{i+1}, w_{i+2}\}$ and we also feed this into our neural network.
  • Then we just want our network to predict (i.e. classify) whether or not $\{w_{i-2}, w_{i-1}, w_{i+1}, w_{i+2}\}$ is the true context of $w_i$.

For this particular example we'd want the network to output 1 (i.e. yes, that is the true context).

If we set $w_i$ to 'frogs', then we'd want the network output 0. In our one sentence corpus, ['I', 'think', 'are', 'cool'] is not the true context for 'frogs'. Sorry frogs šŸø.

Step 1: Import dependencies:

For this example, we'll use keras to build the neural network that we'll use to learn the embeddings. keras is a high-level library that can use either a tensorflow or theano backend to handle low-level tasks. to switch between tensorflow and theano backends, edit the keras configuration file at $HOME/.keras/keras.json


InĀ [1]:
import sklearn
import matplotlib.pyplot as plt
import scipy
import numpy as np
from keras.models import Sequential
from keras.layers.embeddings import Embedding
from keras.layers import Flatten, Activation, Merge
from keras.preprocessing.text import Tokenizer, base_filter
from keras.preprocessing.sequence import skipgrams, make_sampling_table


Using TensorFlow backend.

...Depending on which environment you're running this from, you may find yourself needing to upgrade one of these libraries, which you can do by opening an ipython terminal, launching python, and typing pip install [name of library] --upgrade

Step 2: Load data for training.

change 'sotu/' to the name and path of the folder containing the txt files that you want to train the model on.

if the folder is in a directory other than the one this notebook is saved in, add an absolute path (e.g. '/sharedfolder/datasets/folder/*txt')


InĀ [2]:
from glob import glob
text_files = glob('sotu/*.txt') 
# locates all .txt files in the folder 'sotu' 
# (in the same directory as this notebook)

# define a `text_generator` here, 
# so that data can be loaded on-demand, 
# avoiding having all data in memory unless it's needed:
def text_generator():
    for path in text_files:
        with open(path, 'r') as f:
            yield f.read()
   
files = (len(text_files))
if files == 0:
    print"something's not right - check the file path?"
else:
    print"sucessfully located",len(text_files),"text files to use as training data!"


sucessfully located 84 text files to use as training data!

Before we go any further, we need to map each word in our corpus to a number, so that we have a consistent way of referring to them. To do this, we'll fit a tokenizer to the corpus:


InĀ [3]:
max_vocab_size = 50000
# setting an upper limit just in case

# `filters` specify what characters to get rid of
# `base_filter()` includes basic punctuation 
tokenizer = Tokenizer(nb_words=max_vocab_size, filters=base_filter()+'')

# fit the tokenizer
tokenizer.fit_on_texts(text_generator())

# we also want to keep track of the actual vocab size:
vocab_size = len(tokenizer.word_index) + 1
# note: we add one because `0` is a reserved index in keras' tokenizer

print "found",vocab_size,"unique words"


found 15288 unique words

Now the tokenizer knows what tokens (words) are in our corpus and has mapped them to numbers. The keras tokenizer also indexes them in order of frequency (most common first, i.e. index 1 is usually a word like "the"), which will come in handy later.

Step 3: Define the model.

First, let's set the hyperparameters (higher-level settings that determine how the model trains):

You may just need to play around to get results that fit well to the task at hand.


InĀ [4]:
embedding_dim = 256
n_epochs = 60
# Higher numbers will add time to the training, 
# lower numbers may give wildly inaccurate (er, abstract?) results.

loss_function = 'binary_crossentropy'
activation_function = 'sigmoid'
optimizer_type = 'adam'
# the task as we are framing it is to answer the question: 
# "do the context words match the target word or not?"
# because this is a binary classification, 
# we want the output to be normalized to [0,1] (sigmoid will work),
# and we can use 'binary crossentropy' as our loss

For the "skip-gram" tast, build two separate models (one for the target word (also called the "pivot"), and one for the context words), and then merge them into one:


InĀ [5]:
pivot_model = Sequential()
pivot_model.add(Embedding(vocab_size, embedding_dim, input_length=1))

context_model = Sequential()
context_model.add(Embedding(vocab_size, embedding_dim, input_length=1))

# merge the pivot and context models
model = Sequential()
model.add(Merge([pivot_model, context_model], mode='dot', dot_axes=2))
model.add(Flatten())

model.add(Activation(activation_function))
model.compile(optimizer=optimizer_type, loss=loss_function)

Step 4: Train the model.

run the code below to load pre-trained weights. you can download weights learned on this model over 60 epochs here

OR skip to the following step to train the model from scratch (takes a few minutes):


InĀ [6]:
# skip this if you want to train the model from scratch!

# run this to load pre-trained weights!
# change the filepath 'weights-60epochs.hdf5' to another path if necessary

model.load_weights('weights-60epochs.hdf5')
print"successfully loaded pre-trained weights"

# then skip to "Step 5: Extract the Embeddings"


successfully loaded pre-trained weights

InĀ [12]:
# skip this if you want to use pre-trained weights!

# run this to train the model from scratch!

# used to sample words (indices)
sampling_table = make_sampling_table(vocab_size)

for i in range(n_epochs):
    loss = 0
    for seq in tokenizer.texts_to_sequences_generator(text_generator()):
        # generate skip-gram training examples
        # - `couples` consists of the pivots (i.e. target words) and surrounding contexts
        # - `labels` represent if the context is true or not
        # - `window_size` determines how far to look between words
        # - `negative_samples` specifies the ratio of negative couples
        #    (i.e. couples where the context is false)
        #    to generate with respect to the positive couples;
        #    i.e. `negative_samples=4` means "generate 4 times as many negative samples"
        couples, labels = skipgrams(seq, vocab_size, window_size=5, negative_samples=4, sampling_table=sampling_table)
        if couples:
            pivot, context = zip(*couples)
            pivot = np.array(pivot, dtype='int32')
            context = np.array(context, dtype='int32')
            labels = np.array(labels, dtype='int32')
            loss += model.train_on_batch([pivot, context], labels)
    print('epoch %d, %0.02f'%(i, loss))


epoch 0, 17.30
epoch 1, 17.17
epoch 2, 17.03
epoch 3, 16.90
epoch 4, 16.72
epoch 5, 16.54
epoch 6, 16.32
epoch 7, 16.13
epoch 8, 15.90
epoch 9, 15.65
epoch 10, 15.43
epoch 11, 15.16
epoch 12, 14.94
epoch 13, 14.62
epoch 14, 14.33
epoch 15, 14.10
epoch 16, 13.80
epoch 17, 13.55
epoch 18, 13.26
epoch 19, 13.02
epoch 20, 12.73
epoch 21, 12.48
epoch 22, 12.27
epoch 23, 12.01
epoch 24, 11.78
epoch 25, 11.59
epoch 26, 11.33
epoch 27, 11.15
epoch 28, 11.01
epoch 29, 10.81
epoch 30, 10.66
epoch 31, 10.45
epoch 32, 10.29
epoch 33, 10.16
epoch 34, 9.99
epoch 35, 9.89
epoch 36, 9.74
epoch 37, 9.64
epoch 38, 9.49
epoch 39, 9.33
epoch 40, 9.26
epoch 41, 9.10
epoch 42, 9.07
epoch 43, 8.95
epoch 44, 8.84
epoch 45, 8.77
epoch 46, 8.66
epoch 47, 8.61
epoch 48, 8.49
epoch 49, 8.43
epoch 50, 8.39
epoch 51, 8.29
epoch 52, 8.18
epoch 53, 8.18
epoch 54, 8.06
epoch 55, 8.01
epoch 56, 7.94
epoch 57, 7.94
epoch 58, 7.88
epoch 59, 7.77

Wait a few minutes for training...

then save the trained weights to re-use later:


InĀ [13]:
model.save_weights('weights2.hdf5')
print"successfully saved trained weights"


successfully saved trained weights

Step 5: Extract the embeddings

I.e. the weights of the pivot embedding layer:


InĀ [14]:
embeddings = model.get_weights()[0]

We also want to set aside the tokenizer's word_index and reverse_word_index (so we can look up indices for words and words from indices):


InĀ [15]:
word_index = tokenizer.word_index
reverse_word_index = {v: k for k, v in word_index.items()}

That's it for learning the embeddings. Now we can try using them:

Getting similar words

Each word embedding is a mapping of a specific word to a point in space. If we want to find words that are similar (in terms of how they are used) to some target word, we look for embeddings that are nearby to the point in space where the target word's embedding is mapped.

An example will make this clearer.

First, let's write a simple function to retrieve an embedding for a word:


InĀ [16]:
def get_embedding(word):
    idx = word_index[word]
    # make it 2d
    return embeddings[idx][:,np.newaxis].T

Then we can define a function to get a most similar word for an input word:


InĀ [17]:
from scipy.spatial.distance import cdist

ignore_n_most_common = 50

def get_closest(word):
    embedding = get_embedding(word)

    # get the distance from the embedding
    # to every other embedding
    distances = cdist(embedding, embeddings)[0]

    # pair each embedding index and its distance
    distances = list(enumerate(distances))

    # sort from closest to furthest
    distances = sorted(distances, key=lambda d: d[1])

    # skip the first one; it's the target word
    for idx, dist in distances[1:]:
        # ignore the n most common words;
        # they can get in the way.
        # because the tokenizer organized indices
        # from most common to least, we can just do this
        if idx > ignore_n_most_common:
            return reverse_word_index[idx]

Now let's give it a try (you may get different results):


InĀ [18]:
print'freedom ~',(get_closest('freedom'))
print'justice ~',(get_closest('justice'))
print'america ~',(get_closest('america'))
print'history ~',(get_closest('history'))
print'citizen ~',(get_closest('citizen'))


freedom ~ peace
justice ~ infused
america ~ country
history ~ nation
citizen ~ every

Do words have relations?