This notebook lays out one possible method for generating word embeddings.
Word embeddings provide meaningful representations of words as they are used in context. Word embeddings embed words in high-dimensional space, mapping each word or phrase to its own vector. Word embeddings are trained to help with a specific task, which determines its meaningulness as a representation.
For example, the relationships between vectors can be meaningful (image from the TensorFlow documentation):
Arithmetic with vectors can also be meaningful. Perhaps the most well-known example of this is:
$$ \text{king} - \text{man} + \text{woman} = \text{queen} $$The positioning of these vectors in this space actually tells us something about how these words are used.
This allows us to do things like find the most similar words by looking at the closest words.
As mentioned earlier, these word embeddings are trained to help with a particular task, which is learned through a neural network. Two tasks developed for training embeddings are CBOW ("Continuous Bag Of Words") and skip-grams; together these methods of learning word embeddings are called "Word2Vec".
For the CBOW task, we take the context words (the words around the target word) and give the target word. We want to predict whether or not the target word belongs to the context.
A skip-gram is basically the inverse: we take the target word (also called the "pivot"), then give the context. We want to predict whether or not the context belongs to the word.
They are quite similar but have different properties, e.g. CBOW works better on small datasets, while skip-grams work better for larger ones. In any case, the idea with word embeddings is that they can be trained to help with any task.
For this example, we're going to be working on the skip-gram task.
We need a reasonably-sized text corpus to learn from. Here we'll use State of the Union addresses retrieved from The American Presidency Project. These addresses tend to use similar patterns so we should be able to learn some decent word embeddings. Since the skip-gram task looks at context, texts that use words in a consistent way (i.e. in consistent contexts) are easier to learn from.
The corpus is available here as a compressed archive of .txt files. Download and un-zip this file and place it in the same directory as this notebook. The texts were preprocessed a bit (mainly removing URL-encoded characters). (nb: this isn't the complete collection of texts but enough to work with here).
Before we go any further, let's get a bit more concrete about what the skip-gram task is.
Let's consider the sentence "I think cats are cool"
.
The skip-gram task is as follows:
'cats'
, which we'll represent as $w_i$. We feed this as input into our neural network.['I', 'think', 'are', 'cool']
. We'll represent this as $\{w_{i-2}, w_{i-1}, w_{i+1}, w_{i+2}\}$ and we also feed this into our neural network.For this particular example we'd want the network to output 1 (i.e. yes, that is the true context).
If we set $w_i$ to 'frogs', then we'd want the network output 0. In our one sentence corpus, ['I', 'think', 'are', 'cool']
is not the true context for 'frogs'. Sorry frogs šø.
For this example, we'll use keras to build the neural network that we'll use to learn the embeddings. keras is a high-level library that can use either a tensorflow or theano backend to handle low-level tasks. to switch between tensorflow and theano backends, edit the keras configuration file at $HOME/.keras/keras.json
InĀ [1]:
import sklearn
import matplotlib.pyplot as plt
import scipy
import numpy as np
from keras.models import Sequential
from keras.layers.embeddings import Embedding
from keras.layers import Flatten, Activation, Merge
from keras.preprocessing.text import Tokenizer, base_filter
from keras.preprocessing.sequence import skipgrams, make_sampling_table
...Depending on which environment you're running this from, you may find yourself needing to upgrade one of these libraries, which you can do by opening an ipython terminal, launching python, and typing pip install [name of library] --upgrade
InĀ [2]:
from glob import glob
text_files = glob('sotu/*.txt')
# locates all .txt files in the folder 'sotu'
# (in the same directory as this notebook)
# define a `text_generator` here,
# so that data can be loaded on-demand,
# avoiding having all data in memory unless it's needed:
def text_generator():
for path in text_files:
with open(path, 'r') as f:
yield f.read()
files = (len(text_files))
if files == 0:
print"something's not right - check the file path?"
else:
print"sucessfully located",len(text_files),"text files to use as training data!"
Before we go any further, we need to map each word in our corpus to a number, so that we have a consistent way of referring to them. To do this, we'll fit a tokenizer
to the corpus:
InĀ [3]:
max_vocab_size = 50000
# setting an upper limit just in case
# `filters` specify what characters to get rid of
# `base_filter()` includes basic punctuation
tokenizer = Tokenizer(nb_words=max_vocab_size, filters=base_filter()+'')
# fit the tokenizer
tokenizer.fit_on_texts(text_generator())
# we also want to keep track of the actual vocab size:
vocab_size = len(tokenizer.word_index) + 1
# note: we add one because `0` is a reserved index in keras' tokenizer
print "found",vocab_size,"unique words"
Now the tokenizer knows what tokens (words) are in our corpus and has mapped them to numbers. The keras
tokenizer also indexes them in order of frequency (most common first, i.e. index 1 is usually a word like "the"), which will come in handy later.
First, let's set the hyperparameters (higher-level settings that determine how the model trains):
You may just need to play around to get results that fit well to the task at hand.
InĀ [4]:
embedding_dim = 256
n_epochs = 60
# Higher numbers will add time to the training,
# lower numbers may give wildly inaccurate (er, abstract?) results.
loss_function = 'binary_crossentropy'
activation_function = 'sigmoid'
optimizer_type = 'adam'
# the task as we are framing it is to answer the question:
# "do the context words match the target word or not?"
# because this is a binary classification,
# we want the output to be normalized to [0,1] (sigmoid will work),
# and we can use 'binary crossentropy' as our loss
For the "skip-gram" tast, build two separate models (one for the target word (also called the "pivot"), and one for the context words), and then merge them into one:
InĀ [5]:
pivot_model = Sequential()
pivot_model.add(Embedding(vocab_size, embedding_dim, input_length=1))
context_model = Sequential()
context_model.add(Embedding(vocab_size, embedding_dim, input_length=1))
# merge the pivot and context models
model = Sequential()
model.add(Merge([pivot_model, context_model], mode='dot', dot_axes=2))
model.add(Flatten())
model.add(Activation(activation_function))
model.compile(optimizer=optimizer_type, loss=loss_function)
run the code below to load pre-trained weights. you can download weights learned on this model over 60 epochs here
OR skip to the following step to train the model from scratch (takes a few minutes):
InĀ [6]:
# skip this if you want to train the model from scratch!
# run this to load pre-trained weights!
# change the filepath 'weights-60epochs.hdf5' to another path if necessary
model.load_weights('weights-60epochs.hdf5')
print"successfully loaded pre-trained weights"
# then skip to "Step 5: Extract the Embeddings"
InĀ [12]:
# skip this if you want to use pre-trained weights!
# run this to train the model from scratch!
# used to sample words (indices)
sampling_table = make_sampling_table(vocab_size)
for i in range(n_epochs):
loss = 0
for seq in tokenizer.texts_to_sequences_generator(text_generator()):
# generate skip-gram training examples
# - `couples` consists of the pivots (i.e. target words) and surrounding contexts
# - `labels` represent if the context is true or not
# - `window_size` determines how far to look between words
# - `negative_samples` specifies the ratio of negative couples
# (i.e. couples where the context is false)
# to generate with respect to the positive couples;
# i.e. `negative_samples=4` means "generate 4 times as many negative samples"
couples, labels = skipgrams(seq, vocab_size, window_size=5, negative_samples=4, sampling_table=sampling_table)
if couples:
pivot, context = zip(*couples)
pivot = np.array(pivot, dtype='int32')
context = np.array(context, dtype='int32')
labels = np.array(labels, dtype='int32')
loss += model.train_on_batch([pivot, context], labels)
print('epoch %d, %0.02f'%(i, loss))
Wait a few minutes for training...
then save the trained weights to re-use later:
InĀ [13]:
model.save_weights('weights2.hdf5')
print"successfully saved trained weights"
InĀ [14]:
embeddings = model.get_weights()[0]
We also want to set aside the tokenizer's word_index
and reverse_word_index
(so we can look up indices for words and words from indices):
InĀ [15]:
word_index = tokenizer.word_index
reverse_word_index = {v: k for k, v in word_index.items()}
That's it for learning the embeddings. Now we can try using them:
Each word embedding is a mapping of a specific word to a point in space. If we want to find words that are similar (in terms of how they are used) to some target word, we look for embeddings that are nearby to the point in space where the target word's embedding is mapped.
An example will make this clearer.
First, let's write a simple function to retrieve an embedding for a word:
InĀ [16]:
def get_embedding(word):
idx = word_index[word]
# make it 2d
return embeddings[idx][:,np.newaxis].T
Then we can define a function to get a most similar word for an input word:
InĀ [17]:
from scipy.spatial.distance import cdist
ignore_n_most_common = 50
def get_closest(word):
embedding = get_embedding(word)
# get the distance from the embedding
# to every other embedding
distances = cdist(embedding, embeddings)[0]
# pair each embedding index and its distance
distances = list(enumerate(distances))
# sort from closest to furthest
distances = sorted(distances, key=lambda d: d[1])
# skip the first one; it's the target word
for idx, dist in distances[1:]:
# ignore the n most common words;
# they can get in the way.
# because the tokenizer organized indices
# from most common to least, we can just do this
if idx > ignore_n_most_common:
return reverse_word_index[idx]
Now let's give it a try (you may get different results):
InĀ [18]:
print'freedom ~',(get_closest('freedom'))
print'justice ~',(get_closest('justice'))
print'america ~',(get_closest('america'))
print'history ~',(get_closest('history'))
print'citizen ~',(get_closest('citizen'))
Do words have relations?