Word2Vec Example

(C) 2018 by Damir Cavar

Version: 1.1, November 2018

This is a tutorial related to the L665 course on Machine Learning for NLP focusing on Deep Learning, Spring and Fall 2018 at Indiana University. This material is based on Chris Manning's lecture 2 Word Vector Representations: word2vec and additional sources with extended annotations and explanations.


Here we will discuss briefly the necessary methods to understand the Word2Vec algorithm. We will use Numpy for the basic computations.

In [1]:
import numpy as np

Using One-Hot Vectors

We can create a one-hot vector that selects the 3rd row:

In [2]:
x = np.array([0, 0, 1, 0])

array([0, 0, 1, 0])

Let us create a matrix $A$ of four rows:

In [3]:
A = np.array([[1,   2,  3,  4],
              [5,   6,  7,  8],
              [9,  10, 11, 12],
              [13, 14, 15, 16]])

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12],
       [13, 14, 15, 16]])

We can use the column vector $x$ to select a row in matrix $A$:

In [4]:

array([ 9, 10, 11, 12])

Computing the Dot-Product

Let us simplify and assume that the dot-product of two vectors is the sum of the products of the scalars in the particular dimension of each vector, e.g.:

$$ u^T v = u \cdot v = \sum_{i=1}^n{u_i v_i} $$

The more similar two vectors $u$ and $v$ are in terms of directionality, the larger the dot-product is.

We could think of the dot product $u \cdot v$ as projecting one vector on the other. If the two vectors point into the same direction, the dot-product will be the largest. If one vector is orthogonal to the other, it will be $0$. If the vectors are pointing into opposite directions, the dot-product will be negative.

Computing Softmax

Assume that we have some data or results of mutually exclusive variables that represent scores for an observation being of class $C = [ c_1, c_2, c_3 ]$, as represented by the columns in the vector $y$ below:

In [5]:
y = np.array([4.0, 2.5, 1.1])

If we want to convert the scores into a probability distribution that represents the likelihood that on the basis of these scores the observation belongs to one of the classes, we can compute the Softmax of the vector:

$$ p(C_n) = \frac{\exp(\theta \cdot X_n)}{\sum_{i=1}^N{\exp(\theta \cdot X_i)}} $$

The parameter $\theta$ allows us to scale the results to increase the probabilities of lower scalars in the vector. The exponentiation of $X$ makes larger values much larger. If we include a parameter like $\theta$, we can scale the effect and increase the probabilities assigned to lower values. See for more details the implementation of softmax below.

In Python we can write this using Numpy's exp and sum functions. The axis parameter determines that the some is performed row-wise:

In [6]:
def softmax1(y):
    return np.exp(y) / np.sum(np.exp(y), axis=0)

In [7]:
softmax1([4.0, 4.0, 2.0])

array([0.46831053, 0.46831053, 0.06337894])

We can provide a parameter $\theta$ to the function, to be able to scale the probability for low values up. The larger $\theta$, the higher the probability assigned to lower values. We set the default for $\theta$ to $1.0$ in the softmax definition:

In [8]:
def softmax(y, t=1.0):
    return np.exp(y / t) / np.sum(np.exp(y / t), axis=0)

For a vector of values $[ 4.0, 4.0, 2.0 ]$, we get the following probability distribution given a default $\theta$ of $1.0$:

In [9]:
softmax(np.array([4.0, 4.0, 2.0]))

array([0.46831053, 0.46831053, 0.06337894])

If we double $\theta$, the probability assigned to the third scalar increases significantly:

In [10]:
softmax(np.array([4.0, 4.0, 2.0]), 2.0)

array([0.4223188, 0.4223188, 0.1553624])

Computing Word2Vec

Let us first focus on the skip-gram model. For every word $w$ at position $t$ with some window size or radius of $m$ we want to predict the context of $w$. Our objective function is to maximize the probability of any context word given the current center word:

$$J'(\theta) = \prod_{t=1}^T \prod_{\substack{-m \leq j \leq m\\j \neq 0}} P(w_{t+j} | w_t; \theta)$$

Note that $\theta$ here is a hyper-parameter. We will get back to it later. There is also another hidden hyper-parameter, that is the radius $m$.

We can reformulate this equation as the sum of the log-likelihoods:

$$ J(\theta) = - \frac{1}{T} \sum_{t=1}^T{ \sum_{\substack{-m \leq j \leq m\\ j \neq 0}} \log(P(w_{t+j}| w_t)) } $$

In the equation above we are averaging over all words in the text by dividing with $T$. We take the minus of the sums to get a positive score, since the log of a probability will be always negative. This way our goal is also to minimize the loss function $J(\theta)$, that is to minimize the negative log-likelihood.

The task can be described as: with a specific center word somewhere in the middle of a sentence, pick one random word in a specific radius of $n$ words left and right, we will get as output the probability for every word in the vocabulary of being the selected context word.

The model is trained using word pairs extracted from a text. We go word by word through some text, selecting this word as the center word, and pick the context words from a windows of $n$ size. For example, here for $n=3$ (left and right of the center word):

Our model will be based on the distributional properties of the word pairs, that is the frequency of their cooccurrence. The left word in the example pairs above is the center word (red) and the right word is one of the context words.

Assume that we have learned 300 features represented in vectors for center words and their context respectively, that is, we have two independent vectors for each word, one that represents its features when it is a context word, and one that represents its features as a center word. This is our model $p(w_{t+j}|w_t)$ for a word at position $t$ in the text, and words in $t+j$ positions in the text relative to $t$. For the purpose of explanation here, we could assume that we have such vector pairs of 300 features for 100,000 such words.

In the following equation we will assume $c$ and $o$ to be indices in the space of the word types that we model. These indices now refer to the list of vocabulary, not to positions in the text, as $t$ does above.

We will compute the Softmax of two word vectors for a context word (index $o$ in the vocabulary) and a center word (index $c$ in the vocabulary) by taking the dot-product of their 300-dimensional feature vectors. That is, one is the vector of a word in the context $u_o$, and the other is the vector of the center word $v_c$. The $u$ vectors are in the context word vectors and the $v$ vectors are the center word vectors.

$$ p(o|c) = \frac{\exp(u_o^T v_c)}{\sum_{W=1}^V{\exp(u_w^T v_c)}} $$

We compute the dot-product between the one-hot vector representing one center word $w_t$ in the center word matrix $W$. This is the lookup matrix (looking up column) of word embedding matrix as representation of center word. (See Lecture 2, Word Vector Representations: word2vec, Stanford presentation by Chris Manning). As explained above, this will result in picking a concrete center word feature vector from the corresponding matrix:

$$\begin{bmatrix} 0 \\[0.3em] 0 \\[0.3em] 0 \\[0.3em] 0 \\[0.3em] 1 \\[0.3em] 0 \\[0.3em] 0 \\[0.3em] 0 \end{bmatrix} \begin{bmatrix} -- & 0.2 & -- \\[0.3em] -- & -1.4 & -- \\[0.3em] -- & 0.3 & -- \\[0.3em] -- & -0.1 & -- \\[0.3em] -- & 0.1 & -- \\[0.3em] -- & -0.3 & -- \end{bmatrix} = \begin{bmatrix} 0.2 \\[0.3em] -1.4 \\[0.3em] 0.3 \\[0.3em] -0.1 \\[0.3em] 0.1 \\[0.3em] -0.3 \end{bmatrix} = V_c $$

Take this center word matrix to be the hidden layer of a simple neural network.

We then take the dot-product of this vector of the center word with the matrix of the context vectors (here the scalars dashed out), or output word representation, for each word in the context (or the $n$-sized radius) around the center word, which gives us a 100,000 dimensional vector of weights for each word from the vocabulary in the context of the center vector:

$$V_c \cdot \begin{bmatrix} -- & -- & -- \\[0.3em] -- & -- & -- \\[0.3em] -- & -- & -- \\[0.3em] -- & -- & -- \\[0.3em] -- & -- & -- \\[0.3em] -- & -- & -- \end{bmatrix} = u_o \cdot v_c$$

Take this context vector matrix to be the output matrix.

In the next step we use Softmax to compute the probability distribution of this vector:

$$\mbox{Softmax}(u_o \cdot v_c) = \mbox{Softmax}( \begin{bmatrix} 1.7 \\[0.3em] 0.3 \\[0.3em] 0.1 \\[0.3em] -0.7 \\[0.3em] -0.2 \\[0.3em] 0.1 \\[0.3em] 0.7 \end{bmatrix}) = \begin{bmatrix} 0.44 \\[0.3em] 0.11 \\[0.3em] 0.09 \\[0.3em] 0.04 \\[0.3em] 0.07 \\[0.3em] 0.09 \\[0.3em] 0.16 \end{bmatrix} $$

Here once more the softmax function over the vector $u_o \cdot v_c$:

In [11]:
softmax(np.array([1.7, 0.3, 0.1, -0.7, -0.2, 0.1, 0.7]))

array([0.44276077, 0.10918346, 0.08939186, 0.04016635, 0.06622312,
       0.08939186, 0.16288258])

The rounded sum results in $1.0$:

In [12]:
sum([0.44, 0.11, 0.09, 0.04, 0.07, 0.09, 0.16])


We could now compare the resulting probabilities with some ground truth and compute the error rate for example. If the ground truth for the above result would be:

$$\begin{bmatrix} 0 \\[0.3em] 0 \\[0.3em] 0 \\[0.3em] 0 \\[0.3em] 0 \\[0.3em] 0 \\[0.3em] 1 \end{bmatrix}$$

this would mean that the model assigned a probability of $0.16$ to this word, then we could compute the loss.

How do we learn the parameters for each word that would maximize the prediction of a context word (or vice versa)?

Training the Model

Assume that all parameters of the model are defined in a long vector $\theta$. This vector will have twice the length of the vocabulary size, containing a weight for every word as a center word, and every word as a context word:

$$ \theta = \begin{bmatrix} v_{a} \\[0.3em] v_{ant} \\[0.3em] \vdots \\[0.3em] v_{zero} \\[0.3em] u_{a} \\[0.3em] u_{ant} \\[0.3em] \vdots \\[0.3em] u_{zero} \end{bmatrix} \in \mathbb{R}^{2dV} $$

We repeat here again the objective function in which we want to minimize the log-likelihood:

$$ J(\theta) = - \frac{1}{T} \sum_{t=1}^T{ \sum_{\substack{-m \leq j \leq m\\ j \neq 0}} \log(P(w_{t+j}| w_t)) } $$

Our softmax function discussed above fits into this equation:

$$ p(o|c) = \frac{\exp(u_o^T v_c)}{\sum_{W=1}^V{\exp(u_w^T v_c)}} $$
$$ J(\theta) = - \frac{1}{T} \sum_{t=1}^T{ \sum_{\substack{-m \leq j \leq m\\ j \neq 0}} \log(p(o|c)) } $$

Our goal is to minimize the loss function and maximize the likelihood that we predict the right word in the context for any given center word.

Changing the parameters can be achieved using the gradient. We want to minimize the negative log of the following equation, computing the partial derivative with respect to the center vector ($v_c$):

$$ \frac{\partial }{\partial v_c} \log(\frac{\exp(u_o^T v_c)}{\sum_{W=1}^V{\exp(u_w^T v_c)}}) $$

Note a partial derivative of a function with several variables is its derivative with respect to one of these variables. Here it is $v_c$.

The log of a division can be converted into a subtraction:

$$ \frac{\partial }{\partial v_c} \ \log(\exp(u_o^T v_c)) - \log(\sum_{W=1}^V{\exp(u_w^T v_c)}) $$

We can simplify the first part of the subtraction, since $\log$ and $\exp$ cancel each other out, and the partial derivative with respect to $v_c$ of the simplified equation is simply $u_0$:

$$ \frac{\partial }{\partial v_c} \ \log(\exp(u_o^T v_c)) = \frac{\partial }{\partial v_c} \ u_o^T v_c = u_0 $$

For the second part of the subtraction above, we get:

$$ \frac{\partial }{\partial v_c} \ \log(\sum_{W=1}^V{\exp(u_w^T v_c)}) $$

Using the chain rule we can simplify this equation as well. The chain rule expresses the derivative of the composition of two functions, that is mapping $x$ onto $f(g(x))$, in terms of derivatives of $f$ and $g$.

Let us first start with derivatives of simple functions. What is the derivative of a constant $5$ for example? It is $0$. Since the derivative measures the change of a function, and a constant does not change, the derivative of a constant is $0$. The derivative of a variable $x$ is $1$. The derivative of the product of a constant and a variable is the constant. The derivative of $x^n$ is $n \cdot x^{n-1}$.

A differentiable function of a variable is a function whose derivative exists at each point in its domain (see here).The graph of a differentiable function must have a (non-vertical) tangent line at each point in its domain, and cannot contain any breaks, bends, or cusps. The following function $y = |x|$ (absolute value function) is not differentiable:

The following function is differentiable:

The chain rule allows one to find derivatives of compositions of functions by knowing the derivative of the elementary functions from the composition. Let $g(x)$ be differentiable at $x$ and $f(x)$ be differentiable at $f(g(x))$. Then, if $y=f(g(x))$ and $u=g(x)$:



see for the derivation Chris Manning's video...

The equation he derives is:

$$ u_o - \sum_{x=1}^V p(x|c) u_x $$

He labels the left part ($u_o$) as observation. This is the context word vectors that we identified in the texts or data. He labels the second part ($\sum_{x=1}^V p(x|c) u_x$) as expectation. This is the part that we want to tweak such that the loss function is minimized.

For maximizing or minimizing the cost (or objective) function using gradient descent, consider this simple example: Find the local minimum of the function $f(x) = x^4 - 3 x^3 + 2$ with derivative $f'(x) = 4 x^3 - 9 x^2$.

Subtracting a fraction of the gradient moves you to the minimum:

In [13]:
x_old = 0
x_new = 6
eps = 0.01 # step size
precision = 0.00001

def f_derivative(x):
    return 4 * x**3 - 9 * x**2

while abs(x_new - x_old) > precision:
    x_old = x_new
    x_new = x_old - eps * f_derivative(x_old)
    print("x_old:", x_old, " Local minimum occurs at", x_new)

x_old: 6  Local minimum occurs at 0.5999999999999996
x_old: 0.5999999999999996  Local minimum occurs at 0.6237599999999996
x_old: 0.6237599999999996  Local minimum occurs at 0.6490692731402646
x_old: 0.6490692731402646  Local minimum occurs at 0.6760475763767438
x_old: 0.6760475763767438  Local minimum occurs at 0.704821965498881
x_old: 0.704821965498881  Local minimum occurs at 0.7355261366038248
x_old: 0.7355261366038248  Local minimum occurs at 0.7682992721113444
x_old: 0.7682992721113444  Local minimum occurs at 0.8032842278686305
x_old: 0.8032842278686305  Local minimum occurs at 0.840624861847519
x_old: 0.840624861847519  Local minimum occurs at 0.8804622684298664
x_old: 0.8804622684298664  Local minimum occurs at 0.9229296507309586
x_old: 0.9229296507309586  Local minimum occurs at 0.9681455460305634
x_old: 0.9681455460305634  Local minimum occurs at 1.016205130521792
x_old: 1.016205130521792  Local minimum occurs at 1.067169389942697
x_old: 1.067169389942697  Local minimum occurs at 1.1210520795330405
x_old: 1.1210520795330405  Local minimum occurs at 1.1778046421472836
x_old: 1.1778046421472836  Local minimum occurs at 1.237299637824332
x_old: 1.237299637824332  Local minimum occurs at 1.2993137782331108
x_old: 1.2993137782331108  Local minimum occurs at 1.3635123370474889
x_old: 1.3635123370474889  Local minimum occurs at 1.429437442158506
x_old: 1.429437442158506  Local minimum occurs at 1.4965033788967752
x_old: 1.4965033788967752  Local minimum occurs at 1.5640022802344904
x_old: 1.5640022802344904  Local minimum occurs at 1.6311231270849003
x_old: 1.6311231270849003  Local minimum occurs at 1.6969855549473505
x_old: 1.6969855549473505  Local minimum occurs at 1.7606875094969714
x_old: 1.7606875094969714  Local minimum occurs at 1.821362659674955
x_old: 1.821362659674955  Local minimum occurs at 1.8782404675959476
x_old: 1.8782404675959476  Local minimum occurs at 1.930700009196379
x_old: 1.930700009196379  Local minimum occurs at 1.9783089472809865
x_old: 1.9783089472809865  Local minimum occurs at 2.0208417065692057
x_old: 2.0208417065692057  Local minimum occurs at 2.0582751831448975
x_old: 2.0582751831448975  Local minimum occurs at 2.090764845528107
x_old: 2.090764845528107  Local minimum occurs at 2.118607415721545
x_old: 2.118607415721545  Local minimum occurs at 2.1421976265432066
x_old: 2.1421976265432066  Local minimum occurs at 2.1619858762300224
x_old: 2.1619858762300224  Local minimum occurs at 2.178441640823547
x_old: 2.178441640823547  Local minimum occurs at 2.1920251576443675
x_old: 2.1920251576443675  Local minimum occurs at 2.203167862727841
x_old: 2.203167862727841  Local minimum occurs at 2.2122606942724694
x_old: 2.2122606942724694  Local minimum occurs at 2.2196486877629633
x_old: 2.2196486877629633  Local minimum occurs at 2.2256301304909205
x_old: 2.2256301304909205  Local minimum occurs at 2.2304587076907274
x_old: 2.2304587076907274  Local minimum occurs at 2.234347382687595
x_old: 2.234347382687595  Local minimum occurs at 2.2374730902946083
x_old: 2.2374730902946083  Local minimum occurs at 2.239981621916576
x_old: 2.239981621916576  Local minimum occurs at 2.2419923174775156
x_old: 2.2419923174775156  Local minimum occurs at 2.243602351591089
x_old: 2.243602351591089  Local minimum occurs at 2.2448905184851697
x_old: 2.2448905184851697  Local minimum occurs at 2.2459204946033684
x_old: 2.2459204946033684  Local minimum occurs at 2.2467436015363202
x_old: 2.2467436015363202  Local minimum occurs at 2.2474011148628947
x_old: 2.2474011148628947  Local minimum occurs at 2.2479261740485823
x_old: 2.2479261740485823  Local minimum occurs at 2.2483453500247714
x_old: 2.2483453500247714  Local minimum occurs at 2.2486799240099864
x_old: 2.2486799240099864  Local minimum occurs at 2.2489469258218673
x_old: 2.2489469258218673  Local minimum occurs at 2.2491599737759116
x_old: 2.2491599737759116  Local minimum occurs at 2.2493299520940697
x_old: 2.2493299520940697  Local minimum occurs at 2.249465555993498
x_old: 2.249465555993498  Local minimum occurs at 2.24957372949745
x_old: 2.24957372949745  Local minimum occurs at 2.249660016570137
x_old: 2.249660016570137  Local minimum occurs at 2.2497288424102844
x_old: 2.2497288424102844  Local minimum occurs at 2.24978373858824
x_old: 2.24978373858824  Local minimum occurs at 2.2498275231061067
x_old: 2.2498275231061067  Local minimum occurs at 2.249862444322635
x_old: 2.249862444322635  Local minimum occurs at 2.249890295941524
x_old: 2.249890295941524  Local minimum occurs at 2.2499125088471215
x_old: 2.2499125088471215  Local minimum occurs at 2.24993022442776
x_old: 2.24993022442776  Local minimum occurs at 2.249944353104799
x_old: 2.249944353104799  Local minimum occurs at 2.2499556210437
x_old: 2.2499556210437  Local minimum occurs at 2.2499646074278457

Here the idea is to identify the gradient at point $x$, we subtract a little fraction of the gradient, which moves us downhill towards the minimum. We continue with computing the gradient again at this point and walking down towards the minimum.

Gradient Descent

Minimization of an objective function $J(\theta)$ over the entire training corpus makes it necessary to compute gradients for all windows.

We would need to update for each element of $\theta$ to identify the derivatives of the objective function with respect to all the parameters:

$$ \theta_j^{new} = \theta_j^{old} - \alpha \frac{\partial }{\partial \theta_j^{old}} J(\theta) $$

$\alpha$ is the step size in the Gradient Descent algorithm. We have some parameter values ($\theta_j^{old}$) and we identified the gradient (the $\frac{\partial}{\partial\theta_j^{old}}$ portion) at this position ($j$). We subtract a fraction of the gradient from the old parameters and get new ones. If this gives us a lower objection value, this takes us towards the minimum.

Matrix notation for all parameters:

$$ \theta^{new} = \theta^{old} - \alpha \frac{\partial}{\partial \theta^{old}} J(\theta) $$
$$ \theta^{new} = \theta^{old} - \alpha \nabla_\theta J(\theta) $$

Generic Gradient Descent code with some stopping condition to be added to it:

In [ ]:
while True:
    theta_grad = evaluate_gradient(J, corpus, theta)
    theta = theta - alpha * theta_grad

The blue lines show the contour lines of the value of the objective function. Computing the Gradient we identify the direction of the steepest descent. Walking in small steps down this line of the steepest descent takes us to the minimum.

To achieve walking continuously down towards the minimum, $\alpha$ needs to be small enough so that one does not jump over the minimum to the other side.

The problem would be that, if we have a billion token corpus, this might include a lot of computations and optimizations. Computing the gradient of the objective function for a very large corpus will take very long for even the first gradient update.

Stochastic Gradient Descent

We compute the gradient and optimize at one position $t$ in the corpus ($t$ is the index of the center word).

$$ \theta^{new} = \theta^{old} - \alpha \nabla_\theta J_t(\theta) $$

In [ ]:
while True:
    window = sample_window(corpus)
    theta_grad = evaluate_gradient(J, window, theta)
    theta = theta - alpha * theta_grad

Example using Gensim

Gensim provides an API documentation for word2vec. There is a GitHub repository with the code for a high-performance similarity server using Gensim and word2vec. This short intro is based on the Rare Technologies tutorial.

The input for Gensim's word2vec has to be a list of sentences and sentences are a list of tokens. We will import gensim and train a model on some sample sentences:

In [16]:
import gensim

sentences = [['Tom', 'loves', 'pizza'], ['Peter', 'loves', 'fries']]

model = gensim.models.Word2Vec(sentences, min_count=1)

In a more efficient implementation that does not hold an entire corpus in RAM, one can specify a file-reader and process the input line by line, as shown by Radim Řehůřek in his tutorial:

In [ ]:
import os

class MySentences(object):
    def __init__(self, dirname):
        self.dirname = dirname
    def __iter__(self):
        for fname in os.listdir(self.dirname):
            for line in open(os.path.join(self.dirname, fname)):
                yield line.split()
sentences = MySentences('examples') # load Gensim_example_1.txt from folder, a memory-friendly iterator
model = gensim.models.Word2Vec(sentences)

The code above might not run in Jupyter Notebooks, due to restrictions over some modules. Copy the code into a Python file and run it in the command line.

Gensim's word2vec allows you to prune the internal dictionary. This can eliminate tokens that occur a minimum number of times. The min_count parameter provides this restriction.

Gensim offers an API for:

  • evaluation using standard data sets and formats (e.g. Google test set)
  • storing and loading of models
  • online or resuming of training

We can compute similarities using

In [ ]:
model.wv.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
model.doesnt_match("breakfast cereal dinner lunch".split())
model.similarity('woman', 'man')

We can access a word vector directly:

In [ ]:

In [ ]: