(C) 2018 by Damir Cavar
Version: 1.1, November 2018
This is a tutorial related to the L665 course on Machine Learning for NLP focusing on Deep Learning, Spring and Fall 2018 at Indiana University. This material is based on Chris Manning's lecture 2 Word Vector Representations: word2vec and additional sources with extended annotations and explanations.
Here we will discuss briefly the necessary methods to understand the Word2Vec algorithm. We will use Numpy for the basic computations.
In [1]:
import numpy as np
We can create a one-hot vector that selects the 3rd row:
In [2]:
x = np.array([0, 0, 1, 0])
x
Out[2]:
Let us create a matrix $A$ of four rows:
In [3]:
A = np.array([[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12],
[13, 14, 15, 16]])
A
Out[3]:
We can use the column vector $x$ to select a row in matrix $A$:
In [4]:
x.dot(A)
Out[4]:
Let us simplify and assume that the dot-product of two vectors is the sum of the products of the scalars in the particular dimension of each vector, e.g.:
The more similar two vectors $u$ and $v$ are in terms of directionality, the larger the dot-product is.
We could think of the dot product $u \cdot v$ as projecting one vector on the other. If the two vectors point into the same direction, the dot-product will be the largest. If one vector is orthogonal to the other, it will be $0$. If the vectors are pointing into opposite directions, the dot-product will be negative.
Assume that we have some data or results of mutually exclusive variables that represent scores for an observation being of class $C = [ c_1, c_2, c_3 ]$, as represented by the columns in the vector $y$ below:
In [5]:
y = np.array([4.0, 2.5, 1.1])
If we want to convert the scores into a probability distribution that represents the likelihood that on the basis of these scores the observation belongs to one of the classes, we can compute the Softmax of the vector:
The parameter $\theta$ allows us to scale the results to increase the probabilities of lower scalars in the vector. The exponentiation of $X$ makes larger values much larger. If we include a parameter like $\theta$, we can scale the effect and increase the probabilities assigned to lower values. See for more details the implementation of softmax below.
In Python we can write this using Numpy's exp and sum functions. The axis parameter determines that the some is performed row-wise:
In [6]:
def softmax1(y):
return np.exp(y) / np.sum(np.exp(y), axis=0)
In [7]:
softmax1([4.0, 4.0, 2.0])
Out[7]:
We can provide a parameter $\theta$ to the function, to be able to scale the probability for low values up. The larger $\theta$, the higher the probability assigned to lower values. We set the default for $\theta$ to $1.0$ in the softmax definition:
In [8]:
def softmax(y, t=1.0):
return np.exp(y / t) / np.sum(np.exp(y / t), axis=0)
For a vector of values $[ 4.0, 4.0, 2.0 ]$, we get the following probability distribution given a default $\theta$ of $1.0$:
In [9]:
softmax(np.array([4.0, 4.0, 2.0]))
Out[9]:
If we double $\theta$, the probability assigned to the third scalar increases significantly:
In [10]:
softmax(np.array([4.0, 4.0, 2.0]), 2.0)
Out[10]:
Let us first focus on the skip-gram model. For every word $w$ at position $t$ with some window size or radius of $m$ we want to predict the context of $w$. Our objective function is to maximize the probability of any context word given the current center word:
Note that $\theta$ here is a hyper-parameter. We will get back to it later. There is also another hidden hyper-parameter, that is the radius $m$.
We can reformulate this equation as the sum of the log-likelihoods:
In the equation above we are averaging over all words in the text by dividing with $T$. We take the minus of the sums to get a positive score, since the log of a probability will be always negative. This way our goal is also to minimize the loss function $J(\theta)$, that is to minimize the negative log-likelihood.
The task can be described as: with a specific center word somewhere in the middle of a sentence, pick one random word in a specific radius of $n$ words left and right, we will get as output the probability for every word in the vocabulary of being the selected context word.
The model is trained using word pairs extracted from a text. We go word by word through some text, selecting this word as the center word, and pick the context words from a windows of $n$ size. For example, here for $n=3$ (left and right of the center word):
Our model will be based on the distributional properties of the word pairs, that is the frequency of their cooccurrence. The left word in the example pairs above is the center word (red) and the right word is one of the context words.
Assume that we have learned 300 features represented in vectors for center words and their context respectively, that is, we have two independent vectors for each word, one that represents its features when it is a context word, and one that represents its features as a center word. This is our model $p(w_{t+j}|w_t)$ for a word at position $t$ in the text, and words in $t+j$ positions in the text relative to $t$. For the purpose of explanation here, we could assume that we have such vector pairs of 300 features for 100,000 such words.
In the following equation we will assume $c$ and $o$ to be indices in the space of the word types that we model. These indices now refer to the list of vocabulary, not to positions in the text, as $t$ does above.
We will compute the Softmax of two word vectors for a context word (index $o$ in the vocabulary) and a center word (index $c$ in the vocabulary) by taking the dot-product of their 300-dimensional feature vectors. That is, one is the vector of a word in the context $u_o$, and the other is the vector of the center word $v_c$. The $u$ vectors are in the context word vectors and the $v$ vectors are the center word vectors.
We compute the dot-product between the one-hot vector representing one center word $w_t$ in the center word matrix $W$. This is the lookup matrix (looking up column) of word embedding matrix as representation of center word. (See Lecture 2, Word Vector Representations: word2vec, Stanford presentation by Chris Manning). As explained above, this will result in picking a concrete center word feature vector from the corresponding matrix:
Take this center word matrix to be the hidden layer of a simple neural network.
We then take the dot-product of this vector of the center word with the matrix of the context vectors (here the scalars dashed out), or output word representation, for each word in the context (or the $n$-sized radius) around the center word, which gives us a 100,000 dimensional vector of weights for each word from the vocabulary in the context of the center vector:
Take this context vector matrix to be the output matrix.
In the next step we use Softmax to compute the probability distribution of this vector:
Here once more the softmax function over the vector $u_o \cdot v_c$:
In [11]:
softmax(np.array([1.7, 0.3, 0.1, -0.7, -0.2, 0.1, 0.7]))
Out[11]:
The rounded sum results in $1.0$:
In [12]:
sum([0.44, 0.11, 0.09, 0.04, 0.07, 0.09, 0.16])
Out[12]:
We could now compare the resulting probabilities with some ground truth and compute the error rate for example. If the ground truth for the above result would be:
this would mean that the model assigned a probability of $0.16$ to this word, then we could compute the loss.
How do we learn the parameters for each word that would maximize the prediction of a context word (or vice versa)?
Assume that all parameters of the model are defined in a long vector $\theta$. This vector will have twice the length of the vocabulary size, containing a weight for every word as a center word, and every word as a context word:
We repeat here again the objective function in which we want to minimize the log-likelihood:
Our softmax function discussed above fits into this equation:
Our goal is to minimize the loss function and maximize the likelihood that we predict the right word in the context for any given center word.
Changing the parameters can be achieved using the gradient. We want to minimize the negative log of the following equation, computing the partial derivative with respect to the center vector ($v_c$):
Note a partial derivative of a function with several variables is its derivative with respect to one of these variables. Here it is $v_c$.
The log of a division can be converted into a subtraction:
We can simplify the first part of the subtraction, since $\log$ and $\exp$ cancel each other out, and the partial derivative with respect to $v_c$ of the simplified equation is simply $u_0$:
For the second part of the subtraction above, we get:
Using the chain rule we can simplify this equation as well. The chain rule expresses the derivative of the composition of two functions, that is mapping $x$ onto $f(g(x))$, in terms of derivatives of $f$ and $g$.
Let us first start with derivatives of simple functions. What is the derivative of a constant $5$ for example? It is $0$. Since the derivative measures the change of a function, and a constant does not change, the derivative of a constant is $0$. The derivative of a variable $x$ is $1$. The derivative of the product of a constant and a variable is the constant. The derivative of $x^n$ is $n \cdot x^{n-1}$.
A differentiable function of a variable is a function whose derivative exists at each point in its domain (see here).The graph of a differentiable function must have a (non-vertical) tangent line at each point in its domain, and cannot contain any breaks, bends, or cusps. The following function $y = |x|$ (absolute value function) is not differentiable:
The following function is differentiable:
The chain rule allows one to find derivatives of compositions of functions by knowing the derivative of the elementary functions from the composition. Let $g(x)$ be differentiable at $x$ and $f(x)$ be differentiable at $f(g(x))$. Then, if $y=f(g(x))$ and $u=g(x)$:
...
see for the derivation Chris Manning's video...
The equation he derives is:
He labels the left part ($u_o$) as observation. This is the context word vectors that we identified in the texts or data. He labels the second part ($\sum_{x=1}^V p(x|c) u_x$) as expectation. This is the part that we want to tweak such that the loss function is minimized.
For maximizing or minimizing the cost (or objective) function using gradient descent, consider this simple example: Find the local minimum of the function $f(x) = x^4 - 3 x^3 + 2$ with derivative $f'(x) = 4 x^3 - 9 x^2$.
Subtracting a fraction of the gradient moves you to the minimum:
In [13]:
x_old = 0
x_new = 6
eps = 0.01 # step size
precision = 0.00001
def f_derivative(x):
return 4 * x**3 - 9 * x**2
while abs(x_new - x_old) > precision:
x_old = x_new
x_new = x_old - eps * f_derivative(x_old)
print("x_old:", x_old, " Local minimum occurs at", x_new)
Here the idea is to identify the gradient at point $x$, we subtract a little fraction of the gradient, which moves us downhill towards the minimum. We continue with computing the gradient again at this point and walking down towards the minimum.
Minimization of an objective function $J(\theta)$ over the entire training corpus makes it necessary to compute gradients for all windows.
We would need to update for each element of $\theta$ to identify the derivatives of the objective function with respect to all the parameters:
$\alpha$ is the step size in the Gradient Descent algorithm. We have some parameter values ($\theta_j^{old}$) and we identified the gradient (the $\frac{\partial}{\partial\theta_j^{old}}$ portion) at this position ($j$). We subtract a fraction of the gradient from the old parameters and get new ones. If this gives us a lower objection value, this takes us towards the minimum.
Matrix notation for all parameters:
Generic Gradient Descent code with some stopping condition to be added to it:
In [ ]:
while True:
theta_grad = evaluate_gradient(J, corpus, theta)
theta = theta - alpha * theta_grad
The blue lines show the contour lines of the value of the objective function. Computing the Gradient we identify the direction of the steepest descent. Walking in small steps down this line of the steepest descent takes us to the minimum.
To achieve walking continuously down towards the minimum, $\alpha$ needs to be small enough so that one does not jump over the minimum to the other side.
The problem would be that, if we have a billion token corpus, this might include a lot of computations and optimizations. Computing the gradient of the objective function for a very large corpus will take very long for even the first gradient update.
We compute the gradient and optimize at one position $t$ in the corpus ($t$ is the index of the center word).
In [ ]:
while True:
window = sample_window(corpus)
theta_grad = evaluate_gradient(J, window, theta)
theta = theta - alpha * theta_grad
Gensim provides an API documentation for word2vec. There is a GitHub repository with the code for a high-performance similarity server using Gensim and word2vec. This short intro is based on the Rare Technologies tutorial.
The input for Gensim's word2vec has to be a list of sentences and sentences are a list of tokens. We will import gensim and train a model on some sample sentences:
In [16]:
import gensim
sentences = [['Tom', 'loves', 'pizza'], ['Peter', 'loves', 'fries']]
model = gensim.models.Word2Vec(sentences, min_count=1)
In a more efficient implementation that does not hold an entire corpus in RAM, one can specify a file-reader and process the input line by line, as shown by Radim Řehůřek in his tutorial:
In [ ]:
import os
class MySentences(object):
def __init__(self, dirname):
self.dirname = dirname
def __iter__(self):
for fname in os.listdir(self.dirname):
for line in open(os.path.join(self.dirname, fname)):
yield line.split()
sentences = MySentences('examples') # load Gensim_example_1.txt from folder, a memory-friendly iterator
model = gensim.models.Word2Vec(sentences)
The code above might not run in Jupyter Notebooks, due to restrictions over some modules. Copy the code into a Python file and run it in the command line.
Gensim's word2vec allows you to prune the internal dictionary. This can eliminate tokens that occur a minimum number of times. The min_count parameter provides this restriction.
Gensim offers an API for:
We can compute similarities using
In [ ]:
model.wv.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
model.doesnt_match("breakfast cereal dinner lunch".split())
model.similarity('woman', 'man')
We can access a word vector directly:
In [ ]:
model.wv['loves']
In [ ]: