**(C) 2018 by Damir Cavar**

**Version:** 1.1, November 2018

```
In [1]:
```import numpy as np

We can create a one-hot vector that selects the 3rd row:

```
In [2]:
```x = np.array([0, 0, 1, 0])
x

```
Out[2]:
```

Let us create a matrix $A$ of four rows:

```
In [3]:
```A = np.array([[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12],
[13, 14, 15, 16]])
A

```
Out[3]:
```

We can use the column vector $x$ to select a row in matrix $A$:

```
In [4]:
```x.dot(A)

```
Out[4]:
```

$$
u^T v = u \cdot v = \sum_{i=1}^n{u_i v_i}
$$

*directionality*, the larger the dot-product is.

```
In [5]:
```y = np.array([4.0, 2.5, 1.1])

$$
p(C_n) = \frac{\exp(\theta \cdot X_n)}{\sum_{i=1}^N{\exp(\theta \cdot X_i)}}
$$

*softmax* below.

*exp* and *sum* functions. The *axis* parameter determines that the some is performed row-wise:

```
In [6]:
```def softmax1(y):
return np.exp(y) / np.sum(np.exp(y), axis=0)

```
In [7]:
```softmax1([4.0, 4.0, 2.0])

```
Out[7]:
```

*softmax* definition:

```
In [8]:
```def softmax(y, t=1.0):
return np.exp(y / t) / np.sum(np.exp(y / t), axis=0)

```
In [9]:
```softmax(np.array([4.0, 4.0, 2.0]))

```
Out[9]:
```

If we double $\theta$, the probability assigned to the third scalar increases significantly:

```
In [10]:
```softmax(np.array([4.0, 4.0, 2.0]), 2.0)

```
Out[10]:
```

*objective function* is to maximize the probability of any context word given the current center word:

$$J'(\theta) = \prod_{t=1}^T \prod_{\substack{-m \leq j \leq m\\j \neq 0}} P(w_{t+j} | w_t; \theta)$$

We can reformulate this equation as the sum of the log-likelihoods:

$$
J(\theta) = - \frac{1}{T} \sum_{t=1}^T{ \sum_{\substack{-m \leq j \leq m\\ j \neq 0}} \log(P(w_{t+j}| w_t)) }
$$

$$
p(o|c) = \frac{\exp(u_o^T v_c)}{\sum_{W=1}^V{\exp(u_w^T v_c)}}
$$

$$\begin{bmatrix}
0 \\[0.3em]
0 \\[0.3em]
0 \\[0.3em]
0 \\[0.3em]
1 \\[0.3em]
0 \\[0.3em]
0 \\[0.3em]
0
\end{bmatrix}
\begin{bmatrix}
-- & 0.2 & -- \\[0.3em]
-- & -1.4 & -- \\[0.3em]
-- & 0.3 & -- \\[0.3em]
-- & -0.1 & -- \\[0.3em]
-- & 0.1 & -- \\[0.3em]
-- & -0.3 & --
\end{bmatrix} =
\begin{bmatrix}
0.2 \\[0.3em]
-1.4 \\[0.3em]
0.3 \\[0.3em]
-0.1 \\[0.3em]
0.1 \\[0.3em]
-0.3
\end{bmatrix} = V_c
$$

Take this center word matrix to be the **hidden layer** of a simple neural network.

$$V_c \cdot \begin{bmatrix}
-- & -- & -- \\[0.3em]
-- & -- & -- \\[0.3em]
-- & -- & -- \\[0.3em]
-- & -- & -- \\[0.3em]
-- & -- & -- \\[0.3em]
-- & -- & --
\end{bmatrix} = u_o \cdot v_c$$

Take this context vector matrix to be the **output matrix**.

In the next step we use *Softmax* to compute the probability distribution of this vector:

$$\mbox{Softmax}(u_o \cdot v_c) = \mbox{Softmax}(
\begin{bmatrix}
1.7 \\[0.3em]
0.3 \\[0.3em]
0.1 \\[0.3em]
-0.7 \\[0.3em]
-0.2 \\[0.3em]
0.1 \\[0.3em]
0.7
\end{bmatrix}) = \begin{bmatrix}
0.44 \\[0.3em]
0.11 \\[0.3em]
0.09 \\[0.3em]
0.04 \\[0.3em]
0.07 \\[0.3em]
0.09 \\[0.3em]
0.16
\end{bmatrix}
$$

Here once more the softmax function over the vector $u_o \cdot v_c$:

```
In [11]:
```softmax(np.array([1.7, 0.3, 0.1, -0.7, -0.2, 0.1, 0.7]))

```
Out[11]:
```

The rounded sum results in $1.0$:

```
In [12]:
```sum([0.44, 0.11, 0.09, 0.04, 0.07, 0.09, 0.16])

```
Out[12]:
```

$$\begin{bmatrix}
0 \\[0.3em]
0 \\[0.3em]
0 \\[0.3em]
0 \\[0.3em]
0 \\[0.3em]
0 \\[0.3em]
1
\end{bmatrix}$$

$$
\theta = \begin{bmatrix}
v_{a} \\[0.3em]
v_{ant} \\[0.3em]
\vdots \\[0.3em]
v_{zero} \\[0.3em]
u_{a} \\[0.3em]
u_{ant} \\[0.3em]
\vdots \\[0.3em]
u_{zero}
\end{bmatrix} \in \mathbb{R}^{2dV}
$$

We repeat here again the objective function in which we want to minimize the log-likelihood:

Our softmax function discussed above fits into this equation:

$$
p(o|c) = \frac{\exp(u_o^T v_c)}{\sum_{W=1}^V{\exp(u_w^T v_c)}}
$$

$$
J(\theta) = - \frac{1}{T} \sum_{t=1}^T{ \sum_{\substack{-m \leq j \leq m\\ j \neq 0}} \log(p(o|c)) }
$$

$$
\frac{\partial }{\partial v_c} \log(\frac{\exp(u_o^T v_c)}{\sum_{W=1}^V{\exp(u_w^T v_c)}})
$$

The log of a division can be converted into a subtraction:

$$
\frac{\partial }{\partial v_c} \ \log(\exp(u_o^T v_c)) - \log(\sum_{W=1}^V{\exp(u_w^T v_c)})
$$

$$
\frac{\partial }{\partial v_c} \ \log(\exp(u_o^T v_c)) = \frac{\partial }{\partial v_c} \ u_o^T v_c = u_0
$$

For the second part of the subtraction above, we get:

$$
\frac{\partial }{\partial v_c} \ \log(\sum_{W=1}^V{\exp(u_w^T v_c)})
$$

The following function is differentiable:

$$\frac{dy}{dx}=\frac{dy}{du}\cdot\frac{du}{dx}$$

...

see for the derivation Chris Manning's video...

The equation he derives is:

$$
u_o - \sum_{x=1}^V p(x|c) u_x
$$

*observation*. This is the context word vectors that we identified in the texts or data. He labels the second part ($\sum_{x=1}^V p(x|c) u_x$) as *expectation*. This is the part that we want to tweak such that the loss function is minimized.

Subtracting a fraction of the gradient moves you to the minimum:

```
In [13]:
```x_old = 0
x_new = 6
eps = 0.01 # step size
precision = 0.00001
def f_derivative(x):
return 4 * x**3 - 9 * x**2
while abs(x_new - x_old) > precision:
x_old = x_new
x_new = x_old - eps * f_derivative(x_old)
print("x_old:", x_old, " Local minimum occurs at", x_new)

```
```

$$
\theta_j^{new} = \theta_j^{old} - \alpha \frac{\partial }{\partial \theta_j^{old}} J(\theta)
$$

Matrix notation for all parameters:

$$
\theta^{new} = \theta^{old} - \alpha \frac{\partial}{\partial \theta^{old}} J(\theta)
$$

$$
\theta^{new} = \theta^{old} - \alpha \nabla_\theta J(\theta)
$$

Generic Gradient Descent code with some stopping condition to be added to it:

```
In [ ]:
```while True:
theta_grad = evaluate_gradient(J, corpus, theta)
theta = theta - alpha * theta_grad

$$
\theta^{new} = \theta^{old} - \alpha \nabla_\theta J_t(\theta)
$$

```
In [ ]:
```while True:
window = sample_window(corpus)
theta_grad = evaluate_gradient(J, window, theta)
theta = theta - alpha * theta_grad

*Rare Technologies* tutorial.

*gensim* and train a model on some sample sentences:

```
In [16]:
```import gensim
sentences = [['Tom', 'loves', 'pizza'], ['Peter', 'loves', 'fries']]
model = gensim.models.Word2Vec(sentences, min_count=1)

```
In [ ]:
```import os
class MySentences(object):
def __init__(self, dirname):
self.dirname = dirname
def __iter__(self):
for fname in os.listdir(self.dirname):
for line in open(os.path.join(self.dirname, fname)):
yield line.split()
sentences = MySentences('examples') # load Gensim_example_1.txt from folder, a memory-friendly iterator
model = gensim.models.Word2Vec(sentences)

*min_count* parameter provides this restriction.

Gensim offers an API for:

- evaluation using standard data sets and formats (e.g. Google test set)
- storing and loading of models
- online or resuming of training

We can compute similarities using

```
In [ ]:
```model.wv.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
model.doesnt_match("breakfast cereal dinner lunch".split())
model.similarity('woman', 'man')

We can access a word vector directly:

```
In [ ]:
```model.wv['loves']

```
In [ ]:
```