Abstract

This article gives a detailed explanation of the weight vector updates in word2vec C implementation.

And find some not correct procedures that used train skip-gram model in the code.

Introduction

Word2vec is a famous word embeddings approch. See detailed at Word2vec

Prerequisite

some properties of sigmoid function

(0)

$$ \sigma' (y) = \sigma (y) \cdot [ 1 - \sigma (y) ] \cdot y' $$

visualize a property of sigmoid function

$$ \frac {1} {N} \sum_{i=1}^{N} \sigma(x_i) \approx \sigma( \frac {1} {N} \sum_{i=1}^{N} x_i ) $$

I can't prove it, my major is not math



In [70]:

    
%pylab inline --no-import-all

import numpy as np

x = np.random.randn(1000, 100)

m = np.mean(x, axis=1)

def sigmoid(a):
    return 1.0 / (1 + np.exp(-a))

m_s = sigmoid(m)

m_x_s = np.mean(sigmoid(x), axis=1)

plt.scatter(m_x_s, m_s)
plt.show()
plt.plot(m_x_s, m_s)
plt.show()









    



Populating the interactive namespace from numpy and matplotlib

hierarchical softmax

(1) $$ p(w | w_I) = \prod_{j=1}^{L(w)-1} \sigma \left( [\![ n(w, j+1) = ch(n(w, j)) ]\!] \cdot {v'_{n(w, j)}}^T v_{w_I} \right) $$

the hierarchical softmax gradient

First let

$$ hs_j = [\![ n(w, j+1) = ch(n(w, j)) ]\!] \cdot {v'_{n(w, j)}}^T v_{w_I} $$

And

$$ \frac {d hs_j} {v'_{n(w, j)}} = [\![ n(w, j+1) = ch(n(w, j)) ]\!] \cdot v_{w_I} $$$$ \frac {d hs_j} {v_{w_I}} = [\![ n(w, j+1) = ch(n(w, j)) ]\!] \cdot v'_{n(w, j)} $$

Then $$ \nabla \log p(w | w_I) = \sum_{j=1}^{L(w)-1} \frac{1} {\sigma (hs_j)} \cdot \sigma' (hs_j) \cdot hs'_j $$

According to equation (0) $$ \nabla \log p(w | w_I) = \sum_{j=1}^{L(w)-1} \frac{1} {\sigma (hs_j)} \cdot \sigma (hs_j) \cdot [ 1 - \sigma (hs_j) ] \cdot hs'_j $$

simplify $$ \nabla \log p(w | w_I) = \sum_{j=1}^{L(w)-1} [ 1 - \sigma (hs_j) ] \cdot hs'_j $$

When update $ v_{w_I} $, we got gradient is:

$$ \nabla \log p(w | w_I) = \sum_{j=1}^{L(w)-1} [ 1 - \sigma (hs_j) ] \cdot [\![ n(w, j+1) = ch(n(w, j)) ]\!] \cdot v'_{n(w, j)} $$

When update $ v'_{n(w, j)}$ for each $j \in [1, L(w)-1]$, we got gradient is:

$$ \nabla \log p(w | w_I) = [ 1 - \sigma (hs_j) ] \cdot [\![ n(w, j+1) = ch(n(w, j)) ]\!] \cdot v_{w_I} $$

negative sampling

(2) $$ \log p(w_O | w_I) = \log \sigma( {v'_{w_O}}^T v_{w_I} ) + \sum_{i}^{k} E_{w_i \sim P_n(w)} \left[ \log \sigma( {-v'_{w_i}}^T v_{w_I} ) \right] $$

the negative sampling gradient

First let

$$ ns_{xI} = {v'_{w_x}}^T \cdot v_{w_I} $$

$w_x \in \{ w_O, w_i \sim P_n(w) \}$

And

the gradients $ns_{xI}$ according to ${v'_{w_x}}$ and $v_{w_I}$ are:

$$ \frac {d ns_{xI}} {v'_{w_x}} = v_{w_I} $$$$ \frac {d ns_{xI}} {v_{w_I}} = v'_{w_x} $$

Then

$$ \nabla \log p(w_O | w_I) = \frac {1} { \sigma(ns_{OI}) } \cdot \sigma'(ns_{OI}) \cdot ns'_{OI} + \sum_{i}^{k} E_{w_i \sim P_n(w)} \left[ \frac {1} { \sigma(-ns_{iI}) } \cdot \sigma'( -ns_{iI} ) \cdot -ns'_{iI} \right] $$

According to equation (0)

$$ \nabla \log p(w_O | w_I) = \frac {1} { \sigma(ns_{OI}) } \cdot \sigma(ns_{OI}) \cdot [ 1 - \sigma(ns_{OI}) ] \cdot ns'_{OI} + \sum_{i}^{k} E_{w_i \sim P_n(w)} \left[ \frac {1} { \sigma(-ns_{iI}) } \cdot \sigma ( -ns_{iI} ) \cdot [ 1 - \sigma(-ns_{iI}) ] \cdot -ns'_{iI} \right] $$

simplify

$$ \nabla \log p(w_O | w_I) = [ 1 - \sigma(ns_{OI}) ] \cdot ns'_{OI} + \sum_{i}^{k} E_{w_i \sim P_n(w)} \left[ [ 1 - \sigma(-ns_{iI}) ] \cdot -ns'_{iI} \right] $$

When update $ v'_{w_O} $, we got gradient is:

$$ \nabla \log p(w_O | w_I) = [ 1 - \sigma(ns_{OI}) ] \cdot v_{w_I} $$

When update $ v_{w_I} $, we got gradient is:

$$ \nabla \log p(w_O | w_I) = [ 1 - \sigma(ns_{OI}) ] \cdot v'_{w_O} + \sum_{i}^{k} E_{w_i \sim P_n(w)} \left[ [ 1 - \sigma(-ns_{iI}) ] \cdot -v'_{w_i} \right] $$

When update $ v'_{w_i}, w_i \sim P_n{w} $, we got gradient is:

$$ \nabla \log p(w_O | w_I) = [ 1 - \sigma(-ns_{iI}) ] \cdot -v_{w_I} $$

continuous bag of words



In [8]:

    
from IPython.display import Image
c
Image("./cbow.png")









    Out[8]:

maximize the following average log probability:

(3) $$ \frac{1}{T} \sum_{t=1}^{T} \sum_{-c \leq{j} \leq{c},j\neq{0}} \log{p(w_t | w_{t+j})} $$

I change the index $j$ to index $q$ due to conflict with index $j$ in hierarchical softmax.

(3) $$ \frac{1}{T} \sum_{t=1}^{T} \sum_{-c \leq{q} \leq{c},q\neq{0}} \log{p(w_t | w_{t+q})} $$

And

$$ \nabla \sum_{-c \leq{q} \leq{c},q\neq{0}} \log{p(w_t | w_{t+q})} = \sum_{-c \leq{q} \leq{c},q\neq{0}} \nabla \log{p(w_t | w_{t+q})} $$

First compute the mean of window wods vector

Let $$ v_{mcw} = \frac{1} {cw} \sum_{-c \leq{q} \leq{c},q\neq{0}} v_{w_{t+q}} $$

// window word count
      cw = 0;
      for (a = b; a < window * 2 + 1 - b; a++) if (a != window) {
        // sentence_position is sp
        // c in [sp - (window - b), sp + (window - b)]
        c = sentence_position - window + a;
        if (c < 0) continue;
        // may be is break?
        // because c is increase
        if (c >= sentence_length) continue;
        last_word = sen[c];
        if (last_word == -1) continue;
        // sum window words' vectors
        for (c = 0; c < layer1_size; c++) neu1[c] += syn0[c + last_word * layer1_size];
        cw++;
      }

      if (cw) {
        for (c = 0; c < layer1_size; c++) neu1[c] /= cw;
        // ...

"neu1" is $v_{mcw}$

when using hierarchical softmax

$w_t$ is $w$, $w_{t+q}$ is $w_I$ in equation(1)

And

$$ hs_j = [\![ n(w_t, j+1) = ch(n(w_t, j)) ]\!] \cdot {v'_{n(w_t, j)}}^T v_{w_{t+q}} $$

$q \in [-c, c], q\neq{0}$

Then

When update $ v_{w_{t+q}}$ for each $q \in [-c, c], q\neq{0}$, we got gradient is:

$$ \nabla \log p(w_t | w_{t+q}) = \sum_{j=1}^{L(w_t)-1} [ 1 - \sigma (hs_j) ] \cdot [\![ n(w_t, j+1) = ch(n(w_t, j)) ]\!] \cdot v'_{n(w_t, j)} $$$$ \nabla \sum_{-c \leq{q} \leq{c},q\neq{0}} \log{p(w_t | w_{t+q})} = \nabla \log p(w_t | w_{t+q}) $$

When update $ v'_{n(w_t, j)}$ for each $j \in [1, L(w_t)-1]$, we got gradient is:

$$ \nabla \log p(w_t | w_{t+q}) = [ 1 - \sigma (hs_j) ] \cdot [\![ n(w_t, j+1) = ch(n(w_t, j)) ]\!] \cdot v_{w_{t+q}} $$$$ \nabla \sum_{-c \leq{q} \leq{c},q\neq{0}} \log{p(w_t | w_{t+q})} = \sum_{-c \leq{q} \leq{c},q\neq{0}} [ 1 - \sigma (hs_j) ] \cdot [\![ n(w_t, j+1) = ch(n(w_t, j)) ]\!] \cdot v_{w_{t+q}} $$

if (hs) for (d = 0; d < vocab[word].codelen; d++) {
          f = 0;
          l2 = vocab[word].point[d] * layer1_size;
          // l2 is index j in equation

          // Propagate hidden -> output
          for (c = 0; c < layer1_size; c++) f += neu1[c] * syn1[c + l2];
          // f is +- hs_j
          if (f <= -MAX_EXP) continue;
          else if (f >= MAX_EXP) continue;
          // approximation
          // may be round is better
          else f = expTable[(int)((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))];
          // 'g' is the gradient multiplied by the learning rate
          // why sub code[d]?
          // another sigmoid property
          // \sigma(x) = 1 - \sigma(-x)
          // code[d] is 0 or 1
          g = (1 - vocab[word].code[d] - f) * alpha;
          // Propagate errors output -> hidden
          for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1[c + l2];
          // Learn weights hidden -> output
          for (c = 0; c < layer1_size; c++) syn1[c + l2] += g * neu1[c];
        }

From the code We can see that, she or he uses $v_{mcw}$ in place of $v_{w_{t+q}}$

$$ hs_j = [\![ n(w_t, j+1) = ch(n(w_t, j)) ]\!] \cdot {v'_{n(w_t, j)}}^T v_{mcw} $$

"syn1[l2]" is $v'_{n(w_t, j)}$, then "f" in the above code before "expTable" is $\pm hs_j$.

And $[ 1 - \sigma (hs_j) ] \cdot [\![ n(w_t, j+1) = ch(n(w_t, j)) ]\!]$ is "1 - vocab[word].code[d] - f" after "expTable", because of the equation:

$$ \sigma(x) = 1 - \sigma(-x) $$

So "vocab[word].code[d] == 0" for $[\![ n(w_t, j+1) = ch(n(w_t, j)) ]\!] = +1$,

and "vocab[word].code[d] == 1" for $[\![ n(w_t, j+1) = ch(n(w_t, j)) ]\!] = -1$,

for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1[c + l2];

caculates the gradient of $ v_{w_{t+q}} $ for each $q \in [-c, c], q\neq{0}$

After for loop "neu1e" is the added weight vector for each $v_{w_{t+q}}$.

And the gradient of $ v'_{n(w_t, j)}$ for each $j \in [1, L(w_t)-1]$ is caculated as:

$$ \nabla \sum_{-c \leq{q} \leq{c},q\neq{0}} \log{p(w_t | w_{t+q})} = cw \cdot [ 1 - \sigma (hs_j) ] \cdot [\![ n(w_t, j+1) = ch(n(w_t, j)) ]\!] \cdot v_{mcw} $$

The code doesn't multiply cw.

for (c = 0; c < layer1_size; c++) syn1[c + l2] += g * neu1[c];

update the $v'_{n(w_t, j)}$ weight vector.

when using negative sampling

$w_t is\, w_O$, $w_{t+q} is\, w_I$ in equation(2)

And

$$ ns_{x(t+q)} = {v'_{w_x}}^T \cdot v_{w_{t+q}} $$

$w_x \in \{ w_t, w_i \sim P_n(w) \}$, $q \in [-c, c], q\neq{0}$

Then

When update $ v'_{w_t} $, we got gradient is:

$$ \nabla \log p(w_t | w_{t+q}) = [ 1 - \sigma(ns_{t(t+q)}) ] \cdot v_{w_{t+q}} $$$$ \nabla \sum_{-c \leq{q} \leq{c},q\neq{0}} \log{p(w_t | w_{t+q})} = \sum_{-c \leq{q} \leq{c},q\neq{0}} [ 1 - \sigma(ns_{t(t+q)}) ] \cdot v_{w_{t+q}} $$

When update $ v'_{w_i}, w_i \sim P_n{w} $, we got gradient is:

$$ \nabla \log p(w_t | w_{t+q}) = [ 1 - \sigma(-ns_{i(t+q)}) ] \cdot -v_{w_{t+q}} $$$$ \nabla \sum_{-c \leq{q} \leq{c},q\neq{0}} \log{p(w_t | w_{t+q})} = \sum_{-c \leq{q} \leq{c},q\neq{0}} [ 1 - \sigma(-ns_{i(t+q)}) ] \cdot -v_{w_{t+q}} $$

When update $ v_{w_{t+q}} $ for each $q \in [-c, c], q\neq{0}$, we got gradient is:

$$ \nabla \log p(w_t | w_{t+q}) = [ 1 - \sigma(ns_{t(t+q)}) ] \cdot v'_{w_t} + \sum_{i}^{k} E_{w_i \sim P_n(w)} \left[ [ 1 - \sigma(-ns_{i(t+q)}) ] \cdot -v'_{w_i} \right] $$$$ \nabla \sum_{-c \leq{q} \leq{c},q\neq{0}} \log{p(w_t | w_{t+q})} = \nabla \log p(w_t | w_{t+q}) $$

if (negative > 0) for (d = 0; d < negative + 1; d++) {
          // target is w_O, w_i
          if (d == 0) {
            target = word;
            label = 1;
          } else { // random sample negative weight vector
            next_random = next_random * (unsigned long long)25214903917 + 11;
            target = table[(next_random >> 16) % table_size];
            if (target == 0) target = next_random % (vocab_size - 1) + 1;
            if (target == word) continue;
            label = 0;
          }
          l2 = target * layer1_size;
          f = 0;
          // caculate ns
          // syn1neg[l2] is v', neu1 is v_mcw
          for (c = 0; c < layer1_size; c++) f += neu1[c] * syn1neg[c + l2];
          // caculate gradient
          if (f > MAX_EXP) g = (label - 1) * alpha;
          else if (f < -MAX_EXP) g = (label - 0) * alpha;
          // \sigma(x) = 1 - \sigma(-x), and 0 label for -ns
          else g = (label - expTable[(int)((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))]) * alpha;
          // neu1e is added weight vector for each w_{t+q}
          for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1neg[c + l2];
          // update v'
          for (c = 0; c < layer1_size; c++) syn1neg[c + l2] += g * neu1[c];
        }

The code uses $v_{mcw}$ in place of $v_{w_{t+q}}$

$$ ns_{x(t+q)} = {v'_{w_x}}^T \cdot v_{mcw} $$

$w_x \in \{ w_t, w_i \sim P_n(w) \}$, $q \in [-c, c], q\neq{0}$

the gradient of ${v'_{w_x}},\, w_x \in \{ w_t, w_i \sim P_n(w) \}$ is caculated as:

$$ \nabla \sum_{-c \leq{q} \leq{c},q\neq{0}} \log{p(w_t | w_{t+q})} = cw \cdot [ 1 - \sigma(\pm ns_{x(mcw)} ) ] \cdot \pm v_{w_{mcw}} $$

The code doesn't multiply "cw"

Because the equation $ \sigma(x) = 1 - \sigma(-x)$,

the gradient of ${v'_{w_x}},\, w_x \in \{ w_t, w_i \sim P_n(w) \}$ is caculated as:

$$ \nabla \sum_{-c \leq{q} \leq{c},q\neq{0}} \log{p(w_t | w_{t+q})} = \begin{cases} - \sigma( ns_{x(mcw)} ) \cdot v_{w_{mcw}} & \quad \text{if } -ns_{x(mcw)} \\ [ 1 - \sigma(ns_{x(mcw)} ) ] \cdot v_{w_{mcw}} & \quad \text{if } ns_{x(mcw)} \\ \end{cases} $$

The code above uses "label = 0" to caculate the gradient of $v'_{w_t}$, "label = 1" to caculate the gradient of $v'_{w_i}$,

The gradient of $ v_{w_{t+q}} $ is caculated as:

$$ \nabla \sum_{-c \leq{q} \leq{c},q\neq{0}} \log p(w_t | w_{t+q}) = [ 1 - \sigma(ns_{x(mcw)} ) ] \cdot v_{w_{mcw}} + \sum_{i}^{k} E_{w_i \sim P_n(w)} \left[ - \sigma( ns_{x(mcw)} ) \cdot v_{w_{mcw}} \right] $$

for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1[c + l2];

caculates the gradient of $ v_{w_{t+q}} $ for each $q \in [-c, c], q\neq{0}$

After for loop "neu1e" is the added weight vector for each $v_{w_{t+q}}$.

finally updates $ v_{w_{t+q}} $ for each $q \in [-c, c], q\neq{0}$

// hidden -> in
        for (a = b; a < window * 2 + 1 - b; a++) if (a != window) {
          c = sentence_position - window + a;
          if (c < 0) continue;
          if (c >= sentence_length) continue;
          last_word = sen[c];
          if (last_word == -1) continue;
          for (c = 0; c < layer1_size; c++) syn0[c + last_word * layer1_size] += neu1e[c];
        }

From the code above we can see that when updates each $v_{w_{t+q}}$, it uses the same vector "neu1e"

skip-gram



In [10]:

    
from IPython.display import Image

Image("./skip-gram.png")









    Out[10]:

maximize the following average log probability:

(4) $$ \frac{1}{T} \sum_{t=1}^{T} \sum_{-c \leq{j} \leq{c},j\neq{0}} \log{p(w_{t+j} | w_t)} $$

Same reason, I change the index $j$ to index $q$ due to conflict with index $j$ in hierarchical softmax.

(4) $$ \frac{1}{T} \sum_{t=1}^{T} \sum_{-c \leq{q} \leq{c}, q\neq{0}} \log{p(w_{t+q} | w_t)} $$

The most difference between skip-gram and continuous bag of words in the code is that when train skip-gram, updates the weight vectors for every window word.

Actually, these equations exchange $w_{t+q}$ with $w_t$, and don't use $\sum_{-c \leq{q} \leq{c}, q\neq{0}} \log{p(w_{t+q} | w_t)}$ as "y"

when using hierarchical softmax

$w_{t+q}$ is $w$, $w_{t}$ is $w_I$ in equation(1)

And

$$ hs_j = [\![ n(w_{t+q}, j+1) = ch(n(w_{t+q}, j)) ]\!] \cdot {v'_{n(w_{t+q}, j)}}^T v_{w_t} $$

$q \in [-c, c], q\neq{0}$

Then

When update $ v_{w_t}$, we got gradient is:

$$ \nabla \log p(w_{t+q} | w_t) = \sum_{j=1}^{L(w_{t+q})-1} [ 1 - \sigma (hs_j) ] \cdot [\![ n(w_{t+q}, j+1) = ch(n(w_{t+q}, j)) ]\!] \cdot v'_{n(w_{t+q}, j)} $$

When update $ v'_{n(w_{t+q}, j)}$ for each $j \in [1, L(w_{t+q})-1]$, we got gradient is:

$$ \nabla \log p(w_{t+q} | w_t) = [ 1 - \sigma (hs_j) ] \cdot [\![ n(w_{t+q}, j+1) = ch(n(w_{t+q}, j)) ]\!] \cdot v_{w_t} $$

$q \in [-c, c], q\neq{0}$

But, the code actually uses cbow model

The word2vec C implementation doesn't use these gradients of skip-gram.

Tt actually uses continuous bag of words model, when you require that train using skip-gram.

I don't know why.

if (hs) for (d = 0; d < vocab[word].codelen; d++) {
          f = 0;
          l2 = vocab[word].point[d] * layer1_size;
          // Propagate hidden -> output
          // l1 is w_{t+q}, l2 is n(w_t, j)
          for (c = 0; c < layer1_size; c++) f += syn0[c + l1] * syn1[c + l2];
          if (f <= -MAX_EXP) continue;
          else if (f >= MAX_EXP) continue;
          else f = expTable[(int)((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))];
          // 'g' is the gradient multiplied by the learning rate
          g = (1 - vocab[word].code[d] - f) * alpha;
          // Propagate errors output -> hidden
          for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1[c + l2];
          // Learn weights hidden -> output
          for (c = 0; c < layer1_size; c++) syn1[c + l2] += g * syn0[c + l1];
        }

The code in the above is same as train continuous bag of words model, except that it replaces "neu1[c]" with "syn0[c + l1]".

The gradients used are hierarchical softmax of CBOW, so actually train cbow model.

when using negative sampling

$w_{t+q} is\, w_O$, $w_t is\, w_I$ in equation(2)

And

$$ ns_{xt} = {v'_{w_x}}^T \cdot v_{w_t} $$

$w_x \in \{ w_{t+q}, w_i \sim P_n(w) \}$, $q \in [-c, c], q\neq{0}$

Then

When update $ v'_{w_{t+q}}$ for each $q \in [-c, c], q\neq{0}$, we got gradient is:

$$ \nabla \log p(w_{t+q} | w_t) = [ 1 - \sigma(ns_{(t+q)t}) ] \cdot v_{w_t} $$

When update $ v'_{w_i}, w_i \sim P_n{w} $, we got gradient is:

$$ \nabla \log p(w_{t+q} | w_t) = [ 1 - \sigma(-ns_{it}) ] \cdot -v_{w_t} $$

When update $ v_{w_t}$ we got gradient is:

$$ \nabla \log p(w_{t+q} | w_t) = [ 1 - \sigma(ns_{(t+q)t}) ] \cdot v'_{w_{t+q}} + \sum_{i}^{k} E_{w_i \sim P_n(w)} \left[ [ 1 - \sigma(-ns_{it}) ] \cdot -v'_{w_i} \right] $$

But, the code again uses cbow model

The word2vec C implementation doesn't use these negative sample gradients of skip-gram.

Tt actually uses continuous bag of words model, when you require that train using skip-gram.

if (negative > 0) for (d = 0; d < negative + 1; d++) {
          if (d == 0) {
            target = word;
            // should target = last_word;
            label = 1;
          } else {
            next_random = next_random * (unsigned long long)25214903917 + 11;
            target = table[(next_random >> 16) % table_size];
            if (target == 0) target = next_random % (vocab_size - 1) + 1;
            if (target == word) continue;
            label = 0;
          }
          l2 = target * layer1_size;
          f = 0;
          // l1 is w_{t+q}, l2 is in [w_t, w_i \sim P_n(w)]
          // l1 should be w_t
          for (c = 0; c < layer1_size; c++) f += syn0[c + l1] * syn1neg[c + l2];
          if (f > MAX_EXP) g = (label - 1) * alpha;
          else if (f < -MAX_EXP) g = (label - 0) * alpha;
          else g = (label - expTable[(int)((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))]) * alpha;
          for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1neg[c + l2];
          for (c = 0; c < layer1_size; c++) syn1neg[c + l2] += g * syn0[c + l1];
        }

Again the code in the above is same as train continuous bag of words model, except that it replaces "neu1[c]" with "syn0[c + l1]".

The gradients used are hierarchical softmax of CBOW, so actually train cbow model.

finally updates $ v_{w_{t+q}} $ for each $q \in [-c, c], q\neq{0}$

for (c = 0; c < layer1_size; c++) syn0[c + l1] += neu1e[c];

But it should update $v_{w_t}$

Implementation

famous implementations have:

Future works

I plan to implement a C++ version word2vec using correct gradients in recent weeks.

Conclusion

Word2vec is a powerful way to model languages.

But the original C implemetation doesn't use correct gradients for skip-gram model, neither does Gensim.

I have not watch the word2vec implementation of Deeplearning4j, so I can't give a conclusion.

References

Some papers:

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013.

Other useful urls:



In [ ]: