In [1]:
%load_ext autoreload
%autoreload 2

This is equivalent to the demo-word.sh, demo-analogy.sh and demo-classes.sh from Google.

Training

Download some data, for example: http://mattmahoney.net/dc/text8.zip


In [2]:
import word2vec

Note that this could take a long time depending on the parameters


In [3]:
word2vec.word2vec('/Users/danielfrg/Downloads/text8', '/Users/danielfrg/Downloads/text8.bin', size=100, verbose=True)


Starting training using file /Users/danielfrg/Downloads/text8
Vocab size: 71291
Words in train file: 16718843
Alpha: 0.000002  Progress: 100.03%  Words/thread/sec: 270.54k  

That generated a text8.bin file containing the word vectors in a binary format.


In [4]:
word2vec.word2clusters('/Users/danielfrg/Downloads/text8', '/Users/danielfrg/Downloads/text8-clusters.txt', 100, verbose=True)


Starting training using file /Users/danielfrg/Downloads/text8
Vocab size: 71291
Words in train file: 16718843
Alpha: 0.000002  Progress: 100.02%  Words/thread/sec: 273.11k  

That created a text8-clusters.txt with the cluster for every word in the vocabulary

Word2vec


In [1]:
import word2vec

Import the binary file created above


In [2]:
model = word2vec.load('/Users/danielfrg/Downloads/text8.bin')

We can take a look at the vocabulaty as a numpy array


In [3]:
model.vocab


Out[3]:
array([u'</s>', u'the', u'of', ..., u'bredon', u'skirting', u'santamaria'], 
      dtype='<U78')

Or take a look at the whole matrix


In [4]:
model.vectors.shape


Out[4]:
(71291, 100)

In [5]:
model.vectors


Out[5]:
array([[ 0.14333282,  0.15825513, -0.13715845, ...,  0.05456942,
         0.10955409,  0.00693387],
       [ 0.19530475,  0.06706266, -0.06676513, ..., -0.02816086,
        -0.14898917,  0.09727648],
       [-0.01193852,  0.13753659, -0.00024625, ..., -0.00940192,
        -0.14928788, -0.00186248],
       ..., 
       [ 0.03080524,  0.04359109, -0.02770138, ...,  0.02682899,
        -0.04904861,  0.00400822],
       [ 0.12868328, -0.09444693,  0.00022423, ...,  0.10692056,
        -0.11976643,  0.01290878],
       [-0.13700521,  0.02007952,  0.00595254, ...,  0.09005585,
         0.08282874,  0.09130787]])

We can retreive the vector of individual words


In [6]:
model['dog'].shape


Out[6]:
(100,)

In [7]:
model['dog'][:10]


Out[7]:
array([-0.18659274,  0.27234063, -0.21508452,  0.01970069, -0.06842735,
        0.04821981,  0.05422455,  0.11214764,  0.08051528, -0.01919792])

We can do simple queries to retreive words similar to "socks":


In [8]:
indexes, metrics = model.cosine('socks')
indexes, metrics


Out[8]:
(array([19427, 13879, 13940, 21250, 27485, 19011, 14017, 27316, 28569, 25292]),
 array([ 0.79034326,  0.77553328,  0.76619852,  0.7653784 ,  0.76218141,
         0.75970452,  0.75899984,  0.7544875 ,  0.750367  ,  0.74552157]))

This returned a tuple with 2 items:

  1. numpy array with the indexes of the similar words in the vocabulary
  2. numpy array with cosine similarity to each word

Its possible to get the words of those indexes


In [9]:
model.vocab[indexes]


Out[9]:
array([u'curly', u'pants', u'fin', u'skirt', u'straps', u'sleeves',
       u'hats', u'sandals', u'paw', u'fists'], 
      dtype='<U78')

There is a helper function to create a combined response: a numpy record array


In [10]:
model.generate_response(indexes, metrics)


Out[10]:
rec.array([(u'curly', 0.7903432572899406), (u'pants', 0.7755332828881559),
       (u'fin', 0.7661985156103499), (u'skirt', 0.7653784049761758),
       (u'straps', 0.7621814070595871), (u'sleeves', 0.7597045154700648),
       (u'hats', 0.7589998379430011), (u'sandals', 0.7544875003008115),
       (u'paw', 0.750366995120531), (u'fists', 0.7455215743986232)], 
      dtype=[(u'word', '<U312'), (u'metric', '<f8')])

With that numpy array is easy to make it a pure python response:


In [11]:
model.generate_response(indexes, metrics).tolist()


Out[11]:
[(u'curly', 0.7903432572899406),
 (u'pants', 0.7755332828881559),
 (u'fin', 0.7661985156103499),
 (u'skirt', 0.7653784049761758),
 (u'straps', 0.7621814070595871),
 (u'sleeves', 0.7597045154700648),
 (u'hats', 0.7589998379430011),
 (u'sandals', 0.7544875003008115),
 (u'paw', 0.750366995120531),
 (u'fists', 0.7455215743986232)]

Analogies

Its possible to do more complex queries like analogies such as: king - man + woman = queen This method returns the same as cosine the indexes of the words in the vocab and the metric


In [12]:
indexes, metrics = model.analogy(pos=['king', 'woman'], neg=['man'], n=10)
indexes, metrics


Out[12]:
(array([  903,  2032,  3419,  6839,  1061,  1151, 10708,   525,  2074,   387]),
 array([ 0.29142041,  0.25738584,  0.25242398,  0.25085615,  0.25043206,
         0.24834467,  0.24692792,  0.23686954,  0.23649967,  0.23527399]))

In [13]:
model.generate_response(indexes, metrics).tolist()


Out[13]:
[(u'queen', 0.29142041319188594),
 (u'elizabeth', 0.2573858437509287),
 (u'princess', 0.2524239753609573),
 (u'empress', 0.25085614765702235),
 (u'prince', 0.2504320556591395),
 (u'daughter', 0.24834467154521817),
 (u'isabella', 0.2469279243593665),
 (u'emperor', 0.23686953710844988),
 (u'throne', 0.23649966580355197),
 (u'son', 0.23527399371736316)]

Clusters


In [14]:
clusters = word2vec.load_clusters('/Users/danielfrg/Downloads/text8-clusters.txt')

We can see get the cluster number for individual words


In [15]:
clusters['dog']


Out[15]:
11

We can see get all the words grouped on an specific cluster


In [16]:
clusters.get_words_on_cluster(90).shape


Out[16]:
(2396,)

In [17]:
clusters.get_words_on_cluster(90)[:10]


Out[17]:
array(['paired', 'stranded', 'stained', 'casts', 'cleaned', 'filtered',
       'boring', 'disappears', 'engraved', 'jar'], dtype=object)

We can add the clusters to the word2vec model and generate a response that includes the clusters


In [18]:
model.clusters = clusters

In [19]:
indexes, metrics = model.analogy(pos=['paris', 'germany'], neg=['france'], n=10)

In [20]:
model.generate_response(indexes, metrics).tolist()


Out[20]:
[(u'berlin', 0.2972392934316237, 33),
 (u'munich', 0.28325824633538604, 33),
 (u'vienna', 0.2807684777840865, 33),
 (u'leipzig', 0.2618531952118039, 72),
 (u'heidelberg', 0.24388784263661334, 72),
 (u'moscow', 0.23368751705624013, 81),
 (u'bonn', 0.23262011080843972, 33),
 (u'budapest', 0.23008734564402614, 33),
 (u'stuttgart', 0.2270129408196096, 23),
 (u'freiburg', 0.22626312626968229, 72)]

In [ ]: