In [1]:

    
%load_ext autoreload
%autoreload 2

This is equivalent to the demo-word.sh, demo-analogy.sh and demo-classes.sh from Google.

Training

Download some data, for example: http://mattmahoney.net/dc/text8.zip



In [2]:

    
import word2vec

Note that this could take a long time depending on the parameters



In [3]:

    
word2vec.word2vec('/Users/danielfrg/Downloads/text8', '/Users/danielfrg/Downloads/text8.bin', size=100, verbose=True)









    



Starting training using file /Users/danielfrg/Downloads/text8
Vocab size: 71291
Words in train file: 16718843
Alpha: 0.000002  Progress: 100.03%  Words/thread/sec: 270.54k

That generated a text8.bin file containing the word vectors in a binary format.



In [4]:

    
word2vec.word2clusters('/Users/danielfrg/Downloads/text8', '/Users/danielfrg/Downloads/text8-clusters.txt', 100, verbose=True)









    



Starting training using file /Users/danielfrg/Downloads/text8
Vocab size: 71291
Words in train file: 16718843
Alpha: 0.000002  Progress: 100.02%  Words/thread/sec: 273.11k

That created a text8-clusters.txt with the cluster for every word in the vocabulary

Word2vec



In [1]:

    
import word2vec

Import the binary file created above



In [2]:

    
model = word2vec.load('/Users/danielfrg/Downloads/text8.bin')

We can take a look at the vocabulaty as a numpy array



In [3]:

    
model.vocab









    Out[3]:





array([u'</s>', u'the', u'of', ..., u'bredon', u'skirting', u'santamaria'], 
      dtype='<U78')

Or take a look at the whole matrix



In [4]:

    
model.vectors.shape









    Out[4]:





(71291, 100)



In [5]:

    
model.vectors









    Out[5]:





array([[ 0.14333282,  0.15825513, -0.13715845, ...,  0.05456942,
         0.10955409,  0.00693387],
       [ 0.19530475,  0.06706266, -0.06676513, ..., -0.02816086,
        -0.14898917,  0.09727648],
       [-0.01193852,  0.13753659, -0.00024625, ..., -0.00940192,
        -0.14928788, -0.00186248],
       ..., 
       [ 0.03080524,  0.04359109, -0.02770138, ...,  0.02682899,
        -0.04904861,  0.00400822],
       [ 0.12868328, -0.09444693,  0.00022423, ...,  0.10692056,
        -0.11976643,  0.01290878],
       [-0.13700521,  0.02007952,  0.00595254, ...,  0.09005585,
         0.08282874,  0.09130787]])

We can retreive the vector of individual words



In [6]:

    
model['dog'].shape









    Out[6]:





(100,)



In [7]:

    
model['dog'][:10]









    Out[7]:





array([-0.18659274,  0.27234063, -0.21508452,  0.01970069, -0.06842735,
        0.04821981,  0.05422455,  0.11214764,  0.08051528, -0.01919792])

We can do simple queries to retreive words similar to "socks":



In [8]:

    
indexes, metrics = model.cosine('socks')
indexes, metrics









    Out[8]:





(array([19427, 13879, 13940, 21250, 27485, 19011, 14017, 27316, 28569, 25292]),
 array([ 0.79034326,  0.77553328,  0.76619852,  0.7653784 ,  0.76218141,
         0.75970452,  0.75899984,  0.7544875 ,  0.750367  ,  0.74552157]))

This returned a tuple with 2 items:

numpy array with the indexes of the similar words in the vocabulary
numpy array with cosine similarity to each word

Its possible to get the words of those indexes



In [9]:

    
model.vocab[indexes]









    Out[9]:





array([u'curly', u'pants', u'fin', u'skirt', u'straps', u'sleeves',
       u'hats', u'sandals', u'paw', u'fists'], 
      dtype='<U78')

There is a helper function to create a combined response: a numpy record array



In [10]:

    
model.generate_response(indexes, metrics)









    Out[10]:





rec.array([(u'curly', 0.7903432572899406), (u'pants', 0.7755332828881559),
       (u'fin', 0.7661985156103499), (u'skirt', 0.7653784049761758),
       (u'straps', 0.7621814070595871), (u'sleeves', 0.7597045154700648),
       (u'hats', 0.7589998379430011), (u'sandals', 0.7544875003008115),
       (u'paw', 0.750366995120531), (u'fists', 0.7455215743986232)], 
      dtype=[(u'word', '<U312'), (u'metric', '<f8')])

With that numpy array is easy to make it a pure python response:



In [11]:

    
model.generate_response(indexes, metrics).tolist()









    Out[11]:





[(u'curly', 0.7903432572899406),
 (u'pants', 0.7755332828881559),
 (u'fin', 0.7661985156103499),
 (u'skirt', 0.7653784049761758),
 (u'straps', 0.7621814070595871),
 (u'sleeves', 0.7597045154700648),
 (u'hats', 0.7589998379430011),
 (u'sandals', 0.7544875003008115),
 (u'paw', 0.750366995120531),
 (u'fists', 0.7455215743986232)]

Analogies

Its possible to do more complex queries like analogies such as: king - man + woman = queen This method returns the same as cosine the indexes of the words in the vocab and the metric



In [12]:

    
indexes, metrics = model.analogy(pos=['king', 'woman'], neg=['man'], n=10)
indexes, metrics









    Out[12]:





(array([  903,  2032,  3419,  6839,  1061,  1151, 10708,   525,  2074,   387]),
 array([ 0.29142041,  0.25738584,  0.25242398,  0.25085615,  0.25043206,
         0.24834467,  0.24692792,  0.23686954,  0.23649967,  0.23527399]))



In [13]:

    
model.generate_response(indexes, metrics).tolist()









    Out[13]:





[(u'queen', 0.29142041319188594),
 (u'elizabeth', 0.2573858437509287),
 (u'princess', 0.2524239753609573),
 (u'empress', 0.25085614765702235),
 (u'prince', 0.2504320556591395),
 (u'daughter', 0.24834467154521817),
 (u'isabella', 0.2469279243593665),
 (u'emperor', 0.23686953710844988),
 (u'throne', 0.23649966580355197),
 (u'son', 0.23527399371736316)]

Clusters



In [14]:

    
clusters = word2vec.load_clusters('/Users/danielfrg/Downloads/text8-clusters.txt')

We can see get the cluster number for individual words



In [15]:

    
clusters['dog']









    Out[15]:





11

We can see get all the words grouped on an specific cluster



In [16]:

    
clusters.get_words_on_cluster(90).shape









    Out[16]:





(2396,)



In [17]:

    
clusters.get_words_on_cluster(90)[:10]









    Out[17]:





array(['paired', 'stranded', 'stained', 'casts', 'cleaned', 'filtered',
       'boring', 'disappears', 'engraved', 'jar'], dtype=object)

We can add the clusters to the word2vec model and generate a response that includes the clusters



In [18]:

    
model.clusters = clusters



In [19]:

    
indexes, metrics = model.analogy(pos=['paris', 'germany'], neg=['france'], n=10)



In [20]:

    
model.generate_response(indexes, metrics).tolist()









    Out[20]:





[(u'berlin', 0.2972392934316237, 33),
 (u'munich', 0.28325824633538604, 33),
 (u'vienna', 0.2807684777840865, 33),
 (u'leipzig', 0.2618531952118039, 72),
 (u'heidelberg', 0.24388784263661334, 72),
 (u'moscow', 0.23368751705624013, 81),
 (u'bonn', 0.23262011080843972, 33),
 (u'budapest', 0.23008734564402614, 33),
 (u'stuttgart', 0.2270129408196096, 23),
 (u'freiburg', 0.22626312626968229, 72)]



In [ ]: