In [1]:
%load_ext autoreload
%autoreload 2
This is equivalent to the demo-word.sh
, demo-analogy.sh
and demo-classes.sh
from Google.
Download some data, for example: http://mattmahoney.net/dc/text8.zip
In [2]:
import word2vec
Note that this could take a long time depending on the parameters
In [3]:
word2vec.word2vec('/Users/danielfrg/Downloads/text8', '/Users/danielfrg/Downloads/text8.bin', size=100, verbose=True)
That generated a text8.bin
file containing the word vectors in a binary format.
In [4]:
word2vec.word2clusters('/Users/danielfrg/Downloads/text8', '/Users/danielfrg/Downloads/text8-clusters.txt', 100, verbose=True)
That created a text8-clusters.txt
with the cluster for every word in the vocabulary
In [1]:
import word2vec
Import the binary file created above
In [2]:
model = word2vec.load('/Users/danielfrg/Downloads/text8.bin')
We can take a look at the vocabulaty as a numpy array
In [3]:
model.vocab
Out[3]:
Or take a look at the whole matrix
In [4]:
model.vectors.shape
Out[4]:
In [5]:
model.vectors
Out[5]:
We can retreive the vector of individual words
In [6]:
model['dog'].shape
Out[6]:
In [7]:
model['dog'][:10]
Out[7]:
We can do simple queries to retreive words similar to "socks":
In [8]:
indexes, metrics = model.cosine('socks')
indexes, metrics
Out[8]:
This returned a tuple with 2 items:
Its possible to get the words of those indexes
In [9]:
model.vocab[indexes]
Out[9]:
There is a helper function to create a combined response: a numpy record array
In [10]:
model.generate_response(indexes, metrics)
Out[10]:
With that numpy array is easy to make it a pure python response:
In [11]:
model.generate_response(indexes, metrics).tolist()
Out[11]:
Its possible to do more complex queries like analogies such as: king - man + woman = queen
This method returns the same as cosine
the indexes of the words in the vocab and the metric
In [12]:
indexes, metrics = model.analogy(pos=['king', 'woman'], neg=['man'], n=10)
indexes, metrics
Out[12]:
In [13]:
model.generate_response(indexes, metrics).tolist()
Out[13]:
In [14]:
clusters = word2vec.load_clusters('/Users/danielfrg/Downloads/text8-clusters.txt')
We can see get the cluster number for individual words
In [15]:
clusters['dog']
Out[15]:
We can see get all the words grouped on an specific cluster
In [16]:
clusters.get_words_on_cluster(90).shape
Out[16]:
In [17]:
clusters.get_words_on_cluster(90)[:10]
Out[17]:
We can add the clusters to the word2vec model and generate a response that includes the clusters
In [18]:
model.clusters = clusters
In [19]:
indexes, metrics = model.analogy(pos=['paris', 'germany'], neg=['france'], n=10)
In [20]:
model.generate_response(indexes, metrics).tolist()
Out[20]:
In [ ]: