In [ ]:
%load_ext autoreload
%autoreload 2
Download some data, for example: http://mattmahoney.net/dc/text8.zip
You could use make test-data
from the root of the repo.
In [2]:
import word2vec
Run word2phrase
to group up similar words "Los Angeles" to "Los_Angeles"
In [3]:
word2vec.word2phrase('../data/text8', '../data/text8-phrases', verbose=True)
This created a text8-phrases
file that we can use as a better input for word2vec
.
Note that you could easily skip this previous step and use the text data as input for word2vec
directly.
Now train the word2vec model.
In [4]:
word2vec.word2vec('../data/text8-phrases', '../data/text8.bin', size=100, binary=True, verbose=True)
That created a text8.bin
file containing the word vectors in a binary format.
Generate the clusters of the vectors based on the trained model.
In [5]:
word2vec.word2clusters('../data/text8', '../data/text8-clusters.txt', 100, verbose=True)
That created a text8-clusters.txt
with the cluster for every word in the vocabulary
In [ ]:
%load_ext autoreload
%autoreload 2
In [ ]:
import word2vec
Import the word2vec
binary file created above
In [3]:
model = word2vec.load('../data/text8.bin')
We can take a look at the vocabulary as a numpy array
In [4]:
model.vocab
Out[4]:
Or take a look at the whole matrix
In [5]:
model.vectors.shape
Out[5]:
In [6]:
model.vectors
Out[6]:
We can retreive the vector of individual words
In [7]:
model['dog'].shape
Out[7]:
In [8]:
model['dog'][:10]
Out[8]:
We can calculate the distance between two or more (all combinations) words.
In [9]:
model.distance("dog", "cat", "fish")
Out[9]:
In [10]:
indexes, metrics = model.similar("dog")
indexes, metrics
Out[10]:
This returned a tuple with 2 items:
We can get the words for those indexes
In [11]:
model.vocab[indexes]
Out[11]:
There is a helper function to create a combined response as a numpy record array
In [12]:
model.generate_response(indexes, metrics)
Out[12]:
Is easy to make that numpy array a pure python response:
In [13]:
model.generate_response(indexes, metrics).tolist()
Out[13]:
Since we trained the model with the output of word2phrase
we can ask for similarity of "phrases", basically compained words such as "Los Angeles"
In [14]:
indexes, metrics = model.similar('los_angeles')
model.generate_response(indexes, metrics).tolist()
Out[14]:
Its possible to do more complex queries like analogies such as: king - man + woman = queen
This method returns the same as cosine
the indexes of the words in the vocab and the metric
In [15]:
indexes, metrics = model.analogy(pos=['king', 'woman'], neg=['man'])
indexes, metrics
Out[15]:
In [16]:
model.generate_response(indexes, metrics).tolist()
Out[16]:
In [17]:
clusters = word2vec.load_clusters('../data/text8-clusters.txt')
We can see get the cluster number for individual words
In [18]:
clusters.vocab
Out[18]:
We can see get all the words grouped on an specific cluster
In [19]:
clusters.get_words_on_cluster(90).shape
Out[19]:
In [20]:
clusters.get_words_on_cluster(90)[:10]
Out[20]:
We can add the clusters to the word2vec model and generate a response that includes the clusters
In [21]:
model.clusters = clusters
In [22]:
indexes, metrics = model.analogy(pos=["paris", "germany"], neg=["france"])
In [23]:
model.generate_response(indexes, metrics).tolist()
Out[23]:
In [ ]: