In [1]:
using Word2Vec
We first download a text corpus: http://mattmahoney.net/dc/text8.zip and unzip it.
All functions are documented, i.e., we can type ?functionname to check input options.
In [2]:
?word2vec
Out[2]:
In [4]:
word2vec("Downloads/text8", "text8-vec.txt", verbose=true)
This will create a text file text8-vec.txt where each word in text8
is represented by a vector. In certain applications, we want to have vector
representation of larger piece of text. For example, instead of considering "san" and "francisco" as two words, we want to have a vector to represent "san francisco". This can be achieved by pre-processing the text corpus with the function word2phrase.
In [29]:
word2phrase("Downloads/text8", "text8phrase")
word2vec("text8phrase", "text8phrase-vec.txt", verbose=true)
word2clusters gives each word a class ID number.
In [22]:
word2clusters("text8", "text8-class.txt", 100)
In [25]:
;ls
In [4]:
model = wordvectors("text8-vec.txt")
Out[4]:
Here are some basic functionalities.
In [26]:
size(model)
Out[26]:
In [7]:
words = vocabulary(model)
Out[7]:
In [36]:
idx = index(model, "book")
Out[36]:
In [37]:
words[idx]
Out[37]:
We can retrieve the vector representation of individual words and compute the cosine distance between two words.
In [6]:
get_vector(model, "one")
Out[6]:
In [7]:
similarity(model, "one", "two")
Out[7]:
In [8]:
similarity(model, "one", "hello")
Out[8]:
The funciton cosine(model, word, n) return the indices and distances
of n neighbors of word.
In [5]:
idxs, dists = cosine(model, "paris", 10)
Out[5]:
We can use Gadfly to plot the top 10 similar words to "paris"
In [3]:
using Gadfly
In [8]:
plot(x=words[idxs], y=dists)
Out[8]:
In [12]:
?analogy
Out[12]:
In [10]:
indxs, dists = analogy(model, ["king", "woman"], ["man"], 8)
Out[10]:
In [11]:
plot(x=words[indxs], y=dists)
Out[11]:
analogy_words is a wrapper of analogy.
In [13]:
?analogy_words
Out[13]:
In [23]:
analogy_words(model, ["paris", "germany"], ["france"], 10)
Out[23]:
In [30]:
model2 = wordvectors("text8phrase-vec.txt")
Out[30]:
model2 is pre-processed by word2phrase, so we can compute the similar words of phrases.
In [32]:
cosine_similar_words(model2, "los_angeles", 13)
Out[32]:
In [61]:
model3 = wordclusters("text8-class.txt")
Out[61]:
The function clusters returns all the clusters in a model.
In [62]:
clusters(model3)
Out[62]:
We can use get_cluster to retrieve the cluster ID of a given word and use get_words to retrieve all the words
of a given cluster ID.
In [65]:
get_cluster(model3, "two")
Out[65]:
In [66]:
get_words(model3, 39)
Out[66]: