Word2Vec.jl



In [1]:

    
using Word2Vec

Training

Three functions are available for traning:

word2vec
word2phrase
word2cluster

We first download a text corpus: http://mattmahoney.net/dc/text8.zip and unzip it.

All functions are documented, i.e., we can type ?functionname to check input options.



In [2]:

    
?word2vec









    



search: 





    Out[2]:




word2vec(train, output; size=100, window=5, sample=1e-3, hs=0,  negative=5, threads=12, iter=5, min_count=5, alpha=0.025, debug=2, binary=1, cbow=1, save_vocal=Void(), read_vocab=Void(), verbose=false,)

Parameters for training:
    train <file>
        Use text data from <file> to train the model
    output <file>
        Use <file> to save the resulting word vectors / word clusters
    size <Int>
        Set size of word vectors; default is 100
    window <Int>
        Set max skip length between words; default is 5
    sample <AbstractFloat>
        Set threshold for occurrence of words. Those that appear with
        higher frequency in the training data will be randomly
        down-sampled; default is 1e-5.
    hs <Int>
        Use Hierarchical Softmax; default is 1 (0 = not used)
    negative <Int>
        Number of negative examples; default is 0, common values are 
        5 - 10 (0 = not used)
    threads <Int>
        Use <Int> threads (default 12)
    iter <Int>
        Run more training iterations (default 5)
    min_count <Int>
        This will discard words that appear less than <Int> times; default
        is 5
    alpha <AbstractFloat>
        Set the starting learning rate; default is 0.025
    debug <Int>
        Set the debug mode (default = 2 = more info during training)
    binary <Int>
        Save the resulting vectors in binary moded; default is 0 (off)
    cbow <Int>
        Use the continuous back of words model; default is 1 (skip-gram
        model)
    save_vocab <file>
        The vocabulary will be saved to <file>
    read_vocab <file>
        The vocabulary will be read from <file>, not constructed from the
        training data
    verbose <Bool>
        Print output from training







    



word2vec Word2Vec



In [4]:

    
word2vec("Downloads/text8", "text8-vec.txt", verbose=true)









    



Starting training using file Downloads/text8
Vocab size: 71291
Words in train file: 16718843
Alpha: 0.000002  Progress: 100.03%  Words/thread/sec: 356.11k

This will create a text file text8-vec.txt where each word in text8 is represented by a vector. In certain applications, we want to have vector representation of larger piece of text. For example, instead of considering "san" and "francisco" as two words, we want to have a vector to represent "san francisco". This can be achieved by pre-processing the text corpus with the function word2phrase.



In [29]:

    
word2phrase("Downloads/text8", "text8phrase")
word2vec("text8phrase", "text8phrase-vec.txt", verbose=true)









    



Starting training using file Downloads/text8

Vocab size (unigrams + bigrams): 2419827
Words in train file: 17005206
Starting training using file text8phrase
Vocab size: 98331
Words in train file: 15857306
Alpha: 0.000002  Progress: 100.03%  Words/thread/sec: 341.90k

word2clusters gives each word a class ID number.



In [22]:

    
word2clusters("text8", "text8-class.txt", 100)









    



Starting training using file text8
Vocab size: 71291
Words in train file: 16718843
Alpha: 0.000006  Progress: 99.99%  Words/thread/sec: 382.50k

Modelling



In [25]:

    
;ls









    



text8
text8-class.txt
text8phrase
text8phrase-vec.txt
text8-vec.txt
text8.zip



In [4]:

    
model = wordvectors("text8-vec.txt")









    Out[4]:





WordVectors 71291 words, 100-element Float64 vectors

Here are some basic functionalities.



In [26]:

    
size(model)









    Out[26]:





(100,71291)



In [7]:

    
words = vocabulary(model)









    Out[7]:





71291-element Array{AbstractString,1}:
 "</s>"        
 "the"         
 "of"          
 "and"         
 "one"         
 "in"          
 "a"           
 "to"          
 "zero"        
 "nine"        
 "two"         
 "is"          
 "as"          
 ⋮             
 "raam"        
 "barad"       
 "baume"       
 "mothmen"     
 "gallopin"    
 "horsecollar" 
 "mojitos"     
 "snaggletooth"
 "introvigne"  
 "denishawn"   
 "tamiris"     
 "dolophine"



In [36]:

    
idx = index(model, "book")









    Out[36]:





199



In [37]:

    
words[idx]









    Out[37]:





"book"

We can retrieve the vector representation of individual words and compute the cosine distance between two words.



In [6]:

    
get_vector(model, "one")









    Out[6]:





100-element Array{Float64,1}:
 -0.00124171 
 -0.153338   
  0.102503   
  0.0189016  
  0.0481557  
 -0.017203   
 -0.0345992  
 -0.143795   
  0.13964    
  0.10404    
  0.0987664  
  0.000247274
 -0.0294016  
  ⋮          
 -0.0729129  
  0.00609002 
 -0.115113   
 -0.1635     
  0.104623   
 -0.0815325  
 -0.0979441  
 -0.0522775  
 -0.0893822  
  0.121403   
 -0.0100501  
  0.100918



In [7]:

    
similarity(model, "one", "two")









    Out[7]:





1-element Array{Float64,1}:
 0.795706



In [8]:

    
similarity(model, "one", "hello")









    Out[8]:





1-element Array{Float64,1}:
 0.000406437

The funciton cosine(model, word, n) return the indices and distances of n neighbors of word.



In [5]:

    
idxs, dists = cosine(model, "paris", 10)









    Out[5]:





([1056,5356,3222,6964,4611,9122,4219,4218,12359,15749],[0.9999999999999999,0.7663500088541577,0.754464599846764,0.7269210839899815,0.6949826015235541,0.6906811525014919,0.6899940820154246,0.6874193547947338,0.6820260362215733,0.6817421670970778])

We can use Gadfly to plot the top 10 similar words to "paris"



In [3]:

    
using Gadfly



In [8]:

    
plot(x=words[idxs], y=dists)









    Out[8]:



In [12]:

    
?analogy









    



search: 





    Out[12]:




analogy(wv, pos, neg, n=5)
Compute the analogy similarity between two lists of words. The positions and the similarity values of the top n similar words will be returned.  For example,  king - man + woman = queen will be  pos=["king", "woman"], neg=["man"].







    



analogy analogy_words



In [10]:

    
indxs, dists = analogy(model, ["king", "woman"], ["man"], 8)









    Out[10]:





([904,6854,1062,2033,527,12269,2076,3422],[0.29024117726721255,0.26277586168028433,0.253278904324895,0.25208853175214935,0.24775773633691375,0.24558402441677105,0.24309061947916874,0.2418303817434562])



In [11]:

    
plot(x=words[indxs], y=dists)









    Out[11]:

analogy_words is a wrapper of analogy.



In [13]:

    
?analogy_words









    



search: 





    Out[13]:




analogy_words(wv, pos, neg, n=5)
Return the top n words computed by analogy similarity between positive words pos and negaive words neg. from the WordVectors wv.







    



analogy_words



In [23]:

    
analogy_words(model, ["paris", "germany"], ["france"], 10)









    Out[23]:





10-element Array{AbstractString,1}:
 "berlin"    
 "munich"    
 "leipzig"   
 "vienna"    
 "bonn"      
 "dresden"   
 "hamburg"   
 "stuttgart" 
 "frankfurt" 
 "heidelberg"



In [30]:

    
model2 = wordvectors("text8phrase-vec.txt")









    Out[30]:





WordVectors 98331 words, 100-element Float64 vectors

model2 is pre-processed by word2phrase, so we can compute the similar words of phrases.



In [32]:

    
cosine_similar_words(model2, "los_angeles", 13)









    Out[32]:





13-element Array{AbstractString,1}:
 "los_angeles"  
 "san_francisco"
 "san_diego"    
 "miami"        
 "las_vegas"    
 "seattle"      
 "cincinnati"   
 "cleveland"    
 "st_louis"     
 "california"   
 "chicago"      
 "dallas"       
 "atlanta"

Clustering



In [61]:

    
model3 = wordclusters("text8-class.txt")









    Out[61]:





WordClusters 71291 words, 100 clusters

The function clusters returns all the clusters in a model.



In [62]:

    
clusters(model3)









    Out[62]:





100-element Array{Integer,1}:
  0
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
  ⋮
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99

We can use get_cluster to retrieve the cluster ID of a given word and use get_words to retrieve all the words of a given cluster ID.



In [65]:

    
get_cluster(model3, "two")









    Out[65]:





39



In [66]:

    
get_words(model3, 39)









    Out[66]:





116-element Array{AbstractString,1}:
 "one"          
 "zero"         
 "nine"         
 "two"          
 "eight"        
 "five"         
 "three"        
 "four"         
 "six"          
 "seven"        
 "years"        
 "th"           
 "century"      
 ⋮              
 "interceptions"
 "nisan"        
 "weekday"      
 "ramadan"      
 "weekdays"     
 "workday"      
 "thirtieth"    
 "lunations"    
 "graders"      
 "goodwrench"   
 "spoked"       
 "rublei"