Word2Vec Tutorial

This tutorial follows a blog post written by the creator of gensim.

Preparing the Input

Starting from the beginning, gensim’s word2vec expects a sequence of sentences as its input. Each sentence a list of words (utf8 strings):



In [1]:

    
# import modules & set up logging
import gensim, logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)



In [2]:

    
sentences = [['first', 'sentence'], ['second', 'sentence']]
# train word2vec on the two sentences
model = gensim.models.Word2Vec(sentences, min_count=1)









    



2016-11-10 19:48:07,037 : INFO : collecting all words and their counts
2016-11-10 19:48:07,043 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2016-11-10 19:48:07,045 : INFO : collected 3 word types from a corpus of 4 raw words and 2 sentences
2016-11-10 19:48:07,049 : INFO : min_count=1 retains 3 unique words (drops 0)
2016-11-10 19:48:07,054 : INFO : min_count leaves 4 word corpus (100% of original 4)
2016-11-10 19:48:07,059 : INFO : deleting the raw counts dictionary of 3 items
2016-11-10 19:48:07,062 : INFO : sample=0.001 downsamples 3 most-common words
2016-11-10 19:48:07,065 : INFO : downsampling leaves estimated 0 word corpus (5.7% of prior 4)
2016-11-10 19:48:07,068 : INFO : estimated required memory for 3 words and 100 dimensions: 3900 bytes
2016-11-10 19:48:07,073 : INFO : resetting layer weights
2016-11-10 19:48:07,076 : INFO : training model with 3 workers on 3 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5
2016-11-10 19:48:07,079 : INFO : expecting 2 sentences, matching count from corpus used for vocabulary survey
2016-11-10 19:48:07,099 : INFO : worker thread finished; awaiting finish of 2 more threads
2016-11-10 19:48:07,100 : INFO : worker thread finished; awaiting finish of 1 more threads
2016-11-10 19:48:07,102 : INFO : worker thread finished; awaiting finish of 0 more threads
2016-11-10 19:48:07,104 : INFO : training on 20 raw words (0 effective words) took 0.0s, 0 effective words/s
2016-11-10 19:48:07,108 : WARNING : under 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay

Keeping the input as a Python built-in list is convenient, but can use up a lot of RAM when the input is large.

Gensim only requires that the input must provide sentences sequentially, when iterated over. No need to keep everything in RAM: we can provide one sentence, process it, forget it, load another sentence…

For example, if our input is strewn across several files on disk, with one sentence per line, then instead of loading everything into an in-memory list, we can process the input file by file, line by line:



In [3]:

    
# create some toy data to use with the following example
import smart_open, os

if not os.path.exists('./data/'):
    os.makedirs('./data/')

filenames = ['./data/f1.txt', './data/f2.txt']

for i, fname in enumerate(filenames):
    with smart_open.smart_open(fname, 'w') as fout:
        for line in sentences[i]:
            fout.write(line + '\n')



In [4]:

    
class MySentences(object):
    def __init__(self, dirname):
        self.dirname = dirname
 
    def __iter__(self):
        for fname in os.listdir(self.dirname):
            for line in open(os.path.join(self.dirname, fname)):
                yield line.split()



In [5]:

    
sentences = MySentences('./data/') # a memory-friendly iterator
print(list(sentences))









    



[['first'], ['sentence'], ['second'], ['sentence']]



In [6]:

    
# generate the Word2Vec model
model = gensim.models.Word2Vec(sentences, min_count=1)
print(model)
print(model.vocab)









    



2016-11-10 19:48:07,625 : INFO : collecting all words and their counts
2016-11-10 19:48:07,625 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2016-11-10 19:48:07,634 : INFO : collected 3 word types from a corpus of 4 raw words and 4 sentences
2016-11-10 19:48:07,634 : INFO : min_count=1 retains 3 unique words (drops 0)
2016-11-10 19:48:07,638 : INFO : min_count leaves 4 word corpus (100% of original 4)
2016-11-10 19:48:07,642 : INFO : deleting the raw counts dictionary of 3 items
2016-11-10 19:48:07,646 : INFO : sample=0.001 downsamples 3 most-common words
2016-11-10 19:48:07,646 : INFO : downsampling leaves estimated 0 word corpus (5.7% of prior 4)
2016-11-10 19:48:07,650 : INFO : estimated required memory for 3 words and 100 dimensions: 3900 bytes
2016-11-10 19:48:07,654 : INFO : resetting layer weights
2016-11-10 19:48:07,658 : INFO : training model with 3 workers on 3 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5
2016-11-10 19:48:07,662 : INFO : expecting 4 sentences, matching count from corpus used for vocabulary survey
2016-11-10 19:48:07,690 : INFO : worker thread finished; awaiting finish of 2 more threads
2016-11-10 19:48:07,690 : INFO : worker thread finished; awaiting finish of 1 more threads
2016-11-10 19:48:07,694 : INFO : worker thread finished; awaiting finish of 0 more threads
2016-11-10 19:48:07,696 : INFO : training on 20 raw words (0 effective words) took 0.0s, 0 effective words/s
2016-11-10 19:48:07,698 : WARNING : under 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay






    



Word2Vec(vocab=3, size=100, alpha=0.025)
{'second': <gensim.models.word2vec.Vocab object at 0x00000216B9898470>, 'first': <gensim.models.word2vec.Vocab object at 0x00000216B9898400>, 'sentence': <gensim.models.word2vec.Vocab object at 0x00000216B98984E0>}

Say we want to further preprocess the words from the files — convert to unicode, lowercase, remove numbers, extract named entities… All of this can be done inside the MySentences iterator and word2vec doesn’t need to know. All that is required is that the input yields one sentence (list of utf8 words) after another.

Note to advanced users: calling Word2Vec(sentences) will run two passes over the sentences iterator.

The first pass collects words and their frequencies to build an internal dictionary tree structure.
The second pass trains the neural model.

These two passes can also be initiated manually, in case your input stream is non-repeatable (you can only afford one pass), and you’re able to initialize the vocabulary some other way:



In [7]:

    
# build the same model, making the 2 steps explicit
new_model = gensim.models.Word2Vec(min_count=1)  # an empty model, no training
new_model.build_vocab(sentences)                 # can be a non-repeatable, 1-pass generator     
new_model.train(sentences)                       # can be a non-repeatable, 1-pass generator
print(new_model)
print(model.vocab)









    



2016-11-10 19:48:07,742 : INFO : collecting all words and their counts
2016-11-10 19:48:07,746 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2016-11-10 19:48:07,750 : INFO : collected 3 word types from a corpus of 4 raw words and 4 sentences
2016-11-10 19:48:07,754 : INFO : min_count=1 retains 3 unique words (drops 0)
2016-11-10 19:48:07,758 : INFO : min_count leaves 4 word corpus (100% of original 4)
2016-11-10 19:48:07,758 : INFO : deleting the raw counts dictionary of 3 items
2016-11-10 19:48:07,762 : INFO : sample=0.001 downsamples 3 most-common words
2016-11-10 19:48:07,762 : INFO : downsampling leaves estimated 0 word corpus (5.7% of prior 4)
2016-11-10 19:48:07,766 : INFO : estimated required memory for 3 words and 100 dimensions: 3900 bytes
2016-11-10 19:48:07,766 : INFO : resetting layer weights
2016-11-10 19:48:07,770 : INFO : training model with 3 workers on 3 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5
2016-11-10 19:48:07,770 : INFO : expecting 4 sentences, matching count from corpus used for vocabulary survey
2016-11-10 19:48:07,782 : INFO : worker thread finished; awaiting finish of 2 more threads
2016-11-10 19:48:07,786 : INFO : worker thread finished; awaiting finish of 1 more threads
2016-11-10 19:48:07,786 : INFO : worker thread finished; awaiting finish of 0 more threads
2016-11-10 19:48:07,786 : INFO : training on 20 raw words (0 effective words) took 0.0s, 0 effective words/s
2016-11-10 19:48:07,790 : WARNING : under 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay






    



Word2Vec(vocab=3, size=100, alpha=0.025)
{'second': <gensim.models.word2vec.Vocab object at 0x00000216B9898470>, 'first': <gensim.models.word2vec.Vocab object at 0x00000216B9898400>, 'sentence': <gensim.models.word2vec.Vocab object at 0x00000216B98984E0>}

More data would be nice

For the following examples, we'll use the Lee Corpus (which you already have if you've installed gensim):



In [8]:

    
# Set file names for train and test data
test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data']) + os.sep
lee_train_file = test_data_dir + 'lee_background.cor'



In [9]:

    
class MyText(object):
    def __iter__(self):
        for line in open(lee_train_file):
            # assume there's one document per line, tokens separated by whitespace
            yield line.lower().split()

sentences = MyText()

print(sentences)









    



<__main__.MyText object at 0x00000216B9890940>

Training

Word2Vec accepts several parameters that affect both training speed and quality.

One of them is for pruning the internal dictionary. Words that appear only once or twice in a billion-word corpus are probably uninteresting typos and garbage. In addition, there’s not enough data to make any meaningful training on those words, so it’s best to ignore them:



In [10]:

    
# default value of min_count=5
model = gensim.models.Word2Vec(sentences, min_count=10)









    



2016-11-10 19:48:08,108 : INFO : collecting all words and their counts
2016-11-10 19:48:08,110 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2016-11-10 19:48:08,155 : INFO : collected 10186 word types from a corpus of 59890 raw words and 300 sentences
2016-11-10 19:48:08,175 : INFO : min_count=10 retains 806 unique words (drops 9380)
2016-11-10 19:48:08,175 : INFO : min_count leaves 40964 word corpus (68% of original 59890)
2016-11-10 19:48:08,186 : INFO : deleting the raw counts dictionary of 10186 items
2016-11-10 19:48:08,194 : INFO : sample=0.001 downsamples 54 most-common words
2016-11-10 19:48:08,206 : INFO : downsampling leaves estimated 26224 word corpus (64.0% of prior 40964)
2016-11-10 19:48:08,210 : INFO : estimated required memory for 806 words and 100 dimensions: 1047800 bytes
2016-11-10 19:48:08,225 : INFO : resetting layer weights
2016-11-10 19:48:08,298 : INFO : training model with 3 workers on 806 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5
2016-11-10 19:48:08,299 : INFO : expecting 300 sentences, matching count from corpus used for vocabulary survey
2016-11-10 19:48:08,623 : INFO : worker thread finished; awaiting finish of 2 more threads
2016-11-10 19:48:08,629 : INFO : worker thread finished; awaiting finish of 1 more threads
2016-11-10 19:48:08,647 : INFO : worker thread finished; awaiting finish of 0 more threads
2016-11-10 19:48:08,650 : INFO : training on 299450 raw words (131035 effective words) took 0.3s, 378385 effective words/s



In [11]:

    
# default value of size=100
model = gensim.models.Word2Vec(sentences, size=200)









    



2016-11-10 19:48:08,667 : INFO : collecting all words and their counts
2016-11-10 19:48:08,674 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2016-11-10 19:48:08,735 : INFO : collected 10186 word types from a corpus of 59890 raw words and 300 sentences
2016-11-10 19:48:08,755 : INFO : min_count=5 retains 1723 unique words (drops 8463)
2016-11-10 19:48:08,766 : INFO : min_count leaves 46858 word corpus (78% of original 59890)
2016-11-10 19:48:08,795 : INFO : deleting the raw counts dictionary of 10186 items
2016-11-10 19:48:08,800 : INFO : sample=0.001 downsamples 49 most-common words
2016-11-10 19:48:08,803 : INFO : downsampling leaves estimated 32849 word corpus (70.1% of prior 46858)
2016-11-10 19:48:08,805 : INFO : estimated required memory for 1723 words and 200 dimensions: 3618300 bytes
2016-11-10 19:48:08,820 : INFO : resetting layer weights
2016-11-10 19:48:08,910 : INFO : training model with 3 workers on 1723 vocabulary and 200 features, using sg=0 hs=0 sample=0.001 negative=5
2016-11-10 19:48:08,911 : INFO : expecting 300 sentences, matching count from corpus used for vocabulary survey
2016-11-10 19:48:09,378 : INFO : worker thread finished; awaiting finish of 2 more threads
2016-11-10 19:48:09,388 : INFO : worker thread finished; awaiting finish of 1 more threads
2016-11-10 19:48:09,404 : INFO : worker thread finished; awaiting finish of 0 more threads
2016-11-10 19:48:09,411 : INFO : training on 299450 raw words (164316 effective words) took 0.5s, 331532 effective words/s

Bigger size values require more training data, but can lead to better (more accurate) models. Reasonable values are in the tens to hundreds.

The last of the major parameters (full list here) is for training parallelization, to speed up training:



In [12]:

    
# default value of workers=3 (tutorial says 1...)
model = gensim.models.Word2Vec(sentences, workers=4)









    



2016-11-10 19:48:09,429 : INFO : collecting all words and their counts
2016-11-10 19:48:09,434 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2016-11-10 19:48:09,486 : INFO : collected 10186 word types from a corpus of 59890 raw words and 300 sentences
2016-11-10 19:48:09,507 : INFO : min_count=5 retains 1723 unique words (drops 8463)
2016-11-10 19:48:09,511 : INFO : min_count leaves 46858 word corpus (78% of original 59890)
2016-11-10 19:48:09,528 : INFO : deleting the raw counts dictionary of 10186 items
2016-11-10 19:48:09,535 : INFO : sample=0.001 downsamples 49 most-common words
2016-11-10 19:48:09,538 : INFO : downsampling leaves estimated 32849 word corpus (70.1% of prior 46858)
2016-11-10 19:48:09,541 : INFO : estimated required memory for 1723 words and 100 dimensions: 2239900 bytes
2016-11-10 19:48:09,570 : INFO : resetting layer weights
2016-11-10 19:48:09,648 : INFO : training model with 4 workers on 1723 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5
2016-11-10 19:48:09,649 : INFO : expecting 300 sentences, matching count from corpus used for vocabulary survey
2016-11-10 19:48:10,029 : INFO : worker thread finished; awaiting finish of 3 more threads
2016-11-10 19:48:10,034 : INFO : worker thread finished; awaiting finish of 2 more threads
2016-11-10 19:48:10,071 : INFO : worker thread finished; awaiting finish of 1 more threads
2016-11-10 19:48:10,090 : INFO : worker thread finished; awaiting finish of 0 more threads
2016-11-10 19:48:10,095 : INFO : training on 299450 raw words (164298 effective words) took 0.4s, 373031 effective words/s
2016-11-10 19:48:10,100 : WARNING : under 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay

The workers parameter only has an effect if you have Cython installed. Without Cython, you’ll only be able to use one core because of the GIL (and word2vec training will be miserably slow).

Memory

At its core, word2vec model parameters are stored as matrices (NumPy arrays). Each array is #vocabulary (controlled by min_count parameter) times #size (size parameter) of floats (single precision aka 4 bytes).

Three such matrices are held in RAM (work is underway to reduce that number to two, or even one). So if your input contains 100,000 unique words, and you asked for layer size=200, the model will require approx. 100,000*200*4*3 bytes = ~229MB.

There’s a little extra memory needed for storing the vocabulary tree (100,000 words would take a few megabytes), but unless your words are extremely loooong strings, memory footprint will be dominated by the three matrices above.

Evaluating

Word2Vec training is an unsupervised task, there’s no good way to objectively evaluate the result. Evaluation depends on your end application.

Google have released their testing set of about 20,000 syntactic and semantic test examples, following the “A is to B as C is to D” task. It is provided in the 'datasets' folder.

Gensim support the same evaluation set, in exactly the same format:



In [20]:

    
model.accuracy('./datasets/questions-words.txt')









    



2016-11-10 19:48:43,879 : INFO : family: 0.0% (0/2)
2016-11-10 19:48:43,947 : INFO : gram3-comparative: 0.0% (0/12)
2016-11-10 19:48:43,978 : INFO : gram4-superlative: 0.0% (0/12)
2016-11-10 19:48:44,011 : INFO : gram5-present-participle: 0.0% (0/20)
2016-11-10 19:48:44,061 : INFO : gram6-nationality-adjective: 0.0% (0/20)
2016-11-10 19:48:44,105 : INFO : gram7-past-tense: 0.0% (0/20)
2016-11-10 19:48:44,138 : INFO : gram8-plural: 0.0% (0/12)
2016-11-10 19:48:44,152 : INFO : total: 0.0% (0/98)






    Out[20]:





[{'correct': [], 'incorrect': [], 'section': 'capital-common-countries'},
 {'correct': [], 'incorrect': [], 'section': 'capital-world'},
 {'correct': [], 'incorrect': [], 'section': 'currency'},
 {'correct': [], 'incorrect': [], 'section': 'city-in-state'},
 {'correct': [],
  'incorrect': [('he', 'she', 'his', 'her'), ('his', 'her', 'he', 'she')],
  'section': 'family'},
 {'correct': [], 'incorrect': [], 'section': 'gram1-adjective-to-adverb'},
 {'correct': [], 'incorrect': [], 'section': 'gram2-opposite'},
 {'correct': [],
  'incorrect': [('good', 'better', 'great', 'greater'),
   ('good', 'better', 'long', 'longer'),
   ('good', 'better', 'low', 'lower'),
   ('great', 'greater', 'long', 'longer'),
   ('great', 'greater', 'low', 'lower'),
   ('great', 'greater', 'good', 'better'),
   ('long', 'longer', 'low', 'lower'),
   ('long', 'longer', 'good', 'better'),
   ('long', 'longer', 'great', 'greater'),
   ('low', 'lower', 'good', 'better'),
   ('low', 'lower', 'great', 'greater'),
   ('low', 'lower', 'long', 'longer')],
  'section': 'gram3-comparative'},
 {'correct': [],
  'incorrect': [('big', 'biggest', 'good', 'best'),
   ('big', 'biggest', 'great', 'greatest'),
   ('big', 'biggest', 'large', 'largest'),
   ('good', 'best', 'great', 'greatest'),
   ('good', 'best', 'large', 'largest'),
   ('good', 'best', 'big', 'biggest'),
   ('great', 'greatest', 'large', 'largest'),
   ('great', 'greatest', 'big', 'biggest'),
   ('great', 'greatest', 'good', 'best'),
   ('large', 'largest', 'big', 'biggest'),
   ('large', 'largest', 'good', 'best'),
   ('large', 'largest', 'great', 'greatest')],
  'section': 'gram4-superlative'},
 {'correct': [],
  'incorrect': [('go', 'going', 'look', 'looking'),
   ('go', 'going', 'play', 'playing'),
   ('go', 'going', 'run', 'running'),
   ('go', 'going', 'say', 'saying'),
   ('look', 'looking', 'play', 'playing'),
   ('look', 'looking', 'run', 'running'),
   ('look', 'looking', 'say', 'saying'),
   ('look', 'looking', 'go', 'going'),
   ('play', 'playing', 'run', 'running'),
   ('play', 'playing', 'say', 'saying'),
   ('play', 'playing', 'go', 'going'),
   ('play', 'playing', 'look', 'looking'),
   ('run', 'running', 'say', 'saying'),
   ('run', 'running', 'go', 'going'),
   ('run', 'running', 'look', 'looking'),
   ('run', 'running', 'play', 'playing'),
   ('say', 'saying', 'go', 'going'),
   ('say', 'saying', 'look', 'looking'),
   ('say', 'saying', 'play', 'playing'),
   ('say', 'saying', 'run', 'running')],
  'section': 'gram5-present-participle'},
 {'correct': [],
  'incorrect': [('australia', 'australian', 'france', 'french'),
   ('australia', 'australian', 'india', 'indian'),
   ('australia', 'australian', 'israel', 'israeli'),
   ('australia', 'australian', 'switzerland', 'swiss'),
   ('france', 'french', 'india', 'indian'),
   ('france', 'french', 'israel', 'israeli'),
   ('france', 'french', 'switzerland', 'swiss'),
   ('france', 'french', 'australia', 'australian'),
   ('india', 'indian', 'israel', 'israeli'),
   ('india', 'indian', 'switzerland', 'swiss'),
   ('india', 'indian', 'australia', 'australian'),
   ('india', 'indian', 'france', 'french'),
   ('israel', 'israeli', 'switzerland', 'swiss'),
   ('israel', 'israeli', 'australia', 'australian'),
   ('israel', 'israeli', 'france', 'french'),
   ('israel', 'israeli', 'india', 'indian'),
   ('switzerland', 'swiss', 'australia', 'australian'),
   ('switzerland', 'swiss', 'france', 'french'),
   ('switzerland', 'swiss', 'india', 'indian'),
   ('switzerland', 'swiss', 'israel', 'israeli')],
  'section': 'gram6-nationality-adjective'},
 {'correct': [],
  'incorrect': [('going', 'went', 'paying', 'paid'),
   ('going', 'went', 'playing', 'played'),
   ('going', 'went', 'saying', 'said'),
   ('going', 'went', 'taking', 'took'),
   ('paying', 'paid', 'playing', 'played'),
   ('paying', 'paid', 'saying', 'said'),
   ('paying', 'paid', 'taking', 'took'),
   ('paying', 'paid', 'going', 'went'),
   ('playing', 'played', 'saying', 'said'),
   ('playing', 'played', 'taking', 'took'),
   ('playing', 'played', 'going', 'went'),
   ('playing', 'played', 'paying', 'paid'),
   ('saying', 'said', 'taking', 'took'),
   ('saying', 'said', 'going', 'went'),
   ('saying', 'said', 'paying', 'paid'),
   ('saying', 'said', 'playing', 'played'),
   ('taking', 'took', 'going', 'went'),
   ('taking', 'took', 'paying', 'paid'),
   ('taking', 'took', 'playing', 'played'),
   ('taking', 'took', 'saying', 'said')],
  'section': 'gram7-past-tense'},
 {'correct': [],
  'incorrect': [('building', 'buildings', 'car', 'cars'),
   ('building', 'buildings', 'child', 'children'),
   ('building', 'buildings', 'man', 'men'),
   ('car', 'cars', 'child', 'children'),
   ('car', 'cars', 'man', 'men'),
   ('car', 'cars', 'building', 'buildings'),
   ('child', 'children', 'man', 'men'),
   ('child', 'children', 'building', 'buildings'),
   ('child', 'children', 'car', 'cars'),
   ('man', 'men', 'building', 'buildings'),
   ('man', 'men', 'car', 'cars'),
   ('man', 'men', 'child', 'children')],
  'section': 'gram8-plural'},
 {'correct': [], 'incorrect': [], 'section': 'gram9-plural-verbs'},
 {'correct': [],
  'incorrect': [('he', 'she', 'his', 'her'),
   ('his', 'her', 'he', 'she'),
   ('good', 'better', 'great', 'greater'),
   ('good', 'better', 'long', 'longer'),
   ('good', 'better', 'low', 'lower'),
   ('great', 'greater', 'long', 'longer'),
   ('great', 'greater', 'low', 'lower'),
   ('great', 'greater', 'good', 'better'),
   ('long', 'longer', 'low', 'lower'),
   ('long', 'longer', 'good', 'better'),
   ('long', 'longer', 'great', 'greater'),
   ('low', 'lower', 'good', 'better'),
   ('low', 'lower', 'great', 'greater'),
   ('low', 'lower', 'long', 'longer'),
   ('big', 'biggest', 'good', 'best'),
   ('big', 'biggest', 'great', 'greatest'),
   ('big', 'biggest', 'large', 'largest'),
   ('good', 'best', 'great', 'greatest'),
   ('good', 'best', 'large', 'largest'),
   ('good', 'best', 'big', 'biggest'),
   ('great', 'greatest', 'large', 'largest'),
   ('great', 'greatest', 'big', 'biggest'),
   ('great', 'greatest', 'good', 'best'),
   ('large', 'largest', 'big', 'biggest'),
   ('large', 'largest', 'good', 'best'),
   ('large', 'largest', 'great', 'greatest'),
   ('go', 'going', 'look', 'looking'),
   ('go', 'going', 'play', 'playing'),
   ('go', 'going', 'run', 'running'),
   ('go', 'going', 'say', 'saying'),
   ('look', 'looking', 'play', 'playing'),
   ('look', 'looking', 'run', 'running'),
   ('look', 'looking', 'say', 'saying'),
   ('look', 'looking', 'go', 'going'),
   ('play', 'playing', 'run', 'running'),
   ('play', 'playing', 'say', 'saying'),
   ('play', 'playing', 'go', 'going'),
   ('play', 'playing', 'look', 'looking'),
   ('run', 'running', 'say', 'saying'),
   ('run', 'running', 'go', 'going'),
   ('run', 'running', 'look', 'looking'),
   ('run', 'running', 'play', 'playing'),
   ('say', 'saying', 'go', 'going'),
   ('say', 'saying', 'look', 'looking'),
   ('say', 'saying', 'play', 'playing'),
   ('say', 'saying', 'run', 'running'),
   ('australia', 'australian', 'france', 'french'),
   ('australia', 'australian', 'india', 'indian'),
   ('australia', 'australian', 'israel', 'israeli'),
   ('australia', 'australian', 'switzerland', 'swiss'),
   ('france', 'french', 'india', 'indian'),
   ('france', 'french', 'israel', 'israeli'),
   ('france', 'french', 'switzerland', 'swiss'),
   ('france', 'french', 'australia', 'australian'),
   ('india', 'indian', 'israel', 'israeli'),
   ('india', 'indian', 'switzerland', 'swiss'),
   ('india', 'indian', 'australia', 'australian'),
   ('india', 'indian', 'france', 'french'),
   ('israel', 'israeli', 'switzerland', 'swiss'),
   ('israel', 'israeli', 'australia', 'australian'),
   ('israel', 'israeli', 'france', 'french'),
   ('israel', 'israeli', 'india', 'indian'),
   ('switzerland', 'swiss', 'australia', 'australian'),
   ('switzerland', 'swiss', 'france', 'french'),
   ('switzerland', 'swiss', 'india', 'indian'),
   ('switzerland', 'swiss', 'israel', 'israeli'),
   ('going', 'went', 'paying', 'paid'),
   ('going', 'went', 'playing', 'played'),
   ('going', 'went', 'saying', 'said'),
   ('going', 'went', 'taking', 'took'),
   ('paying', 'paid', 'playing', 'played'),
   ('paying', 'paid', 'saying', 'said'),
   ('paying', 'paid', 'taking', 'took'),
   ('paying', 'paid', 'going', 'went'),
   ('playing', 'played', 'saying', 'said'),
   ('playing', 'played', 'taking', 'took'),
   ('playing', 'played', 'going', 'went'),
   ('playing', 'played', 'paying', 'paid'),
   ('saying', 'said', 'taking', 'took'),
   ('saying', 'said', 'going', 'went'),
   ('saying', 'said', 'paying', 'paid'),
   ('saying', 'said', 'playing', 'played'),
   ('taking', 'took', 'going', 'went'),
   ('taking', 'took', 'paying', 'paid'),
   ('taking', 'took', 'playing', 'played'),
   ('taking', 'took', 'saying', 'said'),
   ('building', 'buildings', 'car', 'cars'),
   ('building', 'buildings', 'child', 'children'),
   ('building', 'buildings', 'man', 'men'),
   ('car', 'cars', 'child', 'children'),
   ('car', 'cars', 'man', 'men'),
   ('car', 'cars', 'building', 'buildings'),
   ('child', 'children', 'man', 'men'),
   ('child', 'children', 'building', 'buildings'),
   ('child', 'children', 'car', 'cars'),
   ('man', 'men', 'building', 'buildings'),
   ('man', 'men', 'car', 'cars'),
   ('man', 'men', 'child', 'children')],
  'section': 'total'}]

This accuracy takes an optional parameter restrict_vocab which limits which test examples are to be considered.

Once again, good performance on this test set doesn’t mean word2vec will work well in your application, or vice versa. It’s always best to evaluate directly on your intended task.

Storing and loading models

You can store/load models using the standard gensim methods:



In [14]:

    
from tempfile import mkstemp

fs, temp_path = mkstemp("gensim_temp")  # creates a temp file

model.save(temp_path)  # save the model
new_model = gensim.models.Word2Vec.load(temp_path)  # open the model









    



2016-11-10 19:48:10,552 : INFO : saving Word2Vec object under C:\Users\vtush\AppData\Local\Temp\tmp2ifnei6xgensim_temp, separately None
2016-11-10 19:48:10,556 : INFO : not storing attribute syn0norm
2016-11-10 19:48:10,558 : INFO : not storing attribute cum_table
2016-11-10 19:48:10,601 : INFO : loading Word2Vec object from C:\Users\vtush\AppData\Local\Temp\tmp2ifnei6xgensim_temp
2016-11-10 19:48:10,630 : INFO : setting ignored attribute syn0norm to None
2016-11-10 19:48:10,632 : INFO : setting ignored attribute cum_table to None

which uses pickle internally, optionally mmap‘ing the model’s internal large NumPy matrices into virtual memory directly from disk files, for inter-process memory sharing.

In addition, you can load models created by the original C tool, both using its text and binary formats:

model = gensim.models.Word2Vec.load_word2vec_format('/tmp/vectors.txt', binary=False)
# using gzipped/bz2 input works too, no need to unzip:
model = gensim.models.Word2Vec.load_word2vec_format('/tmp/vectors.bin.gz', binary=True)

Online training / Resuming training

Advanced users can load a model and continue training it with more sentences:



In [15]:

    
model = gensim.models.Word2Vec.load(temp_path)
more_sentences = ['Advanced', 'users', 'can', 'load', 'a', 'model', 'and', 'continue', 
                  'training', 'it', 'with', 'more', 'sentences']
model.train(more_sentences)

# cleaning up temp
os.close(fs)
os.remove(temp_path)









    



2016-11-10 19:48:10,675 : INFO : loading Word2Vec object from C:\Users\vtush\AppData\Local\Temp\tmp2ifnei6xgensim_temp
2016-11-10 19:48:10,747 : INFO : setting ignored attribute syn0norm to None
2016-11-10 19:48:10,748 : INFO : setting ignored attribute cum_table to None
2016-11-10 19:48:10,760 : INFO : training model with 4 workers on 1723 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5
2016-11-10 19:48:10,768 : INFO : expecting 300 sentences, matching count from corpus used for vocabulary survey
2016-11-10 19:48:10,793 : INFO : worker thread finished; awaiting finish of 3 more threads
2016-11-10 19:48:10,797 : INFO : worker thread finished; awaiting finish of 2 more threads
2016-11-10 19:48:10,801 : INFO : worker thread finished; awaiting finish of 1 more threads
2016-11-10 19:48:10,804 : INFO : worker thread finished; awaiting finish of 0 more threads
2016-11-10 19:48:10,813 : INFO : training on 320 raw words (37 effective words) took 0.0s, 1679 effective words/s
2016-11-10 19:48:10,816 : WARNING : under 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay
2016-11-10 19:48:10,819 : WARNING : supplied example count (65) did not equal expected count (1500)

You may need to tweak the total_words parameter to train(), depending on what learning rate decay you want to simulate.

Note that it’s not possible to resume training with models generated by the C tool, load_word2vec_format(). You can still use them for querying/similarity, but information vital for training (the vocab tree) is missing there.

Using the model

Word2Vec supports several word similarity tasks out of the box:



In [16]:

    
model.most_similar(positive=['human', 'crime'], negative=['party'], topn=1)









    



2016-11-10 19:48:10,851 : INFO : precomputing L2-norms of word weight vectors






    Out[16]:





[('yesterday,', 0.9939109683036804)]



In [17]:

    
model.doesnt_match("input is lunch he sentence cat".split())









    Out[17]:





'sentence'



In [18]:

    
print(model.similarity('human', 'party'))
print(model.similarity('tree', 'murder'))









    



0.998607211002
0.996326758402

If you need the raw output vectors in your application, you can access these either on a word-by-word basis:



In [19]:

    
model['tree']  # raw NumPy vector of a word









    Out[19]:





array([-0.04036779,  0.02003553, -0.04284975,  0.0091192 ,  0.02448689,
       -0.00216569,  0.01173194,  0.04932737,  0.02052016,  0.03154321,
        0.03878884, -0.04184856, -0.00124647, -0.01990931, -0.05680807,
       -0.11135258,  0.11940905, -0.02824721, -0.02524053,  0.0395379 ,
        0.02866056,  0.03140666,  0.00251211, -0.01522053, -0.01220295,
        0.06460804,  0.00598784,  0.08115971, -0.02839125, -0.05328558,
       -0.01666472,  0.05955963,  0.01684751,  0.01161456, -0.02370278,
        0.02833314,  0.0592064 ,  0.02200391, -0.01102725, -0.07582182,
       -0.00775153,  0.00349301,  0.01163415,  0.08868522,  0.01771937,
       -0.01190445, -0.06827834,  0.0019346 ,  0.02322449,  0.0387942 ,
       -0.00897879,  0.03611458, -0.03682299,  0.11514064,  0.06071609,
       -0.04949778, -0.03331855, -0.04093069, -0.05451356, -0.01999527,
       -0.06447638,  0.00418219, -0.05440477,  0.03913156,  0.01763333,
        0.04044701, -0.03877832, -0.04378604,  0.0416888 , -0.09959508,
        0.0519225 , -0.00876303,  0.00755177,  0.0199695 , -0.01945884,
        0.04657057, -0.03607963, -0.0029297 ,  0.03263984,  0.00665692,
       -0.00313392,  0.04821754, -0.0732187 ,  0.02061378, -0.02486049,
        0.02817053,  0.00889842,  0.05549937,  0.00870937, -0.02475666,
        0.06040503, -0.05282155,  0.01314425, -0.01814836, -0.02232507,
       -0.01911532, -0.03181317,  0.11537127,  0.0908583 ,  0.06476384], dtype=float32)

…or en-masse as a 2D NumPy matrix from model.syn0.

Outro

There is a Bonus App on the original blog post, which runs word2vec on the Google News dataset, of about 100 billion words.

Full word2vec API docs here; get gensim here. Original C toolkit and word2vec papers by Google here.