*2Vec File-based Training: API Tutorial

This tutorial introduces a new file-based training mode for gensim.models.{Word2Vec, FastText, Doc2Vec} which leads to (much) faster training on machines with many cores. Below we demonstrate how to use this new mode, with Python examples.

In this tutorial

We will show how to use the new training mode on Word2Vec, FastText and Doc2Vec.
Evaluate the performance of file-based training on the English Wikipedia and compare it to the existing queue-based training.
Show that model quality (analogy accuracies on question-words.txt) are almost the same for both modes.

Motivation

The original implementation of Word2Vec training in Gensim is already super fast (covered in this blog series, see also benchmarks against other implementations in Tensorflow, DL4J, and C) and flexible, allowing you to train on arbitrary Python streams. We had to jump through some serious hoops to make it so, avoiding the Global Interpreter Lock (the dreaded GIL, the main bottleneck for any serious high performance computation in Python).

The end result worked great for modest machines (< 8 cores), but for higher-end servers, the GIL reared its ugly head again. Simply managing the input stream iterators and worker queues, which has to be done in Python holding the GIL, was becoming the bottleneck. Simply put, the Python implementation didn't scale linearly with cores, as the original C implementation by Tomáš Mikolov did.

We decided to change that. After much experimentation and benchmarking, including some pretty hardcore outlandish ideas, we figured there's no way around the GIL limitations—not at the level of fine-tuned performance needed here. Remember, we're talking >500k words (training instances) per second, using highly optimized C code. Way past the naive "vectorize with NumPy arrays" territory.

So we decided to introduce a new code path, which has less flexibility in favour of more performance. We call this code path file-based training, and it's realized by passing a new corpus_file parameter to training. The existing sentences parameter (queue-based training) is still available, and you can continue using without any change: there's full backward compatibility.

How it works

code path	input parameter	advantages	disadvantages
queue-based training (existing)	`sentences` (Python iterable)	Input can be generated dynamically from any storage, or even on-the-fly.	Scaling plateaus after 8 cores.
file-based training (new)	`corpus_file` (file on disk)	Scales linearly with CPU cores.	Training corpus must be serialized to disk in a specific format.

When you specify corpus_file, the model will read and process different portions of the file with different workers. The entire bulk of work is done outside of GIL, using no Python structures at all. The workers update the same weight matrix, but otherwise there's no communication, each worker munches on its data portion completely independently. This is the same approach the original C tool uses.

Training with corpus_file yields a significant performance boost: for example, in the experiment belows training is 3.7x faster with 32 workers in comparison to training with sentences argument. It even outperforms the original Word2Vec C tool in terms of words/sec processing speed on high-core machines.

The limitation of this approach is that corpus_file argument accepts a path to your corpus file, which must be stored on disk in a specific format. The format is simply the well-known gensim.models.word2vec.LineSentence: one sentence per line, with words separated by spaces.

How to use it

You only need to:

Save your corpus in the LineSentence format to disk (you may use gensim.utils.save_as_line_sentence(your_corpus, your_corpus_file) for convenience).
Change sentences=your_corpus argument to corpus_file=your_corpus_file in Word2Vec.__init__, Word2Vec.build_vocab, Word2Vec.train calls.

A short Word2Vec example:



In [1]:

    
import gensim
import gensim.downloader as api
from gensim.utils import save_as_line_sentence
from gensim.models.word2vec import Word2Vec

print(gensim.models.word2vec.CORPUSFILE_VERSION)  # must be >= 0, i.e. optimized compiled version

corpus = api.load("text8")
save_as_line_sentence(corpus, "my_corpus.txt")

model = Word2Vec(corpus_file="my_corpus.txt", iter=5, size=300, workers=14)

Let's prepare the full Wikipedia dataset as training corpus

We load wikipedia dump from gensim-data, perform text preprocessing with Gensim functions, and finally save processed corpus in LineSentence format.



In [3]:

    
CORPUS_FILE = 'wiki-en-20171001.txt'



In [ ]:

    
import itertools
from gensim.parsing.preprocessing import preprocess_string

def processed_corpus():
    raw_corpus = api.load('wiki-english-20171001')
    for article in raw_corpus:
        # concatenate all section titles and texts of each Wikipedia article into a single "sentence"
        doc = '\n'.join(itertools.chain.from_iterable(zip(article['section_titles'], article['section_texts'])))
        yield preprocess_string(doc)

# serialize the preprocessed corpus into a single file on disk, using memory-efficient streaming
save_as_line_sentence(processed_corpus(), CORPUS_FILE)

Word2Vec

We train two models:

With sentences argument
With corpus_file argument

Then, we compare the timings and accuracy on question-words.txt.



In [ ]:

    
from gensim.models.word2vec import LineSentence
import time

start_time = time.time()
model_sent = Word2Vec(sentences=LineSentence(CORPUS_FILE), iter=5, size=300, workers=32)
sent_time = time.time() - start_time

start_time = time.time()
model_corp_file = Word2Vec(corpus_file=CORPUS_FILE, iter=5, size=300, workers=32)
file_time = time.time() - start_time



In [5]:

    
print("Training model with `sentences` took {:.3f} seconds".format(sent_time))
print("Training model with `corpus_file` took {:.3f} seconds".format(file_time))









    



Training model with `sentences` took 9494.237 seconds
Training model with `corpus_file` took 2566.170 seconds

Training with corpus_file took 3.7x less time!

Now, let's compare the accuracies:



In [6]:

    
from gensim.test.utils import datapath



In [7]:

    
model_sent_accuracy = model_sent.wv.evaluate_word_analogies(datapath('questions-words.txt'))[0]
print("Word analogy accuracy with `sentences`: {:.1f}%".format(100.0 * model_sent_accuracy))

model_corp_file_accuracy = model_corp_file.wv.evaluate_word_analogies(datapath('questions-words.txt'))[0]
print("Word analogy accuracy with `corpus_file`: {:.1f}%".format(100.0 * model_corp_file_accuracy))









    



/home/persiyanov/gensim/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):






    



Word analogy accuracy with `sentences`: 75.4%
Word analogy accuracy with `corpus_file`: 74.8%

The accuracies are approximately the same.

FastText

Short example:



In [8]:

    
import gensim.downloader as api
from gensim.utils import save_as_line_sentence
from gensim.models.fasttext import FastText

corpus = api.load("text8")
save_as_line_sentence(corpus, "my_corpus.txt")

model = FastText(corpus_file="my_corpus.txt", iter=5, size=300, workers=14)

Let's compare the timings



In [ ]:

    
from gensim.models.word2vec import LineSentence
from gensim.models.fasttext import FastText
import time

start_time = time.time()
model_corp_file = FastText(corpus_file=CORPUS_FILE, iter=5, size=300, workers=32)
file_time = time.time() - start_time

start_time = time.time()
model_sent = FastText(sentences=LineSentence(CORPUS_FILE), iter=5, size=300, workers=32)
sent_time = time.time() - start_time



In [10]:

    
print("Training model with `sentences` took {:.3f} seconds".format(sent_time))
print("Training model with `corpus_file` took {:.3f} seconds".format(file_time))









    



Training model with `sentences` took 17963.283 seconds
Training model with `corpus_file` took 10725.931 seconds

We see a 1.67x performance boost!

Now, accuracies:



In [11]:

    
from gensim.test.utils import datapath

model_sent_accuracy = model_sent.wv.evaluate_word_analogies(datapath('questions-words.txt'))[0]
print("Word analogy accuracy with `sentences`: {:.1f}%".format(100.0 * model_sent_accuracy))

model_corp_file_accuracy = model_corp_file.wv.evaluate_word_analogies(datapath('questions-words.txt'))[0]
print("Word analogy accuracy with `corpus_file`: {:.1f}%".format(100.0 * model_corp_file_accuracy))









    



/home/persiyanov/gensim/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):






    



Word analogy accuracy with `sentences`: 64.2%
Word analogy accuracy with `corpus_file`: 66.2%

Doc2Vec

Short example:



In [12]:

    
import gensim.downloader as api
from gensim.utils import save_as_line_sentence
from gensim.models.doc2vec import Doc2Vec

corpus = api.load("text8")
save_as_line_sentence(corpus, "my_corpus.txt")

model = Doc2Vec(corpus_file="my_corpus.txt", epochs=5, vector_size=300, workers=14)

Let's compare the timings



In [ ]:

    
from gensim.models.doc2vec import Doc2Vec, TaggedLineDocument
import time

start_time = time.time()
model_corp_file = Doc2Vec(corpus_file=CORPUS_FILE, epochs=5, vector_size=300, workers=32)
file_time = time.time() - start_time

start_time = time.time()
model_sent = Doc2Vec(documents=TaggedLineDocument(CORPUS_FILE), epochs=5, vector_size=300, workers=32)
sent_time = time.time() - start_time



In [15]:

    
print("Training model with `sentences` took {:.3f} seconds".format(sent_time))
print("Training model with `corpus_file` took {:.3f} seconds".format(file_time))









    



Training model with `sentences` took 20427.949 seconds
Training model with `corpus_file` took 3085.256 seconds

A 6.6x speedup!

Accuracies:



In [16]:

    
from gensim.test.utils import datapath

model_sent_accuracy = model_sent.wv.evaluate_word_analogies(datapath('questions-words.txt'))[0]
print("Word analogy accuracy with `sentences`: {:.1f}%".format(100.0 * model_sent_accuracy))

model_corp_file_accuracy = model_corp_file.wv.evaluate_word_analogies(datapath('questions-words.txt'))[0]
print("Word analogy accuracy with `corpus_file`: {:.1f}%".format(100.0 * model_corp_file_accuracy))









    



/home/persiyanov/gensim/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):






    



Word analogy accuracy with `sentences`: 71.7%
Word analogy accuracy with `corpus_file`: 67.8%

TL;DR: Conclusion

In case your training corpus already lives on disk, you lose nothing by switching to the new corpus_file training mode. Training will be much faster.

In case your corpus is generated dynamically, you can either serialize it to disk first with gensim.utils.save_as_line_sentence (and then use the fast corpus_file), or if that's not possible continue using the existing sentences training mode.

This new code branch was created by @persiyanov as a Google Summer of Code 2018 project in the RARE Student Incubator.

Questions, comments? Use our Gensim mailing list and twitter. Happy training!