question-words.txt
) are almost the same for both modes.The original implementation of Word2Vec training in Gensim is already super fast (covered in this blog series, see also benchmarks against other implementations in Tensorflow, DL4J, and C) and flexible, allowing you to train on arbitrary Python streams. We had to jump through some serious hoops to make it so, avoiding the Global Interpreter Lock (the dreaded GIL, the main bottleneck for any serious high performance computation in Python).
The end result worked great for modest machines (< 8 cores), but for higher-end servers, the GIL reared its ugly head again. Simply managing the input stream iterators and worker queues, which has to be done in Python holding the GIL, was becoming the bottleneck. Simply put, the Python implementation didn't scale linearly with cores, as the original C implementation by Tomáš Mikolov did.
We decided to change that. After much experimentation and benchmarking, including some pretty hardcore outlandish ideas, we figured there's no way around the GIL limitations—not at the level of fine-tuned performance needed here. Remember, we're talking >500k words (training instances) per second, using highly optimized C code. Way past the naive "vectorize with NumPy arrays" territory.
So we decided to introduce a new code path, which has less flexibility in favour of more performance. We call this code path file-based training
, and it's realized by passing a new corpus_file
parameter to training. The existing sentences
parameter (queue-based training) is still available, and you can continue using without any change: there's full backward compatibility.
code path | input parameter | advantages | disadvantages |
---|---|---|---|
queue-based training (existing) | sentences (Python iterable) |
Input can be generated dynamically from any storage, or even on-the-fly. | Scaling plateaus after 8 cores. |
file-based training (new) | corpus_file (file on disk) |
Scales linearly with CPU cores. | Training corpus must be serialized to disk in a specific format. |
When you specify corpus_file
, the model will read and process different portions of the file with different workers. The entire bulk of work is done outside of GIL, using no Python structures at all. The workers update the same weight matrix, but otherwise there's no communication, each worker munches on its data portion completely independently. This is the same approach the original C tool uses.
Training with corpus_file
yields a significant performance boost: for example, in the experiment belows training is 3.7x faster with 32 workers in comparison to training with sentences
argument. It even outperforms the original Word2Vec C tool in terms of words/sec processing speed on high-core machines.
The limitation of this approach is that corpus_file
argument accepts a path to your corpus file, which must be stored on disk in a specific format. The format is simply the well-known gensim.models.word2vec.LineSentence: one sentence per line, with words separated by spaces.
You only need to:
sentences=your_corpus
argument to corpus_file=your_corpus_file
in Word2Vec.__init__
, Word2Vec.build_vocab
, Word2Vec.train
calls.A short Word2Vec example:
In [1]:
import gensim
import gensim.downloader as api
from gensim.utils import save_as_line_sentence
from gensim.models.word2vec import Word2Vec
print(gensim.models.word2vec.CORPUSFILE_VERSION) # must be >= 0, i.e. optimized compiled version
corpus = api.load("text8")
save_as_line_sentence(corpus, "my_corpus.txt")
model = Word2Vec(corpus_file="my_corpus.txt", iter=5, size=300, workers=14)
In [3]:
CORPUS_FILE = 'wiki-en-20171001.txt'
In [ ]:
import itertools
from gensim.parsing.preprocessing import preprocess_string
def processed_corpus():
raw_corpus = api.load('wiki-english-20171001')
for article in raw_corpus:
# concatenate all section titles and texts of each Wikipedia article into a single "sentence"
doc = '\n'.join(itertools.chain.from_iterable(zip(article['section_titles'], article['section_texts'])))
yield preprocess_string(doc)
# serialize the preprocessed corpus into a single file on disk, using memory-efficient streaming
save_as_line_sentence(processed_corpus(), CORPUS_FILE)
In [ ]:
from gensim.models.word2vec import LineSentence
import time
start_time = time.time()
model_sent = Word2Vec(sentences=LineSentence(CORPUS_FILE), iter=5, size=300, workers=32)
sent_time = time.time() - start_time
start_time = time.time()
model_corp_file = Word2Vec(corpus_file=CORPUS_FILE, iter=5, size=300, workers=32)
file_time = time.time() - start_time
In [5]:
print("Training model with `sentences` took {:.3f} seconds".format(sent_time))
print("Training model with `corpus_file` took {:.3f} seconds".format(file_time))
Training with corpus_file
took 3.7x less time!
Now, let's compare the accuracies:
In [6]:
from gensim.test.utils import datapath
In [7]:
model_sent_accuracy = model_sent.wv.evaluate_word_analogies(datapath('questions-words.txt'))[0]
print("Word analogy accuracy with `sentences`: {:.1f}%".format(100.0 * model_sent_accuracy))
model_corp_file_accuracy = model_corp_file.wv.evaluate_word_analogies(datapath('questions-words.txt'))[0]
print("Word analogy accuracy with `corpus_file`: {:.1f}%".format(100.0 * model_corp_file_accuracy))
The accuracies are approximately the same.
Short example:
In [8]:
import gensim.downloader as api
from gensim.utils import save_as_line_sentence
from gensim.models.fasttext import FastText
corpus = api.load("text8")
save_as_line_sentence(corpus, "my_corpus.txt")
model = FastText(corpus_file="my_corpus.txt", iter=5, size=300, workers=14)
In [ ]:
from gensim.models.word2vec import LineSentence
from gensim.models.fasttext import FastText
import time
start_time = time.time()
model_corp_file = FastText(corpus_file=CORPUS_FILE, iter=5, size=300, workers=32)
file_time = time.time() - start_time
start_time = time.time()
model_sent = FastText(sentences=LineSentence(CORPUS_FILE), iter=5, size=300, workers=32)
sent_time = time.time() - start_time
In [10]:
print("Training model with `sentences` took {:.3f} seconds".format(sent_time))
print("Training model with `corpus_file` took {:.3f} seconds".format(file_time))
We see a 1.67x performance boost!
In [11]:
from gensim.test.utils import datapath
model_sent_accuracy = model_sent.wv.evaluate_word_analogies(datapath('questions-words.txt'))[0]
print("Word analogy accuracy with `sentences`: {:.1f}%".format(100.0 * model_sent_accuracy))
model_corp_file_accuracy = model_corp_file.wv.evaluate_word_analogies(datapath('questions-words.txt'))[0]
print("Word analogy accuracy with `corpus_file`: {:.1f}%".format(100.0 * model_corp_file_accuracy))
Short example:
In [12]:
import gensim.downloader as api
from gensim.utils import save_as_line_sentence
from gensim.models.doc2vec import Doc2Vec
corpus = api.load("text8")
save_as_line_sentence(corpus, "my_corpus.txt")
model = Doc2Vec(corpus_file="my_corpus.txt", epochs=5, vector_size=300, workers=14)
In [ ]:
from gensim.models.doc2vec import Doc2Vec, TaggedLineDocument
import time
start_time = time.time()
model_corp_file = Doc2Vec(corpus_file=CORPUS_FILE, epochs=5, vector_size=300, workers=32)
file_time = time.time() - start_time
start_time = time.time()
model_sent = Doc2Vec(documents=TaggedLineDocument(CORPUS_FILE), epochs=5, vector_size=300, workers=32)
sent_time = time.time() - start_time
In [15]:
print("Training model with `sentences` took {:.3f} seconds".format(sent_time))
print("Training model with `corpus_file` took {:.3f} seconds".format(file_time))
A 6.6x speedup!
In [16]:
from gensim.test.utils import datapath
model_sent_accuracy = model_sent.wv.evaluate_word_analogies(datapath('questions-words.txt'))[0]
print("Word analogy accuracy with `sentences`: {:.1f}%".format(100.0 * model_sent_accuracy))
model_corp_file_accuracy = model_corp_file.wv.evaluate_word_analogies(datapath('questions-words.txt'))[0]
print("Word analogy accuracy with `corpus_file`: {:.1f}%".format(100.0 * model_corp_file_accuracy))
In case your training corpus already lives on disk, you lose nothing by switching to the new corpus_file
training mode. Training will be much faster.
In case your corpus is generated dynamically, you can either serialize it to disk first with gensim.utils.save_as_line_sentence
(and then use the fast corpus_file
), or if that's not possible continue using the existing sentences
training mode.
This new code branch was created by @persiyanov as a Google Summer of Code 2018 project in the RARE Student Incubator.
Questions, comments? Use our Gensim mailing list and twitter. Happy training!