Benchmark: Implement Levenshtein term similarity matrix and fast SCM between corpora (RaRe-Technologies/gensim PR #2016)


In [1]:
!git rev-parse HEAD


d429fedf094e00c4bb5c27589d5befb53b2e4b13

In [2]:
from copy import deepcopy
from datetime import timedelta
from itertools import product
import logging
from math import floor, ceil, log10
import pickle
from random import sample, seed, shuffle
from time import time

import numpy as np
import pandas as pd
from tqdm import tqdm_notebook

def tqdm(iterable, total=None, desc=None):
    if total is None:
        total = len(iterable)
    for num_done, element in enumerate(tqdm_notebook(iterable, total=total)):
        logger.info("%s: %d / %d", desc, num_done, total)
        yield element

from gensim.corpora import Dictionary
import gensim.downloader as api
from gensim.similarities.index import AnnoyIndexer
from gensim.similarities import SparseTermSimilarityMatrix
from gensim.similarities import UniformTermSimilarityIndex
from gensim.similarities import LevenshteinSimilarityIndex
from gensim.models import WordEmbeddingSimilarityIndex
from gensim.utils import simple_preprocess

RANDOM_SEED = 12345

logger = logging.getLogger()
fhandler = logging.FileHandler(filename='matrix_speed.log', mode='a')
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
fhandler.setFormatter(formatter)
logger.addHandler(fhandler)
logger.setLevel(logging.INFO)

pd.set_option('display.max_rows', None, 'display.max_seq_items', None)

In [3]:
"""Repeatedly run a benchmark callable given various configurations and
get a list of results.

Return a list of results of repeatedly running a benchmark callable.

Parameters
----------
benchmark : callable tuple -> dict
    A benchmark callable that accepts a configuration and returns results.
configurations : iterable of tuple
    An iterable of configurations that are used for calling the benchmark function.
results_filename : str
    A filename of a file that will be used to persistently store the results using
    pickle. If the file exists, then the function will load the stored results
    instead of calling the benchmark callable.

Returns
-------
iterable of tuple
    The return values of the individual invocations of the benchmark callable.

"""
def benchmark_results(benchmark, configurations, results_filename):
    try:
        with open(results_filename, "rb") as file:
            results = pickle.load(file)
    except IOError:
        configurations = list(configurations)
        shuffle(configurations)
        results = list(tqdm(
            (benchmark(configuration) for configuration in configurations),
            total=len(configurations), desc="benchmark"))
        with open(results_filename, "wb") as file:
            pickle.dump(results, file)
    return results

Implement Levenshtein term similarity matrix

In Gensim PR #1827, we added a base implementation of the soft cosine measure (SCM). The base implementation would create term similarity matrices using a single complex procedure. In the Gensim PR #2016, we split the procedure into:

  • TermSimilarityIndex builder classes that produce the $k$ most similar terms for a given term $t$ that are distinct from $t$ along with the term similarities, and
  • the SparseTermSimilarityMatrix director class that constructs term similarity matrices and consumes term similarities produced by TermSimilarityIndex instances.

One of the benefits of this separation is that we can easily measure the speed at which a TermSimilarityIndex builder class produces term similarities and compare this speed with the speed at which the SparseTermSimilarityMatrix director class consumes term similarities. This allows us to see which of the classes are a bottleneck that slows down the construction of term similarity matrices.

In this notebook, we measure all the currently available builder and director classes. For the measurements, we use the Google News word embeddings distributed with the C implementation of Word2Vec. From the word embeddings, we will derive a dictionary of 2.01M terms.


In [4]:
full_model = api.load("word2vec-google-news-300")

try:
    full_dictionary = Dictionary.load("matrix_speed.dictionary")
except IOError:
    full_dictionary = Dictionary([[term] for term in full_model.vocab.keys()])
    full_dictionary.save("matrix_speed.dictionary")

Director class benchmark

SparseTermSimilarityMatrix

First, we measure the speed at which the SparseTermSimilarityMatrix director class consumes term similarities.


In [5]:
def benchmark(configuration):
    dictionary, nonzero_limit, symmetric, positive_definite, repetition = configuration
    index = UniformTermSimilarityIndex(dictionary)
    
    start_time = time()
    matrix = SparseTermSimilarityMatrix(
        index, dictionary, nonzero_limit=nonzero_limit, symmetric=symmetric,
        positive_definite=positive_definite, dtype=np.float16).matrix
    end_time = time()
    
    duration = end_time - start_time
    return {
        "dictionary_size": len(dictionary),
        "nonzero_limit": nonzero_limit,
        "matrix_nonzero": matrix.nnz,
        "repetition": repetition,
        "symmetric": symmetric,
        "positive_definite": positive_definite,
        "duration": duration, }

In [6]:
dictionary_sizes = [10**k for k in range(3, int(ceil(log10(len(full_dictionary)))))]
seed(RANDOM_SEED)
dictionaries = []
for size in tqdm(dictionary_sizes, desc="dictionaries"):
    dictionary = Dictionary([sample(list(full_dictionary.values()), size)])
    dictionaries.append(dictionary)
dictionaries.append(full_dictionary)
nonzero_limits = [1, 10, 100]
symmetry = (True, False)
positive_definiteness = (True, False)
repetitions = range(10)

configurations = product(dictionaries, nonzero_limits, symmetry, positive_definiteness, repetitions)
results = benchmark_results(benchmark, configurations, "matrix_speed.director_results")



The following tables show how long it takes to construct a term similarity matrix (the duration column), how many nonzero elements there are in the matrix (the matrix_nonzero column) and the mean term similarity consumption speed (the consumption_speed column) as we vary the dictionary size (the dictionary_size column) the maximum number of nonzero elements outside the diagonal in every column of the matrix (the nonzero_limit column), the matrix symmetry constraint (the symmetric column), and the matrix positive definiteness constraing (the positive_definite column). Ten independendent measurements were taken. The top table shows the mean values and the bottom table shows the standard deviations.

We can see that the symmetry and positive definiteness constraints severely limit the number of nonzero elements in the resulting matrix. This in turn increases the consumption speed, since we end up throwing away most of the elements that we consume. The effects of the dictionary size on the mean term similarity consumption speed are minor to none.


In [7]:
df = pd.DataFrame(results)
df["consumption_speed"] = df.dictionary_size * df.nonzero_limit / df.duration
df = df.groupby(["dictionary_size", "nonzero_limit", "symmetric", "positive_definite"])

def display(df):
    df["duration"] = [timedelta(0, duration) for duration in df["duration"]]
    df["matrix_nonzero"] = [int(nonzero) for nonzero in df["matrix_nonzero"]]
    df["consumption_speed"] = ["%.02f Kword pairs / s" % (speed / 1000) for speed in df["consumption_speed"]]
    return df

In [8]:
display(df.mean()).loc[
    [10000, len(full_dictionary)], :, :].loc[
    :, ["duration", "matrix_nonzero", "consumption_speed"]]


Out[8]:
duration matrix_nonzero consumption_speed
dictionary_size nonzero_limit symmetric positive_definite
10000 1 False False 00:00:00.435533 20000 22.96 Kword pairs / s
True 00:00:00.492606 20000 20.30 Kword pairs / s
True False 00:00:00.185563 10002 53.90 Kword pairs / s
True 00:00:00.240471 10002 41.59 Kword pairs / s
10 False False 00:00:02.687836 110000 37.21 Kword pairs / s
True 00:00:00.615492 20000 162.49 Kword pairs / s
True False 00:00:00.501188 10118 199.53 Kword pairs / s
True 00:00:01.380586 10010 72.44 Kword pairs / s
100 False False 00:00:25.262807 1010000 39.58 Kword pairs / s
True 00:00:01.132524 20000 883.02 Kword pairs / s
True False 00:00:03.595666 20198 278.13 Kword pairs / s
True 00:00:11.818912 10100 84.61 Kword pairs / s
2010000 1 False False 00:01:31.786585 4020000 21.90 Kword pairs / s
True 00:01:40.954580 4020000 19.91 Kword pairs / s
True False 00:00:39.050064 2010002 51.48 Kword pairs / s
True 00:00:49.238437 2010002 40.82 Kword pairs / s
10 False False 00:09:35.470373 22110000 34.93 Kword pairs / s
True 00:02:02.920334 4020000 163.52 Kword pairs / s
True False 00:01:39.576693 2010118 201.88 Kword pairs / s
True 00:04:35.646501 2010010 72.92 Kword pairs / s
100 False False 01:42:01.747568 203010000 32.88 Kword pairs / s
True 00:03:36.420778 4020000 928.75 Kword pairs / s
True False 00:10:58.434060 2020198 305.30 Kword pairs / s
True 00:39:40.319479 2010100 84.44 Kword pairs / s

In [9]:
display(df.apply(lambda x: (x - x.mean()).std())).loc[
    [10000, len(full_dictionary)], :, :].loc[
    :, ["duration", "matrix_nonzero", "consumption_speed"]]


Out[9]:
duration matrix_nonzero consumption_speed
dictionary_size nonzero_limit symmetric positive_definite
10000 1 False False 00:00:00.005334 0 0.28 Kword pairs / s
True 00:00:00.004072 0 0.17 Kword pairs / s
True False 00:00:00.003124 0 0.90 Kword pairs / s
True 00:00:00.001797 0 0.31 Kword pairs / s
10 False False 00:00:00.011986 0 0.17 Kword pairs / s
True 00:00:00.005972 0 1.59 Kword pairs / s
True False 00:00:00.002869 0 1.15 Kword pairs / s
True 00:00:00.011411 0 0.60 Kword pairs / s
100 False False 00:00:00.111118 0 0.17 Kword pairs / s
True 00:00:00.007611 0 5.94 Kword pairs / s
True False 00:00:00.030875 0 2.38 Kword pairs / s
True 00:00:00.050198 0 0.36 Kword pairs / s
2010000 1 False False 00:00:00.767305 0 0.18 Kword pairs / s
True 00:00:00.172432 0 0.03 Kword pairs / s
True False 00:00:00.346239 0 0.46 Kword pairs / s
True 00:00:00.177075 0 0.15 Kword pairs / s
10 False False 00:00:05.156655 0 0.31 Kword pairs / s
True 00:00:00.631676 0 0.83 Kword pairs / s
True False 00:00:01.216067 0 2.41 Kword pairs / s
True 00:00:00.547773 0 0.14 Kword pairs / s
100 False False 00:04:10.371035 0 1.24 Kword pairs / s
True 00:00:00.634416 0 2.73 Kword pairs / s
True False 00:00:06.586767 0 3.05 Kword pairs / s
True 00:00:09.030932 0 0.32 Kword pairs / s

Builder class benchmark

UniformTermSimilarityIndex

First, we measure the speed at which the UniformTermSimilarityIndex builder class produces term similarities. UniformTermSimilarityIndex is a dummy class that just generates a sequence of constants. It produces much more term similarities per second than the SparseTermSimilarityMatrix is capable of consuming and its results will serve as an upper limit.


In [10]:
def benchmark(configuration):
    dictionary, nonzero_limit, repetition = configuration
    
    start_time = time()
    index = UniformTermSimilarityIndex(dictionary)
    end_time = time()
    constructor_duration = end_time - start_time
    
    start_time = time()
    for term in dictionary.values():
        for _j, _k in zip(index.most_similar(term, topn=nonzero_limit), range(nonzero_limit)):
            pass
    end_time = time()
    production_duration = end_time - start_time
    
    return {
        "dictionary_size": len(dictionary),
        "nonzero_limit": nonzero_limit,
        "repetition": repetition,
        "constructor_duration": constructor_duration,
        "production_duration": production_duration, }

In [11]:
nonzero_limits = [1, 10, 100, 1000]

configurations = product(dictionaries, nonzero_limits, repetitions)
results = benchmark_results(benchmark, configurations, "matrix_speed.builder_results.uniform")

The following tables show how long it takes to retrieve the most similar terms for all terms in a dictionary (the production_duration column) and the mean term similarity production speed (the production_speed column) as we vary the dictionary size (the dictionary_size column), and the maximum number of most similar terms that will be retrieved (the nonzero_limit column). Ten independendent measurements were taken. The top table shows the mean values and the bottom table shows the standard deviations.

The production_speed is proportional to nonzero_limit.


In [12]:
df = pd.DataFrame(results)
df["processing_speed"] = df.dictionary_size ** 2 / df.production_duration
df["production_speed"] = df.dictionary_size * df.nonzero_limit / df.production_duration
df = df.groupby(["dictionary_size", "nonzero_limit"])

def display(df):
    df["constructor_duration"] = [timedelta(0, duration) for duration in df["constructor_duration"]]
    df["production_duration"] = [timedelta(0, duration) for duration in df["production_duration"]]
    df["processing_speed"] = ["%.02f Kword pairs / s" % (speed / 1000) for speed in df["processing_speed"]]
    df["production_speed"] = ["%.02f Kword pairs / s" % (speed / 1000) for speed in df["production_speed"]]
    return df

In [13]:
display(df.mean()).loc[
    [1000, len(full_dictionary)], :, :].loc[
    :, ["production_duration", "production_speed"]]


Out[13]:
production_duration production_speed
dictionary_size nonzero_limit
1000 1 00:00:00.002973 336.41 Kword pairs / s
10 00:00:00.005372 1861.64 Kword pairs / s
100 00:00:00.026752 3738.79 Kword pairs / s
1000 00:00:00.290265 3449.16 Kword pairs / s
2010000 1 00:00:06.318446 318.12 Kword pairs / s
10 00:00:10.783611 1863.96 Kword pairs / s
100 00:00:53.108644 3785.04 Kword pairs / s
1000 00:09:45.103741 3437.36 Kword pairs / s

In [14]:
display(df.apply(lambda x: (x - x.mean()).std())).loc[
    [1000, len(full_dictionary)], :, :].loc[
    :, ["production_duration", "production_speed"]]


Out[14]:
production_duration production_speed
dictionary_size nonzero_limit
1000 1 00:00:00.000017 1.93 Kword pairs / s
10 00:00:00.000062 21.50 Kword pairs / s
100 00:00:00.000408 56.66 Kword pairs / s
1000 00:00:00.010500 123.82 Kword pairs / s
2010000 1 00:00:00.023495 1.18 Kword pairs / s
10 00:00:00.035587 6.16 Kword pairs / s
100 00:00:00.535765 37.76 Kword pairs / s
1000 00:00:15.037816 89.56 Kword pairs / s

LevenshteinSimilarityIndex

Next, we measure the speed at which the LevenshteinSimilarityIndex builder class produces term similarities. LevenshteinSimilarityIndex is currently just a naïve implementation that produces much fewer term similarities per second than the SparseTermSimilarityMatrix class is capable of consuming.


In [15]:
def benchmark(configuration):
    dictionary, nonzero_limit, query_terms, repetition = configuration
    
    start_time = time()
    index = LevenshteinSimilarityIndex(dictionary)
    end_time = time()
    constructor_duration = end_time - start_time
    
    start_time = time()
    for term in query_terms:
        for _j, _k in zip(index.most_similar(term, topn=nonzero_limit), range(nonzero_limit)):
            pass
    end_time = time()
    production_duration = end_time - start_time
    
    return {
        "dictionary_size": len(dictionary),
        "mean_query_term_length": np.mean([len(term) for term in query_terms]),
        "nonzero_limit": nonzero_limit,
        "repetition": repetition,
        "constructor_duration": constructor_duration,
        "production_duration": production_duration, }

In [16]:
nonzero_limits = [1, 10, 100]
seed(RANDOM_SEED)
min_dictionary = sorted((len(dictionary), dictionary) for dictionary in dictionaries)[0][1]
query_terms = sample(list(min_dictionary.values()), 10)

configurations = product(dictionaries, nonzero_limits, [query_terms], repetitions)
results = benchmark_results(benchmark, configurations, "matrix_speed.builder_results.levenshtein")

The following tables show how long it takes to retrieve the most similar terms for ten randomly sampled terms from a dictionary (the production_duration column), the mean term similarity production speed (the production_speed column) and the mean term similarity processing speed (the processing_speed column) as we vary the dictionary size (the dictionary_size column), and the maximum number of most similar terms that will be retrieved (the nonzero_limit column). Ten independendent measurements were taken. The top table shows the mean values and the bottom table shows the standard deviations.

The production_speed is proportional to nonzero_limit / dictionary_size. The processing_speed is constant.


In [17]:
df = pd.DataFrame(results)
df["processing_speed"] = df.dictionary_size * len(query_terms) / df.production_duration
df["production_speed"] = df.nonzero_limit * len(query_terms) / df.production_duration
df = df.groupby(["dictionary_size", "nonzero_limit"])

def display(df):
    df["constructor_duration"] = [timedelta(0, duration) for duration in df["constructor_duration"]]
    df["production_duration"] = [timedelta(0, duration) for duration in df["production_duration"]]
    df["processing_speed"] = ["%.02f Kword pairs / s" % (speed / 1000) for speed in df["processing_speed"]]
    df["production_speed"] = ["%.02f word pairs / s" % speed for speed in df["production_speed"]]
    return df

In [18]:
display(df.mean()).loc[
    [1000, 1000000, len(full_dictionary)], :].loc[
    :, ["production_duration", "production_speed", "processing_speed"]]


Out[18]:
production_duration production_speed processing_speed
dictionary_size nonzero_limit
1000 1 00:00:00.055994 178.61 word pairs / s 178.61 Kword pairs / s
10 00:00:00.056097 1782.70 word pairs / s 178.27 Kword pairs / s
100 00:00:00.056212 17791.65 word pairs / s 177.92 Kword pairs / s
1000000 1 00:01:20.618070 0.12 word pairs / s 124.05 Kword pairs / s
10 00:01:20.048238 1.25 word pairs / s 124.92 Kword pairs / s
100 00:01:20.064999 12.49 word pairs / s 124.90 Kword pairs / s
2010000 1 00:02:44.069399 0.06 word pairs / s 122.51 Kword pairs / s
10 00:02:43.914601 0.61 word pairs / s 122.63 Kword pairs / s
100 00:02:43.892408 6.10 word pairs / s 122.64 Kword pairs / s

In [19]:
display(df.apply(lambda x: (x - x.mean()).std())).loc[
    [1000, 1000000, len(full_dictionary)], :].loc[
    :, ["production_duration", "production_speed", "processing_speed"]]


Out[19]:
production_duration production_speed processing_speed
dictionary_size nonzero_limit
1000 1 00:00:00.000673 2.16 word pairs / s 2.16 Kword pairs / s
10 00:00:00.000409 13.06 word pairs / s 1.31 Kword pairs / s
100 00:00:00.000621 196.80 word pairs / s 1.97 Kword pairs / s
1000000 1 00:00:00.810661 0.00 word pairs / s 1.23 Kword pairs / s
10 00:00:00.110013 0.00 word pairs / s 0.17 Kword pairs / s
100 00:00:00.164959 0.03 word pairs / s 0.26 Kword pairs / s
2010000 1 00:00:01.159273 0.00 word pairs / s 0.85 Kword pairs / s
10 00:00:00.429011 0.00 word pairs / s 0.32 Kword pairs / s
100 00:00:00.433687 0.02 word pairs / s 0.32 Kword pairs / s

WordEmbeddingSimilarityIndex

Lastly, we measure the speed at which the WordEmbeddingSimilarityIndex builder class constructs an instance and produces term similarities. Gensim currently supports slow and precise nearest neighbor search, and also approximate nearest neighbor search using ANNOY. We evaluate both options.


In [20]:
def benchmark(configuration):
    (model, dictionary), nonzero_limit, annoy_n_trees, query_terms, repetition = configuration
    use_annoy = annoy_n_trees > 0
    model.init_sims()
    
    start_time = time()
    if use_annoy:
        annoy = AnnoyIndexer(model, annoy_n_trees)
        kwargs = {"indexer": annoy}
    else:
        kwargs = {}
    index = WordEmbeddingSimilarityIndex(model, kwargs=kwargs)
    end_time = time()
    constructor_duration = end_time - start_time
    
    start_time = time()
    for term in query_terms:
        for _j, _k in zip(index.most_similar(term, topn=nonzero_limit), range(nonzero_limit)):
            pass
    end_time = time()
    production_duration = end_time - start_time
    
    return {
        "dictionary_size": len(dictionary),
        "mean_query_term_length": np.mean([len(term) for term in query_terms]),
        "nonzero_limit": nonzero_limit,
        "use_annoy": use_annoy,
        "annoy_n_trees": annoy_n_trees,
        "repetition": repetition,
        "constructor_duration": constructor_duration,
        "production_duration": production_duration, }

In [21]:
models = []
for dictionary in tqdm(dictionaries, desc="models"):
    if dictionary == full_dictionary:
        models.append(full_model)
        continue
    model = full_model.__class__(full_model.vector_size)
    model.vocab = {word: deepcopy(full_model.vocab[word]) for word in dictionary.values()}
    model.index2entity = []
    vector_indices = []
    for index, word in enumerate(full_model.index2entity):
        if word in model.vocab.keys():
            model.index2entity.append(word)
            model.vocab[word].index = len(vector_indices)
            vector_indices.append(index)
    model.vectors = full_model.vectors[vector_indices]
    models.append(model)
annoy_n_trees = [0] + [10**k for k in range(3)]
seed(RANDOM_SEED)
query_terms = sample(list(min_dictionary.values()), 1000)

configurations = product(zip(models, dictionaries), nonzero_limits, annoy_n_trees, [query_terms], repetitions)
results = benchmark_results(benchmark, configurations, "matrix_speed.builder_results.wordembeddings")



The following tables show how long it takes to construct an ANNOY index and the builder class instance (the constructor_duration column), how long it takes to retrieve the most similar terms for 1,000 randomly sampled terms from a dictionary (the production_duration column), the mean term similarity production speed (the production_speed column) and the mean term similarity processing speed (the processing_speed column) as we vary the dictionary size (the dictionary_size column), the maximum number of most similar terms that will be retrieved (the nonzero_limit column), and the number of constructed ANNOY trees (the annoy_n_trees column). Ten independendent measurements were taken. The top table shows the mean values and the bottom table shows the standard deviations.

If we do not use ANNOY (annoy_n_trees${}=0$), then production_speed is proportional to nonzero_limit / dictionary_size. If we do use ANNOY (annoy_n_trees${}>0$), then production_speed is proportional to nonzero_limit / (annoy_n_trees)${}^{1/2}$.


In [22]:
df = pd.DataFrame(results)
df["processing_speed"] = df.dictionary_size * len(query_terms) / df.production_duration
df["production_speed"] = df.nonzero_limit * len(query_terms) / df.production_duration
df = df.groupby(["dictionary_size", "nonzero_limit", "annoy_n_trees"])

def display(df):
    df["constructor_duration"] = [timedelta(0, duration) for duration in df["constructor_duration"]]
    df["production_duration"] = [timedelta(0, duration) for duration in df["production_duration"]]
    df["processing_speed"] = ["%.02f Kword pairs / s" % (speed / 1000) for speed in df["processing_speed"]]
    df["production_speed"] = ["%.02f Kword pairs / s" % (speed / 1000) for speed in df["production_speed"]]
    return df

In [23]:
display(df.mean()).loc[
    [1000000, len(full_dictionary)], [1, 100], [0, 1, 100]].loc[
    :, ["constructor_duration", "production_duration", "production_speed", "processing_speed"]]


Out[23]:
constructor_duration production_duration production_speed processing_speed
dictionary_size nonzero_limit annoy_n_trees
1000000 1 0 00:00:00.000007 00:00:19.962977 0.05 Kword pairs / s 50094.22 Kword pairs / s
1 00:00:30.268797 00:00:00.097011 10.32 Kword pairs / s 10320061.76 Kword pairs / s
100 00:06:23.415982 00:00:00.160870 6.24 Kword pairs / s 6236688.27 Kword pairs / s
100 0 00:00:00.000008 00:00:22.868372 4.37 Kword pairs / s 43729.34 Kword pairs / s
1 00:00:31.154876 00:00:00.156238 641.91 Kword pairs / s 6419086.99 Kword pairs / s
100 00:06:23.290572 00:00:01.297445 77.13 Kword pairs / s 771277.71 Kword pairs / s
2010000 1 0 00:00:00.000007 00:01:55.303216 0.01 Kword pairs / s 17432.79 Kword pairs / s
1 00:01:34.004196 00:00:00.190463 5.25 Kword pairs / s 10561607.14 Kword pairs / s
100 00:23:29.796006 00:00:00.339500 2.96 Kword pairs / s 5954865.50 Kword pairs / s
100 0 00:00:00.000007 00:02:11.926861 0.76 Kword pairs / s 15236.46 Kword pairs / s
1 00:01:35.813414 00:00:00.301120 332.38 Kword pairs / s 6680879.02 Kword pairs / s
100 00:23:05.155399 00:00:03.031527 33.42 Kword pairs / s 671683.05 Kword pairs / s

In [24]:
display(df.apply(lambda x: (x - x.mean()).std())).loc[
    [1000000, len(full_dictionary)], [1, 100], [0, 1, 100]].loc[
    :, ["constructor_duration", "production_duration", "production_speed", "processing_speed"]]


Out[24]:
constructor_duration production_duration production_speed processing_speed
dictionary_size nonzero_limit annoy_n_trees
1000000 1 0 00:00:00.000002 00:00:00.115644 0.00 Kword pairs / s 286.27 Kword pairs / s
1 00:00:01.854097 00:00:00.003517 0.37 Kword pairs / s 367959.55 Kword pairs / s
100 00:00:04.702035 00:00:00.010444 0.35 Kword pairs / s 350506.05 Kword pairs / s
100 0 00:00:00.000002 00:00:00.104872 0.02 Kword pairs / s 198.86 Kword pairs / s
1 00:00:01.163678 00:00:00.008939 36.14 Kword pairs / s 361441.71 Kword pairs / s
100 00:00:06.818568 00:00:00.036979 2.07 Kword pairs / s 20741.69 Kword pairs / s
2010000 1 0 00:00:00.000001 00:00:00.653177 0.00 Kword pairs / s 97.50 Kword pairs / s
1 00:00:04.677209 00:00:00.005679 0.16 Kword pairs / s 311832.91 Kword pairs / s
100 00:01:38.562684 00:00:00.029887 0.22 Kword pairs / s 434681.25 Kword pairs / s
100 0 00:00:00.000001 00:00:00.979613 0.01 Kword pairs / s 111.85 Kword pairs / s
1 00:00:03.207474 00:00:00.009479 10.18 Kword pairs / s 204614.80 Kword pairs / s
100 00:00:55.119595 00:00:00.419531 3.46 Kword pairs / s 69543.35 Kword pairs / s

Implement fast SCM between corpora

In Gensim PR #1827, we added a base implementation of the soft cosine measure (SCM). The base implementation would compute SCM between single documents using the softcossim function. In the Gensim PR #2016, we intruduced the SparseTermSimilarityMatrix.inner_product method, which computes SCM not only between single documents, but also between a document and a corpus, and between two corpora.

For the measurements, we use the Google News word embeddings distributed with the C implementation of Word2Vec. From the word embeddings, we will derive a dictionary of 2.01m terms. As a corpus, we will use a random sample of 100K articles from the 4.92m English Wikipedia articles.


In [25]:
full_model = api.load("word2vec-google-news-300")

try:
    with open("matrix_speed.corpus", "rb") as file:
        full_corpus = pickle.load(file)        
except IOError:
    original_corpus = list(tqdm(api.load("wiki-english-20171001"), desc="original_corpus", total=4924894))
    seed(RANDOM_SEED)
    full_corpus = [
        simple_preprocess(u'\n'.join(article["section_texts"]))
        for article in tqdm(sample(original_corpus, 10**5), desc="full_corpus", total=10**5)]
    del original_corpus
    with open("matrix_speed.corpus", "wb") as file:
        pickle.dump(full_corpus, file)

try:
    full_dictionary = Dictionary.load("matrix_speed.dictionary")
except IOError:
    full_dictionary = Dictionary([[term] for term in full_model.vocab.keys()])
    full_dictionary.save("matrix_speed.dictionary")

SCM between two documents

First, we measure the speed at which the inner_product method produces term similarities between single documents.


In [26]:
def benchmark(configuration):
    (matrix, dictionary, nonzero_limit), corpus, normalized, repetition = configuration
    corpus_size = len(corpus)
    corpus = [dictionary.doc2bow(doc) for doc in corpus]
    corpus = [vec for vec in corpus if len(vec) > 0]
    
    start_time = time()
    for vec1 in corpus:
        for vec2 in corpus:
            matrix.inner_product(vec1, vec2, normalized=normalized)
    end_time = time()
    duration = end_time - start_time
    
    return {
        "dictionary_size": matrix.matrix.shape[0],
        "matrix_nonzero": matrix.matrix.nnz,
        "nonzero_limit": nonzero_limit,
        "normalized": normalized,
        "corpus_size": corpus_size,
        "corpus_actual_size": len(corpus),
        "corpus_nonzero": sum(len(vec) for vec in corpus),
        "mean_document_length": np.mean([len(doc) for doc in corpus]),
        "repetition": repetition,
        "duration": duration, }

In [27]:
seed(RANDOM_SEED)
dictionary_sizes = [1000, 100000]
dictionaries = []
for size in tqdm(dictionary_sizes, desc="dictionaries"):
    dictionary = Dictionary([sample(list(full_dictionary.values()), size)])
    dictionaries.append(dictionary)
min_dictionary = sorted((len(dictionary), dictionary) for dictionary in dictionaries)[0][1]

corpus_sizes = [100, 1000]
corpora = []
for size in tqdm(corpus_sizes, desc="corpora"):
    corpus = sample(full_corpus, size)
    corpora.append(corpus)

models = []
for dictionary in tqdm(dictionaries, desc="models"):
    if dictionary == full_dictionary:
        models.append(full_model)
        continue
    model = full_model.__class__(full_model.vector_size)
    model.vocab = {word: deepcopy(full_model.vocab[word]) for word in dictionary.values()}
    model.index2entity = []
    vector_indices = []
    for index, word in enumerate(full_model.index2entity):
        if word in model.vocab.keys():
            model.index2entity.append(word)
            model.vocab[word].index = len(vector_indices)
            vector_indices.append(index)
    model.vectors = full_model.vectors[vector_indices]
    models.append(model)

nonzero_limits = [1, 10, 100]
matrices = []
for (model, dictionary), nonzero_limit in tqdm(
        list(product(zip(models, dictionaries), nonzero_limits)), desc="matrices"):
    annoy = AnnoyIndexer(model, 1)
    index = WordEmbeddingSimilarityIndex(model, kwargs={"indexer": annoy})
    matrix = SparseTermSimilarityMatrix(index, dictionary, nonzero_limit=nonzero_limit)
    matrices.append((matrix, dictionary, nonzero_limit))
    del annoy

normalization = (True, False)
repetitions = range(10)





/mnt/storage/home/novotny/.virtualenvs/gensim/lib/python3.4/site-packages/gensim/matutils.py:738: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):


In [28]:
configurations = product(matrices, corpora, normalization, repetitions)
results = benchmark_results(benchmark, configurations, "matrix_speed.inner-product_results.doc_doc")

The following tables show how long it takes to compute the inner_product method between all document vectors in a corpus (the duration column), how many nonzero elements there are in a corpus matrix (the corpus_nonzero column), how many nonzero elements there are in a term similarity matrix (the matrix_nonzero column) and the mean document similarity production speed (the speed column) as we vary the dictionary size (the dictionary_size column), the size of the corpus (the corpus_size column), the maximum number of nonzero elements in a single column of the matrix (the nonzero_limit column), and the matrix symmetry constraint (the symmetric column). Ten independendent measurements were taken. The top table shows the mean values and the bottom table shows the standard deviations.

The speed is proportional to the square of the number of unique terms shared by the two document vectors. In our scenario as well as the standard IR scenario, this means speed is constant. Computing a normalized inner product (normalized${}={}$True) results in a constant speed decrease.


In [29]:
df = pd.DataFrame(results)
df["speed"] = df.corpus_actual_size**2 / df.duration
del df["corpus_actual_size"]
df = df.groupby(["dictionary_size", "corpus_size", "nonzero_limit", "normalized"])

def display(df):
    df["duration"] = [timedelta(0, duration) for duration in df["duration"]]
    df["speed"] = ["%.02f Kdoc pairs / s" % (speed / 1000) for speed in df["speed"]]
    return df

In [30]:
display(df.mean()).loc[
    [1000, 100000], :, [1, 100], :].loc[
    :, ["duration", "corpus_nonzero", "matrix_nonzero", "speed"]]


Out[30]:
duration corpus_nonzero matrix_nonzero speed
dictionary_size corpus_size nonzero_limit normalized
1000 100 1 False 00:00:00.007383 3.0 1000.0 1.23 Kdoc pairs / s
True 00:00:00.009028 3.0 1000.0 1.01 Kdoc pairs / s
100 False 00:00:00.007657 3.0 84944.0 1.19 Kdoc pairs / s
True 00:00:00.008238 3.0 84944.0 1.10 Kdoc pairs / s
1000 1 False 00:00:00.414364 26.0 1000.0 1.39 Kdoc pairs / s
True 00:00:00.473789 26.0 1000.0 1.22 Kdoc pairs / s
100 False 00:00:00.430833 26.0 84944.0 1.35 Kdoc pairs / s
True 00:00:00.453477 26.0 84944.0 1.27 Kdoc pairs / s
100000 100 1 False 00:00:05.236376 423.0 101868.0 1.29 Kdoc pairs / s
True 00:00:05.623463 423.0 101868.0 1.20 Kdoc pairs / s
100 False 00:00:05.083829 423.0 8202884.0 1.33 Kdoc pairs / s
True 00:00:05.576003 423.0 8202884.0 1.21 Kdoc pairs / s
1000 1 False 00:08:59.285347 5162.0 101868.0 1.26 Kdoc pairs / s
True 00:09:57.693219 5162.0 101868.0 1.14 Kdoc pairs / s
100 False 00:09:23.213450 5162.0 8202884.0 1.21 Kdoc pairs / s
True 00:10:10.612458 5162.0 8202884.0 1.12 Kdoc pairs / s

In [31]:
display(df.apply(lambda x: (x - x.mean()).std())).loc[
    [1000, 100000], :, [1, 100], :].loc[
    :, ["duration", "corpus_nonzero", "matrix_nonzero", "speed"]]


Out[31]:
duration corpus_nonzero matrix_nonzero speed
dictionary_size corpus_size nonzero_limit normalized
1000 100 1 False 00:00:00.000871 0.0 0.0 0.13 Kdoc pairs / s
True 00:00:00.001315 0.0 0.0 0.14 Kdoc pairs / s
100 False 00:00:00.000893 0.0 0.0 0.12 Kdoc pairs / s
True 00:00:00.000631 0.0 0.0 0.08 Kdoc pairs / s
1000 1 False 00:00:00.014460 0.0 0.0 0.05 Kdoc pairs / s
True 00:00:00.025250 0.0 0.0 0.07 Kdoc pairs / s
100 False 00:00:00.039088 0.0 0.0 0.11 Kdoc pairs / s
True 00:00:00.023602 0.0 0.0 0.06 Kdoc pairs / s
100000 100 1 False 00:00:00.276359 0.0 0.0 0.07 Kdoc pairs / s
True 00:00:00.278806 0.0 0.0 0.06 Kdoc pairs / s
100 False 00:00:00.286781 0.0 0.0 0.07 Kdoc pairs / s
True 00:00:00.313397 0.0 0.0 0.06 Kdoc pairs / s
1000 1 False 00:00:14.321101 0.0 0.0 0.03 Kdoc pairs / s
True 00:00:23.526104 0.0 0.0 0.05 Kdoc pairs / s
100 False 00:00:05.899527 0.0 0.0 0.01 Kdoc pairs / s
True 00:00:24.454422 0.0 0.0 0.05 Kdoc pairs / s

SCM between a document and a corpus

Next, we measure the speed at which the inner_product method produces term similarities between documents and a corpus.


In [32]:
def benchmark(configuration):
    (matrix, dictionary, nonzero_limit), corpus, normalized, repetition = configuration
    corpus_size = len(corpus)
    corpus = [dictionary.doc2bow(doc) for doc in corpus if doc]
    
    start_time = time()
    for vec in corpus:
        matrix.inner_product(vec, corpus, normalized=normalized)
    end_time = time()
    duration = end_time - start_time
    
    return {
        "dictionary_size": matrix.matrix.shape[0],
        "matrix_nonzero": matrix.matrix.nnz,
        "nonzero_limit": nonzero_limit,
        "normalized": normalized,
        "corpus_size": corpus_size,
        "corpus_actual_size": len(corpus),
        "corpus_nonzero": sum(len(vec) for vec in corpus),
        "mean_document_length": np.mean([len(doc) for doc in corpus]),
        "repetition": repetition,
        "duration": duration, }

In [33]:
configurations = product(matrices, corpora, normalization, repetitions)
results = benchmark_results(benchmark, configurations, "matrix_speed.inner-product_results.doc_corpus")

The speed is inversely proportional to matrix_nonzero. Computing a normalized inner product (normalized${}={}$True) results in a constant speed decrease.


In [34]:
df = pd.DataFrame(results)
df["speed"] = df.corpus_actual_size**2 / df.duration
del df["corpus_actual_size"]
df = df.groupby(["dictionary_size", "corpus_size", "nonzero_limit", "normalized"])

def display(df):
    df["duration"] = [timedelta(0, duration) for duration in df["duration"]]
    df["speed"] = ["%.02f Kdoc pairs / s" % (speed / 1000) for speed in df["speed"]]
    return df

In [35]:
display(df.mean()).loc[
    [1000, 100000], :, [1, 100], :].loc[
    :, ["duration", "corpus_nonzero", "matrix_nonzero", "speed"]]


Out[35]:
duration corpus_nonzero matrix_nonzero speed
dictionary_size corpus_size nonzero_limit normalized
1000 100 1 False 00:00:00.009363 3.0 1000.0 1117.12 Kdoc pairs / s
True 00:00:00.010948 3.0 1000.0 954.13 Kdoc pairs / s
100 False 00:00:00.014128 3.0 84944.0 728.91 Kdoc pairs / s
True 00:00:00.018164 3.0 84944.0 551.78 Kdoc pairs / s
1000 1 False 00:00:00.072091 26.0 1000.0 13872.12 Kdoc pairs / s
True 00:00:00.079284 26.0 1000.0 12615.36 Kdoc pairs / s
100 False 00:00:00.162483 26.0 84944.0 6188.43 Kdoc pairs / s
True 00:00:00.203081 26.0 84944.0 4924.48 Kdoc pairs / s
100000 100 1 False 00:00:00.278253 423.0 101868.0 36.05 Kdoc pairs / s
True 00:00:00.298519 423.0 101868.0 33.56 Kdoc pairs / s
100 False 00:00:36.326167 423.0 8202884.0 0.28 Kdoc pairs / s
True 00:00:36.928802 423.0 8202884.0 0.27 Kdoc pairs / s
1000 1 False 00:00:07.403301 5162.0 101868.0 135.08 Kdoc pairs / s
True 00:00:07.794943 5162.0 101868.0 128.29 Kdoc pairs / s
100 False 00:05:55.674712 5162.0 8202884.0 2.81 Kdoc pairs / s
True 00:06:05.561398 5162.0 8202884.0 2.74 Kdoc pairs / s

In [36]:
display(df.apply(lambda x: (x - x.mean()).std())).loc[
    [1000, 100000], :, [1, 100], :].loc[
    :, ["duration", "corpus_nonzero", "matrix_nonzero", "speed"]]


Out[36]:
duration corpus_nonzero matrix_nonzero speed
dictionary_size corpus_size nonzero_limit normalized
1000 100 1 False 00:00:00.002120 0.0 0.0 242.09 Kdoc pairs / s
True 00:00:00.002387 0.0 0.0 207.64 Kdoc pairs / s
100 False 00:00:00.002531 0.0 0.0 130.94 Kdoc pairs / s
True 00:00:00.000911 0.0 0.0 27.68 Kdoc pairs / s
1000 1 False 00:00:00.000587 0.0 0.0 112.92 Kdoc pairs / s
True 00:00:00.001191 0.0 0.0 187.31 Kdoc pairs / s
100 False 00:00:00.011944 0.0 0.0 513.79 Kdoc pairs / s
True 00:00:00.001793 0.0 0.0 43.54 Kdoc pairs / s
100000 100 1 False 00:00:00.016156 0.0 0.0 2.06 Kdoc pairs / s
True 00:00:00.013451 0.0 0.0 1.47 Kdoc pairs / s
100 False 00:00:01.339787 0.0 0.0 0.01 Kdoc pairs / s
True 00:00:01.617340 0.0 0.0 0.01 Kdoc pairs / s
1000 1 False 00:00:00.038961 0.0 0.0 0.71 Kdoc pairs / s
True 00:00:00.024154 0.0 0.0 0.40 Kdoc pairs / s
100 False 00:00:07.604805 0.0 0.0 0.06 Kdoc pairs / s
True 00:00:14.799519 0.0 0.0 0.10 Kdoc pairs / s

SCM between two corpora

Lastly, we measure the speed at which the inner_product method produces term similarities between entire corpora.


In [37]:
def benchmark(configuration):
    (matrix, dictionary, nonzero_limit), corpus, normalized, repetition = configuration
    corpus_size = len(corpus)
    corpus = [dictionary.doc2bow(doc) for doc in corpus]
    corpus = [vec for vec in corpus if len(vec) > 0]
    
    start_time = time()
    matrix.inner_product(corpus, corpus, normalized=normalized)
    end_time = time()
    duration = end_time - start_time
    
    return {
        "dictionary_size": matrix.matrix.shape[0],
        "matrix_nonzero": matrix.matrix.nnz,
        "nonzero_limit": nonzero_limit,
        "normalized": normalized,
        "corpus_size": corpus_size,
        "corpus_actual_size": len(corpus),
        "corpus_nonzero": sum(len(vec) for vec in corpus),
        "mean_document_length": np.mean([len(doc) for doc in corpus]),
        "repetition": repetition,
        "duration": duration, }

In [38]:
nonzero_limits = [1000]
dense_matrices = []
for (model, dictionary), nonzero_limit in tqdm(
        list(product(zip(models, dictionaries), nonzero_limits)), desc="matrices"):
    annoy = AnnoyIndexer(model, 1)
    index = WordEmbeddingSimilarityIndex(model, kwargs={"indexer": annoy})
    matrix = SparseTermSimilarityMatrix(index, dictionary, nonzero_limit=nonzero_limit)
    matrices.append((matrix, dictionary, nonzero_limit))
    del annoy


/mnt/storage/home/novotny/.virtualenvs/gensim/lib/python3.4/site-packages/gensim/matutils.py:738: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):


In [39]:
configurations = product(matrices + dense_matrices, corpora + [full_corpus], normalization, repetitions)
results = benchmark_results(benchmark, configurations, "matrix_speed.inner-product_results.corpus_corpus")

In [40]:
df = pd.DataFrame(results)
df["speed"] = df.corpus_actual_size**2 / df.duration
del df["corpus_actual_size"]
df = df.groupby(["dictionary_size", "corpus_size", "nonzero_limit", "normalized"])

def display(df):
    df["duration"] = [timedelta(0, duration) for duration in df["duration"]]
    df["speed"] = ["%.02f Kdoc pairs / s" % (speed / 1000) for speed in df["speed"]]
    return df

In [41]:
display(df.mean()).loc[
    [1000, 100000], :, [1, 10, 100, 1000], :].loc[
    :, ["duration", "corpus_nonzero", "matrix_nonzero", "speed"]]


Out[41]:
duration corpus_nonzero matrix_nonzero speed
dictionary_size corpus_size nonzero_limit normalized
1000 100 1 False 00:00:00.001403 3.0 1000.0 6.69 Kdoc pairs / s
True 00:00:00.005313 3.0 1000.0 1.70 Kdoc pairs / s
10 False 00:00:00.001565 3.0 8634.0 5.80 Kdoc pairs / s
True 00:00:00.005307 3.0 8634.0 1.70 Kdoc pairs / s
100 False 00:00:00.003172 3.0 84944.0 3.05 Kdoc pairs / s
True 00:00:00.008461 3.0 84944.0 1.07 Kdoc pairs / s
1000 False 00:00:00.021377 3.0 838588.0 0.42 Kdoc pairs / s
True 00:00:00.055234 3.0 838588.0 0.16 Kdoc pairs / s
1000 1 False 00:00:00.001376 26.0 1000.0 418.61 Kdoc pairs / s
True 00:00:00.005019 26.0 1000.0 114.78 Kdoc pairs / s
10 False 00:00:00.001511 26.0 8634.0 381.50 Kdoc pairs / s
True 00:00:00.005208 26.0 8634.0 110.60 Kdoc pairs / s
100 False 00:00:00.003539 26.0 84944.0 164.03 Kdoc pairs / s
True 00:00:00.008502 26.0 84944.0 67.81 Kdoc pairs / s
1000 False 00:00:00.021548 26.0 838588.0 26.73 Kdoc pairs / s
True 00:00:00.054425 26.0 838588.0 10.59 Kdoc pairs / s
100000 1 False 00:00:00.019915 2914.0 1000.0 391443.20 Kdoc pairs / s
True 00:00:00.026118 2914.0 1000.0 298377.75 Kdoc pairs / s
10 False 00:00:00.020152 2914.0 8634.0 386722.55 Kdoc pairs / s
True 00:00:00.026998 2914.0 8634.0 288567.14 Kdoc pairs / s
100 False 00:00:00.028345 2914.0 84944.0 274905.36 Kdoc pairs / s
True 00:00:00.041069 2914.0 84944.0 189709.57 Kdoc pairs / s
1000 False 00:00:00.089978 2914.0 838588.0 86598.15 Kdoc pairs / s
True 00:00:00.185611 2914.0 838588.0 41971.58 Kdoc pairs / s
100000 100 1 False 00:00:00.003345 423.0 101868.0 2013.92 Kdoc pairs / s
True 00:00:00.008857 423.0 101868.0 760.13 Kdoc pairs / s
10 False 00:00:00.032639 423.0 814154.0 206.66 Kdoc pairs / s
True 00:00:00.080591 423.0 814154.0 83.46 Kdoc pairs / s
100 False 00:00:00.488467 423.0 8202884.0 13.77 Kdoc pairs / s
True 00:00:01.454507 423.0 8202884.0 4.62 Kdoc pairs / s
1000 False 00:00:04.973667 423.0 89912542.0 1.35 Kdoc pairs / s
True 00:00:15.035711 423.0 89912542.0 0.45 Kdoc pairs / s
1000 1 False 00:00:00.010141 5162.0 101868.0 67139.73 Kdoc pairs / s
True 00:00:00.016685 5162.0 101868.0 40798.02 Kdoc pairs / s
10 False 00:00:00.041392 5162.0 814154.0 16444.18 Kdoc pairs / s
True 00:00:00.091686 5162.0 814154.0 7425.08 Kdoc pairs / s
100 False 00:00:00.508916 5162.0 8202884.0 1338.94 Kdoc pairs / s
True 00:00:01.497556 5162.0 8202884.0 454.49 Kdoc pairs / s
1000 False 00:00:05.101489 5162.0 89912542.0 133.44 Kdoc pairs / s
True 00:00:15.325415 5162.0 89912542.0 44.42 Kdoc pairs / s
100000 1 False 00:00:37.145526 525310.0 101868.0 192578.80 Kdoc pairs / s
True 00:00:45.729004 525310.0 101868.0 156431.36 Kdoc pairs / s
10 False 00:00:44.981806 525310.0 814154.0 159029.88 Kdoc pairs / s
True 00:00:54.245450 525310.0 814154.0 131871.88 Kdoc pairs / s
100 False 00:01:15.925860 525310.0 8202884.0 94216.21 Kdoc pairs / s
True 00:01:29.232076 525310.0 8202884.0 80177.08 Kdoc pairs / s
1000 False 00:03:17.140191 525310.0 89912542.0 36286.25 Kdoc pairs / s
True 00:04:05.865666 525310.0 89912542.0 29097.14 Kdoc pairs / s

In [42]:
display(df.apply(lambda x: (x - x.mean()).std())).loc[
    [1000, 100000], :, [1, 100], :].loc[
    :, ["duration", "corpus_nonzero", "matrix_nonzero", "speed"]]


Out[42]:
duration corpus_nonzero matrix_nonzero speed
dictionary_size corpus_size nonzero_limit normalized
1000 100 1 False 00:00:00.000292 0.0 0.0 1.48 Kdoc pairs / s
True 00:00:00.000225 0.0 0.0 0.08 Kdoc pairs / s
100 False 00:00:00.000747 0.0 0.0 1.02 Kdoc pairs / s
True 00:00:00.000488 0.0 0.0 0.07 Kdoc pairs / s
1000 1 False 00:00:00.000027 0.0 0.0 8.10 Kdoc pairs / s
True 00:00:00.000069 0.0 0.0 1.56 Kdoc pairs / s
100 False 00:00:00.000309 0.0 0.0 16.26 Kdoc pairs / s
True 00:00:00.000268 0.0 0.0 2.24 Kdoc pairs / s
100000 1 False 00:00:00.000576 0.0 0.0 11256.03 Kdoc pairs / s
True 00:00:00.000574 0.0 0.0 6512.19 Kdoc pairs / s
100 False 00:00:00.000562 0.0 0.0 5233.50 Kdoc pairs / s
True 00:00:00.000609 0.0 0.0 2743.63 Kdoc pairs / s
100000 100 1 False 00:00:00.000152 0.0 0.0 98.97 Kdoc pairs / s
True 00:00:00.000322 0.0 0.0 28.10 Kdoc pairs / s
100 False 00:00:00.004997 0.0 0.0 0.14 Kdoc pairs / s
True 00:00:00.022206 0.0 0.0 0.07 Kdoc pairs / s
1000 1 False 00:00:00.000210 0.0 0.0 1420.00 Kdoc pairs / s
True 00:00:00.000192 0.0 0.0 467.23 Kdoc pairs / s
100 False 00:00:00.019022 0.0 0.0 45.91 Kdoc pairs / s
True 00:00:00.004431 0.0 0.0 1.35 Kdoc pairs / s
100000 1 False 00:00:00.024466 0.0 0.0 126.77 Kdoc pairs / s
True 00:00:00.062447 0.0 0.0 213.64 Kdoc pairs / s
100 False 00:00:00.087692 0.0 0.0 108.55 Kdoc pairs / s
True 00:00:01.065889 0.0 0.0 968.80 Kdoc pairs / s