Benchmark: Implement Levenshtein term similarity matrix and fast SCM between corpora (RaRe-Technologies/gensim PR #2016)



In [1]:

    
!git rev-parse HEAD









    



d429fedf094e00c4bb5c27589d5befb53b2e4b13



In [2]:

    
from copy import deepcopy
from datetime import timedelta
from itertools import product
import logging
from math import floor, ceil, log10
import pickle
from random import sample, seed, shuffle
from time import time

import numpy as np
import pandas as pd
from tqdm import tqdm_notebook

def tqdm(iterable, total=None, desc=None):
    if total is None:
        total = len(iterable)
    for num_done, element in enumerate(tqdm_notebook(iterable, total=total)):
        logger.info("%s: %d / %d", desc, num_done, total)
        yield element

from gensim.corpora import Dictionary
import gensim.downloader as api
from gensim.similarities.index import AnnoyIndexer
from gensim.similarities import SparseTermSimilarityMatrix
from gensim.similarities import UniformTermSimilarityIndex
from gensim.similarities import LevenshteinSimilarityIndex
from gensim.models import WordEmbeddingSimilarityIndex
from gensim.utils import simple_preprocess

RANDOM_SEED = 12345

logger = logging.getLogger()
fhandler = logging.FileHandler(filename='matrix_speed.log', mode='a')
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
fhandler.setFormatter(formatter)
logger.addHandler(fhandler)
logger.setLevel(logging.INFO)

pd.set_option('display.max_rows', None, 'display.max_seq_items', None)



In [3]:

    
"""Repeatedly run a benchmark callable given various configurations and
get a list of results.

Return a list of results of repeatedly running a benchmark callable.

Parameters
----------
benchmark : callable tuple -> dict
    A benchmark callable that accepts a configuration and returns results.
configurations : iterable of tuple
    An iterable of configurations that are used for calling the benchmark function.
results_filename : str
    A filename of a file that will be used to persistently store the results using
    pickle. If the file exists, then the function will load the stored results
    instead of calling the benchmark callable.

Returns
-------
iterable of tuple
    The return values of the individual invocations of the benchmark callable.

"""
def benchmark_results(benchmark, configurations, results_filename):
    try:
        with open(results_filename, "rb") as file:
            results = pickle.load(file)
    except IOError:
        configurations = list(configurations)
        shuffle(configurations)
        results = list(tqdm(
            (benchmark(configuration) for configuration in configurations),
            total=len(configurations), desc="benchmark"))
        with open(results_filename, "wb") as file:
            pickle.dump(results, file)
    return results

Implement Levenshtein term similarity matrix

In Gensim PR #1827, we added a base implementation of the soft cosine measure (SCM). The base implementation would create term similarity matrices using a single complex procedure. In the Gensim PR #2016, we split the procedure into:

TermSimilarityIndex builder classes that produce the $k$ most similar terms for a given term $t$ that are distinct from $t$ along with the term similarities, and
the SparseTermSimilarityMatrix director class that constructs term similarity matrices and consumes term similarities produced by TermSimilarityIndex instances.

One of the benefits of this separation is that we can easily measure the speed at which a TermSimilarityIndex builder class produces term similarities and compare this speed with the speed at which the SparseTermSimilarityMatrix director class consumes term similarities. This allows us to see which of the classes are a bottleneck that slows down the construction of term similarity matrices.

In this notebook, we measure all the currently available builder and director classes. For the measurements, we use the Google News word embeddings distributed with the C implementation of Word2Vec. From the word embeddings, we will derive a dictionary of 2.01M terms.



In [4]:

    
full_model = api.load("word2vec-google-news-300")

try:
    full_dictionary = Dictionary.load("matrix_speed.dictionary")
except IOError:
    full_dictionary = Dictionary([[term] for term in full_model.vocab.keys()])
    full_dictionary.save("matrix_speed.dictionary")

Director class benchmark

SparseTermSimilarityMatrix

First, we measure the speed at which the SparseTermSimilarityMatrix director class consumes term similarities.



In [5]:

    
def benchmark(configuration):
    dictionary, nonzero_limit, symmetric, positive_definite, repetition = configuration
    index = UniformTermSimilarityIndex(dictionary)
    
    start_time = time()
    matrix = SparseTermSimilarityMatrix(
        index, dictionary, nonzero_limit=nonzero_limit, symmetric=symmetric,
        positive_definite=positive_definite, dtype=np.float16).matrix
    end_time = time()
    
    duration = end_time - start_time
    return {
        "dictionary_size": len(dictionary),
        "nonzero_limit": nonzero_limit,
        "matrix_nonzero": matrix.nnz,
        "repetition": repetition,
        "symmetric": symmetric,
        "positive_definite": positive_definite,
        "duration": duration, }



In [6]:

    
dictionary_sizes = [10**k for k in range(3, int(ceil(log10(len(full_dictionary)))))]
seed(RANDOM_SEED)
dictionaries = []
for size in tqdm(dictionary_sizes, desc="dictionaries"):
    dictionary = Dictionary([sample(list(full_dictionary.values()), size)])
    dictionaries.append(dictionary)
dictionaries.append(full_dictionary)
nonzero_limits = [1, 10, 100]
symmetry = (True, False)
positive_definiteness = (True, False)
repetitions = range(10)

configurations = product(dictionaries, nonzero_limits, symmetry, positive_definiteness, repetitions)
results = benchmark_results(benchmark, configurations, "matrix_speed.director_results")

The following tables show how long it takes to construct a term similarity matrix (the duration column), how many nonzero elements there are in the matrix (the matrix_nonzero column) and the mean term similarity consumption speed (the consumption_speed column) as we vary the dictionary size (the dictionary_size column) the maximum number of nonzero elements outside the diagonal in every column of the matrix (the nonzero_limit column), the matrix symmetry constraint (the symmetric column), and the matrix positive definiteness constraing (the positive_definite column). Ten independendent measurements were taken. The top table shows the mean values and the bottom table shows the standard deviations.

We can see that the symmetry and positive definiteness constraints severely limit the number of nonzero elements in the resulting matrix. This in turn increases the consumption speed, since we end up throwing away most of the elements that we consume. The effects of the dictionary size on the mean term similarity consumption speed are minor to none.



In [7]:

    
df = pd.DataFrame(results)
df["consumption_speed"] = df.dictionary_size * df.nonzero_limit / df.duration
df = df.groupby(["dictionary_size", "nonzero_limit", "symmetric", "positive_definite"])

def display(df):
    df["duration"] = [timedelta(0, duration) for duration in df["duration"]]
    df["matrix_nonzero"] = [int(nonzero) for nonzero in df["matrix_nonzero"]]
    df["consumption_speed"] = ["%.02f Kword pairs / s" % (speed / 1000) for speed in df["consumption_speed"]]
    return df



In [8]:

    
display(df.mean()).loc[
    [10000, len(full_dictionary)], :, :].loc[
    :, ["duration", "matrix_nonzero", "consumption_speed"]]









    Out[8]:







  
    
      
      
      
      
      duration
      matrix_nonzero
      consumption_speed
    
    
      dictionary_size
      nonzero_limit
      symmetric
      positive_definite
      
      
      
    
  
  
    
      10000
      1
      False
      False
      00:00:00.435533
      20000
      22.96 Kword pairs / s
    
    
      True
      00:00:00.492606
      20000
      20.30 Kword pairs / s
    
    
      True
      False
      00:00:00.185563
      10002
      53.90 Kword pairs / s
    
    
      True
      00:00:00.240471
      10002
      41.59 Kword pairs / s
    
    
      10
      False
      False
      00:00:02.687836
      110000
      37.21 Kword pairs / s
    
    
      True
      00:00:00.615492
      20000
      162.49 Kword pairs / s
    
    
      True
      False
      00:00:00.501188
      10118
      199.53 Kword pairs / s
    
    
      True
      00:00:01.380586
      10010
      72.44 Kword pairs / s
    
    
      100
      False
      False
      00:00:25.262807
      1010000
      39.58 Kword pairs / s
    
    
      True
      00:00:01.132524
      20000
      883.02 Kword pairs / s
    
    
      True
      False
      00:00:03.595666
      20198
      278.13 Kword pairs / s
    
    
      True
      00:00:11.818912
      10100
      84.61 Kword pairs / s
    
    
      2010000
      1
      False
      False
      00:01:31.786585
      4020000
      21.90 Kword pairs / s
    
    
      True
      00:01:40.954580
      4020000
      19.91 Kword pairs / s
    
    
      True
      False
      00:00:39.050064
      2010002
      51.48 Kword pairs / s
    
    
      True
      00:00:49.238437
      2010002
      40.82 Kword pairs / s
    
    
      10
      False
      False
      00:09:35.470373
      22110000
      34.93 Kword pairs / s
    
    
      True
      00:02:02.920334
      4020000
      163.52 Kword pairs / s
    
    
      True
      False
      00:01:39.576693
      2010118
      201.88 Kword pairs / s
    
    
      True
      00:04:35.646501
      2010010
      72.92 Kword pairs / s
    
    
      100
      False
      False
      01:42:01.747568
      203010000
      32.88 Kword pairs / s
    
    
      True
      00:03:36.420778
      4020000
      928.75 Kword pairs / s
    
    
      True
      False
      00:10:58.434060
      2020198
      305.30 Kword pairs / s
    
    
      True
      00:39:40.319479
      2010100
      84.44 Kword pairs / s



In [9]:

    
display(df.apply(lambda x: (x - x.mean()).std())).loc[
    [10000, len(full_dictionary)], :, :].loc[
    :, ["duration", "matrix_nonzero", "consumption_speed"]]









    Out[9]:







  
    
      
      
      
      
      duration
      matrix_nonzero
      consumption_speed
    
    
      dictionary_size
      nonzero_limit
      symmetric
      positive_definite
      
      
      
    
  
  
    
      10000
      1
      False
      False
      00:00:00.005334
      0
      0.28 Kword pairs / s
    
    
      True
      00:00:00.004072
      0
      0.17 Kword pairs / s
    
    
      True
      False
      00:00:00.003124
      0
      0.90 Kword pairs / s
    
    
      True
      00:00:00.001797
      0
      0.31 Kword pairs / s
    
    
      10
      False
      False
      00:00:00.011986
      0
      0.17 Kword pairs / s
    
    
      True
      00:00:00.005972
      0
      1.59 Kword pairs / s
    
    
      True
      False
      00:00:00.002869
      0
      1.15 Kword pairs / s
    
    
      True
      00:00:00.011411
      0
      0.60 Kword pairs / s
    
    
      100
      False
      False
      00:00:00.111118
      0
      0.17 Kword pairs / s
    
    
      True
      00:00:00.007611
      0
      5.94 Kword pairs / s
    
    
      True
      False
      00:00:00.030875
      0
      2.38 Kword pairs / s
    
    
      True
      00:00:00.050198
      0
      0.36 Kword pairs / s
    
    
      2010000
      1
      False
      False
      00:00:00.767305
      0
      0.18 Kword pairs / s
    
    
      True
      00:00:00.172432
      0
      0.03 Kword pairs / s
    
    
      True
      False
      00:00:00.346239
      0
      0.46 Kword pairs / s
    
    
      True
      00:00:00.177075
      0
      0.15 Kword pairs / s
    
    
      10
      False
      False
      00:00:05.156655
      0
      0.31 Kword pairs / s
    
    
      True
      00:00:00.631676
      0
      0.83 Kword pairs / s
    
    
      True
      False
      00:00:01.216067
      0
      2.41 Kword pairs / s
    
    
      True
      00:00:00.547773
      0
      0.14 Kword pairs / s
    
    
      100
      False
      False
      00:04:10.371035
      0
      1.24 Kword pairs / s
    
    
      True
      00:00:00.634416
      0
      2.73 Kword pairs / s
    
    
      True
      False
      00:00:06.586767
      0
      3.05 Kword pairs / s
    
    
      True
      00:00:09.030932
      0
      0.32 Kword pairs / s

Builder class benchmark

UniformTermSimilarityIndex

First, we measure the speed at which the UniformTermSimilarityIndex builder class produces term similarities. UniformTermSimilarityIndex is a dummy class that just generates a sequence of constants. It produces much more term similarities per second than the SparseTermSimilarityMatrix is capable of consuming and its results will serve as an upper limit.



In [10]:

    
def benchmark(configuration):
    dictionary, nonzero_limit, repetition = configuration
    
    start_time = time()
    index = UniformTermSimilarityIndex(dictionary)
    end_time = time()
    constructor_duration = end_time - start_time
    
    start_time = time()
    for term in dictionary.values():
        for _j, _k in zip(index.most_similar(term, topn=nonzero_limit), range(nonzero_limit)):
            pass
    end_time = time()
    production_duration = end_time - start_time
    
    return {
        "dictionary_size": len(dictionary),
        "nonzero_limit": nonzero_limit,
        "repetition": repetition,
        "constructor_duration": constructor_duration,
        "production_duration": production_duration, }



In [11]:

    
nonzero_limits = [1, 10, 100, 1000]

configurations = product(dictionaries, nonzero_limits, repetitions)
results = benchmark_results(benchmark, configurations, "matrix_speed.builder_results.uniform")

The following tables show how long it takes to retrieve the most similar terms for all terms in a dictionary (the production_duration column) and the mean term similarity production speed (the production_speed column) as we vary the dictionary size (the dictionary_size column), and the maximum number of most similar terms that will be retrieved (the nonzero_limit column). Ten independendent measurements were taken. The top table shows the mean values and the bottom table shows the standard deviations.

The production_speed is proportional to nonzero_limit.



In [12]:

    
df = pd.DataFrame(results)
df["processing_speed"] = df.dictionary_size ** 2 / df.production_duration
df["production_speed"] = df.dictionary_size * df.nonzero_limit / df.production_duration
df = df.groupby(["dictionary_size", "nonzero_limit"])

def display(df):
    df["constructor_duration"] = [timedelta(0, duration) for duration in df["constructor_duration"]]
    df["production_duration"] = [timedelta(0, duration) for duration in df["production_duration"]]
    df["processing_speed"] = ["%.02f Kword pairs / s" % (speed / 1000) for speed in df["processing_speed"]]
    df["production_speed"] = ["%.02f Kword pairs / s" % (speed / 1000) for speed in df["production_speed"]]
    return df



In [13]:

    
display(df.mean()).loc[
    [1000, len(full_dictionary)], :, :].loc[
    :, ["production_duration", "production_speed"]]









    Out[13]:







  
    
      
      
      production_duration
      production_speed
    
    
      dictionary_size
      nonzero_limit
      
      
    
  
  
    
      1000
      1
      00:00:00.002973
      336.41 Kword pairs / s
    
    
      10
      00:00:00.005372
      1861.64 Kword pairs / s
    
    
      100
      00:00:00.026752
      3738.79 Kword pairs / s
    
    
      1000
      00:00:00.290265
      3449.16 Kword pairs / s
    
    
      2010000
      1
      00:00:06.318446
      318.12 Kword pairs / s
    
    
      10
      00:00:10.783611
      1863.96 Kword pairs / s
    
    
      100
      00:00:53.108644
      3785.04 Kword pairs / s
    
    
      1000
      00:09:45.103741
      3437.36 Kword pairs / s



In [14]:

    
display(df.apply(lambda x: (x - x.mean()).std())).loc[
    [1000, len(full_dictionary)], :, :].loc[
    :, ["production_duration", "production_speed"]]









    Out[14]:







  
    
      
      
      production_duration
      production_speed
    
    
      dictionary_size
      nonzero_limit
      
      
    
  
  
    
      1000
      1
      00:00:00.000017
      1.93 Kword pairs / s
    
    
      10
      00:00:00.000062
      21.50 Kword pairs / s
    
    
      100
      00:00:00.000408
      56.66 Kword pairs / s
    
    
      1000
      00:00:00.010500
      123.82 Kword pairs / s
    
    
      2010000
      1
      00:00:00.023495
      1.18 Kword pairs / s
    
    
      10
      00:00:00.035587
      6.16 Kword pairs / s
    
    
      100
      00:00:00.535765
      37.76 Kword pairs / s
    
    
      1000
      00:00:15.037816
      89.56 Kword pairs / s

LevenshteinSimilarityIndex

Next, we measure the speed at which the LevenshteinSimilarityIndex builder class produces term similarities. LevenshteinSimilarityIndex is currently just a naïve implementation that produces much fewer term similarities per second than the SparseTermSimilarityMatrix class is capable of consuming.



In [15]:

    
def benchmark(configuration):
    dictionary, nonzero_limit, query_terms, repetition = configuration
    
    start_time = time()
    index = LevenshteinSimilarityIndex(dictionary)
    end_time = time()
    constructor_duration = end_time - start_time
    
    start_time = time()
    for term in query_terms:
        for _j, _k in zip(index.most_similar(term, topn=nonzero_limit), range(nonzero_limit)):
            pass
    end_time = time()
    production_duration = end_time - start_time
    
    return {
        "dictionary_size": len(dictionary),
        "mean_query_term_length": np.mean([len(term) for term in query_terms]),
        "nonzero_limit": nonzero_limit,
        "repetition": repetition,
        "constructor_duration": constructor_duration,
        "production_duration": production_duration, }



In [16]:

    
nonzero_limits = [1, 10, 100]
seed(RANDOM_SEED)
min_dictionary = sorted((len(dictionary), dictionary) for dictionary in dictionaries)[0][1]
query_terms = sample(list(min_dictionary.values()), 10)

configurations = product(dictionaries, nonzero_limits, [query_terms], repetitions)
results = benchmark_results(benchmark, configurations, "matrix_speed.builder_results.levenshtein")

The following tables show how long it takes to retrieve the most similar terms for ten randomly sampled terms from a dictionary (the production_duration column), the mean term similarity production speed (the production_speed column) and the mean term similarity processing speed (the processing_speed column) as we vary the dictionary size (the dictionary_size column), and the maximum number of most similar terms that will be retrieved (the nonzero_limit column). Ten independendent measurements were taken. The top table shows the mean values and the bottom table shows the standard deviations.

The production_speed is proportional to nonzero_limit / dictionary_size. The processing_speed is constant.



In [17]:

    
df = pd.DataFrame(results)
df["processing_speed"] = df.dictionary_size * len(query_terms) / df.production_duration
df["production_speed"] = df.nonzero_limit * len(query_terms) / df.production_duration
df = df.groupby(["dictionary_size", "nonzero_limit"])

def display(df):
    df["constructor_duration"] = [timedelta(0, duration) for duration in df["constructor_duration"]]
    df["production_duration"] = [timedelta(0, duration) for duration in df["production_duration"]]
    df["processing_speed"] = ["%.02f Kword pairs / s" % (speed / 1000) for speed in df["processing_speed"]]
    df["production_speed"] = ["%.02f word pairs / s" % speed for speed in df["production_speed"]]
    return df



In [18]:

    
display(df.mean()).loc[
    [1000, 1000000, len(full_dictionary)], :].loc[
    :, ["production_duration", "production_speed", "processing_speed"]]









    Out[18]:







  
    
      
      
      production_duration
      production_speed
      processing_speed
    
    
      dictionary_size
      nonzero_limit
      
      
      
    
  
  
    
      1000
      1
      00:00:00.055994
      178.61 word pairs / s
      178.61 Kword pairs / s
    
    
      10
      00:00:00.056097
      1782.70 word pairs / s
      178.27 Kword pairs / s
    
    
      100
      00:00:00.056212
      17791.65 word pairs / s
      177.92 Kword pairs / s
    
    
      1000000
      1
      00:01:20.618070
      0.12 word pairs / s
      124.05 Kword pairs / s
    
    
      10
      00:01:20.048238
      1.25 word pairs / s
      124.92 Kword pairs / s
    
    
      100
      00:01:20.064999
      12.49 word pairs / s
      124.90 Kword pairs / s
    
    
      2010000
      1
      00:02:44.069399
      0.06 word pairs / s
      122.51 Kword pairs / s
    
    
      10
      00:02:43.914601
      0.61 word pairs / s
      122.63 Kword pairs / s
    
    
      100
      00:02:43.892408
      6.10 word pairs / s
      122.64 Kword pairs / s



In [19]:

    
display(df.apply(lambda x: (x - x.mean()).std())).loc[
    [1000, 1000000, len(full_dictionary)], :].loc[
    :, ["production_duration", "production_speed", "processing_speed"]]









    Out[19]:







  
    
      
      
      production_duration
      production_speed
      processing_speed
    
    
      dictionary_size
      nonzero_limit
      
      
      
    
  
  
    
      1000
      1
      00:00:00.000673
      2.16 word pairs / s
      2.16 Kword pairs / s
    
    
      10
      00:00:00.000409
      13.06 word pairs / s
      1.31 Kword pairs / s
    
    
      100
      00:00:00.000621
      196.80 word pairs / s
      1.97 Kword pairs / s
    
    
      1000000
      1
      00:00:00.810661
      0.00 word pairs / s
      1.23 Kword pairs / s
    
    
      10
      00:00:00.110013
      0.00 word pairs / s
      0.17 Kword pairs / s
    
    
      100
      00:00:00.164959
      0.03 word pairs / s
      0.26 Kword pairs / s
    
    
      2010000
      1
      00:00:01.159273
      0.00 word pairs / s
      0.85 Kword pairs / s
    
    
      10
      00:00:00.429011
      0.00 word pairs / s
      0.32 Kword pairs / s
    
    
      100
      00:00:00.433687
      0.02 word pairs / s
      0.32 Kword pairs / s

WordEmbeddingSimilarityIndex

Lastly, we measure the speed at which the WordEmbeddingSimilarityIndex builder class constructs an instance and produces term similarities. Gensim currently supports slow and precise nearest neighbor search, and also approximate nearest neighbor search using ANNOY. We evaluate both options.



In [20]:

    
def benchmark(configuration):
    (model, dictionary), nonzero_limit, annoy_n_trees, query_terms, repetition = configuration
    use_annoy = annoy_n_trees > 0
    model.init_sims()
    
    start_time = time()
    if use_annoy:
        annoy = AnnoyIndexer(model, annoy_n_trees)
        kwargs = {"indexer": annoy}
    else:
        kwargs = {}
    index = WordEmbeddingSimilarityIndex(model, kwargs=kwargs)
    end_time = time()
    constructor_duration = end_time - start_time
    
    start_time = time()
    for term in query_terms:
        for _j, _k in zip(index.most_similar(term, topn=nonzero_limit), range(nonzero_limit)):
            pass
    end_time = time()
    production_duration = end_time - start_time
    
    return {
        "dictionary_size": len(dictionary),
        "mean_query_term_length": np.mean([len(term) for term in query_terms]),
        "nonzero_limit": nonzero_limit,
        "use_annoy": use_annoy,
        "annoy_n_trees": annoy_n_trees,
        "repetition": repetition,
        "constructor_duration": constructor_duration,
        "production_duration": production_duration, }



In [21]:

    
models = []
for dictionary in tqdm(dictionaries, desc="models"):
    if dictionary == full_dictionary:
        models.append(full_model)
        continue
    model = full_model.__class__(full_model.vector_size)
    model.vocab = {word: deepcopy(full_model.vocab[word]) for word in dictionary.values()}
    model.index2entity = []
    vector_indices = []
    for index, word in enumerate(full_model.index2entity):
        if word in model.vocab.keys():
            model.index2entity.append(word)
            model.vocab[word].index = len(vector_indices)
            vector_indices.append(index)
    model.vectors = full_model.vectors[vector_indices]
    models.append(model)
annoy_n_trees = [0] + [10**k for k in range(3)]
seed(RANDOM_SEED)
query_terms = sample(list(min_dictionary.values()), 1000)

configurations = product(zip(models, dictionaries), nonzero_limits, annoy_n_trees, [query_terms], repetitions)
results = benchmark_results(benchmark, configurations, "matrix_speed.builder_results.wordembeddings")

The following tables show how long it takes to construct an ANNOY index and the builder class instance (the constructor_duration column), how long it takes to retrieve the most similar terms for 1,000 randomly sampled terms from a dictionary (the production_duration column), the mean term similarity production speed (the production_speed column) and the mean term similarity processing speed (the processing_speed column) as we vary the dictionary size (the dictionary_size column), the maximum number of most similar terms that will be retrieved (the nonzero_limit column), and the number of constructed ANNOY trees (the annoy_n_trees column). Ten independendent measurements were taken. The top table shows the mean values and the bottom table shows the standard deviations.

If we do not use ANNOY (annoy_n_trees${}=0$), then production_speed is proportional to nonzero_limit / dictionary_size. If we do use ANNOY (annoy_n_trees${}>0$), then production_speed is proportional to nonzero_limit / (annoy_n_trees)${}^{1/2}$.



In [22]:

    
df = pd.DataFrame(results)
df["processing_speed"] = df.dictionary_size * len(query_terms) / df.production_duration
df["production_speed"] = df.nonzero_limit * len(query_terms) / df.production_duration
df = df.groupby(["dictionary_size", "nonzero_limit", "annoy_n_trees"])

def display(df):
    df["constructor_duration"] = [timedelta(0, duration) for duration in df["constructor_duration"]]
    df["production_duration"] = [timedelta(0, duration) for duration in df["production_duration"]]
    df["processing_speed"] = ["%.02f Kword pairs / s" % (speed / 1000) for speed in df["processing_speed"]]
    df["production_speed"] = ["%.02f Kword pairs / s" % (speed / 1000) for speed in df["production_speed"]]
    return df



In [23]:

    
display(df.mean()).loc[
    [1000000, len(full_dictionary)], [1, 100], [0, 1, 100]].loc[
    :, ["constructor_duration", "production_duration", "production_speed", "processing_speed"]]









    Out[23]:







  
    
      
      
      
      constructor_duration
      production_duration
      production_speed
      processing_speed
    
    
      dictionary_size
      nonzero_limit
      annoy_n_trees
      
      
      
      
    
  
  
    
      1000000
      1
      0
      00:00:00.000007
      00:00:19.962977
      0.05 Kword pairs / s
      50094.22 Kword pairs / s
    
    
      1
      00:00:30.268797
      00:00:00.097011
      10.32 Kword pairs / s
      10320061.76 Kword pairs / s
    
    
      100
      00:06:23.415982
      00:00:00.160870
      6.24 Kword pairs / s
      6236688.27 Kword pairs / s
    
    
      100
      0
      00:00:00.000008
      00:00:22.868372
      4.37 Kword pairs / s
      43729.34 Kword pairs / s
    
    
      1
      00:00:31.154876
      00:00:00.156238
      641.91 Kword pairs / s
      6419086.99 Kword pairs / s
    
    
      100
      00:06:23.290572
      00:00:01.297445
      77.13 Kword pairs / s
      771277.71 Kword pairs / s
    
    
      2010000
      1
      0
      00:00:00.000007
      00:01:55.303216
      0.01 Kword pairs / s
      17432.79 Kword pairs / s
    
    
      1
      00:01:34.004196
      00:00:00.190463
      5.25 Kword pairs / s
      10561607.14 Kword pairs / s
    
    
      100
      00:23:29.796006
      00:00:00.339500
      2.96 Kword pairs / s
      5954865.50 Kword pairs / s
    
    
      100
      0
      00:00:00.000007
      00:02:11.926861
      0.76 Kword pairs / s
      15236.46 Kword pairs / s
    
    
      1
      00:01:35.813414
      00:00:00.301120
      332.38 Kword pairs / s
      6680879.02 Kword pairs / s
    
    
      100
      00:23:05.155399
      00:00:03.031527
      33.42 Kword pairs / s
      671683.05 Kword pairs / s



In [24]:

    
display(df.apply(lambda x: (x - x.mean()).std())).loc[
    [1000000, len(full_dictionary)], [1, 100], [0, 1, 100]].loc[
    :, ["constructor_duration", "production_duration", "production_speed", "processing_speed"]]









    Out[24]:







  
    
      
      
      
      constructor_duration
      production_duration
      production_speed
      processing_speed
    
    
      dictionary_size
      nonzero_limit
      annoy_n_trees
      
      
      
      
    
  
  
    
      1000000
      1
      0
      00:00:00.000002
      00:00:00.115644
      0.00 Kword pairs / s
      286.27 Kword pairs / s
    
    
      1
      00:00:01.854097
      00:00:00.003517
      0.37 Kword pairs / s
      367959.55 Kword pairs / s
    
    
      100
      00:00:04.702035
      00:00:00.010444
      0.35 Kword pairs / s
      350506.05 Kword pairs / s
    
    
      100
      0
      00:00:00.000002
      00:00:00.104872
      0.02 Kword pairs / s
      198.86 Kword pairs / s
    
    
      1
      00:00:01.163678
      00:00:00.008939
      36.14 Kword pairs / s
      361441.71 Kword pairs / s
    
    
      100
      00:00:06.818568
      00:00:00.036979
      2.07 Kword pairs / s
      20741.69 Kword pairs / s
    
    
      2010000
      1
      0
      00:00:00.000001
      00:00:00.653177
      0.00 Kword pairs / s
      97.50 Kword pairs / s
    
    
      1
      00:00:04.677209
      00:00:00.005679
      0.16 Kword pairs / s
      311832.91 Kword pairs / s
    
    
      100
      00:01:38.562684
      00:00:00.029887
      0.22 Kword pairs / s
      434681.25 Kword pairs / s
    
    
      100
      0
      00:00:00.000001
      00:00:00.979613
      0.01 Kword pairs / s
      111.85 Kword pairs / s
    
    
      1
      00:00:03.207474
      00:00:00.009479
      10.18 Kword pairs / s
      204614.80 Kword pairs / s
    
    
      100
      00:00:55.119595
      00:00:00.419531
      3.46 Kword pairs / s
      69543.35 Kword pairs / s

Implement fast SCM between corpora

In Gensim PR #1827, we added a base implementation of the soft cosine measure (SCM). The base implementation would compute SCM between single documents using the softcossim function. In the Gensim PR #2016, we intruduced the SparseTermSimilarityMatrix.inner_product method, which computes SCM not only between single documents, but also between a document and a corpus, and between two corpora.

For the measurements, we use the Google News word embeddings distributed with the C implementation of Word2Vec. From the word embeddings, we will derive a dictionary of 2.01m terms. As a corpus, we will use a random sample of 100K articles from the 4.92m English Wikipedia articles.



In [25]:

    
full_model = api.load("word2vec-google-news-300")

try:
    with open("matrix_speed.corpus", "rb") as file:
        full_corpus = pickle.load(file)        
except IOError:
    original_corpus = list(tqdm(api.load("wiki-english-20171001"), desc="original_corpus", total=4924894))
    seed(RANDOM_SEED)
    full_corpus = [
        simple_preprocess(u'\n'.join(article["section_texts"]))
        for article in tqdm(sample(original_corpus, 10**5), desc="full_corpus", total=10**5)]
    del original_corpus
    with open("matrix_speed.corpus", "wb") as file:
        pickle.dump(full_corpus, file)

try:
    full_dictionary = Dictionary.load("matrix_speed.dictionary")
except IOError:
    full_dictionary = Dictionary([[term] for term in full_model.vocab.keys()])
    full_dictionary.save("matrix_speed.dictionary")

SCM between two documents

First, we measure the speed at which the inner_product method produces term similarities between single documents.



In [26]:

    
def benchmark(configuration):
    (matrix, dictionary, nonzero_limit), corpus, normalized, repetition = configuration
    corpus_size = len(corpus)
    corpus = [dictionary.doc2bow(doc) for doc in corpus]
    corpus = [vec for vec in corpus if len(vec) > 0]
    
    start_time = time()
    for vec1 in corpus:
        for vec2 in corpus:
            matrix.inner_product(vec1, vec2, normalized=normalized)
    end_time = time()
    duration = end_time - start_time
    
    return {
        "dictionary_size": matrix.matrix.shape[0],
        "matrix_nonzero": matrix.matrix.nnz,
        "nonzero_limit": nonzero_limit,
        "normalized": normalized,
        "corpus_size": corpus_size,
        "corpus_actual_size": len(corpus),
        "corpus_nonzero": sum(len(vec) for vec in corpus),
        "mean_document_length": np.mean([len(doc) for doc in corpus]),
        "repetition": repetition,
        "duration": duration, }



In [27]:

    
seed(RANDOM_SEED)
dictionary_sizes = [1000, 100000]
dictionaries = []
for size in tqdm(dictionary_sizes, desc="dictionaries"):
    dictionary = Dictionary([sample(list(full_dictionary.values()), size)])
    dictionaries.append(dictionary)
min_dictionary = sorted((len(dictionary), dictionary) for dictionary in dictionaries)[0][1]

corpus_sizes = [100, 1000]
corpora = []
for size in tqdm(corpus_sizes, desc="corpora"):
    corpus = sample(full_corpus, size)
    corpora.append(corpus)

models = []
for dictionary in tqdm(dictionaries, desc="models"):
    if dictionary == full_dictionary:
        models.append(full_model)
        continue
    model = full_model.__class__(full_model.vector_size)
    model.vocab = {word: deepcopy(full_model.vocab[word]) for word in dictionary.values()}
    model.index2entity = []
    vector_indices = []
    for index, word in enumerate(full_model.index2entity):
        if word in model.vocab.keys():
            model.index2entity.append(word)
            model.vocab[word].index = len(vector_indices)
            vector_indices.append(index)
    model.vectors = full_model.vectors[vector_indices]
    models.append(model)

nonzero_limits = [1, 10, 100]
matrices = []
for (model, dictionary), nonzero_limit in tqdm(
        list(product(zip(models, dictionaries), nonzero_limits)), desc="matrices"):
    annoy = AnnoyIndexer(model, 1)
    index = WordEmbeddingSimilarityIndex(model, kwargs={"indexer": annoy})
    matrix = SparseTermSimilarityMatrix(index, dictionary, nonzero_limit=nonzero_limit)
    matrices.append((matrix, dictionary, nonzero_limit))
    del annoy

normalization = (True, False)
repetitions = range(10)









    





 
 










    









    





 
 










    









    





 
 










    









    





 
 










    



/mnt/storage/home/novotny/.virtualenvs/gensim/lib/python3.4/site-packages/gensim/matutils.py:738: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):



In [28]:

    
configurations = product(matrices, corpora, normalization, repetitions)
results = benchmark_results(benchmark, configurations, "matrix_speed.inner-product_results.doc_doc")

The following tables show how long it takes to compute the inner_product method between all document vectors in a corpus (the duration column), how many nonzero elements there are in a corpus matrix (the corpus_nonzero column), how many nonzero elements there are in a term similarity matrix (the matrix_nonzero column) and the mean document similarity production speed (the speed column) as we vary the dictionary size (the dictionary_size column), the size of the corpus (the corpus_size column), the maximum number of nonzero elements in a single column of the matrix (the nonzero_limit column), and the matrix symmetry constraint (the symmetric column). Ten independendent measurements were taken. The top table shows the mean values and the bottom table shows the standard deviations.

The speed is proportional to the square of the number of unique terms shared by the two document vectors. In our scenario as well as the standard IR scenario, this means speed is constant. Computing a normalized inner product (normalized${}={}$True) results in a constant speed decrease.



In [29]:

    
df = pd.DataFrame(results)
df["speed"] = df.corpus_actual_size**2 / df.duration
del df["corpus_actual_size"]
df = df.groupby(["dictionary_size", "corpus_size", "nonzero_limit", "normalized"])

def display(df):
    df["duration"] = [timedelta(0, duration) for duration in df["duration"]]
    df["speed"] = ["%.02f Kdoc pairs / s" % (speed / 1000) for speed in df["speed"]]
    return df



In [30]:

    
display(df.mean()).loc[
    [1000, 100000], :, [1, 100], :].loc[
    :, ["duration", "corpus_nonzero", "matrix_nonzero", "speed"]]









    Out[30]:







  
    
      
      
      
      
      duration
      corpus_nonzero
      matrix_nonzero
      speed
    
    
      dictionary_size
      corpus_size
      nonzero_limit
      normalized
      
      
      
      
    
  
  
    
      1000
      100
      1
      False
      00:00:00.007383
      3.0
      1000.0
      1.23 Kdoc pairs / s
    
    
      True
      00:00:00.009028
      3.0
      1000.0
      1.01 Kdoc pairs / s
    
    
      100
      False
      00:00:00.007657
      3.0
      84944.0
      1.19 Kdoc pairs / s
    
    
      True
      00:00:00.008238
      3.0
      84944.0
      1.10 Kdoc pairs / s
    
    
      1000
      1
      False
      00:00:00.414364
      26.0
      1000.0
      1.39 Kdoc pairs / s
    
    
      True
      00:00:00.473789
      26.0
      1000.0
      1.22 Kdoc pairs / s
    
    
      100
      False
      00:00:00.430833
      26.0
      84944.0
      1.35 Kdoc pairs / s
    
    
      True
      00:00:00.453477
      26.0
      84944.0
      1.27 Kdoc pairs / s
    
    
      100000
      100
      1
      False
      00:00:05.236376
      423.0
      101868.0
      1.29 Kdoc pairs / s
    
    
      True
      00:00:05.623463
      423.0
      101868.0
      1.20 Kdoc pairs / s
    
    
      100
      False
      00:00:05.083829
      423.0
      8202884.0
      1.33 Kdoc pairs / s
    
    
      True
      00:00:05.576003
      423.0
      8202884.0
      1.21 Kdoc pairs / s
    
    
      1000
      1
      False
      00:08:59.285347
      5162.0
      101868.0
      1.26 Kdoc pairs / s
    
    
      True
      00:09:57.693219
      5162.0
      101868.0
      1.14 Kdoc pairs / s
    
    
      100
      False
      00:09:23.213450
      5162.0
      8202884.0
      1.21 Kdoc pairs / s
    
    
      True
      00:10:10.612458
      5162.0
      8202884.0
      1.12 Kdoc pairs / s



In [31]:

    
display(df.apply(lambda x: (x - x.mean()).std())).loc[
    [1000, 100000], :, [1, 100], :].loc[
    :, ["duration", "corpus_nonzero", "matrix_nonzero", "speed"]]









    Out[31]:







  
    
      
      
      
      
      duration
      corpus_nonzero
      matrix_nonzero
      speed
    
    
      dictionary_size
      corpus_size
      nonzero_limit
      normalized
      
      
      
      
    
  
  
    
      1000
      100
      1
      False
      00:00:00.000871
      0.0
      0.0
      0.13 Kdoc pairs / s
    
    
      True
      00:00:00.001315
      0.0
      0.0
      0.14 Kdoc pairs / s
    
    
      100
      False
      00:00:00.000893
      0.0
      0.0
      0.12 Kdoc pairs / s
    
    
      True
      00:00:00.000631
      0.0
      0.0
      0.08 Kdoc pairs / s
    
    
      1000
      1
      False
      00:00:00.014460
      0.0
      0.0
      0.05 Kdoc pairs / s
    
    
      True
      00:00:00.025250
      0.0
      0.0
      0.07 Kdoc pairs / s
    
    
      100
      False
      00:00:00.039088
      0.0
      0.0
      0.11 Kdoc pairs / s
    
    
      True
      00:00:00.023602
      0.0
      0.0
      0.06 Kdoc pairs / s
    
    
      100000
      100
      1
      False
      00:00:00.276359
      0.0
      0.0
      0.07 Kdoc pairs / s
    
    
      True
      00:00:00.278806
      0.0
      0.0
      0.06 Kdoc pairs / s
    
    
      100
      False
      00:00:00.286781
      0.0
      0.0
      0.07 Kdoc pairs / s
    
    
      True
      00:00:00.313397
      0.0
      0.0
      0.06 Kdoc pairs / s
    
    
      1000
      1
      False
      00:00:14.321101
      0.0
      0.0
      0.03 Kdoc pairs / s
    
    
      True
      00:00:23.526104
      0.0
      0.0
      0.05 Kdoc pairs / s
    
    
      100
      False
      00:00:05.899527
      0.0
      0.0
      0.01 Kdoc pairs / s
    
    
      True
      00:00:24.454422
      0.0
      0.0
      0.05 Kdoc pairs / s

SCM between a document and a corpus

Next, we measure the speed at which the inner_product method produces term similarities between documents and a corpus.



In [32]:

    
def benchmark(configuration):
    (matrix, dictionary, nonzero_limit), corpus, normalized, repetition = configuration
    corpus_size = len(corpus)
    corpus = [dictionary.doc2bow(doc) for doc in corpus if doc]
    
    start_time = time()
    for vec in corpus:
        matrix.inner_product(vec, corpus, normalized=normalized)
    end_time = time()
    duration = end_time - start_time
    
    return {
        "dictionary_size": matrix.matrix.shape[0],
        "matrix_nonzero": matrix.matrix.nnz,
        "nonzero_limit": nonzero_limit,
        "normalized": normalized,
        "corpus_size": corpus_size,
        "corpus_actual_size": len(corpus),
        "corpus_nonzero": sum(len(vec) for vec in corpus),
        "mean_document_length": np.mean([len(doc) for doc in corpus]),
        "repetition": repetition,
        "duration": duration, }



In [33]:

    
configurations = product(matrices, corpora, normalization, repetitions)
results = benchmark_results(benchmark, configurations, "matrix_speed.inner-product_results.doc_corpus")

The speed is inversely proportional to matrix_nonzero. Computing a normalized inner product (normalized${}={}$True) results in a constant speed decrease.



In [34]:

    
df = pd.DataFrame(results)
df["speed"] = df.corpus_actual_size**2 / df.duration
del df["corpus_actual_size"]
df = df.groupby(["dictionary_size", "corpus_size", "nonzero_limit", "normalized"])

def display(df):
    df["duration"] = [timedelta(0, duration) for duration in df["duration"]]
    df["speed"] = ["%.02f Kdoc pairs / s" % (speed / 1000) for speed in df["speed"]]
    return df



In [35]:

    
display(df.mean()).loc[
    [1000, 100000], :, [1, 100], :].loc[
    :, ["duration", "corpus_nonzero", "matrix_nonzero", "speed"]]









    Out[35]:







  
    
      
      
      
      
      duration
      corpus_nonzero
      matrix_nonzero
      speed
    
    
      dictionary_size
      corpus_size
      nonzero_limit
      normalized
      
      
      
      
    
  
  
    
      1000
      100
      1
      False
      00:00:00.009363
      3.0
      1000.0
      1117.12 Kdoc pairs / s
    
    
      True
      00:00:00.010948
      3.0
      1000.0
      954.13 Kdoc pairs / s
    
    
      100
      False
      00:00:00.014128
      3.0
      84944.0
      728.91 Kdoc pairs / s
    
    
      True
      00:00:00.018164
      3.0
      84944.0
      551.78 Kdoc pairs / s
    
    
      1000
      1
      False
      00:00:00.072091
      26.0
      1000.0
      13872.12 Kdoc pairs / s
    
    
      True
      00:00:00.079284
      26.0
      1000.0
      12615.36 Kdoc pairs / s
    
    
      100
      False
      00:00:00.162483
      26.0
      84944.0
      6188.43 Kdoc pairs / s
    
    
      True
      00:00:00.203081
      26.0
      84944.0
      4924.48 Kdoc pairs / s
    
    
      100000
      100
      1
      False
      00:00:00.278253
      423.0
      101868.0
      36.05 Kdoc pairs / s
    
    
      True
      00:00:00.298519
      423.0
      101868.0
      33.56 Kdoc pairs / s
    
    
      100
      False
      00:00:36.326167
      423.0
      8202884.0
      0.28 Kdoc pairs / s
    
    
      True
      00:00:36.928802
      423.0
      8202884.0
      0.27 Kdoc pairs / s
    
    
      1000
      1
      False
      00:00:07.403301
      5162.0
      101868.0
      135.08 Kdoc pairs / s
    
    
      True
      00:00:07.794943
      5162.0
      101868.0
      128.29 Kdoc pairs / s
    
    
      100
      False
      00:05:55.674712
      5162.0
      8202884.0
      2.81 Kdoc pairs / s
    
    
      True
      00:06:05.561398
      5162.0
      8202884.0
      2.74 Kdoc pairs / s



In [36]:

    
display(df.apply(lambda x: (x - x.mean()).std())).loc[
    [1000, 100000], :, [1, 100], :].loc[
    :, ["duration", "corpus_nonzero", "matrix_nonzero", "speed"]]









    Out[36]:







  
    
      
      
      
      
      duration
      corpus_nonzero
      matrix_nonzero
      speed
    
    
      dictionary_size
      corpus_size
      nonzero_limit
      normalized
      
      
      
      
    
  
  
    
      1000
      100
      1
      False
      00:00:00.002120
      0.0
      0.0
      242.09 Kdoc pairs / s
    
    
      True
      00:00:00.002387
      0.0
      0.0
      207.64 Kdoc pairs / s
    
    
      100
      False
      00:00:00.002531
      0.0
      0.0
      130.94 Kdoc pairs / s
    
    
      True
      00:00:00.000911
      0.0
      0.0
      27.68 Kdoc pairs / s
    
    
      1000
      1
      False
      00:00:00.000587
      0.0
      0.0
      112.92 Kdoc pairs / s
    
    
      True
      00:00:00.001191
      0.0
      0.0
      187.31 Kdoc pairs / s
    
    
      100
      False
      00:00:00.011944
      0.0
      0.0
      513.79 Kdoc pairs / s
    
    
      True
      00:00:00.001793
      0.0
      0.0
      43.54 Kdoc pairs / s
    
    
      100000
      100
      1
      False
      00:00:00.016156
      0.0
      0.0
      2.06 Kdoc pairs / s
    
    
      True
      00:00:00.013451
      0.0
      0.0
      1.47 Kdoc pairs / s
    
    
      100
      False
      00:00:01.339787
      0.0
      0.0
      0.01 Kdoc pairs / s
    
    
      True
      00:00:01.617340
      0.0
      0.0
      0.01 Kdoc pairs / s
    
    
      1000
      1
      False
      00:00:00.038961
      0.0
      0.0
      0.71 Kdoc pairs / s
    
    
      True
      00:00:00.024154
      0.0
      0.0
      0.40 Kdoc pairs / s
    
    
      100
      False
      00:00:07.604805
      0.0
      0.0
      0.06 Kdoc pairs / s
    
    
      True
      00:00:14.799519
      0.0
      0.0
      0.10 Kdoc pairs / s

SCM between two corpora

Lastly, we measure the speed at which the inner_product method produces term similarities between entire corpora.



In [37]:

    
def benchmark(configuration):
    (matrix, dictionary, nonzero_limit), corpus, normalized, repetition = configuration
    corpus_size = len(corpus)
    corpus = [dictionary.doc2bow(doc) for doc in corpus]
    corpus = [vec for vec in corpus if len(vec) > 0]
    
    start_time = time()
    matrix.inner_product(corpus, corpus, normalized=normalized)
    end_time = time()
    duration = end_time - start_time
    
    return {
        "dictionary_size": matrix.matrix.shape[0],
        "matrix_nonzero": matrix.matrix.nnz,
        "nonzero_limit": nonzero_limit,
        "normalized": normalized,
        "corpus_size": corpus_size,
        "corpus_actual_size": len(corpus),
        "corpus_nonzero": sum(len(vec) for vec in corpus),
        "mean_document_length": np.mean([len(doc) for doc in corpus]),
        "repetition": repetition,
        "duration": duration, }



In [38]:

    
nonzero_limits = [1000]
dense_matrices = []
for (model, dictionary), nonzero_limit in tqdm(
        list(product(zip(models, dictionaries), nonzero_limits)), desc="matrices"):
    annoy = AnnoyIndexer(model, 1)
    index = WordEmbeddingSimilarityIndex(model, kwargs={"indexer": annoy})
    matrix = SparseTermSimilarityMatrix(index, dictionary, nonzero_limit=nonzero_limit)
    matrices.append((matrix, dictionary, nonzero_limit))
    del annoy









    





 
 










    



/mnt/storage/home/novotny/.virtualenvs/gensim/lib/python3.4/site-packages/gensim/matutils.py:738: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):



In [39]:

    
configurations = product(matrices + dense_matrices, corpora + [full_corpus], normalization, repetitions)
results = benchmark_results(benchmark, configurations, "matrix_speed.inner-product_results.corpus_corpus")



In [40]:

    
df = pd.DataFrame(results)
df["speed"] = df.corpus_actual_size**2 / df.duration
del df["corpus_actual_size"]
df = df.groupby(["dictionary_size", "corpus_size", "nonzero_limit", "normalized"])

def display(df):
    df["duration"] = [timedelta(0, duration) for duration in df["duration"]]
    df["speed"] = ["%.02f Kdoc pairs / s" % (speed / 1000) for speed in df["speed"]]
    return df



In [41]:

    
display(df.mean()).loc[
    [1000, 100000], :, [1, 10, 100, 1000], :].loc[
    :, ["duration", "corpus_nonzero", "matrix_nonzero", "speed"]]









    Out[41]:







  
    
      
      
      
      
      duration
      corpus_nonzero
      matrix_nonzero
      speed
    
    
      dictionary_size
      corpus_size
      nonzero_limit
      normalized
      
      
      
      
    
  
  
    
      1000
      100
      1
      False
      00:00:00.001403
      3.0
      1000.0
      6.69 Kdoc pairs / s
    
    
      True
      00:00:00.005313
      3.0
      1000.0
      1.70 Kdoc pairs / s
    
    
      10
      False
      00:00:00.001565
      3.0
      8634.0
      5.80 Kdoc pairs / s
    
    
      True
      00:00:00.005307
      3.0
      8634.0
      1.70 Kdoc pairs / s
    
    
      100
      False
      00:00:00.003172
      3.0
      84944.0
      3.05 Kdoc pairs / s
    
    
      True
      00:00:00.008461
      3.0
      84944.0
      1.07 Kdoc pairs / s
    
    
      1000
      False
      00:00:00.021377
      3.0
      838588.0
      0.42 Kdoc pairs / s
    
    
      True
      00:00:00.055234
      3.0
      838588.0
      0.16 Kdoc pairs / s
    
    
      1000
      1
      False
      00:00:00.001376
      26.0
      1000.0
      418.61 Kdoc pairs / s
    
    
      True
      00:00:00.005019
      26.0
      1000.0
      114.78 Kdoc pairs / s
    
    
      10
      False
      00:00:00.001511
      26.0
      8634.0
      381.50 Kdoc pairs / s
    
    
      True
      00:00:00.005208
      26.0
      8634.0
      110.60 Kdoc pairs / s
    
    
      100
      False
      00:00:00.003539
      26.0
      84944.0
      164.03 Kdoc pairs / s
    
    
      True
      00:00:00.008502
      26.0
      84944.0
      67.81 Kdoc pairs / s
    
    
      1000
      False
      00:00:00.021548
      26.0
      838588.0
      26.73 Kdoc pairs / s
    
    
      True
      00:00:00.054425
      26.0
      838588.0
      10.59 Kdoc pairs / s
    
    
      100000
      1
      False
      00:00:00.019915
      2914.0
      1000.0
      391443.20 Kdoc pairs / s
    
    
      True
      00:00:00.026118
      2914.0
      1000.0
      298377.75 Kdoc pairs / s
    
    
      10
      False
      00:00:00.020152
      2914.0
      8634.0
      386722.55 Kdoc pairs / s
    
    
      True
      00:00:00.026998
      2914.0
      8634.0
      288567.14 Kdoc pairs / s
    
    
      100
      False
      00:00:00.028345
      2914.0
      84944.0
      274905.36 Kdoc pairs / s
    
    
      True
      00:00:00.041069
      2914.0
      84944.0
      189709.57 Kdoc pairs / s
    
    
      1000
      False
      00:00:00.089978
      2914.0
      838588.0
      86598.15 Kdoc pairs / s
    
    
      True
      00:00:00.185611
      2914.0
      838588.0
      41971.58 Kdoc pairs / s
    
    
      100000
      100
      1
      False
      00:00:00.003345
      423.0
      101868.0
      2013.92 Kdoc pairs / s
    
    
      True
      00:00:00.008857
      423.0
      101868.0
      760.13 Kdoc pairs / s
    
    
      10
      False
      00:00:00.032639
      423.0
      814154.0
      206.66 Kdoc pairs / s
    
    
      True
      00:00:00.080591
      423.0
      814154.0
      83.46 Kdoc pairs / s
    
    
      100
      False
      00:00:00.488467
      423.0
      8202884.0
      13.77 Kdoc pairs / s
    
    
      True
      00:00:01.454507
      423.0
      8202884.0
      4.62 Kdoc pairs / s
    
    
      1000
      False
      00:00:04.973667
      423.0
      89912542.0
      1.35 Kdoc pairs / s
    
    
      True
      00:00:15.035711
      423.0
      89912542.0
      0.45 Kdoc pairs / s
    
    
      1000
      1
      False
      00:00:00.010141
      5162.0
      101868.0
      67139.73 Kdoc pairs / s
    
    
      True
      00:00:00.016685
      5162.0
      101868.0
      40798.02 Kdoc pairs / s
    
    
      10
      False
      00:00:00.041392
      5162.0
      814154.0
      16444.18 Kdoc pairs / s
    
    
      True
      00:00:00.091686
      5162.0
      814154.0
      7425.08 Kdoc pairs / s
    
    
      100
      False
      00:00:00.508916
      5162.0
      8202884.0
      1338.94 Kdoc pairs / s
    
    
      True
      00:00:01.497556
      5162.0
      8202884.0
      454.49 Kdoc pairs / s
    
    
      1000
      False
      00:00:05.101489
      5162.0
      89912542.0
      133.44 Kdoc pairs / s
    
    
      True
      00:00:15.325415
      5162.0
      89912542.0
      44.42 Kdoc pairs / s
    
    
      100000
      1
      False
      00:00:37.145526
      525310.0
      101868.0
      192578.80 Kdoc pairs / s
    
    
      True
      00:00:45.729004
      525310.0
      101868.0
      156431.36 Kdoc pairs / s
    
    
      10
      False
      00:00:44.981806
      525310.0
      814154.0
      159029.88 Kdoc pairs / s
    
    
      True
      00:00:54.245450
      525310.0
      814154.0
      131871.88 Kdoc pairs / s
    
    
      100
      False
      00:01:15.925860
      525310.0
      8202884.0
      94216.21 Kdoc pairs / s
    
    
      True
      00:01:29.232076
      525310.0
      8202884.0
      80177.08 Kdoc pairs / s
    
    
      1000
      False
      00:03:17.140191
      525310.0
      89912542.0
      36286.25 Kdoc pairs / s
    
    
      True
      00:04:05.865666
      525310.0
      89912542.0
      29097.14 Kdoc pairs / s



In [42]:

    
display(df.apply(lambda x: (x - x.mean()).std())).loc[
    [1000, 100000], :, [1, 100], :].loc[
    :, ["duration", "corpus_nonzero", "matrix_nonzero", "speed"]]









    Out[42]:







  
    
      
      
      
      
      duration
      corpus_nonzero
      matrix_nonzero
      speed
    
    
      dictionary_size
      corpus_size
      nonzero_limit
      normalized
      
      
      
      
    
  
  
    
      1000
      100
      1
      False
      00:00:00.000292
      0.0
      0.0
      1.48 Kdoc pairs / s
    
    
      True
      00:00:00.000225
      0.0
      0.0
      0.08 Kdoc pairs / s
    
    
      100
      False
      00:00:00.000747
      0.0
      0.0
      1.02 Kdoc pairs / s
    
    
      True
      00:00:00.000488
      0.0
      0.0
      0.07 Kdoc pairs / s
    
    
      1000
      1
      False
      00:00:00.000027
      0.0
      0.0
      8.10 Kdoc pairs / s
    
    
      True
      00:00:00.000069
      0.0
      0.0
      1.56 Kdoc pairs / s
    
    
      100
      False
      00:00:00.000309
      0.0
      0.0
      16.26 Kdoc pairs / s
    
    
      True
      00:00:00.000268
      0.0
      0.0
      2.24 Kdoc pairs / s
    
    
      100000
      1
      False
      00:00:00.000576
      0.0
      0.0
      11256.03 Kdoc pairs / s
    
    
      True
      00:00:00.000574
      0.0
      0.0
      6512.19 Kdoc pairs / s
    
    
      100
      False
      00:00:00.000562
      0.0
      0.0
      5233.50 Kdoc pairs / s
    
    
      True
      00:00:00.000609
      0.0
      0.0
      2743.63 Kdoc pairs / s
    
    
      100000
      100
      1
      False
      00:00:00.000152
      0.0
      0.0
      98.97 Kdoc pairs / s
    
    
      True
      00:00:00.000322
      0.0
      0.0
      28.10 Kdoc pairs / s
    
    
      100
      False
      00:00:00.004997
      0.0
      0.0
      0.14 Kdoc pairs / s
    
    
      True
      00:00:00.022206
      0.0
      0.0
      0.07 Kdoc pairs / s
    
    
      1000
      1
      False
      00:00:00.000210
      0.0
      0.0
      1420.00 Kdoc pairs / s
    
    
      True
      00:00:00.000192
      0.0
      0.0
      467.23 Kdoc pairs / s
    
    
      100
      False
      00:00:00.019022
      0.0
      0.0
      45.91 Kdoc pairs / s
    
    
      True
      00:00:00.004431
      0.0
      0.0
      1.35 Kdoc pairs / s
    
    
      100000
      1
      False
      00:00:00.024466
      0.0
      0.0
      126.77 Kdoc pairs / s
    
    
      True
      00:00:00.062447
      0.0
      0.0
      213.64 Kdoc pairs / s
    
    
      100
      False
      00:00:00.087692
      0.0
      0.0
      108.55 Kdoc pairs / s
    
    
      True
      00:00:01.065889
      0.0
      0.0
      968.80 Kdoc pairs / s

				duration	matrix_nonzero	consumption_speed
dictionary_size	nonzero_limit	symmetric	positive_definite
10000	1	False	False	00:00:00.435533	20000	22.96 Kword pairs / s
		False	True	00:00:00.492606	20000	20.30 Kword pairs / s
		True	False	00:00:00.185563	10002	53.90 Kword pairs / s
		True	True	00:00:00.240471	10002	41.59 Kword pairs / s
	10	False	False	00:00:02.687836	110000	37.21 Kword pairs / s
		False	True	00:00:00.615492	20000	162.49 Kword pairs / s
		True	False	00:00:00.501188	10118	199.53 Kword pairs / s
		True	True	00:00:01.380586	10010	72.44 Kword pairs / s
	100	False	False	00:00:25.262807	1010000	39.58 Kword pairs / s
		False	True	00:00:01.132524	20000	883.02 Kword pairs / s
		True	False	00:00:03.595666	20198	278.13 Kword pairs / s
		True	True	00:00:11.818912	10100	84.61 Kword pairs / s
2010000	1	False	False	00:01:31.786585	4020000	21.90 Kword pairs / s
		False	True	00:01:40.954580	4020000	19.91 Kword pairs / s
		True	False	00:00:39.050064	2010002	51.48 Kword pairs / s
		True	True	00:00:49.238437	2010002	40.82 Kword pairs / s
	10	False	False	00:09:35.470373	22110000	34.93 Kword pairs / s
		False	True	00:02:02.920334	4020000	163.52 Kword pairs / s
		True	False	00:01:39.576693	2010118	201.88 Kword pairs / s
		True	True	00:04:35.646501	2010010	72.92 Kword pairs / s
	100	False	False	01:42:01.747568	203010000	32.88 Kword pairs / s
		False	True	00:03:36.420778	4020000	928.75 Kword pairs / s
		True	False	00:10:58.434060	2020198	305.30 Kword pairs / s
		True	True	00:39:40.319479	2010100	84.44 Kword pairs / s

		production_duration	production_speed
dictionary_size	nonzero_limit
1000	1	00:00:00.002973	336.41 Kword pairs / s
	10	00:00:00.005372	1861.64 Kword pairs / s
	100	00:00:00.026752	3738.79 Kword pairs / s
	1000	00:00:00.290265	3449.16 Kword pairs / s
2010000	1	00:00:06.318446	318.12 Kword pairs / s
	10	00:00:10.783611	1863.96 Kword pairs / s
	100	00:00:53.108644	3785.04 Kword pairs / s
	1000	00:09:45.103741	3437.36 Kword pairs / s

			constructor_duration	production_duration	production_speed	processing_speed
dictionary_size	nonzero_limit	annoy_n_trees
1000000	1	0	00:00:00.000007	00:00:19.962977	0.05 Kword pairs / s	50094.22 Kword pairs / s
		1	00:00:30.268797	00:00:00.097011	10.32 Kword pairs / s	10320061.76 Kword pairs / s
		100	00:06:23.415982	00:00:00.160870	6.24 Kword pairs / s	6236688.27 Kword pairs / s
	100	0	00:00:00.000008	00:00:22.868372	4.37 Kword pairs / s	43729.34 Kword pairs / s
		1	00:00:31.154876	00:00:00.156238	641.91 Kword pairs / s	6419086.99 Kword pairs / s
		100	00:06:23.290572	00:00:01.297445	77.13 Kword pairs / s	771277.71 Kword pairs / s
2010000	1	0	00:00:00.000007	00:01:55.303216	0.01 Kword pairs / s	17432.79 Kword pairs / s
		1	00:01:34.004196	00:00:00.190463	5.25 Kword pairs / s	10561607.14 Kword pairs / s
		100	00:23:29.796006	00:00:00.339500	2.96 Kword pairs / s	5954865.50 Kword pairs / s
	100	0	00:00:00.000007	00:02:11.926861	0.76 Kword pairs / s	15236.46 Kword pairs / s
		1	00:01:35.813414	00:00:00.301120	332.38 Kword pairs / s	6680879.02 Kword pairs / s
		100	00:23:05.155399	00:00:03.031527	33.42 Kword pairs / s	671683.05 Kword pairs / s

				duration	corpus_nonzero	matrix_nonzero	speed
dictionary_size	corpus_size	nonzero_limit	normalized
1000	100	1	False	00:00:00.007383	3.0	1000.0	1.23 Kdoc pairs / s
		1	True	00:00:00.009028	3.0	1000.0	1.01 Kdoc pairs / s
		100	False	00:00:00.007657	3.0	84944.0	1.19 Kdoc pairs / s
		100	True	00:00:00.008238	3.0	84944.0	1.10 Kdoc pairs / s
	1000	1	False	00:00:00.414364	26.0	1000.0	1.39 Kdoc pairs / s
		1	True	00:00:00.473789	26.0	1000.0	1.22 Kdoc pairs / s
		100	False	00:00:00.430833	26.0	84944.0	1.35 Kdoc pairs / s
		100	True	00:00:00.453477	26.0	84944.0	1.27 Kdoc pairs / s
100000	100	1	False	00:00:05.236376	423.0	101868.0	1.29 Kdoc pairs / s
		1	True	00:00:05.623463	423.0	101868.0	1.20 Kdoc pairs / s
		100	False	00:00:05.083829	423.0	8202884.0	1.33 Kdoc pairs / s
		100	True	00:00:05.576003	423.0	8202884.0	1.21 Kdoc pairs / s
	1000	1	False	00:08:59.285347	5162.0	101868.0	1.26 Kdoc pairs / s
		1	True	00:09:57.693219	5162.0	101868.0	1.14 Kdoc pairs / s
		100	False	00:09:23.213450	5162.0	8202884.0	1.21 Kdoc pairs / s
		100	True	00:10:10.612458	5162.0	8202884.0	1.12 Kdoc pairs / s