Benchmarks for Exdir

This notebook contains a number of benchmarks for Exdir. They compare the performance of Exdir with h5py.

Warning: Please make sure the files are not created in a folder managed by Syncthing, Dropbox or any other file synchronization system. We will be making a large number of changes to the files and a file synchronization system will reduce performance and possibly become out of sync in the process.

Note: You may experience unreliable results on some systems, where the numbers vary greatly between each run. This can be caused by the large number of I/O operations performed by the benchmarks. We have tried to improve the reliability by adding a call to time.sleep between setting up the benchmark and running the benchmark. This should allow the system to completely flush to disk the changes made while setting up and have the benchmark run unaffected. However, if you still experience unreliable results, you may want to try to set up a RAM disk and change the below paths to read /tmp/ramdis/test.exdir and /tmp/ramdisk/test.h5:

mkdir /tmp/ramdisk/
sudo mount -t tmpfs -o size=2048M tmpfs /tmp/ramdisk/

Helper functions

The following functions are used to set up an exdir or hdf5 file for benchmarking:


In [ ]:
import exdir
import os
import shutil
import h5py

def setup_exdir():
    testpath = "test.exdir"
    if os.path.exists(testpath):
        shutil.rmtree(testpath)
    f = exdir.File(testpath)
    return f, testpath

def setup_exdir_no_validation():
    testpath = "test.exdir"
    if os.path.exists(testpath):
        shutil.rmtree(testpath)
    f = exdir.File(testpath, name_validation=exdir.validation.minimal)
    return f, testpath

def teardown_exdir(f, testpath):
    f.close()
    shutil.rmtree(testpath)

def setup_h5py():
    testpath = "test.h5"
    if os.path.exists(testpath):
        os.remove(testpath)
    f = h5py.File(testpath)
    return f, testpath

    
def teardown_h5py(f, testpath):
    f.close()
    os.remove(testpath)

The following function is used to run the different benchmarks. It takes a target function to test, a setup function to create the file and the number of iterations the function should be run to get a decent average:


In [ ]:
import time

def benchmark(target, setup=None, teardown=None, iterations=10):
    total_time = 0
    setup_teardown_start = time.time()
    for i in range(iterations):
        data = tuple()
        if setup is not None:
            data = setup()
        time.sleep(1) # allow changes to be flushed to disk
        start_time = time.time()
        target(*data)
        end_time = time.time()
        total_time += end_time - start_time
        if teardown is not None:
            teardown(*data)
    setup_teardown_end = time.time()
    total_setup_teardown = setup_teardown_end - setup_teardown_start
    
    mean = total_time / iterations
    
    return mean

The following functions are used as wrappers to make it easy to run a benchmark of Exdir or h5py:


In [ ]:
import pandas as pd
import numpy as np

all_results = []

def benchmark_both(function, iterations=10, name_validation=True):
    if name_validation:
        setup_exdir_ = setup_exdir
        name = function.__name__
    else:
        setup_exdir_ = setup_exdir_no_validation
        name = function.__name__ + " (minimal name validation)"
    
    exdir_mean = benchmark(
        target=lambda f, path: function(f),
        setup=setup_exdir_,
        teardown=teardown_exdir,
        iterations=iterations
    )
    hdf5_mean = benchmark(
        target=lambda f, path: function(f),
        setup=setup_h5py,
        teardown=teardown_h5py,
        iterations=iterations
    )
    
    result = pd.DataFrame(
        [(name, hdf5_mean, exdir_mean, hdf5_mean/exdir_mean)],
        columns=["Test", "h5py", "Exdir", "Ratio"]
    )
    all_results.append(result)
    return result

def benchmark_exdir(function, iterations=10):
    exdir_mean = benchmark(
        target=lambda f, path: function(f),
        setup=setup_exdir,
        teardown=teardown_exdir,
        iterations=iterations
    )
    result = pd.DataFrame(
        [(function.__name__, np.nan, exdir_mean, np.nan)],
        columns=["Test", "h5py", "Exdir", "Ratio"]
    )
    all_results.append(result)
    return result

We are now ready to start running the different benchmarks.

Benchmark functions

The following benchmark creates a small number of attributes. This should be very fast with both h5py and Exdir:


In [ ]:
def add_few_attributes(obj):
    for i in range(5):
        obj.attrs["hello" + str(i)] = "world"

benchmark_both(add_few_attributes)

The following benchmark adds a larger number of attributes one-by-one. Because Exdir needs to read back and rewrite the entire file in case someone changed it between each write, this is significantly slower with Exdir than h5py:


In [ ]:
def add_many_attributes(obj):
    for i in range(200):
        obj.attrs["hello" + str(i)] = "world"

benchmark_both(add_many_attributes, 10)

However, Exdir is capable of writing all attributes in one operation. This makes writing the same attributes about as fast (or even faster than h5py). Writing a large number of attributes in a single operation is not possible with h5py. We therefore need to run this only with Exdir:


In [ ]:
def add_many_attributes_single_operation(obj):
    attributes = {}
    for i in range(200):
        attributes["hello" + str(i)] = "world"
    obj.attrs = attributes
    
benchmark_exdir(add_many_attributes_single_operation)

Exdir also supports adding nested attributes, such as Python dictionaries, which is not supported by h5py:


In [ ]:
def add_attribute_tree(obj):
    tree = {}
    for i in range(100):
        tree["hello" + str(i)] = "world"
    tree["intermediate"] = {}
    intermediate = tree["intermediate"]
    for level in range(10):
        level_str = "level" + str(level)
        intermediate[level_str] = {}
        intermediate = intermediate[level_str]
    intermediate = 42
    obj.attrs["test"] = tree
    
benchmark_exdir(add_attribute_tree)

The following benchmarks create a small, a medium, and a large dataset:


In [ ]:
def add_small_dataset(obj):
    data = np.zeros((100, 100, 100))
    obj.create_dataset("foo", data=data)
    obj.close()
    
benchmark_both(add_small_dataset)

In [ ]:
def add_medium_dataset(obj):
    data = np.zeros((1000, 100, 100))
    obj.create_dataset("foo", data=data)
    obj.close()
    
benchmark_both(add_medium_dataset, 10)

In [ ]:
def add_large_dataset(obj):
    data = np.zeros((1000, 1000, 100))
    obj.create_dataset("foo", data=data)
    obj.close()
    
benchmark_both(add_large_dataset, 3)

There is some overhead in creating the objects themselves. This is rather small in h5py, but can be high in Exdir with name validation enabled. This is because the name of every created object must be checked against all the existing objects in the same group:


In [ ]:
def create_many_objects(obj):
    for i in range(5000):
        group = obj.create_group("group{}".format(i))

benchmark_both(create_many_objects, 3)

Without minimal validation, this is almost as fast in Exdir as it is in h5py. Minimal name validation only checks if file with the exact same name exist in the folder:


In [ ]:
benchmark_both(create_many_objects, 3, name_validation=False)

Not only the number of created objects matter. Creating them in a tree structure can also incur a performance penalty. The following test creates an object tree:


In [ ]:
def create_large_tree(obj, level=0):
    if level > 4:
        return
    for i in range(3):
        group = obj.create_group("group_{}_{}".format(i, level))
        data = np.zeros((10, 10, 10))
        group.create_dataset("dataset_{}_{}".format(i, level), data=data)
        create_large_tree(group, level + 1)
        
benchmark_both(create_large_tree)

The final benchmark tests writing a "slice" of a dataset, which means only a part of the entire dataset is modified. This is typically fast in both h5py and in Exdir thanks to memory mapping.


In [ ]:
def write_slice(dataset):
    dataset[320:420, 0:300, 0:100] = np.ones((100, 300, 100))

def create_setup_dataset(setup_function):
    def setup():
        f, path = setup_function()
        data = np.zeros((1000, 500, 100))
        dataset = f.create_dataset("foo", data=data)
        time.sleep(1) # allow changes to get flushed to disk
        return dataset, f, path
    return setup

exdir_mean = benchmark(
    target=lambda dataset, f, path: write_slice(dataset),
    setup=create_setup_dataset(setup_exdir),
    teardown=lambda dataset, f, path: teardown_exdir(f, path),
    iterations=3
)

hdf5_mean = benchmark(
    target=lambda dataset, f, path: write_slice(dataset),
    setup=create_setup_dataset(setup_h5py),
    teardown=lambda dataset, f, path: teardown_h5py(f, path),
    iterations=3
)
result = pd.DataFrame(
    [("write_slice", hdf5_mean, exdir_mean, hdf5_mean/exdir_mean)],
    columns=["Test", "h5py", "Exdir", "Ratio"]
)
all_results.append(result)

result

Benchmark summary

The results are summarized in the following table:


In [ ]:
pd.concat(all_results)

Profiling the largest differences

While the performance of Exdir in many cases is close to h5py, there are a few cases that can be worth investigating further.

For instance, it might be interesting to know what takes most time in create_large_tree, which is about 2-3 times slower in Exdir than h5py:


In [ ]:
import cProfile

f, path = setup_exdir()
cProfile.run('create_large_tree(f)', sort="cumtime")
teardown_exdir(f, path)

Here we see that create_dataset and create_group take up about 2/3 and 1/3 of the total run time, respectively. Some of the time in both of these are spent on building paths using pathlib and name validation. The remaining time is mostly spent on writing the array header of the NumPy files. Only a small amount of time is spent on actually writing files. Increasing performance in this case will likely mean that we need to outperform pathlib in building paths and numpy in writing files. While it might be possible, it is also beneficial to stick with the existing, well-tested implementations of both of these libraries.


In [ ]: