Quick HDF5 benchmarks

We compare the performance of reading a subset of a large array:

in memory with NumPy
with h5py
with memmap using an HDF5 file
with memmap using an NPY file

This illustrates our performance issues with HDF5 in our very particular use case (accessing a small number of lines in a large "vertical" rectangular array).



In [4]:

    
import h5py
import numpy as np



In [5]:

    
np.random.seed(2016)

We'll use this function to bypass the slow h5py data access with a faster memory mapping (only works on uncompressed contiguous datasets):



In [6]:

    
def _mmap_h5(path, h5path):
    with h5py.File(path) as f:
        ds = f[h5path]
        # We get the dataset address in the HDF5 fiel.
        offset = ds.id.get_offset()
        # We ensure we have a non-compressed contiguous array.
        assert ds.chunks is None
        assert ds.compression is None
        assert offset > 0
        dtype = ds.dtype
        shape = ds.shape
    arr = np.memmap(path, mode='r', shape=shape, offset=offset, dtype=dtype)
    return arr

Number of lines in our test array:



In [7]:

    
shape = (100000, 1000)
n, ncols = shape

We generate a random array:



In [8]:

    
arr = np.random.rand(n, ncols).astype(np.float32)

We write it to a file:



In [12]:

    
%timeit with h5py.File('test.h5', 'w') as f: f['/test'] = arr









    



1 loops, best of 3: 413 ms per loop

We load the file once in read mode.



In [7]:

    
f = h5py.File('test.h5', 'r')



In [13]:

    
%timeit np.save('test.npy', arr)









    



1 loops, best of 3: 628 ms per loop



In [ ]:

    
%timeit arr = np.memmap('test.map', mode='w+', shape=shape, dtype=np.float32)

Slices



In [10]:

    
ind = slice(None, None, 100)



In [11]:

    
print('in memory')
%timeit arr[ind, :] * 1
print()
print('h5py')
%timeit f['/test'][ind, :] * 1
print()
print('memmap of HDF5 file')
%timeit _mmap_h5('test.h5', '/test')[ind, :] * 1
print()
print('memmap of NPY file')
%timeit np.load('test.npy', mmap_mode='r')[ind, :] * 1









    



in memory
1000 loops, best of 3: 741 µs per loop

h5py
100 loops, best of 3: 9.65 ms per loop

memmap of HDF5 file
100 loops, best of 3: 3.95 ms per loop

memmap of NPY file
100 loops, best of 3: 3.75 ms per loop

Fancy indexing

Fancy indexing is what we have to use in our particular use-case.



In [13]:

    
ind = np.unique(np.random.randint(0, n, n // 100))



In [15]:

    
len(ind)









    Out[15]:





999



In [16]:

    
print('in memory')
%timeit arr[ind, :] * 1
print()
print('h5py')
%timeit f['/test'][ind, :] * 1
print()
print('memmap of HDF5 file')
%timeit _mmap_h5('test.h5', '/test')[ind, :] * 1
print()
print('memmap of NPY file')
%timeit np.load('test.npy', mmap_mode='r')[ind, :] * 1









    



in memory
100 loops, best of 3: 2.05 ms per loop

h5py
10 loops, best of 3: 53.3 ms per loop

memmap of HDF5 file
100 loops, best of 3: 5.62 ms per loop

memmap of NPY file
100 loops, best of 3: 5.12 ms per loop

Note that h5py uses a slow algorithm for fancy indexing, so HDF5 is not the only cause of the slowdown.