We compare the performance of reading a subset of a large array:
This illustrates our performance issues with HDF5 in our very particular use case (accessing a small number of lines in a large "vertical" rectangular array).
In [4]:
import h5py
import numpy as np
In [5]:
np.random.seed(2016)
We'll use this function to bypass the slow h5py data access with a faster memory mapping (only works on uncompressed contiguous datasets):
In [6]:
def _mmap_h5(path, h5path):
with h5py.File(path) as f:
ds = f[h5path]
# We get the dataset address in the HDF5 fiel.
offset = ds.id.get_offset()
# We ensure we have a non-compressed contiguous array.
assert ds.chunks is None
assert ds.compression is None
assert offset > 0
dtype = ds.dtype
shape = ds.shape
arr = np.memmap(path, mode='r', shape=shape, offset=offset, dtype=dtype)
return arr
Number of lines in our test array:
In [7]:
shape = (100000, 1000)
n, ncols = shape
We generate a random array:
In [8]:
arr = np.random.rand(n, ncols).astype(np.float32)
We write it to a file:
In [12]:
%timeit with h5py.File('test.h5', 'w') as f: f['/test'] = arr
We load the file once in read mode.
In [7]:
f = h5py.File('test.h5', 'r')
In [13]:
%timeit np.save('test.npy', arr)
In [ ]:
%timeit arr = np.memmap('test.map', mode='w+', shape=shape, dtype=np.float32)
In [10]:
ind = slice(None, None, 100)
In [11]:
print('in memory')
%timeit arr[ind, :] * 1
print()
print('h5py')
%timeit f['/test'][ind, :] * 1
print()
print('memmap of HDF5 file')
%timeit _mmap_h5('test.h5', '/test')[ind, :] * 1
print()
print('memmap of NPY file')
%timeit np.load('test.npy', mmap_mode='r')[ind, :] * 1
Fancy indexing is what we have to use in our particular use-case.
In [13]:
ind = np.unique(np.random.randint(0, n, n // 100))
In [15]:
len(ind)
Out[15]:
In [16]:
print('in memory')
%timeit arr[ind, :] * 1
print()
print('h5py')
%timeit f['/test'][ind, :] * 1
print()
print('memmap of HDF5 file')
%timeit _mmap_h5('test.h5', '/test')[ind, :] * 1
print()
print('memmap of NPY file')
%timeit np.load('test.npy', mmap_mode='r')[ind, :] * 1
Note that h5py uses a slow algorithm for fancy indexing, so HDF5 is not the only cause of the slowdown.