Advanced indexing



In [1]:

    
import sys
sys.path.insert(0, '..')
import zarr
import numpy as np
np.random.seed(42)
import cProfile
zarr.__version__









    Out[1]:





'2.1.5.dev144'

Functionality and API

Indexing a 1D array with a Boolean (mask) array

Supported via get/set_mask_selection() and .vindex[]. Also supported via get/set_orthogonal_selection() and .oindex[].



In [2]:

    
a = np.arange(10)
za = zarr.array(a, chunks=2)
ix = [False,  True,  False,  True, False, True, False,  True,  False,  True]



In [3]:

    
# get items
za.vindex[ix]









    Out[3]:





array([1, 3, 5, 7, 9])



In [4]:

    
# get items
za.oindex[ix]









    Out[4]:





array([1, 3, 5, 7, 9])



In [5]:

    
# set items
za.vindex[ix] = a[ix] * 10
za[:]









    Out[5]:





array([ 0, 10,  2, 30,  4, 50,  6, 70,  8, 90])



In [6]:

    
# set items
za.oindex[ix] = a[ix] * 100
za[:]









    Out[6]:





array([  0, 100,   2, 300,   4, 500,   6, 700,   8, 900])



In [7]:

    
# if using .oindex, indexing array can be any array-like, e.g., Zarr array
zix = zarr.array(ix, chunks=2)
za = zarr.array(a, chunks=2)
za.oindex[zix]  # will not load all zix into memory









    Out[7]:





array([1, 3, 5, 7, 9])

Indexing a 1D array with a 1D integer (coordinate) array

Supported via get/set_coordinate_selection() and .vindex[]. Also supported via get/set_orthogonal_selection() and .oindex[].



In [8]:

    
a = np.arange(10)
za = zarr.array(a, chunks=2)
ix = [1, 3, 5, 7, 9]



In [9]:

    
# get items
za.vindex[ix]









    Out[9]:





array([1, 3, 5, 7, 9])



In [10]:

    
# get items
za.oindex[ix]









    Out[10]:





array([1, 3, 5, 7, 9])



In [11]:

    
# set items
za.vindex[ix] = a[ix] * 10
za[:]









    Out[11]:





array([ 0, 10,  2, 30,  4, 50,  6, 70,  8, 90])



In [12]:

    
# set items
za.oindex[ix] = a[ix] * 100
za[:]









    Out[12]:





array([  0, 100,   2, 300,   4, 500,   6, 700,   8, 900])

Indexing a 1D array with a multi-dimensional integer (coordinate) array

Supported via get/set_coordinate_selection() and .vindex[].



In [13]:

    
a = np.arange(10)
za = zarr.array(a, chunks=2)
ix = np.array([[1, 3, 5], [2, 4, 6]])



In [14]:

    
# get items
za.vindex[ix]









    Out[14]:





array([[1, 3, 5],
       [2, 4, 6]])



In [15]:

    
# set items
za.vindex[ix] = a[ix] * 10
za[:]









    Out[15]:





array([ 0, 10, 20, 30, 40, 50, 60,  7,  8,  9])

Slicing a 1D array with step > 1

Slices with step > 1 are supported via get/set_basic_selection(), get/set_orthogonal_selection(), __getitem__ and .oindex[]. Negative steps are not supported.



In [16]:

    
a = np.arange(10)
za = zarr.array(a, chunks=2)



In [17]:

    
# get items
za[1::2]









    Out[17]:





array([1, 3, 5, 7, 9])



In [18]:

    
# set items
za.oindex[1::2] = a[1::2] * 10
za[:]









    Out[18]:





array([ 0, 10,  2, 30,  4, 50,  6, 70,  8, 90])

Orthogonal (outer) indexing of multi-dimensional arrays

Orthogonal (a.k.a. outer) indexing is supported with either Boolean or integer arrays, in combination with integers and slices. This functionality is provided via the get/set_orthogonal_selection() methods. For convenience, this functionality is also available via the .oindex[] property.



In [19]:

    
a = np.arange(15).reshape(5, 3)
za = zarr.array(a, chunks=(3, 2))
za[:]









    Out[19]:





array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])



In [20]:

    
# orthogonal indexing with Boolean arrays
ix0 = [False, True, False, True, False]
ix1 = [True, False, True]
za.get_orthogonal_selection((ix0, ix1))









    Out[20]:





array([[ 3,  5],
       [ 9, 11]])



In [21]:

    
# alternative API
za.oindex[ix0, ix1]









    Out[21]:





array([[ 3,  5],
       [ 9, 11]])



In [22]:

    
# orthogonal indexing with integer arrays
ix0 = [1, 3]
ix1 = [0, 2]
za.get_orthogonal_selection((ix0, ix1))









    Out[22]:





array([[ 3,  5],
       [ 9, 11]])



In [23]:

    
# alternative API
za.oindex[ix0, ix1]









    Out[23]:





array([[ 3,  5],
       [ 9, 11]])



In [24]:

    
# combine with slice
za.oindex[[1,  3], :]









    Out[24]:





array([[ 3,  4,  5],
       [ 9, 10, 11]])



In [25]:

    
# combine with slice
za.oindex[:, [0, 2]]









    Out[25]:





array([[ 0,  2],
       [ 3,  5],
       [ 6,  8],
       [ 9, 11],
       [12, 14]])



In [26]:

    
# set items via Boolean selection
ix0 = [False, True, False, True, False]
ix1 = [True, False, True]
selection = ix0, ix1
value = 42
za.set_orthogonal_selection(selection, value)
za[:]









    Out[26]:





array([[ 0,  1,  2],
       [42,  4, 42],
       [ 6,  7,  8],
       [42, 10, 42],
       [12, 13, 14]])



In [27]:

    
# alternative API
za.oindex[ix0, ix1] = 44
za[:]









    Out[27]:





array([[ 0,  1,  2],
       [44,  4, 44],
       [ 6,  7,  8],
       [44, 10, 44],
       [12, 13, 14]])



In [28]:

    
# set items via integer selection
ix0 = [1, 3]
ix1 = [0, 2]
selection = ix0, ix1
value = 46
za.set_orthogonal_selection(selection, value)
za[:]









    Out[28]:





array([[ 0,  1,  2],
       [46,  4, 46],
       [ 6,  7,  8],
       [46, 10, 46],
       [12, 13, 14]])



In [29]:

    
# alternative API
za.oindex[ix0, ix1] = 48
za[:]









    Out[29]:





array([[ 0,  1,  2],
       [48,  4, 48],
       [ 6,  7,  8],
       [48, 10, 48],
       [12, 13, 14]])

Coordinate indexing of multi-dimensional arrays

Selecting arbitrary points from a multi-dimensional array by indexing with integer (coordinate) arrays is supported. This functionality is provided via the get/set_coordinate_selection() methods. For convenience, this functionality is also available via the .vindex[] property.



In [30]:

    
a = np.arange(15).reshape(5, 3)
za = zarr.array(a, chunks=(3, 2))
za[:]









    Out[30]:





array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])



In [31]:

    
# get items
ix0 = [1, 3]
ix1 = [0, 2]
za.get_coordinate_selection((ix0, ix1))









    Out[31]:





array([ 3, 11])



In [32]:

    
# alternative API
za.vindex[ix0, ix1]









    Out[32]:





array([ 3, 11])



In [33]:

    
# set items
za.set_coordinate_selection((ix0, ix1), 42)
za[:]









    Out[33]:





array([[ 0,  1,  2],
       [42,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 42],
       [12, 13, 14]])



In [34]:

    
# alternative API
za.vindex[ix0, ix1] = 44
za[:]









    Out[34]:





array([[ 0,  1,  2],
       [44,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 44],
       [12, 13, 14]])

Mask indexing of multi-dimensional arrays

Selecting arbitrary points from a multi-dimensional array by a Boolean array is supported. This functionality is provided via the get/set_mask_selection() methods. For convenience, this functionality is also available via the .vindex[] property.



In [35]:

    
a = np.arange(15).reshape(5, 3)
za = zarr.array(a, chunks=(3, 2))
za[:]









    Out[35]:





array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])



In [36]:

    
ix = np.zeros_like(a, dtype=bool)
ix[1, 0] = True
ix[3, 2] = True
za.get_mask_selection(ix)









    Out[36]:





array([ 3, 11])



In [37]:

    
za.vindex[ix]









    Out[37]:





array([ 3, 11])



In [38]:

    
za.set_mask_selection(ix, 42)
za[:]









    Out[38]:





array([[ 0,  1,  2],
       [42,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 42],
       [12, 13, 14]])



In [39]:

    
za.vindex[ix] = 44
za[:]









    Out[39]:





array([[ 0,  1,  2],
       [44,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 44],
       [12, 13, 14]])

Selecting fields from arrays with a structured dtype

All get/set_selection_...() methods support a fields argument which allows retrieving/replacing data for a specific field or fields. Also h5py-like API is supported where fields can be provided within __getitem__, .oindex[] and .vindex[].



In [42]:

    
a = np.array([(b'aaa', 1, 4.2),
              (b'bbb', 2, 8.4),
              (b'ccc', 3, 12.6)], 
             dtype=[('foo', 'S3'), ('bar', 'i4'), ('baz', 'f8')])
za = zarr.array(a, chunks=2, fill_value=None)
za[:]









    Out[42]:





array([(b'aaa', 1,   4.2), (b'bbb', 2,   8.4), (b'ccc', 3,  12.6)],
      dtype=[('foo', 'S3'), ('bar', '<i4'), ('baz', '<f8')])



In [43]:

    
za['foo']









    Out[43]:





array([b'aaa', b'bbb', b'ccc'],
      dtype='|S3')



In [44]:

    
za['foo', 'baz']









    Out[44]:





array([(b'aaa',   4.2), (b'bbb',   8.4), (b'ccc',  12.6)],
      dtype=[('foo', 'S3'), ('baz', '<f8')])



In [45]:

    
za[:2, 'foo']









    Out[45]:





array([b'aaa', b'bbb'],
      dtype='|S3')



In [46]:

    
za[:2, 'foo', 'baz']









    Out[46]:





array([(b'aaa',  4.2), (b'bbb',  8.4)],
      dtype=[('foo', 'S3'), ('baz', '<f8')])



In [47]:

    
za.oindex[[0, 2], 'foo']









    Out[47]:





array([b'aaa', b'ccc'],
      dtype='|S3')



In [48]:

    
za.vindex[[0, 2], 'foo']









    Out[48]:





array([b'aaa', b'ccc'],
      dtype='|S3')



In [49]:

    
za['bar'] = 42
za[:]









    Out[49]:





array([(b'aaa', 42,   4.2), (b'bbb', 42,   8.4), (b'ccc', 42,  12.6)],
      dtype=[('foo', 'S3'), ('bar', '<i4'), ('baz', '<f8')])



In [50]:

    
za[:2, 'bar'] = 84
za[:]









    Out[50]:





array([(b'aaa', 84,   4.2), (b'bbb', 84,   8.4), (b'ccc', 42,  12.6)],
      dtype=[('foo', 'S3'), ('bar', '<i4'), ('baz', '<f8')])

Note that this API differs from numpy when selecting multiple fields. E.g.:



In [51]:

    
a['foo', 'baz']









    



---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-51-ad64b4a52de3> in <module>()
----> 1 a['foo', 'baz']

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices



In [52]:

    
a[['foo', 'baz']]









    Out[52]:





array([(b'aaa',   4.2), (b'bbb',   8.4), (b'ccc',  12.6)],
      dtype=[('foo', 'S3'), ('baz', '<f8')])



In [53]:

    
za['foo', 'baz']









    Out[53]:





array([(b'aaa',   4.2), (b'bbb',   8.4), (b'ccc',  12.6)],
      dtype=[('foo', 'S3'), ('baz', '<f8')])



In [54]:

    
za[['foo', 'baz']]









    



---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-54-2394dac3f711> in <module>()
----> 1 za[['foo', 'baz']]

~/src/github/alimanfoo/zarr/zarr/core.py in __getitem__(self, selection)
    537 
    538         fields, selection = pop_fields(selection)
--> 539         return self.get_basic_selection(selection, fields=fields)
    540 
    541     def get_basic_selection(self, selection=Ellipsis, out=None, fields=None):

~/src/github/alimanfoo/zarr/zarr/core.py in get_basic_selection(self, selection, out, fields)
    661             return self._get_basic_selection_zd(selection=selection, out=out, fields=fields)
    662         else:
--> 663             return self._get_basic_selection_nd(selection=selection, out=out, fields=fields)
    664 
    665     def _get_basic_selection_zd(self, selection, out=None, fields=None):

~/src/github/alimanfoo/zarr/zarr/core.py in _get_basic_selection_nd(self, selection, out, fields)
    701 
    702         # setup indexer
--> 703         indexer = BasicIndexer(selection, self)
    704 
    705         return self._get_selection(indexer=indexer, out=out, fields=fields)

~/src/github/alimanfoo/zarr/zarr/indexing.py in __init__(self, selection, array)
    275             else:
    276                 raise IndexError('unsupported selection item for basic indexing; expected integer '
--> 277                                  'or slice, got {!r}'.format(type(dim_sel)))
    278 
    279             dim_indexers.append(dim_indexer)

IndexError: unsupported selection item for basic indexing; expected integer or slice, got <class 'list'>

1D Benchmarking



In [53]:

    
c = np.arange(100000000)
c.nbytes









    Out[53]:





800000000



In [54]:

    
%time zc = zarr.array(c)
zc.info









    



CPU times: user 480 ms, sys: 16 ms, total: 496 ms
Wall time: 141 ms






    Out[54]:




Type zarr.core.Array
Data type int64
Shape (100000000,)
Chunk shape (97657,)
Order C
Read-only False
Compressor Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
Store type builtins.dict
No. bytes 800000000 (762.9M)
No. bytes stored 11854081 (11.3M)
Storage ratio 67.5
Chunks initialized 1024/1024



In [55]:

    
%timeit c.copy()









    



121 ms ± 1.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)



In [56]:

    
%timeit zc[:]









    



254 ms ± 942 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

bool dense selection



In [57]:

    
# relatively dense selection - 10%
ix_dense_bool = np.random.binomial(1, 0.1, size=c.shape[0]).astype(bool)
np.count_nonzero(ix_dense_bool)









    Out[57]:





9997476



In [58]:

    
%timeit c[ix_dense_bool]









    



243 ms ± 5.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)



In [59]:

    
%timeit zc.oindex[ix_dense_bool]









    



433 ms ± 6.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)



In [60]:

    
%timeit zc.vindex[ix_dense_bool]









    



548 ms ± 5.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)



In [61]:

    
import tempfile
import cProfile
import pstats

def profile(statement, sort='time', restrictions=(7,)):
    with tempfile.NamedTemporaryFile() as f:
        cProfile.run(statement, filename=f.name)
        pstats.Stats(f.name).sort_stats(sort).print_stats(*restrictions)



In [62]:

    
profile('zc.oindex[ix_dense_bool]')









    



Wed Nov  8 17:17:48 2017    /tmp/tmpruua2rs_

         98386 function calls in 0.483 seconds

   Ordered by: internal time
   List reduced from 83 to 7 due to restriction <7>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1025    0.197    0.000    0.197    0.000 {method 'nonzero' of 'numpy.ndarray' objects}
     1024    0.149    0.000    0.159    0.000 ../zarr/core.py:1028(_decode_chunk)
     1024    0.044    0.000    0.231    0.000 ../zarr/core.py:849(_chunk_getitem)
     1024    0.009    0.000    0.009    0.000 {built-in method numpy.core.multiarray.count_nonzero}
     1025    0.007    0.000    0.238    0.000 ../zarr/indexing.py:541(__iter__)
     1024    0.006    0.000    0.207    0.000 /home/aliman/pyenv/zarr_20171023/lib/python3.6/site-packages/numpy/lib/index_tricks.py:26(ix_)
     2048    0.005    0.000    0.005    0.000 ../zarr/core.py:337(<genexpr>)

Method nonzero is being called internally within numpy to convert bool to int selections, no way to avoid.



In [63]:

    
profile('zc.vindex[ix_dense_bool]')









    



Wed Nov  8 17:18:06 2017    /tmp/tmp7_bautep

         52382 function calls in 0.592 seconds

   Ordered by: internal time
   List reduced from 88 to 7 due to restriction <7>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    0.219    0.110    0.219    0.110 {method 'nonzero' of 'numpy.ndarray' objects}
     1024    0.096    0.000    0.101    0.000 ../zarr/core.py:1028(_decode_chunk)
        2    0.094    0.047    0.094    0.047 ../zarr/indexing.py:630(<genexpr>)
     1024    0.044    0.000    0.167    0.000 ../zarr/core.py:849(_chunk_getitem)
        1    0.029    0.029    0.029    0.029 {built-in method numpy.core.multiarray.ravel_multi_index}
        1    0.023    0.023    0.023    0.023 {built-in method numpy.core.multiarray.bincount}
        1    0.021    0.021    0.181    0.181 ../zarr/indexing.py:603(__init__)

.vindex[] is a bit slower, possibly because internally it converts to a coordinate array first.

int dense selection



In [64]:

    
ix_dense_int = np.random.choice(c.shape[0], size=c.shape[0]//10, replace=True)
ix_dense_int_sorted = ix_dense_int.copy()
ix_dense_int_sorted.sort()
len(ix_dense_int)









    Out[64]:





10000000



In [65]:

    
%timeit c[ix_dense_int_sorted]









    



62.2 ms ± 2.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)



In [66]:

    
%timeit zc.oindex[ix_dense_int_sorted]









    



355 ms ± 3.53 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)



In [67]:

    
%timeit zc.vindex[ix_dense_int_sorted]









    



351 ms ± 3.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)



In [68]:

    
%timeit c[ix_dense_int]









    



128 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)



In [69]:

    
%timeit zc.oindex[ix_dense_int]









    



1.71 s ± 5.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)



In [70]:

    
%timeit zc.vindex[ix_dense_int]









    



1.68 s ± 3.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)



In [71]:

    
profile('zc.oindex[ix_dense_int_sorted]')









    



Wed Nov  8 17:19:09 2017    /tmp/tmpgmu5btr_

         95338 function calls in 0.424 seconds

   Ordered by: internal time
   List reduced from 89 to 7 due to restriction <7>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.141    0.141    0.184    0.184 ../zarr/indexing.py:369(__init__)
     1024    0.099    0.000    0.106    0.000 ../zarr/core.py:1028(_decode_chunk)
     1024    0.046    0.000    0.175    0.000 ../zarr/core.py:849(_chunk_getitem)
     1025    0.027    0.000    0.027    0.000 ../zarr/indexing.py:424(__iter__)
        1    0.023    0.023    0.023    0.023 {built-in method numpy.core.multiarray.bincount}
        1    0.010    0.010    0.010    0.010 /home/aliman/pyenv/zarr_20171023/lib/python3.6/site-packages/numpy/lib/function_base.py:1848(diff)
     1025    0.006    0.000    0.059    0.000 ../zarr/indexing.py:541(__iter__)



In [72]:

    
profile('zc.vindex[ix_dense_int_sorted]')









    



Wed Nov  8 17:19:13 2017    /tmp/tmpay1gvnx8

         52362 function calls in 0.398 seconds

   Ordered by: internal time
   List reduced from 85 to 7 due to restriction <7>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    0.107    0.054    0.107    0.054 ../zarr/indexing.py:630(<genexpr>)
     1024    0.091    0.000    0.096    0.000 ../zarr/core.py:1028(_decode_chunk)
     1024    0.041    0.000    0.160    0.000 ../zarr/core.py:849(_chunk_getitem)
        1    0.040    0.040    0.213    0.213 ../zarr/indexing.py:603(__init__)
        1    0.029    0.029    0.029    0.029 {built-in method numpy.core.multiarray.ravel_multi_index}
        1    0.023    0.023    0.023    0.023 {built-in method numpy.core.multiarray.bincount}
     2048    0.011    0.000    0.011    0.000 ../zarr/indexing.py:695(<genexpr>)



In [73]:

    
profile('zc.oindex[ix_dense_int]')









    



Wed Nov  8 17:19:20 2017    /tmp/tmpngsf6zpp

         120946 function calls in 1.793 seconds

   Ordered by: internal time
   List reduced from 92 to 7 due to restriction <7>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.128    1.128    1.128    1.128 {method 'argsort' of 'numpy.ndarray' objects}
     1024    0.139    0.000    0.285    0.000 ../zarr/core.py:849(_chunk_getitem)
        1    0.132    0.132    1.422    1.422 ../zarr/indexing.py:369(__init__)
        1    0.120    0.120    0.120    0.120 {method 'take' of 'numpy.ndarray' objects}
     1024    0.116    0.000    0.123    0.000 ../zarr/core.py:1028(_decode_chunk)
     1025    0.034    0.000    0.034    0.000 ../zarr/indexing.py:424(__iter__)
        1    0.023    0.023    0.023    0.023 {built-in method numpy.core.multiarray.bincount}



In [74]:

    
profile('zc.vindex[ix_dense_int]')









    



Wed Nov  8 17:19:22 2017    /tmp/tmpbskhj8de

         50320 function calls in 1.730 seconds

   Ordered by: internal time
   List reduced from 86 to 7 due to restriction <7>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.116    1.116    1.116    1.116 {method 'argsort' of 'numpy.ndarray' objects}
     1024    0.133    0.000    0.275    0.000 ../zarr/core.py:849(_chunk_getitem)
        2    0.121    0.060    0.121    0.060 ../zarr/indexing.py:654(<genexpr>)
     1024    0.113    0.000    0.119    0.000 ../zarr/core.py:1028(_decode_chunk)
        2    0.100    0.050    0.100    0.050 ../zarr/indexing.py:630(<genexpr>)
        1    0.030    0.030    0.030    0.030 {built-in method numpy.core.multiarray.ravel_multi_index}
        1    0.024    0.024    1.427    1.427 ../zarr/indexing.py:603(__init__)

When indices are not sorted, zarr needs to partially sort them so the occur in chunk order, so we only have to visit each chunk once. This sorting dominates the processing time and is unavoidable AFAIK.

bool sparse selection



In [75]:

    
# relatively sparse selection
ix_sparse_bool = np.random.binomial(1, 0.0001, size=c.shape[0]).astype(bool)
np.count_nonzero(ix_sparse_bool)









    Out[75]:





9932



In [76]:

    
%timeit c[ix_sparse_bool]









    



15.7 ms ± 38.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)



In [77]:

    
%timeit zc.oindex[ix_sparse_bool]









    



156 ms ± 2.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)



In [78]:

    
%timeit zc.vindex[ix_sparse_bool]









    



133 ms ± 2.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)



In [79]:

    
profile('zc.oindex[ix_sparse_bool]')









    



Wed Nov  8 17:20:09 2017    /tmp/tmpb7nqc9ax

         98386 function calls in 0.191 seconds

   Ordered by: internal time
   List reduced from 83 to 7 due to restriction <7>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1024    0.093    0.000    0.098    0.000 ../zarr/core.py:1028(_decode_chunk)
     1025    0.017    0.000    0.017    0.000 {method 'nonzero' of 'numpy.ndarray' objects}
     1024    0.007    0.000    0.007    0.000 {built-in method numpy.core.multiarray.count_nonzero}
     1024    0.007    0.000    0.129    0.000 ../zarr/core.py:849(_chunk_getitem)
     1025    0.005    0.000    0.052    0.000 ../zarr/indexing.py:541(__iter__)
     1024    0.005    0.000    0.025    0.000 /home/aliman/pyenv/zarr_20171023/lib/python3.6/site-packages/numpy/lib/index_tricks.py:26(ix_)
     2048    0.004    0.000    0.004    0.000 ../zarr/core.py:337(<genexpr>)



In [80]:

    
profile('zc.vindex[ix_sparse_bool]')









    



Wed Nov  8 17:20:09 2017    /tmp/tmphsko8nvh

         52382 function calls in 0.160 seconds

   Ordered by: internal time
   List reduced from 88 to 7 due to restriction <7>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1024    0.093    0.000    0.098    0.000 ../zarr/core.py:1028(_decode_chunk)
        2    0.017    0.008    0.017    0.008 {method 'nonzero' of 'numpy.ndarray' objects}
     1025    0.008    0.000    0.014    0.000 ../zarr/indexing.py:674(__iter__)
     1024    0.006    0.000    0.127    0.000 ../zarr/core.py:849(_chunk_getitem)
     2048    0.004    0.000    0.004    0.000 ../zarr/indexing.py:695(<genexpr>)
     2054    0.003    0.000    0.003    0.000 ../zarr/core.py:337(<genexpr>)
     1024    0.002    0.000    0.005    0.000 /home/aliman/pyenv/zarr_20171023/lib/python3.6/site-packages/numpy/core/arrayprint.py:381(wrapper)

int sparse selection



In [81]:

    
ix_sparse_int = np.random.choice(c.shape[0], size=c.shape[0]//10000, replace=True)
ix_sparse_int_sorted = ix_sparse_int.copy()
ix_sparse_int_sorted.sort()
len(ix_sparse_int)









    Out[81]:





10000



In [82]:

    
%timeit c[ix_sparse_int_sorted]









    



18.9 µs ± 392 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)



In [83]:

    
%timeit c[ix_sparse_int]









    



20.3 µs ± 155 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)



In [84]:

    
%timeit zc.oindex[ix_sparse_int_sorted]









    



125 ms ± 296 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)



In [85]:

    
%timeit zc.vindex[ix_sparse_int_sorted]









    



109 ms ± 428 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)



In [86]:

    
%timeit zc.oindex[ix_sparse_int]









    



132 ms ± 489 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)



In [87]:

    
%timeit zc.vindex[ix_sparse_int]









    



108 ms ± 579 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)



In [88]:

    
profile('zc.oindex[ix_sparse_int]')









    



Wed Nov  8 17:21:12 2017    /tmp/tmp0b0o2quo

         120946 function calls in 0.196 seconds

   Ordered by: internal time
   List reduced from 92 to 7 due to restriction <7>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1024    0.105    0.000    0.111    0.000 ../zarr/core.py:1028(_decode_chunk)
     2048    0.006    0.000    0.013    0.000 /home/aliman/pyenv/zarr_20171023/lib/python3.6/site-packages/numpy/lib/index_tricks.py:26(ix_)
     1025    0.006    0.000    0.051    0.000 ../zarr/indexing.py:541(__iter__)
     1024    0.006    0.000    0.141    0.000 ../zarr/core.py:849(_chunk_getitem)
     2048    0.005    0.000    0.005    0.000 ../zarr/core.py:337(<genexpr>)
    15373    0.004    0.000    0.010    0.000 {built-in method builtins.isinstance}
     1025    0.004    0.000    0.005    0.000 ../zarr/indexing.py:424(__iter__)



In [89]:

    
profile('zc.vindex[ix_sparse_int]')









    



Wed Nov  8 17:21:19 2017    /tmp/tmpdwju98kn

         50320 function calls in 0.167 seconds

   Ordered by: internal time
   List reduced from 86 to 7 due to restriction <7>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1024    0.105    0.000    0.111    0.000 ../zarr/core.py:1028(_decode_chunk)
     1025    0.009    0.000    0.017    0.000 ../zarr/indexing.py:674(__iter__)
     1024    0.006    0.000    0.142    0.000 ../zarr/core.py:849(_chunk_getitem)
     2048    0.005    0.000    0.005    0.000 ../zarr/indexing.py:695(<genexpr>)
     2054    0.004    0.000    0.004    0.000 ../zarr/core.py:337(<genexpr>)
        1    0.003    0.003    0.162    0.162 ../zarr/core.py:591(_get_selection)
     1027    0.003    0.000    0.003    0.000 {method 'reshape' of 'numpy.ndarray' objects}

For sparse selections, processing time is dominated by decompression, so we can't do any better.

sparse bool selection as zarr array



In [90]:

    
zix_sparse_bool = zarr.array(ix_sparse_bool)
zix_sparse_bool.info









    Out[90]:




Type zarr.core.Array
Data type bool
Shape (100000000,)
Chunk shape (390625,)
Order C
Read-only False
Compressor Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
Store type builtins.dict
No. bytes 100000000 (95.4M)
No. bytes stored 507131 (495.2K)
Storage ratio 197.2
Chunks initialized 256/256



In [91]:

    
%timeit zc.oindex[zix_sparse_bool]









    



387 ms ± 5.47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

slice with step



In [92]:

    
%timeit np.array(c[::2])









    



80.3 ms ± 377 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)



In [93]:

    
%timeit zc[::2]









    



168 ms ± 837 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)



In [94]:

    
%timeit zc[::10]









    



136 ms ± 1.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)



In [95]:

    
%timeit zc[::100]









    



104 ms ± 1.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)



In [96]:

    
%timeit zc[::1000]









    



100 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)



In [97]:

    
profile('zc[::2]')









    



Wed Nov  8 17:22:44 2017    /tmp/tmpg9dxqcpg

         49193 function calls in 0.211 seconds

   Ordered by: internal time
   List reduced from 55 to 7 due to restriction <7>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1024    0.104    0.000    0.110    0.000 ../zarr/core.py:1028(_decode_chunk)
     1024    0.067    0.000    0.195    0.000 ../zarr/core.py:849(_chunk_getitem)
     1025    0.005    0.000    0.013    0.000 ../zarr/indexing.py:278(__iter__)
     2048    0.004    0.000    0.004    0.000 ../zarr/core.py:337(<genexpr>)
     2050    0.003    0.000    0.003    0.000 ../zarr/indexing.py:90(ceildiv)
     1025    0.003    0.000    0.006    0.000 ../zarr/indexing.py:109(__iter__)
     1024    0.003    0.000    0.003    0.000 {method 'reshape' of 'numpy.ndarray' objects}

2D Benchmarking



In [99]:

    
c.shape









    Out[99]:





(100000000,)



In [100]:

    
d = c.reshape(-1, 1000)
d.shape









    Out[100]:





(100000, 1000)



In [101]:

    
zd = zarr.array(d)
zd.info









    Out[101]:




Type zarr.core.Array
Data type int64
Shape (100000, 1000)
Chunk shape (3125, 32)
Order C
Read-only False
Compressor Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
Store type builtins.dict
No. bytes 800000000 (762.9M)
No. bytes stored 39228864 (37.4M)
Storage ratio 20.4
Chunks initialized 1024/1024

bool orthogonal selection



In [102]:

    
ix0 = np.random.binomial(1, 0.5, size=d.shape[0]).astype(bool)
ix1 = np.random.binomial(1, 0.5, size=d.shape[1]).astype(bool)



In [103]:

    
%timeit d[np.ix_(ix0, ix1)]









    



101 ms ± 577 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)



In [104]:

    
%timeit zd.oindex[ix0, ix1]









    



373 ms ± 5.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

int orthogonal selection



In [105]:

    
ix0 = np.random.choice(d.shape[0], size=int(d.shape[0] * .5), replace=True)
ix1 = np.random.choice(d.shape[1], size=int(d.shape[1] * .5), replace=True)



In [106]:

    
%timeit d[np.ix_(ix0, ix1)]









    



174 ms ± 4.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)



In [107]:

    
%timeit zd.oindex[ix0, ix1]









    



566 ms ± 12.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

coordinate (point) selection



In [108]:

    
n = int(d.size * .1)
ix0 = np.random.choice(d.shape[0], size=n, replace=True)
ix1 = np.random.choice(d.shape[1], size=n, replace=True)
n









    Out[108]:





10000000



In [109]:

    
%timeit d[ix0, ix1]









    



243 ms ± 3.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)



In [110]:

    
%timeit zd.vindex[ix0, ix1]









    



2.03 s ± 17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)



In [111]:

    
profile('zd.vindex[ix0, ix1]')









    



Wed Nov  8 17:24:31 2017    /tmp/tmp7c68z70p

         62673 function calls in 2.065 seconds

   Ordered by: internal time
   List reduced from 88 to 7 due to restriction <7>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.112    1.112    1.112    1.112 {method 'argsort' of 'numpy.ndarray' objects}
        3    0.244    0.081    0.244    0.081 ../zarr/indexing.py:654(<genexpr>)
        3    0.193    0.064    0.193    0.064 ../zarr/indexing.py:630(<genexpr>)
     1024    0.170    0.000    0.350    0.000 ../zarr/core.py:849(_chunk_getitem)
     1024    0.142    0.000    0.151    0.000 ../zarr/core.py:1028(_decode_chunk)
        1    0.044    0.044    0.044    0.044 {built-in method numpy.core.multiarray.ravel_multi_index}
        1    0.043    0.043    1.676    1.676 ../zarr/indexing.py:603(__init__)

Points need to be partially sorted so all points in the same chunk are grouped and processed together. This requires argsort which dominates time.

h5py comparison

N.B., not really fair because using slower compressor, but for interest...



In [65]:

    
import h5py
import tempfile



In [78]:

    
h5f = h5py.File(tempfile.mktemp(), driver='core', backing_store=False)



In [79]:

    
hc = h5f.create_dataset('c', data=c, compression='gzip', compression_opts=1, chunks=zc.chunks, shuffle=True)
hc









    Out[79]:





<HDF5 dataset "c": shape (100000000,), type "<i8">



In [80]:

    
%time hc[:]









    



CPU times: user 1.16 s, sys: 172 ms, total: 1.33 s
Wall time: 1.32 s






    Out[80]:





array([       0,        1,        2, ..., 99999997, 99999998, 99999999])



In [81]:

    
%time hc[ix_sparse_bool]









    



CPU times: user 1.11 s, sys: 0 ns, total: 1.11 s
Wall time: 1.11 s






    Out[81]:





array([    1063,    28396,    37229, ..., 99955875, 99979354, 99995791])



In [82]:

    
# # this is pathological, takes minutes 
# %time hc[ix_dense_bool]



In [83]:

    
# this is pretty slow
%time hc[::1000]









    



CPU times: user 38.3 s, sys: 136 ms, total: 38.4 s
Wall time: 38.1 s






    Out[83]:





array([       0,     1000,     2000, ..., 99997000, 99998000, 99999000])



In [ ]:

Type	zarr.core.Array
Data type	int64
Shape	(100000000,)
Chunk shape	(97657,)
Order	C
Read-only	False
Compressor	Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
Store type	builtins.dict
No. bytes	800000000 (762.9M)
No. bytes stored	11854081 (11.3M)
Storage ratio	67.5
Chunks initialized	1024/1024