In [1]:
%autosave 10


Autosaving every 10 seconds

 Problem

  • Need
    • Very large NumPy arrays that don't fit into memory.
    • Lazy loading
    • Allows indexing and combination with other arrays.
  • Don't need
    • ragged arrays (fills regular numpy arrays)
    • heterogenous dtype
    • better-than-numpy performance

 Prior art


In [5]:
import numpy as np
import biggus

np_array = np.empty((700, 200), dtype=int32)
arr = biggus.NumpyArrayAdapter(np_array)
print arr


<NumpyArrayAdapter shape=(700, 200) dtype=dtype('int32')>

In [6]:
# np.concatenate
bigger_arr = biggus.LinearMosaic([arr, arr], axis=0)
print bigger_arr


<LinearMosaic shape=(1400, 200) dtype=dtype('int32')>
  • But NumPy is creating a new space, then copying.
    • NumPY arrays are hard to subclass
  • Biggus does no copying.

In [7]:
# no memory copying
print biggus.LinearMosaic([arr, arr] * 20, axis=0)


<LinearMosaic shape=(28000, 200) dtype=dtype('int32')>

In [8]:
# new dimension
biggus.ArrayStack(np.array([arr, arr]))


Out[8]:
<ArrayStack shape=(2, 700, 200) dtype=dtype('int32')>

In [ ]:
import h5py

hdf_dataset = h5py.File('data.hdf5')['arange']

# this is lazy; no data is loaded
arr_hdf = biggus.NumpyArrayAdapter(hdf_dataset)

print arr
  • HDF5 already has the ability to load arrays, but this is more generic.
  • Can combine (LinearMosaic) HDF5 and regular arrays.

In [ ]:
bigger_arr = biggus.LinearMosaic([bigger_arr, arr_hdf], axis=0)
print bigger_arr

The ndarray method realizes arrays, and brings it into memory.


In [11]:
type(bigger_arr.ndarray()), bigger_arr.ndarray().shape


Out[11]:
(numpy.ndarray, (1400, 200))

You can do basic processing on massive arrays in chunks.


In [12]:
# These operations don't run when you do this
mean = biggus.mean(bigger_arr, axis=0)
std = biggus.std(bigger_arr, axis=0)

In [13]:
print mean


<_Aggregation shape=(200,) dtype=dtype('float64')>

In [14]:
# _now_ we realize it, calculate mean in mean.ndarray()
# done is chunks, data never all in-memory
print np.all(mean.ndarray() == bigger_arr.ndarray().mean(axis=0))


True

Really though as you go chunk-by-chunk you want to do many operations at the same time.


In [15]:
# this realizes the result. it really is chunking the array
# into sub-arrays, aggregating results
mean_np, std_np = biggus.ndarrays([mean, std])
print type(mean_np)


<type 'numpy.ndarray'>

 Current limitations

  • Limited to axis 0 - pull request in the works
  • Threading is coming, but tasks are I/O bound so not too helpful.

Array results

  • What if the result is in itself a big array?
  • Can never realize the full result.
  • Can chunk result directly into an HDF5 variable.

In [18]:
import h5py
with h5py.File('result.hdf5', mode='w') as f_out:
    df = f_out.create_dataset('my_result', mean.shape, mean.dtype)
    biggus.save([mean], [df])

 Toy example of taking OpenStreetMap tiles

!!AI see video.

  • Eventually dealing with a 48GB uncompressed array, easily on a MacBook Air.

 Summary

  • Simple (1200 LOC)
  • Array-like class.
  • Array-like indexing, aggregation, concatenation, with any index-like object (NumPy, HDF5).
  • Supports streaming.
  • Conceptually no size limit on arrays you can manipulate.
  • TODO
    • Efficient chunking for complex chains of evaluations