In [1]:

    
%autosave 10









    














    



Autosaving every 10 seconds

Module: https://github.com/SciTools/biggus/wiki
Talk: http://nbviewer.ipython.org/gist/pelson/9171602

Problem

Need
- Very large NumPy arrays that don't fit into memory.
- Lazy loading
- Allows indexing and combination with other arrays.
Don't need
- ragged arrays (fills regular numpy arrays)
- heterogenous dtype
- better-than-numpy performance

Prior art

BLAZE, https://code.google.com/p/blaze-lib/

Their work

Biggus. https://github.com/SciTools/biggus/wiki



In [5]:

    
import numpy as np
import biggus

np_array = np.empty((700, 200), dtype=int32)
arr = biggus.NumpyArrayAdapter(np_array)
print arr









    



<NumpyArrayAdapter shape=(700, 200) dtype=dtype('int32')>



In [6]:

    
# np.concatenate
bigger_arr = biggus.LinearMosaic([arr, arr], axis=0)
print bigger_arr









    



<LinearMosaic shape=(1400, 200) dtype=dtype('int32')>

But NumPy is creating a new space, then copying.
- NumPY arrays are hard to subclass
Biggus does no copying.



In [7]:

    
# no memory copying
print biggus.LinearMosaic([arr, arr] * 20, axis=0)









    



<LinearMosaic shape=(28000, 200) dtype=dtype('int32')>



In [8]:

    
# new dimension
biggus.ArrayStack(np.array([arr, arr]))









    Out[8]:





<ArrayStack shape=(2, 700, 200) dtype=dtype('int32')>



In [ ]:

    
import h5py

hdf_dataset = h5py.File('data.hdf5')['arange']

# this is lazy; no data is loaded
arr_hdf = biggus.NumpyArrayAdapter(hdf_dataset)

print arr

HDF5 already has the ability to load arrays, but this is more generic.
Can combine (LinearMosaic) HDF5 and regular arrays.



In [ ]:

    
bigger_arr = biggus.LinearMosaic([bigger_arr, arr_hdf], axis=0)
print bigger_arr

The ndarray method realizes arrays, and brings it into memory.



In [11]:

    
type(bigger_arr.ndarray()), bigger_arr.ndarray().shape









    Out[11]:





(numpy.ndarray, (1400, 200))

You can do basic processing on massive arrays in chunks.



In [12]:

    
# These operations don't run when you do this
mean = biggus.mean(bigger_arr, axis=0)
std = biggus.std(bigger_arr, axis=0)



In [13]:

    
print mean









    



<_Aggregation shape=(200,) dtype=dtype('float64')>



In [14]:

    
# _now_ we realize it, calculate mean in mean.ndarray()
# done is chunks, data never all in-memory
print np.all(mean.ndarray() == bigger_arr.ndarray().mean(axis=0))









    



True

Really though as you go chunk-by-chunk you want to do many operations at the same time.



In [15]:

    
# this realizes the result. it really is chunking the array
# into sub-arrays, aggregating results
mean_np, std_np = biggus.ndarrays([mean, std])
print type(mean_np)









    



<type 'numpy.ndarray'>

Current limitations

Limited to axis 0 - pull request in the works
Threading is coming, but tasks are I/O bound so not too helpful.

Array results

What if the result is in itself a big array?
Can never realize the full result.
Can chunk result directly into an HDF5 variable.



In [18]:

    
import h5py
with h5py.File('result.hdf5', mode='w') as f_out:
    df = f_out.create_dataset('my_result', mean.shape, mean.dtype)
    biggus.save([mean], [df])

Toy example of taking OpenStreetMap tiles

!!AI see video.

Eventually dealing with a 48GB uncompressed array, easily on a MacBook Air.

Summary

Simple (1200 LOC)
Array-like class.
Array-like indexing, aggregation, concatenation, with any index-like object (NumPy, HDF5).
Supports streaming.
Conceptually no size limit on arrays you can manipulate.
TODO
- Efficient chunking for complex chains of evaluations