Efficient large data operations with Biggus

Aug 30th 2014

Patrick Peglar - UK Met Office, Exeter

Summary

Biggus is a lightweight pure-Python package which implements lazy operations on numpy array-like objects. This provides dramatically improved efficiency in analysing large datasets, for minimal additional effort in the client code.

From the module docstring...

"""

Virtual arrays of arbitrary size, with arithmetic and statistical operations, and conversion to NumPy ndarrays.  
Virtual arrays can be stacked to increase their dimensionality, or tiled to increase their extent.  
Includes support for easily wrapping data sources which produce NumPy ndarray objects via slicing.  
    For example: netcdf4python Variable instances, and NumPy ndarray instances.  
All operations are performed in a lazy fashion to avoid overloading system resources.  
Conversion to a concrete NumPy ndarray requires an explicit method call.

"""

( That's pretty much it, so we can go home. )

The longest journey ...



In [9]:

    
import biggus
print biggus.__version__









    



0.7.0-alpha

A simple operation and the biggus equivalent



In [10]:

    
import numpy as np
array_1 = np.array([[1., 5., 2.], [7., 6., 5.]])
print 'simple array :'
print array_1
print
print 'shape :', array_1.shape
mean_a1 = array_1.mean(axis=1)
print
print 'mean over axis 1 :'
print mean_a1









    



simple array :
[[ 1.  5.  2.]
 [ 7.  6.  5.]]

shape : (2, 3)

mean over axis 1 :
[ 2.66666667  6.        ]



In [11]:

    
lazy_1 = biggus.NumpyArrayAdapter(array_1)
print 'a lazy array : ', lazy_1









    



a lazy array :  <NumpyArrayAdapter shape=(2, 3) dtype=dtype('float64')>



In [12]:

    
lazy_mean = biggus.mean(lazy_1, axis=1)
print 'lazy mean :', lazy_mean
print 'shape :', array_1.shape
print
print 'lazy mean *result* :'
print lazy_mean.ndarray()









    



lazy mean : <_Aggregation shape=(2,) dtype=dtype('float64')>
shape : (2, 3)

lazy mean *result* :
[ 2.66666667  6.        ]

Same as before...

But this time change the source data, between forming the mean and evaluating it.



In [13]:

    
lazy_mean = biggus.mean(lazy_1, axis=1)
print lazy_mean
array_1[0,:] = -1
print lazy_mean.ndarray()









    



<_Aggregation shape=(2,) dtype=dtype('float64')>
[-1.  6.]

Anything that 'looks like' an array can be wrapped as a biggus.Array.

It just needs to support the minimum numpy-array-like properties : "shape", "dtype" and "__getitem__".
Here's a tiny example that "looks like" an array full of a constant, and indicates clearly when it is actually accessed.



In [14]:

    
class constant_array(object):
    def __init__(self, shape, value=0.0):
        self.shape = shape
        self.dtype = np.float
        self._value = value
    def __getitem__(self, indices):
        print 'accessing :', indices
        return  self._value * np.ones(self.shape)[indices]

lazy_const_2x3 = biggus.NumpyArrayAdapter(constant_array((2, 3, 4), value=3.5))
print lazy_const_2x3
one_element = lazy_const_2x3[0, 1]
print one_element
print one_element.ndarray()









    



<NumpyArrayAdapter shape=(2, 3, 4) dtype=<type 'float'>>
<NumpyArrayAdapter shape=(4,) dtype=<type 'float'>>
accessing : (0, 1)
[ 3.5  3.5  3.5  3.5]

In fact, this functionality is already provided in biggus

(without the evaluation debug).

It is called a ConstantArray..



In [15]:

    
lazy_const_2x3 = biggus.ConstantArray((2, 3, 4), 3.77)
print lazy_const_2x3
one_element = lazy_const_2x3[0, 1]
print one_element
print one_element.ndarray()









    



<ConstantArray shape=(2, 3, 4) dtype=dtype('float64')>
<ConstantArray shape=(4,) dtype=dtype('float64')>
[ 3.77  3.77  3.77  3.77]

But what is the point ?

SPACE
and
TIME

See : https://github.com/SciTools/biggus

Rationale

As scientific datasets continue to grow exponentially in size, the resource requirements of even simple analyses can quickly grow to become a problem -- e.g. the job takes an unreasonably long time, or simply runs out of space.

Existing solutions to this may be powerful, but can also come with a large complexity overhead, especially for the non-expert user. This makes adapting an existing analysis to the needs of larger datasets potentially very costly.

Biggus provides simple abstractions of data access and calculations which provide lazy evaluation. It exposes this as simple virtual array object which mimics a numpy array. Thus, it does not require the user to re-cast an operation in unfamiliar terms, or specify unrelated details of data storage or concurrency factors.

The lazy evaluation approach allows optimised resource usage for both storage accesses and the parallelisation of calculations. Pure Python is well suited to describing and implementing these techniques, and the resulting implementation is easily accessible to the average user.

At the Met Office, we develop data analysis tools for a large community of research scientists. We developed Biggus as a resource for the Iris project, our next-generation analysis library for meteorological and oceanographic data (see: http://scitools.org.uk/iris/). While Biggus is still work-in-progress, within Iris it is already delivering significant benefit, in tasks such as catalogueing large datasets and accelerating statistical calculations. Here, performance already exceeds that of other standard software toolsets. Schedule

the problems Biggus is aiming to solve, and techniques employed
an overview of the architecture and code of the current implementation
a demonstration of current performance, in ease-of-use and efficiency benefits
suggestions for future developments; how to get involved; questions

Supporting URLs Iris project (core developer) Biggus project (own project : object wrapping for netcdf manipulations)



In [15]: