Aug 30th 2014
Patrick Peglar - UK Met Office, Exeter
Biggus is a lightweight pure-Python package which implements lazy operations on numpy array-like objects. This provides dramatically improved efficiency in analysing large datasets, for minimal additional effort in the client code.
"""
Virtual arrays of arbitrary size, with arithmetic and statistical operations, and conversion to NumPy ndarrays.
Virtual arrays can be stacked to increase their dimensionality, or tiled to increase their extent.
Includes support for easily wrapping data sources which produce NumPy ndarray objects via slicing.
For example: netcdf4python Variable instances, and NumPy ndarray instances.
All operations are performed in a lazy fashion to avoid overloading system resources.
Conversion to a concrete NumPy ndarray requires an explicit method call.
"""
!
In [9]:
import biggus
print biggus.__version__
In [10]:
import numpy as np
array_1 = np.array([[1., 5., 2.], [7., 6., 5.]])
print 'simple array :'
print array_1
print
print 'shape :', array_1.shape
mean_a1 = array_1.mean(axis=1)
print
print 'mean over axis 1 :'
print mean_a1
In [11]:
lazy_1 = biggus.NumpyArrayAdapter(array_1)
print 'a lazy array : ', lazy_1
In [12]:
lazy_mean = biggus.mean(lazy_1, axis=1)
print 'lazy mean :', lazy_mean
print 'shape :', array_1.shape
print
print 'lazy mean *result* :'
print lazy_mean.ndarray()
In [13]:
lazy_mean = biggus.mean(lazy_1, axis=1)
print lazy_mean
array_1[0,:] = -1
print lazy_mean.ndarray()
In [14]:
class constant_array(object):
def __init__(self, shape, value=0.0):
self.shape = shape
self.dtype = np.float
self._value = value
def __getitem__(self, indices):
print 'accessing :', indices
return self._value * np.ones(self.shape)[indices]
lazy_const_2x3 = biggus.NumpyArrayAdapter(constant_array((2, 3, 4), value=3.5))
print lazy_const_2x3
one_element = lazy_const_2x3[0, 1]
print one_element
print one_element.ndarray()
In [15]:
lazy_const_2x3 = biggus.ConstantArray((2, 3, 4), 3.77)
print lazy_const_2x3
one_element = lazy_const_2x3[0, 1]
print one_element
print one_element.ndarray()
See : https://github.com/SciTools/biggus
Rationale
As scientific datasets continue to grow exponentially in size, the resource requirements of even simple analyses can quickly grow to become a problem -- e.g. the job takes an unreasonably long time, or simply runs out of space.
Existing solutions to this may be powerful, but can also come with a large complexity overhead, especially for the non-expert user. This makes adapting an existing analysis to the needs of larger datasets potentially very costly.
Biggus provides simple abstractions of data access and calculations which provide lazy evaluation. It exposes this as simple virtual array object which mimics a numpy array. Thus, it does not require the user to re-cast an operation in unfamiliar terms, or specify unrelated details of data storage or concurrency factors.
The lazy evaluation approach allows optimised resource usage for both storage accesses and the parallelisation of calculations. Pure Python is well suited to describing and implementing these techniques, and the resulting implementation is easily accessible to the average user.
At the Met Office, we develop data analysis tools for a large community of research scientists. We developed Biggus as a resource for the Iris project, our next-generation analysis library for meteorological and oceanographic data (see: http://scitools.org.uk/iris/). While Biggus is still work-in-progress, within Iris it is already delivering significant benefit, in tasks such as catalogueing large datasets and accelerating statistical calculations. Here, performance already exceeds that of other standard software toolsets. Schedule
the problems Biggus is aiming to solve, and techniques employed
an overview of the architecture and code of the current implementation
a demonstration of current performance, in ease-of-use and efficiency benefits
suggestions for future developments; how to get involved; questions
Supporting URLs Iris project (core developer) Biggus project (own project : object wrapping for netcdf manipulations)
In [15]: