In [1]:

    
from IPython.display import Image
import matplotlib.pyplot as plt
%run talktools

Efficient large data operations with Biggus

Aug 30th 2014

Patrick Peglar - UK Met Office, Exeter

Biggus: https://github.com/SciTools/biggus

( this talk: https://github.com/pp-mo/biggus_talk )

Overview

Biggus is a lightweight pure-Python package which implements lazy operations on numpy array-like objects. This provides dramatically improved efficiency in analysing large datasets, for minimal additional effort in the client code.

Biggus makes data analysis on large datasets easier

avoids scalability problems
makes code clearer and simpler
uses less space and time

The key technique is "lazy evaluation"

Aka "don't do it until you have to"

This goes by several near-interchangeable names:

VIRTUAL arrays
"LAZY" - or -
"DEFERRED"- evaluation

Biggus provides 'virtual' arrays.

A virtual array does not need the actual values of its data content ...

... but it knows all about it:
- size and shape
- datatype
- how to load (any given part of) it

Virtual arrays can be processed in the usual ways

these manipulations don't process actual data
actual results are got from the result arrays, but only on request

The 'deferred' operation is the source of all the benefits ...

copes with arbitrary data sizes
optimises data access
separates data access from analysis concerns

The key features Biggus provides

virtual arrays of arbitrary size
combine and extract
statistics
arithmetic
efficient evaluation

What's to come:

dead-simple example
what it's good for
summary of current features
how it works (in outline)
current work and future directions

The longest journey ...



In [2]:

    
import biggus
print biggus.__version__









    



0.7.0-alpha

A simple 'mean' calculation ...

(and so later.. the biggus equivalent)



In [3]:

    
import numpy as np
array_1 = np.array([[1., 5., 2.], [7., 6., 5.]])
print 'simple array :\n', array_1
mean_a1 = array_1.mean(axis=1)
print
print 'mean over axis 1 :\n',  mean_a1









    



simple array :
[[ 1.  5.  2.]
 [ 7.  6.  5.]]

mean over axis 1 :
[ 2.66666667  6.        ]

Simple mean calculation : The Biggus equivalent



In [4]:

    
from biggus import NumpyArrayAdapter as npwrap
lazy_1 = npwrap(array_1)
print 'a lazy array : ', lazy_1









    



a lazy array :  <NumpyArrayAdapter shape=(2, 3) dtype=dtype('float64')>

... lazy mean and result



In [5]:

    
lazy_mean = biggus.mean(lazy_1, axis=1)
print 'lazy mean :', lazy_mean
print
print 'lazy mean *result* :\n', lazy_mean.ndarray()
print
print 'same as original ...:\n', mean_a1









    



lazy mean : <_Aggregation shape=(2,) dtype=dtype('float64')>

lazy mean *result* :
[ 2.66666667  6.        ]

same as original ...:
[ 2.66666667  6.        ]

Repeat that ...

But this time change the source data, between forming the mean and evaluating it.



In [6]:

    
lazy_mean2 = biggus.mean(lazy_1, axis=1)
print lazy_mean2
array_1[0,:] = -1
print lazy_mean2.ndarray()









    



<_Aggregation shape=(2,) dtype=dtype('float64')>
[-1.  6.]

So... what is the point ?

SPACE (memory usage)

TIME (optimised operation, especially data i/o)

CODE (convenience ; clarity ; separation of concerns)

SPACE (resource efficiency)

data grows
eventually you can't fit all the data

The space problem - a historical example

Which means...

A single forecast datafield now occupies over 200Mb
there are dozens of stored parameter fields, and dozens of timesteps per run
we do a new run every 6 hours

TIME

avoid re-reading anything
pass sections of data to all consumer processes "in parallel"
eventually / potentially this leads to distributed processing
- (i.e. processes not co-hosted)

CODE

The easy way to define a data analysis process

grab all the data
write analysis code
don't worry about resources / efficiency

Code using biggus looks very similar, but

deferred execution means it scales with data size
no need to confuse the code with data access concerns

Aside : the use of biggus in Iris

Features Summary:

arrays and indexing
combining arrays : stack + tile
statistics
arithmetic

The uniformity here is key:

all operations produce an 'Array' object
all operations can be combined freely
so... the lazy representations allow efficient evaluation

Features #1 : Arrays and Indexing

Anything that 'looks like' an array can be cast as a biggus.Array

All types of Biggus array support the major indexing operations... <img src='IndexingStyles.png', align='left'>

So, what is "an array" ?

It just needs to support the minimum numpy-array-like properties :

shape
dtype
__getitem__ (i.e. indexing)

Here's a tiny example that emulates an array full of a constant value ...

... a tiny example that "looks like" an array full of a constant

(and indicates clearly when it is actually accessed )



In [7]:

    
class constant_array(object):
    def __init__(self, shape, value=0.0):
        self.shape = shape
        self.dtype = np.array(value).dtype
        self._value = value
    def __getitem__(self, indices):
        print '    !!! accessing :', indices
        return  self._value * np.ones(self.shape)[indices]

lazy_const_234 = npwrap(constant_array((2, 3, 4), value=3.5))
print 'lazy_234:', lazy_const_234
const_section = lazy_const_234[0, 1]
print 'section:', const_section
print '\nresult:\n', const_section.ndarray()









    



lazy_234: <NumpyArrayAdapter shape=(2, 3, 4) dtype=dtype('float64')>
section: <NumpyArrayAdapter shape=(4,) dtype=dtype('float64')>

result:
    !!! accessing : (0, 1)
[ 3.5  3.5  3.5  3.5]

In fact, this functionality is already provided in biggus

(without the evaluation debug).

It is called a ConstantArray..



In [8]:

    
lazy_const_2x3 = biggus.ConstantArray((2, 3, 4), 3.77)
print lazy_const_2x3
const_section = lazy_const_2x3[0, 1]
print const_section
print const_section.ndarray()









    



<ConstantArray shape=(2, 3, 4) dtype=dtype('float64')>
<ConstantArray shape=(4,) dtype=dtype('float64')>
[ 3.77  3.77  3.77  3.77]

Features #2 : Combining Arrays

Arrays can be combined into larger arrays

Two methods:

"stack" combines Arrays along a new dimension
"mosaic" joins arrays edge-to-edge

ArrayStack

Analogous to numpy "stack" operations ('hstack', 'vstack' etc.).

Given multiple same-shaped Arrays, it creates a new dimension to index over them.

ArrayStack code example:



In [9]:

    
arrays = [num * np.ones((3)) for num in range(1,5)]
print 'array#2:', arrays[2]
print
lazy_arrays = np.array([npwrap(array) for array in arrays])
stack = biggus.ArrayStack(lazy_arrays)
print stack
print stack.ndarray()









    



array#2: [ 3.  3.  3.]

<ArrayStack shape=(4, 3) dtype=dtype('float64')>
[[ 1.  1.  1.]
 [ 2.  2.  2.]
 [ 3.  3.  3.]
 [ 4.  4.  4.]]

LinearMosaic

Analagous to numpy "concatenate" operations.

joins "edge to edge"
does not create a new dimension

LinearMosaic code example:



In [10]:

    
# Make one of the components arrays smaller than the others
lazy_arrays[2] = lazy_arrays[2][:1]
# Combine into one
mosaic = biggus.LinearMosaic(lazy_arrays, axis=0)
print mosaic
print mosaic.ndarray()









    



<LinearMosaic shape=(10,) dtype=dtype('float64')>
[ 1.  1.  1.  2.  2.  2.  3.  4.  4.  4.]

Note:

this handles irregular sizes (along the joining axis)
the join is seamless

Combined usage : code example

With a nested construction, we can make a complex patchwork + extract 2d regions seamlessly ...

The pink cells show separate data sources, joined together with nested 'LinearMosaic's.

The extracted area is a new virtual array, seamlessly extracted across the original boundaries.

Features #3 : Statistics

There are basic functions to perform statistics on an Array ('mean', 'std', 'count', 'var').

We've already seen an example of this:



In [11]:

    
mean = biggus.mean(lazy_arrays[0], axis=0)
print mean









    



<_Aggregation shape=() dtype=dtype('float64')>

The other operations simply follow the same pattern.

Statistical functions must be individually implemented :--

the algorithm is recast in a standard form (subclass '_Aggregation') ...
data is processed in sections known as "chunks", with the simple pattern:
- initialise
- process chunks, as we are given them
- finalise

Chunking is managed by an "evaluation engine" (see on..)

However, a statistical operation can specify constraints on its chunking process.

for instance, a (future) median function would need all the data on the aggregation axis

Features #4 : Array Arithmetic

Basic lazy arithmetic functions are provided ('add', 'sub').

Code example ...



In [12]:

    
values = npwrap(np.array([1.0, 2.0, 3.0]))
constants = biggus.ConstantArray((3,), 1000.0)
sum = biggus.add(values, constants)
print 'sum:', sum
print 'sum results:\n', sum.ndarray()









    



sum: <_Elementwise shape=(3,) dtype=dtype('float64')>
sum results:
[ 1001.  1002.  1003.]

The implementation of these is much more straightforward than statistics :

the function is an 'ElementWise' instance, wrapping the required (numpy) function
it doesn't care how the source data is chunked, so chunks processing is generalised
so, they are somewhat like numpy 'ufuncs'

How it works

in brief: Parallel Evaluation

The point is to avoid the need to make multiple passes over large datasets, while keeping memory usage within sensible limits.

The abstraction of operations provided by the lazy constructs means we can also take advantage of parallelism.

Evaluation example:

a possible processing scheme ...

B_mean = biggus.mean(B, axis=0)
diff = biggus.sub(A, B_mean)
X = biggus.count(diff)
Y = biggus.mean(diff)
Z = biggus.std(diff)    
x, y, z = \
  biggus.ndarrays((X, Y, Z))

Parallel Evaluation code example ...



In [13]:

    
# Make test data (ordinary numpy arrays)
src_a = np.arange(10.0) % 3
src_b = np.arange(40.0).reshape(10, 4) % 7
print src_a









    



[ 0.  1.  2.  0.  1.  2.  0.  1.  2.  0.]



In [14]:

    
# Process in biggus to make 3 lazy results. 
A = npwrap(src_a)
B = npwrap(src_b)
B_mean = biggus.mean(B, axis=1)
diff = biggus.sub(A, B_mean)
X = biggus.count(diff, axis=0)
Y = biggus.mean(diff, axis=0)
Z = biggus.std(diff, axis=0)
print X









    



<_Aggregation shape=() dtype=dtype('int32')>



In [15]:

    
# Evaluate.
x, y, z = biggus.ndarrays((X, Y, Z))
print x, y, z









    



10 -1.975 1.27205542332

As evaluation outputs form a stream of "chunks", each consumer must process each chunk.
To gain speed, the consumers of the data can also operate concurrently.

Content Review

Current status

existing features are immediately useful (e.g. Iris)
- virtual arrays
- combine and extract
- arithmetic and statistics
- efficient evaluation
missing features - still lots

Future directions

direct-to-file saving, with parallel evaluation
more intelligent chunking strategies
more statistics
whatever gets asked for ...

Winding up:

"Forget all your memory worries; calculate faster and easier with ..."
https://github.com/SciTools/biggus

Biggus is:

a fully open-source independent project
licenced under LGPL
developed by contributors around the world
ready for real-world uses
a key dependency of Iris (https://github.com/SciTools/iris)

Just using it is a contribution

We look forward to your contribution !

( this talk : https://github.com/pp-mo/biggus_talk )

END



In [16]:

    
def ta(id, dims):
    return npwrap(id * np.ones(dims))

row1 = biggus.LinearMosaic(np.array([ta(1, (3, 5)), ta(2, (3, 2)), ta(3, (3, 2))]), axis=1)
row2 = biggus.LinearMosaic(np.array([ta(5, (5, 2)), ta(6, (5, 2)), ta(7, (5, 5))]), axis=1)
tiles2d = biggus.LinearMosaic(np.array([row1, row2]), axis=0)

part2d = tiles2d[1:6, 1:7]
whole, part = tiles2d.ndarray(), part2d.ndarray()

print 'whole:', whole
print 'part:', part
plt.figure(figsize=(12,4))
plt.subplot(1, 2, 1); plt.pcolormesh(whole, edgecolor='black');
plt.plot([1, 7, 7, 1, 1], [1, 1, 6, 6, 1], color='red', linewidth=5)
plt.subplot(1, 2, 2); plt.pcolormesh(part, edgecolor='black')









    



whole: [[ 1.  1.  1.  1.  1.  2.  2.  3.  3.]
 [ 1.  1.  1.  1.  1.  2.  2.  3.  3.]
 [ 1.  1.  1.  1.  1.  2.  2.  3.  3.]
 [ 5.  5.  6.  6.  7.  7.  7.  7.  7.]
 [ 5.  5.  6.  6.  7.  7.  7.  7.  7.]
 [ 5.  5.  6.  6.  7.  7.  7.  7.  7.]
 [ 5.  5.  6.  6.  7.  7.  7.  7.  7.]
 [ 5.  5.  6.  6.  7.  7.  7.  7.  7.]]
part: [[ 1.  1.  1.  1.  2.  2.]
 [ 1.  1.  1.  1.  2.  2.]
 [ 5.  6.  6.  7.  7.  7.]
 [ 5.  6.  6.  7.  7.  7.]
 [ 5.  6.  6.  7.  7.  7.]]






    Out[16]:





<matplotlib.collections.QuadMesh at 0xb114deac>