In [1]:
from IPython.display import Image
import matplotlib.pyplot as plt
%run talktools


Efficient large data operations with Biggus


Aug 30th 2014

<img src="MO_Master_W.jpg", width=150, align="right">

Patrick Peglar - UK Met Office, Exeter

Biggus: https://github.com/SciTools/biggus




( this talk: https://github.com/pp-mo/biggus_talk )

Overview

Biggus is a lightweight pure-Python package which implements lazy operations on numpy array-like objects. This provides dramatically improved efficiency in analysing large datasets, for minimal additional effort in the client code.

Biggus makes data analysis on large datasets easier

  • avoids scalability problems
  • makes code clearer and simpler
  • uses less space and time

The key technique is "lazy evaluation"

Aka "don't do it until you have to"

This goes by several near-interchangeable names:

  • VIRTUAL arrays
  • "LAZY" - or -
  • "DEFERRED"- evaluation

Biggus provides 'virtual' arrays.

A virtual array does not need the actual values of its data content ...

  • ... but it knows all about it:
    • size and shape
    • datatype
    • how to load (any given part of) it

Virtual arrays can be processed in the usual ways

  • these manipulations don't process actual data
  • actual results are got from the result arrays, but only on request

The 'deferred' operation is the source of all the benefits ...

  • copes with arbitrary data sizes
  • optimises data access
  • separates data access from analysis concerns

The key features Biggus provides

  • virtual arrays of arbitrary size
  • combine and extract
  • statistics
  • arithmetic
  • efficient evaluation

What's to come:

  • dead-simple example
  • what it's good for
  • summary of current features
  • how it works (in outline)
  • current work and future directions

The longest journey ...


In [2]:
import biggus
print biggus.__version__


0.7.0-alpha

A simple 'mean' calculation ...

(and so later.. the biggus equivalent)


In [3]:
import numpy as np
array_1 = np.array([[1., 5., 2.], [7., 6., 5.]])
print 'simple array :\n', array_1
mean_a1 = array_1.mean(axis=1)
print
print 'mean over axis 1 :\n',  mean_a1


simple array :
[[ 1.  5.  2.]
 [ 7.  6.  5.]]

mean over axis 1 :
[ 2.66666667  6.        ]

Simple mean calculation : The Biggus equivalent


In [4]:
from biggus import NumpyArrayAdapter as npwrap
lazy_1 = npwrap(array_1)
print 'a lazy array : ', lazy_1


a lazy array :  <NumpyArrayAdapter shape=(2, 3) dtype=dtype('float64')>

... lazy mean and result


In [5]:
lazy_mean = biggus.mean(lazy_1, axis=1)
print 'lazy mean :', lazy_mean
print
print 'lazy mean *result* :\n', lazy_mean.ndarray()
print
print 'same as original ...:\n', mean_a1


lazy mean : <_Aggregation shape=(2,) dtype=dtype('float64')>

lazy mean *result* :
[ 2.66666667  6.        ]

same as original ...:
[ 2.66666667  6.        ]

Repeat that ...

But this time change the source data, between forming the mean and evaluating it.


In [6]:
lazy_mean2 = biggus.mean(lazy_1, axis=1)
print lazy_mean2
array_1[0,:] = -1
print lazy_mean2.ndarray()


<_Aggregation shape=(2,) dtype=dtype('float64')>
[-1.  6.]

So... what is the point ?

  • SPACE (memory usage)
  • TIME (optimised operation, especially data i/o)
  • CODE (convenience ; clarity ; separation of concerns)

SPACE (resource efficiency)

  • data grows
  • eventually you can't fit all the data

The space problem - a historical example

<img src="Um_History.png", align='left', width=600>

Which means...

  • A single forecast datafield now occupies over 200Mb
  • there are dozens of stored parameter fields, and dozens of timesteps per run
  • we do a new run every 6 hours

TIME

  • avoid re-reading anything
  • pass sections of data to all consumer processes "in parallel"
  • eventually / potentially this leads to distributed processing
    • (i.e. processes not co-hosted)

CODE

The easy way to define a data analysis process

  • grab all the data
  • write analysis code
  • don't worry about resources / efficiency

Code using biggus looks very similar, but

  • deferred execution means it scales with data size
  • no need to confuse the code with data access concerns

Aside : the use of biggus in Iris

Features Summary:

  1. arrays and indexing
  2. combining arrays : stack + tile
  3. statistics
  4. arithmetic

The uniformity here is key:

  • all operations produce an 'Array' object
  • all operations can be combined freely
  • so... the lazy representations allow efficient evaluation

Features #1 : Arrays and Indexing

Anything that 'looks like' an array can be cast as a biggus.Array

All types of Biggus array support the major indexing operations... <img src='IndexingStyles.png', align='left'>

So, what is "an array" ?

It just needs to support the minimum numpy-array-like properties :

  • shape
  • dtype
  • __getitem__ (i.e. indexing)

Here's a tiny example that emulates an array full of a constant value ...

... a tiny example that "looks like" an array full of a constant

(and indicates clearly when it is actually accessed )


In [7]:
class constant_array(object):
    def __init__(self, shape, value=0.0):
        self.shape = shape
        self.dtype = np.array(value).dtype
        self._value = value
    def __getitem__(self, indices):
        print '    !!! accessing :', indices
        return  self._value * np.ones(self.shape)[indices]

lazy_const_234 = npwrap(constant_array((2, 3, 4), value=3.5))
print 'lazy_234:', lazy_const_234
const_section = lazy_const_234[0, 1]
print 'section:', const_section
print '\nresult:\n', const_section.ndarray()


lazy_234: <NumpyArrayAdapter shape=(2, 3, 4) dtype=dtype('float64')>
section: <NumpyArrayAdapter shape=(4,) dtype=dtype('float64')>

result:
    !!! accessing : (0, 1)
[ 3.5  3.5  3.5  3.5]

In fact, this functionality is already provided in biggus

(without the evaluation debug).

It is called a ConstantArray..


In [8]:
lazy_const_2x3 = biggus.ConstantArray((2, 3, 4), 3.77)
print lazy_const_2x3
const_section = lazy_const_2x3[0, 1]
print const_section
print const_section.ndarray()


<ConstantArray shape=(2, 3, 4) dtype=dtype('float64')>
<ConstantArray shape=(4,) dtype=dtype('float64')>
[ 3.77  3.77  3.77  3.77]

Features #2 : Combining Arrays

Arrays can be combined into larger arrays

Two methods:

  1. "stack" combines Arrays along a new dimension
  2. "mosaic" joins arrays edge-to-edge

ArrayStack

Analogous to numpy "stack" operations ('hstack', 'vstack' etc.).

Given multiple same-shaped Arrays, it creates a new dimension to index over them.

<img src='ArrayStacking.png', align='left'>

ArrayStack code example:


In [9]:
arrays = [num * np.ones((3)) for num in range(1,5)]
print 'array#2:', arrays[2]
print
lazy_arrays = np.array([npwrap(array) for array in arrays])
stack = biggus.ArrayStack(lazy_arrays)
print stack
print stack.ndarray()


array#2: [ 3.  3.  3.]

<ArrayStack shape=(4, 3) dtype=dtype('float64')>
[[ 1.  1.  1.]
 [ 2.  2.  2.]
 [ 3.  3.  3.]
 [ 4.  4.  4.]]

LinearMosaic

Analagous to numpy "concatenate" operations.

  • joins "edge to edge"
  • does not create a new dimension

LinearMosaic code example:


In [10]:
# Make one of the components arrays smaller than the others
lazy_arrays[2] = lazy_arrays[2][:1]
# Combine into one
mosaic = biggus.LinearMosaic(lazy_arrays, axis=0)
print mosaic
print mosaic.ndarray()


<LinearMosaic shape=(10,) dtype=dtype('float64')>
[ 1.  1.  1.  2.  2.  2.  3.  4.  4.  4.]

Note:

  • this handles irregular sizes (along the joining axis)
  • the join is seamless

Combined usage : code example

With a nested construction, we can make a complex patchwork + extract 2d regions seamlessly ...

<img src='TiledIndexing.png', width=600>

The pink cells show separate data sources, joined together with nested 'LinearMosaic's.

The extracted area is a new virtual array, seamlessly extracted across the original boundaries.

Features #3 : Statistics

There are basic functions to perform statistics on an Array ('mean', 'std', 'count', 'var').

We've already seen an example of this:


In [11]:
mean = biggus.mean(lazy_arrays[0], axis=0)
print mean


<_Aggregation shape=() dtype=dtype('float64')>

The other operations simply follow the same pattern.

Statistical functions must be individually implemented :--

  • the algorithm is recast in a standard form (subclass '_Aggregation') ...
  • data is processed in sections known as "chunks", with the simple pattern:
    • initialise
    • process chunks, as we are given them
    • finalise

Chunking is managed by an "evaluation engine" (see on..)

However, a statistical operation can specify constraints on its chunking process.

  • for instance, a (future) median function would need all the data on the aggregation axis

Features #4 : Array Arithmetic

Basic lazy arithmetic functions are provided ('add', 'sub').

Code example ...


In [12]:
values = npwrap(np.array([1.0, 2.0, 3.0]))
constants = biggus.ConstantArray((3,), 1000.0)
sum = biggus.add(values, constants)
print 'sum:', sum
print 'sum results:\n', sum.ndarray()


sum: <_Elementwise shape=(3,) dtype=dtype('float64')>
sum results:
[ 1001.  1002.  1003.]

The implementation of these is much more straightforward than statistics :

  • the function is an 'ElementWise' instance, wrapping the required (numpy) function
  • it doesn't care how the source data is chunked, so chunks processing is generalised
  • so, they are somewhat like numpy 'ufuncs'

How it works

in brief: Parallel Evaluation

The point is to avoid the need to make multiple passes over large datasets, while keeping memory usage within sensible limits.

The abstraction of operations provided by the lazy constructs means we can also take advantage of parallelism.

Evaluation example:

a possible processing scheme ...

<img src="ExpressionGraph.png", align='right', width=600>

B_mean = biggus.mean(B, axis=0)
diff = biggus.sub(A, B_mean)
X = biggus.count(diff)
Y = biggus.mean(diff)
Z = biggus.std(diff)    
x, y, z = \
  biggus.ndarrays((X, Y, Z))

Parallel Evaluation code example ...


In [13]:
# Make test data (ordinary numpy arrays)
src_a = np.arange(10.0) % 3
src_b = np.arange(40.0).reshape(10, 4) % 7
print src_a


[ 0.  1.  2.  0.  1.  2.  0.  1.  2.  0.]

In [14]:
# Process in biggus to make 3 lazy results. 
A = npwrap(src_a)
B = npwrap(src_b)
B_mean = biggus.mean(B, axis=1)
diff = biggus.sub(A, B_mean)
X = biggus.count(diff, axis=0)
Y = biggus.mean(diff, axis=0)
Z = biggus.std(diff, axis=0)
print X


<_Aggregation shape=() dtype=dtype('int32')>

In [15]:
# Evaluate.
x, y, z = biggus.ndarrays((X, Y, Z))
print x, y, z


10 -1.975 1.27205542332
  • As evaluation outputs form a stream of "chunks", each consumer must process each chunk.
  • To gain speed, the consumers of the data can also operate concurrently.

Content Review

Current status

  • existing features are immediately useful (e.g. Iris)
    • virtual arrays
    • combine and extract
    • arithmetic and statistics
    • efficient evaluation
  • missing features - still lots

Future directions

  • direct-to-file saving, with parallel evaluation
  • more intelligent chunking strategies
  • more statistics
  • whatever gets asked for ...

Winding up:

"Forget all your memory worries; calculate faster and easier with ..."
https://github.com/SciTools/biggus

Biggus is:

  • a fully open-source independent project
  • licenced under LGPL
  • developed by contributors around the world
  • ready for real-world uses
  • a key dependency of Iris (https://github.com/SciTools/iris)

Just using it is a contribution

We look forward to your contribution !




( this talk : https://github.com/pp-mo/biggus_talk )

END


In [16]:
def ta(id, dims):
    return npwrap(id * np.ones(dims))

row1 = biggus.LinearMosaic(np.array([ta(1, (3, 5)), ta(2, (3, 2)), ta(3, (3, 2))]), axis=1)
row2 = biggus.LinearMosaic(np.array([ta(5, (5, 2)), ta(6, (5, 2)), ta(7, (5, 5))]), axis=1)
tiles2d = biggus.LinearMosaic(np.array([row1, row2]), axis=0)

part2d = tiles2d[1:6, 1:7]
whole, part = tiles2d.ndarray(), part2d.ndarray()

print 'whole:', whole
print 'part:', part
plt.figure(figsize=(12,4))
plt.subplot(1, 2, 1); plt.pcolormesh(whole, edgecolor='black');
plt.plot([1, 7, 7, 1, 1], [1, 1, 6, 6, 1], color='red', linewidth=5)
plt.subplot(1, 2, 2); plt.pcolormesh(part, edgecolor='black')


whole: [[ 1.  1.  1.  1.  1.  2.  2.  3.  3.]
 [ 1.  1.  1.  1.  1.  2.  2.  3.  3.]
 [ 1.  1.  1.  1.  1.  2.  2.  3.  3.]
 [ 5.  5.  6.  6.  7.  7.  7.  7.  7.]
 [ 5.  5.  6.  6.  7.  7.  7.  7.  7.]
 [ 5.  5.  6.  6.  7.  7.  7.  7.  7.]
 [ 5.  5.  6.  6.  7.  7.  7.  7.  7.]
 [ 5.  5.  6.  6.  7.  7.  7.  7.  7.]]
part: [[ 1.  1.  1.  1.  2.  2.]
 [ 1.  1.  1.  1.  2.  2.]
 [ 5.  6.  6.  7.  7.  7.]
 [ 5.  6.  6.  7.  7.  7.]
 [ 5.  6.  6.  7.  7.  7.]]
Out[16]:
<matplotlib.collections.QuadMesh at 0xb114deac>