Author: Lehman Garrison (lgarrison.github.io)
Last updated: 7/16/2017
An introduction to a few modern Python libraries for high-performance numerical operations.
https://github.com/pydata/numexpr
Numexpr helps accelerate simple, one-line NumPy expressions. It mainly does so by avoiding temporary arrays, but it's also has great parallelism and vectorization support.
Unfortunately, only a subset of NumPy is presently supported. The feature set will be greatly expanded in Numexpr 3.0, but it's still in development.
In [1]:
import numpy as np
import numexpr as ne
ne.set_num_threads(10);
In [2]:
rho = np.empty((512,512,512), dtype=np.float32)
rho[:] = np.random.random(rho.shape)
rho_mean = rho.mean(dtype=np.float64).astype(np.float32) # Use double precision for intermediate accumulations
The NumPy way:
In [3]:
%%timeit
delta = np.exp((rho/rho_mean - 1.)**2.)
The Numexpr way:
In [4]:
%%timeit
delta = ne.evaluate('exp((rho/rho_mean - 1.)**2.)')
We were using 10 cores. Did our speedup come from multi-threading or loop-blocking/vectorization?
In [5]:
ne.set_num_threads(10)
%timeit ne.evaluate('exp((rho/rho_mean - 1.)**2.)')
ne.set_num_threads(1)
%timeit ne.evaluate('exp((rho/rho_mean - 1.)**2.)')
Wait, what happened? Why is single-threaded NumExpr slower than NumPy?
In [6]:
np_delta = np.exp((rho/rho_mean - 1.)**2.)
ne_delta = ne.evaluate('exp((rho/rho_mean - 1.)**2.)')
print np_delta.dtype
print ne_delta.dtype
NumExpr changed the computation to double precision while we weren't looking! Floats like 1.
are always interpreted as double
s (in Python and NumExpr), and NumExpr uses the highest precision of any operand (in symmetric binary operators, but not **
for some reason). This is in contrast to NumPy, which respects the dtype
of the array operands.
There are two solutions:
one = np.float32(1.)
1
In [7]:
ne_delta = ne.evaluate('exp((rho/rho_mean - 1)**2.)')
print ne_delta.dtype
In [8]:
ne.set_num_threads(10)
%timeit ne.evaluate('exp((rho/rho_mean - 1)**2.)')
ne.set_num_threads(1)
%timeit ne.evaluate('exp((rho/rho_mean - 1)**2.)')
So even in the single-threaded case, we got a 10% speedup. NumExpr achieves this by doing the division and subtraction on small blocks of the array rather than doing the divison on the whole array then a second pass to do the subtraction.
NumExpr also supports an out
keyword arugment for true in-place operations, which can be useful for large arrays:
In [9]:
rho_double = np.random.random((768,768,768))
rho = rho_double.copy()
print 'NumPy:\n\t',
%timeit np_rho2 = rho**2
rho = rho_double.copy()
print 'NumPy inplace:\n\t',
%timeit global rho; rho **= 2
rho = rho_double.copy()
print 'NumExpr:\n\t',
ne.set_num_threads(1)
%timeit ne_rho2 = ne.evaluate('rho**2')
rho = rho_double.copy()
print 'NumExpr inplace:\n\t',
ne_inplace_rho2 = rho
%timeit ne.evaluate('rho**2', out=ne_inplace_rho2)
NumExpr is actually slower than NumPy in this simple case; it's possible that the NumExpr overhead isn't worth it for this simple operation. But if we were to switch to multi-threaded we would see a speed-up:
In [10]:
ne.set_num_threads(10)
rho = rho_double.copy()
print 'NumExpr inplace, multi-threaded:\n\t',
ne_inplace_rho2 = rho
%timeit ne.evaluate('rho**2', out=ne_inplace_rho2)
In [11]:
rho = np.empty((512,512,512), dtype=np.float32)
rho[:] = np.random.random(rho.shape)
In [12]:
%timeit (np.sin(rho**2) + np.cos(rho**3))**.5
ne.set_num_threads(1)
%timeit ne.evaluate('(sin(rho**2) + cos(rho**3))**.5')
ne.set_num_threads(10)
%timeit ne.evaluate('(sin(rho**2) + cos(rho**3))**.5')
Numba compiles Python functions into native code at the moment you call the function --- this is called "just-in-time" compilation. The resulting code can often match native C code while retaining Python flexibility and being written in pure Python.
Numba is much more heavy-weight than NumExpr and is best seen as an alternative to something like Cython. There's no new language to learn, but you do have to write some funny-looking Python.
Numba has much more developer support than NumExpr (and has a Moore Foundation grant for its development). It's likely going to continue to mature in the coming years.
First, an annoyance: you have to set the number of threads before importing Numba for the first time.
In [13]:
import os
os.environ['NUMBA_NUM_THREADS'] = '10'
import numba as nb
import numpy as np
In [14]:
arr = np.empty((4096,4096), dtype=np.float64)
arr[:] = np.random.random(arr.shape)
Numba has one main function: nb.jit
. It's used as a decorator:
In [15]:
def py_sum2d(arr):
M, N = arr.shape
result = 0.0
for i in range(M):
for j in range(N):
result += arr[i,j]
return result
@nb.jit
def nb_sum2d(arr):
M, N = arr.shape
result = 0.0
for i in range(M):
for j in range(N):
result += arr[i,j]
return result
In [16]:
%timeit py_sum2d(arr)
%timeit nb_sum2d(arr)
The first time we run the timing test, we get a message that one of the Numba timing tests took way longer than the others. That's because the first time we call the jitted function, it has to be compiled for the dtype of the arguments. The compiled function is cached, though, making subsequent calls faster.
It's a good idea to always pass the nopython=True
kwarg. This prevents Numba from falling back into object mode if it finds something it can't compile to machine code, e.g. a Python function call:
In [17]:
import scipy.special
@nb.jit(nopython=True)
def nb_sum2d_j1(arr):
M, N = arr.shape
result = 0.
for i in range(M):
for j in range(N):
result += scipy.special.j1(arr[i,j])
return result
In [18]:
nb_sum2d_j1(arr)
Numba does let you call ctypes and CFFI functions in nopython
mode, as we'll see later.
One of Numba's main drawbacks is lack of broad support for parallelism. jit
has a nogil
option to release the global interpreter lock, which is useful if you have a higher-level parallelism scheme in place (e.g. Python threading
), but it gives no speedup on its own.
However, by using the less-flexible vectorize
or guvectorize
decorators, we can get excellent parallelization in some specialized cases:
In [19]:
@nb.vectorize([nb.float64(nb.float64, nb.float64)], nopython=True, target='cpu')
def vec_op(a, b):
return np.sin(a**b)**2.
@nb.vectorize([nb.float64(nb.float64, nb.float64)], nopython=True, target='parallel')
def parallel_vec_op(a, b):
return np.sin(a**b)**2.
In [20]:
%timeit vec_op(arr, 2*arr)
%timeit parallel_vec_op(arr, 2*arr)
So we got an excellent speedup, but we also had to specify the input dtypes. There are dynamic ufuncs that give more flexibility, but sometimes they don't behave as expected with the parallel
target.
Also note that vectorize
only lets us operate element-wise on the input arrays. guvectorize
gives us access to the whole array if we need, and has a parallel
target. It's the closest thing to a parallel version of jit
, but it has more programmer overhead. Here's an example:
In [21]:
@nb.guvectorize([(nb.float64[:,:], nb.float64[:,:], nb.float64[:,:])], '(nx,ny),(nx,ny)->(nx,ny)', target='cpu', nopython=True)
def guvec(a, b, c):
M, N = a.shape
for i in range(M):
for j in range(N):
c[i,j] = a[i,j]**b[i,j]
@nb.guvectorize([(nb.float64[:,:], nb.float64[:,:], nb.float64[:,:])], '(nx,ny),(nx,ny)->(nx,ny)', target='parallel', nopython=True)
def parallel_guvec(a, b, c):
M, N = a.shape
for i in range(M):
for j in range(N):
c[i,j] = a[i,j]**b[i,j]
%timeit guvec(arr, 2*arr)
%timeit parallel_guvec(arr, 2*arr)
Wait, why didn't we get any parallel speedup? In Numba, parallelization only happens when broadcasting, so let's write it in a way that broadcasts over the first dimension:
In [22]:
@nb.guvectorize([(nb.float64[:], nb.float64[:], nb.float64[:])], '(ny),(ny)->(ny)', target='parallel', nopython=True)
def parallel_guvec(a, b, c):
N, = a.shape
for j in range(N):
c[j] = a[j]**b[j]
%timeit parallel_guvec(arr, 2*arr)
I've hacked a fairly general solution that achieves parallelism by broadcasting over an array of indices (I call it parallel_bcast
). I'm expecting it will be superceded by Numba functionality within a few years, though.
Update: it's here! Numba now offers numba.prange
as a replacement to any range
statement. You must also pass parallel=True
to the jit
decorator. For example:
In [23]:
@nb.jit(nopython=True, parallel=True)
def nb_prange(a, b):
M, N = a.shape
c = np.empty_like(a)
for i in nb.prange(M):
for j in range(N):
c[i,j] = a[i,j]**b[i,j]
%timeit nb_prange(arr, 2*arr)
See below for a non-trivial example where prange
fails to live up to its promise, however.
Numba also offers gpu
and cuda
targets for transparent offloading to the GPU, but I haven't tried it myself.
The above examples were pretty trivial. Numba really shines when you can use explicit looping to calculate intermediate values that you would otherwise have to store all at once in a big Numpy array. In this example, we'll bin values on a rectangular lattice into spherical shells. Thus, we need to know the radial distance to each lattice site. This is what we'll calculate on-the-fly with Numba.
In [24]:
def py_radial_hist(values, box, bin_edges):
nx,ny,nz = values.shape
X,Y,Z = np.ogrid[0:box:box/nx, 0:box:box/ny, 0:box:box/nz]
radii = X**2 + Y**2 + Z**2
return np.histogram(radii, bins=bin_edges**2, weights=values)
@nb.jit(nopython=True)
def nb_radial_hist(values, boxsize, bin_edges):
nx,ny,nz = values.shape
histogram = np.zeros(len(bin_edges)-1)
# Do binning with squared distances
bin_edges = bin_edges**2
nbp1 = len(bin_edges)
for i in range(nx):
dx2 = (boxsize/nx*i)**2
for j in range(ny):
dy2 = (boxsize/ny*j)**2
for k in range(nz):
dz2 = (boxsize/nz*k)**2
dist = dx2 + dy2 + dz2
if dist < bin_edges[0] or dist > bin_edges[-1]:
continue
for b in range(1,nbp1):
if dist < bin_edges[b]:
histogram[b-1] += values[i,j,k]
break
else: # last bin is closed
histogram[-1] += values[i,j,k]
return histogram
In [25]:
box = 1.
bin_edges = np.linspace(0,box,100)
values = np.random.random((512,512,512))
In [26]:
%timeit py_radial_hist(values, box, bin_edges)
%timeit nb_radial_hist(values, box, bin_edges)
So we cut the runtime of the Numpy implementation by 1/3, and we didn't even consider the memory usage. The Numpy implementation doubles the memory usage of the input array, since it has to construct the radii. This can be a big problem when dealing with $2048^3$ FFT meshes, for example.
In practice, I use a parallel_bcast
-based parallel implementation, which gives a huge additional speedup. In theory, we could achieve this parallelism with nb.prange
. Let's explore this below.
nb.prange
for histogrammingnb.prange
has a lot of promise, but as an "experimental feature" it's often clunky to use in practice. Let's try to apply it to this histogramming example and see what happens.
First, let's write a naive implementation with a race condition. Note that to complile, the prange
must be in the innermost loop.
In [27]:
@nb.jit(nopython=True, parallel=True)
def BAD_nb_parallel_radial_hist(values, boxsize, bin_edges):
nx,ny,nz = values.shape
histogram = np.zeros(len(bin_edges)-1)
# parallel=True quirk: some versions wouldn't compile without squaring in-place
bin_edges = bin_edges**2
nbp1 = len(bin_edges)
for i in range(nx):
for j in range(ny):
# another quirk: prange must be in inner loop
for k in nb.prange(nz):
dx2 = (boxsize/nx*i)**2
dy2 = (boxsize/ny*j)**2
dz2 = (boxsize/nz*k)**2
dist = dx2 + dy2 + dz2
if dist < bin_edges[0] or dist > bin_edges[-1]:
continue
for b in range(1,nbp1):
if dist < bin_edges[b]:
histogram[b-1] += values[i,j,k]
break
else: # last bin is closed
histogram[-1] += values[i,j,k]
# also some versions wouldn't compile without a return!
return histogram
In [28]:
py_answer = py_radial_hist(values, box, bin_edges)[0]
nb_answer = nb_radial_hist(values, box, bin_edges)
BAD_parallel_answer = BAD_nb_parallel_radial_hist(values, box, bin_edges)
In [29]:
print np.allclose(py_answer, nb_answer)
print np.allclose(py_answer, BAD_parallel_answer)
As we expected, the race condition caused the parallel version to give the wrong answer. In fact, it doesn't even give a consistent answer from run to run:
In [30]:
BAD_parallel_answer2 = BAD_nb_parallel_radial_hist(values, box, bin_edges)
print np.allclose(BAD_parallel_answer, BAD_parallel_answer2)
We can write a parallel-safe version by giving each k
its own temporary histogram space to write into:
In [31]:
@nb.jit(nopython=True, parallel=True)
def nb_parallel_radial_hist(values, boxsize, bin_edges):
nx,ny,nz = values.shape
histogram = np.zeros((nz, len(bin_edges)-1))
bin_edges = bin_edges**2
nbp1 = len(bin_edges)
for i in range(nx):
for j in range(ny):
# prange only works on inner loop
for k in nb.prange(nz):
dx2 = (boxsize/nx*i)**2
dy2 = (boxsize/ny*j)**2
dz2 = (boxsize/nz*k)**2
dist = dx2 + dy2 + dz2
if dist < bin_edges[0] or dist > bin_edges[-1]:
continue
for b in range(1,nbp1):
if dist < bin_edges[b]:
histogram[k, b-1] += values[i,j,k]
break
else: # last bin is closed
histogram[k, -1] += values[i,j,k]
# Silly! This could be written as
# reduced_hist = histogram.sum(axis=0)
# but Numba auto-parallelization doesn't support axis reductions
reduced_hist = np.zeros(len(bin_edges)-1)
for b in range(len(reduced_hist)):
for k in range(nz):
reduced_hist[b] += histogram[k, b]
return reduced_hist
In [32]:
parallel_answer = nb_parallel_radial_hist(values, box, bin_edges)
print np.allclose(py_answer, parallel_answer)
Finally, how do the timings stack up?
In [33]:
%timeit nb_radial_hist(values, box, bin_edges)
%timeit BAD_nb_parallel_radial_hist(values, box, bin_edges)
%timeit nb_parallel_radial_hist(values, box, bin_edges)
That didn't go as expected. If each thread is only operating on one element at a time, that could explain the inefficiency. Hopefully it's implemented internally to operate on nz/nthreads
elements at a time, though.
We could try flattening the nested loops, but then we'd need nx*ny*nz
temporary space! We actually only need nthreads
temporary space, but the local thread ID is completely hidden from us. Let's write an egregiously wrong version just to test the speed:
In [34]:
@nb.jit(nopython=True, parallel=True)
def BAD_nb_parallel_radial_hist_flat(values, boxsize, bin_edges):
nx,ny,nz = values.shape
histogram = np.zeros(len(bin_edges)-1)
bin_edges = bin_edges**2
nbp1 = len(bin_edges)
for x in nb.prange(nx*ny*nz):
i = x / (ny*nz)
j = (x % (ny*nz)) / nz
k = x % nz
dx2 = (boxsize/nx*i)**2
dy2 = (boxsize/ny*j)**2
dz2 = (boxsize/nz*k)**2
dist = dx2 + dy2 + dz2
if dist < bin_edges[0] or dist > bin_edges[-1]:
continue
for b in range(1,nbp1):
if dist < bin_edges[b]:
histogram[b-1] += values[i,j,k]
break
else: # last bin is closed
histogram[-1] += values[i,j,k]
return histogram
In [35]:
%timeit BAD_nb_parallel_radial_hist_flat(values, box, bin_edges)
That was about as fast as we expected, but it still gives the wrong answer. Maybe when prange
supports nested loops it will be possible to write a correct version of this.
For fun, let's time the implementation with my custom parallelization scheme ("parallel_bcast
"). It's also doing a periodic wrap, unlike the above, so it's doing more work.
In [36]:
import Abacus.Analysis.PowerSpectrum.Histogram as AbacusHistogram
#reload(AbacusHistogram.Histogram.)
# Pre-compute the mean radius of the bin and the counts per bin
bin_info = AbacusHistogram.RadialBinGrid(box, values, bin_edges)
%timeit AbacusHistogram.RadialBinGrid(box, values, bin_edges, bin_info=bin_info)
CFFI (C Foreign Function Interface) is a way to call C code from Python. It's designed as a replacement to tools like SWIG, ctypes, or SciPy.weave (now deprecated).
SWIG is probably still appropriate if you want to call your C code from many languages, not just Python. But ctypes or CFFI are the preferred Python solutions, with a preference for CFFI. We'll see why below.
Let's go back to the Numba example where we tried to call scipy.special.j1
but Numba wouldn't allow it in nopython
mode. We'll use the recommended "API-level, out-of-line" mode. This means the C compiler is invoked, and we have to run a build script beforehand.
We're going to write the script to disk and run it, because this would probably be its own script in a real application.
In [37]:
%%writefile build_ffilib.py
if __name__ == '__main__':
# Compile the FFI lib
import cffi
ffibuilder = cffi.FFI()
ffibuilder.set_source('_gslffilib',
r"""
#include <gsl/gsl_sf_bessel.h> // This gets compiled
""", libraries=['gsl', 'gslcblas'])
ffibuilder.cdef("""
double gsl_sf_bessel_j1 (double x); // This can be copied straight from the man page and is parsed by CFFI
""")
ffibuilder.compile(verbose=True)
In [38]:
!python build_ffilib.py
In [39]:
%ls *ffi*
We can see that we just compiled _gslffilib.so
. Here's how we use it:
In [40]:
import _gslffilib
_gslffilib.lib.gsl_sf_bessel_j1(1.)
Out[40]:
And here's how we use it with Numba:
In [41]:
import numpy as np
import numba as nb
import numba.cffi_support
numba.cffi_support.register_module(_gslffilib)
gsl_sf_bessel_j1 = _gslffilib.lib.gsl_sf_bessel_j1
In [42]:
import scipy.special
@nb.jit(nopython=True)
def nb_sum2d_cffi(arr):
M, N = arr.shape
result = 0.
for i in range(M):
for j in range(N):
result += gsl_sf_bessel_j1(arr[i,j])
return result
In [43]:
nb_sum2d_cffi(np.random.random((128,128)))
Out[43]:
There are other ways to use CFFI (ABI vs API, in-line vs out-of-line), but the above is the preferred approach for most applications.
Here's how we would call gsl_sf_bessel_j1
with ctypes.
In [44]:
import ctypes as ct
libgsl = ct.cdll.LoadLibrary("libgsl.so")
# Set the argument and return types
libgsl.gsl_sf_bessel_j1.restype = ct.c_double
libgsl.gsl_sf_bessel_j1.argtypes = [ct.c_double]
In [45]:
libgsl.gsl_sf_bessel_j1(1.)
Out[45]:
That was a lot easier, so why not use ctypes all the time? First of all, ctypes is ABI-only --- it never invokes the C compiler. This is very error prone and can easily cause segfaults if the wrong argments are passed, wrong signatures are specified, etc. It's also hard to keep updated if the C call signature changes. CFFI, on the other hand, just requires that you copy/paste the C call signature (or load it programmatically). Most modern Python projects are moving away from ctypes towards CFFI.
Finally, CFFI can compile your raw C code for you, if you don't already have it in a shared library. We didn't demonstrate that here, but it's a much more portable and low-maintenance solution than trying to detect what compilers are available on the target platform, for example.
A final note: CFFI was developed as part of the PyPy project, which is a performance-oriented alternative to CPython. However, I wouldn't recommend including PyPy in your workflow, as their approach requires re-implementing basic libraries like Numpy to work with PyPy ("NumPyPy"). This is ultimately an unsustainable approach.
A library for out-of-core, asynchronous, or distributed jobs on big data. It has a lot of promise, but hasn't quite lived up to its potential in my experience. It's being funded by the same grant as Numba, so I'm expecting a lot of maturation in the next few years.