High-Performance Python, or When to Write For Loops in Python

Author: Lehman Garrison (lgarrison.github.io)

Last updated: 7/16/2017

An introduction to a few modern Python libraries for high-performance numerical operations.

Numexpr

https://github.com/pydata/numexpr

Numexpr helps accelerate simple, one-line NumPy expressions. It mainly does so by avoiding temporary arrays, but it's also has great parallelism and vectorization support.

Unfortunately, only a subset of NumPy is presently supported. The feature set will be greatly expanded in Numexpr 3.0, but it's still in development.

Example: converting a density field to a density contrast


In [1]:
import numpy as np
import numexpr as ne
ne.set_num_threads(10);

In [2]:
rho = np.empty((512,512,512), dtype=np.float32)
rho[:] = np.random.random(rho.shape)
rho_mean = rho.mean(dtype=np.float64).astype(np.float32)  # Use double precision for intermediate accumulations

The NumPy way:


In [3]:
%%timeit
delta = np.exp((rho/rho_mean - 1.)**2.)


1 loop, best of 3: 1.61 s per loop

The Numexpr way:


In [4]:
%%timeit
delta = ne.evaluate('exp((rho/rho_mean - 1.)**2.)')


1 loop, best of 3: 403 ms per loop

We were using 10 cores. Did our speedup come from multi-threading or loop-blocking/vectorization?


In [5]:
ne.set_num_threads(10)
%timeit ne.evaluate('exp((rho/rho_mean - 1.)**2.)')
ne.set_num_threads(1)
%timeit ne.evaluate('exp((rho/rho_mean - 1.)**2.)')


1 loop, best of 3: 329 ms per loop
1 loop, best of 3: 2.63 s per loop

Wait, what happened? Why is single-threaded NumExpr slower than NumPy?


In [6]:
np_delta = np.exp((rho/rho_mean - 1.)**2.)
ne_delta = ne.evaluate('exp((rho/rho_mean - 1.)**2.)')
print np_delta.dtype
print ne_delta.dtype


float32
float64

NumExpr changed the computation to double precision while we weren't looking! Floats like 1. are always interpreted as doubles (in Python and NumExpr), and NumExpr uses the highest precision of any operand (in symmetric binary operators, but not ** for some reason). This is in contrast to NumPy, which respects the dtype of the array operands.

There are two solutions:

  • define an intermediate variable like one = np.float32(1.)
  • use an integer 1

In [7]:
ne_delta = ne.evaluate('exp((rho/rho_mean - 1)**2.)')
print ne_delta.dtype


float32

In [8]:
ne.set_num_threads(10)
%timeit ne.evaluate('exp((rho/rho_mean - 1)**2.)')
ne.set_num_threads(1)
%timeit ne.evaluate('exp((rho/rho_mean - 1)**2.)')


1 loop, best of 3: 198 ms per loop
1 loop, best of 3: 1.5 s per loop

So even in the single-threaded case, we got a 10% speedup. NumExpr achieves this by doing the division and subtraction on small blocks of the array rather than doing the divison on the whole array then a second pass to do the subtraction.

NumExpr also supports an out keyword arugment for true in-place operations, which can be useful for large arrays:


In [9]:
rho_double = np.random.random((768,768,768))

rho = rho_double.copy()
print 'NumPy:\n\t',
%timeit np_rho2 = rho**2

rho = rho_double.copy()
print 'NumPy inplace:\n\t',
%timeit global rho; rho **= 2

rho = rho_double.copy()
print 'NumExpr:\n\t',
ne.set_num_threads(1)
%timeit ne_rho2 = ne.evaluate('rho**2')

rho = rho_double.copy()
print 'NumExpr inplace:\n\t',
ne_inplace_rho2 = rho
%timeit ne.evaluate('rho**2', out=ne_inplace_rho2)


NumPy:
	1 loop, best of 3: 1.18 s per loop
NumPy inplace:
	1 loop, best of 3: 401 ms per loop
NumExpr:
	1 loop, best of 3: 1.23 s per loop
NumExpr inplace:
	1 loop, best of 3: 541 ms per loop

NumExpr is actually slower than NumPy in this simple case; it's possible that the NumExpr overhead isn't worth it for this simple operation. But if we were to switch to multi-threaded we would see a speed-up:


In [10]:
ne.set_num_threads(10)
rho = rho_double.copy()
print 'NumExpr inplace, multi-threaded:\n\t',
ne_inplace_rho2 = rho
%timeit ne.evaluate('rho**2', out=ne_inplace_rho2)


NumExpr inplace, multi-threaded:
	10 loops, best of 3: 167 ms per loop

Example: many operations

A quick example of an extreme speedup with math-heavy operations using numexpr


In [11]:
rho = np.empty((512,512,512), dtype=np.float32)
rho[:] = np.random.random(rho.shape)

In [12]:
%timeit (np.sin(rho**2) + np.cos(rho**3))**.5
ne.set_num_threads(1)
%timeit ne.evaluate('(sin(rho**2) + cos(rho**3))**.5')
ne.set_num_threads(10)
%timeit ne.evaluate('(sin(rho**2) + cos(rho**3))**.5')


1 loop, best of 3: 11.9 s per loop
1 loop, best of 3: 2.51 s per loop
1 loop, best of 3: 318 ms per loop

Numba compiles Python functions into native code at the moment you call the function --- this is called "just-in-time" compilation. The resulting code can often match native C code while retaining Python flexibility and being written in pure Python.

Numba is much more heavy-weight than NumExpr and is best seen as an alternative to something like Cython. There's no new language to learn, but you do have to write some funny-looking Python.

Numba has much more developer support than NumExpr (and has a Moore Foundation grant for its development). It's likely going to continue to mature in the coming years.

First, an annoyance: you have to set the number of threads before importing Numba for the first time.


In [13]:
import os
os.environ['NUMBA_NUM_THREADS'] = '10'
import numba as nb
import numpy as np

In [14]:
arr = np.empty((4096,4096), dtype=np.float64)
arr[:] = np.random.random(arr.shape)

Numba has one main function: nb.jit. It's used as a decorator:


In [15]:
def py_sum2d(arr):
    M, N = arr.shape
    result = 0.0
    for i in range(M):
        for j in range(N):
            result += arr[i,j]
    return result

@nb.jit
def nb_sum2d(arr):
    M, N = arr.shape
    result = 0.0
    for i in range(M):
        for j in range(N):
            result += arr[i,j]
    return result

In [16]:
%timeit py_sum2d(arr)
%timeit nb_sum2d(arr)


1 loop, best of 3: 3.2 s per loop
The slowest run took 7.74 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 17 ms per loop

The first time we run the timing test, we get a message that one of the Numba timing tests took way longer than the others. That's because the first time we call the jitted function, it has to be compiled for the dtype of the arguments. The compiled function is cached, though, making subsequent calls faster.

It's a good idea to always pass the nopython=True kwarg. This prevents Numba from falling back into object mode if it finds something it can't compile to machine code, e.g. a Python function call:


In [17]:
import scipy.special

@nb.jit(nopython=True)
def nb_sum2d_j1(arr):
    M, N = arr.shape
    result = 0.
    for i in range(M):
        for j in range(N):
            result += scipy.special.j1(arr[i,j])
    return result

In [18]:
nb_sum2d_j1(arr)



UntypedAttributeErrorTraceback (most recent call last)
<ipython-input-18-994225f75ce8> in <module>()
----> 1 nb_sum2d_j1(arr)

/home/lgarrison/anaconda2/lib/python2.7/site-packages/numba/dispatcher.pyc in _compile_for_args(self, *args, **kws)
    328                                 for i, err in failed_args))
    329                 e.patch_message(msg)
--> 330             raise e
    331 
    332     def inspect_llvm(self, signature=None):

UntypedAttributeError: Caused By:
Traceback (most recent call last):
  File "/home/lgarrison/anaconda2/lib/python2.7/site-packages/numba/compiler.py", line 238, in run
    stage()
  File "/home/lgarrison/anaconda2/lib/python2.7/site-packages/numba/compiler.py", line 452, in stage_nopython_frontend
    self.locals)
  File "/home/lgarrison/anaconda2/lib/python2.7/site-packages/numba/compiler.py", line 841, in type_inference_stage
    infer.propagate()
  File "/home/lgarrison/anaconda2/lib/python2.7/site-packages/numba/typeinfer.py", line 773, in propagate
    raise errors[0]
UntypedAttributeError: Unknown attribute 'j1' of type Module(<module 'scipy.special' from '/home/lgarrison/anaconda2/lib/python2.7/site-packages/scipy/special/__init__.pyc'>)
File "<ipython-input-17-a93836d82e7b>", line 9
[1] During: typing of get attribute at <ipython-input-17-a93836d82e7b> (9)

Failed at nopython (nopython frontend)
Unknown attribute 'j1' of type Module(<module 'scipy.special' from '/home/lgarrison/anaconda2/lib/python2.7/site-packages/scipy/special/__init__.pyc'>)
File "<ipython-input-17-a93836d82e7b>", line 9
[1] During: typing of get attribute at <ipython-input-17-a93836d82e7b> (9)

Numba does let you call ctypes and CFFI functions in nopython mode, as we'll see later.

One of Numba's main drawbacks is lack of broad support for parallelism. jit has a nogil option to release the global interpreter lock, which is useful if you have a higher-level parallelism scheme in place (e.g. Python threading), but it gives no speedup on its own.

However, by using the less-flexible vectorize or guvectorize decorators, we can get excellent parallelization in some specialized cases:


In [19]:
@nb.vectorize([nb.float64(nb.float64, nb.float64)], nopython=True, target='cpu')
def vec_op(a, b):
    return np.sin(a**b)**2.

@nb.vectorize([nb.float64(nb.float64, nb.float64)], nopython=True, target='parallel')
def parallel_vec_op(a, b):
    return np.sin(a**b)**2.

In [20]:
%timeit vec_op(arr, 2*arr)
%timeit parallel_vec_op(arr, 2*arr)


1 loop, best of 3: 1.32 s per loop
1 loop, best of 3: 215 ms per loop

So we got an excellent speedup, but we also had to specify the input dtypes. There are dynamic ufuncs that give more flexibility, but sometimes they don't behave as expected with the parallel target.

Also note that vectorize only lets us operate element-wise on the input arrays. guvectorize gives us access to the whole array if we need, and has a parallel target. It's the closest thing to a parallel version of jit, but it has more programmer overhead. Here's an example:


In [21]:
@nb.guvectorize([(nb.float64[:,:], nb.float64[:,:], nb.float64[:,:])], '(nx,ny),(nx,ny)->(nx,ny)', target='cpu', nopython=True)
def guvec(a, b, c):
    M, N = a.shape
    for i in range(M):
        for j in range(N):
            c[i,j] = a[i,j]**b[i,j]

    
@nb.guvectorize([(nb.float64[:,:], nb.float64[:,:], nb.float64[:,:])], '(nx,ny),(nx,ny)->(nx,ny)', target='parallel', nopython=True)
def parallel_guvec(a, b, c):
    M, N = a.shape
    for i in range(M):
        for j in range(N):
            c[i,j] = a[i,j]**b[i,j]

%timeit guvec(arr, 2*arr)
%timeit parallel_guvec(arr, 2*arr)


1 loop, best of 3: 1.1 s per loop
1 loop, best of 3: 1.07 s per loop

Wait, why didn't we get any parallel speedup? In Numba, parallelization only happens when broadcasting, so let's write it in a way that broadcasts over the first dimension:


In [22]:
@nb.guvectorize([(nb.float64[:], nb.float64[:], nb.float64[:])], '(ny),(ny)->(ny)', target='parallel', nopython=True)
def parallel_guvec(a, b, c):
    N, = a.shape
    for j in range(N):
        c[j] = a[j]**b[j]

%timeit parallel_guvec(arr, 2*arr)


10 loops, best of 3: 181 ms per loop

I've hacked a fairly general solution that achieves parallelism by broadcasting over an array of indices (I call it parallel_bcast). I'm expecting it will be superceded by Numba functionality within a few years, though.

Update: it's here! Numba now offers numba.prange as a replacement to any range statement. You must also pass parallel=True to the jit decorator. For example:


In [23]:
@nb.jit(nopython=True, parallel=True)
def nb_prange(a, b):
    M, N = a.shape
    c = np.empty_like(a)
    for i in nb.prange(M):
        for j in range(N):
            c[i,j] = a[i,j]**b[i,j]
            
%timeit nb_prange(arr, 2*arr)


1 loop, best of 3: 182 ms per loop

See below for a non-trivial example where prange fails to live up to its promise, however.

Numba also offers gpu and cuda targets for transparent offloading to the GPU, but I haven't tried it myself.

Example: radial histogramming

The above examples were pretty trivial. Numba really shines when you can use explicit looping to calculate intermediate values that you would otherwise have to store all at once in a big Numpy array. In this example, we'll bin values on a rectangular lattice into spherical shells. Thus, we need to know the radial distance to each lattice site. This is what we'll calculate on-the-fly with Numba.


In [24]:
def py_radial_hist(values, box, bin_edges):
    nx,ny,nz = values.shape
    X,Y,Z = np.ogrid[0:box:box/nx, 0:box:box/ny, 0:box:box/nz]
    radii = X**2 + Y**2 + Z**2
    
    return np.histogram(radii, bins=bin_edges**2, weights=values)


@nb.jit(nopython=True)
def nb_radial_hist(values, boxsize, bin_edges):
    nx,ny,nz = values.shape
    
    histogram = np.zeros(len(bin_edges)-1)

    # Do binning with squared distances
    bin_edges = bin_edges**2
    nbp1 = len(bin_edges)

    for i in range(nx):
        dx2 = (boxsize/nx*i)**2
        for j in range(ny):
            dy2 = (boxsize/ny*j)**2
            for k in range(nz):
                dz2 = (boxsize/nz*k)**2
                dist = dx2 + dy2 + dz2
                if dist < bin_edges[0] or dist > bin_edges[-1]:
                    continue
                for b in range(1,nbp1):
                    if dist < bin_edges[b]:
                        histogram[b-1] += values[i,j,k]
                        break
                else:  # last bin is closed
                    histogram[-1] += values[i,j,k]
                    
    return histogram

In [25]:
box = 1.
bin_edges = np.linspace(0,box,100)
values = np.random.random((512,512,512))

In [26]:
%timeit py_radial_hist(values, box, bin_edges)
%timeit nb_radial_hist(values, box, bin_edges)


1 loop, best of 3: 9.73 s per loop
1 loop, best of 3: 5.98 s per loop

So we cut the runtime of the Numpy implementation by 1/3, and we didn't even consider the memory usage. The Numpy implementation doubles the memory usage of the input array, since it has to construct the radii. This can be a big problem when dealing with $2048^3$ FFT meshes, for example.

In practice, I use a parallel_bcast-based parallel implementation, which gives a huge additional speedup. In theory, we could achieve this parallelism with nb.prange. Let's explore this below.

Experimental: Using nb.prange for histogramming

nb.prange has a lot of promise, but as an "experimental feature" it's often clunky to use in practice. Let's try to apply it to this histogramming example and see what happens.

First, let's write a naive implementation with a race condition. Note that to complile, the prange must be in the innermost loop.


In [27]:
@nb.jit(nopython=True, parallel=True)
def BAD_nb_parallel_radial_hist(values, boxsize, bin_edges):
    nx,ny,nz = values.shape
    
    histogram = np.zeros(len(bin_edges)-1)
    
    # parallel=True quirk: some versions wouldn't compile without squaring in-place
    bin_edges = bin_edges**2
    nbp1 = len(bin_edges)
    
    for i in range(nx):
        for j in range(ny):
            # another quirk: prange must be in inner loop
            for k in nb.prange(nz):
                dx2 = (boxsize/nx*i)**2
                dy2 = (boxsize/ny*j)**2
                dz2 = (boxsize/nz*k)**2
                dist = dx2 + dy2 + dz2
                if dist < bin_edges[0] or dist > bin_edges[-1]:
                    continue
                for b in range(1,nbp1):
                    if dist < bin_edges[b]:
                        histogram[b-1] += values[i,j,k]
                        break
                else:  # last bin is closed
                    histogram[-1] += values[i,j,k]
            
    # also some versions wouldn't compile without a return!
    return histogram

In [28]:
py_answer = py_radial_hist(values, box, bin_edges)[0]
nb_answer = nb_radial_hist(values, box, bin_edges)
BAD_parallel_answer = BAD_nb_parallel_radial_hist(values, box, bin_edges)

In [29]:
print np.allclose(py_answer, nb_answer)
print np.allclose(py_answer, BAD_parallel_answer)


True
False

As we expected, the race condition caused the parallel version to give the wrong answer. In fact, it doesn't even give a consistent answer from run to run:


In [30]:
BAD_parallel_answer2 = BAD_nb_parallel_radial_hist(values, box, bin_edges)
print np.allclose(BAD_parallel_answer, BAD_parallel_answer2)


False

We can write a parallel-safe version by giving each k its own temporary histogram space to write into:


In [31]:
@nb.jit(nopython=True, parallel=True)
def nb_parallel_radial_hist(values, boxsize, bin_edges):
    nx,ny,nz = values.shape
    
    histogram = np.zeros((nz, len(bin_edges)-1))
    
    bin_edges = bin_edges**2
    nbp1 = len(bin_edges)
    
    for i in range(nx):
        for j in range(ny):
            # prange only works on inner loop
            for k in nb.prange(nz):
                dx2 = (boxsize/nx*i)**2
                dy2 = (boxsize/ny*j)**2
                dz2 = (boxsize/nz*k)**2
                dist = dx2 + dy2 + dz2
                if dist < bin_edges[0] or dist > bin_edges[-1]:
                    continue
                for b in range(1,nbp1):
                    if dist < bin_edges[b]:
                        histogram[k, b-1] += values[i,j,k]
                        break
                else:  # last bin is closed
                    histogram[k, -1] += values[i,j,k]
    
    # Silly! This could be written as
    # reduced_hist = histogram.sum(axis=0)
    # but Numba auto-parallelization doesn't support axis reductions
    reduced_hist = np.zeros(len(bin_edges)-1)
    for b in range(len(reduced_hist)):
        for k in range(nz):
            reduced_hist[b] += histogram[k, b]
            
    return reduced_hist

In [32]:
parallel_answer = nb_parallel_radial_hist(values, box, bin_edges)
print np.allclose(py_answer, parallel_answer)


True

Finally, how do the timings stack up?


In [33]:
%timeit nb_radial_hist(values, box, bin_edges)
%timeit BAD_nb_parallel_radial_hist(values, box, bin_edges)
%timeit nb_parallel_radial_hist(values, box, bin_edges)


1 loop, best of 3: 5.95 s per loop
1 loop, best of 3: 10 s per loop
1 loop, best of 3: 10 s per loop

That didn't go as expected. If each thread is only operating on one element at a time, that could explain the inefficiency. Hopefully it's implemented internally to operate on nz/nthreads elements at a time, though.

We could try flattening the nested loops, but then we'd need nx*ny*nz temporary space! We actually only need nthreads temporary space, but the local thread ID is completely hidden from us. Let's write an egregiously wrong version just to test the speed:


In [34]:
@nb.jit(nopython=True, parallel=True)
def BAD_nb_parallel_radial_hist_flat(values, boxsize, bin_edges):
    nx,ny,nz = values.shape
    
    histogram = np.zeros(len(bin_edges)-1)
    
    bin_edges = bin_edges**2
    nbp1 = len(bin_edges)
    
    for x in nb.prange(nx*ny*nz):
        i = x / (ny*nz)
        j = (x % (ny*nz)) / nz
        k = x % nz
        dx2 = (boxsize/nx*i)**2
        dy2 = (boxsize/ny*j)**2
        dz2 = (boxsize/nz*k)**2
        dist = dx2 + dy2 + dz2
        if dist < bin_edges[0] or dist > bin_edges[-1]:
            continue
        for b in range(1,nbp1):
            if dist < bin_edges[b]:
                histogram[b-1] += values[i,j,k]
                break
        else:  # last bin is closed
            histogram[-1] += values[i,j,k]
            
    return histogram

In [35]:
%timeit BAD_nb_parallel_radial_hist_flat(values, box, bin_edges)


1 loop, best of 3: 1.57 s per loop

That was about as fast as we expected, but it still gives the wrong answer. Maybe when prange supports nested loops it will be possible to write a correct version of this.

For fun, let's time the implementation with my custom parallelization scheme ("parallel_bcast"). It's also doing a periodic wrap, unlike the above, so it's doing more work.


In [36]:
import Abacus.Analysis.PowerSpectrum.Histogram as AbacusHistogram
#reload(AbacusHistogram.Histogram.)
# Pre-compute the mean radius of the bin and the counts per bin
bin_info = AbacusHistogram.RadialBinGrid(box, values, bin_edges)
%timeit AbacusHistogram.RadialBinGrid(box, values, bin_edges, bin_info=bin_info)


/home/lgarrison/abacus/python/Abacus/Analysis/PowerSpectrum/Histogram.py:215: RuntimeWarning: invalid value encountered in divide
  radii /= counts
/home/lgarrison/abacus/python/Abacus/Analysis/PowerSpectrum/Histogram.py:222: RuntimeWarning: invalid value encountered in divide
  weights /= counts
1 loop, best of 3: 1.16 s per loop

CFFI (C Foreign Function Interface) is a way to call C code from Python. It's designed as a replacement to tools like SWIG, ctypes, or SciPy.weave (now deprecated).

SWIG is probably still appropriate if you want to call your C code from many languages, not just Python. But ctypes or CFFI are the preferred Python solutions, with a preference for CFFI. We'll see why below.

Example: calling a GSL function from Numba

Let's go back to the Numba example where we tried to call scipy.special.j1 but Numba wouldn't allow it in nopython mode. We'll use the recommended "API-level, out-of-line" mode. This means the C compiler is invoked, and we have to run a build script beforehand.

We're going to write the script to disk and run it, because this would probably be its own script in a real application.


In [37]:
%%writefile build_ffilib.py
if __name__ == '__main__':
    # Compile the FFI lib
    import cffi
    ffibuilder = cffi.FFI()
    ffibuilder.set_source('_gslffilib',
        r"""
            #include <gsl/gsl_sf_bessel.h>  // This gets compiled
         """, libraries=['gsl', 'gslcblas'])
    ffibuilder.cdef("""
            double gsl_sf_bessel_j1 (double x);  // This can be copied straight from the man page and is parsed by CFFI
        """)
        
    ffibuilder.compile(verbose=True)


Overwriting build_ffilib.py

In [38]:
!python build_ffilib.py


generating ./_gslffilib.c
(already up-to-date)
running build_ext
building '_gslffilib' extension
gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/lgarrison/anaconda2/include/python2.7 -c _gslffilib.c -o ./_gslffilib.o
gcc -pthread -shared -L/home/lgarrison/anaconda2/lib -Wl,-rpath=/home/lgarrison/anaconda2/lib,--no-as-needed ./_gslffilib.o -L/home/lgarrison/anaconda2/lib -lgsl -lgslcblas -lpython2.7 -o ./_gslffilib.so

In [39]:
%ls *ffi*


build_ffilib.py  _gslffilib.c  _gslffilib.o  _gslffilib.so*

We can see that we just compiled _gslffilib.so. Here's how we use it:


In [40]:
import _gslffilib
_gslffilib.lib.gsl_sf_bessel_j1(1.)


Out[40]:
0.30116867893975674

And here's how we use it with Numba:


In [41]:
import numpy as np
import numba as nb

import numba.cffi_support
numba.cffi_support.register_module(_gslffilib)
gsl_sf_bessel_j1 = _gslffilib.lib.gsl_sf_bessel_j1

In [42]:
import scipy.special

@nb.jit(nopython=True)
def nb_sum2d_cffi(arr):
    M, N = arr.shape
    result = 0.
    for i in range(M):
        for j in range(N):
            result += gsl_sf_bessel_j1(arr[i,j])
    return result

In [43]:
nb_sum2d_cffi(np.random.random((128,128)))


Out[43]:
2593.4050247737537

There are other ways to use CFFI (ABI vs API, in-line vs out-of-line), but the above is the preferred approach for most applications.

CFFI vs ctypes

Here's how we would call gsl_sf_bessel_j1 with ctypes.


In [44]:
import ctypes as ct
libgsl = ct.cdll.LoadLibrary("libgsl.so")

# Set the argument and return types
libgsl.gsl_sf_bessel_j1.restype = ct.c_double
libgsl.gsl_sf_bessel_j1.argtypes = [ct.c_double]

In [45]:
libgsl.gsl_sf_bessel_j1(1.)


Out[45]:
0.30116867893975674

That was a lot easier, so why not use ctypes all the time? First of all, ctypes is ABI-only --- it never invokes the C compiler. This is very error prone and can easily cause segfaults if the wrong argments are passed, wrong signatures are specified, etc. It's also hard to keep updated if the C call signature changes. CFFI, on the other hand, just requires that you copy/paste the C call signature (or load it programmatically). Most modern Python projects are moving away from ctypes towards CFFI.

Finally, CFFI can compile your raw C code for you, if you don't already have it in a shared library. We didn't demonstrate that here, but it's a much more portable and low-maintenance solution than trying to detect what compilers are available on the target platform, for example.

A final note: CFFI was developed as part of the PyPy project, which is a performance-oriented alternative to CPython. However, I wouldn't recommend including PyPy in your workflow, as their approach requires re-implementing basic libraries like Numpy to work with PyPy ("NumPyPy"). This is ultimately an unsustainable approach.

A library for out-of-core, asynchronous, or distributed jobs on big data. It has a lot of promise, but hasn't quite lived up to its potential in my experience. It's being funded by the same grant as Numba, so I'm expecting a lot of maturation in the next few years.