Extending Python with Compiled Code

In a previous session, we talked about ways to get performance out of Python using vectorization strategies within NumPy. Today we're going to talk about approaches which involve interfacing Python to external compiled code, or converting Python code to compilable code. There are several different approaches. We'll give examples of solving a problem by several means:

  • Ctypes: the C-linking utilities included in the Python standard library
  • F2Py: the Fortran to Python utility which is bundled with NumPy
  • Cython: a module that allows conversion of Python and Python-like code to C
  • Numba: a Python to LLVM bytecode converter

There is no way to go into depth on all of these, but I mainly just want to show some examples so that you know the types of problems they solve. At the above links, you'll find more in-depth documentation and tutorials.

A Note on CFFI and PyPy

A newer option that I've not personally used (but is probably the way of the future for wrapping compiled code in Cython) is the C Foreign Function Interface (CFFI). One huge advantage of this is that it works with PyPy, where the other options don't.

PyPy is a JIT compiler for Python written in Python. PyPy doesn't support the C backend which most scientific tools rely on, so it's not extremely useful for scientific computing. If the functionality of numpy, scipy, etc. is rewritten using the CFFI in the future, though, PyPy may become a competetive option.

Other Options

There are other options we won't talk about here, because they have mostly been superseded by the above tools:

  • Python C API: Python is implemented in C, and you can actually write C code which will compile to a Python module! This is extremely low-level, and I would only recommend it if you enjoy pain (see Dan Foreman-Mackie's blog post for a concise-as-possible introduction)
  • SWIG: the Simplified Wrapper Interface Generator, can generate wrappers from a variety of low-level languages to a variety of high-level languages. It's used heavily by LSST and other projects.
  • Weave: included in SciPy, weave is a method of putting C snippets within a Python program. It's largely been superseded by Cython in practice.

The optimal approach depends on the desired use-case. We'll go through a few examples, from lowest to highest level:

Why Extend Python?

There are a few reasons you might wish to use the tools outlined below:

  1. Sometimes vectorization of algorithms can lead to excessive memory usage, because it relies on temporary arrays to hold intermediate results. Directly implementing the routine in compiled code can lead to more efficient computation.

  2. Sometimes problems cannot be implemented as vectorized NumPy operations. An example is in tree-based algorithms. This is why I used cython when I wrote the k-neighbors and kernel density estimation code within sklearn.neighbors.

  3. Sometimes there are legacy code-bases that you don't want to have to re-implement: you'd much rather just write some sort of bindings to call the old code from within Python.

We'll do a brief demonstration of some of these tools. These tools do have a bit of a learning curve, so I'll provide references for where you might look to go deeper. The purpose of this lecture is to get a taste of some of the better options for addressing these problems.

Ctypes

Ctypes is part of the standard library, and offers tools to call routines from compiled libraries with Python. These can either be routines in your system libraries, or in shared libraries you compile yourself. We'll see examples of both.

One weakness of ctypes is that it can be very platform-specific. For example, shared-library extensions are different from platform to platform (e.g. *.so on Linux, *.dll on Windows, *.dylib on OSX). Other platform-by-platform issues like 32/64 bit can also make things difficult.

Interfacing to System Libraries

In Linux & Mac, you'll usually find the standard libraries here (remember that ! lets us execute shell commands within IPython):


In [ ]:
!ls /usr/lib/

CTypes lets you link to any of these libraries and call the functions directly:


In [ ]:
from ctypes import CDLL

libc_name = 'libc.dylib'  # OSX
# libc_name = 'libc.so.6'  # Linux
# libc_name = 'libc.dll'  # Windows

libc = CDLL(libc_name)

In [ ]:
libc.time

In [ ]:
print("Seconds since January 1, 1970:", libc.time())

Wrapping your own functions

You can do more than simply wrapping the functionality of system libraries: you can also create your own C functions and wrap them with CTypes (this will require having a C compiler installed):


In [ ]:
%%file my_sum.c
#include <stdio.h>

// sum all the values in the array x
// x is a pointer to a memory block 
// of length n
int sum(int *x, int n)
{
  int i, counter;
  counter = 0;
  for(i=0; i<n; i++)
    {
      counter += x[i];
    }
  return counter;
}

In [ ]:
%%bash
gcc -c my_sum.c
gcc -shared -o my_sum.so my_sum.o

In [ ]:
!ls my_sum.*

Now we'll use the CTypes API to create Python objects that can be passed to this function:


In [ ]:
from ctypes import CDLL, c_void_p
import numpy as np

my_sum=CDLL('my_sum.so')

a = np.arange(10, dtype=np.int32)
adata = a.ctypes.data_as(c_void_p)
asize = a.size

my_sum.sum(adata, asize)

In [ ]:
a.sum()

User-defined Types

Here we've used just simple built-in numerical types: we can also do more complicated things like defining C Structures within Python:


In [ ]:
%%file my_sum2.c
#include <stdio.h>

// Define a simple array struct, containing
// a pointer and a length
struct Array{
   int *x;
   int n;
};

// Sum the values in the array struct
int sum(struct Array a)
{
  int counter, i;
  counter = 0;
  for(i=0; i<a.n; i++)
    {
      counter += a.x[i];
    }
  return counter;
}

In [ ]:
%%bash
gcc -c my_sum2.c
gcc -shared -o my_sum2.so my_sum2.o

Now the function requires us to pass a structure as an argument. To handle this, we need to create a mirror of this structure within our Python script. This can be done via the CTypes API:


In [ ]:
from ctypes import Structure, POINTER, c_int

# Here's our C structure defined in Python
class Array(Structure):
    _fields_ = ("x", c_void_p), ("n", c_int)
    
my_sum2=CDLL('my_sum2.so')
    
a = np.arange(10, dtype=np.int32)
arr = Array(a.ctypes.data_as(c_void_p), a.size)

my_sum2.sum(arr)

In [ ]:
a.sum()

We're starting to see how this can be helpful, but what if we want something more involved?

Interfacing to LAPACK functions

Let's interface to the system LAPACK (Linear Algebra Package) routines and solve a linear equation. This is the type of function you might write in order to interface with a code someone else has written.

We'll use the DGESV routine from LAPACK: the Double precision GEneral SolVer for linear equations. This is a standard routine found on most computers. See http://www.netlib.no/netlib/lapack/double/dgesv.f for a description of the call signature.


In [ ]:
from ctypes import CDLL, c_int, c_void_p, byref
import numpy as np

def solve(M, b):
    """Solve the linear equation M x = b using LAPACK's DGESV"""
    # Interface to the LAPACK system library
    # This may be different depending on your operating system
    lapack = CDLL('liblapack.dylib')
    
    # Make sure M is double precision fortran-ordered
    M = np.asarray(M, dtype=np.float64, order='F')
    
    # Make a copy of b.  It will be overwritten with the result
    b = np.array(b, dtype=np.float64, copy=True)
    n = b.size
    
    # We could generalize this, but we'll do a simple problem for now
    assert M.shape == (n, n)
    assert b.shape == (n,)
    
    # prepare the variables for dgesv.
    size = c_int(n)
    one = c_int(1)
    info = c_int(0)
    lapack.dgesv(byref(size), # size of the problem
                 byref(one), # number of columns of b
                 M.ctypes.data_as(c_void_p), # M data array
                 byref(size), # size of the problem
                 (n * c_int)(), # integer work array
                 b.ctypes.data_as(c_void_p), # b data array
                 byref(size), # size of b
                 byref(info)) # exit flag
    return b

In [ ]:
X = np.random.random((5, 5))
b = np.random.random(5)

solve(X, b)

In [ ]:
np.linalg.solve(X, b)

The routine np.linalg.solve actually uses this same routine (depending on your install options), so it should give precisely the same results. You'd never want to do this for LAPACK routines (they're already in numpy, after all), but this sort of thing can come in handy to call legacy C/Fortran code from Python.

F2Py

F2Py is a submodule of NumPy, and is designed to create easy interfaces to Fortran code (though it will also work with C). We can do something similar to above, but use f2py rather than gcc to compile it (note that this will require you to have a Fortran compiler installed).

A simple example

We'll start with a simple example that computes the first N fibonacci numbers. Here's a Python version of the function:


In [ ]:
import numpy as np

def fib_py(N):
    x = np.zeros(N, dtype=float)
    for i in range(N):
        if i == 0:
            x[i] = 0
        elif i == 1:
            x[i] = 1
        else:
            x[i] = x[i - 1] + x[i - 2]
    return x

In [ ]:
fib_py(10)

In [ ]:
%timeit fib_py(10000)

Now we'll write the same routine in Fortran:


In [ ]:
%%file fib.f

      subroutine fib(A,N)
      integer N
      double precision A(N)
      do i=1,N
         if (i.EQ.1) then
            A(i) = 0.0D0
         elseif (i.EQ.2) then
            A(i) = 1.0D0
         else
            A(i) = A(i-1) + A(i-2)
         endif
      enddo
      end

We'll compile this from the command line using f2py3 (or f2py in Python 2):


In [ ]:
!f2py3 -c -m fib fib.f

In [ ]:
import numpy as np
import fib
a = np.zeros(10, dtype='d')
fib.fib(a) 
a

In [ ]:
a = np.zeros(10000, dtype='d')
%timeit fib.fib(a)

The Fortran version is about 300x faster!

Using an interface file

But this is a bit awkward that we have to create the array that will hold the results... It would be nice if this could happen automatically. We can make this happen through the use of interface files, which have a .pyf extension.

An interface template can be generated automatically with f2py. We'll tell it that we want a module called fib2:


In [ ]:
!rm -f _fib2.pyf
!f2py3 fib.f -m fib2 -h _fib2.pyf

In [ ]:
!cat _fib2.pyf

Take a look at the default settings. We have our subroutine called fib(a, n), which lists the following:

double precision dimension(n) :: a

This lists a variable which is an input into the routine: an array of length n

integer, optional,check(len(a)>=n),depend(a) :: n=len(a)

This tells us what the routine expects n to be. It is optionally specified, with a default value which is the length of a. Also, there is a check in here that makes sure a is long enough. Let's check this out:


In [ ]:
from fib import fib
a = np.zeros(10)
fib(a, 10)  # specify the optional keyword
a

In [ ]:
a = np.zeros(10)
fib(a, 5)  # specify only part of the array
a

In [ ]:
a = np.zeros(10)
fib(a, 20)  # ERROR: length is too long!

What we want to do is to modify this interface so that the length n is an input, and the array a is an (automatically constructed) output. We'll do this below; for more details on the options available in F2Py interface files, see the F2Py documentation.


In [ ]:
%%file fib2.pyf
!    -*- f90 -*-
! Note: the context of this file is case sensitive.

python module fib2 ! in 
    interface  ! in :fib2
        subroutine fib(a,n) ! in :fib2:fib.f
            double precision dimension(n), intent(out), depend(n) :: a
            integer intent(in) :: n
        end subroutine fib
    end interface 
end python module fib2

We've specified the intent and the dependencies of each variable using the f2py specifications.

Here we compile fib2 using the interface file we've created:


In [ ]:
!f2py3 -c -m fib2 fib2.pyf fib.f

Now we can import and call the function like this:


In [ ]:
import fib2
fib2.fib(10)

In [ ]:
%timeit fib2.fib(10000)

From our simple FORTRAN function, along with an interface file and a compilation with f2py, we now have a Python module which is callable in a very intuitive way.

If you look into the source code of SciPy, you'll see that this is how much of its functionality is implemented, through wrapping pieces of the NETLIB repository.

Cython

CTypes and F2Py provide the ability to wrap Fortran, C, and C++ code so that it can be imported into Python. Cython enables this as well, though we will not focus on that part of it here. The biggest part of Cython is that it lets you convert Python code and Python-like code into compiled C code, which can run many times faster than the original code.

Let's see a quick example. Here's a Python function which computes the N^th fibonacci number:


In [ ]:
def nth_fib(n):
    a, b = 0, 1
    for i in range(n):
        b, a = a + b, b
    return a

In [ ]:
[nth_fib(i) for i in range(10)]

In [ ]:
%timeit nth_fib(10000)

Now we'll take the exact same code, and compile it with Cython. In general, this will be done by saving the code to file, and running cython on the command line. You can read about that in the documentation. Here we'll use IPython's Cython magic to streamline the process:


In [ ]:
%load_ext cython

In [ ]:
%%cython
def nth_fib2(n):
    a, b = 0, 1
    for i in range(n):
        b, a = a + b, b
    return a

In [ ]:
[nth_fib2(i) for i in range(10)]

In [ ]:
%timeit nth_fib2(10000)

Just compiling the code in Cython gave us a ~10% speedup. But we can do better by adding type annotations.

See, the main reason Python is slow is because it has to do dynamic type checking each time it evaluates an expression. If we can tell Cython what the types are from the beginning, this step can be skipped, and we have large time savings. We do this through a cdef command. We also do the temporary assignment explicitly to remove the Python tuple assignment:


In [ ]:
%%cython
def nth_fib3(int n):
    cdef int a = 0
    cdef int b = 1
    cdef int tmp
    for i in range(n):
        tmp = b
        b = a + b
        a = tmp
    return a

In [ ]:
[nth_fib3(i) for i in range(10)]

In [ ]:
print("Python only:")
%timeit nth_fib(10000)
print("Bare Cython:")
%timeit nth_fib(10000)
print("Typed Cython:")
%timeit nth_fib3(10000)

By adding some type information to our script, we sped up the execution by several orders of magnitude! This shows how easy Cython is for simple problems.

One very useful trick to be aware of is cython -a. This will produce an annotated html document showing which lines of the program are causing problems. You can run it like this:


In [ ]:
%%cython --annotate
# Slow version:
def nth_fib2(n):
    a, b = 0, 1
    for i in range(n):
        b, a = a + b, b
    return a

# Fast Version:
def nth_fib3(int n):
    cdef int a = 0
    cdef int b = 1
    cdef int tmp
    for i in range(n):
        tmp = b
        b = a + b
        a = tmp
    return a

Here we see a highlighted version of our code. The darkness of the highlight shows how many lines of C code were generated by the line of Python. More yellow lines generally means slower code. This can be very helpful when your code is not running as quickly as you'd expected.

Using Cython with NumPy

Cython provides a really nice interface to numpy arrays via the Typed Memoryview syntax. Let's implement the same fib function as above, but using Cython.

First we'll simply compile our Python function again:


In [ ]:
%%cython
import numpy as np

def fib_cy(N):
    x = np.zeros(N, dtype=float)
    for i in range(N):
        if i == 0:
            x[i] = 0
        elif i == 1:
            x[i] = 1
        else:
            x[i] = x[i - 1] + x[i - 2]
    return x

In [ ]:
print(np.allclose(fib_py(10000), fib_cy(10000)))
%timeit fib_py(10000)
%timeit fib_cy(10000)

Again, a very small improvement. Let's add some type information and see how we do:


In [ ]:
%%cython
import numpy as np
from numpy cimport float_t

def fib_cy2(int N):
    cdef int i
    cdef float_t[::1] x = np.zeros(N, dtype=float)
    for i in range(N):
        if i == 0:
            x[i] = 0
        elif i == 1:
            x[i] = 1
        else:
            x[i] = x[i - 1] + x[i - 2]
    return x

In [ ]:
%timeit fib_cy(10000)
%timeit fib_cy2(10000)

Wow! Simply adding some type information for the Cython compiler made our code orders of magnitude faster! This is because the array can now be treated as a contiguous memory block rather than a Python object. This makes each of the indexing operations much more efficient, because they no longer have to go through the Python interface layer.

Learning More

There's a whole lot more to know about Cython, especially the details of working with numpy arrays. Here are some more resources:

Numba

You might wonder whether some of that type information could be inferred by the compiler, or whether the compilation phase is really necessary. This is where Numba comes in. Numba is an LLVM-based Just In Time (JIT) compiler for Python. It allows simple annotations of Python code which can lead to huge speedups for certain operations.

Numba comes from Continuum Analytics, a company founded by the creator of NumPy and SciPy, and the same people who are behind the Anaconda Python distribution. Numba is still relatively young, and can be a little finnicky for some problems, but the approach is really clean.

You can install numba from scratch, but because of the dependencies on LLVM and other difficult-to-install tools, I'd highly recommend just going with Anaconda.

Let's go back to our Python fib() function:


In [ ]:
def fib(N):
    x = np.zeros(N, dtype=np.float64)
    for i in range(N):
        if i == 0:
            x[i] = 0
        elif i == 1:
            x[i] = 1
        else:
            x[i] = x[i - 1] + x[i - 2]
    return x

To speed this up with Numba, we simply use the jit decorator:


In [ ]:
from numba import autojit
fib_nb = autojit(fib)

In [ ]:
fib_nb(10)

In [ ]:
%timeit fib(10000)
%timeit fib_nb(10000)

Usually, numba is used as a decorator, which essentially does the same as we saw above. You'd write a decorated function this way:


In [ ]:
@autojit
def fib_nb(N):
    x = np.zeros(N, dtype=np.float64)
    for i in range(N):
        if i == 0:
            x[i] = 0
        elif i == 1:
            x[i] = 1
        else:
            x[i] = x[i - 1] + x[i - 2]
    return x

In [ ]:
%timeit fib_nb(10000)

You can see that simply by passing our fib function through the autojit decorator, we've attained execution which is a few percent faster than what we got with Cython - and there's no fancy compilation stage required! The funcion is compiled on demand through LLVM, which makes numba very convenient to use.

Now for the bad news: Numba is still young, and there are a lot of corner cases where the compiler will just break without any useful error message. For more complicated problems and functions, the output code is not always faster than Python+NumPy, and can sometimes even be slower!

My recommendation right now is to keep Numba in mind, to try it in your problems, but not necessarily depend on it -- yet.

More Information on Numba

There's still not a lot of information out there, but here are a few resources: