This notebook was put together by [Jake Vanderplas](http://www.vanderplas.com) for UW's [Astro 599](http://www.astro.washington.edu/users/vanderplas/Astr599/) course. Source and license info is on [GitHub](https://github.com/jakevdp/2013_fall_ASTR599/).
In a previous session, we talked about ways to get performance out of Python using vectorization strategies within NumPy. Today we're going to talk about approaches which involve interfacing Python to external compiled code, or converting Python code to compilable code. There are several different approaches. We'll give examples of solving a problem by several means:
There is no way to go into depth on all of these, but I mainly just want to show some examples so that you know the types of problems they solve. At the above links, you'll find more in-depth documentation and tutorials.
There are other options we won't talk about here, because they're not as useful for our purposes:
The optimal approach depends on the desired use-case. We'll go through a few examples, from lowest to highest level:
There are a few reasons you might wish to use the tools outlined below:
Sometimes vectorization of algorithms can lead to excessive memory usage, because it relies on temporary arrays to hold intermediate results. Directly implementing the routine in compiled code can lead to more efficient computation.
Sometimes problems cannot be implemented as vectorized NumPy operations. An example is in tree-based algorithms. This is why I used cython when I wrote the k-neighbors and kernel density estimation code within sklearn.neighbors
.
Sometimes there are legacy code-bases that you don't want to have to re-implement: you'd much rather just write some sort of bindings to call the old code from within Python.
We'll do a brief demonstration of some of these tools. These tools do have a bit of a learning curve, so I'll provide references for where you might look to go deeper. The purpose of this lecture is to get a taste of some of the better options for addressing these problems.
Ctypes is part of the standard library, and offers tools to call routines from compiled libraries with Python. These can either be routines in your system libraries, or in shared libraries you compile yourself. We'll see examples of both.
One weakness of ctypes is that it can be very platform-specific. For example, shared-library extensions are different from platform to platform (e.g. *.so
on Linux, *.dll
on Windows, *.dylib
on OSX). Other platform-by-platform issues like 32/64 bit can also make things difficult.
In Linux & Mac, you'll usually find the standard libraries here (remember that !
lets us execute shell commands within IPython):
In [1]:
!ls /usr/lib/
CTypes lets you link to any of these libraries and call the functions directly:
In [2]:
from ctypes import CDLL
libc_name = 'libc.dylib' # OSX
# libc_name = 'libc.so.6' # Linux
# libc_name = 'libc.dll' # Windows
libc = CDLL(libc_name)
In [3]:
libc.time
Out[3]:
In [ ]:
print "Seconds since January 1, 1970:"
print libc.time()
In [1]:
%%file my_sum.c
#include <stdio.h>
// sum all the values in the array x
// x is a pointer to a memory block
// of length n
int sum(int *x, int n)
{
int i, counter;
counter = 0;
for(i=0; i<n; i++)
{
counter += x[i];
}
return counter;
}
In [2]:
%%bash
gcc -c my_sum.c
gcc -shared -o my_sum.so my_sum.o
In [3]:
!ls my_sum.*
Now we'll use the CTypes API to create Python objects that can be passed to this function:
In [4]:
from ctypes import CDLL, c_void_p
import numpy as np
my_sum=CDLL('my_sum.so')
a = np.arange(10, dtype=np.int32)
adata = a.ctypes.data_as(c_void_p)
asize = a.size
print my_sum.sum(adata, asize)
print a.sum()
In [5]:
%%file my_sum2.c
#include <stdio.h>
// Define a simple array struct, containing
// a pointer and a length
struct Array{
int *x;
int n;
};
// Sum the values in the array struct
int sum(struct Array a)
{
int counter, i;
counter = 0;
for(i=0; i<a.n; i++)
{
counter += a.x[i];
}
return counter;
}
In [6]:
%%bash
gcc -c my_sum2.c
gcc -shared -o my_sum2.so my_sum2.o
Now the function requires us to pass a structure as an argument. To handle this, we need to create a mirror of this structure within our Python script. This can be done via the CTypes API:
In [7]:
from ctypes import Structure, POINTER, c_int
# Here's our C structure defined in Python
class Array(Structure):
_fields_ = ("x", c_void_p), ("n", c_int)
my_sum2=CDLL('my_sum2.so')
a = np.arange(10, dtype=np.int32)
arr = Array(a.ctypes.data_as(c_void_p), a.size)
print my_sum2.sum(arr)
print a.sum()
We're starting to see how this can be helpful, but what if we want something more involved?
Let's interface to the system LAPACK (Linear Algebra Package) routines and solve a linear equation. This is the type of function you might write in order to interface with a code someone else has written.
We'll use the DGESV routine from LAPACK: the Double precision GEneral SolVer for linear equations. This is a standard routine found on most computers. See http://www.netlib.no/netlib/lapack/double/dgesv.f for a description of the call signature.
In [8]:
from ctypes import CDLL, c_int, c_void_p, byref
import numpy as np
def solve(M, b):
"""Solve the linear equation M x = b using LAPACK's DGESV"""
# Interface to the LAPACK system library
# This may be different depending on your operating system
lapack = CDLL('liblapack.dylib')
# Make sure M is double precision fortran-ordered
M = np.asarray(M, dtype=np.float64, order='F')
# Make a copy of b. It will be overwritten with the result
b = np.array(b, dtype=np.float64, copy=True)
n = b.size
# We could generalize this, but we'll do a simple problem for now
assert M.shape == (n, n)
assert b.shape == (n,)
# prepare the variables for dgesv.
size = c_int(n)
one = c_int(1)
info = c_int(0)
lapack.dgesv(byref(size), # size of the problem
byref(one), # number of columns of b
M.ctypes.data_as(c_void_p), # M data array
byref(size), # size of the problem
(n * c_int)(), # integer work array
b.ctypes.data_as(c_void_p), # b data array
byref(size), # size of b
byref(info)) # exit flag
return b
In [9]:
X = np.random.random((5, 5))
b = np.random.random(5)
print solve(X, b)
In [10]:
print np.linalg.solve(X, b)
The routine np.linalg.solve
actually uses this same routine (depending on your install options), so it should give precisely the same results. You'd never want to do this for LAPACK routines (they're already in numpy, after all), but this sort of thing can come in handy to call legacy C/Fortran code from Python.
In [45]:
import numpy as np
def fib_py(N):
x = np.zeros(N, dtype=float)
for i in range(N):
if i == 0:
x[i] = 0
elif i == 1:
x[i] = 1
else:
x[i] = x[i - 1] + x[i - 2]
return x
In [46]:
print fib_py(10)
In [47]:
%timeit fib_py(10000)
Now we'll write the same routine in Fortran:
In [48]:
%%file fib.f
subroutine fib(A,N)
integer N
double precision A(N)
do i=1,N
if (i.EQ.1) then
A(i) = 0.0D0
elseif (i.EQ.2) then
A(i) = 1.0D0
else
A(i) = A(i-1) + A(i-2)
endif
enddo
end
We'll compile this from the command line using f2py
:
In [49]:
!f2py -c -m fib fib.f
In [51]:
import numpy as np
import fib
a = np.zeros(10, dtype='d')
fib.fib(a)
print a
In [52]:
a = np.zeros(10000, dtype='d')
%timeit fib.fib(a)
The Fortran version is about 300x faster!
But this is a bit awkward that we have to create the array that will hold the results... It would be nice if this could happen automatically. We can make this happen through the use of interface files, which have a .pyf
extension.
An interface template can be generated automatically with f2py. We'll tell it that we want a module called fib2
:
In [18]:
!rm -f _fib2.pyf
!f2py fib.f -m fib2 -h _fib2.pyf
In [19]:
!cat _fib2.pyf
Take a look at the default settings. We have our subroutine called fib(a, n)
, which lists the following:
double precision dimension(n) :: a
This lists a variable which is an input into the routine: an array of length n
integer, optional,check(len(a)>=n),depend(a) :: n=len(a)
This tells us what the routine expects n
to be. It is optionally specified, with a default value which is the length of a
. Also, there is a check in here that makes sure a
is long enough. Let's check this out:
In [20]:
from fib import fib
a = np.zeros(10)
fib(a, 10) # specify the optional keyword
print a
In [21]:
a = np.zeros(10)
fib(a, 5) # specify only part of the array
print a
In [22]:
a = np.zeros(10)
fib(a, 20) # specify a length which is too long
What we want to do is to modify this interface so that the length n
is an input, and the array a
is an (automatically constructed) output. We'll do this below; for more details on the options available in F2Py interface files, see the F2Py documentation.
In [ ]:
%%file fib2.pyf
! -*- f90 -*-
! Note: the context of this file is case sensitive.
python module fib2 ! in
interface ! in :fib2
subroutine fib(a,n) ! in :fib2:fib.f
double precision dimension(n), intent(out), depend(n) :: a
integer intent(in) :: n
end subroutine fib
end interface
end python module fib2
We've specified the intent and the dependencies of each variable using the f2py
specifications.
Here we compile fib2 using the interface file we've created:
In [ ]:
!f2py -c -m fib2 fib2.pyf fib.f
Now we can import and call the function like this:
In [ ]:
import fib2
print fib2.fib(10)
In [ ]:
fib2.fib?
In [ ]:
%timeit fib2.fib(10000)
From our simple FORTRAN function, along with an interface file and a compilation with f2py, we now have a Python module which is callable in a very intuitive way.
If you look into the source code of SciPy, you'll see that this is how much of its functionality is implemented, through wrapping pieces of the NETLIB repository.
CTypes and F2Py provide the ability to wrap Fortran, C, and C++ code so that it can be imported into Python. Cython enables this as well, though we will not focus on that part of it here. The biggest part of Cython is that it lets you convert Python code and Python-like code into compiled C code, which can run many times faster than the original code.
Let's see a quick example. Here's a Python function which computes the N^th fibonacci number:
In [53]:
def nth_fib(n):
a, b = 0, 1
for i in range(n):
b, a = a + b, b
return a
In [54]:
[nth_fib(i) for i in range(10)]
Out[54]:
In [55]:
%timeit nth_fib(10000)
Now we'll take the exact same code, and compile it with Cython. In general, this will be done by saving the code to file, and running cython
on the command line. You can read about that in the documentation. Here we'll use IPython's Cython magic to streamline the process:
In [56]:
%load_ext cythonmagic
In [57]:
%%cython
def nth_fib2(n):
a, b = 0, 1
for i in range(n):
b, a = a + b, b
return a
In [25]:
[nth_fib2(i) for i in range(10)]
Out[25]:
In [58]:
%timeit nth_fib2(10000)
Just compiling the code in Cython gave us a ~10% speedup. But we can do better by adding type annotations.
See, the main reason Python is slow is because it has to do dynamic type checking each time it evaluates an expression. If we can tell Cython what the types are from the beginning, this step can be skipped, and we have large time savings. We do this through a cdef
command. We also do the temporary assignment explicitly to remove the Python tuple assignment:
In [59]:
%%cython
def nth_fib3(int n):
cdef int a = 0
cdef int b = 1
cdef int tmp
for i in range(n):
tmp = b
b = a + b
a = tmp
return a
In [60]:
[nth_fib3(i) for i in range(10)]
Out[60]:
In [61]:
print "Python only:"
%timeit nth_fib(10000)
print "Bare Cython:"
%timeit nth_fib(10000)
print "Typed Cython:"
%timeit nth_fib3(10000)
By adding some type information to our script, we sped up the execution by several orders of magnitude! This shows how easy Cython is for simple problems.
One very useful trick to be aware of is cython -a
. This will produce an annotated html document showing which lines of the program are causing problems. You can run it like this:
In [62]:
%%file nth_fib.pyx
# Slow version:
def nth_fib2(n):
a, b = 0, 1
for i in range(n):
b, a = a + b, b
return a
# Fast Version:
def nth_fib3(int n):
cdef int a = 0
cdef int b = 1
cdef int tmp
for i in range(n):
tmp = b
b = a + b
a = tmp
return a
In [63]:
!cython -a nth_fib.pyx
If we open the resulting HTML file, we'll see a highlighted version of our code. The darkness of the highlight shows how many lines of C code were generated by the line of Python. More yellow lines generally means slower code. This can be very helpful when your code is not running as quickly as you'd expected. To see the results, open nth_fib.html after running the above code.
Cython provides a really nice interface to numpy arrays via the Typed Memoryview syntax. Let's implement the same fib function as above, but using Cython.
First we'll simply compile our Python function again:
In [32]:
%%cython
import numpy as np
def fib_cy(N):
x = np.zeros(N, dtype=float)
for i in range(N):
if i == 0:
x[i] = 0
elif i == 1:
x[i] = 1
else:
x[i] = x[i - 1] + x[i - 2]
return x
In [33]:
print np.allclose(fib_py(10000), fib_cy(10000))
%timeit fib_py(10000)
%timeit fib_cy(10000)
Again, a very small improvement. Let's add some type information and see how we do:
In [34]:
%%cython
import numpy as np
from numpy cimport float_t
def fib_cy2(int N):
cdef int i
cdef float_t[::1] x = np.zeros(N, dtype=float)
for i in range(N):
if i == 0:
x[i] = 0
elif i == 1:
x[i] = 1
else:
x[i] = x[i - 1] + x[i - 2]
return x
In [35]:
%timeit fib_cy(10000)
%timeit fib_cy2(10000)
Wow! Simply adding some type information for the Cython compiler made our code orders of magnitude faster! This is because the array can now be treated as a contiguous memory block rather than a Python object. This makes each of the indexing operations much more efficient, because they no longer have to go through the Python interface layer.
There's a whole lot more to know about Cython, especially the details of working with numpy arrays. Here are some more resources:
You might wonder whether some of that type information could be inferred by the compiler, or whether the compilation phase is really necessary. This is where Numba comes in. Numba is an LLVM-based Just In Time (JIT) compiler for Python. It allows simple annotations of Python code which can lead to huge speedups for certain operations.
Numba comes from Continuum Analytics, a company founded by the creator of NumPy and SciPy, and the same people who are behind the Anaconda Python distribution. Numba is still relatively young, and can be a little finnicky for some problems, but the approach is really clean.
You can install numba from scratch, but because of the dependencies on LLVM and other difficult-to-install tools, I'd highly recommend just going with Anaconda.
Let's go back to our Python fib()
function:
In [65]:
def fib(N):
x = np.zeros(N, dtype=np.float64)
for i in range(N):
if i == 0:
x[i] = 0
elif i == 1:
x[i] = 1
else:
x[i] = x[i - 1] + x[i - 2]
return x
To speed this up with Numba, we simply use the jit
decorator:
In [67]:
from numba import autojit
fib_nb = autojit(fib)
In [68]:
fib_nb(10)
Out[68]:
In [69]:
%timeit fib(10000)
%timeit fib_nb(10000)
Usually, numba is used as a decorator, which essentially does the same as we saw above. You'd write a decorated function this way:
In [70]:
@autojit
def fib_nb(N):
x = np.zeros(N, dtype=np.float64)
for i in range(N):
if i == 0:
x[i] = 0
elif i == 1:
x[i] = 1
else:
x[i] = x[i - 1] + x[i - 2]
return x
In [71]:
%timeit fib_nb(10000)
You can see that simply by passing our fib
function through the autojit
decorator, we've attained execution which is a few percent faster than what we got with Cython - and there's no fancy compilation stage required! The funcion is compiled on demand through LLVM, which makes numba very convenient to use.
Now for the bad news: Numba is still young, and there are a lot of corner cases where the compiler will just break without any useful error message. For more complicated problems and functions, the output code is not always faster than Python+NumPy, and can sometimes even be slower!
My recommendation right now is to keep Numba in mind, to try it in your problems, but not necessarily depend on it -- yet.
One of the more interesting and beautiful calculations you can do with Python is the Mandelbrot Set, a straightforwardly-generated set of points whose boundary is a fractal. The computation of whether a point is in the mandelbrot set can be done as follows:
In [72]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
In [76]:
@autojit
def mandel(x, y, max_iter):
"""
Given z = x + iy and max_iter, determine whether the candidate
is in the mandelbrot set for the given number of iterations
"""
c = complex(x, y)
z = 0.0j
for i in range(max_iter):
z = z*z + c
if (z.real*z.real + z.imag*z.imag) >= 4:
return i
return max_iter
@autojit
def create_fractal(Nx, Ny, xmin, xmax, ymin, ymax, max_iter):
"""Create and return a fractal image"""
image = np.zeros((Ny, Nx), dtype=float)
dx = (xmax - xmin) * 1. / Nx
dy = (ymax - ymin) * 1. / Ny
for x in range(Nx):
rpart = xmin + x * dx
for y in range(Ny):
ipart = ymin + y * dy
color = mandel(rpart, ipart, max_iter)
image[y, x] = color
return image
In [77]:
image = create_fractal(300, 200, -2, 1, -1, 1, 20)
plt.imshow(image, cmap=plt.cm.jet)
Using this simple code, you can generate some absolutely stunning images. The problem is, to go deeper you need very high resolution and very many iterations, both of which are extremely slow in Python!
Your assignment is to make these faster:
Speed up this example using either CTypes, F2Py, or Cython. With CTypes/F2Py, you'll have to write the above functionality as a C/Fortran routine, and use the tools above to wrap that functionality.
For Cython, it will require adding appropriate type information to each of the variables used in the computation. Also, check the Cython documentation and make sure to make mandel()
a cdef
-ed function. Once you're finished, use the cython -a
trick to generate annotated code, and make sure there are no yellow-highlighted lines in the core of your algorithm: the loops should look completely clean!
Speed up this example using Numba. Numba is much more of a black box, but you should be able to do it very quickly. You'll likely find that simply adding the decorator to the function doesn't work. I'll give you a hint: Numba doesn't like an array specified with dtype=float
. Look back to our previous example for clues about how to fix this.
(Also, you might find this notebook to be a useful resource: it is all about doing the mandelbrot set with Numba).
Use %timeit
to compare your three implementations side-by-side for the values passed above.
Once you have these working, choose one of the methods and use it to zoom-in on a portion of the above plot. Go to high resolution and a large number of iterations (say 256). Experiment with different matplotlib color schemes (see some of the options here) and produce your most beautiful visualization of the Mandelbrot set.