Extensions and compiled code

High-performance python typically means using extension modules (read: other languages)

Why extend?

  • Execution speed, obviously
  • Vectorizing the algorithm is difficult, time-consuming, and inefficient
  • Vectorization may be memory-limitiing (unless you use memmap, but that's another ball of wax)

Why should I not extend?

  • It's not python, so you are typically dealing with the compile of code and linking to shared libraries. Yuck. Installing and redistributing the code just got hard.
  • It's not python. Python is sooo much easier.

What choices to I have?

  • Direct calls to compiled libraries: ctypes
  • Just-in-time compilation: numba, PyPy, and the now defunct psyco
  • Building python bindings: hand-binding, SWIG, f2py, and cython

Using just-in-time compilation: numba

  • jit

In [4]:
%%file numba_mandel.py
"""
Compute and plot the Mandelbrot set using matplotlib.
"""

import numpy as np
import matplotlib.pylab as mpl

from numba import jit

@jit
def mandel(x, y, max_iters):
    """
    Given the real and imaginary parts of a complex number,
    determine if it is a candidate for membership in the Mandelbrot
    set given a fixed number of iterations.
    """
    c = complex(x,y)
    z = 0j
    for i in range(max_iters):
        z = z*z + c
        if z.real * z.real + z.imag * z.imag >= 4:
            return 255 * i // max_iters

    return 255

@jit(nopython=True)
def create_fractal(min_x, max_x, min_y, max_y, image, iters):
    height = image.shape[0]
    width = image.shape[1]

    pixel_size_x = (max_x - min_x) / width
    pixel_size_y = (max_y - min_y) / height
    for x in range(width):
        real = min_x + x * pixel_size_x
        for y in range(height):
            imag = min_y + y * pixel_size_y
            color = mandel(real, imag, iters)
            image[y, x] = color

    return image

image = np.zeros((700, 1400), dtype=np.uint8)

import time
start = time.time()
create_fractal(-2.0, 1.0, -1.0, 1.0, image, 20)
print "took %s" % (time.time() - start)

# mpl.imshow(image)
# mpl.gray()
# mpl.show()


Overwriting numba_mandel.py

In [5]:
!python2.7 numba_mandel.py


took 0.172127962112

In [8]:
%%file bubblesort.py
"""
demonstrate compiled extensions using bubblesort
"""

import numpy as np
from numba import jit

def bubblesort(items):
    length = len(items)
    swapped = 1
    for i in range(0, length):
        if swapped:
            swapped = 0
            for ele in range(0, length-i-1):
                if items[ele] > items[ele + 1]:
                    temp = items[ele + 1]
                    items[ele + 1] = items[ele]
                    items[ele] = temp
                    swapped = 1
    return items

jitbubblesort = jit(bubblesort)


import time
for N in (100,100,1000,1000,10000):

    randoms = np.random.randint(0,1000,(N)).tolist()

    print "For N=%s" % N
    start = time.time()
    x = bubblesort(randoms)
    print "%s: python" % (time.time() - start)
    assert np.all(sorted(randoms) == x)

    start = time.time()
    x = jitbubblesort(randoms)
    print "%s: numba" % (time.time() - start)
    assert np.all(sorted(randoms) == x)
    
    print ''


# EOF


Overwriting bubblesort.py

In [9]:
!python2.7 bubblesort.py


For N=100
0.00119304656982: python
0.191457986832: numba

For N=100
0.00129985809326: python
8.01086425781e-05: numba

For N=1000
0.120245933533: python
0.000295877456665: numba

For N=1000
0.120244026184: python
0.000308990478516: numba

For N=10000
11.8938641548: python
0.00219893455505: numba

  • jit with type annotations
  • autojit

In [12]:
import numpy as np
from numba import autojit
import time

def fib(N):
    x = np.zeros(N, dtype=np.float64)
    for i in range(N):
        if i == 0:
            x[i] = 0
        elif i == 1:
            x[i] = 1
        else:
            x[i] = x[i - 1] + x[i - 2]
    return x

jitfib = autojit(fib)


for N in (100,1000,10000):

    print "For N=%s" % N
    start = time.time()
    x = fib(N)
    print "%s: python" % (time.time() - start)

    start = time.time()
    x = jitfib(N)
    print "%s: numba" % (time.time() - start)
    
    print ''


For N=100
7.39097595215e-05: python
0.119683027267: numba

For N=1000
0.000465154647827: python
1.4066696167e-05: numba

For N=10000
0.00473999977112: python
4.91142272949e-05: numba

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/IPython/kernel/__main__.py:14: RuntimeWarning: overflow encountered in double_scalars

Using python bindings: cython

  • Write python code, compile it as C -- from the commandline with cython

In [14]:
%%file pi.pyx

def pied( int num ) :
    return num * 3.14159265359

def vec_pied( int r ) :
    retList = []
    cdef unsigned int i
    for i in range(r):
        retList.append( 3.14159 * i )
    return retList


Writing pi.pyx

In [18]:
%%file setup.py

from distutils.core import setup
from distutils.extension import Extension
from Cython.Build import cythonize
import numpy as np
    
extensions=[
  Extension("pi", ["pi.pyx"], include_dirs=[np.get_include()]),
]   

setup(
  ext_modules=cythonize(extensions)
)


Overwriting setup.py

In [19]:
!python2.7 setup.py build_ext --inplace


Compiling pi.pyx because it changed.
Cythonizing pi.pyx
running build_ext
building 'pi' extension
creating build
creating build/temp.macosx-10.8-x86_64-2.7
/usr/bin/clang -fno-strict-aliasing -fno-common -dynamic -pipe -Os -fwrapv -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/numpy/core/include -I/opt/local/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c pi.c -o build/temp.macosx-10.8-x86_64-2.7/pi.o
/usr/bin/clang -bundle -undefined dynamic_lookup -L/opt/local/lib -Wl,-headerpad_max_install_names -L/opt/local/lib/db48 build/temp.macosx-10.8-x86_64-2.7/pi.o -o /Users/mmckerns/notebooks/mystic_pathos_tutorials/parallel/pi.so

In [22]:
import pi
print pi.pied(1)
print pi.vec_pied(4)


3.14159265359
[0.0, 3.14159, 6.28318, 9.424769999999999]
  • Write python code, compile it as C -- interactively in IPython

In [24]:
%load_ext Cython

In [29]:
%%cython
def cy_bubblesort(items):
    length = len(items)
    swapped = 1
    for i in range(0, length):
        if swapped:
            swapped = 0
            for ele in range(0, length-i-1):
                if items[ele] > items[ele + 1]:
                    temp = items[ele + 1]
                    items[ele + 1] = items[ele]
                    items[ele] = temp
                    swapped = 1
    return items

In [30]:
import time
for N in (100,100,1000,1000,10000):

    randoms = np.random.randint(0,1000,(N)).tolist()

    print "For N=%s" % N
    start = time.time()
    x = cy_bubblesort(randoms)
    print "%s: Cython" % (time.time() - start)
    assert np.all(sorted(randoms) == x)

    print ''


For N=100
0.000560998916626: Cython

For N=100
0.000540971755981: Cython

For N=1000
0.0564270019531: Cython

For N=1000
0.0560281276703: Cython

For N=10000
5.55099987984: Cython

  • Python code with type annotations

In [32]:
def py_fib(n):
    a, b = 0, 1
    for i in range(n):
        b, a = a + b, b
    return a

%timeit py_fib(10000)


100 loops, best of 3: 3.29 ms per loop

In [33]:
%%cython
def cy_fib(n):
    a, b = 0, 1
    for i in range(n):
        b, a = a + b, b
    return a

In [34]:
%timeit cy_fib(10000)


100 loops, best of 3: 2.03 ms per loop

In [40]:
%%file fib.pyx

def cy_fib(int n):
    cdef int a = 0
    cdef int b = 1
    cdef int tmp
    for i in range(n):
        tmp = b
        b = a + b
        a = tmp
    return a


Overwriting fib.pyx

In [41]:
%%file setup2.py

from distutils.core import setup
from Cython.Build import cythonize

setup(
    name = "cython fib",
    ext_modules = cythonize('fib.pyx'),  # accepts a glob pattern
)


Overwriting setup2.py

In [42]:
!python2.7 setup2.py build_ext --inplace


Compiling fib.pyx because it changed.
Cythonizing fib.pyx
running build_ext
building 'fib' extension
/usr/bin/clang -fno-strict-aliasing -fno-common -dynamic -pipe -Os -fwrapv -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/opt/local/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c fib.c -o build/temp.macosx-10.8-x86_64-2.7/fib.o
/usr/bin/clang -bundle -undefined dynamic_lookup -L/opt/local/lib -Wl,-headerpad_max_install_names -L/opt/local/lib/db48 build/temp.macosx-10.8-x86_64-2.7/fib.o -o /Users/mmckerns/notebooks/mystic_pathos_tutorials/parallel/fib.so

In [43]:
import fib
%timeit fib.cy_fib(10000)


100000 loops, best of 3: 5.78 µs per loop
  • Aside: line profiling with cython -a

In [46]:
!cython-2.7 -a fib.pyx

See: result

  • Working with contiguous memory: cython with numpy

In [47]:
%%cython
import numpy as np
from numpy cimport float_t

def numcy_fib(int N):
    cdef int i
    cdef float_t[::1] x = np.zeros(N, dtype=float)
    for i in range(N):
        if i == 0:
            x[i] = 0
        elif i == 1:
            x[i] = 1
        else:
            x[i] = x[i - 1] + x[i - 2]
    return x

In [49]:
%timeit numcy_fib(10000)


The slowest run took 4.08 times longer than the fastest. This could mean that an intermediate result is being cached 
10000 loops, best of 3: 31.4 µs per loop
  • Not covered: working with native parallelism: prange and nogil
  • Not covered: leveraging external functions, structures, and classes

EXERCISE: See if cython or numba can help speed up our Monte Carlo example. Start with roll.py. What are potential issues for using either within a parallel map that potentially distributes code and or objects to another process or computer?

Using direct calls to native libraries: ctypes

  • Accessing shared memory (we've seen this already)
  • Direct access to shared libraries
  • Writing C code and making direct calls to your C functions