Extensions and compiled code

High-performance python typically means using extension modules (read: other languages)

Why extend?

Execution speed, obviously

Vectorizing the algorithm is difficult, time-consuming, and inefficient

Vectorization may be memory-limitiing (unless you use memmap, but that's another ball of wax)

Why should I not extend?

It's not python, so you are typically dealing with the compile of code and linking to shared libraries. Yuck. Installing and redistributing the code just got hard.

It's not python. Python is sooo much easier.

What choices to I have?

Direct calls to compiled libraries: ctypes

Just-in-time compilation: numba, PyPy, and the now defunct psyco

Building python bindings: hand-binding, SWIG, f2py, and cython

Using just-in-time compilation: numba

jit



In [4]:

    
%%file numba_mandel.py
"""
Compute and plot the Mandelbrot set using matplotlib.
"""

import numpy as np
import matplotlib.pylab as mpl

from numba import jit

@jit
def mandel(x, y, max_iters):
    """
    Given the real and imaginary parts of a complex number,
    determine if it is a candidate for membership in the Mandelbrot
    set given a fixed number of iterations.
    """
    c = complex(x,y)
    z = 0j
    for i in range(max_iters):
        z = z*z + c
        if z.real * z.real + z.imag * z.imag >= 4:
            return 255 * i // max_iters

    return 255

@jit(nopython=True)
def create_fractal(min_x, max_x, min_y, max_y, image, iters):
    height = image.shape[0]
    width = image.shape[1]

    pixel_size_x = (max_x - min_x) / width
    pixel_size_y = (max_y - min_y) / height
    for x in range(width):
        real = min_x + x * pixel_size_x
        for y in range(height):
            imag = min_y + y * pixel_size_y
            color = mandel(real, imag, iters)
            image[y, x] = color

    return image

image = np.zeros((700, 1400), dtype=np.uint8)

import time
start = time.time()
create_fractal(-2.0, 1.0, -1.0, 1.0, image, 20)
print "took %s" % (time.time() - start)

# mpl.imshow(image)
# mpl.gray()
# mpl.show()









    



Overwriting numba_mandel.py



In [5]:

    
!python2.7 numba_mandel.py









    



took 0.172127962112



In [8]:

    
%%file bubblesort.py
"""
demonstrate compiled extensions using bubblesort
"""

import numpy as np
from numba import jit

def bubblesort(items):
    length = len(items)
    swapped = 1
    for i in range(0, length):
        if swapped:
            swapped = 0
            for ele in range(0, length-i-1):
                if items[ele] > items[ele + 1]:
                    temp = items[ele + 1]
                    items[ele + 1] = items[ele]
                    items[ele] = temp
                    swapped = 1
    return items

jitbubblesort = jit(bubblesort)


import time
for N in (100,100,1000,1000,10000):

    randoms = np.random.randint(0,1000,(N)).tolist()

    print "For N=%s" % N
    start = time.time()
    x = bubblesort(randoms)
    print "%s: python" % (time.time() - start)
    assert np.all(sorted(randoms) == x)

    start = time.time()
    x = jitbubblesort(randoms)
    print "%s: numba" % (time.time() - start)
    assert np.all(sorted(randoms) == x)
    
    print ''


# EOF









    



Overwriting bubblesort.py



In [9]:

    
!python2.7 bubblesort.py









    



For N=100
0.00119304656982: python
0.191457986832: numba

For N=100
0.00129985809326: python
8.01086425781e-05: numba

For N=1000
0.120245933533: python
0.000295877456665: numba

For N=1000
0.120244026184: python
0.000308990478516: numba

For N=10000
11.8938641548: python
0.00219893455505: numba

jit with type annotations

See: http://numba.pydata.org/numba-doc/dev/user/examples.html

autojit



In [12]:

    
import numpy as np
from numba import autojit
import time

def fib(N):
    x = np.zeros(N, dtype=np.float64)
    for i in range(N):
        if i == 0:
            x[i] = 0
        elif i == 1:
            x[i] = 1
        else:
            x[i] = x[i - 1] + x[i - 2]
    return x

jitfib = autojit(fib)


for N in (100,1000,10000):

    print "For N=%s" % N
    start = time.time()
    x = fib(N)
    print "%s: python" % (time.time() - start)

    start = time.time()
    x = jitfib(N)
    print "%s: numba" % (time.time() - start)
    
    print ''









    



For N=100
7.39097595215e-05: python
0.119683027267: numba

For N=1000
0.000465154647827: python
1.4066696167e-05: numba

For N=10000
0.00473999977112: python
4.91142272949e-05: numba







    



/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/IPython/kernel/__main__.py:14: RuntimeWarning: overflow encountered in double_scalars

Using python bindings: cython

Write python code, compile it as C -- from the commandline with cython



In [14]:

    
%%file pi.pyx

def pied( int num ) :
    return num * 3.14159265359

def vec_pied( int r ) :
    retList = []
    cdef unsigned int i
    for i in range(r):
        retList.append( 3.14159 * i )
    return retList









    



Writing pi.pyx



In [18]:

    
%%file setup.py

from distutils.core import setup
from distutils.extension import Extension
from Cython.Build import cythonize
import numpy as np
    
extensions=[
  Extension("pi", ["pi.pyx"], include_dirs=[np.get_include()]),
]   

setup(
  ext_modules=cythonize(extensions)
)









    



Overwriting setup.py



In [19]:

    
!python2.7 setup.py build_ext --inplace









    



Compiling pi.pyx because it changed.
Cythonizing pi.pyx
running build_ext
building 'pi' extension
creating build
creating build/temp.macosx-10.8-x86_64-2.7
/usr/bin/clang -fno-strict-aliasing -fno-common -dynamic -pipe -Os -fwrapv -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/numpy/core/include -I/opt/local/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c pi.c -o build/temp.macosx-10.8-x86_64-2.7/pi.o
/usr/bin/clang -bundle -undefined dynamic_lookup -L/opt/local/lib -Wl,-headerpad_max_install_names -L/opt/local/lib/db48 build/temp.macosx-10.8-x86_64-2.7/pi.o -o /Users/mmckerns/notebooks/mystic_pathos_tutorials/parallel/pi.so



In [22]:

    
import pi
print pi.pied(1)
print pi.vec_pied(4)









    



3.14159265359
[0.0, 3.14159, 6.28318, 9.424769999999999]

Write python code, compile it as C -- interactively in IPython



In [24]:

    
%load_ext Cython



In [29]:

    
%%cython
def cy_bubblesort(items):
    length = len(items)
    swapped = 1
    for i in range(0, length):
        if swapped:
            swapped = 0
            for ele in range(0, length-i-1):
                if items[ele] > items[ele + 1]:
                    temp = items[ele + 1]
                    items[ele + 1] = items[ele]
                    items[ele] = temp
                    swapped = 1
    return items



In [30]:

    
import time
for N in (100,100,1000,1000,10000):

    randoms = np.random.randint(0,1000,(N)).tolist()

    print "For N=%s" % N
    start = time.time()
    x = cy_bubblesort(randoms)
    print "%s: Cython" % (time.time() - start)
    assert np.all(sorted(randoms) == x)

    print ''









    



For N=100
0.000560998916626: Cython

For N=100
0.000540971755981: Cython

For N=1000
0.0564270019531: Cython

For N=1000
0.0560281276703: Cython

For N=10000
5.55099987984: Cython

Python code with type annotations



In [32]:

    
def py_fib(n):
    a, b = 0, 1
    for i in range(n):
        b, a = a + b, b
    return a

%timeit py_fib(10000)









    



100 loops, best of 3: 3.29 ms per loop



In [33]:

    
%%cython
def cy_fib(n):
    a, b = 0, 1
    for i in range(n):
        b, a = a + b, b
    return a



In [34]:

    
%timeit cy_fib(10000)









    



100 loops, best of 3: 2.03 ms per loop



In [40]:

    
%%file fib.pyx

def cy_fib(int n):
    cdef int a = 0
    cdef int b = 1
    cdef int tmp
    for i in range(n):
        tmp = b
        b = a + b
        a = tmp
    return a









    



Overwriting fib.pyx



In [41]:

    
%%file setup2.py

from distutils.core import setup
from Cython.Build import cythonize

setup(
    name = "cython fib",
    ext_modules = cythonize('fib.pyx'),  # accepts a glob pattern
)









    



Overwriting setup2.py



In [42]:

    
!python2.7 setup2.py build_ext --inplace









    



Compiling fib.pyx because it changed.
Cythonizing fib.pyx
running build_ext
building 'fib' extension
/usr/bin/clang -fno-strict-aliasing -fno-common -dynamic -pipe -Os -fwrapv -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/opt/local/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c fib.c -o build/temp.macosx-10.8-x86_64-2.7/fib.o
/usr/bin/clang -bundle -undefined dynamic_lookup -L/opt/local/lib -Wl,-headerpad_max_install_names -L/opt/local/lib/db48 build/temp.macosx-10.8-x86_64-2.7/fib.o -o /Users/mmckerns/notebooks/mystic_pathos_tutorials/parallel/fib.so



In [43]:

    
import fib
%timeit fib.cy_fib(10000)









    



100000 loops, best of 3: 5.78 µs per loop

Aside: line profiling with cython -a



In [46]:

    
!cython-2.7 -a fib.pyx

See: result

Working with contiguous memory: cython with numpy



In [47]:

    
%%cython
import numpy as np
from numpy cimport float_t

def numcy_fib(int N):
    cdef int i
    cdef float_t[::1] x = np.zeros(N, dtype=float)
    for i in range(N):
        if i == 0:
            x[i] = 0
        elif i == 1:
            x[i] = 1
        else:
            x[i] = x[i - 1] + x[i - 2]
    return x



In [49]:

    
%timeit numcy_fib(10000)









    



The slowest run took 4.08 times longer than the fastest. This could mean that an intermediate result is being cached 
10000 loops, best of 3: 31.4 µs per loop

Not covered: working with native parallelism: prange and nogil

Not covered: leveraging external functions, structures, and classes

EXERCISE: See if cython or numba can help speed up our Monte Carlo example. Start with roll.py. What are potential issues for using either within a parallel map that potentially distributes code and or objects to another process or computer?

Using direct calls to native libraries: ctypes

Accessing shared memory (we've seen this already)

Direct access to shared libraries

Writing C code and making direct calls to your C functions