Cython Speed-up notes

While I started coding on Cython I found a number of tips and tricks of what to (not) do. This is a collection of those things...

Basic tutorials and tips

For the basis, this is a list of documentation that I found useful:

Some basics tips that will speed up your code significantly:

Type your variables : all variables, functions inputs, local variables, global variables, etc.
Minimize functions called from other python libraries (avoid overheads)
Try defining all local function as inline
Learn the difference between cdef, def, and pcdef
If not necessary, release the GIL and make it explicit (ie. use nogil)
Use function decorators (e.g. @cython.boundscheck(False))

Some examples

Typing variables

An easy way to see if you are typing (correctly) all variables, is to see the annotated version of your source code. Let us compare the following three functions.



In [1]:

    
%load_ext Cython



In [2]:

    
%%cython
#--annotate
import time
import sys

cimport cython
cimport numpy as np
import numpy as np


def untyped_func(tab, tab_len, scalar):
    res = 0.
    for i in range(tab_len):
        res += tab[i] * scalar
    return res

def somewhat_typed_func(np.ndarray[double, ndim=1, mode="c"] tab not None, int tab_len, double scalar):
    cdef double res = 0.
    cdef int i
    for i in range(tab_len):
        res += tab[i] * scalar
    return res

@cython.boundscheck(False)  # Deactivate bounds checking
@cython.wraparound(False)   # Deactivate negative indexing.
cdef double typed_func(double[::1] tab, int tab_len, double scalar) nogil:
    cdef double res = 0.
    cdef int i
    for i in range(tab_len):
        res += tab[i] * scalar
    return res

We can already see that the third function typed_func, has much less yellow, which generally means less C code behind it, thus faster code. Let's benchmark them.



In [3]:

    
%%cython
import time
import sys

cimport cython
cimport numpy as np
import numpy as np


def untyped_func(tab, tab_len, scalar):
    res = 0.
    for i in range(tab_len):
        res += tab[i] * scalar
    return res

def somewhat_typed_func(np.ndarray[double, ndim=1, mode="c"] tab not None, int tab_len, double scalar):
    cdef double res = 0.
    cdef int i
    for i in range(tab_len):
        res += tab[i] * scalar
    return res

@cython.boundscheck(False)  # Deactivate bounds checking
@cython.wraparound(False)   # Deactivate negative indexing.
cdef double typed_func(double[::1] tab, int tab_len, double scalar) nogil:
    cdef double res = 0.
    cdef int i
    for i in range(tab_len):
        res += tab[i] * scalar
    return res

@cython.boundscheck(False)  # Deactivate bounds checking
@cython.wraparound(False)   # Deactivate negative indexing.
cdef inline double inline_typed_func(double[::1] tab, int tab_len, double scalar) nogil:
    cdef double res = 0.
    cdef int i
    for i in range(tab_len):
        res += tab[i] * scalar
    return res

cdef int L, i, loops = 1000
cdef double start, end, res
for L in [1000, 10000, 100000]:
    np_array = np.ones(L)
    print("For L = ", L)
    start = time.clock()
    res = untyped_func(np_array, L, 2.)
    end = time.clock()
    print(format((end-start) / loops * 1e6, "2f"), end=" ")
    sys.stdout.flush()
    print("μs, using the untyped_func")
    # ..................................................
    start = time.clock()
    res = somewhat_typed_func(np_array, L, 2.)
    end = time.clock()
    print(format((end-start) / loops * 1e6, "2f"), end=" ")
    sys.stdout.flush()
    print("μs, using the somewhat_typed_func")
    # ..................................................
    start = time.clock()
    res = typed_func(np_array, L, 2.)
    end = time.clock()
    print(format((end-start) / loops * 1e6, "2f"), end=" ")
    sys.stdout.flush()
    print("μs, using the typed_func")
    # ..................................................
    start = time.clock()
    res = inline_typed_func(np_array, L, 2.)
    end = time.clock()
    print(format((end-start) / loops * 1e6, "2f"), end=" ")
    sys.stdout.flush()
    print("μs, using the inline_typed_func")









    



For L =  1000
0.798000 μs, using the untyped_func
0.027000 μs, using the somewhat_typed_func
0.017000 μs, using the typed_func
0.008000 μs, using the inline_typed_func
For L =  10000
8.871000 μs, using the untyped_func
0.027000 μs, using the somewhat_typed_func
0.014000 μs, using the typed_func
0.014000 μs, using the inline_typed_func
For L =  100000
81.221000 μs, using the untyped_func
0.148000 μs, using the somewhat_typed_func
0.010000 μs, using the typed_func
0.015000 μs, using the inline_typed_func

Working with numpy arrays in I/O

The first challenge I was confronted to, was handling Numpy arrays. The cython part of our code takes as inputs numpy arrays, and should give as output numpy arrays as well. However, reading and writing from numpy arrays can be slow in cython. Some tutorials mentioned using memory views, other mention that C array give a clear improvement, and overall several different solutions are mentioned. A StackOverflow answer makes a good benchmark between these solutions for a code that only need to create arrays (not taking any inputs) and giving back a numpy array: https://stackoverflow.com/questions/18462785/what-is-the-recommended-way-of-allocating-memory-for-a-typed-memory-view

However, here we need to focus on the copying and accessing the data from the numpy array.



In [4]:

    
%%cython
import time
import sys

from cpython.array cimport array, clone
from cython.view cimport array as cvarray
from libc.stdlib cimport malloc, free
import numpy as np
cimport numpy as np

cdef int loops

def timefunc(name):
    def timedecorator(f):
        cdef int L, i

        print("Running", name)
        for L in [1, 10, 100, 1000, 10000, 100000, 1000000]:
            np_array = np.ones(L)
            start = time.clock()
            res_array = f(L, np_array)
            end = time.clock()
            print(format((end-start) / loops * 1e6, "2f"), end=" ")
            sys.stdout.flush()

        print("μs")
    return timedecorator

print()
print("-------- TESTS -------")
loops = 3000


@timefunc("numpy buffers")
def _(int L, np.ndarray[double, ndim=1, mode="c"] np_array not None):
    cdef int i, j
    cdef double d
    for i in range(loops):
        for j in range(L):
            d = np_array[j]
            np_array[j] = d*0.
    # Prevents dead code elimination
    str(np_array[0])
    return np_array
    
@timefunc("cpython.array buffer")
def _(int L, np.ndarray[double, ndim=1, mode="c"] np_array not None):
    cdef int i, j
    cdef double d
    cdef array[double] arr, template = array('d')

    for i in range(loops):
        arr = clone(template, L, False)
        for j in range(L):
            # initialization
            arr[j] = np_array[j]
            # access
            d = arr[j]
            arr[j] = d*2.
    # Prevents dead code elimination
    return np.asarray(arr)


@timefunc("cpython.array memoryview")
def _(int L, np.ndarray[double, ndim=1, mode="c"] np_array not None):
    cdef int i, j
    cdef double d
    cdef double[::1] arr

    for i in range(loops):
        arr = np_array
        for j in range(L):
            # usage
            d = arr[j]
            arr[j] = d*0.
    # Prevents dead code elimination
    return np_array
    

@timefunc("cpython.array raw C type with trick")
def _(int L, np.ndarray[double, ndim=1, mode="c"] np_array not None):
    cdef int i
    cdef array arr, template = array('d')

    for i in range(loops):
        arr = clone(template, L, False)
        for j in range(L):
            # initialization
            arr.data.as_doubles[j] = np_array[j]
            # usage
            d = arr.data.as_doubles[j]
            arr.data.as_doubles[j] = d*2.
    # Prevents dead code elimination
    return np.asarray(arr)


@timefunc("C pointers")
def _(int L, np.ndarray[double, ndim=1, mode="c"] np_array not None):
    cdef int i
    cdef double* arrptr

    for i in range(loops):
        arrptr = <double*> np_array.data
        for j in range(L):
            d = arrptr[j]
            arrptr[j] = d*0.

    return np_array

@timefunc("malloc memoryview")
def _(int L, np.ndarray[double, ndim=1, mode="c"] np_array not None):
    cdef int i
    cdef double* arrptr
    cdef double[::1] arr

    for i in range(loops):
        arrptr = <double*> np_array.data
        arr = <double[:L]>arrptr
        for j in range(L):
            d = arrptr[j]
            arrptr[j] = d*0.

    return np_array

@timefunc("argument memoryview")
def _(int L, double[::1] np_array not None):
    cdef int i, j
    cdef double d

    for i in range(loops):
        for j in range(L):
            # usage
            d = np_array[j]
            np_array[j] = d*0.
    # Prevents dead code elimination
    return np_array









    



-------- TESTS -------
Running numpy buffers
0.010667 0.027000 0.108667 0.926667 11.027667 95.586333 1118.653333 μs
Running cpython.array buffer
0.074667 0.091667 0.351667 3.606667 25.803333 230.891000 2605.097000 μs
Running cpython.array memoryview
0.325333 0.398667 1.079000 1.887667 7.430000 79.003667 1005.866000 μs
Running cpython.array raw C type with trick
0.035000 0.041667 0.119667 0.787667 8.991000 93.519667 1401.235667 μs
Running C pointers
0.005000 0.006667 0.057333 0.563333 5.081000 35.753000 887.650667 μs
Running malloc memoryview
0.482333 1.043333 0.778000 1.127333 6.685000 53.277667 798.593333 μs
Running argument memoryview
0.008000 0.019000 0.156000 1.217000 10.740000 96.713667 1305.644333 μs

In conclusion: For all cases, you will gain a 2x factor speed up by using a C pointer. Since the memory is already allocated for the numpy array, it is not necessary to use malloc. We will adopt the following declaration:

cdef double* arrptr
arrptr = <double*> np_array.data

Note that for all functions we declared the numpy array in the function header.

Parallelization and arrays

After optimizing the code, the obvious step to speed-up the code is to parallelize. From the documentation it seems that this should be quite easy, but I discovered a few things to keep in mind. Let's start with a simple loop example



In [5]:

    
import Cython.Compiler.Options as CO
CO.extra_compile_args = ["-O3", "-ffast-math", "-march=native", "-fopenmp" ]
CO.extra_link_args = ['-fopenmp']



In [6]:

    
%%cython --compile=-fopenmp --link-args=-fopenmp

cimport cython

from cython.parallel cimport parallel, prange
from cython.parallel cimport threadid
from libc.stdio cimport stdout, fprintf
import time
import sys

from cpython.array cimport array, clone
from cython.view cimport array as cvarray
from libc.stdlib cimport malloc, free


@cython.boundscheck(False)  # Deactivate bounds checking
@cython.wraparound(False)   # Deactivate negative indexing.
cdef inline void seq_func(int L, double* arrptr):
    cdef int j

    for j in range(L):
        arrptr[j] = 2.0*arrptr[j]
    return

@cython.boundscheck(False)  # Deactivate bounds checking
@cython.wraparound(False)   # Deactivate negative indexing.
cdef inline void bad_par_func(int L, double* arrptr):
    cdef Py_ssize_t j
    cdef double d

    with nogil, parallel():
        arrptr[0] = 0
        for j in prange(1, L-1):
            # or any other operation that doesn't allow to the code parallelized
            arrptr[j+1] = 2.0*arrptr[j]-arrptr[j-1]
        arrptr[L-1] = 0
    return

@cython.boundscheck(False)  # Deactivate bounds checking
@cython.wraparound(False)   # Deactivate negative indexing.
cdef inline void good_par_func(int L, double* arrptr) nogil:
    cdef Py_ssize_t j

    for j in prange(L, nogil=True):
        arrptr[j] = 2.0*arrptr[j]
    return


cdef int L, i, loops = 1000, ilps
cdef double start, end, res
cdef double t0=0.0, t1=0.0, t2=0.0
cdef double* tab
for L in [1000, 10000, 100000]:
    tab = <double *> malloc(sizeof(double) * L)
    print("For L = ", L)
    for ilps in range(loops):
        # ..................................................
        start = time.clock()
        seq_func(L, tab)
        end = time.clock()
        t0 += (end - start) / loops
        # ..................................................
        start = time.clock()
        bad_par_func(L, tab)
        end = time.clock()
        t1 += (end - start) / loops
        # ..................................................
        start = time.clock()
        good_par_func(L, tab)
        end = time.clock()
        t2 += (end - start) / loops

    print(format(t0 * 1e6, "2f"), "μs, using the sequential loop")
    print(format(t1 * 1e6, "2f"), "μs, using the parallel 1 loop")
    print(format(t2 * 1e6, "2f"), "μs, using the parallel 2 loop")
    
    free(tab)









    



For L =  1000
10.211000 μs, using the sequential loop
58.846000 μs, using the parallel 1 loop
25.343000 μs, using the parallel 2 loop
For L =  10000
53.134000 μs, using the sequential loop
180.918000 μs, using the parallel 1 loop
49.014000 μs, using the parallel 2 loop
For L =  100000
827.005000 μs, using the sequential loop
601.007000 μs, using the parallel 1 loop
132.242000 μs, using the parallel 2 loop

Other errors to avoid is to add variables incrementation on the parallel part, e.g. i += 1



In [18]:

    
%%cython --compile-args=-openmp --link-args=-openmp -a

cimport cython

from cython.parallel import parallel, prange
from libc.stdlib cimport abort, malloc, free
import time, sys
import numpy as np
cimport numpy as np


cdef int loops

def timefunc(name):
    def timedecorator(f):
        cdef int L, i
        cdef np.ndarray np_array
        cdef np.ndarray[double] global_buf

        print("Running", name)
        for L in [10000, 1000000]:
            np_array = np.ones(L)
            global_buf = np_array
            start = time.clock()
            f(global_buf, L, <int>(L/2))
            end = time.clock()
            print(format((end-start) / loops * 1e6, "2f"), end=" ")
            sys.stdout.flush()

        print("μs")
    return timedecorator

print()
print("-------- TESTS -------")
loops = 1000

@cython.boundscheck(False)  # Deactivate bounds checking
@cython.wraparound(False)   # Deactivate negative indexing.
@timefunc("Static allocation for n=2")
def _(double[::1] global_buf not None, int n, int n2):
    cdef double[2] local_buf
    cdef int idx, i

    with nogil, parallel():
        for i in range(loops):
            for idx in prange(n2, schedule='guided'):
                local_buf[0] = global_buf[idx*2]
                local_buf[1] = global_buf[idx*2+1]
                func(local_buf)
    return

@cython.boundscheck(False)  # Deactivate bounds checking
@cython.wraparound(False)   # Deactivate negative indexing.
@timefunc("Dynamic allocation for n=2")
def _(double[::1] global_buf not None, int n, int n2):
    cdef double* local_buf
    cdef int idx, i

    with nogil, parallel():
        for i in range(loops):
            local_buf = <double *> malloc(sizeof(double) * 2)
            for idx in prange(n2, schedule='guided'):
                local_buf[0] = global_buf[idx*2]
                local_buf[1] = global_buf[idx*2+1]
                func(local_buf)
            free(local_buf)

@cython.boundscheck(False)  # Deactivate bounds checking
@cython.wraparound(False)   # Deactivate negative indexing.
@timefunc("Static allocation for n=4")
def _(double[::1] global_buf not None, int n, int n2):
    cdef double[4] local_buf
    cdef int idx, i, n4 = <int> (n2/2)

    with nogil, parallel():
        for i in range(loops):
            for idx in prange(n4, schedule='guided'):
                local_buf[0] = global_buf[idx*4]
                local_buf[1] = global_buf[idx*4+1]
                local_buf[2] = global_buf[idx*4+2]
                local_buf[3] = global_buf[idx*4+3]
                func(local_buf)
    return

@cython.boundscheck(False)  # Deactivate bounds checking
@cython.wraparound(False)   # Deactivate negative indexing.
@timefunc("Dynamic allocation for n=4")
def _(double[::1] global_buf not None, int n, int n2):
    cdef double* local_buf
    cdef int idx, i, n4 = <int> (n2/2)

    with nogil, parallel():
        for i in range(loops):
            local_buf = <double *> malloc(sizeof(double) * 4)
            for idx in prange(n4, schedule='guided'):
                local_buf[0] = global_buf[idx*4]
                local_buf[1] = global_buf[idx*4+1]
                local_buf[2] = global_buf[idx*4+2]
                local_buf[3] = global_buf[idx*4+3]
                func(local_buf)
            free(local_buf)
        
# ==============================================================================
# test function
cdef void func(double* local_buf) nogil:
    cdef int i=0
    return









    



-------- TESTS -------
Running Static allocation for n=2
24.429000 1510.157000 μs
Running Dynamic allocation for n=2
25.680000 257.589000 μs
Running Static allocation for n=4
29.224000 1419.650000 μs
Running Dynamic allocation for n=4
49.522000 162.180000 μs

It might seem counter-intuitive but using a dynamic malloc (and free-ing accordingly) instead of declaring an array statically, will improve the performance of your code.