We're going to start with a (compressed) Cython tutorial

http://docs.cython.org/src/tutorial/cython_tutorial.html

Cython is a superset of python which adds typing and additional features to the python language to help the cython interpreter convert the cython code to C/C++. This can allow users to convert slow Python code to fast C/C++ code with minimal changes. It can also be used to wrap existing C/C++/Fortran code.

lets look at a simple prime number finder in python and cython:

python simplesetup.py build_ext --inplace

In [1]:
import simplecy, simplepy
simplepy.primes(30) == simplecy.primes(30)


Out[1]:
array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True], dtype=bool)

In [2]:
%timeit simplecy.primes(30)


100000 loops, best of 3: 3.32 µs per loop

In [3]:
%timeit simplepy.primes(30)


1000 loops, best of 3: 907 µs per loop

A simple example with a 300x speedup!

Let's take a quick look at the C source. The interesting part starts around line 600 (on Cython 21.1) reproduced in part below:

static PyObject *__pyx_pf_8simplecy_primes(CYTHON_UNUSED PyObject *__pyx_self, int __pyx_v_kmax) {
  int __pyx_v_n;
  int __pyx_v_k;
  int __pyx_v_i;
  int __pyx_v_p[1000];
  PyObject *__pyx_v_result = NULL;
  PyObject *__pyx_r = NULL;
  __Pyx_RefNannyDeclarations
  PyObject *__pyx_t_1 = NULL;
  int __pyx_t_2;
  int __pyx_t_3;
  int __pyx_t_4;
  int __pyx_lineno = 0;
  const char *__pyx_filename = NULL;
  int __pyx_clineno = 0;
  __Pyx_RefNannySetupContext("primes", 0);

  /* "simplecy.pyx":7
 *     cdef int n, k, i
 *     cdef int p[1000]
 *     result = []             # <<<<<<<<<<<<<<
 *     if kmax > 1000:
 *         kmax = 1000
 */
  __pyx_t_1 = PyList_New(0); if (unlikely(!__pyx_t_1)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 7; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
  __Pyx_GOTREF(__pyx_t_1);
  __pyx_v_result = ((PyObject*)__pyx_t_1);
  __pyx_t_1 = 0;

  /* "simplecy.pyx":8
 *     cdef int p[1000]
 *     result = []
 *     if kmax > 1000:             # <<<<<<<<<<<<<<
 *         kmax = 1000
 *     k = 0
 */
  __pyx_t_2 = ((__pyx_v_kmax > 1000) != 0);
  if (__pyx_t_2) {

    /* "simplecy.pyx":9
 *     result = []
 *     if kmax > 1000:
 *         kmax = 1000             # <<<<<<<<<<<<<<
 *     k = 0
 *     n = 2
 */
    __pyx_v_kmax = 1000;
    goto __pyx_L3;
  }
  __pyx_L3:;

  /* "simplecy.pyx":10
 *     if kmax > 1000:
 *         kmax = 1000
 *     k = 0             # <<<<<<<<<<<<<<
 *     n = 2
 *     while k < kmax:
 */
  __pyx_v_k = 0;

  /* "simplecy.pyx":11
 *         kmax = 1000
 *     k = 0
 *     n = 2             # <<<<<<<<<<<<<<
 *     while k < kmax:
 *         i = 0
 */
  __pyx_v_n = 2;

  /* "simplecy.pyx":12
 *     k = 0
 *     n = 2
 *     while k < kmax:             # <<<<<<<<<<<<<<
 *         i = 0
 *         while i < k and n % p[i] != 0:
 */
  while (1) {
    __pyx_t_2 = ((__pyx_v_k < __pyx_v_kmax) != 0);
    if (!__pyx_t_2) break;

    /* "simplecy.pyx":13
 *     n = 2
 *     while k < kmax:
 *         i = 0             # <<<<<<<<<<<<<<
 *         while i < k and n % p[i] != 0:
 *             i = i + 1
 */
    __pyx_v_i = 0;

    /* "simplecy.pyx":14
 *     while k < kmax:
 *         i = 0
 *         while i < k and n % p[i] != 0:             # <<<<<<<<<<<<<<
 *             i = i + 1
 *         if i == k:
 */


  /* "simplecy.pyx":21
 *             result.append(n)
 *         n = n + 1
 *     return result             # <<<<<<<<<<<<<<
 */
  __Pyx_XDECREF(__pyx_r);
  __Pyx_INCREF(__pyx_v_result);
  __pyx_r = __pyx_v_result;
  goto __pyx_L0;

Writing good (ie. fast) Cython code can be non-trivial. Sometimes the general rule of just add static typing everywhere can make things SLOWER!

We'll leave the details of fast cython code for another time (Ross mentioned briefly the basic tools to let you know how much C vs. Python code is being called)

Wrapping Existing (or new) C++ and Fortran

Often the simplest strategy for writing fast code is to write the compute intensive functions in C++ or Fortran and then interface to them in Python. For some simple tasks where the inputs are simple and the number of calls are limited the python ctypes module is great (though it has many problems and haters). For more complex inputs or situations where making many calls to a function is needed Cython offers a great alternative to writing your own C/C++ interface.

We will look at wrapping a C++ and a Fortran function (from PyNE http://pyne.io)

These two functions implement the same functionality - a conversion from string to float. We will wrap the C++ directly with Cython and wrap the Fortran version by first implementing a C++ API and then wrapping that (see f2py for a more automated solution).


In [1]:
import endftod

In [2]:
%timeit endftod.endftod_cpp(' 2.040000+2')


1000000 loops, best of 3: 545 ns per loop

In [3]:
%timeit endftod.endftod_f(' 2.040000+2')


1000000 loops, best of 3: 1.59 µs per loop

In [4]:
endftod.endftod_f(' 2.040000+2') == endftod.endftod_f('2.040000E+2')


Out[4]:
True

In [5]:
endftod.endftod_cpp(' 2.040000+2') == endftod.endftod_cpp('2.040000E+2')


Out[5]:
False

Now that we've gone over cython in a bit more detail lets look at Memoryviews - a better way of accessing numpy arrays in cython

from the docs :

Typed memoryviews allow efficient access to memory buffers, such as those underlying NumPy arrays, without incurring any Python overhead. Memoryviews are similar to the current NumPy array buffer support (np.ndarray[np.float64_t, ndim=2]), but they have more features and cleaner syntax.

Memoryviews are more general than the old NumPy array buffer support, because they can handle a wider variety of sources of array data. For example, they can handle C arrays and the Cython array type (Cython arrays).

A memoryview can be used in any context (function parameters, module-level, cdef class attribute, etc) and can be obtained from nearly any object that exposes writable buffer through the PEP 3118 buffer interface

Now lets look at what this can do for cython (via some notebooks by Jake Vanderplas)

http://docs.cython.org/src/userguide/memoryviews.html

We can put all this together

from __future__ import print_function
from cython.parallel import prange
import cython
import numpy as np
cimport numpy as np
from libc.stdlib cimport free

cdef extern double trapfilter(int *, int, int, int, int, int, double, int, int) nogil

@cython.boundscheck(False)
@cython.wraparound(False)
def trapfilterlots(np.ndarray inputarr not None, int peaking,
               int gap, int M, int len_in, double N):
    """
    Filter a 2-d array of pulses. return heights and polarities. currently uses
    first 1200 points for baseline subtraction
    """
    cdef int [:] avg = np.average(inputarr[:, :1200], 1).astype(np.int32)
    cdef int [:, :] inarr = np.ascontiguousarray(inputarr,'i4')
    cdef double [:] final = np.zeros(inarr.shape[0])
    cdef int [:] pol = np.zeros(inarr.shape[0], dtype=np.int32)
    cdef Py_ssize_t i
    cdef int m = inarr.shape[1]
    for i in prange(inarr.shape[0], nogil=True):
        pol[i] = 1
        if  inarr[i][2200] - avg[i] < 0:
            pol[i] = -1
        final[i] = trapfilter(&inarr[i][0], peaking,
                        gap, M, m, len_in, N, avg[i], pol[i])
    return np.asarray(final), np.asarray(pol)

In [ ]: