Parallel/efficient coding in Python

Python (more presicely: the CPython interpreter) uses a Global interpreter lock (GIL) which prohibits simultaneous execution of Python code.

Overview

In this tutorial, we'll cover these common problems/situations/solutions:

  • If your code is I/O bound, threads is a simple solution
  • If your code is CPU bound, multiprocessing is the way to go
  • If your code can be vectorised: use numpy
  • If your code is still too slow: write your own C extensions with cython

This list is by no means complete and is just a starting point for getting you into parallel/efficient coding with Python. There are plenty of other very nice solutions for all kinds of problems out there, but this is beyond the scope of this tutorial.

A very short Python primer

If you are not familiar with Python, I'd suggest learning it the (not so) hard way. Or continue reading, it describes very briefly the needed basics.

This is how we define functions:


In [1]:
def function(argument):
    """
    This function just prints the given argument.
    
    :param argument: what to print
    :returns:        also returns the argument
    
    """
    print("Calling function with argument %s" % argument)
    return argument

In [2]:
function(42)


Calling function with argument 42
Out[2]:
42

This is how we define classes:


In [3]:
class Class(object):
    """A simple meaningless Class"""
    def __init__(self, attribute):
        """
        This method get's called during object initialisation.
        
        :param attribute: attribute of the class to save
        
        """
        self.attribute = attribute
    
    def tell(self):
        """Tell the world about us."""
        print("I'm a Class with a very nice attribute: %s" % self.attribute)

In [4]:
A = Class('smart')
A.tell()


I'm a Class with a very nice attribute: smart

Since all (our) functions and classes are well documented, we can always ask how to use them:


In [5]:
function?

In [6]:
Class?

Python also has lists which can be accessed by index (indices always start at 0):


In [7]:
x = [1, 2, 4.5, 'bla']

In [8]:
x[1]


Out[8]:
2

In [9]:
x[3]


Out[9]:
'bla'

In [10]:
x.index('bla')


Out[10]:
3

One of the more fancy stuff we can do in Python is list comprehensions.

Here's a simple example of how to calculate the squares of some numbers:


In [11]:
[x ** 2 for x in range(10)]


Out[11]:
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

Ok, that's it for now, let's speed things up :)

Threads

From the Python threading documentation (https://docs.python.org/2/library/threading.html):

In CPython, due to the Global Interpreter Lock, only one thread can execute Python code at once (even though certain performance-oriented libraries might overcome this limitation). If you want your application to make better use of the computational resources of multi-core machines, you are advised to use multiprocessing. However, threading is still an appropriate model if you want to run multiple I/O-bound tasks simultaneously.

All my examples are CPU bound, so let's move on to multiprocessing -- it also has a simpler interface ;)

Multiprocessing

The multiprocessing module can start multiple Python interpreters and thus can run code in parallel.

Inside each of these interpreters, the GIL still applies, but they do not interfere with each other.


In [12]:
import multiprocessing as mp

A simple CPU bound example:

We consider a function which sums all primes below a given number. Source: http://www.parallelpython.com/content/view/17/31/#SUM_PRIMES


In [13]:
import math

def isprime(n):
    """Returns True if n is prime and False otherwise"""
    if not isinstance(n, int):
        raise TypeError("argument passed to is_prime is not of 'int' type")
    if n < 2:
        return False
    if n == 2:
        return True
    max = int(math.ceil(math.sqrt(n)))
    i = 2
    while i <= max:
        if n % i == 0:
            return False
        i += 1
    return True

def sum_primes(n):
    """Calculates sum of all primes below given integer n"""
    return sum([x for x in xrange(2,n) if isprime(x)])

If we simply use the list comprehension without the sum, we get a list of primes smaller than the given one:


In [14]:
[x for x in xrange(10) if isprime(x)]


Out[14]:
[2, 3, 5, 7]

In [15]:
sum_primes(10)


Out[15]:
17

In [16]:
%timeit sum_primes(10)


100000 loops, best of 3: 6.23 µs per loop

Not bad, why should we speed up something like this?

Let's see what happens if we ask for the sum of primes below a larger number:


In [17]:
%timeit sum_primes(10000)


10 loops, best of 3: 16.5 ms per loop

What if we do this for a bunch of numbers, e.g.:


In [18]:
%timeit [sum_primes(n) for n in xrange(1000)]


1 loops, best of 3: 537 ms per loop

Ok, this is definitely too slow, but we should be able to run this in parallel easily, since we are asking for the same thing (summing all prime numbers below the given number) for a lot of of numbers.

We do this by calling this very same function a thousand times (inside the list comprehension).

Python has a map function, which maps a list of arguments to a single function.


In [19]:
map(sum_primes, xrange(10))


Out[19]:
[0, 0, 0, 2, 5, 5, 10, 10, 17, 17]

This is basically the same as what we had before with our list comprehension.


In [20]:
%timeit map(sum_primes, xrange(1000))


1 loops, best of 3: 504 ms per loop

multiprocessing offers the same map function in it's Pool class. Let's create a Pool with 2 simultaneous threads (i.e. processes).


In [21]:
pool = mp.Pool(2)

Now we can use the pool's map function to do the same as before but in parallel.


In [22]:
pool.map(sum_primes, xrange(10))


Out[22]:
[0, 0, 0, 2, 5, 5, 10, 10, 17, 17]

In [23]:
%timeit pool.map(sum_primes, xrange(1000))


1 loops, best of 3: 359 ms per loop

This is a speed-up of almost 2x, which is exactly what we expected (using 2 processes minus some overhead).

Vectorisation

We can solve a lot of stuff in almost no time if we avoid loops and vectorise the expressions instead.

Numpy does an excellent job here and automatically does some things in parallel (e.g. matrix multiplication).

It uses highly optimised linear algebra packages to do things really fast.


In [24]:
import numpy as np

Some Numpy basics:

We can define arrays simple as that:


In [25]:
x = np.array([1, 2, 5, 15])

In [26]:
x


Out[26]:
array([ 1,  2,  5, 15])

As long as it can be casted to an array, we can use almost everything as input for an array:


In [27]:
np.array(map(sum_primes, xrange(10)))


Out[27]:
array([ 0,  0,  0,  2,  5,  5, 10, 10, 17, 17])

Or we define arrays with some special functions:


In [28]:
np.zeros(10)


Out[28]:
array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])

In [29]:
np.arange(10.)


Out[29]:
array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.])

Numpy supports indexing and slicing:


In [30]:
x = np.arange(10)

Get a single item of the array:


In [31]:
x[3]


Out[31]:
3

Get a slice of the array:


In [32]:
x[1:5]


Out[32]:
array([1, 2, 3, 4])

Get everything starting from index 4 to the end:


In [33]:
x[4:]


Out[33]:
array([4, 5, 6, 7, 8, 9])

Negative indices are counted backwards from the end. Get everything before the last element:


In [34]:
x[:-1]


Out[34]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8])

Let's define another problem: comb filters.

In signal processing, a comb filter adds a delayed version of a signal to itself, causing constructive and destructive interference [Wikipedia].

These filters can be either feed forward or backward, depending on wheter the signal itself or the output of the filter is delayed and added to the signal.

Feed forward:

$y[n] = x[n] + \alpha * x[n - \tau]$

Feed backward:

$y[n] = x[n] + \alpha * y[n - \tau]$

Again, we start with a Python only solution (it has some Numpy stuff in there already, but that's not the point. It uses a loop for adding the delayed portion of the signal to the output).


In [35]:
def feed_forward_comb_filter_loop(signal, tau, alpha):
    """
    Filter the signal with a feed forward comb filter.

    :param signal: signal
    :param tau:    delay length
    :param alpha:  scaling factor
    :return:       comb filtered signal

    """
    # y[n] = x[n] + α * x[n - τ]
    if tau <= 0:
        raise ValueError('tau must be greater than 0')
    # init the output array 
    y = np.zeros(len(signal))
    # iterate over the signal
    for i in range(len(signal)):
        # add the delayed version of the signal to itself
        if i < tau:
            y[i] = signal[i]
        else:
            y[i] = signal[i] + alpha * signal[i - tau]
    # return the filtered signal
    return y

In [36]:
feed_forward_comb_filter_loop(np.arange(10.), 4, 0.5)


Out[36]:
array([  0. ,   1. ,   2. ,   3. ,   4. ,   5.5,   7. ,   8.5,  10. ,  11.5])

In [37]:
%timeit feed_forward_comb_filter_loop(np.arange(100000.), 4, 0.5)


10 loops, best of 3: 34.9 ms per loop

Let's vectorise this by removig the loop (the for i in range(len(signal)) stuff):


In [38]:
def feed_forward_comb_filter(signal, tau, alpha):
    """
    Filter the signal with a feed forward comb filter.

    :param signal: signal
    :param tau:    delay length
    :param alpha:  scaling factor
    :return:       comb filtered signal

    """
    # y[n] = x[n] + α * x[n - τ]
    if tau <= 0:
        raise ValueError('tau must be greater than 0')
    # init the output array as a copy of the input signal, since
    # the output is the input signal plus a delayed version of it
    y = signal.copy()
    # add the delayed signal, starting at index tau
    y[tau:] += alpha * signal[:-tau]
    # return the filtered signal
    return y

In [39]:
%timeit feed_forward_comb_filter(np.arange(100000.), 4, 0.5)


1000 loops, best of 3: 643 µs per loop

This is a nice ~67x speed-up (523 µs vs. 35.1 ms). Continue with the feed backward example...

The feed backward variant comb filter function containing a loop:


In [40]:
def feed_backward_comb_filter_loop(signal, tau, alpha):
    """
    Filter the signal with a feed backward comb filter.

    :param signal: signal
    :param tau:    delay length
    :param alpha:  scaling factor
    :return:       comb filtered signal

    """
    # y[n] = x[n] + α * y[n - τ]
    if tau <= 0:
        raise ValueError('tau must be greater than 0')
    # init the output array as a copy of the input signal
    y = signal.copy()
    # loop over the signal, starting at tau
    for n in range(tau, len(signal)):
        # add a delayed version of the output signal
        y[n] += alpha * y[n - tau]
    # return the filtered signal
    return y

In [41]:
%timeit feed_backward_comb_filter_loop(np.arange(100000.), 4, 0.5)


10 loops, best of 3: 38.3 ms per loop

The backward variant has basically the same runtime as the forward (loop-)version, but unfortunately, we cannot speed this up further with a vectorised expression, since the output depends on the output of a previous step.

And this is where Cython comes in.

Cython

Cython can be used to write C-extensions in Python.

The Cython compiler converts everything into C code, compiles it to a shared object which can be loaded/imported like any other python module.

First we need to import Cython via cimport cython in normal Cython files or load the Cython extension (in IPython):


In [42]:
%load_ext Cython

To be able to use Cython code within IPython, we need to add the magic %%cython handler as a first line into a cell. Then we can start writing normal Python code.


In [43]:
%%cython
# magic cython handler for IPython (must be first line of a cell)

def sum_two_numbers(a, b):
    return a + b

Cython then compiles and loads everything transparently.


In [44]:
sum_two_numbers(10, 5)


Out[44]:
15

Cython gains most of its tremendous speed gains from static typing.

Let's try with the isprime example from before again:


In [45]:
%%cython

# import some C functions to avoid calls to the Python interpreter
from libc.math cimport sqrt, ceil

# define type n as integer
def isprime_cython(int n):
    """Returns True if n is prime and False otherwise"""
    # we can skip the instance check since we have strong typing now
    if n < 2:
        return False
    if n == 2:
        return True
    # define type max as integer
    cdef int max
    # use the C ceil & sqrt functions instead of Python's own
    max = int(ceil(sqrt(n)))
    # define type n as integer
    cdef int i = 2
    while i <= max:
        if n % i == 0:
            return False
        i += 1
    return True

def sum_primes_cython(n):
    """Calculates sum of all primes below given integer n"""
    return sum([x for x in xrange(2,n) if isprime_cython(x)])

Let's see if (and how much) it helps:


In [46]:
%timeit sum_primes(10000)
%timeit sum_primes_cython(10000)


10 loops, best of 3: 16.5 ms per loop
1000 loops, best of 3: 686 µs per loop

speed-up ~24x (686 µs vs. 16.5 ms)


In [47]:
%timeit -n 1 [sum_primes(n) for n in xrange(10000)]
%timeit -n 1 [sum_primes_cython(n) for n in xrange(10000)]


1 loops, best of 3: 1min 16s per loop
1 loops, best of 3: 3.1 s per loop

speed-up ~25x (3.1 s vs. 76 s)

What if we use this faster version of the function with multiple processes?


In [48]:
%timeit mp.Pool(2).map(sum_primes_cython, (n for n in xrange(10000)))
%timeit mp.Pool(4).map(sum_primes_cython, (n for n in xrange(10000)))


1 loops, best of 3: 1.92 s per loop
1 loops, best of 3: 1.56 s per loop

Starting more threads than physical CPU cores gives some more performance, but does not scale as good because of hyper-threading. Total speed-up compared to the single-process pure Python variant is ~49x (1.56 s vs. 76 s)

Still, the isprime_cython is a Python function, which adds some overhead. Since we call this function only from sum_primes_cython, we can make it a C-only function by using the cdef statement instead of def and also define the return type.


In [49]:
%%cython

from libc.math cimport sqrt, ceil

# make this a C function
cdef int isprime_cython_nogil(int n) nogil:
    """Returns True if n is prime and False otherwise"""
    if n < 2:
        return 0
    if n == 2:
        return 1
    cdef int max
    max = int(ceil(sqrt(n)))
    cdef int i = 2
    while i <= max:
        if n % i == 0:
            return 0
        i += 1
    return 1

def sum_primes_cython_nogil(n):
    """Calculates sum of all primes below given integer n"""
    return sum([x for x in xrange(2,n) if isprime_cython_nogil(x)])

In [50]:
%timeit [sum_primes_cython_nogil(n) for n in xrange(10000)]
%timeit mp.Pool(4).map(sum_primes_cython_nogil, (n for n in xrange(10000)))


1 loops, best of 3: 2.59 s per loop
1 loops, best of 3: 1.18 s per loop

Again, a bit faster; total speed-up compared to the single-process pure Python variant is ~64x (1.18 s vs. 76 s)

These are the Cython basics. Let's apply them to the other example, the backward comb filter which we were not able to vectorise:


In [51]:
%%cython

import numpy as np
# we also want to load the C-bindings of numpy with cimport
cimport numpy as np

# statically type the obvious variables (tau, alpha, n)
def feed_backward_comb_filter(signal, unsigned int tau, float alpha):
    """
    Filter the signal with a feed backward comb filter.

    :param signal: signal
    :param tau:    delay length
    :param alpha:  scaling factor
    :return:       comb filtered signal

    """
    # y[n] = x[n] + α * y[n - τ]
    if tau <= 0:
        raise ValueError('tau must be greater than 0')
    # type definitions
    y = np.copy(signal)
    cdef unsigned int n
    # loop over the complete signal
    for n in range(tau, len(signal)):
        # add a delayed version of the output signal
        y[n] += alpha * y[n - tau]
    # return the filtered signal
    return y

In [52]:
%timeit feed_backward_comb_filter(np.arange(100000.), 4, 0.5)


10 loops, best of 3: 16.3 ms per loop

A bit better (roughly half the time), but still far away from the feed forward variant.

Let's see, what kills performance and fix it. Cython has this nice -a switch to highlight calls to Python in yellow.


In [53]:
%%cython -a

import numpy as np
cimport cython
cimport numpy as np

def feed_backward_comb_filter(signal, unsigned int tau, float alpha):
    """
    Filter the signal with a feed backward comb filter.

    :param signal: signal
    :param tau:    delay length
    :param alpha:  scaling factor
    :return:       comb filtered signal

    """
    # y[n] = x[n] + α * y[n - τ]
    if tau <= 0:
        raise ValueError('tau must be greater than 0')
    # type definitions
    y = np.copy(signal)
    cdef unsigned int n
    # loop over the complete signal
    for n in range(tau, len(signal)):
        # add a delayed version of the output signal
        y[n] = signal[n] + alpha * y[n - tau]
    # return the filtered signal
    return y


Out[53]:

Generated by Cython 0.22

 01: 
+02: import numpy as np
  __pyx_t_1 = __Pyx_Import(__pyx_n_s_numpy, 0, -1); if (unlikely(!__pyx_t_1)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 2; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
  __Pyx_GOTREF(__pyx_t_1);
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_np, __pyx_t_1) < 0) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 2; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
/* … */
  __pyx_t_1 = PyDict_New(); if (unlikely(!__pyx_t_1)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 2; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
  __Pyx_GOTREF(__pyx_t_1);
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_test, __pyx_t_1) < 0) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 2; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
 03: cimport cython
 04: cimport numpy as np
 05: 
+06: def feed_backward_comb_filter(signal, unsigned int tau, float alpha):
/* Python wrapper */
static PyObject *__pyx_pw_46_cython_magic_87f0ac483e9d9eaa1e6f6615f53eac0e_1feed_backward_comb_filter(PyObject *__pyx_self, PyObject *__pyx_args, PyObject *__pyx_kwds); /*proto*/
static char __pyx_doc_46_cython_magic_87f0ac483e9d9eaa1e6f6615f53eac0e_feed_backward_comb_filter[] = "\n    Filter the signal with a feed backward comb filter.\n\n    :param signal: signal\n    :param tau:    delay length\n    :param alpha:  scaling factor\n    :return:       comb filtered signal\n\n    ";
static PyMethodDef __pyx_mdef_46_cython_magic_87f0ac483e9d9eaa1e6f6615f53eac0e_1feed_backward_comb_filter = {"feed_backward_comb_filter", (PyCFunction)__pyx_pw_46_cython_magic_87f0ac483e9d9eaa1e6f6615f53eac0e_1feed_backward_comb_filter, METH_VARARGS|METH_KEYWORDS, __pyx_doc_46_cython_magic_87f0ac483e9d9eaa1e6f6615f53eac0e_feed_backward_comb_filter};
static PyObject *__pyx_pw_46_cython_magic_87f0ac483e9d9eaa1e6f6615f53eac0e_1feed_backward_comb_filter(PyObject *__pyx_self, PyObject *__pyx_args, PyObject *__pyx_kwds) {
  PyObject *__pyx_v_signal = 0;
  unsigned int __pyx_v_tau;
  float __pyx_v_alpha;
  PyObject *__pyx_r = 0;
  __Pyx_RefNannyDeclarations
  __Pyx_RefNannySetupContext("feed_backward_comb_filter (wrapper)", 0);
  {
    static PyObject **__pyx_pyargnames[] = {&__pyx_n_s_signal,&__pyx_n_s_tau,&__pyx_n_s_alpha,0};
    PyObject* values[3] = {0,0,0};
    if (unlikely(__pyx_kwds)) {
      Py_ssize_t kw_args;
      const Py_ssize_t pos_args = PyTuple_GET_SIZE(__pyx_args);
      switch (pos_args) {
        case  3: values[2] = PyTuple_GET_ITEM(__pyx_args, 2);
        case  2: values[1] = PyTuple_GET_ITEM(__pyx_args, 1);
        case  1: values[0] = PyTuple_GET_ITEM(__pyx_args, 0);
        case  0: break;
        default: goto __pyx_L5_argtuple_error;
      }
      kw_args = PyDict_Size(__pyx_kwds);
      switch (pos_args) {
        case  0:
        if (likely((values[0] = PyDict_GetItem(__pyx_kwds, __pyx_n_s_signal)) != 0)) kw_args--;
        else goto __pyx_L5_argtuple_error;
        case  1:
        if (likely((values[1] = PyDict_GetItem(__pyx_kwds, __pyx_n_s_tau)) != 0)) kw_args--;
        else {
          __Pyx_RaiseArgtupleInvalid("feed_backward_comb_filter", 1, 3, 3, 1); {__pyx_filename = __pyx_f[0]; __pyx_lineno = 6; __pyx_clineno = __LINE__; goto __pyx_L3_error;}
        }
        case  2:
        if (likely((values[2] = PyDict_GetItem(__pyx_kwds, __pyx_n_s_alpha)) != 0)) kw_args--;
        else {
          __Pyx_RaiseArgtupleInvalid("feed_backward_comb_filter", 1, 3, 3, 2); {__pyx_filename = __pyx_f[0]; __pyx_lineno = 6; __pyx_clineno = __LINE__; goto __pyx_L3_error;}
        }
      }
      if (unlikely(kw_args > 0)) {
        if (unlikely(__Pyx_ParseOptionalKeywords(__pyx_kwds, __pyx_pyargnames, 0, values, pos_args, "feed_backward_comb_filter") < 0)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 6; __pyx_clineno = __LINE__; goto __pyx_L3_error;}
      }
    } else if (PyTuple_GET_SIZE(__pyx_args) != 3) {
      goto __pyx_L5_argtuple_error;
    } else {
      values[0] = PyTuple_GET_ITEM(__pyx_args, 0);
      values[1] = PyTuple_GET_ITEM(__pyx_args, 1);
      values[2] = PyTuple_GET_ITEM(__pyx_args, 2);
    }
    __pyx_v_signal = values[0];
    __pyx_v_tau = __Pyx_PyInt_As_unsigned_int(values[1]); if (unlikely((__pyx_v_tau == (unsigned int)-1) && PyErr_Occurred())) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 6; __pyx_clineno = __LINE__; goto __pyx_L3_error;}
    __pyx_v_alpha = __pyx_PyFloat_AsFloat(values[2]); if (unlikely((__pyx_v_alpha == (float)-1) && PyErr_Occurred())) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 6; __pyx_clineno = __LINE__; goto __pyx_L3_error;}
  }
  goto __pyx_L4_argument_unpacking_done;
  __pyx_L5_argtuple_error:;
  __Pyx_RaiseArgtupleInvalid("feed_backward_comb_filter", 1, 3, 3, PyTuple_GET_SIZE(__pyx_args)); {__pyx_filename = __pyx_f[0]; __pyx_lineno = 6; __pyx_clineno = __LINE__; goto __pyx_L3_error;}
  __pyx_L3_error:;
  __Pyx_AddTraceback("_cython_magic_87f0ac483e9d9eaa1e6f6615f53eac0e.feed_backward_comb_filter", __pyx_clineno, __pyx_lineno, __pyx_filename);
  __Pyx_RefNannyFinishContext();
  return NULL;
  __pyx_L4_argument_unpacking_done:;
  __pyx_r = __pyx_pf_46_cython_magic_87f0ac483e9d9eaa1e6f6615f53eac0e_feed_backward_comb_filter(__pyx_self, __pyx_v_signal, __pyx_v_tau, __pyx_v_alpha);
  int __pyx_lineno = 0;
  const char *__pyx_filename = NULL;
  int __pyx_clineno = 0;

  /* function exit code */
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}

static PyObject *__pyx_pf_46_cython_magic_87f0ac483e9d9eaa1e6f6615f53eac0e_feed_backward_comb_filter(CYTHON_UNUSED PyObject *__pyx_self, PyObject *__pyx_v_signal, unsigned int __pyx_v_tau, float __pyx_v_alpha) {
  PyObject *__pyx_v_y = NULL;
  unsigned int __pyx_v_n;
  PyObject *__pyx_r = NULL;
  __Pyx_RefNannyDeclarations
  __Pyx_RefNannySetupContext("feed_backward_comb_filter", 0);
/* … */
  /* function exit code */
  __pyx_L1_error:;
  __Pyx_XDECREF(__pyx_t_2);
  __Pyx_XDECREF(__pyx_t_3);
  __Pyx_XDECREF(__pyx_t_4);
  __Pyx_XDECREF(__pyx_t_5);
  __Pyx_AddTraceback("_cython_magic_87f0ac483e9d9eaa1e6f6615f53eac0e.feed_backward_comb_filter", __pyx_clineno, __pyx_lineno, __pyx_filename);
  __pyx_r = NULL;
  __pyx_L0:;
  __Pyx_XDECREF(__pyx_v_y);
  __Pyx_XGIVEREF(__pyx_r);
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}
/* … */
  __pyx_tuple__8 = PyTuple_Pack(5, __pyx_n_s_signal, __pyx_n_s_tau, __pyx_n_s_alpha, __pyx_n_s_y, __pyx_n_s_n); if (unlikely(!__pyx_tuple__8)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 6; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
  __Pyx_GOTREF(__pyx_tuple__8);
  __Pyx_GIVEREF(__pyx_tuple__8);
/* … */
  __pyx_t_1 = PyCFunction_NewEx(&__pyx_mdef_46_cython_magic_87f0ac483e9d9eaa1e6f6615f53eac0e_1feed_backward_comb_filter, NULL, __pyx_n_s_cython_magic_87f0ac483e9d9eaa1e); if (unlikely(!__pyx_t_1)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 6; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
  __Pyx_GOTREF(__pyx_t_1);
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_feed_backward_comb_filter, __pyx_t_1) < 0) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 6; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
 07:     """
 08:     Filter the signal with a feed backward comb filter.
 09: 
 10:     :param signal: signal
 11:     :param tau:    delay length
 12:     :param alpha:  scaling factor
 13:     :return:       comb filtered signal
 14: 
 15:     """
 16:     # y[n] = x[n] + α * y[n - τ]
+17:     if tau <= 0:
  __pyx_t_1 = ((__pyx_v_tau <= 0) != 0);
  if (__pyx_t_1) {
+18:         raise ValueError('tau must be greater than 0')
    __pyx_t_2 = __Pyx_PyObject_Call(__pyx_builtin_ValueError, __pyx_tuple_, NULL); if (unlikely(!__pyx_t_2)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 18; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
    __Pyx_GOTREF(__pyx_t_2);
    __Pyx_Raise(__pyx_t_2, 0, 0, 0);
    __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
    {__pyx_filename = __pyx_f[0]; __pyx_lineno = 18; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
  }
/* … */
  __pyx_tuple_ = PyTuple_Pack(1, __pyx_kp_s_tau_must_be_greater_than_0); if (unlikely(!__pyx_tuple_)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 18; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
  __Pyx_GOTREF(__pyx_tuple_);
  __Pyx_GIVEREF(__pyx_tuple_);
 19:     # type definitions
+20:     y = np.copy(signal)
  __pyx_t_3 = __Pyx_GetModuleGlobalName(__pyx_n_s_np); if (unlikely(!__pyx_t_3)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 20; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
  __Pyx_GOTREF(__pyx_t_3);
  __pyx_t_4 = __Pyx_PyObject_GetAttrStr(__pyx_t_3, __pyx_n_s_copy); if (unlikely(!__pyx_t_4)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 20; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
  __Pyx_GOTREF(__pyx_t_4);
  __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
  __pyx_t_3 = NULL;
  if (CYTHON_COMPILING_IN_CPYTHON && unlikely(PyMethod_Check(__pyx_t_4))) {
    __pyx_t_3 = PyMethod_GET_SELF(__pyx_t_4);
    if (likely(__pyx_t_3)) {
      PyObject* function = PyMethod_GET_FUNCTION(__pyx_t_4);
      __Pyx_INCREF(__pyx_t_3);
      __Pyx_INCREF(function);
      __Pyx_DECREF_SET(__pyx_t_4, function);
    }
  }
  if (!__pyx_t_3) {
    __pyx_t_2 = __Pyx_PyObject_CallOneArg(__pyx_t_4, __pyx_v_signal); if (unlikely(!__pyx_t_2)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 20; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
    __Pyx_GOTREF(__pyx_t_2);
  } else {
    __pyx_t_5 = PyTuple_New(1+1); if (unlikely(!__pyx_t_5)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 20; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
    __Pyx_GOTREF(__pyx_t_5);
    PyTuple_SET_ITEM(__pyx_t_5, 0, __pyx_t_3); __Pyx_GIVEREF(__pyx_t_3); __pyx_t_3 = NULL;
    __Pyx_INCREF(__pyx_v_signal);
    PyTuple_SET_ITEM(__pyx_t_5, 0+1, __pyx_v_signal);
    __Pyx_GIVEREF(__pyx_v_signal);
    __pyx_t_2 = __Pyx_PyObject_Call(__pyx_t_4, __pyx_t_5, NULL); if (unlikely(!__pyx_t_2)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 20; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
    __Pyx_GOTREF(__pyx_t_2);
    __Pyx_DECREF(__pyx_t_5); __pyx_t_5 = 0;
  }
  __Pyx_DECREF(__pyx_t_4); __pyx_t_4 = 0;
  __pyx_v_y = __pyx_t_2;
  __pyx_t_2 = 0;
 21:     cdef unsigned int n
 22:     # loop over the complete signal
+23:     for n in range(tau, len(signal)):
  __pyx_t_6 = PyObject_Length(__pyx_v_signal); if (unlikely(__pyx_t_6 == -1)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 23; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
  for (__pyx_t_7 = __pyx_v_tau; __pyx_t_7 < __pyx_t_6; __pyx_t_7+=1) {
    __pyx_v_n = __pyx_t_7;
 24:         # add a delayed version of the output signal
+25:         y[n] = signal[n] + alpha * y[n - tau]
    __pyx_t_2 = __Pyx_GetItemInt(__pyx_v_signal, __pyx_v_n, unsigned int, 0, __Pyx_PyInt_From_unsigned_int, 0, 0, 1); if (unlikely(__pyx_t_2 == NULL)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 25; __pyx_clineno = __LINE__; goto __pyx_L1_error;};
    __Pyx_GOTREF(__pyx_t_2);
    __pyx_t_4 = PyFloat_FromDouble(__pyx_v_alpha); if (unlikely(!__pyx_t_4)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 25; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
    __Pyx_GOTREF(__pyx_t_4);
    __pyx_t_8 = (__pyx_v_n - __pyx_v_tau);
    __pyx_t_5 = __Pyx_GetItemInt(__pyx_v_y, __pyx_t_8, unsigned int, 0, __Pyx_PyInt_From_unsigned_int, 0, 0, 1); if (unlikely(__pyx_t_5 == NULL)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 25; __pyx_clineno = __LINE__; goto __pyx_L1_error;};
    __Pyx_GOTREF(__pyx_t_5);
    __pyx_t_3 = PyNumber_Multiply(__pyx_t_4, __pyx_t_5); if (unlikely(!__pyx_t_3)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 25; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
    __Pyx_GOTREF(__pyx_t_3);
    __Pyx_DECREF(__pyx_t_4); __pyx_t_4 = 0;
    __Pyx_DECREF(__pyx_t_5); __pyx_t_5 = 0;
    __pyx_t_5 = PyNumber_Add(__pyx_t_2, __pyx_t_3); if (unlikely(!__pyx_t_5)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 25; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
    __Pyx_GOTREF(__pyx_t_5);
    __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
    __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
    if (unlikely(__Pyx_SetItemInt(__pyx_v_y, __pyx_v_n, __pyx_t_5, unsigned int, 0, __Pyx_PyInt_From_unsigned_int, 0, 0, 1) < 0)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 25; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
    __Pyx_DECREF(__pyx_t_5); __pyx_t_5 = 0;
  }
 26:     # return the filtered signal
+27:     return y
  __Pyx_XDECREF(__pyx_r);
  __Pyx_INCREF(__pyx_v_y);
  __pyx_r = __pyx_v_y;
  goto __pyx_L0;

In line 25, we still have calls to Python within the loop (e.g. PyNumber_Multiply and PyNumber_Add).

We can get rid of these by statically typing the signal as well. Unfortunately, we lose the ability to call the filter function with a signal of arbitrary dimensions, since Cython needs to know the dimensions of the signal beforehand.


In [54]:
%%cython

import numpy as np
cimport cython
cimport numpy as np

def feed_backward_comb_filter_1d(np.ndarray[np.float_t, ndim=1] signal,
                                 unsigned int tau,
                                 float alpha):
    """
    Filter the signal with a feed backward comb filter.

    :param signal: signal
    :param tau:    delay length
    :param alpha:  scaling factor
    :return:       comb filtered signal

    """
    # y[n] = x[n] + α * y[n - τ]
    if tau <= 0:
        raise ValueError('tau must be greater than 0')
    # type definitions
    cdef np.ndarray[np.float_t, ndim=1] y = np.copy(signal)
    cdef unsigned int n
    # loop over the complete signal
    for n in range(tau, len(signal)):
        # add a delayed version of the output signal
        y[n] += alpha * y[n - tau]
    # return the filtered signal
    return y

In [56]:
%timeit feed_backward_comb_filter_1d(np.arange(100000.), 4, 0.5)


1000 loops, best of 3: 522 µs per loop

Much better, let's check again.


In [57]:
%%cython -a

import numpy as np
cimport cython
cimport numpy as np

def feed_backward_comb_filter_1d(np.ndarray[np.float_t, ndim=1] signal,
                                 unsigned int tau,
                                 float alpha):
    """
    Filter the signal with a feed backward comb filter.

    :param signal: signal
    :param tau:    delay length
    :param alpha:  scaling factor
    :return:       comb filtered signal

    """
    # y[n] = x[n] + α * y[n - τ]
    if tau <= 0:
        raise ValueError('tau must be greater than 0')
    # type definitions
    cdef np.ndarray[np.float_t, ndim=1] y = np.copy(signal)
    cdef unsigned int n
    # loop over the complete signal
    for n in range(tau, len(signal)):
        # add a delayed version of the output signal
        y[n] += alpha * y[n - tau]
    # return the filtered signal
    return y


Out[57]:

Generated by Cython 0.22

 01: 
+02: import numpy as np
  __pyx_t_1 = __Pyx_Import(__pyx_n_s_numpy, 0, -1); if (unlikely(!__pyx_t_1)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 2; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
  __Pyx_GOTREF(__pyx_t_1);
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_np, __pyx_t_1) < 0) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 2; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
/* … */
  __pyx_t_1 = PyDict_New(); if (unlikely(!__pyx_t_1)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 2; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
  __Pyx_GOTREF(__pyx_t_1);
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_test, __pyx_t_1) < 0) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 2; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
 03: cimport cython
 04: cimport numpy as np
 05: 
+06: def feed_backward_comb_filter_1d(np.ndarray[np.float_t, ndim=1] signal,
/* Python wrapper */
static PyObject *__pyx_pw_46_cython_magic_9d78bae6fa67307cade2c9dc19d94d0c_1feed_backward_comb_filter_1d(PyObject *__pyx_self, PyObject *__pyx_args, PyObject *__pyx_kwds); /*proto*/
static char __pyx_doc_46_cython_magic_9d78bae6fa67307cade2c9dc19d94d0c_feed_backward_comb_filter_1d[] = "\n    Filter the signal with a feed backward comb filter.\n\n    :param signal: signal\n    :param tau:    delay length\n    :param alpha:  scaling factor\n    :return:       comb filtered signal\n\n    ";
static PyMethodDef __pyx_mdef_46_cython_magic_9d78bae6fa67307cade2c9dc19d94d0c_1feed_backward_comb_filter_1d = {"feed_backward_comb_filter_1d", (PyCFunction)__pyx_pw_46_cython_magic_9d78bae6fa67307cade2c9dc19d94d0c_1feed_backward_comb_filter_1d, METH_VARARGS|METH_KEYWORDS, __pyx_doc_46_cython_magic_9d78bae6fa67307cade2c9dc19d94d0c_feed_backward_comb_filter_1d};
static PyObject *__pyx_pw_46_cython_magic_9d78bae6fa67307cade2c9dc19d94d0c_1feed_backward_comb_filter_1d(PyObject *__pyx_self, PyObject *__pyx_args, PyObject *__pyx_kwds) {
  PyArrayObject *__pyx_v_signal = 0;
  unsigned int __pyx_v_tau;
  float __pyx_v_alpha;
  PyObject *__pyx_r = 0;
  __Pyx_RefNannyDeclarations
  __Pyx_RefNannySetupContext("feed_backward_comb_filter_1d (wrapper)", 0);
  {
    static PyObject **__pyx_pyargnames[] = {&__pyx_n_s_signal,&__pyx_n_s_tau,&__pyx_n_s_alpha,0};
    PyObject* values[3] = {0,0,0};
    if (unlikely(__pyx_kwds)) {
      Py_ssize_t kw_args;
      const Py_ssize_t pos_args = PyTuple_GET_SIZE(__pyx_args);
      switch (pos_args) {
        case  3: values[2] = PyTuple_GET_ITEM(__pyx_args, 2);
        case  2: values[1] = PyTuple_GET_ITEM(__pyx_args, 1);
        case  1: values[0] = PyTuple_GET_ITEM(__pyx_args, 0);
        case  0: break;
        default: goto __pyx_L5_argtuple_error;
      }
      kw_args = PyDict_Size(__pyx_kwds);
      switch (pos_args) {
        case  0:
        if (likely((values[0] = PyDict_GetItem(__pyx_kwds, __pyx_n_s_signal)) != 0)) kw_args--;
        else goto __pyx_L5_argtuple_error;
        case  1:
        if (likely((values[1] = PyDict_GetItem(__pyx_kwds, __pyx_n_s_tau)) != 0)) kw_args--;
        else {
          __Pyx_RaiseArgtupleInvalid("feed_backward_comb_filter_1d", 1, 3, 3, 1); {__pyx_filename = __pyx_f[0]; __pyx_lineno = 6; __pyx_clineno = __LINE__; goto __pyx_L3_error;}
        }
        case  2:
        if (likely((values[2] = PyDict_GetItem(__pyx_kwds, __pyx_n_s_alpha)) != 0)) kw_args--;
        else {
          __Pyx_RaiseArgtupleInvalid("feed_backward_comb_filter_1d", 1, 3, 3, 2); {__pyx_filename = __pyx_f[0]; __pyx_lineno = 6; __pyx_clineno = __LINE__; goto __pyx_L3_error;}
        }
      }
      if (unlikely(kw_args > 0)) {
        if (unlikely(__Pyx_ParseOptionalKeywords(__pyx_kwds, __pyx_pyargnames, 0, values, pos_args, "feed_backward_comb_filter_1d") < 0)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 6; __pyx_clineno = __LINE__; goto __pyx_L3_error;}
      }
    } else if (PyTuple_GET_SIZE(__pyx_args) != 3) {
      goto __pyx_L5_argtuple_error;
    } else {
      values[0] = PyTuple_GET_ITEM(__pyx_args, 0);
      values[1] = PyTuple_GET_ITEM(__pyx_args, 1);
      values[2] = PyTuple_GET_ITEM(__pyx_args, 2);
    }
    __pyx_v_signal = ((PyArrayObject *)values[0]);
    __pyx_v_tau = __Pyx_PyInt_As_unsigned_int(values[1]); if (unlikely((__pyx_v_tau == (unsigned int)-1) && PyErr_Occurred())) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 7; __pyx_clineno = __LINE__; goto __pyx_L3_error;}
    __pyx_v_alpha = __pyx_PyFloat_AsFloat(values[2]); if (unlikely((__pyx_v_alpha == (float)-1) && PyErr_Occurred())) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 8; __pyx_clineno = __LINE__; goto __pyx_L3_error;}
  }
  goto __pyx_L4_argument_unpacking_done;
  __pyx_L5_argtuple_error:;
  __Pyx_RaiseArgtupleInvalid("feed_backward_comb_filter_1d", 1, 3, 3, PyTuple_GET_SIZE(__pyx_args)); {__pyx_filename = __pyx_f[0]; __pyx_lineno = 6; __pyx_clineno = __LINE__; goto __pyx_L3_error;}
  __pyx_L3_error:;
  __Pyx_AddTraceback("_cython_magic_9d78bae6fa67307cade2c9dc19d94d0c.feed_backward_comb_filter_1d", __pyx_clineno, __pyx_lineno, __pyx_filename);
  __Pyx_RefNannyFinishContext();
  return NULL;
  __pyx_L4_argument_unpacking_done:;
  if (unlikely(!__Pyx_ArgTypeTest(((PyObject *)__pyx_v_signal), __pyx_ptype_5numpy_ndarray, 1, "signal", 0))) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 6; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
  __pyx_r = __pyx_pf_46_cython_magic_9d78bae6fa67307cade2c9dc19d94d0c_feed_backward_comb_filter_1d(__pyx_self, __pyx_v_signal, __pyx_v_tau, __pyx_v_alpha);
  int __pyx_lineno = 0;
  const char *__pyx_filename = NULL;
  int __pyx_clineno = 0;

  /* function exit code */
  goto __pyx_L0;
  __pyx_L1_error:;
  __pyx_r = NULL;
  __pyx_L0:;
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}

static PyObject *__pyx_pf_46_cython_magic_9d78bae6fa67307cade2c9dc19d94d0c_feed_backward_comb_filter_1d(CYTHON_UNUSED PyObject *__pyx_self, PyArrayObject *__pyx_v_signal, unsigned int __pyx_v_tau, float __pyx_v_alpha) {
  PyArrayObject *__pyx_v_y = 0;
  unsigned int __pyx_v_n;
  __Pyx_LocalBuf_ND __pyx_pybuffernd_signal;
  __Pyx_Buffer __pyx_pybuffer_signal;
  __Pyx_LocalBuf_ND __pyx_pybuffernd_y;
  __Pyx_Buffer __pyx_pybuffer_y;
  PyObject *__pyx_r = NULL;
  __Pyx_RefNannyDeclarations
  __Pyx_RefNannySetupContext("feed_backward_comb_filter_1d", 0);
  __pyx_pybuffer_y.pybuffer.buf = NULL;
  __pyx_pybuffer_y.refcount = 0;
  __pyx_pybuffernd_y.data = NULL;
  __pyx_pybuffernd_y.rcbuffer = &__pyx_pybuffer_y;
  __pyx_pybuffer_signal.pybuffer.buf = NULL;
  __pyx_pybuffer_signal.refcount = 0;
  __pyx_pybuffernd_signal.data = NULL;
  __pyx_pybuffernd_signal.rcbuffer = &__pyx_pybuffer_signal;
  {
    __Pyx_BufFmt_StackElem __pyx_stack[1];
    if (unlikely(__Pyx_GetBufferAndValidate(&__pyx_pybuffernd_signal.rcbuffer->pybuffer, (PyObject*)__pyx_v_signal, &__Pyx_TypeInfo_nn___pyx_t_5numpy_float_t, PyBUF_FORMAT| PyBUF_STRIDES, 1, 0, __pyx_stack) == -1)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 6; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
  }
  __pyx_pybuffernd_signal.diminfo[0].strides = __pyx_pybuffernd_signal.rcbuffer->pybuffer.strides[0]; __pyx_pybuffernd_signal.diminfo[0].shape = __pyx_pybuffernd_signal.rcbuffer->pybuffer.shape[0];
/* … */
  /* function exit code */
  __pyx_L1_error:;
  __Pyx_XDECREF(__pyx_t_2);
  __Pyx_XDECREF(__pyx_t_3);
  __Pyx_XDECREF(__pyx_t_4);
  __Pyx_XDECREF(__pyx_t_5);
  { PyObject *__pyx_type, *__pyx_value, *__pyx_tb;
    __Pyx_ErrFetch(&__pyx_type, &__pyx_value, &__pyx_tb);
    __Pyx_SafeReleaseBuffer(&__pyx_pybuffernd_signal.rcbuffer->pybuffer);
    __Pyx_SafeReleaseBuffer(&__pyx_pybuffernd_y.rcbuffer->pybuffer);
  __Pyx_ErrRestore(__pyx_type, __pyx_value, __pyx_tb);}
  __Pyx_AddTraceback("_cython_magic_9d78bae6fa67307cade2c9dc19d94d0c.feed_backward_comb_filter_1d", __pyx_clineno, __pyx_lineno, __pyx_filename);
  __pyx_r = NULL;
  goto __pyx_L2;
  __pyx_L0:;
  __Pyx_SafeReleaseBuffer(&__pyx_pybuffernd_signal.rcbuffer->pybuffer);
  __Pyx_SafeReleaseBuffer(&__pyx_pybuffernd_y.rcbuffer->pybuffer);
  __pyx_L2:;
  __Pyx_XDECREF((PyObject *)__pyx_v_y);
  __Pyx_XGIVEREF(__pyx_r);
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}
/* … */
  __pyx_tuple__8 = PyTuple_Pack(5, __pyx_n_s_signal, __pyx_n_s_tau, __pyx_n_s_alpha, __pyx_n_s_y, __pyx_n_s_n); if (unlikely(!__pyx_tuple__8)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 6; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
  __Pyx_GOTREF(__pyx_tuple__8);
  __Pyx_GIVEREF(__pyx_tuple__8);
/* … */
  __pyx_t_1 = PyCFunction_NewEx(&__pyx_mdef_46_cython_magic_9d78bae6fa67307cade2c9dc19d94d0c_1feed_backward_comb_filter_1d, NULL, __pyx_n_s_cython_magic_9d78bae6fa67307cad); if (unlikely(!__pyx_t_1)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 6; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
  __Pyx_GOTREF(__pyx_t_1);
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_feed_backward_comb_filter_1d, __pyx_t_1) < 0) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 6; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
 07:                                  unsigned int tau,
 08:                                  float alpha):
 09:     """
 10:     Filter the signal with a feed backward comb filter.
 11: 
 12:     :param signal: signal
 13:     :param tau:    delay length
 14:     :param alpha:  scaling factor
 15:     :return:       comb filtered signal
 16: 
 17:     """
 18:     # y[n] = x[n] + α * y[n - τ]
+19:     if tau <= 0:
  __pyx_t_1 = ((__pyx_v_tau <= 0) != 0);
  if (__pyx_t_1) {
+20:         raise ValueError('tau must be greater than 0')
    __pyx_t_2 = __Pyx_PyObject_Call(__pyx_builtin_ValueError, __pyx_tuple_, NULL); if (unlikely(!__pyx_t_2)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 20; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
    __Pyx_GOTREF(__pyx_t_2);
    __Pyx_Raise(__pyx_t_2, 0, 0, 0);
    __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
    {__pyx_filename = __pyx_f[0]; __pyx_lineno = 20; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
  }
/* … */
  __pyx_tuple_ = PyTuple_Pack(1, __pyx_kp_s_tau_must_be_greater_than_0); if (unlikely(!__pyx_tuple_)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 20; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
  __Pyx_GOTREF(__pyx_tuple_);
  __Pyx_GIVEREF(__pyx_tuple_);
 21:     # type definitions
+22:     cdef np.ndarray[np.float_t, ndim=1] y = np.copy(signal)
  __pyx_t_3 = __Pyx_GetModuleGlobalName(__pyx_n_s_np); if (unlikely(!__pyx_t_3)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 22; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
  __Pyx_GOTREF(__pyx_t_3);
  __pyx_t_4 = __Pyx_PyObject_GetAttrStr(__pyx_t_3, __pyx_n_s_copy); if (unlikely(!__pyx_t_4)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 22; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
  __Pyx_GOTREF(__pyx_t_4);
  __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
  __pyx_t_3 = NULL;
  if (CYTHON_COMPILING_IN_CPYTHON && unlikely(PyMethod_Check(__pyx_t_4))) {
    __pyx_t_3 = PyMethod_GET_SELF(__pyx_t_4);
    if (likely(__pyx_t_3)) {
      PyObject* function = PyMethod_GET_FUNCTION(__pyx_t_4);
      __Pyx_INCREF(__pyx_t_3);
      __Pyx_INCREF(function);
      __Pyx_DECREF_SET(__pyx_t_4, function);
    }
  }
  if (!__pyx_t_3) {
    __pyx_t_2 = __Pyx_PyObject_CallOneArg(__pyx_t_4, ((PyObject *)__pyx_v_signal)); if (unlikely(!__pyx_t_2)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 22; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
    __Pyx_GOTREF(__pyx_t_2);
  } else {
    __pyx_t_5 = PyTuple_New(1+1); if (unlikely(!__pyx_t_5)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 22; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
    __Pyx_GOTREF(__pyx_t_5);
    PyTuple_SET_ITEM(__pyx_t_5, 0, __pyx_t_3); __Pyx_GIVEREF(__pyx_t_3); __pyx_t_3 = NULL;
    __Pyx_INCREF(((PyObject *)__pyx_v_signal));
    PyTuple_SET_ITEM(__pyx_t_5, 0+1, ((PyObject *)__pyx_v_signal));
    __Pyx_GIVEREF(((PyObject *)__pyx_v_signal));
    __pyx_t_2 = __Pyx_PyObject_Call(__pyx_t_4, __pyx_t_5, NULL); if (unlikely(!__pyx_t_2)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 22; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
    __Pyx_GOTREF(__pyx_t_2);
    __Pyx_DECREF(__pyx_t_5); __pyx_t_5 = 0;
  }
  __Pyx_DECREF(__pyx_t_4); __pyx_t_4 = 0;
  if (!(likely(((__pyx_t_2) == Py_None) || likely(__Pyx_TypeTest(__pyx_t_2, __pyx_ptype_5numpy_ndarray))))) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 22; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
  __pyx_t_6 = ((PyArrayObject *)__pyx_t_2);
  {
    __Pyx_BufFmt_StackElem __pyx_stack[1];
    if (unlikely(__Pyx_GetBufferAndValidate(&__pyx_pybuffernd_y.rcbuffer->pybuffer, (PyObject*)__pyx_t_6, &__Pyx_TypeInfo_nn___pyx_t_5numpy_float_t, PyBUF_FORMAT| PyBUF_STRIDES| PyBUF_WRITABLE, 1, 0, __pyx_stack) == -1)) {
      __pyx_v_y = ((PyArrayObject *)Py_None); __Pyx_INCREF(Py_None); __pyx_pybuffernd_y.rcbuffer->pybuffer.buf = NULL;
      {__pyx_filename = __pyx_f[0]; __pyx_lineno = 22; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
    } else {__pyx_pybuffernd_y.diminfo[0].strides = __pyx_pybuffernd_y.rcbuffer->pybuffer.strides[0]; __pyx_pybuffernd_y.diminfo[0].shape = __pyx_pybuffernd_y.rcbuffer->pybuffer.shape[0];
    }
  }
  __pyx_t_6 = 0;
  __pyx_v_y = ((PyArrayObject *)__pyx_t_2);
  __pyx_t_2 = 0;
 23:     cdef unsigned int n
 24:     # loop over the complete signal
+25:     for n in range(tau, len(signal)):
  __pyx_t_7 = PyObject_Length(((PyObject *)__pyx_v_signal)); if (unlikely(__pyx_t_7 == -1)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 25; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
  for (__pyx_t_8 = __pyx_v_tau; __pyx_t_8 < __pyx_t_7; __pyx_t_8+=1) {
    __pyx_v_n = __pyx_t_8;
 26:         # add a delayed version of the output signal
+27:         y[n] += alpha * y[n - tau]
    __pyx_t_9 = (__pyx_v_n - __pyx_v_tau);
    __pyx_t_10 = -1;
    if (unlikely(__pyx_t_9 >= (size_t)__pyx_pybuffernd_y.diminfo[0].shape)) __pyx_t_10 = 0;
    if (unlikely(__pyx_t_10 != -1)) {
      __Pyx_RaiseBufferIndexError(__pyx_t_10);
      {__pyx_filename = __pyx_f[0]; __pyx_lineno = 27; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
    }
    __pyx_t_11 = __pyx_v_n;
    __pyx_t_10 = -1;
    if (unlikely(__pyx_t_11 >= (size_t)__pyx_pybuffernd_y.diminfo[0].shape)) __pyx_t_10 = 0;
    if (unlikely(__pyx_t_10 != -1)) {
      __Pyx_RaiseBufferIndexError(__pyx_t_10);
      {__pyx_filename = __pyx_f[0]; __pyx_lineno = 27; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
    }
    *__Pyx_BufPtrStrided1d(__pyx_t_5numpy_float_t *, __pyx_pybuffernd_y.rcbuffer->pybuffer.buf, __pyx_t_11, __pyx_pybuffernd_y.diminfo[0].strides) += (__pyx_v_alpha * (*__Pyx_BufPtrStrided1d(__pyx_t_5numpy_float_t *, __pyx_pybuffernd_y.rcbuffer->pybuffer.buf, __pyx_t_9, __pyx_pybuffernd_y.diminfo[0].strides)));
  }
 28:     # return the filtered signal
+29:     return y
  __Pyx_XDECREF(__pyx_r);
  __Pyx_INCREF(((PyObject *)__pyx_v_y));
  __pyx_r = ((PyObject *)__pyx_v_y);
  goto __pyx_L0;

For the sake of completeness, let's get rid of these Pyx_RaiseBufferIndexError in line 27 as well. We tell Cython that it does not need to check for bounds by adding a @cython.boundscheck(False) decorator.


In [58]:
%%cython

import numpy as np
cimport cython
cimport numpy as np

@cython.boundscheck(False)
def feed_backward_comb_filter_1d(np.ndarray[np.float_t, ndim=1] signal,
                                 unsigned int tau,
                                 float alpha):
    """
    Filter the signal with a feed backward comb filter.

    :param signal: signal
    :param tau:    delay length
    :param alpha:  scaling factor
    :return:       comb filtered signal

    """
    # y[n] = x[n] + α * y[n - τ]
    if tau <= 0:
        raise ValueError('tau must be greater than 0')
    # type definitions
    cdef np.ndarray[np.float_t, ndim=1] y = np.copy(signal)
    cdef unsigned int n
    # loop over the complete signal
    for n in range(tau, len(signal)):
        # add a delayed version of the output signal
        y[n] += alpha * y[n - tau]
    # return the filtered signal
    return y

In [59]:
%timeit feed_backward_comb_filter_1d(np.arange(100000.), 4, 0.5)


1000 loops, best of 3: 518 µs per loop

Ok, we are now on par with the feed forward Numpy variant -- or even a bit better :)

To get back the flexibility of the Python/Numpy solution to be able to handle signals of arbitrary dimension, we need to define a wrapper function (in pure Python):


In [60]:
%%cython

import numpy as np

cimport cython
cimport numpy as np

def feed_backward_comb_filter(signal, tau, alpha):
    """
    Filter the signal with a feed backward comb filter.

    :param signal: signal
    :param tau:    delay length
    :param alpha:  scaling factor
    :return:       comb filtered signal

    """
    if signal.ndim == 1:
        return feed_backward_comb_filter_1d(signal, tau, alpha)
    elif signal.ndim == 2:
        return feed_backward_comb_filter_2d(signal, tau, alpha)
    else:
        raise ValueError('signal must be 1d or 2d')

@cython.boundscheck(False)
def feed_backward_comb_filter_1d(np.ndarray[np.float_t, ndim=1] signal,
                                 unsigned int tau,
                                 float alpha):
    """
    Filter the signal with a feed backward comb filter.

    :param signal: signal
    :param tau:    delay length
    :param alpha:  scaling factor
    :return:       comb filtered signal

    """
    # y[n] = x[n] + α * y[n - τ]
    if tau <= 0:
        raise ValueError('tau must be greater than 0')
    # type definitions
    cdef np.ndarray[np.float_t, ndim=1] y = np.copy(signal)
    cdef unsigned int n
    # loop over the complete signal
    for n in range(tau, len(signal)):
        # add a delayed version of the output signal
        y[n] += alpha * y[n - tau]
    # return the filtered signal
    return y

@cython.boundscheck(False)
def feed_backward_comb_filter_2d(np.ndarray[np.float_t, ndim=2] signal,
                                 unsigned int tau,
                                 float alpha):
    """
    Filter the signal with a feed backward comb filter.

    :param signal: signal
    :param tau:    delay length
    :param alpha:  scaling factor
    :return:       comb filtered signal

    """
    # y[n] = x[n] + α * y[n - τ]
    if tau <= 0:
        raise ValueError('tau must be greater than 0')
    # type definitions
    cdef np.ndarray[np.float_t, ndim=2] y = np.copy(signal)
    cdef unsigned int d, n
    # loop over the dimensions
    for d in range(2):
        # loop over the complete signal
        for n in range(tau, len(signal)):
            # add a delayed version of the output signal
            y[n, d] += alpha * y[n - tau, d]
    # return
    return y

In [61]:
%timeit feed_backward_comb_filter(np.arange(100000.), 4, 0.5)
%timeit feed_backward_comb_filter(np.arange(100000.).reshape(-1, 2), 4, 0.5)


1000 loops, best of 3: 498 µs per loop
1000 loops, best of 3: 531 µs per loop