Python (more presicely: the CPython interpreter) uses a Global interpreter lock (GIL) which prohibits simultaneous execution of Python code.
In this tutorial, we'll cover these common problems/situations/solutions:
This list is by no means complete and is just a starting point for getting you into parallel/efficient coding with Python. There are plenty of other very nice solutions for all kinds of problems out there, but this is beyond the scope of this tutorial.
If you are not familiar with Python, I'd suggest learning it the (not so) hard way. Or continue reading, it describes very briefly the needed basics.
This is how we define functions:
In [1]:
def function(argument):
"""
This function just prints the given argument.
:param argument: what to print
:returns: also returns the argument
"""
print("Calling function with argument %s" % argument)
return argument
In [2]:
function(42)
Out[2]:
This is how we define classes:
In [3]:
class Class(object):
"""A simple meaningless Class"""
def __init__(self, attribute):
"""
This method get's called during object initialisation.
:param attribute: attribute of the class to save
"""
self.attribute = attribute
def tell(self):
"""Tell the world about us."""
print("I'm a Class with a very nice attribute: %s" % self.attribute)
In [4]:
A = Class('smart')
A.tell()
Since all (our) functions and classes are well documented, we can always ask how to use them:
In [5]:
function?
In [6]:
Class?
Python also has lists which can be accessed by index (indices always start at 0):
In [7]:
x = [1, 2, 4.5, 'bla']
In [8]:
x[1]
Out[8]:
In [9]:
x[3]
Out[9]:
In [10]:
x.index('bla')
Out[10]:
One of the more fancy stuff we can do in Python is list comprehensions.
Here's a simple example of how to calculate the squares of some numbers:
In [11]:
[x ** 2 for x in range(10)]
Out[11]:
Ok, that's it for now, let's speed things up :)
From the Python threading
documentation (https://docs.python.org/2/library/threading.html):
In CPython, due to the Global Interpreter Lock, only one thread can execute Python code at once (even though certain performance-oriented libraries might overcome this limitation). If you want your application to make better use of the computational resources of multi-core machines, you are advised to use
multiprocessing
. However, threading is still an appropriate model if you want to run multiple I/O-bound tasks simultaneously.
All my examples are CPU bound, so let's move on to multiprocessing
-- it also has a simpler interface ;)
In [12]:
import multiprocessing as mp
We consider a function which sums all primes below a given number. Source: http://www.parallelpython.com/content/view/17/31/#SUM_PRIMES
In [13]:
import math
def isprime(n):
"""Returns True if n is prime and False otherwise"""
if not isinstance(n, int):
raise TypeError("argument passed to is_prime is not of 'int' type")
if n < 2:
return False
if n == 2:
return True
max = int(math.ceil(math.sqrt(n)))
i = 2
while i <= max:
if n % i == 0:
return False
i += 1
return True
def sum_primes(n):
"""Calculates sum of all primes below given integer n"""
return sum([x for x in xrange(2,n) if isprime(x)])
If we simply use the list comprehension without the sum, we get a list of primes smaller than the given one:
In [14]:
[x for x in xrange(10) if isprime(x)]
Out[14]:
In [15]:
sum_primes(10)
Out[15]:
In [16]:
%timeit sum_primes(10)
Not bad, why should we speed up something like this?
Let's see what happens if we ask for the sum of primes below a larger number:
In [17]:
%timeit sum_primes(10000)
What if we do this for a bunch of numbers, e.g.:
In [18]:
%timeit [sum_primes(n) for n in xrange(1000)]
Ok, this is definitely too slow, but we should be able to run this in parallel easily, since we are asking for the same thing (summing all prime numbers below the given number) for a lot of of numbers.
We do this by calling this very same function a thousand times (inside the list comprehension).
Python has a map
function, which maps a list of arguments to a single function.
In [19]:
map(sum_primes, xrange(10))
Out[19]:
This is basically the same as what we had before with our list comprehension.
In [20]:
%timeit map(sum_primes, xrange(1000))
multiprocessing
offers the same map
function in it's Pool
class. Let's create a Pool with 2 simultaneous threads (i.e. processes).
In [21]:
pool = mp.Pool(2)
Now we can use the pool's map
function to do the same as before but in parallel.
In [22]:
pool.map(sum_primes, xrange(10))
Out[22]:
In [23]:
%timeit pool.map(sum_primes, xrange(1000))
This is a speed-up of almost 2x, which is exactly what we expected (using 2 processes minus some overhead).
Numpy does an excellent job here and automatically does some things in parallel (e.g. matrix multiplication).
It uses highly optimised linear algebra packages to do things really fast.
In [24]:
import numpy as np
Some Numpy basics:
We can define arrays simple as that:
In [25]:
x = np.array([1, 2, 5, 15])
In [26]:
x
Out[26]:
As long as it can be casted to an array, we can use almost everything as input for an array:
In [27]:
np.array(map(sum_primes, xrange(10)))
Out[27]:
Or we define arrays with some special functions:
In [28]:
np.zeros(10)
Out[28]:
In [29]:
np.arange(10.)
Out[29]:
Numpy supports indexing and slicing:
In [30]:
x = np.arange(10)
Get a single item of the array:
In [31]:
x[3]
Out[31]:
Get a slice of the array:
In [32]:
x[1:5]
Out[32]:
Get everything starting from index 4 to the end:
In [33]:
x[4:]
Out[33]:
Negative indices are counted backwards from the end. Get everything before the last element:
In [34]:
x[:-1]
Out[34]:
In signal processing, a comb filter adds a delayed version of a signal to itself, causing constructive and destructive interference [Wikipedia].
These filters can be either feed forward or backward, depending on wheter the signal itself or the output of the filter is delayed and added to the signal.
Feed forward:
$y[n] = x[n] + \alpha * x[n - \tau]$
Feed backward:
$y[n] = x[n] + \alpha * y[n - \tau]$
Again, we start with a Python only solution (it has some Numpy stuff in there already, but that's not the point. It uses a loop for adding the delayed portion of the signal to the output).
In [35]:
def feed_forward_comb_filter_loop(signal, tau, alpha):
"""
Filter the signal with a feed forward comb filter.
:param signal: signal
:param tau: delay length
:param alpha: scaling factor
:return: comb filtered signal
"""
# y[n] = x[n] + α * x[n - τ]
if tau <= 0:
raise ValueError('tau must be greater than 0')
# init the output array
y = np.zeros(len(signal))
# iterate over the signal
for i in range(len(signal)):
# add the delayed version of the signal to itself
if i < tau:
y[i] = signal[i]
else:
y[i] = signal[i] + alpha * signal[i - tau]
# return the filtered signal
return y
In [36]:
feed_forward_comb_filter_loop(np.arange(10.), 4, 0.5)
Out[36]:
In [37]:
%timeit feed_forward_comb_filter_loop(np.arange(100000.), 4, 0.5)
Let's vectorise this by removig the loop (the for i in range(len(signal))
stuff):
In [38]:
def feed_forward_comb_filter(signal, tau, alpha):
"""
Filter the signal with a feed forward comb filter.
:param signal: signal
:param tau: delay length
:param alpha: scaling factor
:return: comb filtered signal
"""
# y[n] = x[n] + α * x[n - τ]
if tau <= 0:
raise ValueError('tau must be greater than 0')
# init the output array as a copy of the input signal, since
# the output is the input signal plus a delayed version of it
y = signal.copy()
# add the delayed signal, starting at index tau
y[tau:] += alpha * signal[:-tau]
# return the filtered signal
return y
In [39]:
%timeit feed_forward_comb_filter(np.arange(100000.), 4, 0.5)
This is a nice ~67x speed-up (523 µs vs. 35.1 ms). Continue with the feed backward example...
The feed backward variant comb filter function containing a loop:
In [40]:
def feed_backward_comb_filter_loop(signal, tau, alpha):
"""
Filter the signal with a feed backward comb filter.
:param signal: signal
:param tau: delay length
:param alpha: scaling factor
:return: comb filtered signal
"""
# y[n] = x[n] + α * y[n - τ]
if tau <= 0:
raise ValueError('tau must be greater than 0')
# init the output array as a copy of the input signal
y = signal.copy()
# loop over the signal, starting at tau
for n in range(tau, len(signal)):
# add a delayed version of the output signal
y[n] += alpha * y[n - tau]
# return the filtered signal
return y
In [41]:
%timeit feed_backward_comb_filter_loop(np.arange(100000.), 4, 0.5)
The backward variant has basically the same runtime as the forward (loop-)version, but unfortunately, we cannot speed this up further with a vectorised expression, since the output depends on the output of a previous step.
And this is where Cython comes in.
Cython can be used to write C-extensions in Python.
The Cython compiler converts everything into C code, compiles it to a shared object which can be loaded/imported like any other python module.
First we need to import Cython via cimport cython
in normal Cython files or load the Cython extension (in IPython):
In [42]:
%load_ext Cython
To be able to use Cython code within IPython, we need to add the magic %%cython
handler as a first line into a cell. Then we can start writing normal Python code.
In [43]:
%%cython
# magic cython handler for IPython (must be first line of a cell)
def sum_two_numbers(a, b):
return a + b
Cython then compiles and loads everything transparently.
In [44]:
sum_two_numbers(10, 5)
Out[44]:
Cython gains most of its tremendous speed gains from static typing.
Let's try with the isprime
example from before again:
In [45]:
%%cython
# import some C functions to avoid calls to the Python interpreter
from libc.math cimport sqrt, ceil
# define type n as integer
def isprime_cython(int n):
"""Returns True if n is prime and False otherwise"""
# we can skip the instance check since we have strong typing now
if n < 2:
return False
if n == 2:
return True
# define type max as integer
cdef int max
# use the C ceil & sqrt functions instead of Python's own
max = int(ceil(sqrt(n)))
# define type n as integer
cdef int i = 2
while i <= max:
if n % i == 0:
return False
i += 1
return True
def sum_primes_cython(n):
"""Calculates sum of all primes below given integer n"""
return sum([x for x in xrange(2,n) if isprime_cython(x)])
Let's see if (and how much) it helps:
In [46]:
%timeit sum_primes(10000)
%timeit sum_primes_cython(10000)
speed-up ~24x (686 µs vs. 16.5 ms)
In [47]:
%timeit -n 1 [sum_primes(n) for n in xrange(10000)]
%timeit -n 1 [sum_primes_cython(n) for n in xrange(10000)]
speed-up ~25x (3.1 s vs. 76 s)
What if we use this faster version of the function with multiple processes?
In [48]:
%timeit mp.Pool(2).map(sum_primes_cython, (n for n in xrange(10000)))
%timeit mp.Pool(4).map(sum_primes_cython, (n for n in xrange(10000)))
Starting more threads than physical CPU cores gives some more performance, but does not scale as good because of hyper-threading. Total speed-up compared to the single-process pure Python variant is ~49x (1.56 s vs. 76 s)
Still, the isprime_cython
is a Python function, which adds some overhead. Since we call this function only from sum_primes_cython
, we can make it a C-only function by using the cdef
statement instead of def
and also define the return type.
In [49]:
%%cython
from libc.math cimport sqrt, ceil
# make this a C function
cdef int isprime_cython_nogil(int n) nogil:
"""Returns True if n is prime and False otherwise"""
if n < 2:
return 0
if n == 2:
return 1
cdef int max
max = int(ceil(sqrt(n)))
cdef int i = 2
while i <= max:
if n % i == 0:
return 0
i += 1
return 1
def sum_primes_cython_nogil(n):
"""Calculates sum of all primes below given integer n"""
return sum([x for x in xrange(2,n) if isprime_cython_nogil(x)])
In [50]:
%timeit [sum_primes_cython_nogil(n) for n in xrange(10000)]
%timeit mp.Pool(4).map(sum_primes_cython_nogil, (n for n in xrange(10000)))
Again, a bit faster; total speed-up compared to the single-process pure Python variant is ~64x (1.18 s vs. 76 s)
These are the Cython basics. Let's apply them to the other example, the backward comb filter which we were not able to vectorise:
In [51]:
%%cython
import numpy as np
# we also want to load the C-bindings of numpy with cimport
cimport numpy as np
# statically type the obvious variables (tau, alpha, n)
def feed_backward_comb_filter(signal, unsigned int tau, float alpha):
"""
Filter the signal with a feed backward comb filter.
:param signal: signal
:param tau: delay length
:param alpha: scaling factor
:return: comb filtered signal
"""
# y[n] = x[n] + α * y[n - τ]
if tau <= 0:
raise ValueError('tau must be greater than 0')
# type definitions
y = np.copy(signal)
cdef unsigned int n
# loop over the complete signal
for n in range(tau, len(signal)):
# add a delayed version of the output signal
y[n] += alpha * y[n - tau]
# return the filtered signal
return y
In [52]:
%timeit feed_backward_comb_filter(np.arange(100000.), 4, 0.5)
A bit better (roughly half the time), but still far away from the feed forward variant.
Let's see, what kills performance and fix it. Cython has this nice -a
switch to highlight calls to Python in yellow.
In [53]:
%%cython -a
import numpy as np
cimport cython
cimport numpy as np
def feed_backward_comb_filter(signal, unsigned int tau, float alpha):
"""
Filter the signal with a feed backward comb filter.
:param signal: signal
:param tau: delay length
:param alpha: scaling factor
:return: comb filtered signal
"""
# y[n] = x[n] + α * y[n - τ]
if tau <= 0:
raise ValueError('tau must be greater than 0')
# type definitions
y = np.copy(signal)
cdef unsigned int n
# loop over the complete signal
for n in range(tau, len(signal)):
# add a delayed version of the output signal
y[n] = signal[n] + alpha * y[n - tau]
# return the filtered signal
return y
Out[53]:
In line 25, we still have calls to Python within the loop (e.g. PyNumber_Multiply
and PyNumber_Add
).
We can get rid of these by statically typing the signal as well. Unfortunately, we lose the ability to call the filter function with a signal of arbitrary dimensions, since Cython needs to know the dimensions of the signal beforehand.
In [54]:
%%cython
import numpy as np
cimport cython
cimport numpy as np
def feed_backward_comb_filter_1d(np.ndarray[np.float_t, ndim=1] signal,
unsigned int tau,
float alpha):
"""
Filter the signal with a feed backward comb filter.
:param signal: signal
:param tau: delay length
:param alpha: scaling factor
:return: comb filtered signal
"""
# y[n] = x[n] + α * y[n - τ]
if tau <= 0:
raise ValueError('tau must be greater than 0')
# type definitions
cdef np.ndarray[np.float_t, ndim=1] y = np.copy(signal)
cdef unsigned int n
# loop over the complete signal
for n in range(tau, len(signal)):
# add a delayed version of the output signal
y[n] += alpha * y[n - tau]
# return the filtered signal
return y
In [56]:
%timeit feed_backward_comb_filter_1d(np.arange(100000.), 4, 0.5)
Much better, let's check again.
In [57]:
%%cython -a
import numpy as np
cimport cython
cimport numpy as np
def feed_backward_comb_filter_1d(np.ndarray[np.float_t, ndim=1] signal,
unsigned int tau,
float alpha):
"""
Filter the signal with a feed backward comb filter.
:param signal: signal
:param tau: delay length
:param alpha: scaling factor
:return: comb filtered signal
"""
# y[n] = x[n] + α * y[n - τ]
if tau <= 0:
raise ValueError('tau must be greater than 0')
# type definitions
cdef np.ndarray[np.float_t, ndim=1] y = np.copy(signal)
cdef unsigned int n
# loop over the complete signal
for n in range(tau, len(signal)):
# add a delayed version of the output signal
y[n] += alpha * y[n - tau]
# return the filtered signal
return y
Out[57]:
For the sake of completeness, let's get rid of these Pyx_RaiseBufferIndexError
in line 27 as well. We tell Cython that it does not need to check for bounds by adding a @cython.boundscheck(False)
decorator.
In [58]:
%%cython
import numpy as np
cimport cython
cimport numpy as np
@cython.boundscheck(False)
def feed_backward_comb_filter_1d(np.ndarray[np.float_t, ndim=1] signal,
unsigned int tau,
float alpha):
"""
Filter the signal with a feed backward comb filter.
:param signal: signal
:param tau: delay length
:param alpha: scaling factor
:return: comb filtered signal
"""
# y[n] = x[n] + α * y[n - τ]
if tau <= 0:
raise ValueError('tau must be greater than 0')
# type definitions
cdef np.ndarray[np.float_t, ndim=1] y = np.copy(signal)
cdef unsigned int n
# loop over the complete signal
for n in range(tau, len(signal)):
# add a delayed version of the output signal
y[n] += alpha * y[n - tau]
# return the filtered signal
return y
In [59]:
%timeit feed_backward_comb_filter_1d(np.arange(100000.), 4, 0.5)
Ok, we are now on par with the feed forward Numpy variant -- or even a bit better :)
To get back the flexibility of the Python/Numpy solution to be able to handle signals of arbitrary dimension, we need to define a wrapper function (in pure Python):
In [60]:
%%cython
import numpy as np
cimport cython
cimport numpy as np
def feed_backward_comb_filter(signal, tau, alpha):
"""
Filter the signal with a feed backward comb filter.
:param signal: signal
:param tau: delay length
:param alpha: scaling factor
:return: comb filtered signal
"""
if signal.ndim == 1:
return feed_backward_comb_filter_1d(signal, tau, alpha)
elif signal.ndim == 2:
return feed_backward_comb_filter_2d(signal, tau, alpha)
else:
raise ValueError('signal must be 1d or 2d')
@cython.boundscheck(False)
def feed_backward_comb_filter_1d(np.ndarray[np.float_t, ndim=1] signal,
unsigned int tau,
float alpha):
"""
Filter the signal with a feed backward comb filter.
:param signal: signal
:param tau: delay length
:param alpha: scaling factor
:return: comb filtered signal
"""
# y[n] = x[n] + α * y[n - τ]
if tau <= 0:
raise ValueError('tau must be greater than 0')
# type definitions
cdef np.ndarray[np.float_t, ndim=1] y = np.copy(signal)
cdef unsigned int n
# loop over the complete signal
for n in range(tau, len(signal)):
# add a delayed version of the output signal
y[n] += alpha * y[n - tau]
# return the filtered signal
return y
@cython.boundscheck(False)
def feed_backward_comb_filter_2d(np.ndarray[np.float_t, ndim=2] signal,
unsigned int tau,
float alpha):
"""
Filter the signal with a feed backward comb filter.
:param signal: signal
:param tau: delay length
:param alpha: scaling factor
:return: comb filtered signal
"""
# y[n] = x[n] + α * y[n - τ]
if tau <= 0:
raise ValueError('tau must be greater than 0')
# type definitions
cdef np.ndarray[np.float_t, ndim=2] y = np.copy(signal)
cdef unsigned int d, n
# loop over the dimensions
for d in range(2):
# loop over the complete signal
for n in range(tau, len(signal)):
# add a delayed version of the output signal
y[n, d] += alpha * y[n - tau, d]
# return
return y
In [61]:
%timeit feed_backward_comb_filter(np.arange(100000.), 4, 0.5)
%timeit feed_backward_comb_filter(np.arange(100000.).reshape(-1, 2), 4, 0.5)