While writing (and maintaining) Python code is often much easier than writing similar code in more traditional high performance computing languages such as C, C++, or Fortran, Python is generally slower than similar programs written in higher performance languages. In those cases where end-to-end performance (i.e., concept to execution) is less important, perhaps because an application will be run many times, a programmer will need to consider new approaches to increase the performance of a Python program.
Before proceeding further, however, some strong words of caution. Many programmers spend an inordinate amount of time on unneeded optimizations. To quote Donald Knuth (1974 Turing Award Lecture):
Premature optimization is the root of all evil (or at least most of it) in programming.
Put simply, one should not worry about optimization until it has been shown to be necessary. And then one needs to very carefully decide what should can and should be optimized. This follows Amdahl's law which quantifies he maximum speed-up possible by improving the execution speed of only part of a program. For example, if only half of a program can be optimized, then the maximum possible speed-up is two times the original version. While modern multi- and many-core systems offer new performance benefits, they also come at an increased cost of code development, maintenance, and readability.
Python, however, does provide both standard (i.e., included with the standard Python distribution) modules to leverage threading and multi-processing, as well as additional libraries and tools that can significantly improve the performance of specific types of applications. One major limitation that must be overcome when employing these techniques is the Global Interpreter Lock or GIL. The GIL is used by the standard Python interpreter (known as CPython) to only allow one thread to execute Python code at one time. This is done to simplify the implementation of the Python object model and to safeguard against concurrent access. In other words, the entire Python interpreter is locked, and only one thread at a time is allowed access.
While this simplifies the development of the Python interpreter, it diminishes the ability of Python programs to leverage the inherent parallelism that is available with multi-processor machines. Two caveats to the GIL are that the global lock is always released when doing IO operations (which might otherwise block or consume a lengthy period) and that either standard or third-party extension modules can explicitly release the global lock when doing computationally intensive tasks.
In the rest of this Notebook, we will first explore standard Python modules for improving program performance. Next, we will explore the use of the IPython parallel programming capabilities. We will then discuss some non-standard extensions that can be used to improve application performance. We will finish with a quick introduction to several Python high performance
The Python interpreter comes with a number of standard modules that
collectively form the [Python Standard Library][sl]. The Python3
standard library contains a set of related modules for concurrent
execution that includes the threading
, multiprocessing
,
concurrent.futures
, subprocess
, sched
, and queue
modules. In
this section, we will quickly introduce the first two modules. Although
the concurrent
module looks promising as a way to employ either
threads or processes in a similar manner.
Threads are lightweight process element that can are often used to
improve coe performance by allowing multiple threads of program
execution to occur within a single process. In Python, however, threads
do not in general offer the same level of performance improvement seen
in other languages since programming languages since Python employs the
global interpreter lock. Yet the threading
module still can offer some
improvement to IO intensive applications and also can provide an easier
path to learning how to effectively employ parallel programming (which
will subsequently be useful when using other techniques such as the
multiprocessing
module or HPC constructs like MPI or OpenCL.
The threading
module is built on the Thread
object, which
encapsulates the creation, interaction, and destruction of threads in a
Python program. In this Notebook we will simply introduce the basic
concepts; a number of other resources exist to provide additional
details.
TO use a thread in Python, we first mus create a Thread
object, to
which we can assign a name, a function to execute, and parameters that
should be used within the threading function. For example, given a
function my_func
that takes a single integer value, we could create a
new thread by executing the following Python statement:
t = threading.Thread(target=my_func, args=(10,))
We build on this simple example in the following code cell to demonstrate how to create and use a worker thread.
In [2]:
import threading
import time
# Generic worker thread
def worker(num):
# Get this Thread's name
name = threading.currentThread().getName()
# Print Starting Message
print('{0:s} starting.\n'.format(name))
# We sleep for five seconds
time.sleep(5)
# Print computation
print('Computation = {0:d}\n'.format(10**num))
# Print Exiting Message
print('{0:s} exiting.\n'.format(name))
# We will spawn several threads.
for i in range(5):
t = threading.Thread(name='Thread #{0:d}'.format(i), target=worker, args=(i,))
t.start()
One way to circumvent the GIL is to use multiple Python interpreters that each run in their own process. This can be accomplished by using the multiprocessing
module. In this module, processes essentially take the place of threads, but since each process will read the same Python code file, we need to ensure that only one process (the main process) creates the other processes, or else we can create an infinite loop that quickly consumes all hardware resources. This is done by using the following statement prior to the main program body:
if __name__ == '__main__':
Inside the main program code, we can create Processes and start them in a similar manner as we did with threads earlier.
In [3]:
import multiprocessing
import time
# Generic worker process
def worker(num):
# Get this Process' name
name = multiprocessing.current_process().name
# Print Starting Message
print('{0:s} starting.\n'.format(name))
# We sleep for five seconds
time.sleep(5)
# Print computation
print('Computation = {0:d}\n'.format(num**10))
# Print Exiting Message
print('{0:s} exiting.\n'.format(name))
if __name__ == '__main__':
# We will spawn several processes.
for i in range(5):
p = multiprocessing.Process(name='Process #{0:d}'.format(i), target=worker, args=(i,))
p.start()
The Ipython Server has built-in support for parallel processing.
This can be initialized in an automated manner by using the ipcluster
command, or in a manual approach by using the ipcontroller
and
ipengine
commands. The first approach simply automated the process of
using the controller and engines, and requires the creation of a IPYthon
profile, which is done by using the ipython profile create
command.
ipcluster
works with both MPI and batch processing clusters (ew.g.,
via PBS), and can be made to work with other schedulers such as condor.
If necessary, you can also manually control the process by directly instantiating the IPython controller and engines. The controller must be started first, after which you can create as many engines as necessary, given your hardware constraints. IPython clustering works best on multi-processing machines or compute clusters.
There are a number of third-party Python modules or packages that can be used to improve the performance of a Python application.
Numba is a just in time compiler from Continuum Analytics that can increase the performance of certain functions (e.g., numerical work).
PYPY is an alternative implementation of the Python language that includes a just in time compiler that speeds up many Python programs.
Cython is a static optimizing compiler for Python and also provides a method for easily including C or C++ code in a Python program.
While Python programs can be easily used for embarrassingly parallel programming on high performance compute systems and Python is also used to glue advanced computation programs together for batch processing there are also projects underway that enable Python code to directly leverage high performance programming paradigms:
MPI is message passing interface and is a protocol used to communicate messages (or data) between compute nodes in a large, distributed compute cluster. mpi4py is a Python module that brings a significant part of the MPI specification to Python programs.
OpenCL is a framework that enables programs to run on heterogeneous platforms including CPUs, GPUs, DSP, and FPGAs. The Python OpenCL package enables Python programs to use OpenCL to write code that runs on these different processor types efficiently and effectively.