Optimizing Python Performance

Professor Robert J. Brunner

</DIV>

Introduction

While writing (and maintaining) Python code is often much easier than writing similar code in more traditional high performance computing languages such as C, C++, or Fortran, Python is generally slower than similar programs written in higher performance languages. In those cases where end-to-end performance (i.e., concept to execution) is less important, perhaps because an application will be run many times, a programmer will need to consider new approaches to increase the performance of a Python program.

Before proceeding further, however, some strong words of caution. Many programmers spend an inordinate amount of time on unneeded optimizations. To quote Donald Knuth (1974 Turing Award Lecture):

Premature optimization is the root of all evil (or at least most of it) in programming.

Put simply, one should not worry about optimization until it has been shown to be necessary. And then one needs to very carefully decide what should can and should be optimized. This follows Amdahl's law which quantifies he maximum speed-up possible by improving the execution speed of only part of a program. For example, if only half of a program can be optimized, then the maximum possible speed-up is two times the original version. While modern multi- and many-core systems offer new performance benefits, they also come at an increased cost of code development, maintenance, and readability.

Python, however, does provide both standard (i.e., included with the standard Python distribution) modules to leverage threading and multi-processing, as well as additional libraries and tools that can significantly improve the performance of specific types of applications. One major limitation that must be overcome when employing these techniques is the Global Interpreter Lock or GIL. The GIL is used by the standard Python interpreter (known as CPython) to only allow one thread to execute Python code at one time. This is done to simplify the implementation of the Python object model and to safeguard against concurrent access. In other words, the entire Python interpreter is locked, and only one thread at a time is allowed access.

While this simplifies the development of the Python interpreter, it diminishes the ability of Python programs to leverage the inherent parallelism that is available with multi-processor machines. Two caveats to the GIL are that the global lock is always released when doing IO operations (which might otherwise block or consume a lengthy period) and that either standard or third-party extension modules can explicitly release the global lock when doing computationally intensive tasks.

In the rest of this Notebook, we will first explore standard Python modules for improving program performance. Next, we will explore the use of the IPython parallel programming capabilities. We will then discuss some non-standard extensions that can be used to improve application performance. We will finish with a quick introduction to several Python high performance

Standard Python Modules

The Python interpreter comes with a number of standard modules that collectively form the [Python Standard Library][sl]. The Python3 standard library contains a set of related modules for concurrent execution that includes the threading, multiprocessing, concurrent.futures, subprocess, sched, and queue modules. In this section, we will quickly introduce the first two modules. Although the concurrent module looks promising as a way to employ either threads or processes in a similar manner.

Python Threads

Threads are lightweight process element that can are often used to improve coe performance by allowing multiple threads of program execution to occur within a single process. In Python, however, threads do not in general offer the same level of performance improvement seen in other languages since programming languages since Python employs the global interpreter lock. Yet the threading module still can offer some improvement to IO intensive applications and also can provide an easier path to learning how to effectively employ parallel programming (which will subsequently be useful when using other techniques such as the multiprocessing module or HPC constructs like MPI or OpenCL.

The threading module is built on the Thread object, which encapsulates the creation, interaction, and destruction of threads in a Python program. In this Notebook we will simply introduce the basic concepts; a number of other resources exist to provide additional details.

TO use a thread in Python, we first mus create a Thread object, to which we can assign a name, a function to execute, and parameters that should be used within the threading function. For example, given a function my_func that takes a single integer value, we could create a new thread by executing the following Python statement:

t = threading.Thread(target=my_func, args=(10,))

We build on this simple example in the following code cell to demonstrate how to create and use a worker thread.



In [2]:

    
import threading
import time

# Generic worker thread
def worker(num):
        
    # Get this Thread's name
    name = threading.currentThread().getName()
    
    # Print Starting Message
    print('{0:s} starting.\n'.format(name))
    
    # We sleep for five seconds
    time.sleep(5)
    
    # Print computation
    print('Computation = {0:d}\n'.format(10**num))
    
    # Print Exiting Message
    print('{0:s} exiting.\n'.format(name))

# We will spawn several threads.
for i in range(5):
    t = threading.Thread(name='Thread #{0:d}'.format(i), target=worker, args=(i,))
    t.start()









    



Computation = 10

Thread #1 exiting.

Computation = 1

Thread #0 exiting.

Computation = 100

Thread #2 exiting.

Computation = 10000

Thread #4 exiting.

Computation = 1000

Thread #3 exiting.

Thread #0 starting.

Thread #1 starting.

Thread #2 starting.

Thread #3 starting.

Thread #4 starting.

Multiprocessing

One way to circumvent the GIL is to use multiple Python interpreters that each run in their own process. This can be accomplished by using the multiprocessing module. In this module, processes essentially take the place of threads, but since each process will read the same Python code file, we need to ensure that only one process (the main process) creates the other processes, or else we can create an infinite loop that quickly consumes all hardware resources. This is done by using the following statement prior to the main program body:

if __name__ == '__main__':

Inside the main program code, we can create Processes and start them in a similar manner as we did with threads earlier.



In [3]:

    
import multiprocessing 
import time

# Generic worker process
def worker(num):
        
    # Get this Process' name
    name = multiprocessing.current_process().name
    
    # Print Starting Message
    print('{0:s} starting.\n'.format(name))
    
    # We sleep for five seconds
    time.sleep(5)
    
    # Print computation
    print('Computation = {0:d}\n'.format(num**10))
    
    # Print Exiting Message
    print('{0:s} exiting.\n'.format(name))

if __name__ == '__main__':

    # We will spawn several processes.
    for i in range(5):
        p = multiprocessing.Process(name='Process #{0:d}'.format(i), target=worker, args=(i,))
        p.start()









    



Computation = 100

Thread #2 exiting.

Computation = 1

Thread #0 exiting.

Computation = 1000

Thread #3 exiting.

Computation = 10

Thread #1 exiting.

Computation = 10000

Thread #4 exiting.

Process #0 starting.

Process #1 starting.
Process #2 starting.

IPython Cluster

The Ipython Server has built-in support for parallel processing. This can be initialized in an automated manner by using the ipcluster command, or in a manual approach by using the ipcontroller and ipengine commands. The first approach simply automated the process of using the controller and engines, and requires the creation of a IPYthon profile, which is done by using the ipython profile create command. ipcluster works with both MPI and batch processing clusters (ew.g., via PBS), and can be made to work with other schedulers such as condor.

If necessary, you can also manually control the process by directly instantiating the IPython controller and engines. The controller must be started first, after which you can create as many engines as necessary, given your hardware constraints. IPython clustering works best on multi-processing machines or compute clusters.

Third-Party Python Tools

There are a number of third-party Python modules or packages that can be used to improve the performance of a Python application.

Numba is a just in time compiler from Continuum Analytics that can increase the performance of certain functions (e.g., numerical work).
PYPY is an alternative implementation of the Python language that includes a just in time compiler that speeds up many Python programs.
Cython is a static optimizing compiler for Python and also provides a method for easily including C or C++ code in a Python program.

Python and HPC

While Python programs can be easily used for embarrassingly parallel programming on high performance compute systems and Python is also used to glue advanced computation programs together for batch processing there are also projects underway that enable Python code to directly leverage high performance programming paradigms:

MPI is message passing interface and is a protocol used to communicate messages (or data) between compute nodes in a large, distributed compute cluster. mpi4py is a Python module that brings a significant part of the MPI specification to Python programs.
OpenCL is a framework that enables programs to run on heterogeneous platforms including CPUs, GPUs, DSP, and FPGAs. The Python OpenCL package enables Python programs to use OpenCL to write code that runs on these different processor types efficiently and effectively.