Make My Python Code Faster

John Parejko, Lia Corrales, Phil Marshall, Andrew Hearin and Your Name Here>

This notebook demonstrates some ways to make your python code go faster.

Step 1: Profile and improve your code

Because how can you optimize something if you haven't first evaluated it?

Step 2: Parallelize your code

Because you probably own more than one CPU.

Profiling



In [2]:

    
import numpy as np



In [5]:

    
x = np.random.randn(1000)

Inline Timing

Use %timeit in the notebook, and other commands in functions... Need examples of these!



In [6]:

    
%timeit np.power(x,2)









    



10000 loops, best of 3: 32 µs per loop



In [8]:

    
%timeit x**2









    



The slowest run took 59.22 times longer than the fastest. This could mean that an intermediate result is being cached 
1000000 loops, best of 3: 1.44 µs per loop

Profiling with cProfile



In [11]:

    
import cProfile
import pstats



In [23]:

    
def square(x):
    for k in range(1000):
        sq = np.power(x,2)
        sq = x**2
        sq = x*x
    return



In [24]:

    
log = 'square.profile'
cProfile.run('square(x)',filename=log)

stats = pstats.Stats(log)
stats.strip_dirs()

stats.sort_stats('cumtime').print_stats(20)

OK - so all the time is being taken by the function "square," as expected.

We need to re-write with the lines separated into functions - which is a better way to code anyway.



In [30]:

    
def bettersquare(x):
    
    def powersquare(x):
        return np.power(x,2)
    def justsquare(x):
        return x**2
    def selfmultiply(x):
        return x*x
    
    for k in range(1000):
        sq = powersquare(x)
        sq = justsquare(x)
        sq = selfmultiply(x)
    
    return



In [31]:

    
log = 'bettersquare.profile'
cProfile.run('bettersquare(x)',filename=log)

stats = pstats.Stats(log)
stats.strip_dirs()

stats.sort_stats('cumtime').print_stats(20)









    



Tue Sep 29 12:22:48 2015    bettersquare.profile

         3004 function calls in 0.063 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.063    0.063 <string>:1(<module>)
        1    0.004    0.004    0.063    0.063 <ipython-input-30-a566590efda6>:1(bettersquare)
     1000    0.052    0.000    0.052    0.000 <ipython-input-30-a566590efda6>:3(powersquare)
     1000    0.005    0.000    0.005    0.000 <ipython-input-30-a566590efda6>:5(justsquare)
     1000    0.003    0.000    0.003    0.000 <ipython-input-30-a566590efda6>:7(selfmultiply)
        1    0.000    0.000    0.000    0.000 {range}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}








    Out[31]:





<pstats.Stats instance at 0x105229c20>

Much better - you can see the cumulative time spent in each function.

Another useful tool is the line_profiler, from rkern on GitHub.



In [70]:

    
!pip install --upgrade line_profiler









    



Requirement already up-to-date: line-profiler in /Users/pjm/lsst/DarwinX86/anaconda/2.1.0-4-g35ca374/lib/python2.7/site-packages

We could also run the line_profiler from the command line...
Which means the square function needs writing out to a file...
Can we do this from this notebook?

Cythonization

This is something of a last resort: don't go to cython unless you know it's going to help.

Cython allows us to replace simple lines of math with the equivalent lines of C, while still coding in python.

On the command line,
```
cython -a file.pyx
```
makes file.c, but also file.html. The html file shows you the lines that were unwrapped into C.

Can we demo this process from this notebook? Hmm.

Compiling cython with IPython Notebook magic functions

Here's a simple example of a double-for loop that cython speeds up tremendously, and a %magic trick for compiling cython within a Notebook. First, our simple slow pure python function:



In [52]:

    
def my_expensive_loop(n):
    x = 0
    for i in range(int(n)):
        for j in range(int(n)):
            x += i + j



In [53]:

    
%timeit my_expensive_loop(1000)









    



10 loops, best of 3: 65.9 ms per loop

Let's write the same exact function in cython syntax:



In [63]:

    
%load_ext cython



In [64]:

    
%%cython 

def my_cythonized_loop(int n):
    cdef int i, j, x
    x = 0
    for i in range(int(n)):
        for j in range(int(n)):
            x += i + j









    



/Users/pjm/lsst/DarwinX86/anaconda/2.1.0-4-g35ca374/lib/python2.7/site-packages/IPython/utils/path.py:264: UserWarning: get_ipython_cache_dir has moved to the IPython.paths module
  warn("get_ipython_cache_dir has moved to the IPython.paths module")



In [65]:

    
%timeit my_cythonized_loop(1000)









    



1000 loops, best of 3: 752 µs per loop

What's happening here is that in the pure python code, at each step of these tight nested loops python is doing a bunch of type-checking on i, j and x. All that cdef declaration does is to tell the cython compiler to declare these variables as c-ints, so that the code will not do this type-checking anymore.

Even if the above pattern is the only one you ever learn in cython, it comes up so, so often that it's worth taking the time to pick up.

Parallelization

Multiprocessing

John's example:



In [66]:

    
"""
The multiprocessing joke.
"""
from __future__ import print_function

import multiprocessing

def print_function(word):
    print(word, end=' ')

def tell_the_joke():
    print()
    print('Why did the parallel chicken cross the road?')
    answer = 'To get to the other side.'
    print()

    # Summon a pool to handle some number of processes.
    # Think of N as the number of processors you have?
    
    N = 2
    
    pool = multiprocessing.Pool(processes=N)

    # Prepare a list of function inputs: 
    args = answer.split()

    # Pass the function, and its arguments, to the pool: 
    pool.map(print_function, args)
    
    # Tell the pool members to finish their work.
    pool.close()
    
    # "Ask the pool to report that they are done.
    pool.join()
    print()
    print()
    
    return



In [67]:

    
tell_the_joke()









    



Why did the parallel chicken cross the road?



To to the other side. get

The processes print their output words at semi-random times - in general, you have to be careful with what you do when dealing with a simple pool of processors.

If we make our function return a word, rather than just print it, then we can collect the outputs and display them in the correct order.



In [68]:

    
def new_function(word):
    return word+' '

def tell_the_joke_better():
    print()
    print('Why did the parallel chicken cross the road?')
    answer = 'To get to the other side.'
    print()

    # Summon a pool to handle some number of processes.
    # Leave N = blank to have multiprocessing guess!
    # Or measure it yourself:
    
    N = multiprocessing.cpu_count()
    
    pool = multiprocessing.Pool(processes=N)

    # Prepare a list of function inputs: 
    args = answer.split()

    # Pass the function, and its arguments, to the pool: 
    punchline = pool.map(new_function, args)
        
    # Tell the pool members to finish their work.
    pool.close()
    
    # "Ask the pool to report that they are done.
    pool.join()

    # Use the outputs of the function, which are accessible via the map() method:
    print(punchline)
    print()
    print()
    
    return



In [69]:

    
tell_the_joke_better()









    



Why did the parallel chicken cross the road?

['To ', 'get ', 'to ', 'the ', 'other ', 'side. ']



In [ ]: