Code Optimization in Python

"Premature optimization is the root of all evil." - Donald Knuth

  1. Get it right.
  2. Test it's right.
  3. Profile if slow.
  4. Optimise.
  5. Repeat from 2.

Measuring


In [1]:
import time

def dummy1():
    time.sleep(0.01)

def dummy2():
    time.sleep(0.02)

In [2]:
%%time 
dummy1()


CPU times: user 439 µs, sys: 761 µs, total: 1.2 ms
Wall time: 10.2 ms

In [3]:
%%time
dummy2()


CPU times: user 513 µs, sys: 871 µs, total: 1.38 ms
Wall time: 24.6 ms

In [4]:
%%timeit -n 10
dummy1()


10 loops, best of 3: 10.8 ms per loop

In [5]:
%%timeit -n 10
dummy2()


10 loops, best of 3: 23.6 ms per loop

Profiling

Builtin library in Python providing deterministic profiling of Python programs. Contains two packages: cProfile and profile. Tha former is faster to run, while the latter is easier to extend.


In [6]:
import cProfile
cProfile.run('dummy1()',)


         5 function calls in 0.012 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.012    0.012 <ipython-input-1-ca49f98e8237>:3(dummy1)
        1    0.000    0.000    0.012    0.012 <string>:1(<module>)
        1    0.000    0.000    0.012    0.012 {built-in method builtins.exec}
        1    0.012    0.012    0.012    0.012 {built-in method time.sleep}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}


For more advanced tool on profiling check this: http://pycallgraph.slowchop.com/en/master/

Tips and tricks

Use the standard library

Several modules and functions are implemented effectively (in C).


In [7]:
import random
random_list = [random.randint(0, 10000) for i in range(1000000)]

In [8]:
%%timeit -n 10
sorted(random_list)


10 loops, best of 3: 473 ms per loop

Use available pre-optimized libraries


In [9]:
%%timeit -n 10

sum(random_list) / len(random_list)


10 loops, best of 3: 8.82 ms per loop

In [10]:
import numpy as np

random_array = np.array(random_list)

In [11]:
%%timeit -n 10

np.mean(random_array)


10 loops, best of 3: 1.22 ms per loop

Loops


In [12]:
oldlist = ["alma"]*10000

In [13]:
%%timeit -n 10

newlist = []
for word in oldlist:
    newlist.append(word.upper())


10 loops, best of 3: 1.71 ms per loop

In [14]:
%%timeit -n 10

newlist = map(str.upper, oldlist)


10 loops, best of 3: 444 ns per loop

Wait, is this that much faster?


In [15]:
%%timeit -n 10

newlist = list(map(str.upper, oldlist))


10 loops, best of 3: 1.31 ms per loop

In [16]:
%%timeit -n 10

newlist = [w.upper() for w in oldlist]


10 loops, best of 3: 1.09 ms per loop

Check these builtin functions:

  • map
  • filter
  • zip
  • range
  • sum
  • enumerate
  • all
  • any

... and these packages:

  • functools
  • itertools

Generators

Generators are a way of declaring a function that behaves like an iterator which evaluates its values lazily. It can save you a both running time and memory.


In [17]:
def firstn1(n):
    num, nums = 0, []
    while num < n:
        nums.append(num)
        num += 1
    return nums
sum_of_first_n = sum(firstn1(1000000))

In [18]:
def firstn2(n):
    num = 0
    while num < n:
        yield num
        num += 1
sum_of_first_n = sum(firstn2(1000000))

Caching


In [19]:
def fib(n):
    if n < 2:
        return n
    return fib(n-1) + fib(n-2)

In [20]:
%%time

x = [fib(n) for n in range(20)]


CPU times: user 7.1 ms, sys: 470 µs, total: 7.57 ms
Wall time: 7.36 ms

In [21]:
from functools import lru_cache

@lru_cache(maxsize=None)
def fib2(n):
    if n < 2:
        return n
    return fib2(n-1) + fib2(n-2)

In [22]:
%%time

x = [fib2(n) for n in range(30)]


CPU times: user 30 µs, sys: 0 ns, total: 30 µs
Wall time: 32.9 µs

In [23]:
fib2.cache_info()


Out[23]:
CacheInfo(hits=56, misses=30, maxsize=None, currsize=30)

Name resolution speedups


In [24]:
%%timeit

newlist = []
for word in oldlist:
    newlist.append(str.upper(word))


100 loops, best of 3: 2.61 ms per loop

Both newlist.append and word.upper are function references that are reevaluated each time through the loop.


In [25]:
%%timeit

upper = str.upper
newlist = []
append = newlist.append
for word in oldlist:
    append(upper(word))


1000 loops, best of 3: 1.87 ms per loop

Python accesses local variables much more efficiently than global variables.


In [26]:
def func():
    upper = str.upper
    newlist = []
    append = newlist.append
    for word in oldlist:
        append(upper(word))
    #return newlist

In [27]:
%%timeit

func()


100 loops, best of 3: 1.8 ms per loop

if vs try

trying is cheap, ifing is expensive (when the condition rarely happens to be true)


In [28]:
%%timeit

wdict = {}
for word in oldlist:
    if word not in wdict:
        wdict[word] = 0
    wdict[word] += 1


1000 loops, best of 3: 1.29 ms per loop

In [29]:
%%timeit

wdict = {}
for word in oldlist:
    try:
        wdict[word] += 1
    except KeyError:
        wdict[word] = 1


1000 loops, best of 3: 1.22 ms per loop

But considering the example above, you might want to use again the stdlib.


In [30]:
from collections import defaultdict

In [31]:
%%timeit

wdict = defaultdict(int)
for word in oldlist:
    wdict[word] += 1


1000 loops, best of 3: 951 µs per loop

Imports

importing is very expensive


In [32]:
def doit1():
    import os.path # import statement inside function
    os.path.split('python.py')

import os.path as p # import statement outside function
def doit2():
    p.split('Python')

In [33]:
%%timeit

for _ in range(100000):
    doit1()


1 loop, best of 3: 194 ms per loop

In [34]:
%%timeit

for _ in range(100000):
    doit2()


10 loops, best of 3: 88.4 ms per loop

Strings are immutables...


In [35]:
def concat1(strings):
    res = ""
    for s in strings:
        res+=s
    return res

def concat2(strings):
    return "".join(strings)

str_list = ["alma"]*10000000

In [36]:
%%timeit

concat1(str_list)


1 loop, best of 3: 1.43 s per loop

In [37]:
%%timeit

concat2(str_list)


10 loops, best of 3: 93 ms per loop

So naive concatenating is very expensive, since each "+" creates a new string. Therefore, instead of "+"-ing things together you can use string formatting.

Interpreters

Pypy

"... is a Python interpreter and a just-in-time compiler. It focuses on speed, efficiency and compatibility with CPython. It currently support language versions 2.7.10 and 3.2.5."

Features:

  • (In most of the cases) faster than CPython
  • Eating up less memory.

Cython

"...is an optimistic static compiler for both the Python programming language and the extended Cython langeuage. It makes C extensions for Python as easy as Python itself."

def sum(int a, int b):
    cdef int sum = a+b
    return sum

Numba

Numba gives you the power to speed up your applications with high performance functions written directly in Python. With a few annotations, array-oriented and math-heavy Python code can be just-in-time compiled to native machine instructions, similar in performance to C, C++ and Fortran, without having to switch languages or Python interpreters.

Numba works by generating optimized machine code using the LLVM compiler infrastructure at import time, runtime, or statically (using the included pycc tool). Numba supports compilation of Python to run on either CPU or GPU hardware, and is designed to integrate with the Python scientific software stack.


In [38]:
from numpy import arange

def sum2d(arr):
    M, N = arr.shape
    result = 0.0
    for i in range(M):
        for j in range(N):
            result += arr[i,j]
    return result

a = arange(9).reshape(3,3)

In [39]:
from numba import jit


# jit decorator tells Numba to compile this function.
# The argument types will be inferred by Numba when function is called.
@jit
def sum2d_numba(arr):
    M, N = arr.shape
    result = 0.0
    for i in range(M):
        for j in range(N):
            result += arr[i,j]
    return result

In [40]:
%%time

sum2d(a)


CPU times: user 35 µs, sys: 11 µs, total: 46 µs
Wall time: 48.9 µs
Out[40]:
36.0

In [41]:
%%time

sum2d_numba(a)


CPU times: user 99.3 ms, sys: 15 ms, total: 114 ms
Wall time: 131 ms
Out[41]:
36.0

Exercises

  1. Calculate the sum of the first n prime numbers effectively!
  2. Implement an optimized merge sort algorithm.
    • How does it compare to the builtin sort?
    • Can you do it faster utilizing the full power of your CPU? (Assuming that you have a huge amount of data.)

In [ ]: