Python idioms performance characteristics

I've been meaning to check this out for a long time.

The setup is simple. We write a monstrous decorator which modifies a simple function that iterates n times, to a function that runs the former on a number of inputs and returns how long each run took.

I'm not in the business of 4% performance improvements here. I'm after quadratic behavior!


In [1]:
from time import time
import numpy as np
import matplotlib.pyplot as plt
from collections import deque
import random
%matplotlib inline

In [2]:
def benchmark(counts):
    def times(f):
        def ret():
            timings = []
            for c in counts:
                start = time()
                f(c)
                timings.append(time() - start)
            return timings
        return ret
    return times

Let's start with the obvious: repeated string concatenation. Everyone knows about that one, right?


In [3]:
counts = [int(x) for x in np.logspace(4, 7, num=40)]

@benchmark(counts)
def accumulate_strings(n):
    s = ''
    for i in range(n):
        s += 'a'

@benchmark(counts)
def list_join(n):
    l = []
    for i in range(n):
        l.append('a')
    ''.join(l)

In [4]:
plt.plot(counts, accumulate_strings())
plt.plot(counts, list_join())
plt.legend(['s+=', "''.join()"], loc='best')


Out[4]:
<matplotlib.legend.Legend at 0x7fba66d57978>

Looks pretty linear to me. Apparently string += runs in O(1). That was honestly a shocking discovery to me. I've been telling people to avoid string += for years!

Let's test another common truth - that you should use a deque for repeated append and pop operations.


In [5]:
counts = [int(x) for x in np.logspace(4, 6)]

@benchmark(counts)
def add_remove_deque(n):
    q = deque()
    for i in range(n):
        q.append(i)
    for i in range(n):
        q.pop()
        
@benchmark(counts)
def add_remove_list(n):
    l = []
    for i in range(n):
        l.append(i)
    for i in range(n):
        l.pop()

In [6]:
plt.plot(counts, add_remove_deque())
plt.plot(counts, add_remove_list())
plt.legend(['deque', 'list'], loc='best')


Out[6]:
<matplotlib.legend.Legend at 0x7fba66c90b38>

Seems like they're both linear time. Hey, how much is that is overhead?


In [7]:
@benchmark(counts)
def just_pass(n):
    l = []
    for i in range(n):
        pass
    for i in range(n):
        pass
    
plt.plot(counts, add_remove_deque())
plt.plot(counts, add_remove_list())
plt.plot(counts, just_pass())
plt.legend(['deque', 'list', 'pass'], loc='best')


Out[7]:
<matplotlib.legend.Legend at 0x7fba66c0beb8>

Not that much. What if we insert to the end and take from the beginning?


In [8]:
counts = [int(x) for x in np.logspace(3, 5)]

@benchmark(counts)
def append_pop0_deque(n):
    q = deque()
    for i in range(n):
        q.append(i)
    for i in range(n):
        q.popleft()
        
@benchmark(counts)
def append_pop0_list(n):
    l = []
    for i in range(n):
        l.append(i)
    for i in range(n):
        l.pop(0)
        
plt.plot(counts, append_pop0_deque())
plt.plot(counts, append_pop0_list())
plt.legend(['deque', 'list'], loc='best')


Out[8]:
<matplotlib.legend.Legend at 0x7fba66b7fe10>

Finally we see a quadratic function. Seems that lists are fine with adding and removing from one end, but they're really bad when you need to pop(0) repeatedly.

Are strings the same?


In [9]:
counts = [int(x) for x in np.logspace(3, 5)]

@benchmark(counts)
def accumulate_right(n):
    s = ''
    for i in range(n):
        s = s + 'a'

@benchmark(counts)
def accumulate_left(n):
    s = ''
    for i in range(n):
        s = 'a' + s
        
plt.plot(counts, accumulate_right())
plt.plot(counts, accumulate_left())
plt.legend(["s = s + 'a'", "s = 'a' + s"], loc='upper left')


Out[9]:
<matplotlib.legend.Legend at 0x7fba66af26a0>

Certainly seems so. That one is a little strange. If x and y are strings, x + y runs in O(len(y)).

Out of curiosity, let's check the bytes type.


In [10]:
counts = [int(x) for x in np.logspace(3, 5)]

@benchmark(counts)
def accumulate_bytes(n):
    s = b''
    for i in range(n):
        s += b'a'
        
@benchmark(counts)
def join_bytes(n):
    l = []
    for i in range(n):
        l.append(b'a')
    b''.join(l)

plt.plot(counts, accumulate_bytes())
plt.plot(counts, join_bytes())
plt.legend(['accumulate', 'join'], loc='upper left')


Out[10]:
<matplotlib.legend.Legend at 0x7fba66ad8cc0>

Let's recap:

  • There's actually nothing wrong with a += loop for strings. We may have to re-educate ourselves!
  • ''.join() is still preferable when you need to accumulate bytes.
  • If you find yourself write .pop(0) or .insert(0) in a loop, you really want a collections.deque.
  • Postulate reaffirmed: Common truths should be tested!

-- Noam Mor, 2015