Sebastian Raschka
last updated: 06/02/2014

I would be happy to hear your comments and suggestions.
Please feel free to drop me a note via twitter, email, or google+.

Day 5 - One Python Benchmark per Day

Comparing 9 ways of flattening a list of sublists

-> skip to the results

Here, we are only looking of ways to flatten 2D-lists, i.e., lists of sublists. Flattening of lists with more than 2 dimensions will be tested separately.
The desired behavior of the different approaches shall be:

[[1],[2,3,4],[5],[6]]    ->   [1, 2, 3, 4, 5, 6]  
[[1],[2,[3],4],[5],[6]]  ->   [1, 2, [3], 4, 5, 6]

Code implementations

Classic approaches



In [1]:

    
def append_in_for_loop(list_2d):
    flattened = []
    for sublist in list_2d:
        for item in sublist:
                flattened.append(item)
    return flattened



In [2]:

    
def extend_in_for_loop(list_2d):
    flattened = []
    for sublist in list_2d:
        flattened.extend(sublist)
    return flattened



In [3]:

    
def list_comprehension(list_2d):
    return [item for sublist in list_2d for item in sublist]



In [4]:

    
def flatten_via_sum(list_2d):
    return sum(list_2d, [])

`itertools` approaches



In [5]:

    
import itertools



In [6]:

    
def itertools_chain_from_iterable(list_2d):
    return list(itertools.chain.from_iterable(list_2d))



In [7]:

    
def itertools_chain(list_2d):
    return list(itertools.chain(*list_2d))

`functools` approaches



In [8]:

    
import functools
import operator



In [9]:

    
def functools_reduce_lambda(list_2d):
    return functools.reduce(lambda x,y: x+y, list_2d)



In [10]:

    
def functools_reduce_operator(list_2d):
    return functools.reduce(operator.add, map(list, list_2d))



In [11]:

    
def functools_reduce_list_add(list_2d):
    return functools.reduce(list.__add__, (list(sub) for sub in list_2d))

Assertion tests



In [12]:

    
funcs = [
     append_in_for_loop, list_comprehension, flatten_via_sum, 
     itertools_chain_from_iterable, itertools_chain, 
     functools_reduce_lambda, extend_in_for_loop,
     functools_reduce_operator, functools_reduce_list_add
    ]



In [13]:

    
for f in funcs:
    assert (f([[1],[2,3,4],[5],[6]]) == [1, 2, 3, 4, 5, 6]),\
    '%s failed test 1'% f.__name__
    assert (f([[1],[2,[3],4],[5],[6]]) == [1, 2, [3], 4, 5, 6]),\
    '%s failed test 2'% f.__name__
print('ok')

ok

Timing



In [14]:

    
import timeit
import random
import copy
import numpy as np
random.seed(12345)

def make_copy(l):
    return copy.deepcopy(l)

func_names = [f.__name__ for f in funcs]

orders_n = [10**n for n in range(1, 5)]
timings = {f:[] for f in func_names}

for n in orders_n:
    l = [[i for i in range(random.randint(1,10))] for j in range(n)]
    for f in func_names:
        timings[f].append(min(timeit.Timer('%s(make_copy(l))' %f, 
                      'from __main__ import %s, l, make_copy' %f)
                              .repeat(repeat=3, number=10)))

Plotting



In [15]:

    
%matplotlib inline



In [19]:

    
import platform
import multiprocessing

def print_sysinfo():
    
    print('\nPython version  :', platform.python_version())
    print('compiler        :', platform.python_compiler())
    
    print('\nsystem     :', platform.system())
    print('release    :', platform.release())
    print('machine    :', platform.machine())
    print('processor  :', platform.processor())
    print('CPU count  :', multiprocessing.cpu_count())
    print('interpreter:', platform.architecture()[0])
    print('\n\n')



In [20]:

    
import matplotlib.pyplot as plt

def plot():
    labels_fast = {'append_in_for_loop':'.append() in for-loop',
          'extend_in_for_loop':'.extend() in for-loop', 
          'list_comprehension': 'List comprehension', 
          'itertools_chain_from_iterable': 'list(itertools.chain.from_iterable(list_2d))',
          'itertools_chain': 'list(itertools.chain(*list_2d))',
          }
    labels_slow = {'functools_reduce_lambda': 'functools.reduce(lambda x,y: x+y, list_2d)',
              'functools_reduce_operator': 'functools.reduce(operator.add, map(list, list_2d))',
              'flatten_via_sum': 'sum(list_2d, [])',
              'functools_reduce_list_add': 'functools.reduce(list.__add__, (list(sub) for sub in list_2d))'
              }
    plt.rcParams.update({'font.size': 9})
    fig = plt.figure(figsize=(10,8))
    plt.title('Performance of different methods for flattening a list of sublists', fontsize=18)

    reds = ['#F87217', '#F75D59', '#FF0000', '#C24641']
    for lb,col in zip(sorted(labels_slow.keys()), reds):
        plt.plot(orders_n, [t**2 for t in timings[lb]], alpha=0.3, label=labels_slow[lb], 
                 marker='o', lw=2, c=col, markersize=8)
    
    blues = ['#000080', '#2B60DE', '#736AFF', '#659EC7', '#46C7C7']
    for lb,col in zip(sorted(labels_fast.keys()), blues):
        plt.plot(orders_n, [t**2 for t in timings[lb]], alpha=0.3, label=labels_fast[lb], 
                 marker='^', lw=2, c=col, markersize=10)
    max_perf = max( s/c for s,c in zip(timings['flatten_via_sum'],
                                   timings['list_comprehension']) )
    ftext = 'A list comprehension \n  >>[item for sublist '\
        'in list_2d for item in sublist]\n'\
        'is {:.2f}x faster for flattening a 2D list than \n  >>sum(list_2d, [])\nat n = 10000'\
        .format(max_perf)
    plt.figtext(.4,.15, ftext, fontsize=13, ha='left')    
    
    plt.legend(loc='upper left')
    plt.grid()
    plt.ylabel('Time per computation for 10 loops (in seconds)', fontsize=12)
    plt.xlabel('Number of sublists', fontsize=12)
    plt.xscale('log')
    plt.yscale('log')
    plt.show()

Results



In [21]:

    
print_sysinfo()
plot()









    



Python version  : 3.4.0
compiler        : GCC 4.2.1 (Apple Inc. build 5577)

system     : Darwin
release    : 13.2.0
machine    : x86_64
processor  : i386
CPU count  : 4
interpreter: 64bit

Conclusion

The line plot above indicates that we have 2 different classes of list flattening methods: the first tier is taking linear time (blue shades), where the four other approaches (in shades of red) which have quadratic growth rates and therefore are significantly slower.

Of course, the sum(list_2d, []) has a very short syntax and might be just convenient to use it for list flattening, but the reason why it is so slow is that it will create a new "intermediate" list for every element (Note that all operations in that are shown in red shades in the plot above follow this same principle).
I would like to suggest to you the nice article "How Not to Flatten a List of Lists in Python" by Mathieu Larose, who described this mechanism very nicely.

Bonus - Which of the methods is actually the fastest?

Looking at the line plot above, it is actually really hard to tell which of the methods with linear growth rate $O(n)$ (blue shades) is the fastest.
Let us plot timings for $n=10^6$ for the fastest methods only.



In [26]:

    
import timeit
import random
import copy
import numpy as np
random.seed(12345)

def make_copy(l):
    return copy.deepcopy(l)

labels_fast = {
          'append_in_for_loop':'.append() in for-loop',
          'extend_in_for_loop':'.extend() in for-loop', 
          'list_comprehension': 'List comprehension', 
          'itertools_chain_from_iterable': 'list(itertools.chain.from_iterable(list_2d))',
          'itertools_chain': 'list(itertools.chain(*list_2d))',
          }

fast_names = [f for f in labels_fast.keys()]
fast_times = []

l = [[i for i in range(random.randint(1,10))] for j in range(10**5)]
for f in fast_names:
    fast_times.append(min(timeit.Timer('%s(make_copy(l))' %f, 
        'from __main__ import %s, l, make_copy' %f)
                    .repeat(repeat=3, number=10)))



In [29]:

    
plt.rcParams.update({'font.size': 12})

fast_labels = [labels_fast[l] for l in fast_names]
x_pos = np.arange(len(fast_labels))

plt.figure(figsize=(8,7))
plt.bar(x_pos, fast_times, align='center', alpha=0.5)
plt.xticks(x_pos, fast_labels, rotation=90)
plt.axhline(y=min(fast_times), linestyle='--', color='black')
plt.ylabel('Time per computation for 10 loops in seconds for n=10000')
plt.title('List flattening methods with linear growth rates')
plt.ylim(8,9)
plt.show()



In [ ]: