This notebook explores the speed of searching for values in sets and lists.

After reading this notebook, watch Brandon Rhodes' videos All Your Ducks In A Row: Data Structures in the Standard Library and Beyond and The Mighty Dictionary.


In [1]:
def make_list(n):
    if True:
        return list(range(n))
    else:
        return list(str(i) for i in range(n))

In [2]:
n = int(25e6)
m = int(n*.9)
m = n // 2
a_list = make_list(n)
a_set = set(a_list)

In [3]:
# Finding something that is in a set is fast.
%timeit m in a_set
# Finding something that is _not_ in a set is also fast.
%timeit n+1 in a_set


The slowest run took 38.41 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 122 ns per loop
The slowest run took 31.83 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 145 ns per loop

In [4]:
# Finding something in a list can be slow.
# It starts at the beginning and compares each value.
# The search time depends on where the value is in the list.
%timeit m in a_list
# Finding something that is not is a list is the worst case.
# It has to be compared to all values of the list.
%timeit n+1 in a_list


1 loop, best of 3: 326 ms per loop
1 loop, best of 3: 645 ms per loop

In [5]:
max_exponent = 6

In [6]:
for n in (10 ** i for i in range(1, max_exponent+1)):
    m = n // 2
    print('%d:' % n)
    a_list = make_list(n)
    a_set = set(a_list)

    %timeit m in a_set


10:
The slowest run took 51.62 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 92 ns per loop
100:
The slowest run took 49.91 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 91 ns per loop
1000:
The slowest run took 39.22 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 118 ns per loop
10000:
The slowest run took 40.17 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 118 ns per loop
100000:
The slowest run took 36.67 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 124 ns per loop
1000000:
The slowest run took 40.67 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 122 ns per loop

Notice that the difference between searching small sets and large sets in not large. This is the magic of Python sets and dictionaries. Read the hash table Wikipedia article for an explanation of how this works.


In [7]:
for n in (10 ** i for i in range(1, max_exponent+1)):
    m = n // 2
    print('%d:' % n)
    a_list = make_list(n)
    a_set = set(a_list)

    %timeit m in a_list


10:
The slowest run took 23.84 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 202 ns per loop
100:
1000000 loops, best of 3: 1.32 µs per loop
1000:
100000 loops, best of 3: 12.7 µs per loop
10000:
10000 loops, best of 3: 126 µs per loop
100000:
1000 loops, best of 3: 1.21 ms per loop
1000000:
100 loops, best of 3: 12.8 ms per loop

Notice that the search time for finding something in the middle of a list is proportional to the size of the list.


In [8]:
def set_foo(n):
    for i in range(n):
        i in a_set

def list_foo(n):
    for i in range(n):
        i in a_list

In [9]:
for n in (10 ** i for i in range(1, max_exponent+1)):
    print('%d:' % n)
    a_list = list(range(n))
    a_set = set(a_list)

    %timeit set_foo(n)


10:
1000000 loops, best of 3: 1.91 µs per loop
100:
100000 loops, best of 3: 9.23 µs per loop
1000:
10000 loops, best of 3: 119 µs per loop
10000:
1000 loops, best of 3: 1.27 ms per loop
100000:
100 loops, best of 3: 13.2 ms per loop
1000000:
10 loops, best of 3: 133 ms per loop

In [ ]:
for n in (10 ** i for i in range(1, max_exponent+1)):
    print('%d:' % n)
    a_list = list(range(n))
    a_set = set(a_list)

    %timeit list_foo(n)


10:
100000 loops, best of 3: 2.62 µs per loop
100:
10000 loops, best of 3: 132 µs per loop
1000:
100 loops, best of 3: 12.6 ms per loop
10000:
1 loop, best of 3: 1.26 s per loop
100000:
1 loop, best of 3: 1min 59s per loop
1000000:

The above cell does not finish overnight, so what you see above is what was saved before it finished.