This notebook explores the speed of searching for values in sets and lists.

After reading this notebook, watch Brandon Rhodes' videos All Your Ducks In A Row: Data Structures in the Standard Library and Beyond and The Mighty Dictionary.


In [1]:
def make_list(n):
    if True:
        return list(range(n))
    else:
        return list(str(i) for i in range(n))

In [2]:
n = int(25e6)
# n = 5
m = (0, n // 2, n-1, n)
a_list = make_list(n)
a_set = set(a_list)

In [3]:
n, m


Out[3]:
(25000000, (0, 12500000, 24999999, 25000000))

In [4]:
# Finding something that is in a set is fast.
# The key one is looking for has little effect on the speed.
beginning = 0
middle = n//2
end = n-1
%timeit beginning in a_set
%timeit middle in a_set
%timeit end in a_set
# Finding something that is _not_ in a set is also fast.
%timeit n in a_set


The slowest run took 31.03 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 94.5 ns per loop
The slowest run took 36.07 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 124 ns per loop
The slowest run took 36.07 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 128 ns per loop
The slowest run took 49.88 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 91 ns per loop

In [5]:
# Searching for something in a list
# starts at the beginning and compares each value.
# The search time depends on where the value is in the list.
# That can be slow.
beginning = 0
middle = n//2
end = n-1
%timeit beginning in a_list
%timeit middle in a_list
%timeit end in a_list
# Finding something that is not is a list is the worst case.
# It has to be compared to all values of the list.
%timeit n in a_list


The slowest run took 55.34 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 82 ns per loop
1 loop, best of 3: 318 ms per loop
1 loop, best of 3: 717 ms per loop
1 loop, best of 3: 641 ms per loop

In [6]:
max_exponent = 6

In [7]:
for n in (10 ** i for i in range(1, max_exponent+1)):
    a_list = make_list(n)
    a_set = set(a_list)

    m = (0, n // 2, n-1, n)
    for j in m:
        print('length is %s, looking for %s' % (n, j))
        %timeit j in a_set


length is 10, looking for 0
The slowest run took 43.58 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 107 ns per loop
length is 10, looking for 5
The slowest run took 44.00 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 105 ns per loop
length is 10, looking for 9
The slowest run took 19.51 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 104 ns per loop
length is 10, looking for 10
The slowest run took 20.81 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 104 ns per loop
length is 100, looking for 0
The slowest run took 44.31 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 104 ns per loop
length is 100, looking for 50
The slowest run took 18.72 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 104 ns per loop
length is 100, looking for 99
The slowest run took 43.18 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 105 ns per loop
length is 100, looking for 100
The slowest run took 45.36 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 102 ns per loop
length is 1000, looking for 0
The slowest run took 43.49 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 104 ns per loop
length is 1000, looking for 500
The slowest run took 33.52 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 133 ns per loop
length is 1000, looking for 999
The slowest run took 34.87 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 132 ns per loop
length is 1000, looking for 1000
The slowest run took 45.38 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 102 ns per loop
length is 10000, looking for 0
The slowest run took 307.19 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 109 ns per loop
length is 10000, looking for 5000
The slowest run took 34.24 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 133 ns per loop
length is 10000, looking for 9999
The slowest run took 34.74 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 133 ns per loop
length is 10000, looking for 10000
The slowest run took 45.18 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 102 ns per loop
length is 100000, looking for 0
The slowest run took 42.87 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 106 ns per loop
length is 100000, looking for 50000
The slowest run took 31.87 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 149 ns per loop
length is 100000, looking for 99999
The slowest run took 34.94 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 136 ns per loop
length is 100000, looking for 100000
The slowest run took 45.70 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 104 ns per loop
length is 1000000, looking for 0
The slowest run took 48.98 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 106 ns per loop
length is 1000000, looking for 500000
The slowest run took 35.03 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 136 ns per loop
length is 1000000, looking for 999999
The slowest run took 34.24 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 135 ns per loop
length is 1000000, looking for 1000000
The slowest run took 41.36 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 111 ns per loop

Notice that the difference between searching small sets and large sets in not large. This is the magic of Python sets and dictionaries. Read the hash table Wikipedia article for an explanation of how this works.


In [8]:
for n in (10 ** i for i in range(1, max_exponent+1)):
    a_list = make_list(n)
    a_set = set(a_list)

    m = (0, n // 2, n-1, n)
    for j in m:
        print('length is %s, looking for %s' % (n, j))
        %timeit j in a_list


length is 10, looking for 0
The slowest run took 50.46 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 92.7 ns per loop
length is 10, looking for 5
The slowest run took 20.69 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 219 ns per loop
length is 10, looking for 9
The slowest run took 8.15 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 317 ns per loop
length is 10, looking for 10
The slowest run took 13.85 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 338 ns per loop
length is 100, looking for 0
The slowest run took 22.01 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 92 ns per loop
length is 100, looking for 50
1000000 loops, best of 3: 1.33 µs per loop
length is 100, looking for 99
100000 loops, best of 3: 2.56 µs per loop
length is 100, looking for 100
100000 loops, best of 3: 2.58 µs per loop
length is 1000, looking for 0
The slowest run took 50.02 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 90.8 ns per loop
length is 1000, looking for 500
100000 loops, best of 3: 12.5 µs per loop
length is 1000, looking for 999
10000 loops, best of 3: 25 µs per loop
length is 1000, looking for 1000
10000 loops, best of 3: 25 µs per loop
length is 10000, looking for 0
The slowest run took 22.14 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 91.5 ns per loop
length is 10000, looking for 5000
10000 loops, best of 3: 125 µs per loop
length is 10000, looking for 9999
1000 loops, best of 3: 249 µs per loop
length is 10000, looking for 10000
1000 loops, best of 3: 250 µs per loop
length is 100000, looking for 0
The slowest run took 22.30 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 90.8 ns per loop
length is 100000, looking for 50000
1000 loops, best of 3: 1.43 ms per loop
length is 100000, looking for 99999
100 loops, best of 3: 2.8 ms per loop
length is 100000, looking for 100000
100 loops, best of 3: 2.41 ms per loop
length is 1000000, looking for 0
The slowest run took 49.94 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 90.9 ns per loop
length is 1000000, looking for 500000
100 loops, best of 3: 12.7 ms per loop
length is 1000000, looking for 999999
10 loops, best of 3: 25.5 ms per loop
length is 1000000, looking for 1000000
10 loops, best of 3: 25.6 ms per loop

Notice that the search time for finding something in a list is proportional to the length of the list and how far the item is from the beginning of the list.