The text and code are released under the CC0 license; see also the companion project, the Python Data Science Handbook.
Here we'll take a deeper dive into Python generators, including generator expressions and generator functions.
The difference between list comprehensions and generator expressions is sometimes confusing; here we'll quickly outline the differences between them:
In [1]:
[n ** 2 for n in range(12)]
Out[1]:
While this is a representative generator expression:
In [2]:
(n ** 2 for n in range(12))
Out[2]:
Notice that printing the generator expression does not print the contents; one way to print the contents of a generator expression is to pass it to the list
constructor:
In [3]:
G = (n ** 2 for n in range(12))
list(G)
Out[3]:
When you create a list, you are actually building a collection of values, and there is some memory cost associated with that. When you create a generator, you are not building a collection of values, but a recipe for producing those values. Both expose the same iterator interface, as we can see here:
In [4]:
L = [n ** 2 for n in range(12)]
for val in L:
print(val, end=' ')
In [5]:
G = (n ** 2 for n in range(12))
for val in G:
print(val, end=' ')
The difference is that a generator expression does not actually compute the values until they are needed. This not only leads to memory efficiency, but to computational efficiency as well! This also means that while the size of a list is limited by available memory, the size of a generator expression is unlimited!
An example of an infinite generator expression can be created using the count
iterator defined in itertools
:
In [6]:
from itertools import count
count()
Out[6]:
In [7]:
for i in count():
print(i, end=' ')
if i >= 10: break
The count
iterator will go on happily counting forever until you tell it to stop; this makes it convenient to create generators that will also go on forever:
In [8]:
factors = [2, 3, 5, 7]
G = (i for i in count() if all(i % n > 0 for n in factors))
for val in G:
print(val, end=' ')
if val > 40: break
You might see what we're getting at here: if we were to expand the list of factors appropriately, what we would have the beginnings of is a prime number generator, using the Sieve of Eratosthenes algorithm. We'll explore this more momentarily.
In [9]:
L = [n ** 2 for n in range(12)]
for val in L:
print(val, end=' ')
print()
for val in L:
print(val, end=' ')
A generator expression, on the other hand, is used-up after one iteration:
In [10]:
G = (n ** 2 for n in range(12))
list(G)
Out[10]:
In [11]:
list(G)
Out[11]:
This can be very useful because it means iteration can be stopped and started:
In [12]:
G = (n**2 for n in range(12))
for n in G:
print(n, end=' ')
if n > 30: break
print("\ndoing something in between")
for n in G:
print(n, end=' ')
One place I've found this useful is when working with collections of data files on disk; it means that you can quite easily analyze them in batches, letting the generator keep track of which ones you have yet to see.
yield
We saw in the previous section that list comprehensions are best used to create relatively simple lists, while using a normal for
loop can be better in more complicated situations.
The same is true of generator expressions: we can make more complicated generators using generator functions, which make use of the yield
statement.
Here we have two ways of constructing the same list:
In [13]:
L1 = [n ** 2 for n in range(12)]
L2 = []
for n in range(12):
L2.append(n ** 2)
print(L1)
print(L2)
Similarly, here we have two ways of constructing equivalent generators:
In [14]:
G1 = (n ** 2 for n in range(12))
def gen():
for n in range(12):
yield n ** 2
G2 = gen()
print(*G1)
print(*G2)
A generator function is a function that, rather than using return
to return a value once, uses yield
to yield a (potentially infinite) sequence of values.
Just as in generator expressions, the state of the generator is preserved between partial iterations, but if we want a fresh copy of the generator we can simply call the function again.
In [15]:
# Generate a list of candidates
L = [n for n in range(2, 40)]
print(L)
In [16]:
# Remove all multiples of the first value
L = [n for n in L if n == L[0] or n % L[0] > 0]
print(L)
In [17]:
# Remove all multiples of the second value
L = [n for n in L if n == L[1] or n % L[1] > 0]
print(L)
In [18]:
# Remove all multiples of the third value
L = [n for n in L if n == L[2] or n % L[2] > 0]
print(L)
If we repeat this procedure enough times on a large enough list, we can generate as many primes as we wish.
Let's encapsulate this logic in a generator function:
In [19]:
def gen_primes(N):
"""Generate primes up to N"""
primes = set()
for n in range(2, N):
if all(n % p > 0 for p in primes):
primes.add(n)
yield n
print(*gen_primes(100))
That's all there is to it! While this is certainly not the most computationally efficient implementation of the Sieve of Eratosthenes, it illustrates how convenient the generator function syntax can be for building more complicated sequences.