This is a follow up post to the iteration post, which covered what I consider are the basics of iteration. This post covers generators, which are a special type of iter. I'm going to try to avoid, for the most part, using generators to calculate infinite streams of integers (Fibonacci, et al).
I also need to admit being a bit of a David Beazley fan boy. His tutorials, writings and talks on generators were extremely influential on my understanding of generators, their uses and composition. As a consequence, I tend to quote him a lot.
The biggest, most obvious difference between generators and other things that make iterators, is that generators yield values, not return values. Consider the "cannonical" implementation of the Fibonacci sequence:
In [1]:
def gen_fib():
a, b = 0, 1
yield a
yield b
while True:
yield a+b
a, b = b, a+b
The way a lot of Fibonacci sequences are implemented are with a list, essentially saying, "I want the first ten digits of Fibonacci"
In [2]:
def list_fib(n):
fibs = []
if n < 1:
pass
elif n == 1:
fibs.append(0)
elif n == 2:
fibs.extend([0,1])
else:
a, b = 0, 1
fibs.extend([a,b])
while len(fibs) < n:
fibs.append(a+b)
a, b = b, a+b
return fibs
print(list_fib(0))
print(list_fib(1))
print(list_fib(2))
print(list_fib(10))
list_fib
is a mess, we have to check for what was passed in, we need to monitor the size of the list (even using collections.deque
doesn't quite solve this problem). There's a while loop we might hit. But all of this is needed to make sure we correctly construct the list of Fibonacci numbers.
By contrast, the gen_fib
function is simple, clean, there's only one form of flow control. Well, it looks like there's only one form of flow control, but the logic inside of it is much more complicated.
So what happens when we call gen_fib
?
In [3]:
print(gen_fib())
A generator object pops out. Odd. I don't see any object instantiation going on inside of the function. Let's talk a closer look using the inspect
modules, in particular there are two functions in there of intrest: isgenerator
and isgeneratorfunction
. Of course, there's also the usual tools of type
and hasattr
.
In [4]:
from inspect import isgenerator, isgeneratorfunction
f = gen_fib()
print("isgeneratorfunction(gen_fib):", isgeneratorfunction(gen_fib))
print("isgenerator(f):", isgenerator(gen_fib()))
print("type(f):", type(f))
print("type(iter(gen_fib())):", type(iter(f)))
print("Are generators their own iterators?", f is iter(f))
print('f hasattr __iter__:', hasattr(f, '__iter__'))
print('f hasattr __next__:', hasattr(f, '__next__'))
Under the hood, when Python sees yield
inside of a function, it converts that function into a generator class using the logic inside of it. Calling gen_fib
is very similar to what happens when you call MyFibClass
, out pops an object. The actual implementation of generators is beyond the scope of this, however.
So, now we have this object, how can we get values out of it? The obvious answer is iteration! However, if you notice gen_fib
's while loop never exits, it's infinite. Attempting to consume the "whole" sequence will exhaust time (but honestly, it'll probably consume all your memory first, Fibonacci numbers get really big really quick). But just like with other iterators, next
can be used to manually pull values of it.
In [5]:
f = gen_fib()
print(next(f))
print(next(f))
print(next(f))
print(next(f))
However, I mentioned that the flow control is actually more complicated than list_fib
. Here's why and where the biggest difference between yield
and return
become known:
If it's not obvious, a generator function is a function that "pauses" it's execution when it hit yield which also causes it spit out a value. Regular functions get one chance to run and return a value: they take some arguments, do something and then return a value. A generator takes some arguments, does something, yields a value and waits around to be called again. By not building a list that is potentially massive, generators can be incredibly memory efficient.
This opens up a lot of possibilites:
And what's really cool, if you have a simple generator -- something like "for each of these lines, yield one if it meets this simple requirement" -- you can write a generator expression, here's a very contrived example.
In [6]:
nums = (num for num in range(1, 100) if not num%3)
print(list(nums))
The biggest most important gotcha is that generators are forgetful. What I mean by that, is when a generator hits yield
it sends a value out. Unless the internal state that created value is met again, you can't get it back from the generator. Unlike lists
and other types of iterables where you can iterate over the same object multiple times, generators are one time use. You can create multiple generator objects from the same function and iterate over each (and each will maintain their own state).
As a consequence of this, searching a generator with in
or by explicitly hitting __contains__
will partially or wholly consume a generator. This is because in
asks the generator, "Hey, do you ever return this value?" The generator gets to work yielding values out until one matches. All those values it outputs are gone. I suppose this could potentially be helpful in some situations, but I do want this caveat to be known.
Another thing that will catch people off guard the first few times is that generators don't have a __len__
property. Essentially, a generator has no idea how many values it will yield. It can't tell you that. Generators are also non-indexable for the same reason.
Up until now, these examples have been compatible between Python 2 and 3. However, in 3.3 a few changes were made to yield
that took it from awesome to amazing. yield
gained an optional keyword from
. yield from
delegates access from the current generator to one it's calling. The simplest way to understanding this is to know the next two code snippets output the same thing:
In [7]:
def my_chain_1(*iters):
'''This is actually the example code in the
itertools module for chain
'''
for it in iters:
for item in it:
yield item
def my_chain_2(*iters):
'''This example completely removes a loop
'''
for it in iters:
yield from it
a = range(2)
b = range(5,7)
print(list(my_chain_1(a,b)))
print(list(my_chain_2(a,b)))
The true power of yield from
is that when you call my_chain_2
you aren't simply being fed values from a inner generator. You are interacting directly with the inner generator. The impact of this is profound. However, you don't need to construct an event loop to make use of this.
My canonical example is walking my ~/Music
directory and doing something with...
In [8]:
%%bash
find ~/Music -type f -name '*.mp3' | wc -l
...that number of files. To be honest, I'm not concerned about creating a list in memory of 19k file paths (typically a one time operation that I run every now and then). What I'm concerned with is processing 19k file paths in a timely fashion, let alone opening up the files, pulling information out of them and handling them. For the time being, I'm only going to operate on a very small subset of my library. I'm also going to show off how to build "generator pipelines" as Dave Beazley calls them.
I do use a third party library here, mutagenx
a Python3 implementation of mutagen
.
In [9]:
import os
from pprint import pprint
from mutagenx import File
valid_types=('m4a', 'flac', 'mp3', 'ogg', 'oga')
def find(basedir, valid_types=valid_types):
'''Utilize os.walk to only select out the files we'd like to potentially
parse and yield them one at a time.'''
basedir = os.path.abspath(basedir)
for current, dirs, files in os.walk(basedir):
files = filter(lambda f: f.endswith(valid_types), sorted(files))
files = [os.path.join(current, f) for f in files]
if files:
yield from files
def adaptor(track):
'''Take in a Mutagen track object and
parse it into a dictionary.
'''
return dict(
artist=track['artist'][0],
album=track['album'][0],
position=int(track['tracknumber'][0].split('/')[0]),
length=int(track.info.length),
location=track.filename,
name=track['title'][0],
)
def process_directory(basedir, valid_types=valid_types):
'''Hook up find and adaptor into a pipeline'''
files = find(basedir, valid_types)
tracks = (File(f, easy=True) for f in files)
yield from (adaptor(t) for t in tracks)
tracks = process_directory('/home/justanr//Music/At the Drive-In/Relationship of Command')
pprint(next(tracks))
Because of the nature of generators, only one item gets pulled down the line at a time. So, instead of processing the whole directory at once, we can process the directory one file at a time.
yield and yield from also make processing trees easy as well. In the post on iteration, I showed a code example from the Python Cookbook that used a very complex iterator to maintain the state of iterating depth first over a tree. This is the same class built with generators:
In [10]:
class Node:
def __init__(self, value):
self._value = value
self._children = []
def __repr__(self):
return 'Node({!r})'.format(self._value)
def add_child(self, node):
self._children.append(node)
def __iter__(self):
return iter(self._children)
def depth_first(self):
yield self
for c in self:
yield from c.depth_first()
root = Node(0)
child1 = Node(1)
child2 = Node(2)
root.add_child(child1)
root.add_child(child2)
child1.add_child(Node(3))
child1.add_child(Node(4))
child2.add_child(Node(5))
for ch in root.depth_first():
print(ch)
The entire Node
class is 18 lines. The DepthFirstIterator
alone is 30 lines. The logic is less complex.
And of course, you can do bad things, too.
In [11]:
from itertools import chain
class IterInt(int):
def __iter__(self):
yield self
def countdown(n):
n = IterInt(n)
if n < 0:
# wait... why are we returning here?
# I thought generators never returned!
return "Countdown finished"
else:
yield from chain(n, countdown(n-1))
print(list(countdown(10)))
try:
next(countdown(-1))
except StopIteration as wut:
# no seriously...wut
print(wut)
Consider it incentive for me to write more about generators.
Generators are a really broad topic. For something that can be implemented for creating infinite sequences to being the backbone of event loops, there's quite a bit of ground to cover. There's a lot of really smart people out there writing things about them.