2014-08-14, Josh Montague
In this notebook, we'll walk through some of the details behind how Python lets us easily iterate over containers, the guidelines to follow when creating a custom iterable, and how the iterator concept extends to generators. We'll get just a taste of generators (which are very powerful!), and then see them in action with a simple example of using the MapReduce framework.
This session was built using:
In [ ]:
l = [1, 2, 3]
for x in l:
print x
Similarly, you're almost certainly familiar with iterating over dicts
, strs
, tuples
, and files:
In [ ]:
s = 'josh'
d = {'j':1, 'o':2, 's':3, 'h':4}
t = (1, 2, 3)
f = open('moby.txt')
print '*'*10 + ' str ' + '*'*10
for x in s:
# iterate over characters, x is a str
print x
print
print '*'*10 + ' dict ' + '*'*10
for x in d: # remember: not in any particular order!
# iterate over dict keys, x is a str
print x
print
print '*'*10 + ' tuple ' + '*'*10
for x in t:
# iterate over members, x is an int
print x
print
print '*'*10 + ' file ' + '*'*10
for x in f:
# iterate over lines, x is a str
print '> ' + x[:50] + '...'
There are also a bunch of constructors and built-in functions take iterables as arguments:
In [ ]:
# strings are iterables
print list("hello")
print
# range() returns an iterable
y = [ 2*x for x in range(10) ]
# lists are iterables
print sum(y)
The reason all of these common usages work so nicely is that the built-in objects all obey the "iterator protocol". The terminology can be a little confusing, so here are some relevent links to the official glossary in the docs. The exact definitions are not the point, but they are helpful to start wrapping your head around. Here are some of the key points:
iterable: an object capable of returning its members one at a time, has internal methods like __getitem__()
, __iter__()
that deal with usage; think list
, dict
, tuple
, str
.
iterator: an object representing a stream of data, really a subset of iterables that require a next()
method, and raise a StopIteration
exception when the object is "out".
These internal methods and exception handling mean that as long as an object is built properly, using it in something like a for
loop will be transparently easy. Python does an amazing job of hiding all of this from you when you're getting started. All of the built-in types obviously have these things, prebaked.
We write code that looks like this:
for x in obj:
# do stuff with x
And Python does something like this:
_i = obj.__iter__()
while True:
try:
x = _i.next()
except StopIteration:
break
# do stuff with x
By combining those abbreviated definitions above and the example code just above, we can see that our typical for
loop takes an iterable obj
and creates an temporary iterator _i
from it, which is used within the loop.
So, iterables and iterators will have different attributes. Let's verify this with the awesome dir()
function.
In [ ]:
# l is a list, an iterable
l = [1,2,3]
# create an iterator object from l
li = iter(l)
In [ ]:
# compare the methods and attributes
print "**** iterable ****"
print
print l.__class__
print
print dir(l)
print
print
print "**** iterator ****"
print
print li.__class__
print
print dir(li)
Now we can also look a little closer at how the .next()
method works by explicitly calling it on an iterator that we've created:
In [ ]:
# l is an iterable
l = ["ready", "set", "go"]
# i is an iterator on the list iterable
i = iter(l)
i
For this cell where we'll use the next()
function, use the in-place evaluation <CTL + return>
, and see what the iterator returns each time.
In [ ]:
i.next()
So, now you can better appreciate how the for
loop works so nicely... it's built to handle exactly this behavior. Here's that code snippet from above again:
for x in obj:
# do stuff with x
Under the hood:
_i = obj.__iter__() # this is what iter(obj) actually does
while True:
try:
x = _i.next()
except StopIteration:
break
# do stuff with x
Ok, so now hopefully you have a deeper appreciation for how the built-in objects in Python let you use loops and other constructs for iterating.
There are (at least) two ways that you can have trouble using containers that step through a sequence of members:
In the first case, imagine want to create a special iterator object that represents a "countdown" (this example is all over the place online). To do this, you'd need to write a class that has (at a minimum): __init__()
, __iter__()
, and next()
methods such that you can drop your object into a for
loop: for x in countdown(5): ...
That's a lot of overhead for something that doesn't seem very complicated. Hold that thought.
In the second case, imagine writing a function so I end up with a list of all the integers below 99. No problem; keep assembling the list (the_list.append()
) while incrementing a counter somewhere, and then at the end you: return the_list
. Now, please extend the function so I end up with a list of all integers below 10^20; or how about a never-ending list of integers, like a stream? Trouble. Most likely trouble due to lack of memory; we can't build a list that large in memory and then return
it.
This brings us to...
It seems that people most often refer an instance of "a generator function" as "a generator", so be aware of the potential for naming ambiguity.
A generator is a nicer and more convenient way to solve the problems outlined above, and more! Generators looks like ordinary functions but include the yield
keyword, and generally behave quite differently. In fact, they're functions that return an iterator.
Importantly, generators yield processing control while maintaining state and position. This is in contrast to the manner in which functions enter, execute their code all the way to either an Exception or a return
statement, and subsequently lose all local variables and values (the "state").
Some other key things about generators:
next()
call, the generator function executes all the way to the next yield
, returns the appropriate value, and then pauses until the next call to next()
StopIteration
exception, as expected (as saw for an iterator)Conceptually, you can think of these things as being somewhat like subsets (but not exactly):
[ iterables [ iterators [ generators ] ] ]
Let's go through some of the characteristics of these things, by implementing the countdown object we talked about earlier. Note that there are none of the class methods we said we'd need to do this as a pure iterator!
In [ ]:
# countdown is a generator function. note the "yield"
def countdown(n):
print "starting countdown from %s...." % n
while n > 0:
yield n
print "decrementing count..."
n -= 1
No code executes within the generator function when we make the object:
In [ ]:
x = countdown(10)
x
Remember that the generator function returns a generator, which behaves like an iterator. So we already know how to get items out of an iterator using the next()
method. Importantly: note the flow of execution by looking for the print
statements:
In [ ]:
x.next()
Note the first thing that executes is the print
statement that immediately follows the yield
in the generator function. Execution picks up at this point every time we call next()
. When the generator is empty, it raises the expected StopIteration
exception.
In [ ]:
# use the CTL-return execution on this cell
x.next()
Clearly this object is set up properly to use in a for
loop. I'll remove the print statements this time:
In [ ]:
def quiet_countdown(n):
while n > 0:
yield n
n -= 1
for x in quiet_countdown(10):
print x
The other example we gave in which iterators fall short is where the end result (the series) is very long. Imaging that we'd like a way to return a long series of numbers, such that we can sum them for a final answer. Compare the two following ways of accumulating the series, one using a list and one using a generator:
In [ ]:
def first_n_list(n):
current, result = 0, list()
while current < n:
result.append(current) # result will have to be n items long
current += 1
return result
def first_n_gen(n):
current = 0
while current < n:
yield current # there is only one number, current
current += 1
In both cases, the result can be used by sum
to find the answer:
answer = sum( first_n_list(100000000) )
answer = sum( first_n_gen(100000000) )
The reason that both of these expressions look the same is that the sum
method takes care of asking each iterator for its next item, increments its internal result, and proceeds until the end of the iterator. However, the difference is in the flow of control. For the list-based function, this is approximately:
sum
expects an iterator, asks first_n_list
to evaluatefirst_n_list
evaluates in its entirety, returns a 100-million item listsum
iterates through the list, incrementing its internal result each timeStopIteration
exception to sum
, who now knows its done, returns result to answer
But for the generator function version:
sum
expects an iterator, asks first_n_gen
to evaluatefirst_n_gen
begins execution until yield
, returns a single value and control to sum
, pausessum
receives first value, increments internal counter, asks generator for next()
first_n_gen
resumes execution until yield
, returns value and control ... first_n_gen
raises its StopIteration
exception to sum
, who now knows its done, returns result to answer
So both versions needed to keep approximately three important things in memory: the current result within sum
, the current result from the first_n_*
method, and then one more. However, for the list-based approach, that extra thing was a 100 million-item list; for the generator, it was a single integer.
Imagine that we have a module called word_count.py
that looks something like the code below. Note the yield
s in the three generator functions!
class MRWordFrequencyCount(MRJob):
def mapper(self, _, line):
for word in WORD_RE.findall(line):
yield word.lower(), 1
def combiner(self, word, counts):
yield word, sum(counts)
def reducer(self, word, counts):
yield word, sum(counts)
Ok, now you can stop imagining, because this code is actually here in the repo! This is the prototypical introduction to the MapReduce framework. It's certainly one of the most common snippets of code in existence (maybe second to hello world
).
Specifically, this code is from the Python library mrjob
(em-arr-job). This is one of a handful of Python wrappers for using MapReduce; mrjob
has proven useful for the ability to write your code once, run the code locally to look for errors or unexpected behaviors, and then seamlessly launch a distributed job on AWS EC2 instances with only a couple of command-line parameter changes, no changes to your code needed. That said, when it comes to debugging errors that come from Hadoop... you're on your own :)
If you don't have it installed and want to do this last part, go ahead and
pip install mrjob
in your favorite Python environment.
Let's drop to the shell and run this simple job, locally, on the first few chapters of Moby Dick, and see all those generators in action!
In [ ]:
# A one-liner that leaves an IPython notebook to run
# a local MapReduce job on your laptop. Awesome!
!python word_count.py moby.txt 2>/dev/null
mrjob
is doing a lot of work for you, mostly behind the scenes. And I'll be honest: there are still a few corners of this process that seem like magic to me. If you want to start digging, you can always check the docs. However, we can still talk about how this process uses generators.
For each line in the input file, the mapper tokenizes the line with a regular expression, and yields (for example) ("whale", 1)
for each word. The part that's "invisible" here, is that the counts for each key are added to a generator, such that we end up with an input to the combiner that looks like ("whale", <generator of counts for each line>)
. The combiner then sums all the values (from one mapper) in the list and yields ("whale", 8)
. Similarly, each of these counts is added to a generator such that the reducer does the same task as the combiner. In this locally-run example, the combiner and the reducer are redundant. But in a distributed process, there will be many individual combiners that yield the same keys, so the final reducer will aggregate all of these tallies on the same key. Finally, the reduces yields the final tally ("whale", 14)
.
At each step, the control passing between these processes means that each new line of the input file becomes a new task for the mapper-combiner-reducer
sequence.
Though definitely out of the scope of this session on Python, MapReduce is a powerful framework with which it's helfpul to be familiar if you're in the data space. The wikipedia article is a good place to start.
Oh, you didn't get enough of iterators and generators? Well, here are some sneak previews of concepts that didn't get included in the main session. Consider these your starting points for googling away at your own pace...
In [ ]:
def knock_knock():
name = yield "Who's there?"
print type(name)
yield "%s who?" % name
yield "The end."
k = knock_knock()
In [ ]:
k.next()
In [ ]:
k.send('Josh')
In [ ]:
k.next()
In [ ]:
k.next()
It turns out that yield
actually returns None
when used in a generator. So, that return value (or any other) can be stored in another variable. Above, name
captures the return value of yield
. The send()
generator method resumes execution of the generator and allows the yield
statement to return a different value.
Just like list comprehensions allow us to execute the equivalent of a for
loop,
for x in i:
foo(x)
# ^ gives the same results as...
[ foo(x) for x in i ]
generator expressions can be used to replace similar loops:
for x in i:
yield x
# ^ gives the same result as...
( x for x in i )
In a generator expression the yield
is implicit. And just like a generator, the statement is evaluated in a lazy manner: one item at a time, as needed.
In [ ]:
gen_exp = ( x for x in range(1000))
gen_exp
In [ ]:
sum(gen_exp)
There are many great references online for learning more about these concepts. Below are a few of the links where I found examples and some of the thoughts shared above:
mrjob
Concepts