Iterators, Generators, and MapReduce(rators)

2014-08-14, Josh Montague

In this notebook, we'll walk through some of the details behind how Python lets us easily iterate over containers, the guidelines to follow when creating a custom iterable, and how the iterator concept extends to generators. We'll get just a taste of generators (which are very powerful!), and then see them in action with a simple example of using the MapReduce framework.

This session was built using:

Python 2.7
IPython 2
mrjob 0.4 (optional)

Iterables and Iterators

Common Usage

Let's start with examples of how we already use these things. Lists are iterables. You can use a for loop to iterate over iterables and you're already falling asleep because you know this. Bare with me!



In [ ]:

    
l = [1, 2, 3] 

for x in l:
    print x

Similarly, you're almost certainly familiar with iterating over dicts, strs, tuples, and files:



In [ ]:

    
s = 'josh'
d = {'j':1, 'o':2, 's':3, 'h':4}
t = (1, 2, 3)
f = open('moby.txt')

print '*'*10 + ' str ' + '*'*10
for x in s:
    # iterate over characters, x is a str
    print x

print    
print '*'*10 + ' dict ' + '*'*10    
for x in d:   # remember: not in any particular order!
    # iterate over dict keys, x is a str
    print x   
    
print     
print '*'*10 + ' tuple ' + '*'*10    
for x in t:
    # iterate over members, x is an int
    print x
 
print     
print '*'*10 + ' file ' + '*'*10    
for x in f:
    # iterate over lines, x is a str
    print '> ' + x[:50] + '...'

There are also a bunch of constructors and built-in functions take iterables as arguments:



In [ ]:

    
# strings are iterables
print list("hello")
print

# range() returns an iterable
y = [ 2*x for x in range(10) ]

# lists are iterables
print sum(y)

Why? Iterator Protocol!

The reason all of these common usages work so nicely is that the built-in objects all obey the "iterator protocol". The terminology can be a little confusing, so here are some relevent links to the official glossary in the docs. The exact definitions are not the point, but they are helpful to start wrapping your head around. Here are some of the key points:

iterable: an object capable of returning its members one at a time, has internal methods like __getitem__(), __iter__() that deal with usage; think list, dict, tuple, str.
iterator: an object representing a stream of data, really a subset of iterables that require a next() method, and raise a StopIteration exception when the object is "out".

These internal methods and exception handling mean that as long as an object is built properly, using it in something like a for loop will be transparently easy. Python does an amazing job of hiding all of this from you when you're getting started. All of the built-in types obviously have these things, prebaked.

We write code that looks like this:

for x in obj:
    # do stuff with x

And Python does something like this:

_i = obj.__iter__()
while True:
    try:
        x = _i.next()
    except StopIteration:
        break
    # do stuff with x

By combining those abbreviated definitions above and the example code just above, we can see that our typical for loop takes an iterable obj and creates an temporary iterator _i from it, which is used within the loop.

So, iterables and iterators will have different attributes. Let's verify this with the awesome dir() function.



In [ ]:

    
# l is a list, an iterable
l = [1,2,3]

# create an iterator object from l
li = iter(l)



In [ ]:

    
# compare the methods and attributes
print "**** iterable ****"
print
print l.__class__
print
print dir(l)

print
print
print "**** iterator ****"
print
print li.__class__
print
print dir(li)

Now we can also look a little closer at how the .next() method works by explicitly calling it on an iterator that we've created:



In [ ]:

    
# l is an iterable
l = ["ready", "set", "go"]

# i is an iterator on the list iterable
i = iter(l)

i

For this cell where we'll use the next() function, use the in-place evaluation <CTL + return>, and see what the iterator returns each time.



In [ ]:

    
i.next()

So, now you can better appreciate how the for loop works so nicely... it's built to handle exactly this behavior. Here's that code snippet from above again:

for x in obj:
    # do stuff with x

Under the hood:

_i = obj.__iter__()    # this is what iter(obj) actually does
while True:
    try:
        x = _i.next()
    except StopIteration:
        break
    # do stuff with x

I got 99 problems...

... but not 99^99; that's too many problems to iterate over.

Ok, so now hopefully you have a deeper appreciation for how the built-in objects in Python let you use loops and other constructs for iterating.

There are (at least) two ways that you can have trouble using containers that step through a sequence of members:

you have custom data types for containers or members (or both)
the data structures are built-in, but the sequence is incredibly large (or infinite)

In the first case, imagine want to create a special iterator object that represents a "countdown" (this example is all over the place online). To do this, you'd need to write a class that has (at a minimum): __init__(), __iter__(), and next() methods such that you can drop your object into a for loop: for x in countdown(5): ... That's a lot of overhead for something that doesn't seem very complicated. Hold that thought.

In the second case, imagine writing a function so I end up with a list of all the integers below 99. No problem; keep assembling the list (the_list.append()) while incrementing a counter somewhere, and then at the end you: return the_list. Now, please extend the function so I end up with a list of all integers below 10^20; or how about a never-ending list of integers, like a stream? Trouble. Most likely trouble due to lack of memory; we can't build a list that large in memory and then return it.

This brings us to...

Generators / generator functions

It seems that people most often refer an instance of "a generator function" as "a generator", so be aware of the potential for naming ambiguity.

A generator is a nicer and more convenient way to solve the problems outlined above, and more! Generators looks like ordinary functions but include the yield keyword, and generally behave quite differently. In fact, they're functions that return an iterator.

Importantly, generators yield processing control while maintaining state and position. This is in contrast to the manner in which functions enter, execute their code all the way to either an Exception or a return statement, and subsequently lose all local variables and values (the "state").

Some other key things about generators:

on next() call, the generator function executes all the way to the next yield, returns the appropriate value, and then pauses until the next call to next()
when "completed", the generator raises the StopIteration exception, as expected (as saw for an iterator)
the generator object only calculates / returns one value at a time, regardless of the length of the series (lazy evaluation)

Conceptually, you can think of these things as being somewhat like subsets (but not exactly):

[ iterables [ iterators [ generators ] ] ]

Let's go through some of the characteristics of these things, by implementing the countdown object we talked about earlier. Note that there are none of the class methods we said we'd need to do this as a pure iterator!



In [ ]:

    
# countdown is a generator function. note the "yield"
def countdown(n):
    print "starting countdown from %s...." % n
    while n > 0: 
        yield n
        print "decrementing count..."
        n -= 1

No code executes within the generator function when we make the object:



In [ ]:

    
x = countdown(10)

x

Remember that the generator function returns a generator, which behaves like an iterator. So we already know how to get items out of an iterator using the next() method. Importantly: note the flow of execution by looking for the print statements:



In [ ]:

    
x.next()

Note the first thing that executes is the print statement that immediately follows the yield in the generator function. Execution picks up at this point every time we call next(). When the generator is empty, it raises the expected StopIteration exception.



In [ ]:

    
# use the CTL-return execution on this cell
x.next()

Clearly this object is set up properly to use in a for loop. I'll remove the print statements this time:



In [ ]:

    
def quiet_countdown(n):
    while n > 0: 
        yield n
        n -= 1
        
for x in quiet_countdown(10):
    print x

The other example we gave in which iterators fall short is where the end result (the series) is very long. Imaging that we'd like a way to return a long series of numbers, such that we can sum them for a final answer. Compare the two following ways of accumulating the series, one using a list and one using a generator:



In [ ]:

    
def first_n_list(n):
    current, result = 0, list()
    while current < n:
        result.append(current)     # result will have to be n items long
        current += 1
    return result

def first_n_gen(n):
    current = 0
    while current < n:
        yield current              # there is only one number, current
        current += 1

In both cases, the result can be used by sum to find the answer:

answer = sum( first_n_list(100000000) )
answer = sum( first_n_gen(100000000) )

The reason that both of these expressions look the same is that the sum method takes care of asking each iterator for its next item, increments its internal result, and proceeds until the end of the iterator. However, the difference is in the flow of control. For the list-based function, this is approximately:

sum expects an iterator, asks first_n_list to evaluate
first_n_list evaluates in its entirety, returns a 100-million item list
sum iterates through the list, incrementing its internal result each time
the giant list raises a StopIteration exception to sum, who now knows its done, returns result to answer

But for the generator function version:

sum expects an iterator, asks first_n_gen to evaluate
first_n_gen begins execution until yield, returns a single value and control to sum, pauses
sum receives first value, increments internal counter, asks generator for next()
first_n_gen resumes execution until yield, returns value and control ...
... and so on, until ...
first_n_gen raises its StopIteration exception to sum, who now knows its done, returns result to answer

So both versions needed to keep approximately three important things in memory: the current result within sum, the current result from the first_n_* method, and then one more. However, for the list-based approach, that extra thing was a 100 million-item list; for the generator, it was a single integer.

A Tale of Two (Or Three) Generator Functions

Imagine that we have a module called word_count.py that looks something like the code below. Note the yields in the three generator functions!

class MRWordFrequencyCount(MRJob):

    def mapper(self, _, line):
        for word in WORD_RE.findall(line):
            yield word.lower(), 1

    def combiner(self, word, counts):
        yield word, sum(counts)

    def reducer(self, word, counts):
        yield word, sum(counts)

Ok, now you can stop imagining, because this code is actually here in the repo! This is the prototypical introduction to the MapReduce framework. It's certainly one of the most common snippets of code in existence (maybe second to hello world).

Specifically, this code is from the Python library mrjob (em-arr-job). This is one of a handful of Python wrappers for using MapReduce; mrjob has proven useful for the ability to write your code once, run the code locally to look for errors or unexpected behaviors, and then seamlessly launch a distributed job on AWS EC2 instances with only a couple of command-line parameter changes, no changes to your code needed. That said, when it comes to debugging errors that come from Hadoop... you're on your own :)

If you don't have it installed and want to do this last part, go ahead and

pip install mrjob

in your favorite Python environment.

Let's drop to the shell and run this simple job, locally, on the first few chapters of Moby Dick, and see all those generators in action!



In [ ]:

    
# A one-liner that leaves an IPython notebook to run 
#     a local MapReduce job on your laptop. Awesome!
!python word_count.py moby.txt 2>/dev/null

What's Happening (the short version)?

mrjob is doing a lot of work for you, mostly behind the scenes. And I'll be honest: there are still a few corners of this process that seem like magic to me. If you want to start digging, you can always check the docs. However, we can still talk about how this process uses generators.

For each line in the input file, the mapper tokenizes the line with a regular expression, and yields (for example) ("whale", 1) for each word. The part that's "invisible" here, is that the counts for each key are added to a generator, such that we end up with an input to the combiner that looks like ("whale", <generator of counts for each line>). The combiner then sums all the values (from one mapper) in the list and yields ("whale", 8). Similarly, each of these counts is added to a generator such that the reducer does the same task as the combiner. In this locally-run example, the combiner and the reducer are redundant. But in a distributed process, there will be many individual combiners that yield the same keys, so the final reducer will aggregate all of these tallies on the same key. Finally, the reduces yields the final tally ("whale", 14).

At each step, the control passing between these processes means that each new line of the input file becomes a new task for the mapper-combiner-reducer sequence.

Though definitely out of the scope of this session on Python, MapReduce is a powerful framework with which it's helfpul to be familiar if you're in the data space. The wikipedia article is a good place to start.

The End

Oh, you didn't get enough of iterators and generators? Well, here are some sneak previews of concepts that didn't get included in the main session. Consider these your starting points for googling away at your own pace...

Passing values into the scope of a generator

You can actually pass values into the evaluation of generator while it is paused at a yield. Those values can then be used later in the execution!



In [ ]:

    
def knock_knock():
    name = yield "Who's there?"
    print type(name)
    yield "%s who?" % name
    yield "The end."
    
k = knock_knock()



In [ ]:

    
k.next()



In [ ]:

    
k.send('Josh')



In [ ]:

    
k.next()



In [ ]:

    
k.next()

It turns out that yield actually returns None when used in a generator. So, that return value (or any other) can be stored in another variable. Above, name captures the return value of yield. The send() generator method resumes execution of the generator and allows the yield statement to return a different value.

Generator Expressions

Just like list comprehensions allow us to execute the equivalent of a for loop,

for x in i:
    foo(x)

# ^ gives the same results as...
[ foo(x) for x in i ]

generator expressions can be used to replace similar loops:

for x in i:
    yield x

# ^ gives the same result as...
( x for x in i )

In a generator expression the yield is implicit. And just like a generator, the statement is evaluated in a lazy manner: one item at a time, as needed.



In [ ]:

    
gen_exp = ( x for x in range(1000))

gen_exp



In [ ]:

    
sum(gen_exp)

References

There are many great references online for learning more about these concepts. Below are a few of the links where I found examples and some of the thoughts shared above:

Iterator types (Python docs)
mrjob Concepts
David Beazley's slide deck "Hacking Generators"
Jeff Knupp's post on generators
The Python Practice Book