Sets and Dictionaries in Python: Aggregation (Instructor Version)

Objectives

  • Recognize problems that can be solved by aggregating values.
  • Use dictionaries to aggregate values.
  • Explain why actual data values should be used as initializers rather than "impossible" values.

Lesson

To see how useful dictionaries can be, let's switch tracks and do some birdwatching. We'll start by asking how early in the day we saw each kind of bird? Our data consists of the date and time of the observation, the bird's name, and an optional comment:


In [1]:
!cat some_birds.txt


2010-07-03    05:38    loon
2010-07-03    06:02    goose
2010-07-03    06:07    loon
2010-07-04    05:09    ostrich   # hallucinating?
2010-07-04    05:29    loon

Rephrasing our problem, we want the minimum of all the times associated with each bird name. If our data was stored in memory like this:

loon = ['05:38', '06:07', '05:20', ...]

the solution would simply be min(loon), and similarly for the other birds. However, we have to work with the data we have, so let's start by reading our data file and creating a list of tuples, each of which contains a date, time, and bird name as strings:


In [3]:
def read_observations(filename):
    '''Read data, returning [(date, time, bird)...].'''

    reader = open(filename, 'r')
    result = []

    for line in reader:
        fields = line.split('#')[0].strip().split()
        assert len(fields) == 3, 'Bad line "%s"' % line
        result.append(fields)

    return result

This function follows the pattern we've seen many times before. We set up by opening the input file and creating an empty list that we'll append records to. We then process each line of the file in turn. Splitting the line on the '#' character and taking the first part of the result gets rid of any comment that might be present; stripping off whitespace and then splitting breaks the remainder into fields.

To prevent trouble later on, we check that there actually are three fields before going on. (An industrial-strength version of this function would also check that the date and time were properly formatted, but we'll skip that for now.) Once we've done our check, we append the triple containing the date, time, and bird name to the list we're going to return.

Here's the function that turns that list of tuples into a dictionary:


In [4]:
def earliest_observation(data):
    '''How early did we see each bird?'''

    result = {}
    for (date, time, bird) in data:
        if bird not in result:
            result[bird] = time
        else:
            result[bird] = min(result[bird], time)

    return result

Once again, the pattern should by now be familiar. We start by creating an empty dictionary, then use a loop to inspect each tuple in turn. The loop explodes the tuple into separate variables for the date, time and bird. If the bird's name is not already a key in our dictionary, this must be the first time we've seen it, so we store the time we saw it in the dictionary. If the bird's name is already there, on the other hand, we keep the minimum of the stored time and the new time. This is almost exactly the same as our earlier counting example, but instead of either storing 1 or adding 1 to the count so far, we're either storing the time or taking the minimum of it and the least time so far.

Let's test what we've written so far:


In [5]:
entries = read_observations('some_birds.txt')
result = earliest_observation(entries)
print result


{'loon': '05:29', 'goose': '06:02', 'ostrich': '05:09'}

Now, what if we want to find out which birds were seen on particular days? Once again, we are aggregating values, i.e., combining many separate values to create one new one. However, since we probably saw more than one kind of bird each day, that "new value" needs to be a collection of some kind. We're only interested in which birds we saw, so the right kind of collection is a set. Here's our function:


In [6]:
def birds_by_date(data):
    '''Which birds were seen on each day?'''

    result = {}
    for (date, time, bird) in data:
        if date not in result:
            result[date] = {bird}
        else:
            result[date].add(bird)

    return result

print birds_by_date(entries)


{'2010-07-04': set(['loon', 'ostrich']), '2010-07-03': set(['loon', 'goose'])}

Again, we start by creating an empty dictionary, and then process each tuple in turn. Since we're recording birds by date, the keys in our dictionary are dates rather than bird names. If the current date isn't already a key in the dictionary, we create a set containing only this bird, and store it in the dictionary with the date as the key. Otherwise, we add this bird to the set associated with the date. (As always, we don't need to check whether the bird is already in that set, since the set will automatically eliminate any duplication.)

Let's watch this function in action for the first few records from our data:

Input Dictionary
start {}
2010-07-03  05:38  loon {'2010-07-03' : {'loon'}}
2010-07-03  06:02  goose {'2010-07-03' : {'goose', 'loon'}}
2010-07-03  06:07  loon {'2010-07-03' : {'goose', 'loon'}}
2010-07-04  05:09  ostrich {'2010-07-03' : {'goose', 'loon'}, '2010-07-04' : {'ostrich'}}
2010-07-04  05:29  loon {'2010-07-03' : {'goose', 'loon'}, '2010-07-04' : {'ostrich', 'loon'}}

For our last example, we'll figure out which bird we saw least frequently—or rather, which birds, since two or more may be tied for the low score. Forgetting that values may not be unique is a common mistake in data crunching, and often a hard one to track down.

Our first strategy is simple: figure out how many times we've seen each bird, then find the minimum of those counts and get the set of birds we've seen that many times. The function below implements this fairly directly:


In [7]:
def least_common_birds(data):
    '''Which bird or birds have been seen least frequently?'''
    
    counts = count_by_bird(data) # need to write this
    least = min(counts.values())
    result = set()
    for bird in counts:
        if counts[bird] == least:
            result.add(bird)
    return result

least_common_birds depends on a function count_by_bird, but this is yet another example of using a dictionary to aggregate values (in this case, to sum the number of birds we have seen). Just for variety's sake, we'll use a slightly different strategy that we've used before: whenever we see a new kind of bird, we'll set its count to zero, and then always add one to the stored count:


In [8]:
def count_by_bird(data):
    '''How many times was each bird seen?'''
    result = {}
    for (date, time, bird) in data:
        if bird not in result:
            result[bird] = 0
        result[bird] += 1
    return result

Finally, we'll test our function:


In [9]:
print least_common_birds(entries)


set(['goose', 'ostrich'])

This does the job, but is somewhat inefficient: we do one pass through all the data while counting birds, then another pass through all the birds to find those that we've seen the least number of times. We can actually do the whole job with a single pass through the data, but as we'll see in the challenges, the resulting code is significantly more complex than what we have written so far. Unless we're sure that the second pass is really a performance bottleneck, we should stick with this simple implementation.

Key Points

  • Use dictionaries to count things.
  • Initialize values from actual data instead of trying to guess what values could "never" occur.