Sets and Dictionaries in Python: Aggregation (Learner Version)

Objectives

  • Recognize problems that can be solved by aggregating values.
  • Use dictionaries to aggregate values.
  • Explain why actual data values should be used as initializers rather than "impossible" values.

Lesson

  • How early in the day did we see each kind of bird?

In [2]:
!cat some_birds.txt


2010-07-03    05:38    loon
2010-07-03    06:02    goose
2010-07-03    06:07    loon
2010-07-04    05:09    ostrich   # hallucinating?
2010-07-04    05:29    loon
  • Read data, returning list of tuples

In [5]:
def read_observations(filename):
    '''Read data, returning [(date, time, bird)...].'''

    reader = open(filename, 'r')
    result = []

    for line in reader:
        fields = line.split('#')[0].strip().split()
        assert len(fields) == 3, 'Bad line "%s"' % line
        result.append(fields)

    return result
  • Turn list of tuples into dictionary

In [6]:
def earliest_observation(data):
    '''How early did we see each bird?'''

    result = {}
    for (date, time, bird) in data:
        if bird not in result:
            result[bird] = time
        else:
            result[bird] = min(result[bird], time)

    return result
  • Test

In [7]:
entries = read_observations('some_birds.txt')
result = earliest_observation(entries)
print result


{'loon': '05:29', 'goose': '06:02', 'ostrich': '05:09'}
  • Which birds did we see on a particular day?

In [8]:
def birds_by_date(data):
    '''Which birds were seen on each day?'''

    result = {}
    for (date, time, bird) in data:
        if date not in result:
            result[date] = {bird}
        else:
            result[date].add(bird)

    return result

print birds_by_date(entries)


{'2010-07-04': set(['loon', 'ostrich']), '2010-07-03': set(['loon', 'goose'])}
  • Which bird (or birds, in case of ties) did we see least frequently?
  • Try to do it without doing two passes through the data (to find the least value)
  • And without choosing some arbitrary large initial value for least
  • Solution: initialize least to None and handle its replacement as a special case

In [9]:
def least_common_birds(data):
    '''Which bird or birds have been seen least frequently?'''
    
    counts = count_by_bird(data) # need to write this
    least = min(counts.values())
    result = set()
    for bird in counts:
        if counts[bird] == least:
            result.add(bird)
    return result

# Helper function.
def count_by_bird(data):
    '''How many times was each bird seen?'''
    result = {}
    for (date, time, bird) in data:
        if bird not in result:
            result[bird] = 0
        result[bird] += 1
    return result

# Test.
print least_common_birds(entries)


set(['goose', 'ostrich'])

Key Points

  • Use dictionaries to count things.
  • Initialize values from actual data instead of trying to guess what values could "never" occur.