Sets and Dictionaries in Python

Motivation

Fan Fullerene has just joined Molecules'R'Us, a nanotechnology startup that fabricates molecules using only the highest quality atoms. His first job is to build a simple inventory management system that compares incoming orders for molecules to the stock of atoms in the company's supercooled warehouse to see how many of those molecules we can build. For example, if the warehouse holds 20 hydrogen atoms, 5 oxygen atoms, and 11 nitrogen atoms, Fan could make 10 water molecules (H2O) or 6 ammonia molecules (NH3), but could not make any methane (CH4) because there isn't any carbon.

Fan could solve this problem using the tools we've seen so far. As we'll see, though, it's a lot more efficient to do it using a different data structure. And "efficient" means both "takes less programmer time to create" and "takes less computer time to execute": the data structures introduced in this chapter are both simpler to use and faster than the lists most programmers are introduced to first.

For Instructors

The ostensible goal of this set of lessons is to introduce learners to non-linear data structures. Most have ony ever seen arrays or lists, i.e., things that are accessed using sequential numeric indices. Sets and dictionaries are usually their first exposure to accessing content by value rather than by location, and to the bigger idea that there are lots of other data structures they might want to learn about. (Unfortunately, there still isn't a good data structure handbook for Python programmers that we can point them at.)

These lessons also introduce JSON as a general-purpose data format that requires less effort to work with than flat text or CSV. We discuss its shortcomings as well as its benefits to help learners see what forces are at play when designing and/or choosing data representations.

Finally, these lessons are also our first chance to introduce the idea of computational complexity via back-of-the-envelope calculations of how the number of steps required to look things up in an unordered list grows with the number of things being looked up. We return to this idea in the extended example of invasion percolation, and to the notion that algorithmic improvements help more than tuning code, but this is a chance to touch on the idea in classes that don't get to that example. The discussion of hash tables is also good preparation for the discussion of relational databases, but isn't required.

Everything in this lesson except the final example on phylogenetic trees can be covered in two hours, assuming that only three short exercises are given (one for sets, one for basic dictionary operations, and one related to aggregation).

  • Start with sets: they're a familiar concept, there's no confusion between keys and values, and they are enough to motivate discussion of hash tables.
  • Python's requirement that entries in hash-based data structures be immutable only makes sense once the mechanics of hash tables are explained. Terms like "hash codes" and "hash function" also come up in error messages and Python's documentation, so learners are likely to become confused without some kind of hand-waving overview. Tuples are also easy to explain as "how to create immutable multi-part keys", and it's easy to explain why entries can't be looked up by parts (e.g., why a tuple containing a first and a last name can't be looked up by last name only) in terms of hash functions.
  • Finally, explaining why hash tables are fast is a good first encounter with the idea of "big oh" complexity.
  • Once sets have been mastered, dictionaries can be explained as "sets with extra information attached to each entry". The canonical example—counting things—shows why that "extra information" is useful. The original motivating problem then uses both a dictionary and a dictionary of dictionaries; when introducing the latter, compare it to a list of lists.
  • Use the nanotechnology inventory example to re-emphasize how code is build top-down by writing code as if desired functions existed, then filling them in.
  • Only tackle the phylogenetic tree example with very advanced learners. The algorithm is usually presented as a table, which makes an array a natural representation. Showing how and why to use dictionaries instead is as important as showing vector operations when introducing NumPy, but the example is hard to follow (and debug) without a graphical representation of the evolving tree.

Prerequisites

Basic data types (strings, numbers, lists); loops; file I/O; conditionals; string operations; references and aliasing; creating functions; top-down development.

Lessons

FIXME: add links

Summary

Every programmer meets lists (or arrays or matrices) early in her career. Many in science never meet sets and dictionaries, and that's a shame: they often make programs easier to write and faster to run at the same time.

Before we leave this topic, try running the function globals at an interactive Python prompt:


In [1]:
globals()


Out[1]:
{'In': ['', u'globals()'],
 'Out': {},
 '_': '',
 '__': '',
 '___': '',
 '__builtin__': <module '__builtin__' (built-in)>,
 '__builtins__': <module '__builtin__' (built-in)>,
 '__name__': '__main__',
 '_dh': [u'/Users/gwilson/bc/lessons/guide-setdict'],
 '_i': u'',
 '_i1': u'globals()',
 '_ih': ['', u'globals()'],
 '_ii': u'',
 '_iii': u'',
 '_oh': {},
 '_sh': <module 'IPython.core.shadowns' from '/Users/gwilson/anaconda/lib/python2.7/site-packages/IPython/core/shadowns.pyc'>,
 'exit': <IPython.core.autocall.ZMQExitAutocall at 0x10238b8d0>,
 'get_ipython': <bound method ZMQInteractiveShell.get_ipython of <IPython.zmq.zmqshell.ZMQInteractiveShell object at 0x10238b2d0>>,
 'help': Type help() for interactive help, or help(object) for help about object.,
 'quit': <IPython.core.autocall.ZMQExitAutocall at 0x10238b8d0>}

That's right—Python actually stores the program's variables in a dictionary. In fact, it uses one dictionary for the global variables and one for each currently-active function call:


In [3]:
def example(first, second):
    print 'locals in example:', locals()
example(22, 33)


locals in example: {'second': 33, 'first': 22}

You now know everything you need to know in order to build a programming language of your own. But please don't: the world will be much better off if you keep doing science instead.