Sets and Dictionaries in Python: Nanotech Inventory (Instructor Version)

Objectives

  • Create and manipulate nested dictionaries.
  • Explain the similarities and differences between nested dictionaries and nested lists.

Lesson

We can now solve Fan's original nanotech inventory problem. As explained in the introduction, our goal is to find out how many molecules of various kinds we can make using the atoms in our warehouse. The number of molecules of any particular type we can make is limited by the scarcest atom that molecule requires. For example, if we have five nitrogen atoms and ten hydrogen atoms, we can only make three ammonia molecules, because we need three hydrogen atoms for each.

The formulas for the molecules we know how to make are stored in a file like this:


In [11]:
!cat formulas-03.txt


# Molecular formula file

helium : He 1
water : H 2 O 1
hydrogen : H 2

and our inventory is stored in a file like this:


In [12]:
!cat inventory-03.txt


# Atomic inventory file
He 1
H 4
O 3

Let's start by reading in our inventory. It consists of pairs of strings and numbers, which by now should suggest using a dictionary for storage. The keys will be atomic symbols, and the values will be the number of atoms of that kind we currently have. If an atom isn't listed in our inventory, we'll assume that we don't have any.

What about the formulas for the molecules we know how to make? Once again, we want to use strings—the names of molecules—as indices, which suggests a dictionary. Each of its values will be something storing atomic symbols and the number of atoms of that type in the molecule—the same structure, in fact, that we're using for our inventory. The diagram below shows what this looks like in memory if the only molecules we know how to make are water and ammonia.

Finally, we'll store the results of our calculation in yet another dictionary, this one mapping the names of molecules to how many molecules of that kind we can make, and display like this:

helium 1
hydrogen 2
water 2

The main body of the program is straightforward. It reads in the input files, does our calculation, and prints the result:


In [13]:
def main(inventory_file, formula_file):
    inventory = read_inventory(inventory_file)
    formulas = read_formulas(formula_file)
    counts = calculate_counts(inventory, formulas)
    show_counts(counts)

Reading the inventory file is simple. We take each interesting line in the file, split it to get an atomic symbol and a count, and store them together in a dictionary:


In [14]:
def read_inventory(inventory_file):
    result = {}
    with open(inventory_file, 'r') as reader:
        for line in reader:
            name, count = line.strip().split()
            result[name] = int(count)
    return result

Let's test it:


In [15]:
print read_inventory('inventory-03.txt')


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-15-c05b7b912bfb> in <module>()
----> 1 print read_inventory('inventory-03.txt')

<ipython-input-14-d5dd028eb45b> in read_inventory(inventory_file)
      3     with open(inventory_file, 'r') as reader:
      4         for line in reader:
----> 5             name, count = line.strip().split()
      6             result[name] = int(count)
      7     return result

ValueError: too many values to unpack

Our mistake was to forget that files can contain blank lines and comments. It's easy enough to modify the function to handle them, though it complicates the logic:


In [16]:
def read_inventory(inventory_file):
    result = {}
    with open(inventory_file, 'r') as reader:
        for line in reader:
            line = line.strip()
            if (not line) or line.startswith('#'):
                continue
            name, count = line.split()
            result[name] = int(count)
    return result

print read_inventory('inventory-03.txt')


{'H': 4, 'O': 3, 'He': 1}

The next step is to read the files containing formulas. Since the file format is more complicated, the function is as well. In fact, it's complicated enough that we'll come back later and simplify it.

FIXME: diagram


In [17]:
def read_formulas(formula_file):
    result = {}
    with open(formula_file, 'r') as reader:
        for line in reader:
            line = line.strip()
            if (not line) or line.startswith('#'):
                continue
            name, atoms = line.split(':')
            name = name.strip()
            atoms = atoms.strip().split()
    
            formula = {}
            for i in range(0, len(atoms), 2):
                symbol = atoms[i].strip()
                count = int(atoms[i+1])
                formula[symbol] = count
            result[name] = formula

    return result

We start by creating a dictionary to hold our results. We then split each interesting line in the data file on the colon ':' to separate the molecule's name (which may contain spaces) from its formula. We then split the formulas into a list of strings. These alternate between atomic symbols and numbers, so in the inner loop, we move forward through those values two elements at a time, storing the atomic symbol and count in a dictionary. Once we're done, we store that dictionary as the value for the molecule name in the main dictionary. When we've processed all the lines, we return the final result. Here's a simple test:


In [18]:
print read_formulas('formulas-03.txt')


{'water': {'H': 2, 'O': 1}, 'hydrogen': {'H': 2}, 'helium': {'He': 1}}

Now that we have all our data, it's time to calculate how many molecules of each kind we can make. inventory maps atomic symbols to counts, and so does formulas[name], so let's loop over all the molecules we know how to make and "divide" the inventory by each one:


In [19]:
def calculate_counts(inventory, formulas):
    counts = {}
    for name in formulas:
        counts[name] = dict_divide(inventory, formulas[name])
    return counts

We say we're "dividing" the inventory by each molecule because we're trying to find out how many of that molecule we can make without requiring more of any particular atom than we actually have. (By analogy, when we divide 11 by 3, we're trying to find out how many 3's we can make from 11.) The function that does the division is:


In [20]:
def dict_divide(inventory, molecule):
    number = None
    for atom in molecule:
        required = molecule[atom]
        available = inventory.get(atom, 0)
        limit = available / required
        if (number is None) or (limit < number):
            number = limit
    return number

This function loops over all the atoms in the molecule we're trying to build, see what limit the available inventory puts on us, and return the minimum of all those results. This function uses a few patterns that come up frequently in many kinds of programs:

  1. The first pattern is to initialize the value we're going to return to None, then test for that value inside the loop to make sure we re-set it to a legal value the first time we have real data. In this case, we could just as easily use -1 or some other impossible value as an "uninitialized" flag for number.
  2. Since we're looping over the keys of molecule, we know that we can get the value stored in molecule[atom]. However, that atom might not be a key in inventory, so we use inventory.get(atom, 0) to get either the stored value or a sensible default. In this case zero, the sensible default is 0, because if the atom's symbol isn't in the dictionary, we don't have any of it. This is our second pattern.
  3. The third is using calculate, test, and store to find a single value—in this case, the minimum—from a set of calculated values. We could calculate the list of available over required values, then find the minimum of the list, but doing the minimum test as we go along saves us having to store the list of intermediate values. It's probably not a noticeable time saving in this case, but it would be with larger data sets.

The last step in building our program is to show how many molecules of each kind we can make. We could just loop over our result dictionary, printing each molecule's name and the number of times we could make it, but let's put the results in alphabetical order to make it easier to read:


In [21]:
def show_counts(counts):
    names = counts.keys()
    names.sort()
    for name in names:
        print name, counts[name]

It's time to test our code. Let's start by using an empty inventory and a single formula:


In [22]:
!cat inventory-00.txt

In [23]:
!cat formulas-00.txt

In [24]:
main('inventory-00.txt', 'formulas-00.txt')

There's no output, which is what we expect. Let's add a formula but no atoms:


In [26]:
!cat formulas-01.txt


helium : He 1

In [27]:
main('inventory-00.txt', 'formulas-01.txt')


helium 0

and now an atom:


In [29]:
!cat inventory-01.txt


He 1

In [30]:
main('inventory-01.txt', 'formulas-01.txt')


helium 1

That seems right as well. Let's add some hydrogen and another formula:


In [31]:
!cat inventory-02.txt


He 1
H 4

In [33]:
!cat formulas-02.txt


helium : He 1
water : H 2 O 1

In [34]:
main('inventory-02.txt', 'formulas-02.txt')


helium 1
water 0

The output doesn't change, which is correct. Our final test adds some oxygen:


In [35]:
!cat inventory-03.txt


# Atomic inventory file
He 1
H 4
O 3

In [36]:
main('inventory-03.txt', 'formulas-02.txt')


helium 1
water 2

That's right too: we can make two water molecules (because we don't have enough hydrogen to pair with our three oxygen atoms).

There are quite a few other interesting tests still to run, but before we do that, we should clean up our code. Both of our input functions handle comments and blank lines the same way; let's put that code in a helper function:


In [37]:
def readlines(filename):
    result = []
    with open(filename, 'r') as reader:
        for line in reader:
            line = line.strip()
            if line and (not line.startswith('#')):
                result.append(line)
    return result

If we convert read_inventory to use it, the result is six lines long instead of ten. More importantly, the logic of what we're doing is much clearer:


In [38]:
def read_inventory(inventory_file):
    result = {}
    for line in readlines(inventory_file):
        name, count = line.split()
        result[name] = int(count)
    return result

The converted version of read_formulas is 15 lines instead of 19:


In [ ]:
def read_formulas(formula_file):
    result = {}
    for line in readlines(formula_file):
        name, atoms = line.split(':')
        name = name.strip()
        atoms = atoms.strip().split()

        formula = {}
        for i in range(0, len(atoms), 2):
            symbol = atoms[i].strip()
            count = int(atoms[i+1])
            formula[symbol] = count
        result[name] = formula

    return result

We can do better still by putting the code that handles atom/count pairs in a helper function of its own:


In [39]:
def read_formulas(formula_file):
    result = {}
    for line in readlines(formula_file):
        name, atoms = line.split(':')
        name = name.strip()
        result[name] = make_formula(atoms)
    return result

def make_formula(atoms):
    formula = {}
    atoms = atoms.strip().split()
    for i in range(0, len(atoms), 2):
        symbol = atoms[i].strip()
        count = int(atoms[i+1])
        formula[symbol] = count
    return formula

This change has actually made the code slightly longer, but each function now does one small job, and as a bonus, the code in make_formula (which is moderately complex) can now be tested on its own. We're not done until we test our changes, though:


In [40]:
main('inventory-00.txt', 'formulas-00.txt')

In [41]:
main('inventory-01.txt', 'formulas-01.txt')


helium 1

In [42]:
main('inventory-02.txt', 'formulas-02.txt')


helium 1
water 0

In [43]:
main('inventory-03.txt', 'formulas-03.txt')


helium 1
hydrogen 2
water 2

In fact, we could (and should) have tested right after refactoring read_inventory, and only then refactored read_formulas.

Key Points

  • Whenever names are used to label things, consider using dictionaries to store them.
  • Use nested dictionaries to store hierarchical values (like molecule names and atomic counts).
  • Get it right, then refactor to make each part simple.
  • Test after each refactoring step.