The example above used two data file formats: one for storing molecular formulas, the other for storing inventory. Both formats were specific to this application, which means we needed to write, debug, document, and maintain functions to handle them. Those functions weren't particularly difficult to create, but they still took time to create, and if anyone ever wants to read our files in Java, MATLAB, or Perl, they'll have to write equivalent functions themselves.
A growing number of programs avoid these problems by using a flexible
data format called JSON, which stands for
"JavaScript Object Notation". Despite the name, it is a
language-independent way to store nested data structures made up of
strings, numbers, Booleans, lists, dictionaries, and the special value
null (equivalent to Python's None)—in short, the basic data types
that almost every language supports. For example, let's convert a
dictionary of scientists' birthdays to a string:
In [1]:
import json
birthdays = {'Curie' : 1867, 'Hopper' : 1906, 'Franklin' : 1920}
as_string = json.dumps(birthdays)
print as_string
print type(as_string)
json.dumps doesn't seem to do much, but that's kind of the point: the
textual representation of the data structure looks pretty much like what
a programmer would type in to re-create it. The advantage is that this
representation can be saved in a file:
In [2]:
writer = open('/tmp/example.json', 'w')
json.dump(birthdays, writer)
writer.close()
reader = open('/tmp/example.json', 'r')
duplicate = json.load(reader)
reader.close()
print 'original:', birthdays
In [5]:
print 'duplicate:', duplicate
(Note that strings are stored as Unicode.)
The data read in is the same as the original:
In [6]:
print 'original == duplicate:', birthdays == duplicate
But it is not the same object in memory:
In [7]:
print 'original is duplicate:', birthdays is duplicate
The data file holds what we'd type in to create the data in a program, which makes it easy to edit by hand:
In [3]:
!cat /tmp/example.json
How is this different in practice from what we had? First, our inventory file now looks like this:
In [4]:
!cat inventory-03.json
while our formula files are:
In [6]:
!cat formulas-03.json
Those aren't as intuitive for non-programmers as the original flat text files, but they're not too bad. The worst thing is the lack of comments: unfortunately—very unfortunately—the JSON format doesn't support them. (And note that JSON requires us to use a double-quote for strings: unlike Python, we cannot substitute single quotes.)
The good news is that given files like these, we can rewrite our program as:
In [7]:
def main(inventory_file, formula_file):
with open(inventory_file, 'r') as reader:
inventory = json.load(reader)
with open(formula_file, 'r') as reader:
formulas = json.load(reader)
counts = calculate_counts(inventory, formulas)
show_counts(counts)
The two functions that read formula and inventory files have been replaced with a couple of lines each. Nothing else has to change, because the data structures loaded from the data files are exactly what we had before. The end result is 51 lines long compared to the 80 we started with, a reduction of more than a third.
JSON's greatest weakness isn't its lack of support for comments, but the fact that it doesn't recognize and manage aliases. Instead, each occurrence of an aliased structure is treated as something brand new when data is being saved. For example:
In [9]:
inner = ['name']
outer = [inner, inner] # Creating an alias
print outer
print outer[0] is outer[1]
In [10]:
as_string = json.dumps(outer)
duplicate = json.loads(as_string)
print duplicate
print duplicate[0] is duplicate[1]
The diagram below shows the difference between the original data
structure (referred to by outer) and what winds up in duplicate. If
aliases might be present in our data, and it's important to preserve
their structure, we must either record the aliasing ourself (which is
tricky), or use some other format. Luckily, a lot of data either doesn't
contain aliases, or the aliasing in it isn't important.
None.