Sets and Dictionaries in Python: JSON (Instructor Version)

Objectives

Correctly define "JSON" and give simple examples of valid JSON structures.
Describe JSON's strengths and weaknesses as a storage format.
Write code to read and write JSON-formatted data files using standard libraries.

Lesson

The example above used two data file formats: one for storing molecular formulas, the other for storing inventory. Both formats were specific to this application, which means we needed to write, debug, document, and maintain functions to handle them. Those functions weren't particularly difficult to create, but they still took time to create, and if anyone ever wants to read our files in Java, MATLAB, or Perl, they'll have to write equivalent functions themselves.

A growing number of programs avoid these problems by using a flexible data format called JSON, which stands for "JavaScript Object Notation". Despite the name, it is a language-independent way to store nested data structures made up of strings, numbers, Booleans, lists, dictionaries, and the special value null (equivalent to Python's None)—in short, the basic data types that almost every language supports. For example, let's convert a dictionary of scientists' birthdays to a string:



In [1]:

    
import json
birthdays = {'Curie' : 1867, 'Hopper' : 1906, 'Franklin' : 1920}
as_string = json.dumps(birthdays)
print as_string
print type(as_string)









    



{"Curie": 1867, "Hopper": 1906, "Franklin": 1920}
<type 'str'>

json.dumps doesn't seem to do much, but that's kind of the point: the textual representation of the data structure looks pretty much like what a programmer would type in to re-create it. The advantage is that this representation can be saved in a file:



In [2]:

    
writer = open('/tmp/example.json', 'w')
json.dump(birthdays, writer)
writer.close()

reader = open('/tmp/example.json', 'r')
duplicate = json.load(reader)
reader.close()

print 'original:', birthdays









    



original: {'Curie': 1867, 'Hopper': 1906, 'Franklin': 1920}



In [5]:

    
print 'duplicate:', duplicate









    



duplicate: {u'Curie': 1867, u'Hopper': 1906, u'Franklin': 1920}

(Note that strings are stored as Unicode.)

The data read in is the same as the original:



In [6]:

    
print 'original == duplicate:', birthdays == duplicate









    



original == duplicate: True

But it is not the same object in memory:



In [7]:

    
print 'original is duplicate:', birthdays is duplicate









    



original is duplicate: False

The data file holds what we'd type in to create the data in a program, which makes it easy to edit by hand:



In [3]:

    
!cat /tmp/example.json









    



{"Curie": 1867, "Hopper": 1906, "Franklin": 1920}

How is this different in practice from what we had? First, our inventory file now looks like this:



In [4]:

    
!cat inventory-03.json









    



{"He" : 1, "H" : 4, "O" : 3}

while our formula files are:



In [6]:

    
!cat formulas-03.json









    



{"helium"   : {"He" : 1},
 "water"    : {"H" : 2, "O" : 1},
 "hydrogen" : {"H" : 2}}

Those aren't as intuitive for non-programmers as the original flat text files, but they're not too bad. The worst thing is the lack of comments: unfortunately—very unfortunately—the JSON format doesn't support them. (And note that JSON requires us to use a double-quote for strings: unlike Python, we cannot substitute single quotes.)

The good news is that given files like these, we can rewrite our program as:



In [7]:

    
def main(inventory_file, formula_file):
    with open(inventory_file, 'r') as reader:
        inventory = json.load(reader)
    with open(formula_file, 'r') as reader:
        formulas = json.load(reader)
    counts = calculate_counts(inventory, formulas)
    show_counts(counts)

The two functions that read formula and inventory files have been replaced with a couple of lines each. Nothing else has to change, because the data structures loaded from the data files are exactly what we had before. The end result is 51 lines long compared to the 80 we started with, a reduction of more than a third.

Nothing's Perfekt

JSON's greatest weakness isn't its lack of support for comments, but the fact that it doesn't recognize and manage aliases. Instead, each occurrence of an aliased structure is treated as something brand new when data is being saved. For example:



In [9]:

    
inner = ['name']
outer = [inner, inner] # Creating an alias
print outer
print outer[0] is outer[1]









    



[['name'], ['name']]
True



In [10]:

    
as_string = json.dumps(outer)
duplicate = json.loads(as_string)
print duplicate
print duplicate[0] is duplicate[1]









    



[[u'name'], [u'name']]
False

The diagram below shows the difference between the original data structure (referred to by outer) and what winds up in duplicate. If aliases might be present in our data, and it's important to preserve their structure, we must either record the aliasing ourself (which is tricky), or use some other format. Luckily, a lot of data either doesn't contain aliases, or the aliasing in it isn't important.

Key Points

The JSON data format can represent arbitrarily-nested lists and dictionaries containing strings, numbers, Booleans, and None.
Using JSON reduces the code we have to write ourselves and improves interoperability with other programming languages.
JSON doesn't allow for comments, and doesn't handle aliasing.