Python default dictionary vs dictionary

This notebook motivates and explains why python has default dictionaries

Read more here: https://docs.python.org/3/library/collections.html#collections.defaultdict

Suppose you have a list of tuples where each one has a string key and integer value. Your task is to sum all the values which have the same key


In [1]:
data = [
    ('california', 1),
    ('california', 3),
    ('colorado', 0),
    ('colorado', 10),
    ('washington', 2),
    ('washington', 4)
]

With an ordinary dictionary, I would need to check if they key exists. If it doesn't I need to initialize it with a value. For instrutional purposes I will call the int() function which will return the default value for an integer which is 0.


In [5]:
# This won't work because I haven't initialized keys

summed = dict()
for row in data:
    key, value = row # destructure the tuple
    summed[key] = summed[key] + value


---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-5-4e6a9bdea217> in <module>()
      4 for row in data:
      5     key, value = row # destructure the tuple
----> 6     summed[key] = summed[key] + value

KeyError: 'california'

As expected, the first time we try to set the value for california, it doesn't exist in the dictionary so the right handside of the equal sign errors. Thats easy to fix like this


In [22]:
summed = dict()
for row in data:
    key, value = row
    if key not in summed:
        summed[key] = int()
        
    summed[key] = summed[key] + value

In [23]:
summed


Out[23]:
{'california': 4, 'colorado': 10, 'washington': 6}

Lets see one more example that instead of summing the numbers we wan't to collect everything into a list. So lets replace int() with list() since we wan't to make an empty list. We also need to change the summing term to use append instead


In [24]:
merged = dict()
for row in data:
    key, value = row
    if key not in merged:
        merged[key] = list()
    
    merged[key].append(value)

In [25]:
merged


Out[25]:
{'california': [1, 3], 'colorado': [0, 10], 'washington': [2, 4]}

Its inconvenient to do this check every time so python has a nice way to make this pattern simpler. This is what collections.defaultdict was designed for. It does the following:

  1. Takes a single argument which is a function which we will call func
  2. When a key is accessed (for example with merged[key], check if it exists. If it doesn't, instead of erroring initialize it to the return of func then proceed as normal

Lets see both examples from above using this


In [26]:
from collections import defaultdict

In [27]:
summed = defaultdict(int)
for row in data:
    key, value = row
    summed[key] = summed[key] + value

In [28]:
summed


Out[28]:
defaultdict(int, {'california': 4, 'colorado': 10, 'washington': 6})

In [29]:
merged = defaultdict(list)
for row in data:
    key, value = row
    merged[key].append(value)

In [30]:
merged


Out[30]:
defaultdict(list,
            {'california': [1, 3], 'colorado': [0, 10], 'washington': [2, 4]})

In [31]:
def myinit():
    return -100

summed = defaultdict(myinit)
for row in data:
    key, value = row
    summed[key] += value

In [32]:
summed


Out[32]:
defaultdict(<function __main__.myinit>,
            {'california': -96, 'colorado': -90, 'washington': -94})

As expected, the results are exactly the same, and it is based on the initial method you pass it. This function is called a factory method since each time a key needs to be initialized you can imagine that the function acts as a factory which creates new values. Lets cover one of the common mistakes with default dictionaries before concluding. The source of this mistake is that any time a non-existent key is accessed its initialized.


In [52]:
d = defaultdict(str)

# initially this is empty so all of these should be false
print('pedro in dictionary:', 'pedro' in d)
print('jordan in dictionary:', 'jordan' in d)


pedro in dictionary: False
jordan in dictionary: False

In [53]:
# Lets set something in the dictionary now and check that again

d['jordan'] = 'professor'

print('jordan is in dictionary:', 'jordan' in d)
print('pedro is in dictionary:', 'pedro' in d)


jordan is in dictionary: True
pedro is in dictionary: False

In [54]:
# Lets accidentally access 'pedro' before setting it then see what happens

pedro_job = d['pedro']

print('pedro is in dictionary:', 'pedro' in d)
print(d)
print('-->', d['pedro'], '<--', type(d['pedro']))


pedro is in dictionary: True
defaultdict(<class 'str'>, {'jordan': 'professor', 'pedro': ''})
-->  <-- <class 'str'>

So this is odd! You never set a key (only accessed it), but nonetheless pedro is in the dictionary. This is because when the 'pedro' key was accessed and not there, python set it to the return of str which returns an empty string. Lets set this to the real value and be done


In [55]:
d['pedro'] = 'PhD Student'

print('pedro is in dictionary:', 'pedro' in d)
print(d)
print('-->', d['pedro'], '<--', type(d['pedro']))


pedro is in dictionary: True
defaultdict(<class 'str'>, {'jordan': 'professor', 'pedro': 'PhD Student'})
--> PhD Student <-- <class 'str'>

In [ ]: