This notebook motivates and explains why python has default dictionaries
Read more here: https://docs.python.org/3/library/collections.html#collections.defaultdict
Suppose you have a list of tuples where each one has a string key and integer value. Your task is to sum all the values which have the same key
In [1]:
data = [
('california', 1),
('california', 3),
('colorado', 0),
('colorado', 10),
('washington', 2),
('washington', 4)
]
With an ordinary dictionary, I would need to check if they key exists. If it doesn't I need to initialize it with a value. For instrutional purposes I will call the int()
function which will return the default value for an integer which is 0.
In [5]:
# This won't work because I haven't initialized keys
summed = dict()
for row in data:
key, value = row # destructure the tuple
summed[key] = summed[key] + value
As expected, the first time we try to set the value for california, it doesn't exist in the dictionary so the right handside of the equal sign errors. Thats easy to fix like this
In [22]:
summed = dict()
for row in data:
key, value = row
if key not in summed:
summed[key] = int()
summed[key] = summed[key] + value
In [23]:
summed
Out[23]:
Lets see one more example that instead of summing the numbers we wan't to collect everything into a list. So lets replace int() with list() since we wan't to make an empty list. We also need to change the summing term to use append instead
In [24]:
merged = dict()
for row in data:
key, value = row
if key not in merged:
merged[key] = list()
merged[key].append(value)
In [25]:
merged
Out[25]:
Its inconvenient to do this check every time so python has a nice way to make this pattern simpler. This is what collections.defaultdict
was designed for. It does the following:
func
merged[key]
, check if it exists. If it doesn't, instead of erroring initialize it to the return of func
then proceed as normalLets see both examples from above using this
In [26]:
from collections import defaultdict
In [27]:
summed = defaultdict(int)
for row in data:
key, value = row
summed[key] = summed[key] + value
In [28]:
summed
Out[28]:
In [29]:
merged = defaultdict(list)
for row in data:
key, value = row
merged[key].append(value)
In [30]:
merged
Out[30]:
In [31]:
def myinit():
return -100
summed = defaultdict(myinit)
for row in data:
key, value = row
summed[key] += value
In [32]:
summed
Out[32]:
As expected, the results are exactly the same, and it is based on the initial method you pass it. This function is called a factory method since each time a key needs to be initialized you can imagine that the function acts as a factory which creates new values. Lets cover one of the common mistakes with default dictionaries before concluding. The source of this mistake is that any time a non-existent key is accessed its initialized.
In [52]:
d = defaultdict(str)
# initially this is empty so all of these should be false
print('pedro in dictionary:', 'pedro' in d)
print('jordan in dictionary:', 'jordan' in d)
In [53]:
# Lets set something in the dictionary now and check that again
d['jordan'] = 'professor'
print('jordan is in dictionary:', 'jordan' in d)
print('pedro is in dictionary:', 'pedro' in d)
In [54]:
# Lets accidentally access 'pedro' before setting it then see what happens
pedro_job = d['pedro']
print('pedro is in dictionary:', 'pedro' in d)
print(d)
print('-->', d['pedro'], '<--', type(d['pedro']))
So this is odd! You never set a key (only accessed it), but nonetheless pedro is in the dictionary. This is because when the 'pedro' key was accessed and not there, python set it to the return of str
which returns an empty string. Lets set this to the real value and be done
In [55]:
d['pedro'] = 'PhD Student'
print('pedro is in dictionary:', 'pedro' in d)
print(d)
print('-->', d['pedro'], '<--', type(d['pedro']))
In [ ]: