Sets and Dictionaries in Python: Sets (Learner Version)

Objectives

Explain why some programs that use lists become proportionally slower as data sizes increase.
Explain the three adjectives in "unordered collection of distinct values".
Use a set to eliminate duplicate values from data.

Lesson

Keeping track of unique atoms using a list:



In [1]:

    
def another_atom(seen, atom):
    for i in range(len(seen)):
        if seen[i] == atom:
            return # atom is already present, so do not re-add
    seen.append(atom)

Average time to check or insert grows in proportion to the number of distinct values V
So with N insertions, running time is NV/2
If most values are distinct, this is N^2/2

Better way to solve the problem is to use a set
An unordered collection of distinct values
Create a set:



In [1]:

    
primes = {3, 5, 7}

Must use set() to create an empty set instead of {}



In [2]:

    
even_primes = set()

Create some sets for examples



In [4]:

    
ten = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
lows = {0, 1, 2, 3, 4}
odds = {1, 3, 5, 7, 9}

Display values



In [6]:

    
print lows









    



set([0, 1, 2, 3, 4])

Usual set operations work as expected



In [7]:

    
print lows.union(odds)









    



set([0, 1, 2, 3, 4, 5, 7, 9])



In [8]:

    
print lows.intersection(odds)









    



set([1, 3])



In [9]:

    
print lows.difference(odds)









    



set([0, 2, 4])



In [10]:

    
print lows.symmetric_difference(odds)









    



set([0, 2, 4, 5, 7, 9])



In [11]:

    
print lows.issubset(ten)









    



True



In [12]:

    
print lows.issuperset(odds)









    



False



In [13]:

    
print len(odds)



In [15]:

    
print 6 in odds









    



False



In [16]:

    
lows.add(9)
print lows









    



set([0, 1, 2, 3, 4, 9])



In [17]:

    
lows.remove(0)
print lows









    



set([1, 2, 3, 4, 9])



In [18]:

    
lows.clear()
print lows









    



set([])

Re-initialize example sets



In [19]:

    
ten = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
lows = {0, 1, 2, 3, 4}
odds = {1, 3, 5, 7, 9}

Use operators instead of methods



In [20]:

    
print lows - odds # difference









    



set([0, 2, 4])



In [21]:

    
print lows & odds # intersection









    



set([1, 3])



In [22]:

    
print lows <= ten # subset









    



True



In [23]:

    
print lows < ten # strict subset









    



True



In [24]:

    
print lows >= ten # superset









    



False



In [25]:

    
print lows > ten # strict superset









    



False



In [26]:

    
print lows ^ odds # symmetric difference









    



set([0, 2, 4, 5, 7, 9])



In [27]:

    
print lows | odds # union









    



set([0, 1, 2, 3, 4, 5, 7, 9])



In [28]:

    
def unique_atoms(filename):
    atoms = set()
    with open(filename, 'r') as source:
        for line in source:
            name = line.strip()
            atoms.add(name)
    return atoms



In [2]:

    
!cat some_atoms.txt









    



Na
Fe
Na
Si
Pd
Na



In [30]:

    
print unique_atoms('some_atoms.txt')









    



set(['Na', 'Si', 'Fe', 'Pd'])

Can construct a set from anything that can be looped over



In [31]:

    
print set('lithium')









    



set(['i', 'h', 'm', 'l', 'u', 't'])

But why are letters out of order?

Key Points

Use sets to store distinct unique values.
Create sets using set() or {v1, v2, ...}.
Sets are mutable, i.e., they can be updated in place like lists.
A loop over a set produces each element once, in arbitrary order.
Use sets to find unique things.