Let's start with something simpler than our actual inventory problem. Suppose we have a list of all the atoms in the warehouse, and we want to know which different kinds we have—not how many, but just their types. We could solve this problem using a list to store the unique atomic symbols we have seen. Here's a function to add a new atom to the list:
In [1]:
def another_atom(seen, atom):
for i in range(len(seen)):
if seen[i] == atom:
return # atom is already present, so do not re-add
seen.append(atom)
another_atom
's arguments are a list of the unique atoms we've already
seen, and the symbol of the atom we're adding. Inside the function, we
loop over the atoms that are already in the list. If we find the one
we're trying to add, we exit the function immediately: we aren't
supposed to have duplicates in our list, so there's nothing to add. If
we reach the end of the list without finding this symbol, though, we
append it. This is a common design pattern: either we find pre-existing data
in a loop and return right away, or take some default action if we
finish the loop without finding a match.
Let's watch this function in action. We start with an empty list. If the
first atomic symbol is 'Na'
, we find no match (since the list is
empty), so we add it. The next symbol is 'Fe'
; it doesn't match
'Na'
, so we add it as well. Our third symbol is 'Na'
again. It
matches the first entry in the list, so we exit the function
immediately.
Before | Adding | After |
---|---|---|
[] |
'Na' |
['Na'] |
['Na'] |
'Fe' |
['Na', 'Fe'] |
['Na', 'Fe'] |
'Na' |
['Na', 'Fe'] |
This code works, but it is inefficient. Suppose there are V distinct atomic symbols in our data, and N symbols in total. Each time we add an observation to our list, we have to look through an average of V/2 entries. The total running time for our program is therefore approximately NV/2. If V is small, this is only a few times larger than N, but what happens if we're keeping track of something like patient records rather than atoms? In that case, most values are distinct, so V is approximately the same as N, which means that our running time is proportional to N^2^/2. That's bad news: if we double the size of our data set, our program runs four times slower, and if we double it again, our program will have slowed down by a factor of 16.
There's a better way to solve this problem that is simpler to use and runs much faster. The trick is to use a set to store the symbols. A set is an unordered collection of distinct items. The word "collection" means that a set can hold zero or more values. The word "distinct" means that any particular value is either in the set or not: a set can't store two or more copies of the same thing. And finally, "unordered" means that values are simply "in" the set. They're not in any particular order, and there's no first value or last value. (They actually are stored in some order, but as we'll discuss in the next section, that order is as random as the computer can make it.)
To create a set, we simply write down its elements inside curly braces:
In [2]:
primes = {3, 5, 7}
The set stores references to the three values in no particular order:
However, we have to use set()
to create an empty set, because the
symbol {}
was already being used for something else when sets were
added to Python:
In [3]:
even_primes = set() # not '{}' as in math
We'll meet that "something else" later in this chapter.
To see what we can do with sets, let's create three holding the integers 0 through 9, the first half of that same range of numbers (0 through 4), and the odd values 1, 3, 5, 7, and 9:
In [4]:
ten = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
lows = {0, 1, 2, 3, 4}
odds = {1, 3, 5, 7, 9}
If we ask Python to display one of our sets, it shows us this:
In [5]:
print lows
rather than using the curly-bracket notation. I personally regard this as a design flaw, but it does remind us that we can create always create a set from a list.
Sets have methods just like strings and lists, and, like the methods of strings and lists, most of those methods create new sets instead of modifying the set they are called for. These three come straight from mathematics:
In [6]:
print lows.union(odds)
In [7]:
print lows.intersection(odds)
In [8]:
print lows.difference(odds)
Another method that creates a new set is symmetric_difference
, which
is sometimes called "exclusive or":
In [9]:
print lows.symmetric_difference(odds)
It returns the values that are in one set or another, but not in both.
Not all set methods return new sets. For example, issubset
returns
True
or False
depending on whether all the elements in one set are
present in another:
In [10]:
print lows.issubset(ten)
A complementary method called issuperset
also exists, and does the
obvious thing:
In [11]:
print lows.issuperset(odds)
We can count how many values are in a set using len
(just as we would
to find the length of a list or string), and check whether a particular
value is in the set or not using in
:
In [12]:
print len(odds)
In [13]:
print 6 in odds
Some methods modify the sets they are called for. The most commonly used
is add
, which adds an element to the set:
In [14]:
lows.add(9)
print lows
If the thing being added is already in the set, add
has no effect,
because any specific thing can appear in a set at most once:
In [15]:
lows.add(9)
print lows
This behavior is different from that of list.append
, which always adds
a new element to a list.
Finally, we can remove individual elements from the set:
In [16]:
lows.remove(0)
print lows
or clear it entirely:
In [17]:
lows.clear()
print lows
Removing elements is similar to deleting things from a list, but there's
an important difference. When we delete something from a list, we
specify its location. When we delete something from a set, though, we
must specify the value that we want to take out, because sets are not
ordered. If that value isn't in the set, remove
does nothing.
To help make programs easier to type and read, most of the methods we've
just seen can be written using arithmetic operators as well. For
example, instead of lows.issubset(ten)
, we can write lows <= ten
,
just as if we were using pen and paper. There are even a couple of
operators, like the strict subset test <
, that don't have long-winded
equivalents.
In [18]:
ten = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
lows = {0, 1, 2, 3, 4}
odds = {1, 3, 5, 7, 9}
Operation | As Method | Using Operator |
---|---|---|
difference | lows.difference(odds) |
lows - odds |
intersection | lows.intersection(odds) |
lows & odds |
subset | lows.issubset(ten) |
lows <= ten |
strict subset | lows < ten |
|
superset | lows.issuperset(ten) |
lows >= odds |
strict superset | lows >= odds |
|
exclusive or | lows.symmetric_difference(odds) |
lows ^ odds |
union | lows.union(odds) |
lows | odds |
The fact that the values in a set are distinct makes them a convenient way to get rid of duplicate values, like the "unique atoms" problem at the start of this section. Suppose we have a file containing the names of all the atoms in our warehouse, and our task is to produce a list of the their types. Here's how simple that code is:
In [19]:
def unique_atoms(filename):
atoms = set()
with open(filename, 'r') as source:
for line in source:
name = line.strip()
atoms.add(name)
return atoms
We start by creating an empty set which we will fill with atomic symbols and opening the file containing our data. As we read the lines in the file, we strip off any whitespace (such as the newline character at the end of the line) and put the resulting strings in the set. When we're done, we print the set. If our input is the file:
In [23]:
!cat some_atoms.txt
then our output is:
In [24]:
print unique_atoms('some_atoms.txt')
The right atoms are there, but what are those extra square brackets for?
The answer is that if we want to construct a set with values using
set()
, we have to pass those values in a single object, such as a
list. This syntax does not work:
In [25]:
set('Na', 'Fe', 'Si', 'Pd')
On the other hand,
this means that we can construct a set from almost anything that a for
loop can iterate over:
In [26]:
set('lithium')
Out[26]:
But hang on: if we're adding characters to the set in the order 'l'
,
'i'
, 't'
, 'h'
, 'i'
, 'u'
, 'm'
, why does Python show them in
the order 'i'
, 'h'
, 'm'
, 'l'
, 'u'
, 't'
? To answer that
question, we need to look at how sets are actually stored, and why
they're stored that way.