Source: This materials is adapted from Python for Biologists and Learn Python 3 in Y Minutes.
You can read more about dictionaries and tuples in the Python for Everyone book.
Suppose we want to count the number of As in a DNA sequence. Carrying out the calculation is quite straightforward – in fact it’s one of the first things we did in talking about strings.
In [15]:
dna = "ATCGATCGATCGTACGCTGA"
a_count = dna.count("A")
How will our code change if we want to generate a complete list of base counts for the sequence? We’ll add a new variable for each base:
In [16]:
dna = "ATCGATCGATCGTACGCTGA"
a_count = dna.count("A")
t_count = dna.count("T")
g_count = dna.count("G")
c_count = dna.count("C")
and now our code is starting to look rather repetitive. It’s not too bad for the four individual bases, but what if we want to generate counts for the 16 dinucleotides:
In [17]:
dna = "ATCGATCGATCGTACGCTGA"
aa_count = dna.count("AA")
at_count = dna.count("AT")
ag_count = dna.count("AG")
# ...etc...
or the 64 trinucleotides:
In [18]:
dna = "ATCGATCGATCGTACGCTGA"
aaa_count = dna.count("AAA")
aat_count = dna.count("AAT")
aag_count = dna.count("AAG")
# ...etc...
For trinucleotides and longer, the situation is particularly bad. The DNA sequence is 20 bases long, so it only contains 18 overlapping trinucleotides in total. This means that we’ll end up with 64 different variables, at least 46 of which will hold the value zero.
One possible way round this is to store the values in a list. If we use three nested loops, we can generate all possible trinucleotides, calculate the count for each one, and store all the counts in a list:
In [19]:
dna = "AATGATCGATCGTACGCTGA"
all_counts = []
for base1 in ['A', 'T', 'G', 'C']:
for base2 in ['A', 'T', 'G', 'C']:
for base3 in ['A', 'T', 'G', 'C']:
trinucleotide = base1 + base2 + base3
count = dna.count(trinucleotide)
print("count is " + str(count) + " for " + trinucleotide)
all_counts.append(count)
print(all_counts)
Although the code is above is quite compact, and doesn’t require huge numbers of variables, the output shows two problems with this approach:
Firstly, the data are still incredibly sparse – the vast majority of the counts are zero. Secondly, the counts themselves are now disconnected from the trinucleotides. If we want to look up the count for a single trinucleotide – for example, TGA – we first have to figure out that TGA was the 25th trinucleotide generated by our loops. Only then can we get the element at the correct index:
In [20]:
print("count for TGA is " + str(all_counts[24]))
We can try various tricks to get round this problem. What if we generated two lists – one of counts, and one of the trinucleotides themselves?
In [21]:
dna = "AATGATCGATCGTACGCTGA"
all_trinucleotides = []
all_counts = []
for base1 in ['A', 'T', 'G', 'C']:
for base2 in ['A', 'T', 'G', 'C']:
for base3 in ['A', 'T', 'G', 'C']:
trinucleotide = base1 + base2 + base3
count = dna.count(trinucleotide)
all_trinucleotides.append(trinucleotide)
all_counts.append(count)
print(all_counts)
print(all_trinucleotides)
Now we have two lists of the same length, with a one-to-one correspondence between the elements:
This allows us to look up the count for a given trinucleotide in a slightly more appealing way – we can look up the index of the trinucleotide in the all_trinucleotides list, then get the count at the same index in the all_counts list:
In [22]:
i = all_trinucleotides.index('TGA')
c = all_counts[i]
print('count for TGA is ' + str(c))
This is a little bit nicer, but still has major drawbacks. We’re still storing all those zeros, and now we have two lists to keep track of. We need to be incredibly careful when manipulating either of the two lists to make sure that they stay perfectly synchronized – if we make any change to one list but not the other, then there will no longer be a one-to-one correspondence between elements and we’ll get the wrong answer when we try to look up a count.
This approach is also slow1 . To find the index of a given trinucleotide in the all_trinucleotides list, Python has to look at each element one at a time until it finds the one we’re looking for. This means that as the size of the list grows2 , the time taken to look up the count for a given element will grow alongside it.
If we take a step back and think about the problem in more general terms, what we need is a way of storing pairs of data (in this case, trinucleotides and their counts) in a way that allows us to efficiently look up the count for any given trinucleotide. This problem of storing paired data is incredibly common in programming. We might want to store:
protein sequence names and their sequences DNA restriction enzyme names and their motifs codons and their associated amino acid residues colleagues’ names and their email addresses sample names and their co-ordinates words and their definitions
All these are examples of what we call key-value pairs. In each case we have pairs of keys and values:
Key Value trinucleotide count name protein sequence name restriction enzyme motif codon amino acid residue name email address sample coordinates word definition
The last example in this table – words and their definitions – is an interesting one because we have a tool in the physical world for storing this type of data – a dictionary. Python’s tool for solving this type of problem is also called a dictionary (usually abbreviated to dict) and in this section we’ll see how to create and use them.
The syntax for creating a dictionary is similar to that for creating a list, but we use curly brackets rather than square ones. Each pair of data, consisting of a key and a value, is called an item. When storing items in a dictionary, we separate them with commas. Within an individual item, we separate the key and the value with a colon. Here’s a bit of code that creates a dictionary of restriction enzymes (using data from the previous section) with three items:
In [23]:
enzymes = { 'EcoRI':r'GAATTC', 'AvaII':r'GG(A|T)CC', 'BisI':'GC[ATGC]GC' }
In this case, the keys and values are both strings3 . Splitting the dictionary definition over several lines makes it easier to read:
In [24]:
enzymes = {
'EcoRI' : r'GAATTC',
'AvaII' : r'GG(A|T)CC',
'BisI' : r'GC[ATGC]GC'
}
but doesn’t affect the code at all. To retrieve a bit of data from the dictionary – i.e. to look up the motif for a particular enzyme – we write the name of the dictionary, followed by the key in square brackets:
In [25]:
print(enzymes['BisI'])
The code looks very similar to using a list, but instead of giving the index of the element we want, we’re giving the key for the value that we want to retrieve.
Dictionaries are a very useful way to store data, but they come with some restrictions. The only types of data we are allowed to use as keys are strings and numbers4 , so we can’t, for example, create a dictionary where the keys are file objects. Values can be whatever type of data we like. Also, keys must be unique – we can’t store multiple values for the same key. You might think that this makes dicts less useful, but there are ways round the problem of storing multiple values – we won’t need them for the examples in this chapter, but the chapter on complex data structures in Advanced Python for Biologists gives details.
In real-life programs, it’s relatively rare that we’ll want to create a dictionary all in one go like in the example above. More often, we’ll want to create an empty dictionary, then add key/value pairs to it (just as we often create an empty list and then add elements to it).
To create an empty dictionary we simply write a pair of curly brackets on their own, and to add elements, we use the square-brackets notation on the left-hand side of an assignment. Here’s a bit of code that stores the restriction enzyme data one item at a time:
In [26]:
enzymes = {}
enzymes['EcoRI'] = r'GAATTC'
enzymes['AvaII'] = r'GG(A|T)CC'
enzymes['BisI'] = r'GC[ATGC]GC'
We can delete a key from a dictionary using the pop method. pop actually returns the value and deletes the key at the same time:
In [27]:
enzymes = {
'EcoRI' : r'GAATTC',
'AvaII' : r'GG(A|T)CC',
'BisI' : r'GC[ATGC]GC'
}
# remove the EcoRI enzyme from the dict
enzymes.pop('EcoRI')
Out[27]:
Let’s take another look at the trinucleotide count example from the start of the section. Here’s how we store the trinucleotides and their counts in a dictionary:
In [28]:
dna = "AATGATCGATCGTACGCTGA"
counts = {}
for base1 in ['A', 'T', 'G', 'C']:
for base2 in ['A', 'T', 'G', 'C']:
for base3 in ['A', 'T', 'G', 'C']:
trinucleotide = base1 + base2 + base3
count = dna.count(trinucleotide)
counts[trinucleotide] = count
print(counts)
We can see from the output that the trinucleotides and their counts are stored together in one variable:
{'ACC': 0, 'ATG': 1, 'AAG': 0, 'AAA': 0, 'ATC': 2, 'AAC': 0, 'ATA': 0, 'AGG': 0, 'CCT': 0, 'CTC': 0, 'AGC': 0, 'ACA': 0, 'AGA': 0, 'CAT': 0, 'AAT': 1, 'ATT': 0, 'CTG': 1, 'CTA': 0, 'ACT': 0, 'CAC': 0, 'ACG': 1, 'CAA': 0, 'AGT': 0, 'CAG': 0, 'CCG': 0, 'CCC': 0, 'CTT': 0, 'TAT': 0, 'GGT': 0, 'TGT': 0, 'CGA': 1, 'CCA': 0, 'TCT': 0, 'GAT': 2, 'CGG': 0, 'TTT': 0, 'TGC': 0, 'GGG': 0, 'TAG': 0, 'GGA': 0, 'TAA': 0, 'GGC': 0, 'TAC': 1, 'TTC': 0, 'TCG': 2, 'TTA': 0, 'TTG': 0, 'TCC': 0, 'GAA': 0, 'TGG': 0, 'GCA': 0, 'GTA': 1, 'GCC': 0, 'GTC': 0, 'GCG': 0, 'GTG': 0, 'GAG': 0, 'GTT': 0, 'GCT': 1, 'TGA': 2, 'GAC': 0, 'CGT': 1, 'TCA': 0, 'CGC': 1}
We still have a lot of repetitive counts of zero, but looking up the count for a particular trinucleotide is now very straightforward:
In [29]:
print(counts['TGA'])
We no longer have to worry about either “memorizing” the order of the counts or maintaining two separate lists.
Let’s now see if we can find a way of avoiding storing all those zero counts. We can add an if statement that ensures that we only store a count if it’s greater than zero:
In [30]:
dna = "AATGATCGATCGTACGCTGA"
counts = {}
for base1 in ['A', 'T', 'G', 'C']:
for base2 in ['A', 'T', 'G', 'C']:
for base3 in ['A', 'T', 'G', 'C']:
trinucleotide = base1 + base2 + base3
count = dna.count(trinucleotide)
if count > 0:
counts[trinucleotide] = count
print(counts)
When we look at the output from the above code, we can see that the amount of data we’re storing is much smaller – just the counts for the trinucleotides that are greater than zero:
{'ATG': 1, 'ACG': 1, 'ATC': 2, 'GTA': 1, 'CTG': 1, 'CGC': 1, 'GAT': 2, 'CGA': 1, 'AAT': 1, 'TGA': 2, 'GCT': 1, 'TAC': 1, 'TCG': 2, 'CGT': 1}
Now we have a new problem to deal with. Looking up the count for a given trinucleotide works fine when the count is positive:
In [31]:
print(counts['TGA'])
But when the count is zero, the trinucleotide doesn’t appear as a key in the dictionary:
In [32]:
print(counts['AAA'])
so we will get a KeyError when we try to look it up:
There are two possible ways to fix this. We can check for the existence of a key in a dictionary (just like we can check for the existence of an element in a list), and only try to retrieve it once we know it exists:
In [33]:
if 'AAA' in counts:
print(counts('AAA'))
Alternatively, we can use the dictionary’s get method. get usually works just like using square brackets: the following two lines do exactly the same thing:
In [34]:
print(counts['TGA'])
print(counts.get('TGA'))
The thing that makes get really useful, however, is that it can take an optional second argument, which is the default value to be returned if the key isn’t present in the dictionary. In this case, we know that if a given trinucleotide doesn’t appear in the dictionary then its count is zero, so we can give zero as the default value and use get to print out the count for any trinucleotide:
In [35]:
print("count for TGA is " + str(counts.get('TGA', 0)))
print("count for AAA is " + str(counts.get('AAA', 0)))
print("count for GTA is " + str(counts.get('GTA', 0)))
print("count for TTT is " + str(counts.get('TTT', 0)))
As we can see from the output, we now don’t have to worry about whether or not each trinucleotide appears in the dictionary – get takes care of everything and returns zero when appropriate:
What if, instead of looking up a single item from a dictionary, we want to do something for all items? For example, imagine that we wanted to take our counts dictionary variable from the code above and print out all trinucleotides where the count was 2. One way to do it would be to use our three nested loops again to generate all possible trinucleotides, then look up the count for each one and decide whether or not to print it:
In [37]:
for base1 in ['A', 'T', 'G', 'C']:
for base2 in ['A', 'T', 'G', 'C']:
for base3 in ['A', 'T', 'G', 'C']:
trinucleotide = base1 + base2 + base3
if counts.get(trinucleotide, 0) == 2:
print(trinucleotide)
As we can see from the output, this works perfectly well:
But it seems inefficient to go through the whole process of generating all possible trinucleotides again, when the information we want – the list of trinucleotides – is already in the dictionary. A better approach would be to read the list of keys directly from the dictionary, which is what the keys method does.
When used on a dictionary, the keys method returns a list of all the keys in the dictionary:
In [38]:
print(counts.keys())
Looking at the output5 confirms that this is the list of trinucleotides we want to consider (remember that we’re looking for trinucleotides with a count of two, so we don’t need to consider ones that aren’t in the dictionary as we already know that they have a count of zero):
['ATG', 'ACG', 'ATC', 'GTA', 'CTG', 'CGC', 'GAT', 'CGA', 'AAT', 'TGA', 'GCT', 'TAC', 'TCG', 'CGT']
Using keys, our code for printing out all the trinucleotides that appear twice in the DNA sequence becomes a lot more concise:
In [39]:
for trinucleotide in counts.keys():
if counts.get(trinucleotide) == 2:
print(trinucleotide)
This version prints exactly the same set of trinucleotides as the more verbose method:
Before we move on, take a moment to compare the output immediately above this paragraph with the output from the three-loop version from earlier in this section. You’ll notice that while the set of trinucleotides is the same, the order in which they appear is different. This illustrates an important point about dictionaries – they are inherently unordered. That means that when we use the keys method to iterate over a dictionary, we can’t rely on processing the items in the same order that we added them. This is in contrast to lists, which always maintain the same order when looping. If we want to control the order in which keys are printed we can use the sorted method to sort the list before processing it:
for trinucleotide in sorted(counts.keys()): if counts.get(trinucleotide) == 2: print(trinucleotide)
In the example code above, the first thing we need to do inside the loop is to look up the value for the current key. This is a very common pattern when iterating over dictionaries – so common, in fact, that Python has a special shorthand for it. Instead of doing this:
In [40]:
for key in my_dict.keys():
value = my_dict.get(key)
# do something with key and value
We can use the items method to iterate over pairs of data, rather than just keys:
In [ ]:
for key, value in my_dict.items():
# do something with key and value
The items method does something slightly different from all the other methods we’ve seen so far in this book; rather than returning a single value, or a list of values, it returns a list of pairs of values. That’s why we have to give two variable names at the start of the loop. Here’s how we can use the items method to process our dictionary of trinucleotide counts just like before:
In [ ]:
for trinucleotide, count in counts.items():
if count == 2:
print(trinucleotide)
This method is generally preferred for iterating over items in a dictionary, as it makes the intention of the code very clear. Many of the problems that we solve by iterating over dicts can also be solved using comprehensions – there’s a whole chapter devoted to comprehensions in Advanced Python for Biologists, so take a look.
In [41]:
# Tuples are like lists but are immutable.
tup = (1, 2, 3)
tup[0] # => 1
tup[0] = 3 # Raises a TypeError
In [43]:
# Note that a tuple of length one has to have a comma after the last element but
# tuples of other lengths, even zero, do not.
type((1)) # => <class 'int'>
type((1,)) # => <class 'tuple'>
type(()) # => <class 'tuple'>
Out[43]:
In [42]:
# You can do most of the list operations on tuples too
len(tup) # => 3
tup + (4, 5, 6) # => (1, 2, 3, 4, 5, 6)
tup[:2] # => (1, 2)
2 in tup # => True
Out[42]:
In [44]:
# You can unpack tuples (or lists) into variables
a, b, c = (1, 2, 3) # a is now 1, b is now 2 and c is now 3
# You can also do extended unpacking
a, *b, c = (1, 2, 3, 4) # a is now 1, b is now [2, 3] and c is now 4
# Tuples are created by default if you leave out the parentheses
d, e, f = 4, 5, 6
# Now look how easy it is to swap two values
e, d = d, e # d is now 5 and e is now 4
In [45]:
# Dictionaries store mappings
empty_dict = {}
# Here is a prefilled dictionary
filled_dict = {"one": 1, "two": 2, "three": 3}
In [46]:
# Note keys for dictionaries have to be immutable types. This is to ensure that
# the key can be converted to a constant hash value for quick look-ups.
# Immutable types include ints, floats, strings, tuples.
invalid_dict = {[1,2,3]: "123"} # => Raises a TypeError: unhashable type: 'list'
valid_dict = {(1,2,3):[1,2,3]} # Values can be of any type, however.
In [47]:
# Look up values with []
filled_dict["one"] # => 1
Out[47]:
In [48]:
# Get all keys as an iterable with "keys()". We need to wrap the call in list()
# to turn it into a list. We'll talk about those later. Note - Dictionary key
# ordering is not guaranteed. Your results might not match this exactly.
list(filled_dict.keys()) # => ["three", "two", "one"]
Out[48]:
In [49]:
# Get all values as an iterable with "values()". Once again we need to wrap it
# in list() to get it out of the iterable. Note - Same as above regarding key
# ordering.
list(filled_dict.values()) # => [3, 2, 1]
Out[49]:
In [50]:
# Check for existence of keys in a dictionary with "in"
"one" in filled_dict # => True
1 in filled_dict # => False
Out[50]:
In [51]:
# Looking up a non-existing key is a KeyError
filled_dict["four"] # KeyError
In [52]:
# Use "get()" method to avoid the KeyError
filled_dict.get("one") # => 1
filled_dict.get("four") # => None
# The get method supports a default argument when the value is missing
filled_dict.get("one", 4) # => 1
filled_dict.get("four", 4) # => 4
Out[52]:
In [53]:
# "setdefault()" inserts into a dictionary only if the given key isn't present
filled_dict.setdefault("five", 5) # filled_dict["five"] is set to 5
filled_dict.setdefault("five", 6) # filled_dict["five"] is still 5
Out[53]:
In [54]:
# Adding to a dictionary
filled_dict.update({"four":4}) # => {"one": 1, "two": 2, "three": 3, "four": 4}
#filled_dict["four"] = 4 #another way to add to dict
In [55]:
# Remove keys from a dictionary with del
del filled_dict["one"] # Removes the key "one" from filled dict
In [56]:
# Sets store ... well sets
empty_set = set()
# Initialize a set with a bunch of values. Yeah, it looks a bit like a dict. Sorry.
some_set = {1, 1, 2, 2, 3, 4} # some_set is now {1, 2, 3, 4}
In [57]:
# Similar to keys of a dictionary, elements of a set have to be immutable.
invalid_set = {[1], 1} # => Raises a TypeError: unhashable type: 'list'
valid_set = {(1,), 1}
In [58]:
# Can set new variables to a set
filled_set = some_set
In [59]:
# Add one more item to the set
filled_set.add(5) # filled_set is now {1, 2, 3, 4, 5}
In [60]:
# Do set intersection with &
other_set = {3, 4, 5, 6}
filled_set & other_set # => {3, 4, 5}
Out[60]:
In [61]:
# Do set union with |
filled_set | other_set # => {1, 2, 3, 4, 5, 6}
Out[61]:
In [62]:
# Do set difference with -
{1, 2, 3, 4} - {2, 3, 5} # => {1, 4}
Out[62]:
In [63]:
# Do set symmetric difference with ^
{1, 2, 3, 4} ^ {2, 3, 5} # => {1, 4, 5}
Out[63]:
In [64]:
# Check if set on the left is a superset of set on the right
{1, 2} >= {1, 2, 3} # => False
Out[64]:
In [65]:
# Check if set on the left is a subset of set on the right
{1, 2} <= {1, 2, 3} # => True
Out[65]:
In [66]:
# Check for existence in a set with in
2 in filled_set # => True
10 in filled_set # => False
Out[66]:
In [ ]: