This lesson is adapted from Martin Fowlers "Python for Biologists"
Why do we need lists and loops?
Think back over the exercises that we’ve seen in the previous two sections – they’ve all involved dealing with one bit of information at a time. In section 2, we used string manipulation tools to process single sequences, and in section 3, we practised reading and writing files one at a time. The closest we got to using multiple pieces of data was during the final exercise in section 3, where we were dealing with three DNA sequences.
If that’s all that Python allowed us to do, it wouldn’t be a very helpful tool for biology. In fact, there’s a good chance that you’re working through this course because you want to be able to write programs to help you deal with large datasets. A very common situation in biological research is to have a large collection of data (DNA sequences, SNP positions, gene expression measurements) that all need to be processed in the same way. In this section, we’ll learn about the fundamental programming tools that will allow our programs to do this.
So far we have learned about several different data types (strings, numbers, and file objects), all of which store a single bit of information1 When we’ve needed to store multiple bits of information (for example, the three DNA sequences in the section 3 exercises) we have simply created more variables to hold them:
In [10]:
# set the values of all the sequence variables
seq_1 = "ATCGTACGATCGATCGATCGCTAGACGTATCG"
seq_2 = "actgatcgacgatcgatcgatcacgact"
seq_3 = "ACTGAC-ACTGT—ACTGTA----CATGTG"
In [ ]:
Type of collections
string - immutable (cannot be changed) - ordered
list - mutable (can change) - ordered
dictionary - mutable - unordered
The limitations of this approach became clear quite quickly as we looked at the solution code – it only worked because the number of sequences were small, and we knew the number in advance. If we were to repeat the exercise with three hundred or three thousand sequences, the vast majority of the code would be given over to storing variables and it would become completely unmanageable. And if we were to try and write a program that could process an unknown number of input sequences (for instance, by reading them from a file), we wouldn’t be able to do it. To make our programs able to process multiple pieces of data, we need an entirely new type of structure which can hold many pieces of information at the same time – a list.
We’ve also dealt exclusively with programs whose statements are executed from top to bottom in a very straightforward way. This has great advantages when first starting to think about programming – it makes it very easy to follow the flow of a program. The downside of this sequential style of programming, however, is that it leads to very redundant code like we saw at the end of the previous section:
In [11]:
# make three files to hold the output
output_1 = open(header_1 + ".fasta", "w")
output_2 = open(header_2 + ".fasta", "w")
output_3 = open(header_3 + ".fasta", "w")
Again; it was only possible to solve the exercise in this manner because we knew in advance the number of output files we were going to need. Looking at the code, it’s clear that these three lines consist of essentially the same statement being executed multiple times, with some slight variations. This idea of repetition-with-variation is incredibly common in programming problems, and Python has built in tools for expressing it – loops.
To make a new list, we put several strings or numbers2 inside square brackets, separated by commas:
apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]
conserved_sites = [24, 56, 132]
Each individual item in a list is called an element. To get a single element from the list, write the variable name followed by the index of the element you want in square brackets:
In [1]:
apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]
conserved_sites = [24, 56, 132]
In [2]:
print(apes[0])
In [4]:
first_site = conserved_sites[2]
first_site
Out[4]:
If we want to go in the other direction – i.e. we know which element we want but we don’t know the index – we can use the index method:
In [5]:
type(apes)
Out[5]:
In [6]:
# apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]
chimp_index = apes.index("Pan troglodytes")
chimp_index
Out[6]:
Remember that in Python we start counting from zero rather than one, so the first element of a list is always at index zero. If we give a negative number, Python starts counting from the end of the list rather than the beginning – so it’s easy to get the last element from a list:
In [8]:
last_ape = apes[-1]
last_ape
Out[8]:
What if we want to get more than one element from a list? We can give a start and stop position, separated by a colon, to specify a range of elements:
In [12]:
ranks[2]
Out[12]:
In [11]:
ranks = ["kingdom","phylum", "class", "order", "family"]
lower_ranks = ranks[2:3]
lower_ranks
Out[11]:
Does this look familiar? It’s the exact same notation that we used to get substrings back in section 2, and it works in exactly the same way – numbers are inclusive at the start and exclusive at the end. The fact that we use the same notation for strings and lists hints at a deeper relationship between the two types. In fact, what we were doing when extracting substrings in section 2 was treating a string as though it were a list of characters. This idea – that we can treat a variable as though it were a list when it’s not – is a powerful one in Python and we’ll come back to it later in this section (and also in the chapter on iterators in Advanced Python for Biologists).
To add another element onto the end of an existing list, we can use the append method:
In [14]:
ape_list = []
In [15]:
type(ape_list)
Out[15]:
In [16]:
apes = []
apes.append("Homo sapiens")
apes.append("Pan troglodytes")
apes.append("Gorilla gorilla")
apes.append("Pan paniscus")
apes
Out[16]:
append is an interesting method because it actually changes the variable on which it’s used – in the above example, the apes list goes from having three elements to having four. We can get the length of a list by using the len function, just like we did for strings:
In [18]:
apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]
print("There are " + len(apes) + " apes")
In [19]:
apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]
print("There are " + str(len(apes)) + " apes") #case an integer as a string
In [20]:
apes.append("Pan paniscus")
print("Now there are " + str(len(apes)) + " apes")
The output shows that the number of elements in apes really has changed:
We can concatenate two lists just as we did with strings, by using the plus symbol:
In [21]:
apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]
monkeys = ["Papio ursinus", "Macaca mulatta"]
primates = apes + monkeys
print(str(len(apes)) + " apes")
print(str(len(monkeys)) + " monkeys")
print(str(len(primates)) + " primates")
In [22]:
primates
Out[22]:
As we can see from the output, this doesn’t change either of the two original lists – it makes a brand new list which contains elements from both:
If we want to add elements from a list onto the end of an existing list, changing it in the process, we can use the extend method. extend behaves like append but takes a list as its argument rather than a single element.
Here are two more list methods that change the variable they’re used on: reverse and sort. Both reverse and sort work by changing the order of the elements in the list. If we want to print out a list to see how this works, we need to used str (just as we did when printing out numbers):
In [1]:
ranks = ["kingdom","phylum", "class", "order", "family"]
print("at the start : " + str(ranks))
In [2]:
help(ranks.insert)
In [24]:
ranks.reverse()
print("after reversing : " + str(ranks))
In [25]:
ranks.sort()
print("after sorting : " + str(ranks))
If we take a look at the output, we can see how the order of the elements in the list is changed by these two methods:
In [ ]:
at the start : ['kingdom', 'phylum', 'class', 'order', 'family']
after reversing : ['family', 'order', 'class', 'phylum', 'kingdom']
after sorting : ['class', 'family', 'kingdom', 'order', 'phylum']
By default, Python sorts strings in alphabetical order and numbers in ascending numerical order3 .
Imagine we wanted to take our list of apes:
In [ ]:
apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]
and print out each element on a separate line, like this:
One way to do it would be to just print each element separately:
In [ ]:
print(apes[0] + " is an ape")
print(apes[1] + " is an ape")
print(apes[2] + " is an ape")
but this is very repetitive and relies on us knowing the number of elements in the list. What we need is a way to say something along the lines of “for each element in the list of apes, print out the element, followed by the words ‘ is an ape’“. Python’s loop syntax allows us to express those instructions like this:
1 2
In [28]:
for ape in apes:
print(ape + " is an ape")
In [29]:
ape
Out[29]:
Let’s take a moment to look at the different parts of this loop. We start by writing for x in y, where y is the name of the list we want to process and x is the name we want to use for the current element each time round the loop.
x is just a variable name (so it follows all the rules that we’ve already learned about variable names), but it behaves slightly differently to all the other variables we’ve seen so far. In all previous examples, we create a variable and store something in it, and then the value of that variable doesn’t change unless we change it ourselves. In contrast, when we create a variable to be used in a loop, we don’t set its value – the value of the variable will be automatically set to each element of the list in turn, and it will be different each time round the loop.
Importantly, the loop variable x only exists inside the loop – it gets created at the start of each loop iteration, and disappears at the end. This means that once the loop has finished running for the last time, that variable is gone forever. When a variable is restricted to a block of code like this, we call it the variable’s scope – we will see several more examples later in the book.
This first line of the loop ends with a colon, and all the subsequent lines (just one, in this case) are indented. Indented lines can start with any number of tab or space characters, but they must all be indented in the same way. This pattern – a line which ends with a colon, followed by some indented lines – is very common in Python, and we’ll see it in several more places throughout this book. A group of indented lines is often called a block of code.4
In this case, we refer to the indented bock as the body of the loop, and the lines inside it will be executed once for each element in the list. To refer to the current element, we use the variable name that we wrote in the first line. The body of the loop can contain as many lines as we like, and can include all the functions and methods that we’ve learned about, with one important exception: we’re not allowed to change the list while inside the body of the loop5 .
Here’s an example of a loop with a more complicated body:
In [34]:
apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]
for ape in apes:
name_length = len(ape)
first_letter = ape[0]
print(ape + " is an ape. Its name starts with " + first_letter)
print("Its name has " + str(name_length) + " letters")
The body of the loop in the code above has four statements, two of which are print statements, so each time round the loop we’ll get two lines of output. If we look at the output we can see all six lines:
Why is the above approach better than printing out these six lines in six separate statements? Well, for one thing, there’s much less redundancy – here we only needed to write two print statements. This also means that if we need to make a change to the code, we only have to make it once rather than three separate times. Another benefit of using a loop here is that if we want to add some elements to the list, we don’t have to touch the loop code at all. Consequently, it doesn’t matter how many elements are in the list, and it’s not a problem if we don’t know how many are going to be in it at the time when we write the code. Many problems that can be solved with loops can also be solved using a tool called list comprehensions – see the chapter on comprehensions in Advanced Python for Biologists.
Unfortunately, introducing tools like loops that require an indented block of code also introduces the possibility of a new type of error – an IndentationError. Notice what happens when the indentation of one of the lines in the block does not match the others:
In [ ]:
apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]
for ape in apes:
name_length = len(ape)
first_letter = ape[0]
print(ape + " is an ape. Its name starts with " + first_letter)
print("Its name has " + str(name_length) + " letters")
When we run this code, we get an error message before the program even starts to run:
When you encounter an IndentationError, go back to your code and double-check that all the lines in the block match up. Also double-check that you are using either tabs or spaces for indentation, not both. The easiest way to do this, as mentioned in section 1, is to enable tab emulation in your text editor.
Using a string as a list
We’ve already seen how a string can pretend to be a list – we can use list index notation to get individual characters or substrings from inside a string. Can we also use loop notation to process a string as though it were a list? Yes – if we write a loop statement with a string in the position where we’d normally find a list, Python treats each character in the string as a separate element. This allows us to very easily process a string one character at a time:
In [ ]:
name = "martin"
for character in name:
print("one character is " + character)
In this case, we’re just printing each individual character:
The process of repeating a set of instructions for each element of a list (or character in a string) is called iteration, and we often talk about iterating over a list or string.
So far in this section, all our lists have been written manually. However, there are plenty of functions and methods in Python that produce lists as their output. One such method that is particularly interesting to biologists is the split method which works on strings. split takes a single argument, called the delimiter, and splits the original string wherever it sees the delimiter, producing a list. Here’s an example:
In [ ]:
names = "melanogaster,simulans,yakuba,ananassae"
species = names.split(",")
print(str(species))
We can see from the output that the string has been split wherever there was a comma leaving us with a list of strings:
Of course, once we’ve created a list in this way we can iterate over it using a loop, just like any other list.
Another very useful thing that we can iterate over is a file. Just as a string can pretend to be a list for the purposes of looping, a file object can do the same trick6 . When we treat a string as a list, each character becomes an individual element, but when we treat a file as a list, each line becomes an individual element. This makes processing a file line-by-line very easy:
In [ ]:
file = open("some_input.txt")
for line in file:
# do something with the line
A quick warning: when you’re writing a program