An introduction to solving biological problems with Python

Session 2.2: Loops

The for loop
Exercises 2.2.1
The while loop
Exercises 2.2.2
Skipping and breaking loops
More looping using range() and enumerate()
Filtering in loops
Exercises 2.2.3

Loops

When an operation needs to be repeated multiple times, for example on all of the items in a list, we avoid having to type (or copy and paste) repetitive code by creating a loop. There are two ways of creating loops in Python, the for loop and the while loop.

The `for` loop

The for loop in Python iterates over each item in a sequence (such as a list or tuple) in the order that they appear in the sequence. What this means is that a variable (code in the below example) is set to each item from the sequence of values in turn, and each time this happens the indented block of code is executed again.



In [ ]:

    
codeList = ['NA06984', 'NA06985', 'NA06986', 'NA06989', 'NA06991']

for code in codeList:
    print(code)

A for loop can iterate over the individual characters in a string:



In [ ]:

    
dnaSequence = 'ATGGTGTTGCC'

for base in dnaSequence:
    print(base)

And also over the keys of a dictionary:



In [ ]:

    
rnaMassDict = {"G":345.21, "C":305.18, "A":329.21, "U":302.16}

for x in rnaMassDict:
    print(x, rnaMassDict[x])

Any variables that are defined before the loop can be accessed from inside the loop. So for example to calculate the summation of the items in a list of values we could define the total initially to be zero and add each value to the total in the loop:



In [ ]:

    
total = 0
values = [1, 2, 4, 8, 16]

for v in values:
    total = total + v
    # total += v
    print(total)

print(total)

Naturally we can combine a for loop with an if statement, noting that we need two indentation levels, one for the outer loop and another for the conditional blocks:



In [ ]:

    
geneExpression = {
    'Beta-Catenin': 2.5, 
    'Beta-Actin': 1.7, 
    'Pax6': 0, 
    'HoxA2': -3.2
}

for gene in geneExpression:
    if geneExpression[gene] < 0:
        print(gene, "is downregulated")
        
    elif geneExpression[gene] > 0:
        print(gene, "is upregulated")
        
    else:
        print("No change in expression of ", gene)

Exercises 2.2.1

Create a sequence where each element is an individual base of DNA. Make the sequence 15 bases long.
Print the length of the sequence.
Create a for loop to output every base of the sequence on a new line.

The `while` loop

In addition to the for loop that operates on a collection of items, there is a while loop that simply repeats while some statement evaluates to True and stops when it is False. Note that if the tested expression never evaluates to False then you have an “infinite loop”, which is not good.

In this example we generate a series of numbers by doubling a value after each iteration, until a limit is reached:



In [ ]:

    
value = 0.25
while value < 8:
    value = value * 2
    print(value)

print("final value:", value)

Whats going on here is that the value is doubled in each iteration and once it gets to 8 the while test fails (8 is not less than 8) and that last value is preserved. Note that if the test were instead value <= 8 then we would get one more doubling and the value would reach 16.

Exercises 2.2.2

Reuse the 15 bases long sequence created at the previous exercise where each element is an individual base of DNA.
Create a while loop similar to the one above that starts at the third base in the sequence and outputs every third base until the 12th.

Skipping and breaking loops

Python has two ways of affecting the flow of the for or while loop inside the block. The continue statement means that the rest of the code in the block is skipped for this particular item in the collection, i.e. jump to the next iteration. In this example negative numbers are left out of a summation:



In [ ]:

    
values = [10, -5, 3, -1, 7]

total = 0
for v in values:
    if v < 0:
        continue # Skip this iteration   
    total += v

print(total)

The other way of affecting a loop is with the break statement. In contrast to the continue statement, this immediately causes all looping to finish, and execution is resumed at the next statement after the loop.



In [ ]:

    
geneticCode = {'TAT': 'Tyrosine',  'TAC': 'Tyrosine',
               'CAA': 'Glutamine', 'CAG': 'Glutamine',
               'TAG': 'STOP'}

sequence = ['CAG','TAC','CAA','TAG','TAC','CAG','CAA']

for codon in sequence:
    if geneticCode[codon] == 'STOP':
        break            # Quit looping at this point
    else:
        print(geneticCode[codon])

Looping gotchas

An internal counter is used to keep track of which item is used next, and this is incremented on each iteration. When this counter has reached the length of the sequence the loop terminates. This means that if you delete the current item from the sequence, the next item will be skipped (since it gets the index of the current item which has already been treated). Likewise, if you insert an item in a sequence before the current item, the current item will be treated again the next time through the loop. This can lead to nasty bugs that can be avoided by making a temporary copy using a slice of the whole sequence.

**When looping, never modify the collection!** Always create a copy of it first.

More looping

Using `range()`

If you would like to iterate over a numeric sequence then this is possible by combining the range() function and a for loop.



In [ ]:

    
print(list(range(10)))

print(list(range(5, 10)))

print(list(range(0, 10, 3)))

print(list(range(7, 2, -2)))

Looping through ranges



In [ ]:

    
for x in range(8):
    print(x*x)



In [ ]:

    
squares = []
for x in range(8):
    s = x*x
    squares.append(s)
    
print(squares)

Using `enumerate()`

Given a sequence, enumerate() allows you to iterate over the sequence generating a tuple containing each value along with a corresponding index.



In [ ]:

    
letters = ['A','C','G','T']
for index, letter in enumerate(letters):
    print(index, letter)



In [ ]:

    
numbered_letters = list(enumerate(letters))
print(numbered_letters)

Filtering in loops



In [ ]:

    
city_pops = {
    'London': 8200000,
    'Cambridge': 130000,
    'Edinburgh': 420000,
    'Glasgow': 1200000
}

big_cities = []
for city in city_pops:
    if city_pops[city] >= 1000000:
         big_cities.append(city)

print(big_cities)



In [ ]:

    
total = 0
for city in city_pops:
    total += city_pops[city]
print("total population:", total)



In [ ]:

    
pops = list(city_pops.values())
print("total population:", sum(pops))

Formating string

Constructing more complex strings from a mix of variables of different types can be cumbersome, and sometimes you want more control over how values are interpolated into a string. Python provides a powerful mechanism for formatting strings using built-in .format() function using "replacement fields" surrounded by curly braces {} which starts with an optional field name followed by a colon : and finishes with a format specification.

There are lots of these specifiers, but here are 3 useful ones:

d: decimal integer
f: floating point number
s: string

You can specify the number of decimal points to use in a floating point number with, e.g. .2f to use 2 decimal places or +.2f to use 2 decimal with always showing its associated sign.



In [ ]:

    
print('{:.2f}'.format(0.4567))



In [ ]:

    
geneExpression = {
    'Beta-Catenin': 2.5, 
    'Beta-Actin': 1.7, 
    'Pax6': 0, 
    'HoxA2': -3.2
}

for gene in geneExpression:
    print('{:s}\t{:+.2f}'.format(gene, geneExpression[gene])) # s is optional
    # could also be written using variable names
    #print('{gene:s}\t{exp:+.2f}'.format(gene=gene, exp=geneExpression[gene]))

Exercises 2.2.3

Let's calculate the GC content of a DNA sequence. Use the 15-base sequence you created for the exercises above. Create a variable, gc, which we will use to count the number of Gs or Cs in our sequence.
Output every base of the sequence alongside its index on a new line.
Create a loop to iterate over the bases in your sequence. If the base is a G or the base is a C, add one to your gc variable.
When the loop is done, divide the number of GC bases by the length of the sequence and multiply by 100 to get the GC percentage. Format the result to only display 2 decimal places.

Next session

Go to our next notebook: python_basic_2_3