An introduction to solving biological problems with Python

Session 1.3: Collections Lists and Strings

As well as the basic data types we introduced above, very commonly you will want to store and operate on collections of values, and python has several data structures that you can use to do this. The general idea is that you can place several items into a single collection and then refer to that collection as a whole. Which one you will use will depend on what problem you are trying to solve.

Tuples

  • Can contain any number of items
  • Can contain different types of items
  • Cannot be altered once created (they are immutable)
  • Items have a defined order

A tuple is created by using round brackets around the items it contains, with commas seperating the individual elements.


In [ ]:
a = (123, 54, 92) # tuple of 4 integers
b = () # empty tuple
c = ("Ala",) # tuple of a single string (note the trailing ",")
d = (2, 3, False, "Arg", None) # a tuple of mixed types

print(a)
print(b)
print(c)
print(d)

You can of course use variables in tuples and other data structures


In [ ]:
x = 1.2
y = -0.3
z = 0.9
t = (x, y, z)

print(t)

Tuples can be packed and unpacked with a convenient syntax. The number of variables used to unpack the tuple must match the number of elements in the tuple.


In [ ]:
t = 2, 3, 4 # tuple packing
print('t is', t)
x, y, z = t # tuple unpacking
print('x is', x)
print('y is', y)
print('z is', z)

Lists

  • Can contain any number of items
  • Can contain different types of items
  • Can be altered once created (they are mutable)
  • Items have a particular order

Lists are created with square brackets around their items:


In [ ]:
a = [1, 3, 9]
b = ["ATG"]
c = []

print(a)
print(b)
print(c)

Lists and tuples can contain other list and tuples, or any other type of collection:


In [ ]:
matrix = [[1, 0], [0, 2]]
print(matrix)

You can convert between tuples and lists with the tuple and list functions. Note that these create a new collection with the same items, and leave the original unaffected.


In [ ]:
a = (1, 4, 9, 16)     # A tuple of numbers
b = ['G','C','A','T'] # A list of characters

print(a)
print(b)

l = list(a)   # Make a list based on a tuple 
print(l)

t = tuple(b)  # Make a tuple based on a list
print(t)

Manipulating tuples and lists

Once your data is in a list or tuple, python supports a number of ways you can access elements of the list and manipulate the list in useful ways, such as sorting the data.

Tuples and lists can generally be used in very similar ways.

Index access

You can access individual elements of the collection using their index, note that the first element is at index 0. Negative indices count backwards from the end.


In [ ]:
t = (123, 54, 92, 87, 33)
x = [123, 54, 92, 87, 33]

print('t is', t)
print('t[0] is', t[0])
print('t[2] is', t[2])

print('x is', x)
print('x[-1] is', x[-1])

Slices

You can also access a range of items, known as slices, from inside lists and tuples using a colon : to indicate the beginning and end of the slice inside the square brackets. Note that the slice notation [a:b] includes positions from a up to but not including b.


In [ ]:
t = (123, 54, 92, 87, 33)
x = [123, 54, 92, 87, 33]
print('t[1:3] is', t[1:3])
print('x[2:] is', x[2:])
print('x[:-1] is', x[:-1])

in operator

You can check if a value is in a tuple or list with the in operator, and you can negate this with not


In [ ]:
t = (123, 54, 92, 87, 33)
x = [123, 54, 92, 87, 33]
print('123 in', x, 123 in x)
print('234 in', t, 234 in t)
print('999 not in', x, 999 not in x)

len() and count() functions

You can get the length of a list or tuple with the in-built len() function, and you can count the number of particular elements contained in a list with the .count() function.


In [ ]:
t = (123, 54, 92, 87, 33)
x = [123, 54, 92, 87, 33]
print("length of t is", len(t))
print("number of 33s in x is", x.count(33))

Modifying lists

You can alter lists in place, but not tuples


In [ ]:
x = [123, 54, 92, 87, 33]
print(x)
x[2] = 33
print(x)

Tuples cannot be altered once they have been created, if you try to do so, you'll get an error.


In [ ]:
t = (123, 54, 92, 87, 33)
print(t)
t[1] = 4

You can add elements to the end of a list with append()


In [ ]:
x = [123, 54, 92, 87, 33]
x.append(101)
print(x)

or insert values at a certain position with insert(), by supplying the desired position as well as the new value


In [ ]:
x = [123, 54, 92, 87, 33]
x.insert(3, 1111)
print(x)

You can remove values with remove()


In [ ]:
x = [123, 54, 92, 87, 33]
x.remove(123)
print(x)

and delete values by index with del


In [ ]:
x = [123, 54, 92, 87, 33]
print(x)
del x[0]
print(x)

It's often useful to be able to combine arrays together, which can be done with extend() (as append would add the whole list as a single element in the list)


In [ ]:
a = [1,2,3]
b = [4,5,6]
a.extend(b)
print(a)
a.append(b)
print(a)

The plus symbol + is shorthand for the extend operation when applied to lists:


In [ ]:
a = [1, 2, 3]
b = [4, 5, 6]
a = a + b
print(a)

Slice syntax can be used on the left hand side of an assignment operation to assign subregions of a list


In [ ]:
a = [1, 2, 3, 4, 5, 6]
a[1:3] = [9, 9, 9, 9]
print(a)

You can change the order of elements in a list


In [ ]:
a = [1, 3, 5, 4, 2]
a.reverse()
print(a)
a.sort()
print(a)

Note that both of these change the list, if you want a sorted copy of the list while leaving the original untouched, use sorted()


In [ ]:
a = [2, 5, 7, 1]
b = sorted(a)
print(a)
print(b)

Getting help from the official Python documentation

The most useful information is online on https://www.python.org/ website and should be used as a reference guide.

Getting help directly from within Python using help()


In [ ]:
help(len)

In [ ]:
help(list)

In [ ]:
help(list.insert)

In [ ]:
help(list.count)

Exercise 1.3.1

  1. Create a list of DNA codons for the protein sequence CLYSY based on the codon variables you defined previously.
  2. Print the DNA sequence of the protein to the screen.
  3. Print the DNA codon of the last amino acid in the protein sequence.
  4. Create two more variables containing the DNA sequence of a stop codon and a start codon, and replace the first element of the DNA sequence with the start codon and append the stop codon to the end of the DNA sequence. Print out the resulting DNA sequence.

String manipulations

Strings are a lot like tuples of characters, and individual characters and substrings can be accessed and manipulated using similar operations we introduced above.


In [ ]:
text = "ATGTCATTTGT"
print(text[0])
print(text[-2])
print(text[0:6])
print("ATG" in text)
print("TGA" in text)
print(len(text))

Just as with tuples, trying to assign a value to an element of a string results in an error


In [ ]:
text = "ATGTCATTTGT"
text[0:2] = "CCC"

Python provides a number of useful functions that let you manipulate strings

The in operator lets you check if a substring is contained within a larger string, but it does not tell you where the substring is located. This is often useful to know and python provides the .find() method which returns the index of the first occurrence of the search string, and the .rfind() method to start searching from the end of the string.

If the search string is not found in the string both these methods return -1.


In [ ]:
dna = "ATGTCACCGTTT"
index = dna.find("TCA")
print("TCA is at position:", index)
index = dna.rfind('C')
print("The last Cytosine is at position:", index)
print("Position of a stop codon:", dna.find("TGA"))

When we are reading text from files (which we will see later on), often there is unwanted whitespace at the start or end of the string. We can remove leading whitespace with the .lstrip() method, trailing whitespace with .rstrip(), and whitespace from both ends with .strip().

All of these methods return a copy of the changed string, so if you want to replace the original you can assign the result of the method call to the original variable.


In [ ]:
s = "    Chromosome Start End                     "
print(len(s), s)
s = s.lstrip()
print(len(s), s)
s = s.rstrip()
print(len(s), s)
s = "    Chromosome Start End                     "
s = s.strip()
print(len(s), s)

You can split a string into a list of substrings using the .split() method, supplying the delimiter as an argument to the method. If you don't supply any delimiter the method will split the string on whitespace by default (which is very often what you want!)

To split a string into its component characters you can simply cast the string to a list


In [ ]:
seq = "ATG TCA CCG GGC"
codons = seq.split(" ")
print(codons)

bases = list(seq) # a tuple of character converted into a list
print(bases)

.split() is the counterpart to the .join() method that lets you join the elements of a list into a string only if all the elements are of type String:


In [ ]:
seq = "ATG TCA CCG GGC"
codons = seq.split(" ")
print(codons)
print("|".join(codons))

We also saw earlier that the + operator lets you concatenate strings together into a larger string.

Note that this operator only works on variables of the same type. If you want to concatenate a string with an integer (or some other type), first you have to cast the integer to a string with the str() function.


In [ ]:
s = "chr"
chrom_number = 2
print(s + str(chrom_number))

To get more information about these two methods split() and join() we could find it online in the Python documentation starting from www.python.org or get help using the help() builtin function.


In [ ]:
help(str.split)
help(str.join)

Exercise 1.3.2

  1. Create a string variable with your full name in it, with your first and last name (and any middle names) seperated by a space. Split the string into a list, and print out your surname.
  2. Check if your surname contains the letter "E", and print out the position of this letter in the string. Try a few other letters.

Next session

Go to our next notebook: python_basic_1_4