An introduction to solving biological problems with Python

Session 1.4: Collections Sets and dictionaries

Sets

  • Sets contain unique elements, i.e. no repeats are allowed
  • The elements in a set do not have an order
  • Sets cannot contain elements which can be internally modified (e.g. lists and dictionaries)

In [ ]:
l = [1, 2, 3, 2, 3] # list of 5 values
s = set(l) # set of 3 unique values
print(s)
e = set() # empty set
print(e)

Sets are very similar to lists and tuples and you can use many of the same operators and functions, except they are inherently unordered, so they don't have an index, and can only contain unique values, so adding a value already in the set will have no effect


In [ ]:
s = set([1, 2, 3, 2, 3])
print(s)
print("number in set:", len(s))
s.add(4)
print(s)
s.add(3)
print(s)

You can remove specific elements from the set.


In [ ]:
s = set([1, 2, 3, 2, 3])
print(s)
s.remove(3)
print(s)

You can do all the expected logical operations on sets, such as taking the union or intersection of 2 sets with the | or and & and operators


In [ ]:
s1 = set([2, 4, 6, 8, 10])
s2 = set([4, 5, 6, 7])

print("Union:", s1 | s2)
print("Intersection:", s1 & s2)

Exercise 1.4.1

  1. Given the protein sequence "MPISEPTFFEIF", split the sequence into its component amino acid codes and use a set to establish the unique amino acids in the protein and print out the result.

Dictionaries

Lists are useful in many contexts, but often we have some data that has no inherent order and that we want to access by some useful name rather than an index. For example, as a result of some experiment we may have a set of genes and corresponding expression values. We could put the expression values in a list, but then we'd have to remember which index in the list corresponded to which gene and this would quickly get complicated.

For these situations a dictionary is a very useful data structure.

Dictionaries:

  • Contain a mapping of keys to values (like a word and its corresponding definition in a dictionary)
  • The keys of a dictionary are unique, i.e. they cannot repeat
  • The values of a dictionary can be of any data type
  • The keys of a dictionary cannot be an internally modifiable type (e.g. lists, but you can use tuples)
  • Dictionaries do not store data in any particular order

In [ ]:
dna = {"A": "Adenine", "C": "Cytosine", "G": "Guanine", "T": "Thymine"}
print(dna)

You can access values in a dictionary using the key inside square brackets


In [ ]:
dna = {"A": "Adenine", "C": "Cytosine", "G": "Guanine", "T": "Thymine"}
print("A represents", dna["A"])
print("G represents", dna["G"])

An error is triggered if a key is absent from the dictionary:


In [ ]:
dna = {"A": "Adenine", "C": "Cytosine", "G": "Guanine", "T": "Thymine"}
print("What about N?", dna["N"])

You can access values safely with the get method, which gives back None if the key is absent and you can also supply a default values


In [ ]:
dna = {"A": "Adenine", "C": "Cytosine", "G": "Guanine", "T": "Thymine"}
print("What about N?", dna.get("N"))
print("With a default value:", dna.get("N", "unknown"))

You can check if a key is in a dictionary with the in operator, and you can negate this with not


In [ ]:
dna = {"A": "Adenine", "C": "Cytosine", "G": "Guanine", "T": "Thymine"}
"T" in dna

In [ ]:
dna = {"A": "Adenine", "C": "Cytosine", "G": "Guanine", "T": "Thymine"}
"Y" not in dna

The len() function gives back the number of (key, value) pairs in the dictionary:


In [ ]:
dna = {"A": "Adenine", "C": "Cytosine", "G": "Guanine", "T": "Thymine"}
print(len(dna))

You can introduce new entries in the dictionary by assigning a value with a new key:


In [ ]:
dna = {"A": "Adenine", "C": "Cytosine", "G": "Guanine", "T": "Thymine"}
dna['Y'] = 'Pyrimidine'
print(dna)

You can change the value for an existing key by reassigning it:


In [ ]:
dna = {'A': 'Adenine', 'C': 'Cytosine', 'T': 'Thymine', 'G': 'Guanine', 'Y': 'Pyrimidine'}
dna['Y'] = 'Cytosine or Thymine'
print(dna)

You can delete entries from the dictionary:


In [ ]:
dna = {'A': 'Adenine', 'C': 'Cytosine', 'T': 'Thymine', 'G': 'Guanine', 'Y': 'Pyrimidine'}
del dna['Y']
print(dna)

You can get a list of all the keys (in arbitrary order) using the inbuilt .keys() function


In [ ]:
dna = {'A': 'Adenine', 'C': 'Cytosine', 'T': 'Thymine', 'G': 'Guanine', 'Y': 'Pyrimidine'}
print(list(dna.keys()))

And equivalently get a list of the values:


In [ ]:
dna = {'A': 'Adenine', 'C': 'Cytosine', 'T': 'Thymine', 'G': 'Guanine', 'Y': 'Pyrimidine'}
print(list(dna.values()))

And a list of tuples containing (key, value) pairs:


In [ ]:
dna = {'A': 'Adenine', 'C': 'Cytosine', 'T': 'Thymine', 'G': 'Guanine', 'Y': 'Pyrimidine'}
print(list(dna.items()))

Exercises 1.4.2

  1. Print out the names of the amino acids that would be produced by the DNA sequence "GTT GCA CCA CAA CCG" (See the DNA codon table). Split this string into the individual codons and then use a dictionary to map between codon sequences and the amino acids they encode.
  2. Print each codon and its corresponding amino acid.
  3. Why couldn't we build a dictionary where the keys are names of amino acids and the values are the DNA codons?

Advanced exercise 1.4.3

  • Starting with an empty dictionary, count the abundance of different residue types present in the 1-letter lysozyme protein sequence (http://www.uniprot.org/uniprot/B2R4C5.fasta) and print the results to the screen in alphabetical key order.

Congratulation! You reached the end of day 1!

Go to our next notebook: python_basic_2_intro