Exercise 1.1 Trim sequence to multiples of three characters

Write a function trim(s) that trims the sequence s (which is a Seq object) to a multiple of three characters so that its translation happens without an error. Use it to translate the given sequence.


In [ ]:
def trim(s):
    # implement this function
    pass

# test case
import Bio.Seq as BS
s = BS.Seq("ACGCGGCGTG")
print(s, "has length", len(s))
# write a piece of code here which will
# print the translated sequence 'TRR'
# without any errors

Exercise 1.2 GC content of a Seq sequence

Write a function GC(s) which calculates the GC content of a Seq sequence s and returns it as a real number in the range 0-1. The GC content is the proportion of "G" and "C" characters in the sequence. Make sure that your function works correctly also for lower-cased sequences.


In [ ]:
def GC(s):
    # implement this function
    pass

# test case
import Bio.Seq as BS
s = BS.Seq("ACGATTAA")
print("GC content of", s, "is", GC(s))

Exercise 1.3 Hamming distance of two sequences

Write a function hamming(s1, s2) that calculates the hamming distance of the two sequences s1 and s2. The Hamming distance is the number of positions in which the two sequences differ. The distance is undefined if the sequences have unequal length.


In [ ]:
def hamming(s1, s2):
    # implement this function
    pass

# test case
import Bio.Seq as BS
s1 = BS.Seq("ACGCAGTTGCAGTAG")
s2 = BS.Seq("ACGCACTTGCAGAAG")
s3 = BS.Seq("AAAAAAAAAA")
print("Hamming distance of", s1, "and", s2, "is", hamming(s1,s2))
print("Hamming distance of", s1, "and", s3, "is", hamming(s1,s3))

Exercise 1.4 Find possible coding tables

DNA sequences are translated into protein (amino acid sequences) three letters at a time. So every three letters of the DNA sequence produce a single letter of the protein sequence. The translation is governed by a coding table, of which there are many. See https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi .

Let us say you have a DNA sequence and its translation, and you would like to find out which coding table was used in the translation. Given a sequence s_dna and its translation s_protein, find the coding table(s) which could have produced it. Print the number of the table (or one number per line if there are many possible tables).

The tables are defined in Bio.Data.CodonTable, and they are numbered. In particular, generic_by_id has a mapping from these numbers to the tables.


In [ ]:
import Bio.Seq as BS

s_dna = BS.Seq("ATGGTCGATGACCTGTGAACTTAA")
s_protein = BS.Seq("MVDDLCT*", BS.IUPAC.protein)

# Hint: Bio.Data.CodonTable.generic_by_id

# print the number(s) of the table(s) which
# could have produced s_protein from s_dna

Exercise 1.5 Remove ambiguous alphabet letter

Write a function clean(s) where s is a string and the function returns the string with all characters removed which are not A, C, G, or T.


In [ ]:
def clean(s):
    # implement this function
    pass

# test case
print("ACGTHABJHHBAGATGATB")
print(clean("ACGTHABJHHBAGATGATB"))

Exercise 1.6 Sort sequences

Write a function sort_by_unknown(s) that sorts and returns the given DNA sequences in ascending order by the number of unknown bases. Here, s is an iterable containing Seq objects.


In [ ]:
def sort_by_unknown(s):
    # implement this function
    pass

# test case
import Bio.Seq as BS
s = [BS.Seq('NGTACCTTGCTACTC'),
     BS.Seq('NCGTGNN'),
     BS.Seq('NNNNN'),
     BS.Seq('ACGGT'),
     BS.Seq('ANNTGGT'),
     BS.Seq('ACGNGT'),
     BS.Seq('AACGTCCGTNNN'),
    ]
print(s)
print(sort_by_unknown(s))

Exercise 1.7 Project design

Consider the following hypothetical experiment and sketch an analysis to reach the given objectives. At this stage of the course, it is enough to outline the overall flow of the analysis and propose methods that could be used in the analysis. The more details you can write down, the better.

Researchers have found a new prokaryote species that can thrive in unexpectedly harsh environmental conditions. To understand how the species can survive, the researchers have obtained RNA molecules expressed by the cells. Given a list of RNA sequences, find out what functions the proteins (possibly) have and how the proteins differ from the corresponding proteins in other species.