Biological sequences

Instructions

Read the chapter 3 of the Biopython tutorial. Pay attention to how biological sequences are represented and handled.

Objectives

  • create and manipulate Seq objects
  • understand how Alphabet objects work
  • understand how CodonTable objects work

Summary

Biological sequences are modelled by the Seq class in the Bio.Seq module.


In [ ]:
import Bio.Seq as BS

Alphabets contain information on interpretation

Alphabet objects specify how the sequence of letters should be interpreted. There are generic alphabets for DNA, RNA, and proteins but also IUPAC-compatible alphabets. In addition to the normal "unambiguous" alphabets, IUPAC defines "ambiguous" and "extended" alphabets.

See the Bio.Alphabet module for details.


In [ ]:
import Bio.Alphabet as BA

Sequences are like strings

Seq objects are like Python string but aware of the used alphabet (i.e. what the letters actually mean).


In [ ]:
# Python string
seq = "AAAA"
print(seq)
print(repr(seq))

In [ ]:
# DNA sequence
seq = BS.Seq("AAAA", BA.generic_dna)
print(seq)
print(repr(seq))

In [ ]:
# RNA sequence
seq = BS.Seq("AAAA", BA.generic_rna)
print(seq)
print(repr(seq))

In [ ]:
# protein sequence
seq = BS.Seq("AAAA", BA.generic_protein)
print(seq)
print(repr(seq))

In [ ]:
# protein sequence
seq = BS.Seq("AAAA", BA.generic_protein)
# back to Python string
seq = str(seq)
print(seq)
print(repr(seq))

The alphabet can also be accessed if needed.


In [ ]:
# protein sequence
seq = BS.Seq("AAAA", BA.generic_protein)
print(seq.alphabet)

Sequences act like strings


In [ ]:
# DNA sequence
seq = BS.Seq("ACGCGGCGTG", BA.generic_dna)

In [ ]:
# length of sequence
print(len(seq))

In [ ]:
# access to a single element
print(seq[4])

In [ ]:
# slicing
print(seq[2:5])

In [ ]:
# iteration
for e in seq:
    print(e)

Study the Seq class API to learn what methods it has. Some of them will be discussed below.

Sequences can be concatenated

Seq objects can be concatenated (like Python strings). Obviously, the alphabets must be compatible.


In [ ]:
# DNA sequences
seq1 = BS.Seq("ACGCGGCGTG", BA.generic_dna)
seq2 = BS.Seq("AAAGGGTAAA", BA.generic_dna)

seq = seq1+seq2
print(repr(seq))

In [ ]:
# DNA sequences
seq1 = BS.Seq("ACGCGGCGTG", BA.generic_dna)
seq2 = BS.Seq("AAAGGGTAAA", BA.generic_protein)

# incompatible sequences cannot be concatenated
try:
    seq = seq1+seq2
    print(repr(seq))
except TypeError as e:
    print(e)

Character case matters

Seq objects are case-sensitive. Unless you have a good reason, it is recommended to use upper-case characters because the IUPAC alphabets expect them.


In [ ]:
# sequence with IUPAC alphabet
seq1 = BS.Seq("ACGCGGCGTG", BA.IUPAC.unambiguous_dna)
print(repr(seq1))

In [ ]:
# lower-case version has a generic alphabet
seq2 = seq1.lower()
print(repr(seq2))

In [ ]:
# an upper-case substring is not found in the lower-case version
print("ACGC" in seq1)
print("ACGC" in seq2)

Sequence comparison is not trivial

Comparing two sequences is not trivial because the alphabet must be taken into account. Due to the challenges involved, Biopython ignores the alphabets but issues a warning if the alphabets are not compatible.


In [ ]:
# DNA sequences
seq1 = BS.Seq("ACGCGGCGTG", BA.IUPAC.unambiguous_dna)
seq2 = BS.Seq("ACGCGGCGTG", BA.IUPAC.ambiguous_dna)
seq3 = BS.Seq("NNNCGGCGTG", BA.IUPAC.ambiguous_dna)
seq4 = BS.Seq("NNNCGGCGTG", BA.IUPAC.ambiguous_dna)

# protein sequence
seq5 = BS.Seq("ACGCGGCGTG", BA.IUPAC.protein)

In [ ]:
# compatible alphabets have no problems
print(seq1 == seq2)

In [ ]:
# ambiguous nucleotides are compared as such (i.e. no "intelligent" matching)
print(seq2 == seq4)
print(seq3 == seq4)

In [ ]:
# turn warnings into exceptions
import warnings as W
W.simplefilter("error")

import Bio as B

# incompatible alphabets are compared but with warning
try:
    print(seq1 == seq5)
except B.BiopythonWarning as e:
    print(e)

Complements, transcription, and translation

Seq objects have methods to create complements as well as to transcribe and translate them. The sequence alphabets are respected in these operations.

Note that the transcription and translation always consider the coding strand (even though biologically the template strand is involved, too).


In [ ]:
# DNA sequence
seq1 = BS.Seq("ACGCGGCGTGAA", BA.IUPAC.unambiguous_dna)
print(repr(seq1))

In [ ]:
# complement
seq = seq1.complement()
print(repr(seq))

In [ ]:
# reverse complement
seq = seq1.reverse_complement()
print(repr(seq))

In [ ]:
# transcribed sequence
seq2 = seq1.transcribe()
print(repr(seq2))

In [ ]:
# translated sequence
seq3 = seq2.translate()
print(repr(seq3))

In [ ]:
# DNA can be translated directly, too
seq3 = seq1.translate()
print(repr(seq3))

The translation method issues an warning if the sequence contains a partial codon at the end.


In [ ]:
# turn warnings into exceptions
import warnings as W
W.simplefilter("error")

import Bio as B

# DNA sequence
seq = BS.Seq("ACGCGGCGTGA", BA.IUPAC.unambiguous_dna)
try:
    seq.translate()
except B.BiopythonWarning as e:
    print(e)

The translated sequences can contain stop codons (obviously). The sequence can be translated until the first stop codon.


In [ ]:
# DNA sequence with two stop codons
seq = BS.Seq("ACGAGGGCGTAGGTGCCTCGATAG", BA.IUPAC.unambiguous_dna)
seq = seq.translate()
print(repr(seq))

In [ ]:
# translate until the first stop codon
seq = BS.Seq("ACGAGGGCGTAGGTGCCTCGATAG", BA.IUPAC.unambiguous_dna)
seq = seq.translate(to_stop=True)
print(repr(seq))

The translation method can be given a specific codon table. Biopython uses the codon information provided by NCBI. See the Bio.Data.CodonTable module for details.


In [ ]:
import Bio.Data.CodonTable as BDCT

In [ ]:
# DNA sequence
seq1 = BS.Seq("ACGAGGGCGTAGGTGCCTCGATAG", BA.IUPAC.unambiguous_dna)

In [ ]:
# translation using the default codon table
seq2 = seq1.translate(table="Standard")
print(repr(seq2))

In [ ]:
# translation using a mitochondrial codon table
seq2 = seq1.translate(table="Vertebrate Mitochondrial")
print(repr(seq2))

The codon tables are accessible via various dictionaries in the CodonTable module.


In [ ]:
# iterate over table names
for name in BDCT.generic_by_name.keys():
    print(name)

In [ ]:
# access a specific table
table = BDCT.generic_by_name["Vertebrate Mitochondrial"]
print(table)

In [ ]:
# start codons
print(table.start_codons)

In [ ]:
# stop codons
print(table.stop_codons)