Read the chapter 3 of the Biopython tutorial. Pay attention to how biological sequences are represented and handled.
Biological sequences are modelled by the Seq
class in the Bio.Seq
module.
In [ ]:
import Bio.Seq as BS
Alphabet objects specify how the sequence of letters should be interpreted. There are generic alphabets for DNA, RNA, and proteins but also IUPAC-compatible alphabets. In addition to the normal "unambiguous" alphabets, IUPAC defines "ambiguous" and "extended" alphabets.
See the Bio.Alphabet
module for details.
In [ ]:
import Bio.Alphabet as BA
In [ ]:
# Python string
seq = "AAAA"
print(seq)
print(repr(seq))
In [ ]:
# DNA sequence
seq = BS.Seq("AAAA", BA.generic_dna)
print(seq)
print(repr(seq))
In [ ]:
# RNA sequence
seq = BS.Seq("AAAA", BA.generic_rna)
print(seq)
print(repr(seq))
In [ ]:
# protein sequence
seq = BS.Seq("AAAA", BA.generic_protein)
print(seq)
print(repr(seq))
In [ ]:
# protein sequence
seq = BS.Seq("AAAA", BA.generic_protein)
# back to Python string
seq = str(seq)
print(seq)
print(repr(seq))
The alphabet can also be accessed if needed.
In [ ]:
# protein sequence
seq = BS.Seq("AAAA", BA.generic_protein)
print(seq.alphabet)
In [ ]:
# DNA sequence
seq = BS.Seq("ACGCGGCGTG", BA.generic_dna)
In [ ]:
# length of sequence
print(len(seq))
In [ ]:
# access to a single element
print(seq[4])
In [ ]:
# slicing
print(seq[2:5])
In [ ]:
# iteration
for e in seq:
print(e)
Study the Seq
class API to learn what methods it has. Some of them will be discussed below.
In [ ]:
# DNA sequences
seq1 = BS.Seq("ACGCGGCGTG", BA.generic_dna)
seq2 = BS.Seq("AAAGGGTAAA", BA.generic_dna)
seq = seq1+seq2
print(repr(seq))
In [ ]:
# DNA sequences
seq1 = BS.Seq("ACGCGGCGTG", BA.generic_dna)
seq2 = BS.Seq("AAAGGGTAAA", BA.generic_protein)
# incompatible sequences cannot be concatenated
try:
seq = seq1+seq2
print(repr(seq))
except TypeError as e:
print(e)
In [ ]:
# sequence with IUPAC alphabet
seq1 = BS.Seq("ACGCGGCGTG", BA.IUPAC.unambiguous_dna)
print(repr(seq1))
In [ ]:
# lower-case version has a generic alphabet
seq2 = seq1.lower()
print(repr(seq2))
In [ ]:
# an upper-case substring is not found in the lower-case version
print("ACGC" in seq1)
print("ACGC" in seq2)
In [ ]:
# DNA sequences
seq1 = BS.Seq("ACGCGGCGTG", BA.IUPAC.unambiguous_dna)
seq2 = BS.Seq("ACGCGGCGTG", BA.IUPAC.ambiguous_dna)
seq3 = BS.Seq("NNNCGGCGTG", BA.IUPAC.ambiguous_dna)
seq4 = BS.Seq("NNNCGGCGTG", BA.IUPAC.ambiguous_dna)
# protein sequence
seq5 = BS.Seq("ACGCGGCGTG", BA.IUPAC.protein)
In [ ]:
# compatible alphabets have no problems
print(seq1 == seq2)
In [ ]:
# ambiguous nucleotides are compared as such (i.e. no "intelligent" matching)
print(seq2 == seq4)
print(seq3 == seq4)
In [ ]:
# turn warnings into exceptions
import warnings as W
W.simplefilter("error")
import Bio as B
# incompatible alphabets are compared but with warning
try:
print(seq1 == seq5)
except B.BiopythonWarning as e:
print(e)
Seq objects have methods to create complements as well as to transcribe and translate them. The sequence alphabets are respected in these operations.
Note that the transcription and translation always consider the coding strand (even though biologically the template strand is involved, too).
In [ ]:
# DNA sequence
seq1 = BS.Seq("ACGCGGCGTGAA", BA.IUPAC.unambiguous_dna)
print(repr(seq1))
In [ ]:
# complement
seq = seq1.complement()
print(repr(seq))
In [ ]:
# reverse complement
seq = seq1.reverse_complement()
print(repr(seq))
In [ ]:
# transcribed sequence
seq2 = seq1.transcribe()
print(repr(seq2))
In [ ]:
# translated sequence
seq3 = seq2.translate()
print(repr(seq3))
In [ ]:
# DNA can be translated directly, too
seq3 = seq1.translate()
print(repr(seq3))
The translation method issues an warning if the sequence contains a partial codon at the end.
In [ ]:
# turn warnings into exceptions
import warnings as W
W.simplefilter("error")
import Bio as B
# DNA sequence
seq = BS.Seq("ACGCGGCGTGA", BA.IUPAC.unambiguous_dna)
try:
seq.translate()
except B.BiopythonWarning as e:
print(e)
The translated sequences can contain stop codons (obviously). The sequence can be translated until the first stop codon.
In [ ]:
# DNA sequence with two stop codons
seq = BS.Seq("ACGAGGGCGTAGGTGCCTCGATAG", BA.IUPAC.unambiguous_dna)
seq = seq.translate()
print(repr(seq))
In [ ]:
# translate until the first stop codon
seq = BS.Seq("ACGAGGGCGTAGGTGCCTCGATAG", BA.IUPAC.unambiguous_dna)
seq = seq.translate(to_stop=True)
print(repr(seq))
The translation method can be given a specific codon table. Biopython uses the codon information provided by NCBI. See the Bio.Data.CodonTable
module for details.
In [ ]:
import Bio.Data.CodonTable as BDCT
In [ ]:
# DNA sequence
seq1 = BS.Seq("ACGAGGGCGTAGGTGCCTCGATAG", BA.IUPAC.unambiguous_dna)
In [ ]:
# translation using the default codon table
seq2 = seq1.translate(table="Standard")
print(repr(seq2))
In [ ]:
# translation using a mitochondrial codon table
seq2 = seq1.translate(table="Vertebrate Mitochondrial")
print(repr(seq2))
The codon tables are accessible via various dictionaries in the CodonTable module.
In [ ]:
# iterate over table names
for name in BDCT.generic_by_name.keys():
print(name)
In [ ]:
# access a specific table
table = BDCT.generic_by_name["Vertebrate Mitochondrial"]
print(table)
In [ ]:
# start codons
print(table.start_codons)
In [ ]:
# stop codons
print(table.stop_codons)