BioPython is a set of free Python modules for working with genomic data, as well as other tools that are commonly used for sequence analysis, such as BLAST and Clustalw. Or as the BioPython developers put it
Basically, the goal of Biopython is to make it as easy as possible to use Python for bioinformatics by creating high-quality, reusable modules and classes. Biopython features include parsers for various Bioinformatics file formats (BLAST, Clustalw, FASTA, Genbank,...), access to online services (NCBI, Expasy,...), interfaces to common and not-so-common programs (Clustalw, DSSP, MSMS...), a standard sequence class, various clustering modules, a KD tree data structure etc. and even documentation.
In terms of getting help, BioPython has a very large cookbook full of examples, and as always, Google is your best friend. This very brief tutorial covers some of the basic bioinformatics tools that will be useful throughout the course.
Biopython includes support for interfacing with or parsing the output from a number of third party command line tools. These tools are not required to install Biopython, but may be of interest. This includes:
See also the listing on http://biopython.org/wiki/Download which should include URLs for these tools, and may also be more up to date.
If everything has been installed correctly, you should be able to import the module and print out the version that you have installed. Note that I am using version 1.64 and I cannot guarantee that the code below will work with different versions of BioPython.
In [1]:
import Bio as bio
print "BioPython Version: "+str(bio.__version__)
Sequences in bioinformatics can simply be interpreted as being strings. BioPython represents sequences with the Seq
object, which contains properties beyond that of a simple string object. For example, the Alphabet of the sequence can be specified (and also obtained). More about the Seq
class can be found on BioPython's API. It is possible to convert the sequence to a string using Python's built in str
command (e.g., str(my_sequence)
). Some useful methods of Seq
object include:
find
: Find method, like that of a python stringstartswith
: Does the Seq start with the given prefix? endswith
: Does the Seq end with the given suffix? complement
: Returns the complement sequence reverse_complement
: Returns the reverse complement sequence. transcribe
: Returns the RNA sequence from a DNA sequence back_transcribe
: Returns the DNA sequence from a RNA sequence translate
: Turns a nucleotide sequence into a protein sequence. Now let us look at a few simple sequence operations.
In [2]:
from Bio.Seq import Seq
# working with sequences
my_seq = Seq("AGTACACTGGT")
print "sequence: AGTACACTGGT"
print "complement: " + my_seq.complement()
print "reverse complement: " + my_seq.reverse_complement()
print "transcribe: " + my_seq.transcribe() # my_rna = my_seq.transcribe()
print "my_seq[2:4]: " + my_seq[2:4]
my_rna = my_seq.transcribe()
my_dna = my_rna.back_transcribe()
In the above example, we converted a simple string in Python to a sequence object, which allowed used to directly call built in functions of the Seq
class. We could call these methods directly as well (e.g., transcribe("AGTACACTGGT")
). Also, we did not specify an Alphabet. BioPython provides generic alphabets in Bio.Alphabet
. Specifying an alphabet can be useful for catching errors in a sequence, or specifying a particular type of sequence.
In [3]:
from Bio.Alphabet import generic_rna,generic_dna
messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG", generic_rna)
messenger_rna.translate()
Out[3]:
GC content is a commonly used statistic to quantify a sequence. It is measured as the percentage of nucleotides from either guanine or cytosine. More formally, the GC-content of a sequence is given by:
$$
\textrm{GC-content} = 100\times\frac{N_G + N_C}{N_G + N_C + N_A + N_T}
$$
where $N_G$, $N_C$, $N_A$, and $N_T$ are the counts for the nucleotides G, C, A and T, respectively. BioPython allows you to perform this calcuation using either GC
or GC123
. Note that GC
simply calculates the GC-content; however, CG123
calculates total GC-content plus first, second and third positions' GC-content. Let us also take a moment to examine how to create plots with IPython and Python. If you're using the Python interpreter than you must use from matplotlib.pylab import *
.
In [4]:
from Bio.Seq import Seq
from Bio.SeqUtils import GC123,GC,GC_skew
seq = "ataccaggctgaggcccattaatgatgcaatttgctgggcttctctattttctccgtgcttccatcctcttctccgtcggcggggagaagtgaaatgccgtggagatgggcggcggcggcggcgacggcggcgacgagaaagctcaccgggatctctcagtcgcgagtttcagtagcctttaccggccgtcttctctaccgctcgttcggaagcgactccagtgaaagccgcaagaggtcactgccacggggggtcgtatcgatcggggccatcagccttgctggaggtctcgtgctcagcgccgtcaacgacctcgccatcttcaatggatgcacaacgaaggcaattgagcatgctgctgacaaccctgctgttgtggaagcaattggagtgcctatagtcagaggaccgtggtatgatgcttctcttgaggtgggccatcgacggcggtctgtgtcatgcacattccctgtatctgggccacatgggtcaggatttctccagattaaggcaacccgagatggagaggatggtctgctttcgtttctgcggcatcacgactggaagatcctattgctggaggctcatcttgaagcaccatcagatgatgaggaccagagaaagctggttaaggtgaatcttgcaagcagtggccgtggggaagatggggatccagagagtggttaatcttttgtactgaattccatggtgagtggaagatcgtgtcatctgaatggactccaaatattaaatgacatggagatctagggaagcaaaaaaaaaaaaaaaa"
print GC123(seq)
print GC(seq)
plot(GC_skew(seq, window=100),c="r")
xlabel("Window")
ylabel("(G-C)/(G+C)")
title("GC-skew")
Out[4]:
We can also go directly from DNA to amino acids using the translate
command.
In [6]:
print "translate: " + Seq(seq).translate() # my_rna = my_seq.transcribe()
Refer to the documentation and cookbook for BioPython to view more of the sequence tools.
Different tools in bioinformatics use different standards for containing thier data. For example, GenBank, Fasta, and Biom are all file formats used to storge biological data. BioPython contains several builtin modules for interfacing with theses different file formats using the SeqIO
class. In this section, we examine converting between file formats as well as manipulating the variables in the SeqIO
class.
Let us take a moment to talk about the SeqRecord
object in BioPython. We learned that sequences in BioPython are stored in a a Seq
object, which contained some handy functions such as translation
. We can create a more detailed record that contains a seqence ID, name, the raw sequence, etc. Such a more detailed stucture can be found in the SeqRecord
class. SeqRecord
contains a sequence (as a Seq
object) with identifiers (ID and name), description and optionally annotation and sub-features. The SeqIO
system will only return SeqRecord
objects.
Quick Python Note
A = []
for x in range(5):
A.append(x**2)
B = [x**2 for x in range(5)]
# A and B are are identical (I use this condensed notation)
SeqIO
contains a parse
function that can readily handle GenBank and Fasta formats. We have a fasta file in ../data
that looks like:
>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG
CCGCCTCGGGAGCGTCCATGGCGGGTTTGAACCTCTAGCCCGGCGCAGTTTGGGCGCCAAGCCATATGAA
...
>gi|2765657|emb|Z78532.1|CCZ78532 C.californicum 5.8S rRNA gene and ITS1 and ITS2 DNA
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAACAGAATATATGATCGAGTG
AATCTGGAGGACCTGTGGTAACTCAGCTCGTCGTGGCACTGCTTTTGTCGTGACCCTGCTTTGTTGTTGG
...
which contains 94 sequences, or records. We can read in the iterable set of sequence records into a dictionary with the key being the id
and the value being the sequence. In code this translates to:
In [7]:
from Bio import SeqIO
# we can parse the fasta file and read the sequences in by looping over
# the iterator returned by SeqIO.parse()
sequences = {} # empty dictionary
for seq_record in SeqIO.parse("../data/ls_orchid.fasta", "fasta"):
sequences[seq_record.id] = seq_record.seq
# an alternative way to do this can be done with one line of code
sequences2 = {s.id:s.seq for s in SeqIO.parse("../data/ls_orchid.fasta", "fasta")}
print sequences2.keys()[0]
print sequences2[sequences2.keys()[0]] # print the first sequence
We can also read in Genbank files as well.
In [9]:
# simple conversion from GB to a fasta file
records = SeqIO.parse("../data/ls_orchid_full.gbk", "genbank")
count = SeqIO.write(records, "../data/my_example.fasta", "fasta")
# read in the sequences from the GB file. note that this is nearly identical to when we
# performed this task with the fasta file in the previous code block
sequences = {s.id:s.seq for s in SeqIO.parse("../data/ls_orchid_full.gbk", "genbank")}
record = SeqIO.read("../data/ls_orchid.gbk", 'genbank')
print "locus = " + record.name
print "definition = " + record.description
print "version = " + record.id
print "seq = " + record.seq
In [10]:
%%bash
cat ../data/my_example.fasta | head -20
In [11]:
from Bio import Entrez
from Bio.SeqRecord import SeqRecord
terms = ["cytochrome oxidase i AND viridiplantae[Organism] and 450:2000[Sequence Length]",
"cytochrome oxidase i AND fungi[Organism] and 450:2000[Sequence Length]",
"cytochrome oxidase i AND animalia[Organism] and 450:2000[Sequence Length]"
]
Entrez.email = "gregscodeisdone@gmail.com" # you should put your email in here!
In [12]:
top_records = []
for term in terms: # I could do this in 1 line if I wanted to!
handle = Entrez.esearch(db="nucleotide", term=term)
top_records.append(Entrez.read(handle)["IdList"][0])
handle = Entrez.efetch(db="nucleotide", id=",".join(top_records), rettype="gb", retmode="xml")
genbank_records = Entrez.read(handle)
print "First Genbank record keys: " + str(genbank_records[0].keys())
print ""
print "First Genbank sequence (" + genbank_records[0]["GBSeq_primary-accession"] + "): " + genbank_records[0]["GBSeq_sequence"]
In [13]:
fasta_records = []
for record in genbank_records:
seq_record = SeqRecord(Seq(record["GBSeq_sequence"]), id=record["GBSeq_primary-accession"], description=record["GBSeq_definition"])
fasta_records.append(seq_record)
In [14]:
with open("../data/my_seqs.fa","w") as fasta_file:
SeqIO.write(fasta_records, fasta_file, "fasta")
In [15]:
%%bash
cat ../data/my_seqs.fa
BLAST can be run with BioPython either from using the command line tools -- provided they are installed -- or through the web. In this section, we use the qblast
from the Bio.Blast.NCBIWWW
module to call the online version of BLAST. Note that the results would be the same if we were to use the command line tools. As pointed out from the qblast
documentation, there are three required arguments:
qblast
only works with blastn, blastp, blastx, tblast and tblastx.The default output format for qblast
is XML; however, we shall use other methods in BioPython to convert the results in the XML file to a more interpretable and managable output.
In [16]:
from Bio.Blast import NCBIWWW, NCBIXML
E_VALUE_THRESH = 0.01
s_len = 100
# search against the nucleotide database (nt) using BLASTN, and you know the GI number
result_handle = NCBIWWW.qblast("blastn", "nt", "8332116") # 449020131, 8332116
blast_records = NCBIXML.read(result_handle)
for alignment in blast_records.alignments[:5]:
for hsp in alignment.hsps:
if hsp.expect < E_VALUE_THRESH:
print "****Alignment****"
print "sequence:", alignment.title
print "length:", alignment.length
print "e value:", hsp.expect
print hsp.query[0:s_len] + "..."
print hsp.match[0:s_len] + "..."
print hsp.sbjct[0:s_len] + "..."
Similarly, we can directly BLAST our sequence if we have it in the Python environment. BioPython is smart enough to pick up on how the third argument should be interpreted for BLASTing.
In [17]:
seq = "ataccaggctgaggcccattaatgatgcaatttgctgggcttctctattttctccgtgcttccatcctcttctccgtcggcggggagaagtgaaatgccgtggagatgggcggcggcggcggcgacggcggcgacgagaaagctcaccgggatctctcagtcgcgagtttcagtagcctttaccggccgtcttctctaccgctcgttcggaagcgactccagtgaaagccgcaagaggtcactgccacggggggtcgtatcgatcggggccatcagccttgctggaggtctcgtgctcagcgccgtcaacgacctcgccatcttcaatggatgcacaacgaaggcaattgagcatgctgctgacaaccctgctgttgtggaagcaattggagtgcctatagtcagaggaccgtggtatgatgcttctcttgaggtgggccatcgacggcggtctgtgtcatgcacattccctgtatctgggccacatgggtcaggatttctccagattaaggcaacccgagatggagaggatggtctgctttcgtttctgcggcatcacgactggaagatcctattgctggaggctcatcttgaagcaccatcagatgatgaggaccagagaaagctggttaaggtgaatcttgcaagcagtggccgtggggaagatggggatccagagagtggttaatcttttgtactgaattccatggtgagtggaagatcgtgtcatctgaatggactccaaatattaaatgacatggagatctagggaagcaaaaaaaaaaaaaaaa "
result_handle = NCBIWWW.qblast("blastn", "nt", seq)
blast_records = NCBIXML.read(result_handle)
for alignment in blast_records.alignments[:5]:
for hsp in alignment.hsps:
if hsp.expect < E_VALUE_THRESH:
print "****Alignment****"
print "sequence:", alignment.title
print "length:", alignment.length
print "e value:", hsp.expect
print hsp.query[0:s_len] + "..."
print hsp.match[0:s_len] + "..."
print hsp.sbjct[0:s_len] + "..."
In [18]:
from Bio import Phylo
tree = Phylo.read("../data/simple.dnd", "newick")
print tree
Phylo.draw_ascii(tree)
tree.rooted = True
Phylo.draw(tree)
In [19]:
from Bio.Phylo.PhyloXML import Phylogeny
tree = tree.as_phyloxml()
tree = Phylogeny.from_tree(tree)
tree.root.color = (128, 128, 128)
mrca = tree.common_ancestor({"name": "E"}, {"name": "F"})
mrca.color = "salmon"
tree.clade[0, 1].color = "blue"
Phylo.draw(tree, branch_labels=lambda c: c.branch_length)
Phylo.draw(tree)
In [ ]: