What is BioPython

BioPython is a set of free Python modules for working with genomic data, as well as other tools that are commonly used for sequence analysis, such as BLAST and Clustalw. Or as the BioPython developers put it

Basically, the goal of Biopython is to make it as easy as possible to use Python for bioinformatics by creating high-quality, reusable modules and classes. Biopython features include parsers for various Bioinformatics file formats (BLAST, Clustalw, FASTA, Genbank,...), access to online services (NCBI, Expasy,...), interfaces to common and not-so-common programs (Clustalw, DSSP, MSMS...), a standard sequence class, various clustering modules, a KD tree data structure etc. and even documentation.

In terms of getting help, BioPython has a very large cookbook full of examples, and as always, Google is your best friend. This very brief tutorial covers some of the basic bioinformatics tools that will be useful throughout the course.

BioPython and Support from External Tools

Biopython includes support for interfacing with or parsing the output from a number of third party command line tools. These tools are not required to install Biopython, but may be of interest. This includes:

  • NCBI Standalone BLAST, which can used with the Bio.Blast module and parsed with the Bio.SearchIO module.
  • EMBOSS tools, which can be invoked using the Bio.Emboss module. The Bio.AlignIO module can also parse some alignment formats output by the EMBOSS suite.
  • ClustalW, which can be parsed using Bio.AlignIO and invoked using the Bio.Align.Applications module.
  • SIMCOAL2 and FDist tools for population genetics can be used via the Bio.PopGen module.
  • Bill Pearson’s FASTA tools output can be parsed using the Bio.AlignIO and Bio.SearchIO module.
  • Wise2 includes the useful tool dnal.

See also the listing on http://biopython.org/wiki/Download which should include URLs for these tools, and may also be more up to date.

If everything has been installed correctly, you should be able to import the module and print out the version that you have installed. Note that I am using version 1.64 and I cannot guarantee that the code below will work with different versions of BioPython.


In [1]:
import Bio as bio
print "BioPython Version: "+str(bio.__version__)


BioPython Version: 1.64

Getting Started with Python

If you are new, or uncomfortable with Python, you are encouraged to watch the videos below.

Basic Sequence Tools

Sequences in bioinformatics can simply be interpreted as being strings. BioPython represents sequences with the Seq object, which contains properties beyond that of a simple string object. For example, the Alphabet of the sequence can be specified (and also obtained). More about the Seq class can be found on BioPython's API. It is possible to convert the sequence to a string using Python's built in str command (e.g., str(my_sequence)). Some useful methods of Seq object include:

  • find: Find method, like that of a python string
  • startswith: Does the Seq start with the given prefix?
  • endswith: Does the Seq end with the given suffix?
  • complement: Returns the complement sequence
  • reverse_complement: Returns the reverse complement sequence.
  • transcribe: Returns the RNA sequence from a DNA sequence
  • back_transcribe: Returns the DNA sequence from a RNA sequence
  • translate: Turns a nucleotide sequence into a protein sequence.

Now let us look at a few simple sequence operations.


In [2]:
from Bio.Seq import Seq 

# working with sequences 
my_seq = Seq("AGTACACTGGT") 
print "sequence: AGTACACTGGT" 
print "complement: " + my_seq.complement() 
print "reverse complement: " + my_seq.reverse_complement()
print "transcribe: " + my_seq.transcribe() # my_rna = my_seq.transcribe()
print "my_seq[2:4]: " + my_seq[2:4]
my_rna = my_seq.transcribe()
my_dna = my_rna.back_transcribe()


sequence: AGTACACTGGT
complement: TCATGTGACCA
reverse complement: ACCAGTGTACT
transcribe: AGUACACUGGU
my_seq[2:4]: TA

In the above example, we converted a simple string in Python to a sequence object, which allowed used to directly call built in functions of the Seq class. We could call these methods directly as well (e.g., transcribe("AGTACACTGGT")). Also, we did not specify an Alphabet. BioPython provides generic alphabets in Bio.Alphabet. Specifying an alphabet can be useful for catching errors in a sequence, or specifying a particular type of sequence.


In [3]:
from Bio.Alphabet import generic_rna,generic_dna 
messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG", generic_rna) 
messenger_rna.translate()


Out[3]:
Seq('MAIVMGR*KGAR*', HasStopCodon(ExtendedIUPACProtein(), '*'))

GC content is a commonly used statistic to quantify a sequence. It is measured as the percentage of nucleotides from either guanine or cytosine. More formally, the GC-content of a sequence is given by: $$ \textrm{GC-content} = 100\times\frac{N_G + N_C}{N_G + N_C + N_A + N_T} $$ where $N_G$, $N_C$, $N_A$, and $N_T$ are the counts for the nucleotides G, C, A and T, respectively. BioPython allows you to perform this calcuation using either GC or GC123. Note that GC simply calculates the GC-content; however, CG123 calculates total GC-content plus first, second and third positions' GC-content. Let us also take a moment to examine how to create plots with IPython and Python. If you're using the Python interpreter than you must use from matplotlib.pylab import *.


In [4]:
from Bio.Seq import Seq 
from Bio.SeqUtils import GC123,GC,GC_skew
seq = "ataccaggctgaggcccattaatgatgcaatttgctgggcttctctattttctccgtgcttccatcctcttctccgtcggcggggagaagtgaaatgccgtggagatgggcggcggcggcggcgacggcggcgacgagaaagctcaccgggatctctcagtcgcgagtttcagtagcctttaccggccgtcttctctaccgctcgttcggaagcgactccagtgaaagccgcaagaggtcactgccacggggggtcgtatcgatcggggccatcagccttgctggaggtctcgtgctcagcgccgtcaacgacctcgccatcttcaatggatgcacaacgaaggcaattgagcatgctgctgacaaccctgctgttgtggaagcaattggagtgcctatagtcagaggaccgtggtatgatgcttctcttgaggtgggccatcgacggcggtctgtgtcatgcacattccctgtatctgggccacatgggtcaggatttctccagattaaggcaacccgagatggagaggatggtctgctttcgtttctgcggcatcacgactggaagatcctattgctggaggctcatcttgaagcaccatcagatgatgaggaccagagaaagctggttaaggtgaatcttgcaagcagtggccgtggggaagatggggatccagagagtggttaatcttttgtactgaattccatggtgagtggaagatcgtgtcatctgaatggactccaaatattaaatgacatggagatctagggaagcaaaaaaaaaaaaaaaa"
print GC123(seq)
print GC(seq)

plot(GC_skew(seq, window=100),c="r")
xlabel("Window")
ylabel("(G-C)/(G+C)")
title("GC-skew")


(52.97092288242731, 51.89393939393939, 57.57575757575758, 49.42965779467681)
52.9709228824
Out[4]:
<matplotlib.text.Text at 0x10c62c7d0>

We can also go directly from DNA to amino acids using the translate command.


In [6]:
print "translate: " + Seq(seq).translate() # my_rna = my_seq.transcribe()


translate: IPG*GPLMMQFAGLLYFLRASILFSVGGEK*NAVEMGGGGGDGGDEKAHRDLSVASFSSLYRPSSLPLVRKRLQ*KPQEVTATGGRIDRGHQPCWRSRAQRRQRPRHLQWMHNEGN*ACC*QPCCCGSNWSAYSQRTVV*CFS*GGPSTAVCVMHIPCIWATWVRISPD*GNPRWRGWSAFVSAASRLEDPIAGGSS*STIR**GPEKAG*GESCKQWPWGRWGSREWLIFCTEFHGEWKIVSSEWTPNIK*HGDLGKQKKKK

Refer to the documentation and cookbook for BioPython to view more of the sequence tools.

Basic FileIO

Different tools in bioinformatics use different standards for containing thier data. For example, GenBank, Fasta, and Biom are all file formats used to storge biological data. BioPython contains several builtin modules for interfacing with theses different file formats using the SeqIO class. In this section, we examine converting between file formats as well as manipulating the variables in the SeqIO class.

Let us take a moment to talk about the SeqRecord object in BioPython. We learned that sequences in BioPython are stored in a a Seq object, which contained some handy functions such as translation. We can create a more detailed record that contains a seqence ID, name, the raw sequence, etc. Such a more detailed stucture can be found in the SeqRecord class. SeqRecord contains a sequence (as a Seq object) with identifiers (ID and name), description and optionally annotation and sub-features. The SeqIO system will only return SeqRecord objects.

Quick Python Note

A = []
for x in range(5):
  A.append(x**2)

B = [x**2 for x in range(5)]
# A and B are are identical (I use this condensed notation)

SeqIO contains a parse function that can readily handle GenBank and Fasta formats. We have a fasta file in ../data that looks like:

>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG
CCGCCTCGGGAGCGTCCATGGCGGGTTTGAACCTCTAGCCCGGCGCAGTTTGGGCGCCAAGCCATATGAA
...

>gi|2765657|emb|Z78532.1|CCZ78532 C.californicum 5.8S rRNA gene and ITS1 and ITS2 DNA
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAACAGAATATATGATCGAGTG
AATCTGGAGGACCTGTGGTAACTCAGCTCGTCGTGGCACTGCTTTTGTCGTGACCCTGCTTTGTTGTTGG
...

which contains 94 sequences, or records. We can read in the iterable set of sequence records into a dictionary with the key being the id and the value being the sequence. In code this translates to:


In [7]:
from Bio import SeqIO

# we can parse the fasta file and read the sequences in by looping over 
# the iterator returned by SeqIO.parse()
sequences = {}  # empty dictionary 
for seq_record in SeqIO.parse("../data/ls_orchid.fasta", "fasta"):
    sequences[seq_record.id] = seq_record.seq

# an alternative way to do this can be done with one line of code
sequences2 = {s.id:s.seq for s in SeqIO.parse("../data/ls_orchid.fasta", "fasta")}

print sequences2.keys()[0]
print sequences2[sequences2.keys()[0]]  # print the first sequence


gi|2765596|emb|Z78471.1|PDZ78471
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACATAATAATTGATCGAGTTAATCTAGAGGATCGGTTTACTTTGGTCACCCATGGGCATCTGCTCTTACAGTGACCTGGATTTGCCATCGAGCCTCCTTGGGAGCTGTCTTGCTGGCGATCTAAATCGTTGCCCGACGCAGCCTTGCGTCAAGTCACCCCGACACATAATGGAAGGGGGCGGCATGCTGCCTTGACCCTTCCCCAAATTAATTTTTTGACAACTCTCAACATCGGATATCTCGGCTCTTGCATCGATGAAGAACGCAGCGAAATGCGATAAATGGTGTGAATTGCAGAATCCCGTGAACCATCGAGTCTTTGAACGCAAGTTGCGCCCGAGGCCATCAGGCCAAGGGCACGCCTGCTTGGGCATTGCGAGTCATATCTCTCCCTTAATGAGGCTGTCCATACATACTGTTCAGCCAGTGCGGATGTGAGTTTGGCCCCTTGTTCTTTAGTACGGGGGGTCTAAAAGCTGCATGGGCTTTTGCTGGTCCTAAATACGGCAAGAGGTGGACAAAGTATGCTACAACAAAATTGTAGTGCGAATGCCCCGGGTTGTCGTATTAGATGGGCCAGCATAATTTAAAGACCCTTTTGAACCCCATTAGAGGCCCATCAACCCTGATCAGTTGATGGCCATTTGGTTGCGACCCCAAGTCAGGTGAGGCAACCCGCTGAGTTTAAGC

We can also read in Genbank files as well.


In [9]:
# simple conversion from GB to a fasta file
records = SeqIO.parse("../data/ls_orchid_full.gbk", "genbank")
count = SeqIO.write(records, "../data/my_example.fasta", "fasta")

# read in the sequences from the GB file. note that this is nearly identical to when we 
# performed this task with the fasta file in the previous code block
sequences = {s.id:s.seq for s in SeqIO.parse("../data/ls_orchid_full.gbk", "genbank")}

record = SeqIO.read("../data/ls_orchid.gbk", 'genbank')
print "locus = " + record.name
print "definition = " + record.description
print "version = " + record.id
print "seq = " + record.seq


locus = Z78533
definition = C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA.
version = Z78533.1
seq = CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTGAATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGGCCGCCTCGGGAGCGTCCATGGCGGGTTTGAACCTCTAGCCCGGCGCAGTTTGGGCGCCAAGCCATATGAAAGCATCACCGGCGAATGGCATTGTCTTCCCCAAAACCCGGAGCGGCGGCGTGCTGTCGCGTGCCCAATGAATTTTGATGACTCTCGCAAACGGGAATCTTGGCTCTTTGCATCGGATGGAAGGACGCAGCGAAATGCGATAAGTGGTGTGAATTGCAAGATCCCGTGAACCATCGAGTCTTTTGAACGCAAGTTGCGCCCGAGGCCATCAGGCTAAGGGCACGCCTGCTTGGGCGTCGCGCTTCGTCTCTCTCCTGCCAATGCTTGCCCGGCATACAGCCAGGCCGGCGTGGTGCGGATGTGAAAGATTGGCCCCTTGTGCCTAGGTGCGGCGGGTCCAAGAGCTGGTGTTTTGATGGCCCGGAACCCGGCAAGAGGTGGACGGATGCTGGCAGCAGCTGCCGTGCGAATCCCCCATGTTGTCGTGCTTGTCGGACAGGCAGGAGAACCCTTCCGAACCCCAATGGAGGGCGGTTGACCGCCATTCGGATGTGACCCCAGGTCAGGCGGGGGCACCCGCTGAGTTTACGC

In [10]:
%%bash 
cat ../data/my_example.fasta | head -20


>Z78533.1 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA.
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAA
CGATCGAGTGAATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGT
GACCCTGATTTGTTGTTGGGCCGCCTCGGGAGCGTCCATGGCGGGTTTGAACCTCTAGCC
CGGCGCAGTTTGGGCGCCAAGCCATATGAAAGCATCACCGGCGAATGGCATTGTCTTCCC
CAAAACCCGGAGCGGCGGCGTGCTGTCGCGTGCCCAATGAATTTTGATGACTCTCGCAAA
CGGGAATCTTGGCTCTTTGCATCGGATGGAAGGACGCAGCGAAATGCGATAAGTGGTGTG
AATTGCAAGATCCCGTGAACCATCGAGTCTTTTGAACGCAAGTTGCGCCCGAGGCCATCA
GGCTAAGGGCACGCCTGCTTGGGCGTCGCGCTTCGTCTCTCTCCTGCCAATGCTTGCCCG
GCATACAGCCAGGCCGGCGTGGTGCGGATGTGAAAGATTGGCCCCTTGTGCCTAGGTGCG
GCGGGTCCAAGAGCTGGTGTTTTGATGGCCCGGAACCCGGCAAGAGGTGGACGGATGCTG
GCAGCAGCTGCCGTGCGAATCCCCCATGTTGTCGTGCTTGTCGGACAGGCAGGAGAACCC
TTCCGAACCCCAATGGAGGGCGGTTGACCGCCATTCGGATGTGACCCCAGGTCAGGCGGG
GGCACCCGCTGAGTTTACGC
>Z78532.1 C.californicum 5.8S rRNA gene and ITS1 and ITS2 DNA.
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAACAGAATATA
TGATCGAGTGAATCTGGAGGACCTGTGGTAACTCAGCTCGTCGTGGCACTGCTTTTGTCG
TGACCCTGCTTTGTTGTTGGGCCTCCTCAAGAGCTTTCATGGCAGGTTTGAACTTTAGTA
CGGTGCAGTTTGCGCCAAGTCATATAAAGCATCACTGATGAATGACATTATTGTCAGAAA
AAATCAGAGGGGCAGTATGCTACTGAGCATGCCAGTGAATTTTTATGACTCTCGCAACGG

Accessing the Entrez Database

We can also search....


In [11]:
from Bio import Entrez 
from Bio.SeqRecord import SeqRecord

terms = ["cytochrome oxidase i AND viridiplantae[Organism] and 450:2000[Sequence Length]",
         "cytochrome oxidase i AND fungi[Organism] and 450:2000[Sequence Length]",
         "cytochrome oxidase i AND animalia[Organism] and 450:2000[Sequence Length]"
         ]
Entrez.email = "gregscodeisdone@gmail.com"   # you should put your email in here!

In [12]:
top_records = []
for term in terms:  # I could do this in 1 line if I wanted to!
    handle = Entrez.esearch(db="nucleotide", term=term)
    top_records.append(Entrez.read(handle)["IdList"][0])

handle = Entrez.efetch(db="nucleotide", id=",".join(top_records), rettype="gb", retmode="xml")
genbank_records = Entrez.read(handle)
print "First Genbank record keys: " + str(genbank_records[0].keys())
print "" 
print "First Genbank sequence (" + genbank_records[0]["GBSeq_primary-accession"] + "): " + genbank_records[0]["GBSeq_sequence"]


First Genbank record keys: [u'GBSeq_moltype', u'GBSeq_source', u'GBSeq_sequence', u'GBSeq_primary-accession', u'GBSeq_definition', u'GBSeq_accession-version', u'GBSeq_topology', u'GBSeq_length', u'GBSeq_feature-table', u'GBSeq_create-date', u'GBSeq_other-seqids', u'GBSeq_division', u'GBSeq_taxonomy', u'GBSeq_comment', u'GBSeq_references', u'GBSeq_update-date', u'GBSeq_organism', u'GBSeq_locus', u'GBSeq_keywords', u'GBSeq_strandedness']

First Genbank sequence (KC016094): ctttattcccacattagataccagaatagttattgcacgtatgtgtaacgccagtatattaattgcgaaggttgctaagctaatcctaatcaggcaacgagtctttcaggagttctggacacctacgggagttgaggaagtaatcatggctgaacaaagggaggaatgttgccaataaggtctctgtctgctgtatgcatatggctatccttctttctctgggatgcgggatacgtttcactgtccatgaaggaagaaaccactcgaccaagggagaggtggtttgcccggcatggtactgattgttaacgagattgattattcacgctctgcttattgagatcttgggtttcttccccagtatattaattgccggattttccggtgttttacttagctgaacagcactagctccttctctcatgccaaatccaagtcatttagcctagatagcacggggaagaggatcagaccattcattaactccgctaattcaagttctgttcactctcgctacgcacctgagaactataagaaaataattcgctacctcccgctcacaattaatatactggcgccttgttctccacgccattcattcccctggtaaataattctacgagatcgatgtgttaagcatagtgtctctgtctccaaatccgtatcaaccactgcgactctcgctaggatttctattctgaactaatatactgggagctgcgatattgagaatgtgagtcttgctgcaaagcagatgaactacctaactaaactaagggaaggggaagaagggtcaaacacctgacatgactaagcgctagggagactcgctacgcacctgcaccctctgctgcttaccaaattcttagggccgtggtgaactggtcgtcctataccttctgcctatctcatgtgtgtggaaccaggtctttttcggttccagcttaactttcagtaggggtgagaaacgccaacctcccctgttgcctattagtaagcggccccaatagtattactcaaacccagtataaacagaaatataaaaaatagaaaaatatagcaaatataaaaaataaaaaactctgagaactacctaactaaagaagaatagtgctttttctatgattattcagctagtaggcgtggagagctttttgcggggaaacttgcaagtcaagtttggggggaggcgggcgtcgacccaaccttatgagtattcagactataacagtttcgatgaccagtcactcacttttgacagttatatgattccataggatgatccagaattgggtcaatcatgtttattagaagtggacaatagagtggctgtaccagccaaaactcatgtatgtatgattgtaacacccgctgatgtacctcatagttgggctgtaccttctt

In [13]:
fasta_records = []
for record in genbank_records:
    seq_record = SeqRecord(Seq(record["GBSeq_sequence"]), id=record["GBSeq_primary-accession"], description=record["GBSeq_definition"])
    fasta_records.append(seq_record)

In [14]:
with open("../data/my_seqs.fa","w") as fasta_file:
    SeqIO.write(fasta_records, fasta_file, "fasta")

In [15]:
%%bash 
cat ../data/my_seqs.fa


>KC016094 UNVERIFIED: Allium sphaerocephalon cytochrome oxidase subunit 2-like (cox2) gene, partial sequence; mitochondrial
ctttattcccacattagataccagaatagttattgcacgtatgtgtaacgccagtatatt
aattgcgaaggttgctaagctaatcctaatcaggcaacgagtctttcaggagttctggac
acctacgggagttgaggaagtaatcatggctgaacaaagggaggaatgttgccaataagg
tctctgtctgctgtatgcatatggctatccttctttctctgggatgcgggatacgtttca
ctgtccatgaaggaagaaaccactcgaccaagggagaggtggtttgcccggcatggtact
gattgttaacgagattgattattcacgctctgcttattgagatcttgggtttcttcccca
gtatattaattgccggattttccggtgttttacttagctgaacagcactagctccttctc
tcatgccaaatccaagtcatttagcctagatagcacggggaagaggatcagaccattcat
taactccgctaattcaagttctgttcactctcgctacgcacctgagaactataagaaaat
aattcgctacctcccgctcacaattaatatactggcgccttgttctccacgccattcatt
cccctggtaaataattctacgagatcgatgtgttaagcatagtgtctctgtctccaaatc
cgtatcaaccactgcgactctcgctaggatttctattctgaactaatatactgggagctg
cgatattgagaatgtgagtcttgctgcaaagcagatgaactacctaactaaactaaggga
aggggaagaagggtcaaacacctgacatgactaagcgctagggagactcgctacgcacct
gcaccctctgctgcttaccaaattcttagggccgtggtgaactggtcgtcctataccttc
tgcctatctcatgtgtgtggaaccaggtctttttcggttccagcttaactttcagtaggg
gtgagaaacgccaacctcccctgttgcctattagtaagcggccccaatagtattactcaa
acccagtataaacagaaatataaaaaatagaaaaatatagcaaatataaaaaataaaaaa
ctctgagaactacctaactaaagaagaatagtgctttttctatgattattcagctagtag
gcgtggagagctttttgcggggaaacttgcaagtcaagtttggggggaggcgggcgtcga
cccaaccttatgagtattcagactataacagtttcgatgaccagtcactcacttttgaca
gttatatgattccataggatgatccagaattgggtcaatcatgtttattagaagtggaca
atagagtggctgtaccagccaaaactcatgtatgtatgattgtaacacccgctgatgtac
ctcatagttgggctgtaccttctt
>KF528054 Uredo rolliniae isolate 189 cytochrome oxidase subunit 3 gene, partial cds; mitochondrial
gtagagggtggtggttatggggtagcactaggttttgtaagcacagtaggtgtaataagt
ctatgatttagggatgtaagtgcagaggggtcgctaggggggtatcatacgtttgatgtt
caacgttctctaaatatgggggtagtactatttattgtaagtgagatctttatttttgta
tcaattttttgagcttattttcattcagctctaagtcctacggtagagctaggttctcag
tggccggcaccaggtgtagagccactgaatgcatttgagattccacttctaaatacgatt
ctgcttctaacctcagcgagctcactaacttatgcacaccatgctctaattaatgggaat
aggaagtcatgtctaattggttttgcggtgactctagcactggcagtaacttttacaggg
tttcaagcactagagtatattgaggcaccttttacaatttcagatggggcatttggtagt
acttttttctttagtacaggttctcatgggatgcacgtaattattgggacgatttttcta
acagtagcactagctcgaattattagttaccaactaacagaccaccaccatctaggtttt
gaggctgcagcactgtattgacattttgttgatgttgtttga
>KF802856 Aspidiotus nerii isolate D0530E cytochrome oxidase subunit 1 (CO1) and cytochrome oxidase subunit 2 (CO2) genes, partial cds; mitochondrial
aataatttatcaactttaggatcaataataacaattatatttatttttatatttttttat
tcaattattgaacttttaatttcaaaacgaaaaattattattattattaaatctaataat
aatgaatgaaaaaataatttaccaactctttctcattcaaatattgaaaataattttata
ttcaataaaaattaaattatatcatgaataaacctaaattttcaaaaccctaattcaatc
aatataattaaaattagaatatttcataattttatattaattattttaattaatattatt
ataattttaattttaataattaatttttcttttattaataaatttaataataaaataatt
ttacaaaatcaaaaattagaaatcttatgaacaattattcccataattatcattattata
attagaataatttcaattaatttattattttcaaataatgaaataaaaaataatatttta
aatattaaaatcattggtaatcaatgattttgaaattatgaatactcattatttaacaaa
aactttaattcatatttattaataaataataattttaatttttttatattagaaacagat
aataattttattattccttttaattatcaattaatacttattttttcatctttagatgtt
attcattcttgaaccattccatcaataaatattaaaatag

BLAST

BLAST can be run with BioPython either from using the command line tools -- provided they are installed -- or through the web. In this section, we use the qblast from the Bio.Blast.NCBIWWW module to call the online version of BLAST. Note that the results would be the same if we were to use the command line tools. As pointed out from the qblast documentation, there are three required arguments:

  • The first argument is the blast program to use for the search, as a lower case string. The options and descriptions of the programs are available at http://www.ncbi.nlm.nih.gov/BLAST/blast_program.shtml. Currently qblast only works with blastn, blastp, blastx, tblast and tblastx.
  • The second argument specifies the database in which to perform the search. Again, the options for this are available on the NCBI web pages at http://www.ncbi.nlm.nih.gov/BLAST/blast_databases.shtml.
  • The third argument is a string containing your query sequence. This can either be the sequence itself, the sequence in fasta format, or an identifier like a GI number.

The default output format for qblast is XML; however, we shall use other methods in BioPython to convert the results in the XML file to a more interpretable and managable output.


In [16]:
from Bio.Blast import NCBIWWW, NCBIXML

E_VALUE_THRESH = 0.01
s_len = 100

# search against the nucleotide database (nt) using BLASTN, and you know the GI number 
result_handle = NCBIWWW.qblast("blastn", "nt", "8332116") # 449020131, 8332116
blast_records = NCBIXML.read(result_handle) 

for alignment in blast_records.alignments[:5]:
    for hsp in alignment.hsps: 
        if hsp.expect < E_VALUE_THRESH: 
            print "****Alignment****"
            print "sequence:", alignment.title 
            print "length:", alignment.length 
            print "e value:", hsp.expect 
            print hsp.query[0:s_len] + "..." 
            print hsp.match[0:s_len] + "..." 
            print hsp.sbjct[0:s_len] + "..."


****Alignment****
sequence: gi|568824607|ref|XM_006466626.1| PREDICTED: Citrus sinensis cold-regulated 413 plasma membrane protein 2-like (LOC102620025), transcript variant X5, mRNA
length: 878
e value: 4.20976e-99
AAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGATTCCGATATCAATGAGCTTAAAATGGCAACAAT...
|||||||||||    |||| || ||||| |||||||||||| |   ||  |  ||  |  |||||   || ||| ||||||||||||| || ||  ||| ...
AAATGGGGAGAT---TGAATTATTTGGCTATGAAAACTGATGATCAGGTTGCAGCAGAGTTGATCAGCTCTGATTTCAATGAGCTTAAGATTGCTGCAAA...
****Alignment****
sequence: gi|568824605|ref|XM_006466625.1| PREDICTED: Citrus sinensis cold-regulated 413 plasma membrane protein 2-like (LOC102620025), transcript variant X4, mRNA
length: 911
e value: 4.20976e-99
AAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGATTCCGATATCAATGAGCTTAAAATGGCAACAAT...
|||||||||||    |||| || ||||| |||||||||||| |   ||  |  ||  |  |||||   || ||| ||||||||||||| || ||  ||| ...
AAATGGGGAGAT---TGAATTATTTGGCTATGAAAACTGATGATCAGGTTGCAGCAGAGTTGATCAGCTCTGATTTCAATGAGCTTAAGATTGCTGCAAA...
****Alignment****
sequence: gi|568824603|ref|XM_006466624.1| PREDICTED: Citrus sinensis cold-regulated 413 plasma membrane protein 2-like (LOC102620025), transcript variant X3, mRNA
length: 894
e value: 4.20976e-99
AAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGATTCCGATATCAATGAGCTTAAAATGGCAACAAT...
|||||||||||    |||| || ||||| |||||||||||| |   ||  |  ||  |  |||||   || ||| ||||||||||||| || ||  ||| ...
AAATGGGGAGAT---TGAATTATTTGGCTATGAAAACTGATGATCAGGTTGCAGCAGAGTTGATCAGCTCTGATTTCAATGAGCTTAAGATTGCTGCAAA...
****Alignment****
sequence: gi|568824601|ref|XM_006466623.1| PREDICTED: Citrus sinensis cold-regulated 413 plasma membrane protein 2-like (LOC102620025), transcript variant X2, mRNA
length: 922
e value: 4.20976e-99
AAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGATTCCGATATCAATGAGCTTAAAATGGCAACAAT...
|||||||||||    |||| || ||||| |||||||||||| |   ||  |  ||  |  |||||   || ||| ||||||||||||| || ||  ||| ...
AAATGGGGAGAT---TGAATTATTTGGCTATGAAAACTGATGATCAGGTTGCAGCAGAGTTGATCAGCTCTGATTTCAATGAGCTTAAGATTGCTGCAAA...
****Alignment****
sequence: gi|568824599|ref|XM_006466622.1| PREDICTED: Citrus sinensis cold-regulated 413 plasma membrane protein 2-like (LOC102620025), transcript variant X1, mRNA
length: 966
e value: 4.20976e-99
AAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGATTCCGATATCAATGAGCTTAAAATGGCAACAAT...
|||||||||||    |||| || ||||| |||||||||||| |   ||  |  ||  |  |||||   || ||| ||||||||||||| || ||  ||| ...
AAATGGGGAGAT---TGAATTATTTGGCTATGAAAACTGATGATCAGGTTGCAGCAGAGTTGATCAGCTCTGATTTCAATGAGCTTAAGATTGCTGCAAA...

Similarly, we can directly BLAST our sequence if we have it in the Python environment. BioPython is smart enough to pick up on how the third argument should be interpreted for BLASTing.


In [17]:
seq = "ataccaggctgaggcccattaatgatgcaatttgctgggcttctctattttctccgtgcttccatcctcttctccgtcggcggggagaagtgaaatgccgtggagatgggcggcggcggcggcgacggcggcgacgagaaagctcaccgggatctctcagtcgcgagtttcagtagcctttaccggccgtcttctctaccgctcgttcggaagcgactccagtgaaagccgcaagaggtcactgccacggggggtcgtatcgatcggggccatcagccttgctggaggtctcgtgctcagcgccgtcaacgacctcgccatcttcaatggatgcacaacgaaggcaattgagcatgctgctgacaaccctgctgttgtggaagcaattggagtgcctatagtcagaggaccgtggtatgatgcttctcttgaggtgggccatcgacggcggtctgtgtcatgcacattccctgtatctgggccacatgggtcaggatttctccagattaaggcaacccgagatggagaggatggtctgctttcgtttctgcggcatcacgactggaagatcctattgctggaggctcatcttgaagcaccatcagatgatgaggaccagagaaagctggttaaggtgaatcttgcaagcagtggccgtggggaagatggggatccagagagtggttaatcttttgtactgaattccatggtgagtggaagatcgtgtcatctgaatggactccaaatattaaatgacatggagatctagggaagcaaaaaaaaaaaaaaaa "
result_handle = NCBIWWW.qblast("blastn", "nt", seq) 
blast_records = NCBIXML.read(result_handle) 

for alignment in blast_records.alignments[:5]:
    for hsp in alignment.hsps: 
        if hsp.expect < E_VALUE_THRESH: 
            print "****Alignment****"
            print "sequence:", alignment.title 
            print "length:", alignment.length 
            print "e value:", hsp.expect 
            print hsp.query[0:s_len] + "..." 
            print hsp.match[0:s_len] + "..." 
            print hsp.sbjct[0:s_len] + "..."


****Alignment****
sequence: gi|226505045|ref|NM_001150016.1| Zea mays uncharacterized LOC100276165 (LOC100276165), mRNA >gi|195621403|gb|EU960414.1| Zea mays clone 224719 hypothetical protein mRNA, complete cds
length: 791
e value: 0.0
ATACCAGGCTGAGGCCCATTAATGATGCAATTTGCTGGGCTTCTCTATTTTCTCCGTGCTTCCATCCTCTTCTCCGTCGGCGGGGAGAAGTGAAATGCCG...
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||...
ATACCAGGCTGAGGCCCATTAATGATGCAATTTGCTGGGCTTCTCTATTTTCTCCGTGCTTCCATCCTCTTCTCCGTCGGCGGGGAGAAGTGAAATGCCG...
****Alignment****
sequence: gi|242044007|ref|XM_002459830.1| Sorghum bicolor hypothetical protein, mRNA
length: 582
e value: 0.0
CGGCGGCGACGAGAAAGCTCACCGGGATCTCTCAGTCGCGAGTTTCAGTAGCCTTTACCGGCCGTCTTCTCTACCGCTCGTTCGGAAGCGACTCCAGTGA...
|||||||| ||||||||||| |||| ||  | |||| |||   | ||| ||||||||||||| |||||||||||||||||||||||||||||||||  | ...
CGGCGGCGGCGAGAAAGCTCGCCGGAATGCCCCAGTTGCGCACTCCAGGAGCCTTTACCGGCGGTCTTCTCTACCGCTCGTTCGGAAGCGACTCCATCGG...
****Alignment****
sequence: gi|514727640|ref|XM_004956214.1| PREDICTED: Setaria italica serine/threonine-protein kinase PBS1-like (LOC101783701), mRNA
length: 2391
e value: 7.92359e-113
GCTCAGCGCCGTCAACGACCTCGCCATCTTCAATGGATGCACAACGAAGGCAATTGAGCATGCTGCTGACAACCCTGCTGTTGTGGAAGCAATTGGAGTG...
|||||||||||| |||||||||||||||||| ||||||| ||||| |||||||||||| | || |||||||||||    ||||| |||||||||||||||...
GCTCAGCGCCGTTAACGACCTCGCCATCTTCCATGGATGTACAACTAAGGCAATTGAGAAGGCGGCTGACAACCCAAAGGTTGTCGAAGCAATTGGAGTG...
****Alignment****
sequence: gi|357160261|ref|XM_003578660.1| PREDICTED: Brachypodium distachyon uncharacterized LOC100836817 (LOC100836817), mRNA
length: 933
e value: 1.43388e-90
AAGAGGTCACTGCCACGGGGGGTCGTATCGATCGGGGCCATCAGCCTTGCTGGAGGTCTCGTGCTCAGCGCCGTCAACGACCTCGCCATCTTCAATGGAT...
|||||||| | |  |||| ||||| | ||||| || | ||||||||| ||||| ||  |||  ||||| ||| |||||||||||||||||||| ||||||...
AAGAGGTCGCCGGTACGGAGGGTCCTGTCGATTGGTGTCATCAGCCTCGCTGGTGGAGTCGCCCTCAGTGCCCTCAACGACCTCGCCATCTTCCATGGAT...
****Alignment****
sequence: gi|149391396|gb|EF576128.1| Oryza sativa (indica cultivar-group) clone V-E12 unknown mRNA
length: 752
e value: 1.2578e-78
AAGAGGTCACTGCCACGGGGGGTCGTATCGATCGGGGCCATCAGCCTTGCTGGAGGTCTCGTGCTCAGCGCCGTCAACGACCTCGCCATCTTCAATGGAT...
|||||||||| |   ||| || |||| ||||| ||||  |||||| | ||||| ||  ||| ||| |||||| |||||||||| || || ||| ||||||...
AAGAGGTCACGGATCCGGAGGATCGTGTCGATTGGGGTTATCAGCATCGCTGGCGGCGTCGCGCTTAGCGCCCTCAACGACCTTGCTATATTCCATGGAT...

Phylogenetic Trees


In [18]:
from Bio import Phylo 
tree = Phylo.read("../data/simple.dnd", "newick")
print tree
Phylo.draw_ascii(tree)
tree.rooted = True 
Phylo.draw(tree)


Tree(rooted=False, weight=1.0)
    Clade(branch_length=1.0)
        Clade(branch_length=1.0)
            Clade(branch_length=1.0)
                Clade(branch_length=1.0, name='A')
                Clade(branch_length=1.0, name='B')
            Clade(branch_length=1.0)
                Clade(branch_length=1.0, name='C')
                Clade(branch_length=1.0, name='D')
        Clade(branch_length=1.0)
            Clade(branch_length=1.0, name='E')
            Clade(branch_length=1.0, name='F')
            Clade(branch_length=1.0, name='G')
                                                          __________________ A
                                       __________________|
                                      |                  |__________________ B
                    __________________|
                   |                  |                   __________________ C
                   |                  |__________________|
___________________|                                     |__________________ D
                   |
                   |                   __________________ E
                   |                  |
                   |__________________|__________________ F
                                      |
                                      |__________________ G


In [19]:
from Bio.Phylo.PhyloXML import Phylogeny 
tree = tree.as_phyloxml() 
tree = Phylogeny.from_tree(tree)
tree.root.color = (128, 128, 128) 
mrca = tree.common_ancestor({"name": "E"}, {"name": "F"}) 
mrca.color = "salmon" 
tree.clade[0, 1].color = "blue" 
Phylo.draw(tree, branch_labels=lambda c: c.branch_length) 
Phylo.draw(tree)



In [ ]: