BioPython is a versatile Python package for computational biology, particularly if your interest is in sequence analysis. In addition to performing multiple alignments, it can download sequences from the internet, construct phylogenetic trees, and even run population genetics simulations. By the end of this lecture, you should be able to:
Biopython features include parsers for various Bioinformatics file formats (BLAST, Clustalw, FASTA, Genbank,...), access to online services (NCBI, Expasy,...), interfaces to common and not-so-common programs (Clustalw, DSSP, MSMS...), a standard sequence class, various clustering modules, a KD tree data structure etc. and even documentation.
Other modules that might be of interest:
Recall from lectures gone by: http://www.ncbi.nlm.nih.gov/gene/5216
First line is description of sequence and starts with >
All lines up to the next > are part of the same sequence
Usually fewer than 80 characters per line
>gi|568815581:c4949086-4945650 Homo sapiens chromosome 17, GRCh38.p2 Primary Assembly
CCCGCAGGGTCCACACGGGTCGGGCCGGGCGCGCTCCCGTGCAGCCGGCTCCGGCCCCGACCGCCCCATG
CACTCCCGGCCCCGGCGCAGGCGCAGGCGCGGGCACACGCGCCGCCGCCCGCCGGTCCTTCCCTTCGGCG
GAGGTGGGGGAAGGAGGAGTCATCCCGTTTAACCCTGGGCTCCCCGAACTCTCCTTAATTTGCTAAATTT
GCAGCTTGCTAATTCCTCCTGCTTTCTCCTTCCTTCCTTCTTCTGGCTCACTCCCTGCCCCGATACCAAA
GTCTGGTTTATATTCAGTGCAAATTGGAGCAAACCCTACCCTTCACCTCTCTCCCGCCACCCCCCATCCT
TCTGCATTGCTTTCCATCGAACTCTGCAAATTTTGCAATAGGGGGAGGGATTTTTAAAATTGCATTTGCA
LOCUS CAG28598 140 aa linear PRI 16-OCT-2008
DEFINITION PFN1, partial [Homo sapiens].
ACCESSION CAG28598
VERSION CAG28598.1 GI:47115277
DBSOURCE embl accession CR407670.1
KEYWORDS .
SOURCE Homo sapiens (human)
ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
Catarrhini; Hominidae; Homo.
ORIGIN
1 magwnayidn lmadgtcqda aivgykdsps vwaavpgktf vnitpaevgv lvgkdrssfy
61 vngltlggqk csvirdsllq dgefsmdlrt kstggaptfn vtvtktdktl vllmgkegvh
121 gglinkkcye mashlrrsqy
//
Let's dive into BioPython, shall we?
In [1]:
from Bio.Seq import Seq # the submodule structure of biopython is a little awkward
s = Seq("GATTACA")
print(s)
Sequences act a lot like strings, but have an associated alphabet and additional methods.
Methods shared with str
: count
, endswith
, find
, lower
, lstrip
, rfind
, split
, startswith
, strip
,upper
Seq
-specific methods: back_transcribe
, complement
, reverse_complement
, tomutable
, tostring
, transcribe
, translate
, ungap
In [2]:
s.alphabet
Out[2]:
The alphabet of a sequence defines what the characters mean
In [3]:
from Bio.Alphabet import IUPAC
# Who is IUPAC?
The sequence objects in BioPython, or Seq
objects, act a lot like Python strings.
In [4]:
s[0] # This returns a string
Out[4]:
In [5]:
s[2:4] # This returns sequence
Out[5]:
In [6]:
s.lower()
Out[6]:
In [7]:
s + s
Out[7]:
Another blast from the past. But now we can do this using Python.
DNA coding strand (aka Crick strand, strand +1)
5’ ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG 3’
|||||||||||||||||||||||||||||||||||||||
3’ TACCGGTAACATTACCCGGCGACTTTCCCACGGGCTATC 5’
DNA template strand (aka Watson strand, strand −1)
|
Transcription
↓
5’ AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG 3’
Single stranded messenger RNA
|
Translation
↓
MAIVMGR*KGAR*
amino acid sequence (w/stop codons)
In [8]:
dna = Seq('GATTACAGATTACAGATTACA')
print(dna.complement(), dna.reverse_complement())
Seq
objects may largely behave like strings, but they have all sorts of additional methods that make them extremely useful and convenient for sequence analysis.
In [9]:
print(dna)
In [10]:
rna = dna.transcribe() # No need for a transcriptase!
print(rna)
In [11]:
protein = rna.translate()
print(protein)
In [12]:
print(dna.translate()) # Can go straight from the DNA to amino acid sequences.
Yes, plural. Did you think there was only one codon table?
By default the standard translation table is used, but others can be provided to the translate
method.
In [13]:
from Bio.Data import CodonTable
print(sorted(CodonTable.unambiguous_dna_by_name.keys()))
In [14]:
print(CodonTable.unambiguous_dna_by_name['Standard']) # What does this structure look like?
Sequence data is read/written as SeqRecord
objects.
These objects store additional information about the sequence (name, id, description, features).
SeqIO
reads sequence records:
read
to read a file with a single recordparse
to iterate over file with mulitple recordsPython file I/O!
In [15]:
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO
seq = SeqIO.read('p53.gb','genbank')
seqs = []
# http://eds-uga.github.io/cbio4835-sp17/files/p53.gb
# http://eds-uga.github.io/cbio4835-sp17/files/hydra.fasta
for s in SeqIO.parse('hydra.fasta','fasta'):
seqs.append(s)
In [16]:
print(len(seqs))
In [17]:
print(seq)
These files are available through the course website and repository if you'd like to use them.
However, BioPython lets you fetch sequences automatically directly off the internet, through an interface to the NCBI "Entrez" search engine.
The results of these queries are returned as file-like objects.
Just import the Entrez
interface:
In [18]:
from Bio import Entrez
Entrez.email = "squinn@cs.uga.edu" # BioPython forces you to provide your email
res = Entrez.read(Entrez.einfo()) # The names of all available databases
print(res)
In [19]:
print(sorted(res['DbList']))
You can search any of the available databases for a given term, and it will return the IDs of all the relevant records.
This is done through the esearch
method in the Entrez
submodule.
In [20]:
result = Entrez.esearch(db = 'nucleotide', term = 'tp53')
# The result is a file-like object of the raw XML data
records = Entrez.read(result)
# This puts the object into a more palatable form (dictionary)
print(records)
In [21]:
print(records['IdList'])
print(len(records['IdList']))
In the previous slide, only 20 results were actually returned. However, there were considerably more hits than that--2,923 to be exact.
They weren't returned [immediately] because of the online nature of the query. Luckily, this is something we can tweak.
In [22]:
records = Entrez.read(Entrez.esearch(db = 'nucleotide', term = 'tp53', retmax = 50))
print(records)
print(len(records['IdList']))
To get the full record for a given ID, we'll use the efetch
function.
To use this function, we have to provide rettype
(available options include FASTA and gb).
We also have to specify retmode
, which can be plan text or XML.
In [23]:
# Fetch the genbank file for the 10th ID from our search
result = Entrez.efetch(db = "nucleotide", id = records['IdList'][10], rettype = "gb", retmode = 'text')
# Parse into a seqrecord
p53 = SeqIO.read(result,'gb')
In [24]:
print(p53)
Genbank files are typically annotated with a ton of additional features that refer to portions of the full sequence.
SeqRecord
objects track these features, allowing you to extract the corresponding subsequence.
In [25]:
p53.features
Out[25]:
(CDS is a coding sequence!)
You can then extract these subsequences like you would an element from a list:
In [26]:
cdsfeature = p53.features[2]
print(cdsfeature)
Using the feature, you can extract the subsequence from the original full sequence.
In [27]:
coding = cdsfeature.extract(p53) # Pass the full record (p53) to the feature
print(coding)
BioPython also allows you to use NCBI's BLAST to search for similar sequences with the qblast
function.
This function has three required arguments:
A couple of examples:
In [28]:
from Bio.Blast import NCBIWWW
result = NCBIWWW.qblast("blastn", "nt", coding.seq)
# The variable "result" is a file-like object with XML in it
# BE WARNED: it can take a while to get results!
# Note whether there is a NUMBER or a STAR next to your JupyterHub cell!
In [29]:
from Bio.Blast import NCBIXML # For parsing xmls
blast_records = NCBIXML.read(result)
In [30]:
print(len(blast_records.alignments), len(blast_records.descriptions))
In [31]:
alignment = blast_records.alignments[1]
print(len(alignment.hsps))
In [32]:
hsp = alignment.hsps[0] #high scoring segment pairs
print('****Alignment****')
print('sequence:', alignment.title)
print('length:', alignment.length)
print('e value:', hsp.expect)
print(hsp.query[0:75] + '...') # what we searched with
print(hsp.match[0:75] + '...')
print(hsp.sbjct[0:75] + '...') # what we matched to
Biopython can use lots of different alignment algorithms, but it includes a pre-packaged that works decently well.
In [33]:
from Bio import pairwise2
alignments = pairwise2.align.globalxx("ACCGT", "ACG")
This will return the list of alignments for the two sequences.
In [34]:
print(pairwise2.format_alignment(*alignments[0]))
print(pairwise2.format_alignment(*alignments[1]))
BioPython has a bunch of other alignment tools built-in, but they rely on external programs (some are HMM-based!).
A phylogenetic tree or evolutionary tree is a branching diagram or "tree" showing the inferred evolutionary relationships among various biological species or other entities—their phylogeny—based upon similarities and differences in their physical or genetic characteristics. --Wikipedia
Biopython can read a variety of tree formats: Newick (clustal), NEXUS, phyloXML, NeXML, and CDAO.
In [35]:
from Bio import Phylo
tree = Phylo.read('hydra179.dnd','newick') # Must specify format.
tree.rooted = True
tree
Out[35]:
Displaying the full phylogenetic trees can get a little interesting.
First, we can visualize them as pure ASCII, as the early internet would have intended:
In [36]:
Phylo.draw_ascii(tree)
With some plotting tools, however, we can make wonderful image visualizations of the tree.
In [37]:
%matplotlib inline
Phylo.draw(tree)
...maybe more "wonderful" than wonderful at first, given how much information there is.
Let's re-draw the previous tree, this time without tree labels, just to see how the branching looks.
In [38]:
Phylo.draw(tree, label_func = lambda x: None) # Don't worry about what this does
Now we're getting somewhere! We can clearly see some structure in the respective evolutionary offshoots.
In [39]:
from Bio import motifs # Lowercase for some reason...
m = motifs.create(["TACAA","CATGC","TACTA","CCCAA"])
Remember the "profile" from profile HMMs? It looked something like this:
In [40]:
print(m.counts)
We can visualize this information in different ways.
For instance, we can look at the consensus sequence, which is just the sequence formed by picking the nucleotide at each position that is most frequent.
In [41]:
print(m.counts)
print(m.consensus)
We can also generate logos!
In [42]:
m.weblogo('logo.png', stack_width = 'large')
In [43]:
from IPython.display import Image
Image(filename = 'logo.png')
Out[43]:
Remember this visualization from the last lecture? The relative heights of each letter are a measure of how frequently those nucleotides show up at those positions in the aligned sequence.
Biopython supports a number of motif formats: JASPAR, MEME, TRANSFAC.
These formats are associated with databases (JASPAR, TRANSFAC) and tools (MEME).
Of particular interest are sequence motifs for transcription factor binding.
In [44]:
f = open('MA0004.1.sites') # Unlike other parts of Biopython, can't just provide filename to open.
arnt = motifs.read(f,'sites') # JASPAR sites
print(arnt)
Now we can examine all sorts of properties associated with the motif.
First, what motifs are even in this file?
In [45]:
print(arnt.instances)
Next, what is the profile matrix of the motifs?
In [46]:
print(arnt.counts)
And the consensus sequence?
In [47]:
print(arnt.consensus)
We've touched on this idea in previous lectures--the idea that not all mismatches in sequence alignments are created equal--but BioPython has this concept built-in.
For one, we can normalize the counts to be probabilities instead, allowing you to compare multiple motifs:
In [48]:
print(arnt.counts.normalize())
An interesting question with normalized counts: what happens if the count is 0? Is the probability of seeing that nucleotide at that position also 0?
(Hint as to where this is going: how do you compute the probabilities of a bunch of statistically independent events? what happens if one of those events has a probability of 0?)
It probably doesn't have a probability of 0, but it may be close to 0. In machine learning, there's a concept known as "pseudocounting" in which a very small number of counts are given to variables whose probabilities we want to keep close to 0, but not exactly 0.
We can do that in BioPython, too:
In [49]:
print(arnt.counts.normalize(pseudocounts = 0.8))
Finally, we can use BioPython to search for motifs in larger sequences.
In [50]:
from Bio import SeqIO
from Bio.Seq import Seq
largeseq = SeqIO.read('bnip3.fasta', 'fasta', arnt.alphabet).seq # Load with same alphabet as motif
smallseq = Seq('AAACCCACGTGACTATATA')
We can search for exact matches:
In [51]:
exact_matches = arnt.instances.search(smallseq)
for match in exact_matches:
position = match[0]
sequence = match[1]
print(position, sequence)
In [52]:
exact_matches = arnt.instances.search(largeseq)
for match in exact_matches:
position = match[0]
sequence = match[1]
print(position, sequence)
We can also use a PSSM to search for "fuzzy" matches--instances where we don't find an exact match of the thing we're looking for, but rather something that's "close" (for some definition of "close").
Basically, this makes use of the counts matrix we saw before.
If we're looking for a specific motif that isn't found exactly, we can specify a count threshold at which mismatches in certain positions are allowed to count toward search hits.
Here's an example:
In [53]:
# First, normalize to probabilities and use pseudocounts.
pwm = arnt.counts.normalize(pseudocounts = 0.8)
# Next, take the log of all the numbers.
pssm = pwm.log_odds()
# Now, run a search for the motif, using the scoring matrix and a threshold.
results = []
threshold = 4
fuzzy_matches = pssm.search(largeseq, threshold)
for match in fuzzy_matches:
results.append(match)
print(len(results), "fuzzy matches found with threshold of", threshold)
The score is a $log_2$ likelihood, so a score of 4 is $2^4=16$ times more likely to occur as part of the motif than as part of the (assumed uniform random) background.