Module pairwise alignment in Biopython uses dynamic programming algorithm.
Match scores and gap penalties should be specified for any alignment. Compatible elements (not neccessarily the same character) should be given higher score. Gap or incompatibles should be given lower or negative scores, signifying the mismatch, though in some case 0 is used.
Bio.pairwise2 contains essentially the same algorithms as water
for local alignment
and needle
for global alignment
used in EMBOSS.
In [ ]:
from Bio import pairwise2
## load the module
## globalxx
## use global alignment function which only score 1
## for each match (0 for both penalty and mismatch)
alignments = pairwise2.align.globalxx("ACCGT", "ACG")
## perform global alignments (xx) between two sequences.
for alignment in alignments:
## Each alignment is a tuple consisting of the two aligned sequences,
## the score, the start and the end positions of the alignment
## (in global alignments the start is always 0 and the end the length of the alignment).
print(alignment)
## print the alignment in a nicer format
from Bio.pairwise2 import format_alignment
print(format_alignment(*alignment))
print(repr(alignment))
You need to specify the match parameters
and gap penalty parameters
to control the scoring output. globalxx
basically sets only match score = 1 and gap penalty score = 0. Setting scoring parameters is easy using the list below.
x
No parameters. Identical characters have score of 1, otherwise 0.m
A match score is the score of identical chars, otherwise mismatch score.d
A dictionary returns the score of any pair of characters.c
A callback function returns scores.x
No gap penalties.s
Same open and extend gap penalties for both sequences.d
The sequences have different open and extend gap penalties.c
A callback function returns the gap penalties.
In [ ]:
from Bio import pairwise2
from Bio.pairwise2 import format_alignment
## m match score = 2, mismatch = 0
## x no gap penalty
for alignment in pairwise2.align.globalmx("ACCGT", "ACG", 2, -1):
print(format_alignment(*alignment))
## score = 6, since only matching scores
## match score = 2, mismatch = -1
## gap opening = 0.5, gap extension = 0.1
for a in pairwise2.align.globalms("ACCGT", "ACG", 2, -1, -.5, -.1):
print(format_alignment(*a))
## score = 5, 2*3 (matchings) + 2*-0.5 (gap opening)
For proteins, scoring only identical amino acids on both sequences as match is biologically incorrect. To score alignments correctly, you need to know what are the compatible amino acids. This can be done by using substitution matrix. The table is stored as dictionary and can be directly given to parameter d
in match parameters. The alignment scores are directly influenced by selected matrix.
More details on types, names and scores are described on this following page. https://biopython.org/DIST/docs/api/Bio.SubsMat.MatrixInfo-module.html
In [ ]:
from Bio.SubsMat import MatrixInfo as matlist
print(matlist.available_matrices)
## print list of available matrices
matrix = matlist.blosum62
## set the substitution matrix to be used
## BLOSUM62 is more stringent than BLOSUM45,
## thus the alignment score is lower
for a in pairwise2.align.globaldx("KEVLA", "EVL", matrix):
print(format_alignment(*a))
## change to use BLOSUM45 for distantly related sequences
## this allows less identical sequence to score above threshold
for a in pairwise2.align.globaldx("KEVLA", "EVL", matlist.blosum45):
print(format_alignment(*a))
In [ ]:
from Bio import pairwise2
from Bio import SeqIO
## use SeqIO to read in input files
seq1 = SeqIO.read("alpha.faa", "fasta")
seq2 = SeqIO.read("beta.faa", "fasta")
alignments = pairwise2.align.globalds(seq1.seq, seq2.seq)
## global alignment
## d : A dictionary of the score of any pair of characters.
## s : Same open and extend gap penalties for both sequences.
## should return error since not enough parameters specified
## have to specify substitution matrix dictionary (1 parameter)
## also 2 parameters for penalties of gap openning and extension
In [ ]:
from Bio.SubsMat.MatrixInfo import blosum62
alignments = pairwise2.align.globalds(seq1.seq, seq2.seq, blosum62, -10, -1)
## supply enough parameters supplied so the code runs
print(format_alignment(*alignments[0]))
## have a look at the first alignment
## compare with the globalxx below
alignments = pairwise2.align.localxx(seq1.seq, seq2.seq)
print(format_alignment(*alignments[1]))
## globalxx give different scores from global ds
## there can be many alignments, just randomly pick the second one here
In [ ]:
alignments = pairwise2.align.localxx(seq1.seq, seq2.seq, one_alignment_only=True)
## will return only the best alignment
## this takes shorter time
print(format_alignment(*alignments[0]))
alignments = pairwise2.align.localxx(seq1.seq, seq2.seq, score_only=True)
## speed gain
## only print the score
print('score of alignment', alignments)
The implementation or calculation for both pairwise alignments and multiple sequence alignments can be slow. It is thus recommended to use better optimized alignment programs. Unfortunately, the algorithms are not implemented in Biopython directly. So accessing these tools is done by running programs inside python via the command-line wrapper provided by Biopython.
With Biopython, this only takes 4 steps;
In [ ]:
import Bio.Align.Applications
dir(Bio.Align.Applications)
## available tools with command line wrappers
In [ ]:
import os
from Bio.Align.Applications import ClustalwCommandline
## help(ClustalwCommandline)
## print help
clustalw_exe = '/usr/bin/clustalw'
## for Windows user, change this to your installed
## e.g. r"C:\Program Files\new clustal\clustalw2.exe"
clustalw_cline = ClustalwCommandline(clustalw_exe, infile="new_opuntia.fasta")
## clustalw_cline = ClustalwCommandline("clustalw2", infile="opuntia.fasta")
assert os.path.isfile(clustalw_exe), "Clustal W executable missing"
stdout, stderr = clustalw_cline()
In [ ]:
print(stdout)
In [ ]:
print(stderr)
# if there is no error, it should be empty string
In [ ]:
from Bio import AlignIO
align = AlignIO.read("new_opuntia.aln", "clustal")
## specify the format of alignment file
print(align)
In [ ]:
from Bio import Phylo
tree = Phylo.read("new_opuntia.dnd", "newick")
Phylo.draw_ascii(tree)
## the result from clustal allows the tree view
In [ ]:
from Bio.Align.Applications import MuscleCommandline
help(MuscleCommandline)
In [ ]:
from Bio.Align.Applications import MuscleCommandline
cline = MuscleCommandline(input="opuntia.fasta", out="opuntia.clw", clw=True)
stdout, stderr = cline()
print(stderr)
## alignment has only one sequence in fasta format
## error if try to open in clustal format
AlignIO.read(open("opuntia.clw"), "clustal")
In [ ]:
from Bio.Align.Applications import MuscleCommandline
cline = MuscleCommandline(input="opuntia.fasta", out="opuntia.txt")
## default format for MUSCLE is fasta
stdout, stderr = cline()
## alignment has only one sequence,
## no error if read in fasta format
print(AlignIO.read(open("opuntia.txt"), "fasta"))
In [ ]:
from Bio.Align.Applications import MuscleCommandline
try:
from StringIO import StringIO
except ImportError:
from io import StringIO
muscle_cline = MuscleCommandline(input="new_opuntia.fasta")
## run command line without writing the output file
## default format of MUSCLE is fasta
stdout, stderr = muscle_cline()
## the result is in stdout
try:
align = AlignIO.read(StringIO(stderr), "fasta")
## check if alignment is fine
except:
align = AlignIO.read(StringIO(stdout), "fasta")
print(align)
In [ ]:
from Bio.Emboss.Applications import NeedleCommandline
## for ubuntu system, giving command
## without specifying the installed path of EMBOSS seems to work
## both gapopen and gapextend need to be set
needle_cline = NeedleCommandline(asequence="alpha.faa", bsequence="beta.faa",
gapopen=10, gapextend=0.5, outfile="needle.txt")
print(needle_cline)
# how the command line looks like
stdout, stderr = needle_cline()
# run the program and save result in stdout & stderr
print(stdout)
print(stderr)
In [ ]:
from Bio.Emboss.Applications import NeedleCommandline
# specify the location where EMBOSS program is installed
needle_cline = NeedleCommandline(r"C:\EMBOSS\needle.exe",
asequence="alpha.faa", bsequence="beta.faa",
gapopen=10, gapextend=0.5, outfile="needle.txt")
needle_cline()
# result in error if the path is incorrect
In [ ]:
from Bio.Emboss.Applications import WaterCommandline
needle_cline = WaterCommandline()
## provide the file name for each sequence
needle_cline.asequence="alpha.faa"
needle_cline.bsequence="beta.faa"
## specify the gap open and gap extend cost
needle_cline.gapopen=10
needle_cline.gapextend=0.5
## save output file
needle_cline.outfile="needle.txt"
## how the command line look like
print(needle_cline)
## run the program and combine output with error
stdout, stderr = needle_cline(stdout=True, stderr=True)
print(stdout + stderr)
from Bio import AlignIO
## use alignio to parse the result written in EMBOSS format
align = AlignIO.read("needle.txt", "emboss")
## get alignment length
print(align.get_alignment_length())
In [ ]:
from Bio import AlignIO
## use parse here even though there is only one alignment
alignment = AlignIO.parse(open("PF18225_seed.sth"), "stockholm")
## AlignIO allows you to access information of each sequence
## similar to SeqIO
for i, align in enumerate(alignment):
print('alignment length', align.get_alignment_length())
print('')
for seqi in align:
print(seqi.seq)
print(seqi.name)
print(seqi.dbxrefs)
print(seqi.annotations)
print(seqi.description)
In [ ]:
from Bio import AlignIO
## open input file in read mode
input_handle = open("PF18225_seed.sth", "r")
## open output file in write mode
output_handle = open("PF18225_seed.phy", "w")
## uses parse here if there is more than one alignment
## parse will also work if there is only one alignment
alignments = AlignIO.parse(input_handle, "stockholm")
## write out the alignment in phylip format
AlignIO.write(alignments, output_handle, "phylip")
## close both file handles
output_handle.close()
input_handle.close()
Instead of opening input file in read-mode and output file in write-mode, convert function in AlignIO can be called directly to change the alignment file type.
Arguments:
in_file
- an input handle or filenamein_format
- input file format, lower case stringoutput
- an output handle or filenameout_file
- output file format, lower case stringalphabet
- optional alphabet to assume, default=NoneThe formats allowed for conversion are listed in the link. https://biopython.org/DIST/docs/api/Bio.AlignIO-module.html#convert
In [ ]:
from Bio import AlignIO
## original phylip format limits sequence id to be only 10 characters
AlignIO.convert("PF18225_seed.sth", "stockholm", "PF18225_seed_strict.phy", "phylip")
## relaxed phylip allows longer names to be written
AlignIO.convert("PF18225_seed.sth", "stockholm", "PF18225_seed_relaxed.phy", "phylip-relaxed")
## it returns the number of the alignments
It goes through the sequence residue by residue and count the number of each type of residue (ie. A, G, T and C for DNA) in all sequences in the alignment. If the percentage of the most common residue type is greater then the specified threshold, that residue will be added to the consensus sequence, otherwise an ambiguous character will be added.
Arguments: (taken from https://biopython.org/DIST/docs/api/Bio.Align.AlignInfo.SummaryInfo-class.html#dumb_consensus)
threshold
- The threshold value that is required to decide whether to add a particular atom.ambiguous
- The ambiguous character to be added when the threshold is not reached.consensus_alpha
- The alphabet to return for the consensus sequence. If this is None, then we will try to guess the alphabet.require_multiple
- If set as 1, this will require that more than 1 sequence be part of an alignment to put it in the consensus (ie. not just 1 sequence and gaps).
In [ ]:
from Bio import AlignIO
align = AlignIO.read("PF18225_full.fasta", "fasta")
## specify the format of alignment file
print(align)
from Bio.Align import AlignInfo
from Bio.Align.AlignInfo import SummaryInfo
## load AlignInfo modules and SummaryInfo submodules
summary = SummaryInfo(align)
## create summary object
print('')
dumb_7 = summary.dumb_consensus()
## create consensus sequence with the default threshold of 0.7
dumb_4 = summary.dumb_consensus(threshold=0.4)
## create consensus sequence with lower (specified) threshold of 0.4
## compare the results from difference threshold
print('default threshold = 0.7', str(dumb_7))
print('consensus threshold = 0.4', str(dumb_4))
In [ ]:
from Bio import AlignIO
align = AlignIO.read("PF18225_full.fasta", "fasta")
## specify the format of alignment file
print(align)
from Bio.Align import AlignInfo
from Bio.Align.AlignInfo import SummaryInfo
## load AlignInfo modules and SummaryInfo submodules
summary = SummaryInfo(align)
## create summary object
## create sumb consensus sequence
dumb_cons = summary.dumb_consensus()
print('dumb oncensus', str(dumb_cons))
print('')
## create gap consensus sequence
gap_cons = summary.gap_consensus()
print('gap oncensus', str(gap_cons))
## compare the difference between dumb and gap consensus sequences
## when using the default threshold (0.7)
summary.gap_consensus() == summary.dumb_consensus()
In [ ]:
from Bio import AlignIO
align = AlignIO.read("new_opuntia.aln", "clustal")
## load our MSA alignment file
print(align)
print('')
from Bio.Align import AlignInfo
from Bio.Align.AlignInfo import SummaryInfo
summary = SummaryInfo(align)
## create summary of the alignments
## check frequency of nucleotide at position 7
print(summary._get_letter_freqs(residue_num=7, all_records=align,
letters=['A', 'C', 'G', 'T'], to_ignore=['N', '-']))
print('')
## check frequency throughout the sequence length
## by going through each position at a time
for i in range(align.get_alignment_length()):
freq_dict = summary._get_letter_freqs(residue_num=i, all_records=align,
letters=['A', 'C', 'G', 'T'], to_ignore=['N', '-'])
print(i, freq_dict)
In [ ]:
## consensus sequence
consensus = summary.dumb_consensus()
## using consensus sequence
print(summary.pos_specific_score_matrix(axis_seq=consensus, chars_to_ignore=['N', '-']))
In [ ]:
## first sequence in the alignment
print(summary.pos_specific_score_matrix(axis_seq=align[0]))#, chars_to_ignore=['N', '-']))