Sequence Alignment

Pairwise alignment

Module pairwise alignment in Biopython uses dynamic programming algorithm.

  • a global alignment finds the best alignment of all characters between 2 sequences
  • a local alignment finds the subsequences that align best between 2 sequences

Match scores and gap penalties should be specified for any alignment. Compatible elements (not neccessarily the same character) should be given higher score. Gap or incompatibles should be given lower or negative scores, signifying the mismatch, though in some case 0 is used.

Bio.pairwise2 contains essentially the same algorithms as water for local alignment and needle for global alignment used in EMBOSS.


In [ ]:
from Bio import pairwise2 
## load the module

## globalxx
## use global alignment function which only score 1 
## for each match (0 for both penalty and mismatch)

alignments = pairwise2.align.globalxx("ACCGT", "ACG")
## perform global alignments (xx) between two sequences. 


for alignment in alignments:
    ## Each alignment is a tuple consisting of the two aligned sequences, 
    ## the score, the start and the end positions of the alignment 
    ## (in global alignments the start is always 0 and the end the length of the alignment). 
    print(alignment)
    
    ## print the alignment in a nicer format
    from Bio.pairwise2 import format_alignment
    print(format_alignment(*alignment))
    print(repr(alignment))

CODE DESCRIPTION

You need to specify the match parameters and gap penalty parameters to control the scoring output. globalxx basically sets only match score = 1 and gap penalty score = 0. Setting scoring parameters is easy using the list below.

match parameters
  • x No parameters. Identical characters have score of 1, otherwise 0.
  • m A match score is the score of identical chars, otherwise mismatch score.
  • d A dictionary returns the score of any pair of characters.
  • c A callback function returns scores.
gap penalty parameters
  • x No gap penalties.
  • s Same open and extend gap penalties for both sequences.
  • d The sequences have different open and extend gap penalties.
  • c A callback function returns the gap penalties.

In [ ]:
from Bio import pairwise2 
from Bio.pairwise2 import format_alignment


## m match score = 2, mismatch = 0
## x no gap penalty
for alignment in pairwise2.align.globalmx("ACCGT", "ACG", 2, -1):
    print(format_alignment(*alignment))
    ## score = 6, since only matching scores

    
## match score = 2, mismatch = -1
## gap opening = 0.5, gap extension = 0.1
for a in pairwise2.align.globalms("ACCGT", "ACG", 2, -1, -.5, -.1):
    print(format_alignment(*a))
    ## score = 5, 2*3 (matchings) + 2*-0.5 (gap opening)

Substitution matrices

For proteins, scoring only identical amino acids on both sequences as match is biologically incorrect. To score alignments correctly, you need to know what are the compatible amino acids. This can be done by using substitution matrix. The table is stored as dictionary and can be directly given to parameter d in match parameters. The alignment scores are directly influenced by selected matrix.

More details on types, names and scores are described on this following page. https://biopython.org/DIST/docs/api/Bio.SubsMat.MatrixInfo-module.html


In [ ]:
from Bio.SubsMat import MatrixInfo as matlist

print(matlist.available_matrices)
## print list of available matrices

matrix = matlist.blosum62
## set the substitution matrix to be used


## BLOSUM62 is more stringent than BLOSUM45, 
## thus the alignment score is lower
for a in pairwise2.align.globaldx("KEVLA", "EVL", matrix):
    print(format_alignment(*a))
    
    
## change to use BLOSUM45 for distantly related sequences
## this allows less identical sequence to score above threshold
for a in pairwise2.align.globaldx("KEVLA", "EVL", matlist.blosum45):
    print(format_alignment(*a))

In [ ]:
from Bio import pairwise2
from Bio import SeqIO


## use SeqIO to read in input files
seq1 = SeqIO.read("alpha.faa", "fasta")
seq2 = SeqIO.read("beta.faa", "fasta")


alignments = pairwise2.align.globalds(seq1.seq, seq2.seq)
## global alignment 
## d : A dictionary of the score of any pair of characters. 
## s : Same open and extend gap penalties for both sequences.
## should return error since not enough parameters specified 
## have to specify substitution matrix dictionary (1 parameter)
## also 2 parameters for penalties of gap openning and extension

In [ ]:
from Bio.SubsMat.MatrixInfo import blosum62

alignments = pairwise2.align.globalds(seq1.seq, seq2.seq, blosum62, -10, -1)
## supply enough parameters supplied so the code runs
print(format_alignment(*alignments[0]))
## have a look at the first alignment
## compare with the globalxx below

alignments = pairwise2.align.localxx(seq1.seq, seq2.seq)
print(format_alignment(*alignments[1]))
## globalxx give different scores from global ds 
## there can be many alignments, just randomly pick the second one here

In [ ]:
alignments = pairwise2.align.localxx(seq1.seq, seq2.seq, one_alignment_only=True)
## will return only the best alignment
## this takes shorter time
print(format_alignment(*alignments[0]))


alignments = pairwise2.align.localxx(seq1.seq, seq2.seq, score_only=True)
##  speed gain
## only print the score
print('score of alignment', alignments)

Alignment tools for multiple sequence alignment

The implementation or calculation for both pairwise alignments and multiple sequence alignments can be slow. It is thus recommended to use better optimized alignment programs. Unfortunately, the algorithms are not implemented in Biopython directly. So accessing these tools is done by running programs inside python via the command-line wrapper provided by Biopython.

With Biopython, this only takes 4 steps;

  1. Install the tools you want to use, e.g. MUSCLE, EMBOSS or CLUSTALW.
  2. Prepare an input file of your unaligned sequences in FASTA format.
  3. Call the corresponding command line wrapper, different command for each tool, to process this input file.
  4. Read or process the output from the tool, i.e. your aligned sequences.

In [ ]:
import Bio.Align.Applications
dir(Bio.Align.Applications)
## available tools with command line wrappers

ClustalW

The wrapper uses subprocess module to run another program inside python. The program prints text on the screen which is piped via standard output and standard error. The input file is standard input.


In [ ]:
import os

from Bio.Align.Applications import ClustalwCommandline
## help(ClustalwCommandline)
## print help 

clustalw_exe = '/usr/bin/clustalw' 
## for Windows user, change this to your installed
## e.g. r"C:\Program Files\new clustal\clustalw2.exe"
    
clustalw_cline = ClustalwCommandline(clustalw_exe, infile="new_opuntia.fasta")
## clustalw_cline = ClustalwCommandline("clustalw2", infile="opuntia.fasta")

assert os.path.isfile(clustalw_exe), "Clustal W executable missing"
stdout, stderr = clustalw_cline()

In [ ]:
print(stdout)

In [ ]:
print(stderr)
# if there is no error, it should be empty string

In [ ]:
from Bio import AlignIO

align = AlignIO.read("new_opuntia.aln", "clustal")
## specify the format of alignment file

print(align)

In [ ]:
from Bio import Phylo
tree = Phylo.read("new_opuntia.dnd", "newick")
Phylo.draw_ascii(tree)
## the result from clustal allows the tree view

MUSCLE

Input file format is fasta and it saves output file in either fasta or clustal format which are compatible with Biopython, using AlignIO for reading or parsing.

It can also output in GCG MSF or HTML format (not covered) as it is not supported by Biopython in parsing it.


In [ ]:
from Bio.Align.Applications import MuscleCommandline
help(MuscleCommandline)

In [ ]:
from Bio.Align.Applications import MuscleCommandline
cline = MuscleCommandline(input="opuntia.fasta", out="opuntia.clw", clw=True)
stdout, stderr = cline()

print(stderr)


## alignment has only one sequence in fasta format
## error if try to open in clustal format
AlignIO.read(open("opuntia.clw"), "clustal")

In [ ]:
from Bio.Align.Applications import MuscleCommandline
cline = MuscleCommandline(input="opuntia.fasta", out="opuntia.txt")
## default format for MUSCLE is fasta
stdout, stderr = cline()

## alignment has only one sequence, 
## no error if read in fasta format
print(AlignIO.read(open("opuntia.txt"), "fasta"))

In [ ]:
from Bio.Align.Applications import MuscleCommandline

try:
    from StringIO import StringIO
except ImportError:
    from io import StringIO
    
    
muscle_cline = MuscleCommandline(input="new_opuntia.fasta")
## run command line without writing the output file
## default format of MUSCLE is fasta

stdout, stderr = muscle_cline()
## the result is in stdout


try:
    align = AlignIO.read(StringIO(stderr), "fasta")
    ## check if alignment is fine
except:
    align = AlignIO.read(StringIO(stdout), "fasta")
    
print(align)

EMBOSS program

The program includes algorithms for both local (Smith-Waterman) and global (Needleman-Wunch) alignments.


In [ ]:
from Bio.Emboss.Applications import NeedleCommandline

## for ubuntu system, giving command 
## without specifying the installed path of EMBOSS seems to work
## both gapopen and gapextend need to be set 
needle_cline = NeedleCommandline(asequence="alpha.faa", bsequence="beta.faa", 
                                 gapopen=10, gapextend=0.5, outfile="needle.txt")


print(needle_cline)
# how the command line looks like

stdout, stderr = needle_cline()
# run the program and save result in stdout & stderr

print(stdout)
print(stderr)

In [ ]:
from Bio.Emboss.Applications import NeedleCommandline

# specify the location where EMBOSS program is installed
needle_cline = NeedleCommandline(r"C:\EMBOSS\needle.exe",
                                 asequence="alpha.faa", bsequence="beta.faa",
                                 gapopen=10, gapextend=0.5, outfile="needle.txt")
needle_cline()
# result in error if the path is incorrect

In [ ]:
from Bio.Emboss.Applications import WaterCommandline

needle_cline = WaterCommandline()

## provide the file name for each sequence 
needle_cline.asequence="alpha.faa"
needle_cline.bsequence="beta.faa"

## specify the gap open and gap extend cost
needle_cline.gapopen=10
needle_cline.gapextend=0.5

## save output file
needle_cline.outfile="needle.txt"

## how the command line look like
print(needle_cline)

## run the program and combine output with error
stdout, stderr = needle_cline(stdout=True, stderr=True)
print(stdout + stderr)



from Bio import AlignIO
## use alignio to parse the result written in EMBOSS format
align = AlignIO.read("needle.txt", "emboss")

## get alignment length
print(align.get_alignment_length())

Parse the result of alignments

AlignIO module is used for read and write sequence alignment. The functionality of the module is quite similar to SeqIO.


In [ ]:
from Bio import AlignIO

## use parse here even though there is only one alignment
alignment = AlignIO.parse(open("PF18225_seed.sth"), "stockholm")


## AlignIO allows you to access information of each sequence
## similar to SeqIO
for i, align in enumerate(alignment):
    print('alignment length', align.get_alignment_length())
    print('')
    for seqi in align:
        print(seqi.seq)
        print(seqi.name)
        print(seqi.dbxrefs)
        print(seqi.annotations)
        print(seqi.description)

Change file format

This is very simple using AlignIO.


In [ ]:
from Bio import AlignIO

## open input file in read mode
input_handle = open("PF18225_seed.sth", "r")

## open output file in write mode 
output_handle = open("PF18225_seed.phy", "w")

## uses parse here if there is more than one alignment
## parse will also work if there is only one alignment
alignments = AlignIO.parse(input_handle, "stockholm")


## write out the alignment in phylip format
AlignIO.write(alignments, output_handle, "phylip")

## close both file handles
output_handle.close()
input_handle.close()

Change file format (alternative)

Instead of opening input file in read-mode and output file in write-mode, convert function in AlignIO can be called directly to change the alignment file type.

Arguments:

  • in_file - an input handle or filename
  • in_format - input file format, lower case string
  • output - an output handle or filename
  • out_file - output file format, lower case string
  • alphabet - optional alphabet to assume, default=None

The formats allowed for conversion are listed in the link. https://biopython.org/DIST/docs/api/Bio.AlignIO-module.html#convert


In [ ]:
from Bio import AlignIO

## original phylip format limits sequence id to be only 10 characters
AlignIO.convert("PF18225_seed.sth", "stockholm", "PF18225_seed_strict.phy", "phylip")

## relaxed phylip allows longer names to be written
AlignIO.convert("PF18225_seed.sth", "stockholm", "PF18225_seed_relaxed.phy", "phylip-relaxed")

## it returns the number of the alignments

SummaryInfo

dumb_consensus

It goes through the sequence residue by residue and count the number of each type of residue (ie. A, G, T and C for DNA) in all sequences in the alignment. If the percentage of the most common residue type is greater then the specified threshold, that residue will be added to the consensus sequence, otherwise an ambiguous character will be added.

Arguments: (taken from https://biopython.org/DIST/docs/api/Bio.Align.AlignInfo.SummaryInfo-class.html#dumb_consensus)

  • threshold - The threshold value that is required to decide whether to add a particular atom.
  • ambiguous - The ambiguous character to be added when the threshold is not reached.
  • consensus_alpha - The alphabet to return for the consensus sequence. If this is None, then we will try to guess the alphabet.
  • require_multiple - If set as 1, this will require that more than 1 sequence be part of an alignment to put it in the consensus (ie. not just 1 sequence and gaps).

In [ ]:
from Bio import AlignIO

align = AlignIO.read("PF18225_full.fasta", "fasta")
## specify the format of alignment file

print(align)

from Bio.Align import AlignInfo
from Bio.Align.AlignInfo import SummaryInfo
## load AlignInfo modules and SummaryInfo submodules

summary = SummaryInfo(align)
## create summary object

print('')

dumb_7 = summary.dumb_consensus()
## create consensus sequence with the default threshold of 0.7
dumb_4 = summary.dumb_consensus(threshold=0.4)
## create consensus sequence with lower (specified) threshold of 0.4


## compare the results from difference threshold
print('default threshold = 0.7', str(dumb_7))
print('consensus threshold = 0.4', str(dumb_4))

gap_consensus

The method is similar to dumb_consensus, but allows gaps on the output.


In [ ]:
from Bio import AlignIO

align = AlignIO.read("PF18225_full.fasta", "fasta")
## specify the format of alignment file

print(align)

from Bio.Align import AlignInfo
from Bio.Align.AlignInfo import SummaryInfo
## load AlignInfo modules and SummaryInfo submodules

summary = SummaryInfo(align)
## create summary object

## create sumb consensus sequence
dumb_cons = summary.dumb_consensus()
print('dumb oncensus', str(dumb_cons))

print('')

## create gap consensus sequence
gap_cons = summary.gap_consensus()
print('gap oncensus', str(gap_cons))

## compare the difference between dumb and gap consensus sequences
## when using the default threshold (0.7)
summary.gap_consensus() == summary.dumb_consensus()

Letter frequency

Another way to look at the alignment and consensus sequence is to look at the letter frequency, the count of each letter at certain position.


In [ ]:
from Bio import AlignIO

align = AlignIO.read("new_opuntia.aln", "clustal")
## load our MSA alignment file

print(align)
print('')

from Bio.Align import AlignInfo
from Bio.Align.AlignInfo import SummaryInfo

summary = SummaryInfo(align)
## create summary of the alignments

## check frequency of nucleotide at position 7
print(summary._get_letter_freqs(residue_num=7, all_records=align, 
                          letters=['A', 'C', 'G', 'T'], to_ignore=['N', '-']))

print('')
## check frequency throughout the sequence length
## by going through each position at a time
for i in range(align.get_alignment_length()):
    freq_dict = summary._get_letter_freqs(residue_num=i, all_records=align, 
                                    letters=['A', 'C', 'G', 'T'], to_ignore=['N', '-'])
    print(i, freq_dict)

Position specific score matrices (PSSMs)

PSSM is a count matrix. For each column in the alignment, it displays the sum of each character. The input sequence can be either the consensus sequence or any sequence in the alignment.


In [ ]:
## consensus sequence

consensus = summary.dumb_consensus()

## using consensus sequence
print(summary.pos_specific_score_matrix(axis_seq=consensus, chars_to_ignore=['N', '-']))

In [ ]:
## first sequence in the alignment
print(summary.pos_specific_score_matrix(axis_seq=align[0]))#, chars_to_ignore=['N', '-']))