Source of the materials: Biopython cookbook (adapted) Status: Draft

Consider looking at the Substitution Matrices example in [Chapter 19 - Cookbook](19 - Cookbook - Cool things to do with it.ipynb#Substitution-Matrices)

Advanced

Parser Design

Substitution Matrices

FreqTable

Parser Design

Many of the older Biopython parsers were built around an event-oriented design that includes Scanner and Consumer objects.

Scanners take input from a data source and analyze it line by line, sending off an event whenever it recognizes some information in the data. For example, if the data includes information about an organism name, the scanner may generate an organism_name event whenever it encounters a line containing the name.

Consumers are objects that receive the events generated by Scanners. Following the previous example, the consumer receives the organism_name event, and the processes it in whatever manner necessary in the current application.

This is a very flexible framework, which is advantageous if you want to be able to parse a file format into more than one representation. For example, the Bio.GenBank module uses this to construct either SeqRecord objects or file-format-specific record objects.

More recently, many of the parsers added for Bio.SeqIO and Bio.AlignIO take a much simpler approach, but only generate a single object representation (SeqRecord and MultipleSeqAlignment objects respectively). In some cases the Bio.SeqIO parsers actually wrap another Biopython parser - for example, the Bio.SwissProt parser produces SwissProt format specific record objects, which get converted into SeqRecord objects.

Substitution Matrices

SubsMat

This module provides a class and a few routines for generating substitution matrices, similar to BLOSUM or PAM matrices, but based on user-provided data. Additionally, you may select a matrix from MatrixInfo.py, a collection of established substitution matrices. The SeqMat class derives from a dictionary:

class SeqMat(dict)

The dictionary is of the form {(i1,j1):n1, (i1,j2):n2,...,(ik,jk):nk} where i, j are alphabet letters, and n is a value.

  1. Attributes

    1. self.alphabet: a class as defined in Bio.Alphabet

    2. self.ab_list: a list of the alphabet’s letters, sorted. Needed mainly for internal purposes

  2. Methods

__init__(self,data=None,alphabet=None, mat_name='', build_later=0):
    1.  `data`: can be either a dictionary, or another
        SeqMat instance.

    2.  `alphabet`: a Bio.Alphabet instance. If not provided,
        construct an alphabet from data.

    3.  `mat_name`: matrix name, such as “BLOSUM62” or “PAM250”

    4.  `build_later`: default false. If true, user may supply only
        alphabet and empty dictionary, if intending to build the
        matrix later. this skips the sanity check of alphabet
        size vs. matrix size.
entropy(self,obs_freq_mat)
    1.  `obs_freq_mat`: an observed frequency matrix. Returns the
        matrix’s entropy, based on the frequency in `obs_freq_mat`.
        The matrix instance should be LO or SUBS.
sum(self)
    Calculates the sum of values for each letter in the matrix’s
    alphabet, and returns it as a dictionary of the form
    `{i1: s1, i2: s2,...,in:sn}`, where:

    -   i: an alphabet letter;

    -   s: sum of all values in a half-matrix for that letter;

    -   n: number of letters in alphabet.
print_mat(self,f,format="%4d",bottomformat="%4s",alphabet=None)
    prints the matrix to file handle f. `format` is the format field
    for the matrix values; `bottomformat` is the format field for
    the bottom row, containing matrix letters. Example output for a
    3-letter alphabet matrix:
A 23
        B 12 34
        C 7  22  27
          A   B   C
    The `alphabet` optional argument is a string of all characters
    in the alphabet. If supplied, the order of letters along the
    axes is taken from the string, rather than by
    alphabetical order.

  1. Usage

    The following section is laid out in the order by which most people wish to generate a log-odds matrix. Of course, interim matrices can be generated and investigated. Most people just want a log-odds matrix, that’s all.

    1. Generating an Accepted Replacement Matrix

      Initially, you should generate an accepted replacement matrix (ARM) from your data. The values in ARM are the counted number of replacements according to your data. The data could be a set of pairs or multiple alignments. So for instance if Alanine was replaced by Cysteine 10 times, and Cysteine by Alanine 12 times, the corresponding ARM entries would be:

('A','C'): 10, ('C','A'): 12
    as order doesn’t matter, user can already provide only one
    entry:
('A','C'): 22
    A SeqMat instance may be initialized with either a full (first
    method of counting: 10, 12) or half (the latter method, 22)
    matrices. A full protein alphabet matrix would be of the size
    20x20 = 400. A half matrix of that alphabet would be 20x20/2 +
    20/2 = 210. That is because same-letter entries don’t change.
    (The matrix diagonal). Given an alphabet size of N:

    1.  Full matrix size: N\*N

    2.  Half matrix size: N(N+1)/2

    The SeqMat constructor automatically generates a half-matrix, if
    a full matrix is passed. If a half matrix is passed, letters in
    the key should be provided in alphabetical order: (’A’,’C’) and
    not (’C’,A’).

    At this point, if all you wish to do is generate a log-odds
    matrix, please go to the section titled Example of Use. The
    following text describes the nitty-gritty of internal functions,
    to be used by people who wish to investigate their
    nucleotide/amino-acid frequency data more thoroughly.

2.  Generating the observed frequency matrix (OFM)

    Use:
OFM = SubsMat._build_obs_freq_mat(ARM)
    The OFM is generated from the ARM, only instead of replacement
    counts, it contains replacement frequencies.

3.  Generating an expected frequency matrix (EFM)

    Use:
EFM = SubsMat._build_exp_freq_mat(OFM,exp_freq_table)
    1.  `exp_freq_table`: should be a FreqTable instance. See
        section \[sec:freq\_table\] for detailed information
        on FreqTable. Briefly, the expected frequency table has the
        frequencies of appearance for each member of the alphabet.
        It is implemented as a dictionary with the alphabet letters
        as keys, and each letter’s frequency as a value. Values sum
        to 1.

    The expected frequency table can (and generally should) be
    generated from the observed frequency matrix. So in most cases
    you will generate `exp_freq_table` using:
from Bio import SubsMat
from Bio.SubsMat import _build_obs_freq_mat
OFM = _build_obs_freq_mat(ARM)
exp_freq_table = SubsMat._exp_freq_table_from_obs_freq(OFM)
EFM = SubsMat._build_exp_freq_mat(OFM, exp_freq_table)
    But you can supply your own `exp_freq_table`, if you wish

4.  Generating a substitution frequency matrix (SFM)

    Use:
SFM = SubsMat._build_subs_mat(OFM,EFM)
    Accepts an OFM, EFM. Provides the division product of the
    corresponding values.

5.  Generating a log-odds matrix (LOM)

    Use:
LOM=SubsMat._build_log_odds_mat(SFM[,logbase=10,factor=10.0,round_digit=1])
    1.  Accepts an SFM.

    2.  `logbase`: base of the logarithm used to generate the
        log-odds values.

    3.  `factor`: factor used to multiply the log-odds values. Each
        entry is generated by log(LOM\[key\])\*factor And rounded to
        the `round_digit` place after the decimal point,
        if required.

  1. Example of use

    As most people would want to generate a log-odds matrix, with minimum hassle, SubsMat provides one function which does it all:

make_log_odds_matrix(acc_rep_mat,exp_freq_table=None,logbase=10,
                          factor=10.0,round_digit=0):
1.  `acc_rep_mat`: user provided accepted replacements matrix

2.  `exp_freq_table`: expected frequencies table. Used if provided,
    if not, generated from the `acc_rep_mat`.

3.  `logbase`: base of logarithm for the log-odds matrix. Default
    base 10.

4.  `round_digit`: number after decimal digit to which result should
    be rounded. Default zero.

FreqTable

FreqTable.FreqTable(UserDict.UserDict)
  1. Attributes:

    1. alphabet: A Bio.Alphabet instance.

    2. data: frequency dictionary

    3. count: count dictionary (in case counts are provided).

  2. Functions:

    1. read_count(f): read a count file from stream f. Then convert to frequencies.

    2. read_freq(f): read a frequency data file from stream f. Of course, we then don’t have the counts, but it is usually the letter frequencies which are interesting.

  3. Example of use: The expected count of the residues in the database is sitting in a file, whitespace delimited, in the following format (example given for a 3-letter alphabet):

A   35
    B   65
    C   100
And will be read using the
`FreqTable.read_count(file_handle)` function.

An equivalent frequency file:
A  0.175
    B  0.325
    C  0.5
Conversely, the residue frequencies or counts can be passed as
a dictionary. Example of a count dictionary (3-letter alphabet):
{'A': 35, 'B': 65, 'C': 100}
Which means that an expected data count would give a 0.5 frequency
for ’C’, a 0.325 probability of ’B’ and a 0.175 probability of ’A’
out of 200 total, sum of A, B and C)

A frequency dictionary for the same data would be:
{'A': 0.175, 'B': 0.325, 'C': 0.5}
Summing up to 1.

When passing a dictionary as an argument, you should indicate
whether it is a count or a frequency dictionary. Therefore the
FreqTable class constructor requires two arguments: the dictionary
itself, and FreqTable.COUNT or FreqTable.FREQ indicating counts or
frequencies, respectively.

Read expected counts. readCount will already generate the
frequencies Any one of the following may be done to geerate the
frequency table (ftab):
from Bio.SubsMat import *
ftab = FreqTable.FreqTable(my_frequency_dictionary, FreqTable.FREQ)
ftab = FreqTable.FreqTable(my_count_dictionary, FreqTable.COUNT)
ftab = FreqTable.read_count(open('myCountFile'))
ftab = FreqTable.read_frequency(open('myFrequencyFile'))

In [ ]: