Structural Bioinformatic Algorithms

About the course

The development of algorithms in structural bioinformatics requires knowledge in biology, physics, chemistry, mathematics and computing. Concepts from linear algebra, numerical methods and programming are usually absent in the curricula of life-sciences. This course is intended as an introduction to those topics motivated by the implementation of common/classical Structural Bioinformatics algorithms.

The course's material consist of IPython notebooks and the course is mainly hands-on. I hope this format will help students to engage in the topics and most important to continue studying on their own.

During the course we will visit several classic algorithms in bioinformatics. Each algorithm will be explained in conceptual terms, followed by their implementation using the Python programming language. It is a requisite for the course to be familiar with this programming language. A good introduction to Python can be found here. Any prior knowledge of scientific libraries like NumPy, SciPy, Matplotlib, etc is not required. These libraries will be presented during the course.

Computing in science

Historically, science has been divided into experimental and theoretical knowledge (or empirism and rationalism). Consequently, a great number of studies in the philosophy of science have been dedicated to the inter-relationship of this two modes of doing science. During the last few decades computing has emerged as a very important part of science and by doing so it have disrupted this two-side view of science.

Computer simulation have emerged as a third scientific practice that while is related to theoretical knowledge is not just theory as involves elements from more than one theory or that are not part of any theory (like approximations, domain-knowledge-intuition an even fictions), and at the same time it also has many characteristics in common with experiments. After all simulations are used in general when experimental or observational data from a phenoma is sparse.

Computers has also enable the storage, processing and analysis of large amount of data in what could be regarded in general terms as computational observation (but people prefer names like data science).

Nowadays an increasingly number of papers include experiments along with numerical calculations, simulations or computer modeling and almost all fields of science has now a computational branch:

Reproducible research

Reproducible Research is the idea that scientific papers must be published not only with the results and methods, but with the data (raw and processed) and the code used for analysis and/or simulation/modeling. This would facilitate the analysis and verification of results and also facilitate further progress.

The scientific data and analysis are increasingly complex so simple descriptions in the style of materials and methods become increasingly complicated to explain and to understand, making them unsuitable. Reducing the barriers for third parties to check the data, first would reduce fraud and will also contribute to the validation of data and published results and finally facilitate the construction of new ideas and analysis from the data and results already published.

It is important to note that much of the literature the terms replication and reproducibility are used interchangeably and to add confusion some authors establish a difference in those terms, but some authors call replication to what others call reproducibility

Speaking specifically of computing disciplines we can say that:

  • Replication: The author of a scientific article that involves numerical calculations should be able to run repeatedly simulations and analysis will always get the same (or equivalent) results. Another scientist must also be able to perform the same calculations and get the same results, given the information provided in the publication.

  • Reproducibility: Using an independent implementation of the method, or using a totally different method the same results as in the original publication are obtained.

There are still no well-established guidelines for how it should be handled the source code, data generated (and raw data). For example, it is relatively rare that the source code used in simulations becomes available to readers along with the published article. Furthermore, it is common that the source code is regarded as a competitive advantage and therefore not made public.

However, this view has begun to be questioned, some scientific publishers and individual scientist have called for greater transparency in the computational sciences. Some journals have begun to require authors to make the source and/or raw data used in publications available upon request.

It is also under discussion how to facilitate the distribution of scientific software as a supplementary material.

Bottom line: A solid scientific study must be reproducible, and replicable.

To achieve these objectives, it is necessary:

  • Keep and take note of exactly which source code and version was used to produce data and figures in published papers.

  • Record information of the environment and which version of external software did you use.

  • Make sure that old codes and notes are backed up and kept for future reference.

  • Be ready to give additional information about the methods used, and perhaps also the simulation codes, to an interested reader who requests it (even years after the paper was published!).

  • Ideally code should be published online (like on github[https://github.com]), to make it easier for other scientists to access it.

Python in science

Python is a modern, general-purpose, object-oriented, high-level programming language.

  • Simple language: Easy to learn, write, read and mantain.

  • It is open-source and free!

  • Documentation tightly integrated with the code.

  • Good performance (trough specialized libraries) due to close integration with time-tested and highly optimized codes written in C and Fortran:

    • blas, altas blas, lapack, arpack, Intel MKL, ...
  • Large community of users, easy to find help and documentation. eg StackOverflow.

  • Python has already an strong position in scientific computing

  • Extensive ecosystem of scientific libraries and environments

    • Numpy: - Numerical Python
    • Scipy: - Scientific Python
    • Matplotlib: - Scientific plots
    • Seaborn: - Attractive statistical graphics with a simple syntaxis (based on Matplotlib)
    • Ipython: - Interactive Computing
    • Pandas: - Data Analysis
    • PyMC: - Bayesian Statistics
    • Scikit-learn: - Machine Learning
    • Statsmodels: Statistics (complements SciPy and has a more clasical flavor than Scikit-learn)
    • Insert your favorite package here

IPython

IPython is a command shell for interactive computing. It was originnaly developed for the Python programing language (and hence the name), but now supports multiple programming languages. Including for example Julia. Furthermore IPython allows to easily combines Python code with Cython, R, Octave, Bash, Perl or Ruby. Recently the Jupyter-Project has been proposed as the next step in the development of IPython.

IPython notebook

IPython notebook is what you are reading right now and is an HTML-javascript-based environment. An IPython notebook is a JSON document containing an ordered list of input/output cells which can contain code, text, mathematical formulas, plots and rich media. It is based on the IPython shell and provides an interactive environment to coding and to document all calculations facilitating reproducible research.

Some of the most important features of IPython notebooks are:

  • Tab completion
  • Support for Markdown
  • Support for Mathjax
  • Magic functions
  • Enhanced introspection
  • Support for rich media
  • Access to the system shell

Tab completion

Tab completion works for python commands (including imported libraries), variable names, objects' atributes and file/directory names. If there is more than one option a list of possible completions is offered.


In [0]:

Support for Markdown

Markdown is a popular markup language that is a superset of HTML. If you pay attention to the toolbar you will see that some of the cells are marked as code_ and others as markdown (like this one). Just doble click on any of the markdown cells and you will see that they are writted using markdown.

Mathjax

Mathjax is a javascript implementation of $LaTeX$ that allows equations to be embedded into HTML.

$$ \int_{a}^{b} f(x)\, dx \approx \frac{1}{2} \sum_{k=1}^{N} \left( x_{k} - x_{k-1} \right) \left( f(x_{k}) + f(x_{k-1}) \right) $$

Magic functions

IPython has a set of predefined magic functions that you can call with a command line style syntax.

There are two types of magics

  • line magics: these are commands prepended by one % character and whose arguments only extend to the end of the current line.
  • cell magics: They are call using %%, and they receive as argument both the current line where they are declared and the whole cell. Cell magics can only be used as the first line in a cell, and as a general principle you can only use one cell magic per cell.

As an example the %lsmagic lists all the available magics.


In [3]:
%lsmagic


Out[3]:
Available line magics:
%alias  %alias_magic  %autocall  %automagic  %autosave  %bookmark  %cat  %cd  %clear  %colors  %config  %connect_info  %cp  %debug  %dhist  %dirs  %doctest_mode  %ed  %edit  %env  %gui  %hist  %history  %install_default_config  %install_ext  %install_profiles  %killbgscripts  %ldir  %less  %lf  %lk  %ll  %load  %load_ext  %loadpy  %logoff  %logon  %logstart  %logstate  %logstop  %ls  %lsmagic  %lx  %macro  %magic  %man  %matplotlib  %mkdir  %more  %mv  %notebook  %page  %pastebin  %pdb  %pdef  %pdoc  %pfile  %pinfo  %pinfo2  %popd  %pprint  %precision  %profile  %prun  %psearch  %psource  %pushd  %pwd  %pycat  %pylab  %qtconsole  %quickref  %recall  %rehashx  %reload_ext  %rep  %rerun  %reset  %reset_selective  %rm  %rmdir  %run  %save  %sc  %store  %sx  %system  %tb  %time  %timeit  %unalias  %unload_ext  %who  %who_ls  %whos  %xdel  %xmode

Available cell magics:
%%!  %%HTML  %%SVG  %%bash  %%capture  %%debug  %%file  %%html  %%javascript  %%latex  %%perl  %%prun  %%pypy  %%python  %%python2  %%python3  %%ruby  %%script  %%sh  %%svg  %%sx  %%system  %%time  %%timeit  %%writefile

Automagic is ON, % prefix IS NOT needed for line magics.

Sometimes it is necessary to measure the execution time of a code that can be used for magic timeit. This magic exists for cells and lines.


In [5]:
%timeit range(0, 100)


1000000 loops, best of 3: 720 ns per loop

As already told IPython allows mixing python code with other languages like bash


In [6]:
%%bash
for i in {3..1}; do
    echo $i
done
echo "hello from $BASH"


3
2
1
hello from /bin/bash

In [7]:
# Provides a quick reference to IPython
%quickref

Enhanced introspection

If you need information about any command you could get it by adding a ? after the name of the command. If you need even more information, including acces to the source code you could try adding ??. The commands could be from the Python standard library function, IPython magics, or commands from any imported library.


In [9]:
range?

In [10]:
%timeit??

In [11]:
?

Support for rich media

IPython use the web browser as a GUI, hence it can easily display images, videos, etc.

One way to include media files in a notebook is to use pure HTML, another way is to use the IPython's rich display system as follows.


In [13]:
from IPython.display import Image, YouTubeVideo

In [14]:
Image(filename='img/logo.png')


Out[14]:

In [10]:
YouTubeVideo('zJM4EBuL82o')


Out[10]:

Access to the system shell

It is possible to have access to the system shell by prepending an exclamation point ! to a shell command.


In [17]:
!ls -lh


total 2,6M
-rw------- 1 osvaldo osvaldo  98K feb 16 12:34 00_Introduction.ipynb
-rw------- 1 osvaldo osvaldo  55K feb 14 15:56 01_NumPy_introduction.ipynb
-rw------- 1 osvaldo osvaldo 186K feb 14 15:56 02_Primer_on_Linear_Algebra.ipynb
-rw------- 1 osvaldo osvaldo 216K feb 14 15:56 03_Distances_angles_and_geometry.ipynb
-rw------- 1 osvaldo osvaldo 122K feb 14 15:56 04_Comparing_structures.ipynb
-rw------- 1 osvaldo osvaldo 140K feb 14 15:56 05_Sequence_alignment_and_clustering.ipynb
-rw------- 1 osvaldo osvaldo 115K feb 14 15:56 06_Protein_and_RNA_structure_from_sequence.ipynb
-rw------- 1 osvaldo osvaldo 123K feb 15 21:40 07_Local_and_global_optimization_methods.ipynb
-rw------- 1 osvaldo osvaldo 990K feb 14 15:56 1D3Z_full.pdb
-rw------- 1 osvaldo osvaldo  98K feb 14 15:56 1D3Z.pdb
-rw------- 1 osvaldo osvaldo  78K feb 14 15:56 1UBQ.pdb
-rw------- 1 osvaldo osvaldo 1,8K feb 14 15:56 ala_blocked2.pdb
-rw------- 1 osvaldo osvaldo 1,8K feb 14 15:56 ala_blocked_min.pdb
-rw------- 1 osvaldo osvaldo 1,8K feb 14 15:56 ala_blocked.pdb
drwxrwxr-x 2 osvaldo osvaldo 4,0K feb 14 17:35 bk
-rw------- 1 osvaldo osvaldo 1,1K feb 14 15:56 easy_clust.txt
drwx------ 2 osvaldo osvaldo 4,0K feb 14 15:56 img
-rw------- 1 osvaldo osvaldo  129 feb 14 15:56 Notes
-rw------- 1 osvaldo osvaldo 4,0K feb 14 15:56 params2.py
-rw------- 1 osvaldo osvaldo 343K feb 14 15:56 params.py
drwxrwxr-x 2 osvaldo osvaldo 4,0K feb 14 17:35 PDFs
-rw------- 1 osvaldo osvaldo   45 feb 14 15:56 random_matrix.csv
-rw------- 1 osvaldo osvaldo  152 feb 14 15:56 random_matrix.npy
-rw------- 1 osvaldo osvaldo  489 feb 14 15:56 sample.dat
-rw------- 1 osvaldo osvaldo 7,3K feb 14 15:56 test_ff.py
-rw------- 1 osvaldo osvaldo  16K feb 14 15:56 Untitled0.ipynb

In [18]:
!pwd


/home/osvaldo/Documentos/Proyectos/0_Bio_struc/SBioA/English

In [19]:
! head 1D3Z.pdb


ATOM      1  N   MET A   1      52.923 -90.016   8.509  1.00  9.67           N  
ATOM      2  CA  MET A   1      51.653 -89.304   8.833  1.00 10.38           C  
ATOM      3  C   MET A   1      50.851 -89.086   7.556  1.00  9.62           C  
ATOM      4  O   MET A   1      51.414 -89.033   6.462  1.00  9.62           O  
ATOM      5  CB  MET A   1      51.976 -87.958   9.485  1.00 13.77           C  
ATOM      6  CG  MET A   1      52.864 -87.131   8.557  1.00 16.29           C  
ATOM      7  SD  MET A   1      53.355 -85.606   9.403  1.00 17.17           S  
ATOM      8  CE  MET A   1      53.359 -84.524   7.954  1.00 16.11           C  
ATOM      9  H1  MET A   1      52.721 -91.019   8.328  1.00  0.00           H  
ATOM     10  H2  MET A   1      53.581 -89.933   9.311  1.00  0.00           H  

Algorithms

An algorithm is a precise and finite sequence of steps designed to solve a problem. According to this definition daily-things like recipes are algorithms. While this is true is important to note that it is possible to formally describe and study algorithm, and that is one of the jobs of computer science.

Many books on algorithms use pseudocode to describe algorithms, pseudocode is useful because provides a rigorous language to describe algorithms and at the same time omits details that are necessary to write a working code in some specific programming language. In this course we will not use pseudocode we will use Python. After all this is not a algorithms' book and as some people say: Python is executable pseudocode!

How to write and algorithm

Probably most problems can be solved following 5 steps (easier said than done!)

  1. Identify a problem and formulate it in a precise way.
  2. Draft a solution/algorithm.
  3. Implement the solution/algorithm.
  4. Evaluate the results. This generally involves answering the questions: Does it work correctly? and How long will it take?).
  5. Iterate from 2 trough 4 until a correct (or acceptable) answer is obtained. Sometimes it will be necessary to go to 1 and redefine the problem.

Regarding point 4, sometimes it is also worth to ask yourself if the code is clear and well documented, while this will not necessary affect the results of the computation it will help others to understand your code (and may be to find bugs or improve the code). Often, the other will be yourself in the future, so try to be clean and clear (comments are good, and good comments are even better!).

And remember when in doubt, be Zen!


In [20]:
import this


The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

Code Optimization

Spending time on optimization, in general, is only a good idea if we really (really) need fast computations. A good rule of thumb is to always start with a simple function that is easy to understand and write, test the function in order to be sure is working as expected and then, and only if the function is slow for our purposes invest time into optimization.

Most of the algorithms we will discuss in this course are implemented in a way that is easy to understand, probably for most of them there is room for optimization.

Big-O Notation

Algorithms are supposed to be platform-independent i.e. they can be implemented in any programming language on an arbitrary computer running whatever operating system you prefer. Hence, how do you measure and compare for example the speed of an algorithm if the speed depends on external factors?

One useful and practical solution is to measure the running time in terms of the input size, this can be done using the Big-O notation. For example, if the running time of an algorithm is $\mathcal{O}(n^2)$ it means that increasing the input one unit will increase the running time of the algorithm by a quadratic factor. Notice that this notation does not indicates the exact increase in time, because all constants have been removed (you don't know if the function is $42n^2$, or $1000n^2$).

Additionally the Big-O notation describes an upper-bound (worse-case scenario). Following the above example the only thing you know is that the running time does not grow faster than $n^2$. As an example, the sorting algorithm Bubblesort have complexity $\mathcal{O}(n^2)$ (there are faster sorting algorithms), but if the input is already sorted the running time is $\mathcal{O}(n)$. Since generally we want to sort unsorted items it is a better idea to report the upper-bound.

P vs NP

The P versus NP problem is a major unsolved problem in computer science. Informally, it asks whether every problem whose solution can be efficiently verified by a computer can also be efficiently solved by a computer.

In this context efficiently means in polynomial time ($\mathcal{O}(n^k)$ $k$ being a natural number) and problems with polynomial time solutions are known as class $P$ or simple $P$.

For some problems, there is not known way to find an answer in polynomial time, but if we know the solution, it may be possible to verify that the solution is correct in polynomial time. This class of problems are called $NP$.

So $P = NP$ means that for every problem that has an efficiently verifiable solution, we can also find that solution efficiently.

A third important category of problems are called $NP$-hard. NP-hard are a type of problems which are at least as hard as the hardest problems in NP, not all NP-hard problems are elements of NP.

If a problem is $NP$-hard and in NP, then it belongs to the class called $NP$-complete. These are are the very hardest $NP$ problems, no one has found a polynomial-time algorithm that could solve any of these problems. The most famous $NP$-complete problem is the Traveling Salesman problem. An efficient solution to any $NP$-complete problem would imply $P = NP$ and an efficient solution to every $NP$-complete problem. Notice that as $NP$-hard are not necessary in $NP$ if $P = NP$ does not necessary implies that NP-hard problems can be solved in polynomial time.

Whether $P = NP$ or $P \neq NP$ is a problem that goes beyond theoretical academic issues. As science (and other aspects of our lives) rely more and more on computers, the $P$ versus $NP$ has became an important question for a lot of disciplines.

Some $NP$-complete problems, in the bioinformatic field.

  • Protein folding (at least formulated using a 2D or 3D HP-model)
  • Protein design
  • Finding optimal protein threading procedures
  • Multiple sequence alignment problem
  • Assembling a collection of DNA fragments in a single sequence from a set of fragments

It is worth noting that proving that a problem is $NP$-hard, and not just superficially difficult, is not a trivial task. In fact, for a lot of bioinformatics problems we don't know if they are $NP$-hard.

What would implies that $P = NP$? Well from the practical point of view nothing, unless we also find an algorithm for solving $NP$-complete problems. In that case cryptography could become impossible (and with that secure Internet transactions). On the bright side all the $NP$-complete optimization problems could become tractable, a lot of bioinformatics problem will be much more efficient and also a lot of problem in our daily-lives, from optimization of transportation to the design of new material and the weather and climate predictions.

Unfortunately (or fortunately, that depends!) it is widely believed that $P \neq NP$.

So, if a lot of interesting and useful problem are $NP$-complete (or we suspect that), what can be done to solve them. We should wait (and hope) that someone proves $P = NP$, to solve them? Well, no. First of all because all the theory behind $NP$-completeness describes how algorithms perform on the worst-case scenario. However, instances of the problems that we found in practice could be easier to solve. Also, for a lot of problems we could use approximate solutions or we can use previous knowledge and various heuristics to solve $NP$-complete problems.

Here I have very briefly discussed the subject of $P$ vs $NP$. But it is important to have in mind that this is a really difficult problem and even computer scientist and mathematicians who have study it for years think they do not fully understand it. So, if you really want to understand this subject it would be wisely to keep reading about it.

Algorithm Design Techniques

Different algorithms applied to very different problems often have recurrent patterns. It seems like that there are only a few useful basic techniques for solving problems. Here I follow the classification made by Jones and Pevzner in their book An Introduction to Bioinformatics Algorithms, but with examples adapted to Structural Bioinformatics.

Brute force

This type of algorithms are the simplest one and consists on examining all the possibilities. If you want to use brute force to find the lowest energy conformation of a molecule (like in protein folding), you will list every possible conformation, compute its energy and choose as the solution the one with the lowest energy. This method guarantee that you will found the correct solution (i.e the lowest energy conformation, of course you will need a very fast computer (or a longer-lasting universe), as the Levinthal paradox warns us.

Pruning (or Branch-and-Bound) Algorithms

This type of algorithm is a modification of the brute force one. And is based on the observation that for many problems it is possible to rule out wrong solutions even without computing them.

Instead of enumerating all possible conformations of a molecule we will eliminate those that are unrealistic. For example, the moment we get atoms that overlaps we discard that conformation (and implicitly all conformations that derive from this one). This could dramatically reduce the searched space.

Greedy Algorithms

Algorithms, such as persons, can be greedy, but unlike persons greedy algorithm could sometimes be something good. In our quest of the lowest energy conformation we will start from a conformation, measure its energy and then perturb that conformation and measure the energy again. We will accept the new conformation only if the energy is lower than the previous one. Of course, one problem with this solution is that the solution is sensitive to the previous accepted solution. Nevertheless for some problems a greedy algorithm provides the optimal solution or at least a very good (and fast) approximation.

Divide-and-Conquer Algorithms

These algorithms rely on the observation that sometimes a hard problem is easier if first we divide it into subproblems, solve the subproblem individually and then combine the solution of the problem into a solution of the original problem. A divide and conquer approach to protein folding could be to recursively fold first small regions of the protein (e.g. secondary structures elements) and then try to obtain the 3D structure by combining the folded small regions.

Dynamic Programming

As in the case of divide and conquer algorithms, dynamic programing is based on the insight that some problems can be solved by breaking them in subproblems, if this is done in a naive way it is often the case that we end up solving the same subproblem many times. This is of course a waste of time and resources. Dynamic programming is a clever way to solve problems by breaking them down into simpler subproblems, solves each subproblem only once, and then use the answer of the subproblems to build the solution of the original problem. We will discuss this technique in chapter 5, when solving the sequence alignment problem.

Knowlegde-based algorithms

In this category I am including any method that use statistical knowledge to solve a problem. This technique has been extensively used in bioinformatics. For example, to fold a protein we could first collect the probability of each amino acid to have a particular combination of $phi$ and $psi$ angles (i.e. the ramachandran plot) and then use those probabilities to sample the conformational space. In chapter 7 we will briefly discuss this approximation in order to build force fields (aka scoring functions).

Randomized Algorithms

The basic idea of these type of algorithms is to use random numbers (or more generally pseudo-random numbers) to solve problems. The well-know and extendenly used Monte Carlo methods, belong to this class of algorithms. Another type of algorithm that is also not-deterministic are the genetic algorithms. We will discuss both of them in chapter 7.