Unlike other blog posts where we actually build some model, here we are just going to motivate an idea. The idea is synthetic evolution, in this case, designing proteins for specific biological purposes.
Protein design is not a new concept. Rational protein design dates back at least to the 1970's, https://en.wikipedia.org/wiki/Protein_design.
Proteins and peptides, I will generally use them interchangably, can be used for a variety of purposes. The earliest notable use of a synthetic peptide for disease treatment, and some would say the beginning of biotechnology was the artificial production of insulin. Since then a variety of peptides have been used for disease treatment in addition to the more widely used small-molecules (http://www.sciencedirect.com/science/article/pii/S1359644614003997). Because peptides represent much more cominatorial design space, they could eventually provide much moresignificant treatments than small molecule treatments.
The problem with such a large combinatorial design space, is how to identify good candidates out of such a large space. The naive combinatorial screening requires exponential numbers of tests in the length of the peptide and this requires many costly experiments.
The field of Molecular Dynamics (MD) simulations, attempts to simulate protein-protein interactions and other molecular interactions, with some success. The problem though is that while MD simulations are often cheaper than wet-lab experiments, MD simulations are less accurate, and we still have no systematic way of deciding on what mutation-changes should be made to produce a "better" protein/peptide. We have just "kicked the can" in terms of evaluating our proteins in a simulation instead of in solution.
We should also consider why we would want to design proteins at all, since nature provides a natural way to optimize proteins according to some (often indirect) fitness function. I remember in sophmore year of undergrad in a Molecular Biology Lab we took a special type of E. coli that had it's mutation correction machinery disabled. We inserted a couple of plasmids and let them grow. I forget the exact biology of the experiment, but in the wild-type case (i.e. the system before mutation) two proteins would find each other, bind and produce a colored by-product. After running the E. coli cell culture for a week or so, we came back and found that in many cases the mutants did not show and color; the mutation had broken the binding. However, in a small number of cases, the mutants showed MORE color than the wild types; the mutation had increased the binding affinity between the two proteins, or increased the catylitic rate of the active site (I forget). The point is that in many cases we can naturally evolve existing proteins to have BETTER function in as little as a week. This is helped when the protein in question can be transformed into a fast growing E. coli with it's mutation correction machinery removed. However, the real limitation with this natural form of evolution is that in many cases, in order to improve "fitness" a series of say 3 or more simultaneous mutations has to occur. It's hard for so many simultaneous changes to happen in a natural LIVING organism, because if that many mutations are allowed to occur, regarless of the mutation source (unless it is very targeted mutation) the organism will quickly die due to a mutation in a critical part of its genome. So mutation in living cells can only proceed so quickly.
The above leads us to a question. Whether we are evaluating our protein's "fitness" in silico, or in vitro, or in vivo, how do we choose what peptide to synthesize to test out? We've assumed here that we are going to do something along the lines of an iterative optimization (like natural evolution) where we synthesize a given peptide, evaluate its "fitness" and then come up with a new peptide in order to try and find a better peptide.
The algorithm as I've described it seems like a typical optimization problem. One simple and popular optimization technique is gradient decent, but unfortunatly it requires our "fitness function" to be differentiable. A differentiable fitness function is absolutely unlikely to occur period. Furthermore though think of the domain of our fitness function. It maps a variable string of amino acids on to, lets say a real number corresponding to fitness. How would you differentiate according to a variable string??
Ok at the other end of the optimization spectrum is Monte Carlo. In Monte Carlo we basically just uniformely sample the input space and test each sample in the output space. There are no assumptions on the nature of the fitness function, but Monte Carlo also ussually isn't as efficent as other optimization methods since it isn't taking advantage of any specific properties of the fitness function.
Ok, now I am about to propose my solution to the problem. The solution is........ MCMC, Markov Chain Monte Carlo, specifically lets say the Metropolis Hastings algorithm. In brief, we start with a sample of the input space, i.e. a protein sequence with a known fitness. Then we sample from a distribution of "simmilar proteins" to our original protein, and evaluate the fitness of that simmilar protein. If the fitness increases, we "accept" the new protein and repeat the process. If the fitness decreases we accept the new protein with some probability. If we don't accept the new protein then we draw again from the distribution of "simmilar proteins". Metropolis Hastings is an important algorithm because it ussually lets us approach an optimal solution faster than pure Monte Carlo, but it makes no hard assumptions on the underlying fitness function (i.e. doesn't have to be differentiable). The only problem here is that we need a distribution that will allow us to "sample simmilar proteins to a given protein", that distibution is what we are building towards in these blogs. You might be able to see where I am heading with generative models and seq2seq models. We are going to use deep learning to do synthetic evolution!!
The ultimate goal is a cost effective way of designing/evolving proteins/peptides against some fitness function.
Also by the way the Metropolis Hastings is also good because it can let us "sample from the fitness function" as if the fitness function were a probability distribution. I'm not sure if that has any uses in the future, but maybe it will??
Edit: Also I want to add, that through this scheme, especially by using VAE we get latent embeddings for each protein/peptide "for free". Sort of like word2vec type embeddings.
In [ ]: