Context of the project...3. What is protein design?...3. I The algorithms...3 A Dead-end elimination procedure...4. B Monte-Carlo simulation...

Laidebeure Stéphane

Context of the project...3 What is protein design?...3 I The algorithms...3 A Dead-end elimination procedure...4 B Monte-Carlo simulation...5 II The model...6 A The molecular model...6 B The energy model...6 C Generating problems...7 III Results...7 Conclusion...Erreur! Signet non défini.

Context of the project CS273 is a class which introduces the computational approach of structure and motion in molecular biology. During that class, we have an overview of cinematic models of molecules, algorithms concerning structure and sequence similarity, and structure prediction. For the project, I wanted to cover a field which was not really covered in class, and for that reason I decided to work on protein design. To make that work interesting, I first wanted to build it inside Gromacs, which is a molecular dynamic system; in this way, my system would have been able to use all the functions of Gromacs, especially actual energy functions and the possibility to use pre-built models for molecules and amino-acids, and to read from large molecule databases to extract useful features. Unfortunately, this turned out to be a bad idea because most of Gromacs has no comments in the code, and when it has some, the comments assume you already know what the function globally does; but I could not find any good documentation about the code in Gromacs, only tutorials on how to use it. Therefore, since I could not possibly build my project on a code that I do not fully understand, I had to switch back to a simpler project, still involving protein design algorithms, but applied on rather smaller and simpler problems, since it will be a toy model instead of real molecules. What is protein design? The aim of protein design is to get proteins matching a certain structures, and therefore certain properties. More precisely, we are given a certain structure, namely a backbone, and we want to find a sequence of amino-acids which would fit into that specific structure. To find the sequence of such proteins, we are looking for molecules that could possibly be stable in this configuration. To do that, the idea is to try and build the sequence which, in that configuration, has the lowest energy. That is, we try to get a protein P such that C arg min E( P, C) where E( P, C) is the energy of protein P in configuration C; but we do C that by computing P min ( C) arg min E( P, C ). Consequently, this model makes the assumption P that the protein having the lowest energy in that configuration must be stable in this configuration. I The algorithms As we have seen before, the problem is to find a sequence of amino-acids having a certain property. The naïve method to do such a computation would be to apply backtracking on it; but for a molecule of length n, it leads to 20 n steps of backtracking, which for reasonable values of n (at least 20, which would be a rather small protein) would take way too much time. As a consequence, we need to use either approximate algorithms, which would lead to a good solution but not necessarily to the best one; or algorithms which would allow us to reduce the search, without changing the accuracy. To reduce the search space, the most common approach is the dead-end elimination procedure.

A Dead-end elimination procedure The idea of the dead-end elimination procedure is to try to find amino-acids which cannot possibly be part of a solution, and eliminate them. Notation: a sequence P is a shortcut for: P 0, P 1,..., P n, P n 1. The optimal criterion to eliminate such a candidate amino-acid the following: A j at position i in the chain, is if A A P 0,..., P i 1, P i 1,..., P n 1, E P, C E P, C, then k j P i A P i A we can be sure that the amino-acid A j cannot be at position i, since for all possible combinations over the rest of the molecule, there is another amino-acid which would reduce n the energy. But computing this criterion has a cost of the order of O(20 ) as well, therefore we will not expect this to be a better choice. But by having an energy function which can be expressed as a sum of unary terms and pair-wise terms, we can reduce that formula to a weaker form of it which happens to be efficient: A A, E( P i A ) E( P i A ) min E( P i A, P l A ) E( P i A, P l A ) 0 k j j k j m k m m l which means that we check whether the amino-acid j A j, in the configuration which is the most advantageous for it against A k, is still worse than this other amino-acid. If it is the case, then we can be sure that A in position i cannot be part of the optimal solution. In the code, the j function running one step of dead-end elimination is as follows: bool deadendeliminate() bool res=false; // The result variable, indicates if there is an update. for(int i=0;i<nb;i++) // For all positions in the chain vector<int> tmp=allowed[i]; allowed[i].clear(); for(int j=0;j<tmp.size();j++) // for all rotamers possibly at that place bool test=false; for(int k=0;!test && k<tmp.size();k++) // for any other rotamer at that place // We first compute the difference in the unary terms double minidiff=precomputedselfenergy[i][tmp[j]] -precomputedselfenergy[i][tmp[k]]; for(int l=0;l<nb;l++) if (l!=i) // for all other positions in the chain double mindiff=infinity; for(int m=0;m<allowed[l].size();m++) // for all rotamers there double v=precomputedcoupleenergy[i][tmp[j]][l][allowed[l][m]] -precomputedcoupleenergy[i][tmp[k]][l][allowed[l][m]]; if (v<mindiff) mindiff=v; minidiff+=mindiff; if (minidiff>0) test=true; // the other rotamer is always better if (!test) allowed[i].push_back(tmp[j]); // This rotamer is still allowed else res=true; // or not return res; k

Different other changes in that heuristic are also used in real-life problems, in particular pairwise heuristics on the choice of amino-acids (keeping track of the fact that some amino-acid would not fit well with another one, when running dead-end elimination). B Monte-Carlo simulation Monte-Carlo method is, in many situations, the best way to get an approximate solution to minimize an energy function over a large space. The general idea behind that is to build a first solution (at random), and try to minimize the energy function by doing regular small changes (changing one amino-acid, or changing two amino-acids). If done in a deterministic way, this would lead to a greedy algorithm, equivalent to a coordinate descent; but if we add some kind of simulated annealing to such a method, it leads to a procedure which is fairly good at leaving local minima, by authorizing changes that increase the total energy, with a certain time-decreasing probability. The procedure can be expressed as follows: Procedure MCAlgorithm Create random assignment; Compute total energy TE; T=MAX_TEMP; While (T>MIN_TEMP) Select small change; Compute updated energy E; If (E<TE) Store as best assignment; If (srand()>min(1,exp((e-te)/t))) reject change; else accept change; Update T; End While End Procedure The original and final values of T, as well as the way it is updated, lead to diverse behaviors of the program, from a fully greedy algorithm (if T is too small) to a purely random algorithm (if T is too big). The type of update (multiplicative or additive) will decide of the tendancy to be greedier at the end (which is better when we do not store the temporarily best solution), and the number of steps is decided in function of the expected speed of the program. In my code, the temperature values were not very big (starting at 1000, down to 0.001) and the update were multiplicative (leading to more greediness). This choice was made because the bad sequences tend to have an extremely large energy, and therefore I wanted to get rid of them earlier by having a T value which is not too big; and at the end, we want to make sure that small conformational changes cannot make it better, because we are more interested in a stable sequence. I also added another feature on restarting: with a certain probability, each 100 steps of update without having found a new optimum, with a low probability, the program may change the configuration to set it back to the optimal assignment. This allows us to make sure that, with still having a chance to visit the whole space, we tend to stay in interesting areas. To do that, the probability must be low in order to authorize the visit of any point in the space. Adding random restart procedures above this allows us to search the space of solutions in a more random way, avoiding to be stuck in a deep local minimum as could happen on a single run.

II The model A The molecular model The toy model I designed consists in a system of rigid sticks: an amino-acid is a sequence of a certain number of main sticks (the backbone), being allowed to rotate freely one around the other, but not around the knees. Therefore, this model has no notion of torsion angles (this assumption has been made to simplify computation, but is in no way difficult to add in the code if needed), but a notion of bond angles. I will refer to the knees as atoms, because it is the equivalent notion for real-life problems. The lengths of the sticks are also fixed, being equivalent to bond distances. In addition to the backbone, the molecular system also defines a notion of side chains, by adding new sticks associated with a position on the chain, and two angles giving the orientation of that stick. When an atom is deformed, the side chains are automatically moved in the position which makes the global structure (in terms of relative positions) as similar as possible to the original one. For example, if a bond angle is diminished, the angle in that same direction with the side chain will be multiplied by the same factor, so that the side chain will keep the same position relatively to those two atoms. At each run of the program, a library containing a certain number of variants of those aminoacids is read, and tries to find an optimal sequence fitting the target carbon chain, given by the set of 3D coordinates of its atoms. B The energy model The energy model is meant to be very similar to actual energy models, to make the computation more similar to real-life situations, so that the results should be reasonable. The energy model can be divided into two parts: a) Bond atoms energy model In the molecular model, bond atoms have rigid sticks between them, therefore their distance should be fixed; yet, we allow a certain variability in that factor, for a certain cost of energy: if d0 is the default length of the stick, the energy for having those two atoms at length d will be 12 6 d0 d0 E A 2 ; this corresponds to van der Waals binding d d energy. The angles at bindings are also an important part of the model. For them, in the same way, we have a default angle a 0, and the cost for having an anglea is therefore 0 2 E K a a. We can notice that both those energy functions have their minimum value for the default values, leading to a better stability when the amino-acids are in their standard configuration. b) Non-bound atoms model

For non-bound atoms, the energy model contains only a term concerning their distances. In the same way as before, the energy depends on a default value d 0 1.55, and the energy function is 12 6 d0 d0 E A 2. d d More generally, the final energy function for the model is the sum of all those terms. C Generating problems To generate problems to solve, and because I couldn t make it from reading PDBs, I wrote scripts that generate chains and amino-acids, in independent ways: for the chain, starting from the origin with the first atom, I set the distance to the next atom with a normal distribution centered on 1, and then for each new atom I generate a new random unitary vector and average it with the preceding one (to have some continuity in the backbone), and set it to the randomly chosen length. The library of amino-acids is generated by randomly choosing the bond distances between each pair of bound atoms, as well as the bond angle for each consecutive three atoms. The side chains are generated by setting randomly a parent atom, an angle (in the direction of the bond), and a second angle (torsion relative to the bond). III Results Average number of remaining rotamers after dead-end elimination (20 at the beginning), depending on the folding of the chain and on its length (in terms of number of amino-acids): Folding \ Length 5 10 20 40 Almost straight 1 1.2 1.3 1.3 Average 12.2 16.3 17.8 19.2 Lots of folding 15.6 17.2 19 20 Dead-end elimination allows us to accurately reduce the amount of search in the process of protein design. When used separately, both dead-end elimination and Monte-Carlo simulation lead to perfect and fast results on small problems, but when the size of the domain grows, Monte-Carlo simulation keeps being fast but lacks accuracy, while the backtracking step in dead-end elimination becomes too slow for being actually used. The use of both techniques together, even if it does not theoretically lead to an optimal solution, tends to give extremely good results when dead-end elimination works well, combining the advantages of both techniques in most situations (in most situations where the dead-end elimination performs well, the dead-end MC simulations gives the optimal solution; and in the other cases, it performs quite logically just like the standard MC simulation). The dead-end elimination procedure does not perform the same with all problems; its use is extremely efficient when the protein has almost no folds, leading to a greedy algorithm most of the time; and on the other hand, it seems to perform extremely poorly on proteins having lots of folds, probably because in this situation, all rotamers tend to have at least one advantage on the others. In terms of speed, the margin between the domain where the algorithm leads to a greedy search, and the domain where it becomes intractable, is really tight. For straight chains, I could run it up to fairly long chains without any trouble (most of the computational time was

due to the dead-end elimination rather than to the backtracking chain); on the other hand, even for chains of length 10 (amino-acids), on situations where the dead-end elimination was not extremely efficient, the backtracking would be way too long. Conclusion In this project, once the stage of finding documents and defining the subject was passed, the most difficult part has, surprisingly, been to generate interesting sets of sticks (coherent, but having enough difference to generate different behaviors), together with the chain. Because the chain could not be generated by the sticks themselves (otherwise the solution would have had an energy so much lower than the other combinations that the algorithm would have necessarily found it), the difficulty was to generate a chain which could be built with them. The poor results of dead-end elimination on my data can probably be attributed to two factors: the absence of torsion angles makes the shape of the sticks much easier to change, leading to a better capability to make them fit not too bad to any part of the molecule; and the second factor is the fact that the sticks are single long entities, while amino-acids sidechains can be considered more or less independently, leading to more local changes. One way to save a lot of computational time on this model would be to develop a forward-checking step in the backtracking, by calling the dead-end elimination at each new labeling, which would take advantage of the reduced domains of all the instantiated variables to find more rotamers to eliminate for the next step. Documents: Thoroughly sampling sequence space: Large-scale protein design of structural ensembles, STEFAN M. LARSON, JEREMY L. ENGLAND, JOHN R. DESJARLAIS, and VIJAY S. PANDE Computational protein design and discovery, SHELDON PARK, XIAORAN FU STOWELL, WEI WANG, XI YANG and JEFFERY G. SAVEN Generalized Dead-end Elimination Algorithms Make Large-Scale Protein Side-chain Structure Prediction Tractable: Implications for Protein Design and Structural Genomics, LORAL L. LOOGER and HOMME W. HELLINGA