EVOLVING LOCAL MINIMA IN THE PROTEIN ENERGY SURFACE

Size: px

Start display at page:

Download "EVOLVING LOCAL MINIMA IN THE PROTEIN ENERGY SURFACE"

Brian Jenkins
5 years ago
Views:

1 EVOLVING LOCAL MINIMA IN THE PROTEIN ENERGY SURFACE by Brian S Olson A Dissertation Submitted to the Graduate Faculty of George Mason University In Partial fulfillment of The Requirements for the Degree of Doctor of Philosophy Computer Science Committee: Dr. Amarda Shehu, Dissertation Director Dr. Estela Blaisten-Barojas, Committee Member Dr. Kenneth De Jong, Committee Member Dr. Jana Kosecka, Committee Member Dr. Jyh-Ming Lien, Committee Member Dr. Sanjeev Setia, Department Chair Dr. Kenneth Ball, Dean, The Volgenau School of Engineering Date: Summer Semester 2013 George Mason University Fairfax, VA

2 Evolving Local Minima in the Protein Energy Surface A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy at George Mason University By Brian S Olson Master of Science George Mason University, 2011 Bachelor of Science in Engineering Princeton University, 2005 Director: Dr. Amarda Shehu, Professor Department of Computer Science Summer Semester 2013 George Mason University Fairfax, VA

4 DEDICATION I dedicate this dissertation to my wife Sarah M. Richardson. iii

5 ACKNOWLEDGMENTS I would like to thank my advisor Dr. Amarda Shehu for all the help and support she has given me in my graduate studies. She took a chance on me early in both of our careers and it is her guidance that has made me successful in my studies. I would also like to thank the other members of the Shehu lab, especially Kevin Molloy, for collaboration on countless projects. I would like to thank the Hydra cluster for its tireless efforts on my behalf and the thousands of hours of time it has put into this thesis. I also want to acknowledge the efforts of Alastair Neil in keeping everything running smoothy on Hydra and thank the other users of the cluster for being patent when they see TestSearch process running on every node. Finally, I would also like to thank my dog Berkeley ROJCA who did absolutely nothing to help with this thesis, but he gets blamed whenever anything goes wrong, so it is finally time for him to get some credit. This material is based upon work supported by the National Science Foundation under Grant numbers and iv

6 TABLE OF CONTENTS Page List of Tables viii List of Figures xi Abstract xvi 1 Introduction Contribution: A novel Evolutionary-inspired Framework for Protein Structure Modeling Background and Related Work Modeling the Protein Conformational Space and Energy Surface Protein Geometry and Protein Conformational Space Protein Energy: Mapping of the Protein Conformational Space Related Work on Protein Structure Modeling Computational Biology Approaches Robotics Motion-Planning Algorithms Evolutionary Approaches for Exploration Integration of Domain-specific Expertise and Evaluation of Proposed Framework Employed Coarse-grained Representation Employed Coarse-grained Energy Functions Associative Memory Hamiltonian with Water (AMW) Rosetta Coarse-grained Energy Function Molecular Fragment Replacement Rosetta Fragment Library Semi-metrics for Comparing Conformations Least Root Mean Square Deviation (RMSD) Global Distance Test (GDT) Implementation Details Starting Points for Conformational Search v

7 3.5.2 Analysis and Performance Measurements Target Systems of Study Sampling Low-energy Conformations Sampling Near-native Conformations Hybrid Global-Local Search : Explicit Sampling of Local Minima Effective Sampling of Local Minima : Protein Local Optima Walk (PLOW) Fragment-based Minimization for Greedy Local Search Perturbation for Global Search Acceptance Criterion for two Consecutively-sampled Local Minima Effectiveness of PLOW Analysis of Local Minima Sampling PLOW versus Multistart: Importance of Adjacency Relationship Controlling Temperature of Local Search Combining Global and Local Search A Population-based Hybrid Evolutionary Algorithm (HEA) Analysis of HEA Conclusions Guiding Sampling through Perturbation Controlling Perturbation Distance of Fragment Replacement Controlling Perturbation Combining Conformational Features with Crossover Hybrid Genetic Evolutionary Algorithm (GA) Analysis of Crossover Operators Navigating the Protein Energy Surface Fitness Improvement Sampling Near-native Conformations Conclusions Guiding Sampling with Multiple Objectives A Multi-Objective (hybrid) Evolutionary Algorithm (MOEA) Decomposing an Energy Function for Multiple Objectives Pareto Dominance and Multi-objective Scoring Metrics Pareto Archive in MOEA Population Selection vi

8 6.2 Analysis of MOEA Sampling low-energy conformations Sampling near-native conformations Conclusions Bringing it all Together with External Validation Algorithmic Realizations of Proposed Framework for Comparison to ClassicRosetta ClassicRosetta: Coarse-grained Sampling in the Rosetta Ab-initio Protocol Experimental Setup for Comparison of Proposed Framework Against ClassicRosetta Analysis of Sampling of Low-energy Conformations Analysis of Sampling Near-native Conformations Unbiased Comparison of Proposed Framework to ClassicRosetta on Testing Set of Protein Sequences A Modular Software Platform for Proposed Evolutionary Search Framework Conclusions and Future Work A Additional Result Data References vii

9 LIST OF TABLES Table Page 3.1 The weights for each of the 13 rosetta energy energy terms is given for each of the 5 coarse-grained Rosetta energy functions The native PDB id, length, and fold are given for each of the 20 target protein systems used to conduct experiments in chapters 4-7. Columns 5 and 6 represent the percentage of amino acids which form α helices and β sheets, respectively, for each target Column 5 gives the RMSD between the native structure and the closest local minimum found when performing multiple greedy local searches starting from the native structure The lowest RMSD to native structure achieved is shown for both PLOW and FeLTr. The RMSDs given are the average of five runs, with the minimum of the five runs shown in parentheses. Column 5 shows the average number of iterations of each PLOW LocalSearch function. FeLTr* represents the FeLTr framework using the value from column 5 as its MMC search length Columns 5 8 report the minimum energy achieved for each temperature T of the minimization component of the BH framework. Columns 9 12 then report the corresponding lowest RMSD to the native structure achieved for each T The lowest energy level reached by PLOW and the HEA are given for each target protein system. Results are given as the average over 5 independent runs with the minimum of each run given in parenthesis. HEA T = 0 employs a greedy local search, while HEA T 0 employs a low-temperature MMC for the local search viii

10 4.5 Columns 5-7 given the minimum RMSD sampled across all independent runs of PLOW and the HEA for T = 0 and T 0 local search temperatures. Columns 8-9 give the percentage of near-native conformations sampled for each algorithm, with values above 1% highlighted in bold Columns 2 4 show PDB Id of the native structure, number of amino acids, and native fold topology for each target protein, respectively. The remaining columns show the lowest energy sampled during each experiment, averaged over 30 runs. The minimum lowest energy over all runs is shown in parentheses. Results for which the mean difference between crossover and mutation only are statistically significant are given in bold ( 95% confidence according to the Mann-Whitney U test) The lowest C α -RMSDs to the known native structure over conformations sampled during each experiment are given. Columns 5 9 report results for experiments performed in this work, while Column 10 shows the lowest C α -RMSD reported for each protein by the ItFix algorithm [1]. Column 11 shows the Pearson correlation coefficient between energies of sampled conformations and their C α -RMSD across all experiments The average minimum Rosetta score4 energy sampled across all independent runs is given for the HEA, MOEA, and MOEA-PC (the minimum across all five runs is given in parenthesis). Energies highlighted in bold represent a significant improvement in minimum energy sampled by MOEA and MOEA-PC over the HEA Columns 5-7 given the minimum RMSD sampled across all independent runs of the HEA, MOEA, and MOEA-PC. Columns 8-9 give the percentage of near-native conformations sampled for each algorithm, with values above 1% highlighted in bold The parameters for each of the 4 subs-stages of the Rosetta coarse-grained ab-initio protocol are given The minimum Rosetta score4 energy sampled across all independent runs is given for ClassicRosetta, MOGA, and HEA*. energies highlighted in bold represent a significant improvement in minimum energy sampled by MOGA and HEA* over ClassicRosetta ix

11 7.3 Columns 5-7 given the minimum RMSD sampled across all independent runs of ClassicRosetta, MOGA, and HEA*. Values Highlighted in bold represent a significant improvement over ClassicRosetta with respect to minimum RMSD to native sampled The minimum Rosetta score4 energy and RMSD to naive sampled across all independent runs is given for ClassicRosetta and MOGA on an unbiased set of proteins which have not been previously been employed to test the techniques explored in this thesis A.1 The lowest energy level and RMSD to native reached by the HEA with T=0 are given for each target protein system. Results are given as the minimum and average over 5 independent runs with the 90% confidence intervals given for the average A.2 The lowest energy level and RMSD to native reached by the GA with 1pt crossover are given for each target protein system. Results are given as the minimum and average over 5 independent runs with the 90% confidence intervals given for the average A.3 The lowest energy level and RMSD to native reached by the MOEA are given for each target protein system. Results are given as the minimum and average over 5 independent runs with the 90% confidence intervals given for the average A.4 The lowest energy level and RMSD to native reached by the MOEA-PC are given for each target protein system. Results are given as the minimum and average over 5 independent runs with the 90% confidence intervals given for the average A.5 The lowest energy level and RMSD to native reached by the MOGA are given for each target protein system. Results are given as the minimum and average over 5 independent runs with the 90% confidence intervals given for the average A.6 The lowest energy level and RMSD to native reached by the MOGA* are given for each target protein system. Results are given as the minimum and average over 5 independent runs with the 90% confidence intervals given for the average x

12 LIST OF FIGURES Figure Page 2.1 (a) A structure corresponding to the native state of of ubiquitin rendered by VMD [2]. The solid grey tube traces the atoms making up the backbone chain of the protein. (b) A section of atoms from amino acid position 47 to position 52 in the protein chain. Backbone atoms are drawn in dark gray and annotated with atom type and side chains are drawn in white. Amino acids are separated by dashed lines. (c) The longest amino acid, arginine, is drawn showing all of the dihedral angles angles. (d) The internal coordinates are illustrated on four backbone atoms. (e) The dihedral angle defines the angle between two planes formed between the first and second bond and the second and third bond The FeLTr search tree is initialized with the extended conformation at the root. Each iteration of the search selects a vertex from the tree for expansion via a short MMC trajectory. The result of this MMC trajectory is then added to the search tree as a new vertex. FeLTr employs a two-level selection process to bias selection towards both low-energy and geometrically diverse conformations. In this example, first the energy level highlighted in green is selected. Then one of the three vertices within that energy level is selected for expansion based on the geometric projection layer. The geometric projection layer consists of a three-dimensional grid using the 1st, 4th, and 7th momenta from the USR projection method [3] (Note only two dimensions are illustrated here) xi

13 3.1 (a) The native structure under id 1dtdB in the Protein Data Bank (PDB) [4] of deposited native structures is shown using the New Cartoon graphical representation (rendered by VMD [2]). (b) A short portion (amino acid positions 11 13) of this structure is shown in greater detail. The backbone atoms of these amino acids are drawn in dark blue, and the corresponding side-chain atoms are in light grey. The backbone dihedral angles are annotated for these three amino acids, and their values in the native structure are shown below in degrees. An angular coarse-grained representation would store only these angles for each conformation of the amino-acid chain. Forward kinematics would be used to obtain the cartesian coordinates A library of fragment configurations extracted from the PDB is defined at the beginning of a search. When a position i (shown in red) in the conformation is to be modified, a corresponding fragment configuration is selected at random from the library. The dihedral angles of the selected fragment in the conformation are then replaced with those of the fragment configuration selected from the library, beginning at position i (shown in green) The figure illustrates (a) greedy local search, (b) naive sampling, and (c) PLOW on a simplified energy surface. (a) A sampled conformation C i(sampled) (empty blue circle) is mapped to the nearest local minimum C i(minimum) (solid purple circle) by a greedy local search (series of short purple arrows). (b) 5 points sampled at random (empty blue circles) by the naive sampling approach are each mapped to a nearby local minimum (solid purple circles) by a greedy local search (series of short purple arrows). (c) PLOW begins at C 0 (leftmost empty blue circle). Through a series of perturbations (long orange arrows) and greedy local searches (short purple arrows), PLOW samples conformations representative of local minima (C 1 through C 4 ) in the energy surface The distribution of energies obtained by BH is superimposed over that obtained by the multistart method on each for protein with native PDB id 2ezk (a) and 1hhp (b) xii

14 4.3 The energy surface sampled for the protein with native PDB ID 1fwp is shown for each temperature T. The x and y-axes represent projection coordinates based on interatomic distances within each conformation, and the z-axis represents the energy of each sampled local minimum. The white x indicates the location of the native structure in the energy surface The mean perturbation distance between C i and C i(perturb) is plotted against the lowest RMSD from the native structure obtained by PLOW on each of 15 protein systems. The strong linear correlation (the identity line is drawn in red) suggests that the efficacy of the perturbation function is directly related to the efficacy of the search in PLOW The distribution of perturbation distances, between C i and C i(perturb), is shown for two selected proteins with PDB ids 3gwl in (a) and 1hhp in (b). The area shaded in red represents the cases where the perturbation distance between C i and C i(perturb) is less than 1Å RMSD and is thus deemed an insignificant change from the conformation C i The mean µ MM is shown for a given target D The frequencies of µ MM sampled during the search for proteins with native structure PDB IDs 1ail and 1isuA are shown in (a) and (c), respectively. Frequency of RMSDs to the native structure for each protein are given in (b) and (d), respectively. The solid red line represents PLOW employing the unbiased perturbation method. The dashed lines represent PLOW with median perturbation distances D = 1Å to D = 5Å Flowchart of the hybrid genetic algorithm. In each generation, the chosen crossover operator is followed by a mutation operator and a local search to optimize a conformation to a nearby local minimum. The new population of local minima competes with elite members of the parent population through truncation selection The mean fitness improvement between parents and children is given for each experiment versus the average lowest energy reached. A strong linear correlation is noted, suggesting a more explorative variation operator with lower mean fitness improvement will allow an EA to maintain breadth in search and access lower energies xiii

15 6.1 Conformations are plotted with respect to two energy terms E 1 and E 2. Conformations represented by empty blue circles are non-dominated and form the Pareto front. C 2 strongly dominates 4 conformations and weakly dominates 1 additional conformation, thus the Pareto count of C 2 is 4 for strong Pareto dominance and 5 for weak Pareto dominance (strong Pareto dominance is employed in this thesis) The frequencies of RMSD to native sampling is shown for the 12 protein systems where near-native conformations (below 5Å RMAD) are achieved. The HEA, MOEA, and MOEA-PC are represented as a solid black line, dotted blue line, and solid purple line, respectively he frequencies of RMSD to native sampling is shown for the 12 protein systems where near-native conformations (below 5Å RMAD) are achieved. The HEA, MOEA, and MOEA-PC are represented as a solid black line, dotted blue line, and solid purple line, respectively On the left side, each conformation sampled across all independent runs by the HEA (top), MOEA (middle), and MOEA-PC (bottom) is plotted with respect to total energy and RMSD to the native structure. On the right, only conformations actually retained in the population are shown for a single run. Here a 3rd dimension (generation) and 4th dimension (age of the conformation) provide a more detained view of how the conformational space is explored) On the left side, each conformation sampled across all independent runs by the HEA (top), MOEA (middle), and MOEA-PC (bottom) is plotted with respect to total energy and RMSD to the native structure. On the right, only conformations actually retained in the population are shown for a single run. Here a 3rd dimension (generation) and 4th dimension (age of the conformation) provide a more detained view of how the conformational space is explored) xiv

16 6.6 On the left side, each conformation sampled across all independent runs by the HEA (top), MOEA (middle), and MOEA-PC (bottom) is plotted with respect to total energy and RMSD to the native structure. On the right, only conformations actually retained in the population are shown for a single run. Here a 3rd dimension (generation) and 4th dimension (age of the conformation) provide a more detained view of how the conformational space is explored) The ensemble of sampled decoy conformations is plotted for MOGA (on the left) and ClassicRosetta (on the right) for three representative protein systems. Points are plotted with respect to the C α -RMSD to native and Rosetta score4 energy xv

17 ABSTRACT EVOLVING LOCAL MINIMA IN THE PROTEIN ENERGY SURFACE Brian S Olson, PhD George Mason University, 2013 Dissertation Director: Dr. Amarda Shehu Proteins are the molecular tools of living cells and the path to unraveling their function is through modeling and understanding their structure. Many diseases occur when a protein loses its intended function due to inability to form the appropriate structure with which it binds to other molecules. A holistic approach to protein modeling would characterize all possible structural states accessible by a protein under native conditions. However, this task is infeasible. The question then becomes, how can we model the subset of these structural states most relevant to the function or disfunction of a protein? This thesis proposes a novel computational framework to obtain an expansive view of the protein conformational space relevant for function while controlling computational cost. The framework complements experimental and high-resolution computational methods which limit their focus to a single region of the conformational space. The framework employs the knowledge that functionally-relevant conformations are those low in energy and the framework incorporates the latest understanding of protein structure and energy from biophysics. Specifically, this thesis proposes a novel stochastic search framework for exploring a diverse ensemble of conformations which capture low-energy basins in the protein energy surface.

18 The proposed search framework employs a hybrid or memetic approach for explicit sampling of local minima in the protein energy surface. This hybrid search framework combines a global evolutionary search approach with a local search component to take advantage of the latest advances from the computational biology community. Specifically, the following questions are addressed to effectively model the protein conformational space: (1) How to balance limited computational resources between exploration of the conformational space in global search with exploitation of local minima in local search? The hybrid search framework combines a global evolutionary search to explore the breadth of the conformational space with a local search for efficiently exploiting local minima in the underlying energy surface. (2) How to sample new conformations at the global level? Two complementary approaches are investigated. One approach proposes an enhanced fragment selection method for sampling a new conformation based on an existing structure. The other approach employs a genetic algorithm to combine features from multiple existing structures to sample a new conformation. (3) How to employ energy to better discriminate between interesting conformations and noise in the conformational search space? A multi-objective decomposition of the energy function is employed to guide the search towards more biologically relevant, low-energy conformations by focusing on the energy terms with the most discriminatory power. Work in this thesis shows that, by combining advanced algorithmic components with the latest understanding of protein biophysics, the proposed search framework is able to more effectively model functionally-relevant conformational states. A direct comparison between the proposed framework and a state-of-the-art coarse-grained sampling algorithm shows that the enhanced sampling strategies lead to a more comprehensive picture of the underlying protein energy surface. By taking this more comprehensive view, the framework is able to capture the protein native state as well as or better than methods relying primarily on protein-specific sampling strategies.

19 CHAPTER 1: INTRODUCTION Proteins are the molecular tools that cells use to carry out the functions of life. Indeed, many diseases, referred to as proteinopathies, occur when a protein loses its intended function due to its inability to assume an appropriate structure in the cell [5, 6]. Examples include cancers and many neurodegenerative disorders, such as Alzheimers, prions, and Huntingtons disease [7]. Given that proteins carry out their biological function by binding with other molecules in the cell, it is not surprising that the path to unraveling protein function goes through understanding and modeling protein structure. This thesis focuses on the particular problem of characterizing protein structure as a means to understanding protein function in the large subarea known as protein modeling in computational biology. An integrated model of protein structure and function should find all possible structural states accessible by a protein molecule under physiologic conditions [8, 9]. This approach takes into account the dynamic nature of protein molecules. Experiment, theory, and computation have shown that proteins exploit anywhere from small-scale fast fluctuations to large-scale slow concerted motions of their amino-acid chains to populate different spatial arrangements, or conformations, through which to modulate biological function [8, 10 17]. Indeed, even the classic picture of a unique functional/structural state of a simple unimodal protein hides the fact that this native state is a potentially large, albeit often homogeneous, ensemble of conformations under physiologic conditions. While it is certainly appealing to obtain a comprehensive view of the protein conformational space relevant for biological activity, the holistic approach remains infeasible. Neither experiment nor computation alone can currently map all functionally-relevant structures of a protein molecule. Examples exist of particular studies that combine experimental 1

20 data, theory, and computational techniques to obtain a close-to-complete view on special classes of proteins with limited structural flexibility [12, 18, 19]. In the general setting, the infeasibility is not surprising, as even the problem of obtaining just one conformation of lowest potential energy for a protein chain has been shown to be NP-hard [20]. Experimental techniques are limited to modeling just a few of the structures a protein might access under native conditions. By structure we refer to a particular threedimensional placement of the atoms that make up a protein molecule. While structure is often used interchangeably with conformation, the distinction between the two is made clearer in the context of computational techniques such as the ones presented in this thesis, and will be discussed in chapter 2. Among experimental techniques, X-ray crystallography captures a single representative/average structure of the native state associated with lowest free energy according to the thermodynamic hypothesis [21]. Another experimental technique is NMR spectroscopy, which is more powerful than X-ray crystallography because NMR can yield not just one structural model but an ensemble of structural models consistent with macroscopic distance-based measurements; however, NMR is limited to small protein systems (typically less than 100 amino acids), can reveal no more than a few dozen models, and often these models are limited to a single functional state. The transitions that a protein can undergo to hop between functional states are often too slow to probe through experimental techniques [22]. Experimental techniques capable of yielding significantly different structural models representing different functional states include FRET and cryo-electron microscopy, but these techniques are either limited to very low resolution, resulting in rather coarse models, or limited to a few structural states [23, 24]. This thesis proposes to address the current inadequacy of experimental techniques through a complementary computational framework aimed at obtaining a more expansive view of the protein conformational space relevant for function. The theoretical foundation of the proposed computational framework is that of thermodynamics, under which 2

21 functionally-relevant conformations are those available to a protein molecule under physiologic/equilibrium conditions and are thus associated with low energies. Specifically, the thesis proposes a stochastic search framework for exploring a diverse ensemble of conformations which capture low-energy basins in the protein energy surface. A detailed treatment of protein structure and energetics in chapter 2 makes the case that a stochastic search framework is critical. Exploration with deterministic algorithms is infeasible due to the high-dimensionality of the protein conformational space. Moreover, as chapter 2 relates, the energy surface underlying the protein conformational space that is accessible in computation is rich in local minima. These two aspects, dimensionality of search space and nonlinearity and multimodaility of the protein energy surface have attracted many researchers from diverse fields both with a direct or indirect interest in protein modeling. Specifically, this thesis draws inspiration from three communities, the computational biology and biophysics community, the robotics sampling-based motion planning subcommunity in algorithmic robotics, and the evolutionary computation community. The work presented in this thesis combines, adapts, and builds on ideas originating in each of these communities on how to conduct stochastic search in complex search spaces such as those characterizing protein systems. The contributions of this thesis on a novel stochastic search framework over current existing work on stochastic search algorithms for modeling protein structure can be exposed when simplifying the exposition of such an algorithm for protein structure down to its two main tasks. At its core, a stochastic search algorithm must be able to: Iterate through different protein conformations. Discriminate among computed conformations so that the search is biased towards lower-energy conformations over time. Different communities elect to make contributions to one or both subtasks with different sets of techniques. For instance, researchers in computational biology and biophysics 3

22 with a direct interest in modeling protein structure address both subtasks by applying domain-specific knowledge obtained through a deep and detailed understanding and insight of proteins and protein structure. For instance, coarse-grained representations of protein structure are designed to lower the dimensionality of the protein conformational space and so simplify the task of computing new conformations. Detailed research onto energy functions capable of interfacing with such representations balances efficiency with accuracy concerns in discriminating among computed conformations. Other techniques in this community, such as molecular fragment replacement, discretize the search space and bias sampling toward identifiable structural motifs. Most stochastic search algorithms in this community largely build over Monte Carlo, and many techniques are proposed to enhance exploration capability. These domain-specific approaches are discussed in greater detail in chapter 3. The other two communities, robotics sampling-based motion planning and evolutionary computation largely focus on the first subtask; that is, how to effectively iterate through conformations. Researchers in these communities are mainly users of coarse-grained representations and energy functions shown to be effective and incorporate realism in the computational biology and biophysics community. As the overview in chapter 2.2 summarizes, in many cases, incorporation of the latest knowledge is neglected in the interest of focusing on issues of search. Specifically, the focus in these two communities is how to enhance sampling capability in the context of some chosen representation and energy function. 4

23 1.1 Contribution: A novel Evolutionary-inspired Framework for Protein Structure Modeling The contribution of this thesis is in enhancing the exploration capability of stochastic search algorithms for modeling protein structure. This thesis focuses on improving algorithmic components of stochastic search while simultaneously incorporating the latest biophysical understanding of protein molecules. Specifically, this thesis addresses how to explicitly sample local minima of relevance for biological activity in the protein energy surface through hybrid or memetic stochastic search algorithms. Employing primarily an evolutionary computation-based approach, this thesis specifically addresses the following questions: How can one balance limited computational resources between exploration of the conformational space in global search with exploitation of local minima in local search? Chapter 4 initially addresses this question with a simple hybrid trajectory search for explicitly sampling local minima and a detailed investigation into the role of the depth of the local search. The trajectory-based search is then generalized into a population-based Hybrid Evolutionary Algorithm (HEA) which forms the basis of a more detailed investigation into various aspects of global search in chapters 5 and 6. How can one effectively sample new energetically-relevant conformations at the global level? Fragment replacement is a powerful tool for effectively iterating through a series of conformations; however, fragment replacement techniques are typically designed 5

24 with a Monte Carlo (MC) search framework in mind, where moves are immediately accepted or rejected based on a selected scoring function. Chapter 5 investigates two complementary approaches to sampling conformations at the global level. Section 5.1 investigates how to bias fragment replacement to improve its efficacy in global search, while section investigates how to best combine features from multiple previously-sampled conformations to better explore the search space. How can one employ energy to better discriminate between interesting conformations and noise in the conformational search space? The extent to which energy should be trusted to reach the native structure is currently under debate, particularly given the fact that many enhanced sampling techniques have shown current coarse-grained energy functions to have significant inaccuracies. Chapter 6 makes a significant contribution and proposes an alternative guidance to the native state. A multi-objective approach is proposed to effectively guide the search by decomposing a given energy function into individual terms. This approach is shown to guide the search towards more biologically-relevant, low-energy conformations by focusing on energy terms with highest discriminatory power. The overall goal of addressing each of these questions is to improve exploration of the protein conformational space and the underlying potential energy surface with the aim of better modeling protein structure. The implementation of the approaches proposed to address the posed questions is carried out through a novel unifying software platform which allows experimentation on elements of stochastic search while incorporating the latest domain-specific components for protein conformational search. An overview of this software platform is given in chapter 8. The ultimate goal is to allow experimentation by other researchers down the road on the power of different algorithmic and domain-specific components. 6

25 The experimental framework in which the stochastic search elements proposed in this thesis are evaluated is the specific protein modeling challenge of determining the biologicallyactive or native structural state of a protein from its amino acid sequence alone. This is known as the ab-initio protein structure prediction problem, and approaches to address it mainly focus on proteins where at least one native structure is known for the purposes of evaluation. It is worth emphasizing that the framework proposed in this thesis can be applied to study proteins with potentially different functional states, as the ultimate output of this framework is a discrete representation of the energy surface relevant for function in terms of an ensemble of conformations representing low-energy minima. However, for the purpose of a rigorous validation, we focus on the specific setting of ab-initio structure prediction due to the maturity of this field and the richness of both computational and experimental data in it. The approach taken in the presentation of the work proposed to address the above questions is to add algorithmic components, and thus algorithmic complexity, gradually in order to fully understand the contribution of each before reaching a more powerful all-encompassing algorithmic realization of the proposed framework. Once the contribution of these components is understood, the most effective ones are put together in two algorithmic realizations of the proposed framework, which are shown to outperform the currently-accepted standard in computational structure biology for ab-initio structure prediction. This is a particularly exciting result that is believed will attract the modeling community to other methods shown to have better exploration capability than deeply finetuned, but essentially multistart, MMC-based approaches. An additional contribution of this thesis is its demonstration of the effectiveness of evolutionary computation-based approaches for stochastic search in the domain of protein modeling. Though not treated in this thesis, the approach proposed here on explicit sampling of local minima has additionally been incorporated into studies investigating transition pathways between protein functional states as well as assembly of multi-domain 7

26 protein complexes [25, 26]. A software platform is associated with the proposed framework in this thesis. All proposed techniques are modular pieces, and they can be combined in a modular fashion to make it easy to explore different algorithmic realizations of the proposed framework. This not only facilitates reproducibility of results presented in this thesis, but it will also make it easier on other researchers, starting with students in the Shehu lab, to either continue this line of research and investigate further pertinent questions or explore realizations more suitable for other applications revolving around protein structure and function. While this thesis focuses on protein systems, the work and ideas described here are relevant in a larger context. Many other systems beyond proteins are complex heterogeneous systems with strongly dofs or parameters. On such systems, which include organic and inorganic polymers, n-body systems in particle physics, high-dof robot articulated chains, and other more derived systems defined to characterize some problem with possibly a combination of explicit and implicit constraints, interesting problems typically seek to characterize specific system states that satisfy constraints. This involves search, often stochastic, and the ideas presented in this thesis on the combination of local and global search, the incorporation of domain-specific expertise to facilitate satisfaction of explicit constraints, integration of scoring functions to guide to states satisfying implicit constraints, and biasing the search by possibly multiple conflicting criteria are expected to be useful in this general setting that goes well beyond protein modeling. 8

27 CHAPTER 2: BACKGROUND AND RELATED WORK 2.1 Modeling the Protein Conformational Space and Energy Surface Protein Geometry and Protein Conformational Space A protein molecule is made up of a set of building blocks known as amino acids. Each amino acid consists of an α carbon atom (C α ) connected to a hydrogen atom (H), an amino group (NH 2 ), and a carboxylic group (COOH), and an R group which varies between different amino acids. In a protein, a series of amino acids come together to form a chain, bonding the carboxylic group from one amino acid to the amino group on the next amino acid in the chain. The repeating series of N, C α, C, and O heavy atoms are then typically referred to as the backbone chain of the protein. The R group, or side chain, of each amino acid dangles off this backbone chain. Figure 2.1a illustrates the atomic structure of a protein in three dimensions with a line tracing out the backbone atoms. Figure 2.1a illustrates the fact that, while side chains comprise a majority of the atoms in the protein, the overall three-dimensional structure of the protein is largely determined by the backbone atoms. The three-dimensional protein structure illustrated in figure 2.1a corresponds to a particular spatial arrangement, or conformation, of the protein chain of amino acids. There are different methods for representing a protein conformation which capture varying degrees of structural detail and result in different computational complexities for the space of possible conformations populated by a protein chain. The most straightforward representation is comprised of the cartesian coordinates of each atom in the protein. In this cartesian representation, each conformation, C, consists of a vector of N sets of x,y, and z 9

(b) Local geometry and kinematics (d) Internal coordinates

atom type and side chains are drawn in white.

(c) The longest amino acid, arginine, is drawn showing all

28 (b) Local geometry and kinematics (d) Internal coordinates (c) Amino acid with dihedral angles (a) Backbone and side chains (e) Dihedral angle Figure 2.1: (a) A structure corresponding to the native state of of ubiquitin rendered by VMD [2]. The solid grey tube traces the atoms making up the backbone chain of the protein. (b) A section of atoms from amino acid position 47 to position 52 in the protein chain. Backbone atoms are drawn in dark gray and annotated with atom type and side chains are drawn in white. Amino acids are separated by dashed lines. (c) The longest amino acid, arginine, is drawn showing all of the dihedral angles angles. (d) The internal coordinates are illustrated on four backbone atoms. (e) The dihedral angle defines the angle between two planes formed between the first and second bond and the second and third bond. 10

29 cartesian coordinates, where N is the number of atoms in the protein. These atomic coordinates maintained to represent C comprise the parameters or degrees of freedom (dofs). For this particular cartesian representation, the space of conformations is then of 3N dimensions. The cartesian coordinate representation is desirable when considering energy functions, as most energy functions employ interatomic distances to estimate potential energy. However, modeling the position of every atom results in a very high-dimensional search space of 3N dimensions. As even very small proteins contain N > 100 atoms, this representation is infeasible for most search algorithms. As a result, Molecular Dynamics (MD) simulations, which favor the cartesian representation, model a protein with a high degree of accuracy but are limited in scope to exploring a very local portion of the conformational space. Many approaches now employ reduced conformational representations to tackle the dimensionality issue. These reduced representations range from all-atom, where every atom is explicitly modeled, to coarse-grained, where only a few representative atoms (typically backbone atoms) in each amino acid are explicitly modeled. In addition to reducing the granularity at which a protein is modeled, another approach to reducing the dimensionality of the conformational space is to decouple the representation employed for search from the representation used to compute energy. Borrowing from terminoloy in robotics, we refer to the representation employed for search, that is, to compute conformations as kinematic representations. The next section summarizes a few kinematic representations and the one that is used in the framework proposed in this thesis. Protein Kinematics: Internal Coordinate Representations Employing an internal coordinate representation as opposed to the cartesian coordinate representation is more appealing from a kinematics point of view. In the internal coordinate representation, the only dofs explicitly modeled are bond lengths, angles between 11

30 two consecutive bonds, and dihedral angles defined over three consecutive bonds. The representation is illustrated in Figure 2.1d (the dihedral angle is defined in Figure 2.1e). The internal coordinate representation not only results in reducing the size of the search space, but it also allows making efficient moves in conformational space while satisfying geometric constraints observed over active protein structures. The representation implicitly encodes constraints that the cartesian representation does not. For instance, assigning random values to cartesian coordinates is bound to break bond lengths and result in atypical bond angles in protein chains. This is rather inefficient, particularly when the focus is on computing functionally-relevant conformations (where high energies due to bond breakage need to be avoided). Instead, the internal coordinate representation makes it easier to place bounds on values sampled for the dofs explicitly represented in it in order to obtain typical values observed over known active protein structures. Since energy functions typically operate on cartesian representations, the internal coordinate representation requires a technique for recovering the cartesian coordinates. Once a new set of internal coordinates has been sampled (thus, a conformation has been obtained), the cartesian coordinates for atoms explicitly modeled can be recovered through a process known as forward kinematics [27 29]. A series of transformation matrices (defined by the change in internal coordinates) are applied down the protein chain to update the position of each atom and obtain cartesian coordinates over which an energy function can now operate to associate a score with the sampled conformation. Protein Kinematics: Dihedral Angle Representations The internal coordinate representation can be further simplified to significantly reduce the dimensionality of the conformational space. Analysis of experimentally-determined protein structures shows that bond lengths and (non-dihedral) bond angles are constrained to characteristic values [30]. By assuming idealized values for these internal coordinates, a protein conformation can be represented as only a series of dihedral bond angles while still 12

31 retaining a high degree of resolution in the model. Each amino acid contains two backbone dihedral angles, denoted φ and ψ, as well as up to five side chain dihedral angles, denoted χ 1 to χ 5, illustrated on the arginine amino acid in Figure 2.1c. A third backbone dihedral angle, ω, connects consecutive amino acids. The dihedral angle representation is computationally appealing to search algorithms, as the number of dofs needed to represent a protein chain is reduced by a factor of 7, on average, over a cartesian coordinate representation [31]. While a dihedral angle representation does sacrifice a degree of granularity, large variations over internal coordinate values are very improbable do to energetic constraints, unless sampled values are explicitly constrained. Furthermore, deviations from idealized values typically represent small fluctuations around an equilibrium value. Dihedral angle representations are therefor appropriate for exploring different equilibrium states in the conformational space, while more detailed MD simulations can be applied to capture the small fluctuations within a particular equilibrium state. If the goal is to model only the backbone structure of a protein at a coarse-grained level of detail, then the dimensionality the search space can be further reduced by representing only the φ, ψ, and ω backbone dihedral angles. While the amino acid side chains are important in determining the overall structure of the backbone chain, a particular backbone conformation can be represented through only these three angular/revolute dofs per amino acid. This results in a 3n dimensional search space, where n is the number of amino acids in the protein. While this coarse-grained dihedral representation sacrifices significant atomic level detail, it is often desirable when modeling larger portions of the conformational space where so-called all-atom representations are computationally infeasible. A coarse-grained representation must be coupled with a specially designed energy function which is able to discriminate among conformations without explicit modeling of side chain atoms. An overview of two state-of-the-art coarse-grained energy functions is given in chapter 3. 13

32 The idealized model of protein geometry allows for analogies between protein chains and mechanical systems consisting of robotic kinematic chains with rotational joints. Rotation of a robotic joint produces a similar downstream effect on a mechanical system as rotation of a dihedral bond angle does on the atoms of a protein chain [27]. This analogy has been successfully exploited by researchers in the field of robotics to apply robotic motion planning algorithms to modeling the protein conformational space [32 43]. This inter-disciplinary research has led to powerful new robotics-inspired methods for exploring the protein conformational space, one of which, summarized in section 2.2.2, resulted from efforts of the author of this thesis. Resolution of Protein Chain Deformations: Restricted Moves Further reductions in the size of the conformational search space can be achieved by restricting the range of dihedral bond angles to experimentally-observed values. Analysis of experimentally-determined protein structures reveals that the dihedral bond angles in protein conformations which actually carry out biological functions do not span the entire range form π to π. Rather, these angles can be probabilistically restricted to specific observed amino-acid dependent ranges. For backbone dihedral angles φ and ψ, these ranges are captured by the Ramachandran maps [44], whereas ranges for side-chain dihedral angles are typically much more restricted and discretized and can be obtained from compiled rotamer libraries [45]. A recent advance, known as molecular fragment replacement, further simplifies the protein conformational space by discretizing deformations of the protein chain. Dihedral angles for consecutive amino acids are bundled together into fragments and new conformations are sampled by replacing the set of bond angles for the entire fragment at once. The key to successfully employing fragment replacement is constructing a library of available fragments. The use of fragment replacement is now a central component in ab initio 14

33 protein structure prediction protocols [1, 46 55] and, more recently, has even been employed to coarsely model transition pathways between protein functional states and the diversity of structures within a functional state [56, 57]. Chapter 3 provides a more detailed overview of how fragment replacement is employed for structure prediction and how the state-of-the-art fragment library employed by this thesis is constructed Protein Energy: Mapping of the Protein Conformational Space As biological systems, protein molecules are characterized by physics-based energetic interactions among the atoms that comprise them. The sum of these interactions in a given protein conformation gives the potential energy or the internal free energy of that conformation [58]. The thermodynamic hypothesis states that the physiological (native) state of a protein system is the state of lowest free energy [21]. Free energy combines potential energy with the entropy and temperature at which the system is observed. Entropy essentially measures the variability of degrees of freedom that allow a protein chain to flex while maintaining a similar potential energy. That is, the native state is an ensemble of conformations that have similar potential energy. Free energy is difficult to calculate, mainly due to entropy. Measuring entropy requires computing the range of values for each degree of freedom in conformations with the same or similar potential energy. Modeling entropy thus requires extensive free-energy sampling, making direct free-energy modeling impractical. For this reason many methods forego entropy considerations and focus instead on probing the potential energy surface of a protein one conformation at a time. A common working assumption in computational methods is that the protein native state (minimum free energy stat) will correspond to conformations with lowest potential energies. This is largely in agreement with the classic view of the protein energy surface as a deep single-basin energy funnel. The pioneering work of Scheraga was one of the first to introduce the idea that structural characterization of the native state can be posed as 15

34 an optimization problem with the goal of finding the global minimum energy conformation (GMEC) [59]. While this classic view is shown to hold on many protein systems, we increasingly see a more nuanced view of the correlation between the GMEC and conformations actually corresponding to the native state. These more complex potential energy mappings require more advanced search algorithms to effectively model the conformational space. Multi-basin energy surface: Many proteins contain multiple functional states corresponding to multiple basins in the potential energy surface. To accurately model these systems, a stochastic search algorithm must be able to not only sample many energy minima, but also capture the energetic barriers which must be overcome for the protein to transition between the different functional states. Flat-basin energy surface: The classic assumption is that the native state of a protein will be highly constrained (low in entropy) and thus correspond to a basin which is both narrow in structural diversity and deep in potential energy. A class of proteins, however, contains a weakly-constrained and so rather structurally-degenerate native state. These proteins contain flat energy surfaces and effective search algorithms must be able to sample a wide range of low-energy conformations rather than converge to some arbitrary structural state. Kinetic Barriers: In some systems, the lowest energy state is inaccessible due to kinetic barriers in the folding process, and thus the actual functional state may have a much higher energy than the GMEC. Such systems present a challenge to search algorithms which are designed specifically to optimize low-energy structures. In these systems, the ability to capture multiple basins in the energy surface is key, though these algorithms need to be followed by techniques that incorporate kinetics, as in Molecular Dynamics, to go from an available view to an accessible view. 16

35 This more complex view of the potential energy surface is further complicated by inaccuracies inherent in the energy functions available to map conformations to the potential energy surface. Energy functions are a linear combination of multiple energy terms, however, biophysical theory tells us that some of these energy terms will be in conflict. Summing conflicting terms together will result in a rugged energy surface, rich in local minima, due artifacts or noise in the energy function. The minima may not be true basins. The highly-constrained nature of the conformational space is another common source of noise, leading to a highly rugged energy surface, where a small change in structure can lead to a very large change in energy. This is further complicated by the fact that when using kinematic representations, a small change in conformational space (parameter space) can result in a large structural change. Chapter 3 discusses the challenge of dealing with noise in a protein energy function in greater detail and proposes solutions based on multiobjective analysis in the field of evolutionary computation. 2.2 Related Work on Protein Structure Modeling We now proceed to briefly summarize related work in the three different research communities from which this proposal draws ideas to propose a unique and novel framework for conformational search Computational Biology Approaches Modern structure prediction strategies enhance the sampling ability of trajectory-based exploration methods with parallel execution, varying temperature, and exchanging the seed conformation from which new trajectories are launched. Some of the recent approaches 17

36 which have been successful are importance sampling, simulated annealing, umbrella sampling, genetic algorithms, replica exchange (also known as parallel tempering), local elevation, activation relaxation, local energy flattening, jump walking, multicanonical ensemble, conformational flooding, Markov state models, discrete time-step MD, and Fragmentbased Assembly (FA) [60]. This section briefly describes MD and MC approaches that form the basis of this field and outlines several recent approaches against which we benchmark results in chapter 7. Molecular Dynamics (MD) MD approaches attempt to simulate the atomic forces at work within a protein molecule by applying the principles of Newtonian physics [61]. An MD simulation calculates the forces exerted by each atom in a protein on every other atom. An MD trajectory simulates a specific period of time, calculating the interatomic forces at each time step and updating the position and momentum of each atom accordingly. MD has the advantage of modeling the actual folding pathway of a protein. However, as the number of amino acids in the target protein grows, the number of atomic interactions which must be computed at each time step increases quadratically. Early milestones in protein structure prediction were achieved using MD approaches [62]. However, given the computational complexity, MD is typically only applied to very small proteins or to carry out fine-grained refinements of existing conformations. Monte Carlo (MC) Monte Carlo (MC) based methods sample the conformational space by making a series of modifications or moves to a conformation. Each resulting conformation is evaluated with a potential energy function, and a determination is made as to whether or not to accept or reject the move based on this function. The goal is to drive a MC trajectory towards lower energy conformations which are, in theory, closer to the native structure. The decision of 18

37 whether or not to accept a move is typically done using the Metropolis criterion [63]; MC methods using the Metropolis criterion are referred to as MMC. MMC-based exploration typically has larger sampling capability than MD-based exploration due to the fact that the moves in MMC can make large jumps in conformational space. Enhanced Sampling Strategies Many of the strategies to enhance the exploration capability of the classic MD or MC search framework have originated in the context of ab-initio structure prediction. The problem, sometimes also known as de-novo structure prediction, involves finding the active or native structure of a protein from its amino-acid sequence. The focus is mainly on proteins with a unique deep basin, circumventing the issue of possibly multiple energetically-similar structures co-existing under native conditions. However, even in this controlled setting, the current state of modeling does not guarantee that the native structure can be found, for two mains intertwined reasons. First, the energy function guiding the search may be inaccurate. Second, the search may have low exploration capability. As the above summary of protein energy lays out, current energy functions that are practical to estimate potential energy on a protein conformations are inherently in accurate. Their inaccuracies are higher when employing coarse-grained or reduced energy functions designed to interface with reduced representations that simplify the protein conformational space. Inaccuracies in protein energy functions are well studied and documented [64 68]. They are the primary reason why the global minimum of some selected energy function (what Scheraga referred to as the GMEC) often does not correspond to the known native structure. Deviations can be anywhere from 2 to 4Å [64], even in simple proteins characterized in experiment to have a unique deep basin. It has also been demonstrated that current energy functions produce weakly-funneled energy surfaces, weakening the working assumption that the GMEC that can be computed using a particular energy function is the structural representative of the native state. 19

38 There is a common template among ab-initio protein structure prediction protocols [1, 46,47,49,56,69] to enhance sampling capability. The protocol is split in two stages, each of which uses different representational detail and achieves different objectives. Stage one employs a coarse-grained (low-resolution) representation where only a subset of atoms, typically backbone, are explicitly modeled. A cartesian representation is employed for energetic calculations, and predominantly a backbone dihedral angle representation is employed for a rapid exploration of the simplified conformational space through MMC-based search. Typically, the most successful protocols launch many MMC trajectories from different seed conformations and engage in detailed switching of energy functions and/or temperature in order to introduce bias gradually while maintaining some level of structural diversity. The objective is to sample a large number of structurallydiverse low-energy decoy conformations. The decoys are then grouped by structural similarity to reveal local minima that are worth optimizing further through some finer-grained model and energy function in the second stage of the exploration. The objective in stage two becomes obtaining convergence of the exploration to the native basin. These decoy conformations are clustered by geometrical similarity in order to highlight centroids that represent a broad range of low-energy local minima [50, 70]. Centroids of top-populated clusters are then passed on to the second stage of the protocol. The working assumption is that it is more likely to have captured the native structure in a top-populated cluster. This assumption is based on both theory and experiment: As Scheraga first laid out in his seminal work in [71], A structure determined by energy minimization alone, which is stable only against small distortions, is very likely to be thermally unstable and hence cannot be admitted as a candidate for the native structure. If the first stage finds enough near-native coarse-grained conformations in the toppopulated clusters, the second stage has a higher chance of recovering the native structure 20

39 with further energetic optimization at a finer representation that includes side chains (Sidechain packing techniques are used to find energetically-optimal configurations for side chains on the backbone of a decoy conformations [72, 73]). Running many independent MMC trajectories, which is often the framework employed for stage one in state-of-the-art protocols, has the advantage of being highly parallelizable; however, there is no guarantee that the independent trajectories will not all converge to the same region of the search space. This is a common issue known as revisitation. In order for the all-atom refinement to be successful, the coarse-grained search must sample a broad enough range of local minima so that the native structure may be eventually reached from one of them. Brunette and Brock propose an iterative approach that uses the results of short all-atom refinements to regularly guide the coarse-grained search towards relevant minima [49]. The method allows the algorithm to dynamically re-apportion computational resources to more promising areas of the energy surface. The Sosnick group also employs an iterative approach that focuses resources based on an increasingly refined view of the search space [1, 74]. Their algorithm, as in FA, employs a biased move set of dihedral bond angles consisting of a probability distribution corresponding to angle frequencies in the PDB. The algorithm performs an iterative set of coarse-grained MMC trajectories using these biased move sets. After each iteration, the probabilities for each move set are updated based on the population of moves in the search. In practice, this approach allows the algorithm to accurately predict the local secondary structural motifs of the target protein, and thus re-apportion computational resources to the regions of the conformational search space that correspond to the predicted secondary structure. 21

40 Both of these approaches use an iterative approach to more efficiently direct independent MMC trajectories. However, neither method directly address the fundamental issue of geometric or structural diversity among the conformations obtained by the coarsegrained search. Furthermore, neither approach explicitly samples local minima in the energy surface; both rely on a post-processing clustering analysis to approximate local minima Robotics Motion-Planning Algorithms Motion-planning algorithms developed by the robotics community are designed to tackle high-dimensional search spaces such as the protein conformational space. These approaches make analogies between the dihedral bond angles in a protein chain and the joints in a robotic manipulator. Motion-planning algorithms, such as Probabilistic RoadMap (PRM), Expansive-Spaces Tree (EST), and Rapidly-exploring Random Tree (RRT), are most directly applicable to the protein folding problem where the native structure is known and the goal is to find one or more folding pathways to reach the native structure [36, 75]. These approaches grow a graph or a tree across the conformational space. Each node represents a sampled conformation and each edge represents a local path between two neighboring conformations. The remainder of this section outlines a framework developed by our lab which is inspired by the tree-based approach of the EST algorithm and extends it for use in a structure prediction methods where the native structure is not known a priori. Tree-based Exploration Our previously published FeLTr framework attempts to ensure a geometrically-diverse conformational sampling at the coarse-grained level by employing a geometric projection layer [50,76]. The algorithm, which can be considered an adaptation of EST, grows a search tree in the conformational space by expanding selected conformations with short MMC trajectories, and maintains a representative ensemble of previously visited conformations 22

41 in memory. Selection from this ensemble is biased towards low-energy conformations and regions in under-explored areas of the conformational space. FeLTr is thus able to dynamically redirect computational resources at the global level to ensure a degree of geometric diversity in its conformational sampling. This section briefly describes the key components of FeLTr. A detailed description is provided in recent publications [50, 76]. FeLTr explores the protein conformational space with a tree-based search shown in Figure 2.2. Algo. 1 provides pseudo-code for the framework. FeLTr executes on a target protein sequence α and produces an output ensemble Ω α of low-energy decoy conformations. The search tree is initialized with the extended conformation at the root (Algo. 1, lines 1-2). Each iteration of the search selects a vertex from the tree for expansion via a short MMC trajectory. The result of this MMC trajectory is then added to the search tree as a new vertex. FeLTr employs a two-level selection process to bias selection towards both low-energy and geometrically diverse conformations. This allows FeLTr to combine multiple MMC trajectories into a single search, which is a more effective allocation of computational resources. Selection of a vertex for expansion is a two step process, starting with selection of an energy level l (Algo. 1, line 4). Each decoy conformation C is projected onto a onedimensional grid of energy levels with increments of 2 kcal/mol. Energy levels are given a weight w(l) = E avg (l) E avg (l). A level l is then selected at random with probability w(l)/ l Layer E (l ). The second step chooses a geometric cell within the selected l (Algo. 1, line 5). Conformations are projected onto grid cells based on geometric shape using three selected coordinates from the Ultrafast Shape Recognition (USR) method [3, 77]. A second weighting function ranks each cell according to the formula w(cell) = 1.0/[(1.0 + nsel) nconfs]. The variable nsel represents the number of times a cell has been previously selected, and nconfs represents the number of conformations discovered which project into that cell. Finally, a C is selected uniformly at random from the set of conformations which lie in 23

Extended Conformation Potential Energy Geometric Projection Layer Native State Figure 2.

Each iteration of the search selects a vertex from the tree for expansion via a short MMC trajectory.

FeLTr employs a two-level selection process to bias selection towards both lowenergy and geometrically diverse conformations.

42 Extended Conformation Potential Energy Geometric Projection Layer Native State Figure 2.2: The FeLTr search tree is initialized with the extended conformation at the root. Each iteration of the search selects a vertex from the tree for expansion via a short MMC trajectory. The result of this MMC trajectory is then added to the search tree as a new vertex. FeLTr employs a two-level selection process to bias selection towards both lowenergy and geometrically diverse conformations. In this example, first the energy level highlighted in green is selected. Then one of the three vertices within that energy level is selected for expansion based on the geometric projection layer. The geometric projection layer consists of a three-dimensional grid using the 1st, 4th, and 7th momenta from the USR projection method [3] (Note only two dimensions are illustrated here). 24

43 both l and the selected geometric grid cell (Algo. 1, line 6). This selection process allows FeLTr to bias sampling of the conformational space towards low-energy conformations in unexplored regions of the conformational space. The selected conformation C is expanded via a short MMC trajectory, resulting in a new conformation C new (Algo. 1, line 7). The trajectory length is n 2 moves, where n is the number of amino acids in the target protein. Each MMC move consists of a random trimer fragment replacement is described in chapter 3. The energy function used to evaluate each move is a modified version of the AMW function described in chapter 3. The resulting C new is then added as a new vertex in the search tree (Algo. 1, line 8) and to the output ensemble Ω α (Algo. 1, line 9). Algo. 1 A high-level description of the FeLTr framework is given as pseudo code. Input: α, amino-acid sequence Output: ensemble Ω α of conformations 1: C init extended coarse-grained conf from α 2: ADDCONF(C init, Layer E, Layer Proj ) 3: while TIME AND Ω α do not exceed limits do 4: l SELECTENERGYLEVEL(Layer E ) 5: cell SELECTGEOMCELL(l.Layer Proj.cells) 6: C SELECTCONF(cell.confs) 7: C new EXPANDCONF(C) 8: ADDCONF(C new, Layer E, Layer Proj ) 9: Ω α Ω α {C new } Recent work shows that the FeLTr framework samples near-native conformations more effectively than MMC-based methods [50, 76, 78, 79]. Like other coarse-grained sampling methods, FeLTr does not explicitly sample local minima, but rather relies on clustering 25

44 analysis to filter its results down to a subset of conformations which will hopefully correspond to local minima. Cluster centroids, however, are only approximations of true local minima, and analysis shows that promising conformations are frequently discarded during clustering Evolutionary Approaches for Exploration Protein structure prediction has been shown to be NP-hard [20]. When approached through the thermodynamic hypothesis, the problem is fundamentally a very challenging global optimization problem of multivariable multimodal function, thus making metaheuristic and evolutionary computation approaches applicable. Many studies have favorably described the use of evolutionary frameworks for navigating the highly rugged energy surface presented by the conformational search space [80 84]. Techniques adopted from the evolutionary computation community, however, have failed to compete with MMC-based approaches, as they often rely on simplistic representations and energy functions, fail to use widely accepted techniques such as fragment replacement, or demonstrate results on toy protein systems. For instance, work in [85] evaluates the use of a canonical evolutionary framework using realistic physics-based energy functions. This work shows that an ab initio evolutionary algorithm can effectively recreate the native structure for a single short protein. The protein modeled, however, is only 5 amino acids in length and thus does not represent a significant computational challenge. Memetic Algorithms Memetic algorithms combine a global search technique with short local optimizations. This approach allows an algorithm to explicitly probe local minima in a rugged energy surface by projecting each move at the global level to a nearby local minimum. Studies using lattice models are able to successfully recreate the native structure using a memetic 26

45 Genetic Algorithm (GA) where the offspring from crossover are refined with either gradient descent or MMC [82, 83, 86]. Lattice models, however, oversimplify protein structure, making them unsuitable for real-world applications. A subsequent study employs a memetic GA with the physically realistic CHARMM [87] energy function [84]. The authors show that the memetic GA consistently finds conformations of lower energy than either a standard GA or MMC search. More recent studies show similar effectiveness from incorporating memetic GAs to find conformations both low in energy and with low RMSD to experimentally determined native structures [88 91]. However, these studies are all limited to very short proteins, typically less than 30 amino acids in length. MMC-based approaches, which incorporate state-of-the-art enhanced sampling techniques such as coarse-grained representations and fragment replacement, are able to accurately predict native structures for proteins with 100 amino acids or more [1,47,49,74]. While memetic GAs are effective at optimizing a given energy function, enhanced sampling strategies such as fragment replacement are necessary to scale structure prediction algorithms to be effective on larger protein, and detailed investigations are required to determine how to best incorporate these enhanced sampling strategies into an evolutionary search framework. Memetic algorithms are especially useful for highly constrained spaces like the protein energy surface. The protein conformational space contains many regions that are energetically infeasible, and many conformations allowed by a dihedral bond representation result in steric clashes and are thus physically unrealistic. Even a small change to a lowenergy conformation can easily result in an infeasible structure with a very high energy. A memetic approach deals with these infeasible regions by efficiently moving a conformation in a constrained region to a nearby unconstrained region of the search space. 27

46 Multi-Objective Algorithms Classic evolutionary algorithms attempt to optimize a single objective function, however, many real-world problems, such as protein structure prediction, contain more than one value which must be optimized. Multi-objective algorithms are the class of algorithms which deal with search/optimization problems for which there is more than one objective or goal. Protein structure prediction algorithms typically try to minimize the value of a potential energy function as a single objective function. However, in practice, most physically realistic energy functions contain several energy terms which are summed together to produce a single energy function. Simply summing objective functions, however, can be problematic because different energy terms have different scales and different importances to the overall structure of the protein. Most energy functions employ a set of weights to scale each energy term, and, indeed, this is a common approach in general to convert a problem with multiple objectives into a single objective problem. However, the selection of these weights is a difficult problem in and of itself and can be a source of significant error. The most common multi-objective decomposition a potential energy function into two separate objective functions based on local and non-local interactions. It has been shown that the use of this multi-objective decomposition reduces the complexity of the energy landscape in short polypeptides by reducing the number of local minima [92]. Multiobjective functions will tend to create a smoother fitness landscape because in order to reach a local minima, both energy terms must converge to a local minimum simultaneously an event which is particularly unlikely when the energy terms are in conflict. Many studies have successfully applied Multi-Objective Evolutionary Algorithms (MOEA) to the protein structure prediction problem [93 97]. MOEAs are found to compare favorably to their single-objective counterparts when employing physically realistic models of the protein systems. These studies, however, fail to incorporate state-of-the-art enhanced sampling techniques, such as coarse-grained representations and FA, and are therefore, 28

47 like their GA counterparts, are limited to applications on smaller protein systems. Research from the evolutionary computation community suggests that memetic and multi-objective approaches to the protein structure prediction problem hold promise. However, further work is needed to apply these initial studies to real world structure prediction problems. This dissertation combines cutting-edge stochastic optimization strategies from the evolutionary computation community with established procedures for assembly of coarse-grained structures and analysis of results. Basin Hopping A simple evolutionary search algorithm that has been shown to be effective for protein structure modeling, when combined with domain-specific expertise, is Basin Hopping (BH). BH is a suitable framework to sample relevant local minima in the protein energy surface. It was originally introduced to obtain the Lennard-Jones minima of small atomic clusters [98]. BH was inspired by evolutionary computation approaches and can be considered a special case of a memetic algorithm. However, unlike traditional evolutionary algorithms, BH is trajectory-based and does not employ a population of solutions. Procedurally, the framework consists of repeated applications of a structural perturbation followed by an energy minimization. A Metropolis-like criterion is often employed to bias the sampling of local minima towards lower energy ones. The result is a trajectory of consecutively-sampled local minima in the energy surface. The appeal of the BH framework is that it transforms the energy surface into a collection of interpenetrating staircases. A succinct and (discrete) coarse-grained representation is obtained for the energy surface in terms of local minima. Recently, the BH framework has gained new attention for protein structure prediction. In [99], the perturbation changes cartesian coordinates by values sampled uniformly at random over a small range. The minimization is implemented through a gradient descent 29

48 of a selected coarse-grained energy function. The resulting BH algorithm succeeds in locating both lower-energy minima and conformations closer to the experimentally-determined native structure than MD with Simulated Annealing (SA) on small proteins. However, the algorithm is shown not to be able to scale to proteins longer than 70 amino acids, and it is also shown to fail to approach the native structure closer than 5Åon smaller chains [99]. One of the contributions of this thesis is in enhancing the exploration power of BH to allow addressing these issues. 30

49 CHAPTER 3: INTEGRATION OF DOMAIN-SPECIFIC EXPERTISE AND EVALUATION OF PROPOSED FRAMEWORK One of the contributions of this thesis is the integration of state-of-the-art domain-specific components in an evolutionary search framework. It is the opinion of the author of this thesis that if the complexity of protein systems is ignored by employing oversimplified representations and energy functions, then the true demands that these systems pose on stochastic search techniques will not be self-evident, missing an opportunity to develop more powerful search strategies to deal with their true complexity. While the much of the work in the evolutionary computation community is conducted on overly simplified lattice models and rudimentary energy functions at the time of the writing of this thesis, the computational biology community has long moved to more complex and detailed representations and energy functions. The domain-specific expertise that is detailed in this thesis and integrated in the framework proposed here consists of: (1) representation(s) employed for a protein conformation; (2) energy function(s) employed to interface with the representation; and (3) deformations employed to obtain new conformations. This thesis focuses on reduced representations, as its goal is to obtain an ensemble of conformations representing local minima in the protein energy surface relevant for function. The latest state-of-the-art reduced representations are put forth in the ab-initio structure prediction community, together with energy functions and deformations operating on these representations. For this purpose, this thesis incorporates domain-specific expertise adopted in this community rather than, for instance, allatom representations or C α representations often used to study in detail other processes such as protein folding or conformational switching. The framework proposed in in this thesis produces a broad view of the conformational 31

50 space relevant for function and can potentially reveal information on more than one structural state available to a protein for activity. However, the rigorous validation performed in this thesis compares the proposed framework with state-of-the-art protocols that focus on capturing only a single native structural state. The problem of protein native structure prediction represents a mature field with substantial experimental and computational data with which to perform validations. The metrics employed for these validations are outlined in this chapter. This chapter is organized as follows. Section 3.1 lays out the coarse-grained representation of a protein conformation employed in this thesis. Section 3.2 presents two coarsegrained pseudo energy functions employed and compared in this thesis and presents a simple experiment to compare their effectiveness at sampling near-native conformations. Section 3.3 provides an overview of the molecular fragment replacement technique and briefly describes the construction of the fragment library employed for fragment-based moves or chain deformations in this work. Section 3.5 lays out implementation details common among the methods proposed in this thesis and the experimental setup employed to compare the effectiveness of different sampling methods in the context of protein structure prediction. Section describes a protocol for creating initial seed conformations for stochastic search algorithms adapted from the Rosetta structure prediction protocol and employed in this thesis. 3.1 Employed Coarse-grained Representation This thesis employs the backbone dihedral angle representation summarized in chapter 2. Two slightly different backbone representations are employed and supported in the proposed framework, based on demands placed by two different state-of-the-art energy functions. On both representations, the only backbone atoms modeled for the purpose of energetic calculations are the heavy atoms N, C α, C and O. In the representation that interfaces 32

Side Chain i+1 Backbone i+2 i+2 i i+1 i i+1 i+2 i -122 129-178 -123 138 176-105 152-179 Representation (a) Native Structure (b) Amino Acids 11 to 13

1: (a) The native structure under id 1dtdB in the Protein Data Bank (PDB) [4] of deposited native structures is shown using the New Cartoon

The backbone atoms of these amino acids are drawn in dark blue, and the corresponding side-chain atoms are in light grey.

51 Side Chain i+1 Backbone i+2 i+2 i i+1 i i+1 i+2 i Representation (a) Native Structure (b) Amino Acids 11 to 13 Figure 3.1: (a) The native structure under id 1dtdB in the Protein Data Bank (PDB) [4] of deposited native structures is shown using the New Cartoon graphical representation (rendered by VMD [2]). (b) A short portion (amino acid positions 11 13) of this structure is shown in greater detail. The backbone atoms of these amino acids are drawn in dark blue, and the corresponding side-chain atoms are in light grey. The backbone dihedral angles are annotated for these three amino acids, and their values in the native structure are shown below in degrees. An angular coarse-grained representation would store only these angles for each conformation of the amino-acid chain. Forward kinematics would be used to obtain the cartesian coordinates. 33

52 with the Associative Memory Hamiltonian with Water (AMW) energy function, the only side-chain atom modeled is the C β atom (with the exception of the glycine amino acid, which has no side chain heavy atoms). In the other representation that interfaces with the Rosetta energy function, a pseudo-atom is used to track the location of the side chain of an amino acid. The location of the pseudo-atom is interpreted as the centroid over the unmodeled side-chain atoms. For the purposes of kinematics, the representation only includes the backbone dihedral angles. Again two slightly different representations are employed based on the coarse-grained energy functions used. When interfacing with the Rosetta energy function, all φ, ψ, ω angle are dofs. When interfacing with the AMW, ω is fixed at an idealized value, and only φ, ψ angles are used as dofs. The angles are illustrated in Figure 3.1(b). 3.2 Employed Coarse-grained Energy Functions The software platform that accompanies the framework proposed in this thesis is designed to support arbitrary energy functions, either natively implemented or linked from an external library. The remainder of this section provides a brief overview of the two energy functions employed in the experiments conducted for this thesis: a native implementation of a modified AMW-based energy function, and the external Rosetta energy function Associative Memory Hamiltonian with Water (AMW) What we refer to as AMW in this thesis is an in-house modification of the Associative Memory Hamiltonian with Water energy function proposed and employed for protein folding in [100]. AMW sums six terms: Energy AMW = E Lennard Jones + E H Bond + E contact + E burial + E water + E Rg. E Lennard Jones is implemented after the 12-6 Lennard-Jones potential in AMBER9 [101] but modified to allow a soft penetration of van der Waals spheres. The E H Bond term 34

53 accounts for local and non-local hydrogen bond formation. The terms E contact, E burial, and E water allow for non-local contacts, a hydrophobic core, and water-mediated interactions, respectively. The E Rg term is our own addition and measures the difference between the radius of gyration (Rg) of a conformation and the Rg value predicted for its sequence, given its length [102]. The E Rg term rewards conformations which are more compact, since near-native conformations tend to be compact with a dense and hydrophobic core Rosetta Coarse-grained Energy Function The Rosetta coarse-grained energy function is a weighted linear combination of 13 terms combining bio-physical, geometric, and statistical potentials. Table 3.1 lists all 13 terms. In practice the rosetta energy function is divided into 5 sets of energy term weights referred to as score0, score1, score2, score3, and score4. Each successive set of weights employ a larger subset of the 13 terms, where score4 includes all 13 terms. The score0 Rosetta energy function consists of only a soft steric repulsion and is used in the Rosetta ab-initio structure prediction protocol to sample random initial conformations while avoiding steric clashes between atoms. The application of score1 allows formation of secondary structure when employing large fragments of 9 amino acids to introduce deformations or moves and obtain new conformations. The score2 and score3 weights are employed for further coarsegrained exploration of the protein conformational space. The additional terms added in score4 are designed to help select near-native conformations from an ensemble of decoy conformations sampled using score3. Additional details on the Rosetta energy function and its uses can be found in [48]. The evolutionary search framework experiments conducted in this work primarily make use of the score3 weights for minimizing/optimizing new child conformations (referred to as local search in the proposed framework) and either score3 or score4 weights as the fitness function for selecting which members of the population survive to the next generation (referred to as global search in the proposed framework). score0 and score1 35

54 are used to initialize the population of conformations in the proposed framework. Table 3.1: The weights for each of the 13 rosetta energy energy terms is given for each of the 5 coarse-grained Rosetta energy functions. Energy function weights Energy term score0 score1 score2 score3 score4 environment pair cbeta Van der waals forces radius of gyration cenpack hs pair ss pair rsigma beta sheet formation short-range hydrogen bonding long-range Hydrogen bonding Ramachandran score chain break Molecular Fragment Replacement Extensive research has shown that the use of bond angles found in nature significantly improves sampling of near-native conformations over rotating angles by values sampled uniformly at random [103]. Furthermore, the process of generating physically-realistic conformations is made much simpler when sampling values for a fragment of k consecutive backbone dihedral angles together, rather than sampling for one angle at a time. This 36

55 is known as molecular fragment replacement. In fragment replacement, realistic configurations for fragments of consecutive amino acids are compiled over a database of known native protein structures, typically a non-redundant subset of the PDB [4]. A conformation can be modified to obtain a new one by essentially replacing a configuration in it for a fragment starting at some randomly sampled amino acid i in the protein chain. The configuration is replaced with one that is sampled at random over those available for that fragment in the fragment configuration library. The process is illustrated in Figure 3.2. Cx Fragment-Based Assembly Fragment Library Cx+1 Select position Select new fragment Replace fragment Figure 3.2: A library of fragment configurations extracted from the PDB is defined at the beginning of a search. When a position i (shown in red) in the conformation is to be modified, a corresponding fragment configuration is selected at random from the library. The dihedral angles of the selected fragment in the conformation are then replaced with those of the fragment configuration selected from the library, beginning at position i (shown in green). Fragment replacement essentially discretizes the conformational space and reduces its dimensionality. It also allows directing sampling towards local structural motifs observed in nature. While the goal of a fragment configuration library is to bias search towards structures seen in the PDB, a sufficiently diverse library will allow the generation of novel structures. The use of fragment replacement as the move set in MMC has been shown 37

56 to greatly improve exploration capability in MMC. This is the reason MMC and fragment replacement form the basis of most modern ab-initio protein structure prediction protocols [47, 49, 69, 74]. Ab-initio structure prediction protocols use fragments varying anywhere from 3 to 19 amino acids in length. It is generally accepted that a minimal fragment length of 3 is necessary to make fine adjustments in order to reach a protein s native structure [56, 79]. However, longer fragments allow an algorithm to benefit from larger repeating motifs which are common in the PDB and make larger moves in conformational space. A common approach uses longer fragment lengths in an initial phase to quickly obtain conformations with well-formed secondary structures, followed by one or more phases which employ shorter fragment lengths for more detailed refinements of conformations. Work has begun to investigate the role of the fragment length in the efficiency and accuracy of the modeling process [54]. The evolutionary search framework allows the employment of different fragment lengths from an arbitrary fragment library, though the employment in this thesis is limited to two fragment lengths rather than a comparison of the role of fragment length for search. This is important, because the fragment library is a key component of a coarse-grained sampling algorithm. This is demonstrated by a preliminary study which shows a dramatic improvement in sampling of near-native structure when switching form a classic fragment library implementation to a state-of-the-art fragment library similar to the one employed in this dissertation [104]. The experiments in this dissertation employ the Rosetta fragment library, employing both fragments of length 9 and length 3. A brief description of how Rosetta fragment libraries are constructed is given below Rosetta Fragment Library Considering configurations of similar sequences but identical secondary structures has become very popular in ab-initio structure prediction methods that employ fragment-based 38

57 assembly [47]. Local features predicted from the given sequence α are used to design a structurally-diverse high-quality library of configurations. The candidate trimer configurations in the library are dependent on the sequence α. We refer to a specific library instance designed from a given α as L α. The construction of L α biases towards fragment configurations that share features with those predicted from α. Essentially, L α, whose construction is detailed below, allows selecting configurations that have similar (not necessarily identical) sequences to a fragment configuration selected for replacement. To maintain a concise set of configurations for each position in the sequence, the configurations are limited to those that share secondary structure annotations with the annotation predicted on α. L α is constructed as follows. A multiple sequence alignment (MSA) lists proteins that have similar sequences to the given α sequence. PSI-BLAST [105] is then employed to analyze the MSA and yield for each position i in α a list of amino acids that can replace the amino acid at position i. The resulting position-specific profile for α reveals what alternative fragment sequences can be considered as similar to a fragment from position i to i + 2 in α. The configurations of these fragments, extracted from a non-redundant database of protein structures, can be added as candidate configurations to those extracted for the fragment sequence from i to i + 2. A filtering step improves the quality of the retained fragment configurations. Only fragment configurations with the same secondary structure (as present in the known protein structures from which the fragment configurations are extracted) as that predicted for α with PSI-PRED [106] are added as candidate configurations for a fragment in the library. 3.4 Semi-metrics for Comparing Conformations The ability to compare two different conformations of the same protein sequence is key to many of the metrics employed in this thesis to evaluate sampled conformations. Specifically, in the context of ab-initio structure prediction, conformations need to be compared to 39

58 the known native structure to determine whether any of them are close enough to be considered to have captured or reproduced the native structure. Many semi-metrics have been proposed for this purpose [ ]. Design of effective semi-metrics is an open area in computational structural biology because comparison of instances in a high-dimensional space is very challenging. All proposed semi-metrics, whether similarity measures or distance functions, suffer from various shortcomings. This thesis primarily employs least Root Mean Square Deviation (RMSD) to compare conformations generated during a search to an experimentally-determined native structure downloaded from the Protein Data Bank (PDB) [4] Least Root Mean Square Deviation (RMSD) The RMSD between two conformations measures the mean distance in Å between corresponding atoms after the conformations have been aligned or superimposed. The two conformations are aligned by center of mass (removing trivial differences due to translation), and a rotation matrix is then applied to minimize the Euclidian distance between corresponding atoms. The root mean square is then calculated for the distance between corresponding atoms of the aligned structures. When employing coarse-grained models, RMSD is typically computed over only the C α atoms, referred to as C α -RMSD. RMSD can also be calculated over all of the heavy backbone atoms, N, C α, C, and O (referred to as bb-rmsd); however, in practice there is typically little difference between C α -RMSD and bb-rmsd. While not a Euclidean metric, RMSD is a dissimilarity metric. A lower RMSD value between a conformation and the sought native structure means that the conformation is closer to the native structure. A higher RMSD, however, does not necessarily mean significant structural differences, as RMSD is not able to recognize when structural differences are limited to a particular segment in the protein chain. Additionally, RMSD grows with chain size. When measuring the distance between two compact protein chains, which is 40

59 how RMSD is employed in this thesis, an upper bound of 8Å on the accuracy of RMSD is typical. Therefore, in this thesis all RMSD values above 8Å are considered comparable. Similarly, for RMSD values below 8Å, the difference between RMSD values must be at least 0.5Å to be considered significant. Lower-RMSD conformations are more likely to be near-native than higher-rmsd ones. When using coarse-grained representations and energy functions, a conformation with RMSD 4 5Å from the native structure is considered to have captured that structure in sufficient detail. It is important to note that, in ab inito structure prediction, it is assumed that the native structure is unknown. Therefore, the RMSD to a native structure is employed only at the end of a search as a performance measure to compare different search algorithms on proteins for which experimental data is available. When employing RMSD for this purpose, this thesis will use the short-hand RMSD to native to refer to the RMSD between the structure of a computed conformation and an experimentally-determined native structure downloaded from the PDB Global Distance Test (GDT) A recently proposed method for comparing conformations, GDT, overcomes some limitations in RMSD [109]. To perform a GDT, two conformations are aligned as in the RMSD calculation. However, rather than calculating the total RMS distance, the GDT measures the number of C α atoms which are within a threshold distance of each other. Typically, the GDT is computed for several threshold values, and the average result from each threshold is reported as a percentage of C α atoms under the threshold. The most common set of thresholds, which is known as GDT TS, are: 1Å, 2Å, 4Å, and 8Å. Calculating the optimal alignment for GDT is a much more computationally difficult problem than RMSD [111]. An exhaustive approach requires aligning all possible sub-sets of C α atoms between both structures in order to determine the which alignment results most C α atoms within the given threshold. An approximation of the true optimal alignment is implemented in this 41

60 thesis as described in [109]. In the approximation approach, the two structures are initially aligned based on a short sequence of C α atoms. The algorithm then calculates which C α atoms fall within the given threshold for the given alignment and re-aligns the two structures based on this new subset of atoms. This process is repeated until the set of atoms employed in the alignment does not change (or a pre-defined limit is met). Note that this approximation will never overestimate the GDT score, but makes no guarantee about reaching the maximum possible GDT score. The approximate version of this algorithm is widely used in the official program used for computing GDT described in [109]. 3.5 Implementation Details All experiments in this thesis are run for a fixed budget of 10,000,000 energy function evaluations. Since over 90% of CPU time is spent on such evaluations, the limit ensures a fair comparison between different parameter selections on a diverse set of proteins. Additionally, the implementation of the Rosetta energy function is significantly more efficient than the AMW energy function. While the speedup is at least a factor of 2, this improvement is not inherent to Rosetta s energy calculations, but rather do to more efficient implementations and caching. Since the goal of this thesis is to evaluate algorithmic components of stochastic search, employing a fixed budget of energy evaluations allows for a fair comparison across a range of energy function implementations. Computing 10,000,000 energy function evaluations takes about 1-4 days of CPU time for AMW and 12 to 24 hours for Rosetta on a 2.4Ghz Core i7 processor, depending on protein length. In this thesis, each experiment is typically repeated 5 independent times to account for stochastic variation between runs. In Chapter 5, a more rigorous evaluation is done by repeating experiments 30 times and testing for statistical significance. 42

61 3.5.1 Starting Points for Conformational Search The algorithms in this thesis make use of an initialization function adapted from the Rosetta protocol that selects random starting conformations with good secondary structure. Starting from a fully-extended conformation, the procedure performs two Monte-Carlo searches in serial with the following parameters. The first search samples fragments of length 9 from the fragment library, rejecting only fragments which cause steric clashes, until all dihedral angles from the extend conformation have been replaced. This effectively samples a random initial conformation. The second search is a low temperature MMC search employing the Rosetta score1 energy function, and it is run with fragments of length 9 until k fragments in a row have been rejected. Fragments of length 9 are more likely to preserve complex secondary structural elements (especially for β-sheets) than the higher resolution fragments of length 3 employed in the evolutionary search algorithms. The fragments available in the fragment library are chosen based on predicted secondary structure, therefore even a short MMC search employing fragments of length 9 can produce structures with good secondary structural elements, even if the overall structure of the protein is not accurately modeled. Here k is set to the length of the target protein system which is the same stopping criteria employed in the local search described in chapter Analysis and Performance Measurements The output of each experiment is an ensemble of conformations sampled during the search. The coarse-grained energy, RMSD to native, and GDT TS to native are computed for each conformation so analysis can be performed on the entire ensemble with respect to these metrics. This section motivates the performance measures employed in chapters 4 to 7 for comparing the effectiveness of stochastic sampling. 43

62 3.5.3 Target Systems of Study Experiments in this thesis are conducted on a diverse set of 20 target protein systems which range in length from 53 to 146 amino acids and represent a diverse set of α, β, and α/β native fold topologies. This list of proteins is drawn from targets investigated in three separate studies which employ a range of energy functions and sampling algorithms [1, 49,53]. Table 3.2 gives native PDB id and length of each protein as well as the percentage of amino acids which form α-helices and β-sheets in the native state of each protein sequence. In this thesis, only a single chain is considered for multi-domain proteins. In these cases, the chain letter is indicated after the PDB id Sampling Low-energy Conformations When posed as an optimization problem, the goal of ab-initio protein structure prediction is to locate the GMEC in the underlying energy surface of the protein conformational space. Errors inherent coarse-grained energy functions typically distort the true potential energy surface, creating an alternate coarse-grained energy surface. The multi-modality and ruggedness inherent in coarse-grained energy functions can make discovery of the GMEC difficult; however, analysis suggests, including that in this thesis, that, for the majority of the target proteins when employing the Rosetta score4 energy function, the GMEC is within the vicinity of the experimentally-determined native structure. For this reason the ability to sample low-energy conformations is an important attribute of stochastic search functions. Comparing the lowest energy level reached by each experiment provides a concise metric for comparing the effectiveness of different sampling strategies. Unlike RMSD, the lowest energy conformation is less likely to be an outlier, and thus a focus on the single lowest energy is appropriate. Additionally, when comparing sampling algorithms for difficult proteins systems in which near-native conformations are not sampled, energy provides a means of comparing two algorithms when RMSD may not. If the lowest RMSD to native structure reached by either approach is over 8 Å RMSD 44

63 Table 3.2: The native PDB id, length, and fold are given for each of the 20 target protein systems used to conduct experiments in chapters 4-7. Columns 5 and 6 represent the percentage of amino acids which form α helices and β sheets, respectively, for each target. PDB id Length Fold % α Helix % β Sheet 1 1bq9 53 αβ dtdB 61 αβ isuA 62 αβ c8cA 64 αβ sap 66 αβ hz6A 67 αβ wapA 68 β fwp 69 αβ ail 70 α dtjA 76 β aoy 78 αβ ci2 83 αβ cc5 83 α tig 88 αβ ezk 93 α hhp 99 β gwl 106 α hg6 106 αβ h5nD 123 α aly 146 β

64 to native, then it is difficult to say which method is more effective at modeling the native state; however, the method that finds lower energies can be said to have had higher exploration capability Sampling Near-native Conformations The goal of a coarse-grained structure prediction algorithm is to sample a diverse set of low-energy conformations in the vicinity of the protein native state. Examining the single lowest RMSD structure provides a concise measurement of how well the native state is captured; however, in practice, the lowest RMSD conformation is often an outlier and is difficult to identify with clustering methods which are employed to select a sub-set of coarse-grained decoy conformations. Recall that the role of a coarse-grained structure prediction protocol is to sample a diverse set of decoy conformations, rather than find a single native structure. For this reason, measuring the number of near-native conformations sampled during a search can be a more effective measure of breadth of sampling. Here we define as near-native any conformation within a threshold of 5Å RMSD from the experimentally-determined native structure. A threshold of 5Å is considered the upper limit at which second stage, higher-resolution, refinements described in section are able to effectively recover the native state. That is, structures discovered in the coarsegrained stage with an RMSD greater than 5Å likely to dissimilar to the native structure to be useful in the high-resolution second stage. 46

65 CHAPTER 4: HYBRID GLOBAL-LOCAL SEARCH : EXPLICIT SAMPLING OF LOCAL MINIMA A fundamental challenge in stochastic search and optimization is how to balance limited computational resources between exploration of a complex space through global search with exploitation of local minima in the space through local search. In the particular context of protein structure modeling, the common approach is essentially disjointed; exploitation is achieved through intensive localized MMC search, while global search is achieved through a multi-start or random-restart approach. This categorization is somewhat simplistic, as some algorithmic realizations of this approach in ab-initio structure prediction tune the temperature parameter in the Metropolis criterion to add the ability to jump in possibly far away regions in conformational space. Exploitation in these methods is achieved by increasing the length of the MMC search, while exportation is achieved by increasing the number of independent MC searches and/or changing temperatures used at various points during the search. This disjointed approach has been highly successful for small protein systems (under 100 amino acids); however, even as advances in fragment libraries and energy functions improve accuracy on these small system, accurately modeling larger systems eludes current capabilities. The exponential size of the protein conformational space (with respect to the number of amino acids) makes scaling of multi-start approaches difficult. This chapter addresses the challenge of balancing exploration and exploitation with a hybrid global-local evolutionary search framework. The framework introduced in this chapter explicitly samples local minima in the protein energy surface through a synergy of global and local search techniques. The goal is to focus sampling of the conformational space on low-energy local minima, essentially obtaining a discrete representation of the 47

66 protein conformational space relevant for function through a set of conformations that map to low-energy local minima in the underlying energy surface. This subspace of local minima, while dramatically reduced in size compared to the entire protein conformational space, is still too large to capture through uniform random sampling. Moreover, not all local minima are interesting. Therefore a local search technique for sampling local minima must be combined with a powerful global search technique to effectively explore both the breadth and depth of the search space. This chapter is divided into three sections. Section 4.1 first proposes a proof-of-concept combination of global and local search through a trajectory-based, Basin Hopping (BH) algorithm. The algorithm is referred to as Protein Local Optima Walk (PLOW). PLOW is a hybrid MMC algorithm which explicitly samples local minima by mapping conformations sampled by a MMC move to their corresponding local minima through a short greedy local search. Comparing PLOW to an MMC-based algorithm which does not explicitly sample local minima shows that there is a distinct advantage to the explicit local minima sampling in PLOW. Section 4.2 then investigates the effect of varying the depth of the local search component employed to map conformations sampled at the global level to local minima. The greedy local search introduced in PLOW is generalized as a low-temperature MMC search and experiments compare the effectiveness of employing different temperatures. Section 4.3 generalizes PLOW in a novel population-based hybrid evolutionary search (HEA) algorithm which forms the basis of a more detailed investigation into various aspects of global search in chapters 5 and 6 in this thesis. 48

67 4.1 Effective Sampling of Local Minima : Protein Local Optima Walk (PLOW) PLOW employs a BH/Iterated Local Search (ILS) approach of iteratively hopping between local minima in a trajectory-based exploration of the subspace of local minima that is progressively steered towards lower-energy local minima. In BH, an energy surface is explored by making a series of hops between local minima. Each hop consists of a perturbation step to jump out of the current local minimum, followed by a minimization step to optimize the perturbed conformation down to a neighboring local minimum. The classic BH framework, however, is typically applied only to very small molecules or to local explorations of the energy surface [59, 98, 112, 113]. PLOW extends BH to the ab-initio structure prediction problem by essentially increasing the reach of the search trajectory. In PLOW, perturbation does not necessarily jump to a neighboring local minimum, but merely uses the previous local minimum as a guide to its next position (as a mutation operator might do in an evolutionary computation algorithm). Similarly, the goal of minimization is not to fully optimize the perturbed conformation to the nearest local minimum. Rather, minimization is performed by short greedy local searches employing molecular fragment replacement. This allows PLOW to rapidly sample a nearby local minimum, even if it not the nearest local minimum. In practice, this means that PLOW makes much larger hops than a traditional BH approach, but the hops are still guided towards increasingly lower-energy regions of the conformational space. Figure 4.1(c) illustrates the essential process in PLOW on a simple two-dimensional energy surface. In the illustration, PLOW begins at a fixed point, C 0 (shown as the empty blue circle on the far left), which is mapped to a local minimum, C 1, by the greedy search (shown as a series of short purple arrows). PLOW then escapes its current local minimum, C 1, through a perturbation move (shown as a long orange arrow). The resulting conformation, C 1(perturb), is again mapped onto a nearby local minimum, C 2, through the greedy 49

68 search. Now a decision must be made to accept C 2 as the new state of the search trajectory. In Figure 4.1(c), both C 2 and C 3 are accepted; C is rejected, however, because it has a much higher energy and fails the Metropolis criterion. In this case, PLOW remains at C 3 and performs a second perturbation followed by greedy search to reach C 4. The essential components of PLOW (minimization, perturbation, and acceptance criterion) are now described in detail in sections 4.1.1, 4.1.2, and 4.1.3, respectively Fragment-based Minimization for Greedy Local Search A greedy search maps a conformation onto a nearby local minimum in the energy surface through a series of small modifications. A modification consists of replacing the configuration (9 backbone dihedral angles) of a fragment of three consecutive amino acids (trimer) in the current conformation with a configuration sampled from a configuration library. This is known as molecular fragment replacement, and a description can be found in chapter 3. A modification that does not result in a lower-energy conformation is discarded, and another modification is performed. The greedy search stops when k consecutive modifications fail to result in a lower energy, indicating the presence of a local minimum. The value of k is set to the length of the target protein (number of amino acids). This process is illustrated in Figure 4.1(a). The greedy search encapsulates the working definition of a local minimum. In essence, the value of k defines how deeply a local minimum is probed. Exhaustively testing for the presence of a local minimum is prohibitive, as it requires thousands of energy evaluations. Our approximation of a local minimum here does not waste resources by unnecessarily probing down to the true local minimum. Moreover, this approach is sufficient when empirical, coarse-grained, energy functions are employed to probe an effective, rather than the true, energy surface. Table 4.1 shows, for instance, that the native structure of a protein is often found somewhere above the basin of the energy surface that can be probed with a coarse-grained energy function. 50

69 Figure 4.1: The figure illustrates (a) greedy local search, (b) naive sampling, and (c) PLOW on a simplified energy surface. (a) A sampled conformation C i(sampled) (empty blue circle) is mapped to the nearest local minimum C i(minimum) (solid purple circle) by a greedy local search (series of short purple arrows). (b) 5 points sampled at random (empty blue circles) by the naive sampling approach are each mapped to a nearby local minimum (solid purple circles) by a greedy local search (series of short purple arrows). (c) PLOW begins at C 0 (leftmost empty blue circle). Through a series of perturbations (long orange arrows) and greedy local searches (short purple arrows), PLOW samples conformations representative of local minima (C 1 through C 4 ) in the energy surface. Ci(sampled) C1(rand) C2(rand) C3(rand) C4(rand) C5(rand) Ci(minimum) (a) greedy local search C1 C2 C3 C4,5 (b) naive sampling C0 C1(perturb) C2(perturb) C3(perturb) C3(perturb) Legend C1 greedy local search perturbation move energy surface C2 C3 C * (c) PLOW C4 high-energy conformation local minimum conformation rejected local minimum conformation 51

70 Table 4.1: Column 5 gives the RMSD between the native structure and the closest local minimum found when performing multiple greedy local searches starting from the native structure. RMSD of nearest local PDB id Length Fold minimum to native (Å) 1 1DTDB 61 αβ ISUA 62 αβ C8CA 64 αβ SAP 66 αβ HZ6A 67 αβ WAPA 68 β FWP 69 αβ AIL 70 α AOY 78 αβ CC5 83 α EZK 93 α HHP 99 αβ HG6 106 αβ GWL 106 α H5ND 123 α

71 4.1.2 Perturbation for Global Search The goal of a perturbation is to allow PLOW to jump out of the current local minimum so that another nearby local minimum can be sampled. The perturbation needs to make a move that is not too small, so PLOW can jump out of a local minimum, but is also not too large, so PLOW can still benefit from knowledge of the previous local minimum and not devolve into random search. The perturbation move is implemented by replacing the configuration of a selected trimer over the protein chain in the conformation representing the currently sampled local minimum with a configuration sampled from the trimer configuration library. This implementation is sufficient to obtain a high-energy conformation that takes PLOW out of the current local minimum. The reason is that low-energy conformations tend to be compact and leave little room for movement in their backbone chain without raising potential energy. The conformation obtained after the perturbation move will share nearly all of its local structural features with its parent conformation (the one residing in the local minimum), but the new conformation will have a much higher energy and a significantly altered overall global structure. Given that the perturbation move results in a high energy, the greedy search described above can then optimize the perturbed conformation C i(perturb) and map it to one of many distinct local minima C i+1, leaving little chance that the mapping will return PLOW to its previously sampled local minimum C i (see Figure 4.1(c) for an illustration). However, because most of the local structural features of C i are maintained in the perturbed conformation C i(perturb), the greedy search will benefit from such knowledge and be able to map C i(perturb) to a nearby local minimum C i+1. For these reasons, a single trimer configuration replacement serves as an effective perturbation move. 53

72 4.1.3 Acceptance Criterion for two Consecutively-sampled Local Minima After each C i(perturb) has been mapped to a nearby local minimum C i+1 by the greedy search, PLOW decides whether or not to accept C i+1 and add it to its trajectory or remain at C i. PLOW employs the Metropolis criterion to make this decision [63]. According the Metropolis criterion, C i+1 will be accepted if it has lower energy than C i. Otherwise, it will be accepted with probability e E β, where E is the energetic difference from C i+1 to C i, and β is a scaling parameter that depends on an employed effective temperature. In this implementation, this parameter is set so that 10 kcal/mol energetic increases are accepted with probability 0.1. (This parameter value is based on previous work [50]) Effectiveness of PLOW The effectiveness of PLOW is demonstrated by comparing its ability to sample conformations near the protein native state to that of FeLTr, a tree-based search framework developed in the Shehu lab and briefly described in chapter 2 [76]. Since FeLTr is a state-ofthe-art probabilistic search framework that does not explicitly sample local minima, this comparison allows investigating the extent to which the explicit sampling of local minima in PLOW is more effective than the exploration in FeLTr. Both PLOW and FeLTr employ the same energy function (AMW) and fragment library to allow the most direct comparison. The minimum and average lowest RMSDs achieved over five runs of each framework are reported and compared in Table 4.2 (PLOW in column 6 and FeLTr in column 7). Table 4.2 shows that PLOW outperforms FeLTr in every case, except the systems with native structure PDB ids 1aoy and 1c8cA. In 10 systems, the difference is greater than 0.5Å RMSD. PLOW significantly outperforms FeLTr by at least 1.0Å in key cases, including the longer proteins (PDB ids 2ezk, 1hhp, 2hg6, 3gwl, and 2h5nD); in the case of the protein with native structure PDB id 1ail, PLOW finds a local minimum that is within 2Å RMSD 54

73 Table 4.2: The lowest RMSD to native structure achieved is shown for both PLOW and FeLTr. The RMSDs given are the average of five runs, with the minimum of the five runs shown in parentheses. Column 5 shows the average number of iterations of each PLOW LocalSearch function. FeLTr* represents the FeLTr framework using the value from column 5 as its MMC search length. local avg (min) lowest lrmsd to native in Å PDB id len fold search len PLOW FeLTr FeLTr 1 1DTDB 61 αβ (6.3) 7.7(6.8) 7.5(7.0) 2 1ISUA 62 αβ (6.0) 6.6(6.3) 6.5(5.7) 3 1C8CA 64 αβ (6.9) 6.8(6.0) 7.2(5.8) 4 1SAP 66 αβ (6.2) 7.1(6.5) 7.3(6.8) 5 1HZ6A 67 αβ (6.1) 6.7(6.6) 6.6(6.1) 6 1WAPA 68 β (6.9) 8.1(7.3) 7.3(6.5) 7 1FWP 69 αβ (5.2) 7.3(6.4) 7.1(6.8) 8 1AIL 70 α (2.0) 4.8(4.5) 4.0(3.4) 9 1AOY 78 αβ (5.3) 5.2(4.6) 5.8(5.2) 10 1CC5 83 α (5.4) 6.2(5.6) 5.8(4.9) 11 2EZK 93 α (4.3) 6.5(6.0) 6.0(4.7) 12 1HHP 99 αβ (9.7) 11.2(10.0) 11.0(9.7) 13 2HG6 106 αβ (8.1) 10.0(9.6) 9.7(9.0) 14 3GWL 106 α (3.7) 6.6(5.7) 6.3(4.4) 15 2H5ND 123 α (6.8) 9.0(8.5) 8.6(7.8) 55

74 from the native. It is worth emphasizing that this is an impressive result. This protein is not small but 70 amino acids in length. Moreover, RMSDs of 1 2Å are often obtained only by protocols after some form of all-atom energetic refinements on selected conformations, whereas search algorithms that employ coarse-grained energy functions often saturate at 4 5Å from the native structure. This result by PLOW suggests that the focus on local minima in PLOW allows effectively locating conformations very near the native structure. It is worth noting that, in PLOW, the length of the greedy search is not determined a priori and can vary. In FeLTr, instead, the inner MMC trajectory that obtains a new conformation from a selected conformation in the FeLTr tree has a fixed length. In order to rule out the possibility that PLOW is merely benefiting from longer greedy searches, we modify FeLTr and obtain FeLTr by extending the length of the MMC trajectory to the average greedy search length in PLOW. These average lengths are shown in Table 4.2, column 5. The results for FeLTr in column 8 show that FeLTr performs slightly better than FeLTr and is even comparable to PLOW in a few cases (proteins with native structure PDB ids 1isuA, 1c8cA, 1hz6A, 1wapA, 1aoy, and 1cc5). On average, however, PLOW still outperforms FeLTr, especially in the case of the five longer proteins with native structure ids 2ezk, 1hhp, 2hg6, 3gwl, and 2h5nD. This additional comparison confirms that there is a distinct advantage that the greedy search confers to PLOW. While the average length of the local search is the same between PLOW and FeLTr, PLOW is able to vary this length as necessary to reach a local minimum. 4.2 Analysis of Local Minima Sampling This section provides a detailed analysis of the local minima sampling approach employed in PLOW. The evolutionary framework proposed in this dissertation makes use of a similar local minima sampling strategy, thus it is important that this process be well-understood. 56

75 Section then demonstrates the complexity of the subspace containing only local minima, making the case for smart exploration PLOW versus Multistart: Importance of Adjacency Relationship Section demonstrates that PLOW is more effective than an MMC based algorithm that does not explicitly sample local minima. Here we conduct an additional experiment to compare PLOW to a naive local minima sampling approach, which effectively samples local minima in the protein energy surface at random. This multistart approach simply samples a point in the conformational space uniformly at random and maps it to a nearby local minimum through the same greedy local search employed by PLOW. Figure 4.1(b) illustrates this process with 5 randomly sampled conformations (C 1(rand) -C 5(rand) ), which are mapped to corresponding local minima (C 1 -C 5 ). This multistart approach is akin to a classic random search over the subspace of local minima. Figure 4.2 plots the distribution of the potential energy for each conformation representative of a local minimum on two representative protein systems. The results obtained by multistart are superimposed over those obtained by PLOW. Figure 4.2 shows that PLOW is able to reach significantly lower-energy minima than multistart sampling on both protein systems. A similar result is obtained on all of the systems studied in this paper (data not shown). Comparison to this naive approach shows that, even if only local minima are considered, a powerful search technique is still required to effectively sample low-energy minima near the protein native structure. Just as PLOW is a significant improvement over random search, the HEA proposed in section 4.3 is a is a significant improvement over PLOW Controlling Temperature of Local Search The greedy search employed in section can be regarded as a special case of MMC where the effective temperature is set to 0; hence, no higher-energy moves are allowed. 57

76 Basin Hopping Multistart Basin Hopping Multistart Sample Frequency Energy (kcal/mol) (a) Sample Frequency Energy (kcal/mol) (b) Figure 4.2: The distribution of energies obtained by BH is superimposed over that obtained by the multistart method on each for protein with native PDB id 2ezk (a) and 1hhp (b). Controlling the effective temperature of the local search allows controlling the height of the barriers crossed during the minimization. We conduct an experiment to compare the effectiveness of greedy vs. MMC search in minimization. Three different effective temperatures are studied in the context of the MMC search. A very low one, T 0, corresponds to accepting a 1.4 kcal/mol energy increase with probability 0.1, and two slightly higher ones, T 1 and T 2, respectively, accept energy increases of 1.7 and 2.6 kcal/mol with probability 0.1. Table 4.3 compares the greedy search (T = 0) to MMC searches with T 0, T 1, and T 2. Columns 5-8 show the lowest energy achieved under each setting. Three observations can be made: (i) Lower energies are obtained by MMC than the greedy search. (ii) Overall, on proteins with less than 80 amino acids, the lowest energy is achieved by MMC with T 0. (iii) On longer proteins, the slightly higher T 1 achieves lower energies, possibly because in more complex rugged surfaces, small uphill moves allow reaching deeper minima. The energy surface sampled by the PLOW framework for each given value of T is 58

77 Table 4.3: Columns 5 8 report the minimum energy achieved for each temperature T of the minimization component of the BH framework. Columns 9 12 then report the corresponding lowest RMSD to the native structure achieved for each T. Native Lowest Energy (kcal/mol) Lowest RMSD (Å) PDB ID Size fold T = 0 T 0 T 1 T 2 T = 0 T 0 T 1 T 2 1 1dtdB 61 α/β isuA 62 α/β c8cA 64 α/β sap 66 α/β hz6A 67 α/β wapA 68 β fwp 69 α/β ail 70 α aoy 78 α/β cc5 83 α ezk 93 α hhp 99 β hg6 106 α/β gwl 106 α h5nD 123 α

78 illustrated in Figure 4.3. The x and y-axes represent geometric projections of the conformations based on interatomic distances [50], and the z-axis represents the energy of each sampled local minimum. A large white x represents the location of the experimentallydetermined native structure. Figure 4.3 illustrates that coarse-grained energy functions are noisy and result in surfaces that can deviate from the true protein energy surface. For this reason the small additional reductions in potential energy obtained by MMC do not improve PLOW s ability to sample local minima near the native structure. Columns 9-12 in Table 4.3 show, for each value of T, the lowest RMSD to the native structure obtained in each experiment. Comparable lowest lrmsds are obtained whether greedy or MMC search is employed in the minimization. Probing deeper into minima in the MMC-based minimization does not necessarily bring the PLOW closer to the native structure. 4.3 Combining Global and Local Search This section generalizes the trajectory-based BH approach in PLOW (detailed in section 4.1) to a population-based Hybrid Evolutionary Algorithm (HEA). Here we combine the greedy local search found to be effective in PLOW with a canonical evolutionary algorithm to create a true hybrid between global and local search. In this context, PLOW can be described as a HEA with a population size of 1 and an elitism rate of 100% (commonly referred to as a 1+1 HEA). The details of the HEA are outlined in section Section then compares the ability of the HEA and PLOW to sample conformations which are low in energy and similar to an experimentally determined native structure A Population-based Hybrid Evolutionary Algorithm (HEA) The HEA combines a canonical population-based evolutionary algorithm with the greedy local search described in section A population of size p of decoy conformations is evolved through a series of generations. An initial population, P 0, is constructed as 60

79 Energy (kcal/mol) Energy (kcal/mol) coor (a) T = 0 12 coor coor (b) T 0 12 coor 1 14 Energy (kcal/mol) coor coor Energy (kcal/mol) coor coor 1 14 (c) T 1 (d) T 2 Figure 4.3: The energy surface sampled for the protein with native PDB ID 1fwp is shown for each temperature T. The x and y-axes represent projection coordinates based on interatomic distances within each conformation, and the z-axis represents the energy of each sampled local minimum. The white x indicates the location of the native structure in the energy surface. 61

80 p copies of an extended conformation subjected domain-specific initialization function described in chapter 3 which aims to create a randomized conformation with good secondary structural elements. The population P i in each subsequent generation i is obtained as follows. All conformations of the previous population P i 1 are first duplicated, then subjected to mutation and projected to a nearby local minimum through the greedy local search outlined in The result of this process is p child conformations that are added to population P i. The k% conformations with highest fitness in P i 1 are also added to P i in order to maintain low-energy conformations captured in previous generations. The resulting population is reduced down to the same constant size of p individuals through elitist truncation selection based on potential energy. For the HEA analyzed in this chapter, p = 100 and an elitism rate of k = 25% based on fitness are used. The Rosetta score4 energy function is used to judge the fitness of conformations in the population, while the score3 energy function is used for minimization in the greedy local search Analysis of HEA This section analyzes the effectiveness of the HEA in sampling low-energy near-native conformations. The HEA described in section is compared to PLOW (described in section 4.1). In order to perform a fair comparison, PLOW is run using the Rosetta score4 energy function for the acceptance criterion and the score3 energy function for minimization. Additionally, the start state of PLOW is initialized with the same initialization function as in the HEA. To examine the effects of the depth of the local search, the HEA is run with both T = 0 and T 0 as local search temperatures (as described in section 4.2.2). Each experiment is run for a fixed budget of 10,000,000 energy function evaluations and repeated five times (as described in chapter 3). 62

81 Sampling Low-energy Conformations with the HEA Table 4.4 shows the lowest energy level reached for each algorithm for each of 20 target protein systems. The average energy over 5 runs is given with the minimum in parenthesis. Comparing columns 5 and 6 reveals that the HEA (with T = 0) finds significantly lower energy conformations than PLOW for 12 out of the 20 target protein systems (highlighted in bold), while PLOW only finds significantly lower energy conformations for 4 of the protein systems (also highlighted in bold). Of particular note is that the HEA finds lower energy conformations for the six largest proteins, suggesting that while PLOW and the HEA perform similarly on smaller proteins, the HEA is particularly effective at navigating more complex energy surfaces. Column 7 in Table 4.4 shows the lowest energy level reached for the HEA with temperature T 0 for the greedy local search. Comparison of columns 6 and 7 reveals that employing T 0 only results in significantly lower energy conformations for 8 out of 20 protein systems (highlighted in bold), while employing T = 0 also results in significantly lower energy conformations for 8 of 20 proteins (results are equivalent on the remaining 4 proteins). In contrast to results for PLOW in section 4.2.2, increasing the depth of the local search does not necessarily improve sampling of low-energy conformations for the HEA. These results are not unexpected, as extensive research in the evolutionary computation community has demonstrated the effectiveness of populations-based search. The elitist population-based nature of the HEA allows at least 75 of the new child conformations to continue on to the next generation for further refinement while still retaining the 25 lowest-energy conformations sampled during the search. Retaining the 25 lowest-energy conformations allows the HEA to simultaneously focus on multiple low-energy regions of the search space, while PLOW must give up on exploring one minima before it can jump to another region of the search space. Additionally, the need for deeper local searches in PLOW is negated since at least 75% of child conformations will continue to be refined in the next generation. PLOW, on the other hand, only maintains a single conformation in 63

82 Table 4.4: The lowest energy level reached by PLOW and the HEA are given for each target protein system. Results are given as the average over 5 independent runs with the minimum of each run given in parenthesis. HEA T = 0 employs a greedy local search, while HEA T 0 employs a low-temperature MMC for the local search. Fold lowest Rosetta Score4 Energy PDB Id Size Topology PLOW T = 0 HEA T = 0 HEA T 0 1 1bq9 53 α/β -49.0(-57.3) -45.6(-50.5) -39.5(-42.9) 2 1dtdB 61 α/β -42.4(-62.8) -49.0(-55.0) -44.3(-68.0) 3 1isuA 62 α/β -43.2(-51.0) -40.0(-46.5) -40.2(-46.5) 4 1c8cA 64 α/β -76.0(-101.6) -78.0(-86.4) (-119.1) 5 1sap 66 α/β (-117.2) (-121.4) (-124.9) 6 1hz6A 67 α/β (-136.8) (-130.9) (-127.3) 7 1wapA 68 β -93.9(-115.4) (-132.5) (-132.8) 8 1fwp 69 α/β -64.7(-73.5) -70.9(-84.4) -72.7(-85.2) 9 1ail 70 α -57.4(-73.5) -52.1(-56.1) -59.9(-63.8) 10 1dtjA 76 β -70.1(-87.7) -71.5(-82.2) -70.8(-81.8) 11 1aoy 78 α/β -91.7(-102.2) -92.1(-98.1) -98.5(-115.8) 12 2ci2 83 α/β -75.9(-98.6) -95.7(-109.8) -85.6(-99.2) 13 1cc5 83 α -62.3(-67.9) -62.3(-68.6) -56.7(-59.9) 14 1tig 88 α/β (-145.8) (-128.0) (-156.6) 15 2ezk 93 α -87.3(-91.7) -94.6(-100.7) -90.1(-96.1) 16 1hhp 99 β -55.1(-79.3) -82.6(-104.5) -76.5(-95.0) 17 2hg6 106 α/β -77.7(-92.5) -90.9(-102.6) -84.9(-98.8) 18 3gwl 106 α -82.4(-87.8) -88.4(-100.0) -90.5(-99.2) 19 2h5nD 123 α (-119.2) (-129.0) (-121.7) 20 1aly 146 β -71.8(-113.6) -75.0(-81.1) -95.5(-128.1) 64

83 memory and will only accept a child conformation with higher energy with a low probability. Sampling Near-native Conformations with the HEA Table 4.5 shows the lowest RMSD conformation reached for each algorithm as well as the percentage of conformations below 5Å RMSD across all 5 runs. Comparison of columns 5 and 6 reveals that the HEA finds significantly lower RMSD conformations for 9 out of 20 proteins, while PLOW finds significantly lower RMSD conformations for 4 out of 20 proteins. While this shows some improvement by the HEA, the results are not conclusive. When examining columns 8 and 9, we see a more interesting results. The HEA samples conformations below 5Å RMSD for 13 out of the 15 smallest proteins (highlighted in bold), while PLOW only samples conformations below 5Å RMSD for 8 of these 15 proteins. However, when PLOW does manage to sample conformations below 5Å, it typically samples a greater percentage than the HEA. This suggests that the enhanced sampling capability of the HEA is not only able to find lower energy conformations, but the more explorative nature of the HEA is able to sample a diverse set of conformations including those in the proximity of the native state. However, when PLOW does get lucky and reach a region near the native structure, its more exploitative nature is able to sample this region in greater detail. 4.4 Conclusions This chapter proposes an HEA which effectively combines global search for exploration of the search space with local search for exploitation of local minima in the space. This approach provides a discrete representation of the search space as a set of conformations which map to low-energy local minima. A BH algorithm is first introduced to show that 65

84 Table 4.5: Columns 5-7 given the minimum RMSD sampled across all independent runs of PLOW and the HEA for T = 0 and T 0 local search temperatures. Columns 8-9 give the percentage of near-native conformations sampled for each algorithm, with values above 1% highlighted in bold. min Cα-RMSD (Å) % less than 5 ÅCα-RMSD PDB Id PLOW HEA T = 0 HEA T 0 PLOW HEA T = 0 HEA T 0 1 1bq dtdB isuA c8cA sap hz6A wapA fwp ail dtjA aoy cc ci tig ezk hhp hg gwl h5nD aly

85 this explicit local minima sampling is more effective than a method based on MMC sampling, which then forms the basis for the HEA described above. Detailed analysis is performed of the effect the depth of the local search has on conformational sampling. The BH algorithm is then generalized as a special case of a populationbased HEA using state-of-the-art domain-specific components from the computational biology community. Comparison of the BH algorithm (HEA with population size 1) to the full HEA shows that the population-based approach is able to sample significantly lowerenergy conformations than the trajectory-based BH algorithm. When examining the ability to sample near-native conformations, however, the picture is less straightforward. The population-based HEA is able to sample near-native conformations for more protein systems; however, when the BH approach does find a near-native region of the search space, it is able to better exploit it. The canonical HEA proposed in this chapter serves as a proof-of-concept that a hybrid global-local search is effective at navigating the rugged energy surface presented by state-of-the-art coarse-grained energy functions. However, this first step does not effectively leverage the population of conformations to guide the search towards near-native conformations. Chapters 5 and 6 examine different aspects of the HEA which can more effectively leverage the population of conformations maintained by the HEA to improve sampling of not only low-energy conformations but also conformations near the native state. 67

86 CHAPTER 5: GUIDING SAMPLING THROUGH PERTURBATION : BIASED MUTATION AND CROSSOVER This chapter addresses the question of how one can effectively sample new energeticallyrelevant conformations at the global level. Fragment replacement is a powerful tool for effectively iterating through a series of conformations; however, fragment replacement techniques are typically designed with a Monte Carlo (MC) search framework in mind, where moves are immediately accepted or rejected based on a selected scoring function. This chapter investigates two complementary approaches to sampling conformations at the global level. Section 5.1 investigates how to bias fragment replacement to improve its efficacy in global search, while section investigates how to best combine features from multiple previously-sampled conformations to better explore the search space. 5.1 Controlling Perturbation Distance of Fragment Replacement This section builds on existing fragment replacement methods to enhance protein conformational sampling. The most common approach to fragment replacement is to select a position in the protein and fragment to employ uniformly at random. There are two drawbacks to this approach: while each fragment replaces a fixed number of dihedral bond angles, the change to the overall structure of the protein can vary widely depending on which position is selected for replacement and how different the selected fragment is from the existing structure. In the context of the evolutionary search framework, this means that some mutations have a very large effect, while others have a vary small effect. This section explores the effect of explicitly controlling the perturbation distance of each mutation, 68

87 rather than simply selecting a fragment at random. Analysis suggests that making primarily smaller moves in the search space tends to produce a more effective search algorithm. Therefore we show that by biasing perturbation distance towards smaller moves, we are able to enhance sampling for proteins which have relatively large moves when fragment selection is done uniformly at random. Experiments in this section are performed in the context of the PLOW algorithm outlined in chapter 4. An interesting correlation is shown in Figure 5.1 between the mean RMSD between C i and C i(perturb) and the lowest RMSD between conformations sampled by PLOW and the known native structure for 15 protein systems. The correlation between these two quantities in Figure 5.1 is about 80%. A lower RMSD from the native structure corresponds to a smaller jump on average (in terms of RMSD) made by the perturbation move. This result suggests that the protein systems where PLOW is able to find low RMSDs to the native structure are also the systems where the perturbation move is able not only to jump out of a current minimum, but also not to jump to a far away region in conformational space. A similar result and observation is attained when correlating the lowest RMSD to the native structure obtained by PLOW to the mean RMSD between consecutive local minima in PLOW (RMSD between C i and C i+1 ). A more detailed picture into what the perturbation move is doing is provided in Figure 5.2, which shows the detailed distribution of RMSDs between C i and C i(perturb) for two selected protein systems. The area of the curve shaded in red represents the portion of perturbation moves where this RMSD is less than 1Å (the move is deemed not to have escaped the current local minimum). The system in Figure 5.2(a) with native structure PDB id 3gwl is an example of a protein system where PLOW is very effective at finding conformations near the native structure. In this case, the distribution contains a large area of short-tomedium moves with RMSDs in the 1-8Å range. In contrast, the system in Figure 5.2(b) with native structure PDB id 1hhp is an example of a system where PLOW does not find conformations near the native structure. Correspondingly, the distribution in Figure 5.2(b) 69

88 Figure 5.1: The mean perturbation distance between C i and C i(perturb) is plotted against the lowest RMSD from the native structure obtained by PLOW on each of 15 protein systems. The strong linear correlation (the identity line is drawn in red) suggests that the efficacy of the perturbation function is directly related to the efficacy of the search in PLOW. Mean perturbation distance (Å) AIL 3GWL 1FWP 1ISUA 2EZK 1SAP 1WAPA 1DTDB 1C8CA 1HZ6A 1CC5 2H5ND 1AOY 2HG6 1HHP Minimum RMSD from the native structure discovered (Å) is weighted towards much higher RMSDs, with much of the area under the curve above 8Å. This suggests that, in this case, the perturbation move is approaching a random restart. Taken together, these results suggest that PLOW performs best when it is able to make no larger than 6Å jumps in terms of mean RMSD between nearby local minima in the energy surface. 70

89 Figure 5.2: The distribution of perturbation distances, between C i and C i(perturb), is shown for two selected proteins with PDB ids 3gwl in (a) and 1hhp in (b). The area shaded in red represents the cases where the perturbation distance between C i and C i(perturb) is less than 1Å RMSD and is thus deemed an insignificant change from the conformation C i. Sample Frequency Sample Frequency Distance < 1 Å Mean (4.2 Å) Perturb Distance Perturbation Distance (Å) (a) 3gwl (length 106, α) Distance < 1 Å Mean (7.2 Å).03 Perturb Distance Perturbation Distance (Å) (b) 1hhp (length 99, β) Controlling Perturbation The analysis above suggests that there is a benefit in biasing perturbation to shorter jumps. To this end, the following technique is employed to control the magnitude of each perturbation jump to a configured distance D (the magnitude is measured as the RMSD between 71

90 Mean Local Minima Distance (Å) Median Perturbation Distance (Å) 1dtdB 1isuA 1c8cA 1sap 1hz6A 1wapA 1fwp 1ail 1aoy 1cc5 2ezk 1hhp 2hg6 3gwl 2h5nD Figure 5.3: The mean µ MM is shown for a given target D. C i and C perturb,i ). A target distance d is sampled from a Gaussian distribution centered at D with a standard deviation of 1. A new perturbed conformation C perturb is sampled using a single trimer configuration replacement. C perturb is accepted if the RMSD between C i and C perturb is within a tolerance, t, of the target distance d. The process is repeated for a maximum n number of attempts or until a C perturb that satisfies the RSMD criterion is obtained. If not, the ensuing minimization uses as C perturb,i the C perturb conformation with the RMSD from C i closest to d over all n ones obtained in this process. The value of n is set to 20, which is large enough to find an accepted C perturb within a tolerance t = 0.5Å in most cases. Since candidates for C perturb,i are not evaluated for energy, this process adds insignificant additional computation to PLOW. Figure 5.3 shows that the distance between consecutively sampled local minima, referred to as µ MM, can be effectively controlled by biasing the magnitude of the perturbation jump through a target perturbation distance D; as D is increased, there is a corresponding increase in µ MM. 72

91 Tuning D does not have any significant effect on the single lowest RMSD obtained (RMSD is computed over the heavy backbone atoms and measures the proximity of a conformation. However, D affects the frequency with which near-native conformations are obtained (that is, the distribution of sampled minima) in cases where unbiased perturbation results in large µ MM values. Figure 5.4 illustrates this for two representative systems by plotting, for different values of D, the distribution of µ MM values and the resulting distribution of RMSD values. These results show that there is a distinct advantage to biasing the perturbation distance to D = 1Å or D = 2Å. Figures 5.4(a) and 5.4(c) show that the frequency of small µ MM is larger when D {1, 2}Å vs. an unbiased perturbation. Figures 5.4(b) and 5.4(d) show that the resulting ensembles contain more low-rmsd conformations than the unbiased approach. The effect of controlling D shown in Figure 5.4 is strongest on more heavily β-sheet proteins (those with native PDB IDs 1dtdB, 1isuA, 1wapA, and 1hhp). On these proteins, an unbiased perturbation results in few small consecutive local minima distances. More near-native conformations are also obtained (though to a lesser extent) when D {1, 2} for other proteins (with native PDB IDs 1ail, 1sap, and 2h5nD). On these proteins, unbiased perturbation results in larger numbers of small consecutive local minima distances, but these proteins still benefit from enhanced sampling of neighboring local minima. This enhanced sampling of near-native conformations can correspond to PLOW remaining in the same near-native region of the space; low D values could potentially cause the minimization to return to the previous minimum. In practice, this does occur for D = 1Å; however, when D > 1Å, the search returns to previous local minima the same or less frequently than the unbiased approach. 73

92 Sample Frequency Unbiased D =1Å D =2Å D =3Å D =4Å D =5Å PDB ID 1ail Topology α Size lrmsd between consecutive local minima (Å) (a) 1ail Sample Frequency Unbiased D =1Å D =2Å D =3Å D =4Å D =5Å PDB ID 1ail Topology α Size lrmsd to the native structure (Å) (b) 1ail Sample Frequency Unbiased D =1Å D =2Å D =3Å D =4Å D =5Å PDB ID 1isuA Topology α/β Size 62 Sample Frequency Unbiased D =1Å D =2Å D =3Å D =4Å D =5Å PDB ID 1isuA Topology α/β Size lrmsd between consecutive local minima (Å) (c) 1isuA lrmsd to the native structure (Å) (d) 1isuA Figure 5.4: The frequencies of µ MM sampled during the search for proteins with native structure PDB IDs 1ail and 1isuA are shown in (a) and (c), respectively. Frequency of RMSDs to the native structure for each protein are given in (b) and (d), respectively. The solid red line represents PLOW employing the unbiased perturbation method. The dashed lines represent PLOW with median perturbation distances D = 1Å to D = 5Å. 74

93 5.2 Combining Conformational Features with Crossover The ensemble of explicitly sampled local minima conformations provides a rich array of structural diversity and, in particular, the presence of multiple low-energy basins suggests multiple locally optimal sub-structures. This section addresses how to best leverage this ensemble of sampled structures to make more effective moves at the global level than simply employing fragment replacement as a mutation operator. In particular, fragment replacement is highly effective at formation of α-helices which are held together through local contacts between amino acidy close in sequence. β-sheets, on the other hand, form through non-local contacts between amino acids which can be far apart in the sequence. For this reason it is difficult for short fragments to capture beta sheets. This chapter explores the use of genetic crossover to select features from two parent conformations to generate a new child conformation. While crossover is a commonly employed approach in the field of evolutionary computation, it is known to have limitations in highly constrained search spaces such as the protein conformational space. Therefore the local search component of the HEA is critical for making the the HEA with crossover effective Hybrid Genetic Evolutionary Algorithm (GA) This work investigates the effectiveness of different crossover operators, comparing standard 1-point and 2-point crossover operators to a homologous 1-point crossover operator. The crossover operator is applied to the HEA described in chapter 4. The combined hybrid Genetic EA (GA) is outlined in figure 5.5. Mutation, local search, and selection are performed as described in chapter 4 for the HEA and the Rosetta score3 energy function is employed evaluation. To perform crossover on each member of the parent population, a second parent is selected uniformly at random, and each pair of parents produce a single offspring. In 1- point and 2-point crossover, the crossover points are selected uniformly at random over 75

94 Random Initial Population Crossover Elitist-based Truncation Selection Mutation Local Search Evaluate Figure 5.5: Flowchart of the hybrid genetic algorithm. In each generation, the chosen crossover operator is followed by a mutation operator and a local search to optimize a conformation to a nearby local minimum. The new population of local minima competes with elite members of the parent population through truncation selection. 76

95 amino acid positions in the target protein sequence. In homologous 1-point crossover, the crossover point is selected uniformly at random over the set of amino acids for which both parents share the same φ and ψ dihedral bond angles. This effectively creates a bias towards selecting a crossover point based on the number of consecutive amino acids with matching φ and ψ angles. In the case that there are no matching pairs of φ and ψ angles, a standard 1-point crossover is performed instead. To provide baseline comparisons, two additional experiments have been included: one with just the fragment mutation operator (i.e., no crossover), and a second experiment using only a random replacement operator that performs n fragment mutations (n is the number of amino acids), effectively restarting at a conformation randomly sampled from the fragment configuration library Analysis of Crossover Operators Experimental setup: The parameters for each experiment are a particular crossover operator (1-point crossover, 2-point crossover, homologous 1-point crossover, random replacement, or mutation only) and a target protein system. Each experiment is an execution of the HEA with the selected parameters for a fixed budget of 10, 000, 000 energy function evaluations. Each run is repeated 30 times, with means and best runs reported. Additional details on experimental setup and protein systems can be found in chapter 3. Types of analyses and performance measurements: Section compares reproductive operators on the lowest energy values reached in order to determine which operator has higher sampling capability. Section provides further details into how the fitness improvement of each operator relates to the sampling of low-energy conformations. Finally, section reports the lowest RMSD reached to the known native structure of a protein. 77

96 5.2.3 Navigating the Protein Energy Surface Table 5.1 shows the lowest energy across all the conformations sampled during each experiment. The value shown is the mean over 30 runs, with the minimum of 30 shown in parentheses. Values shown for the random replacement operator in column 5 provide a baseline of comparison. Column 6 shows values reached by the mutation-only HEA with no crossover. Columns 7 to 9 show values for the 1-point, 2-point, and homologous 1-point crossover, respectively (recall that mutation and local search follow after each crossover operator). Examination of these values shows that in nearly every instance, the addition of crossover results, on average, in lower energies. In only the single case of 2ezk does employment of mutation alone yield a slightly lower energy than 2-point crossover. These results make the case that use of a crossover operator allows our hybrid EA to more effectively navigate the protein energy surface and access lower-energy conformations. A downside to employing crossover over mutation alone is that the larger moves in dihedral angle space tend to require a longer local search to return to a local minimum. As a result, our HEA with mutation only is able to sample significantly more conformations for the same number of energy evaluations (data not shown here). While 1-point and 2-point crossover allow sampling on average 11% and 21% fewer conformations than mutation-only, respectively, homologous 1-point crossover averages only a 7% reduction and never more than 11% across all proteins. This suggests that homologous crossover is less disruptive to local structure than other crossover operators and thus increases sampling diversity without as much of an increase in the time required to map a conformation to a local minimum. The effect of this disruption is particularly evident between the 1-point and 2-point crossover methods. As indicated in Table 5.1, improvements in energy over mutation alone are found to be statistically significant with 95% confidence by the Mann-Whitney U test for 6 of the proteins when employing 1-point and homologous 1-point crossover. However, statistical significance is only achieved for 4 proteins using 2-point crossover. This suggests 78

97 Table 5.1: Columns 2 4 show PDB Id of the native structure, number of amino acids, and native fold topology for each target protein, respectively. The remaining columns show the lowest energy sampled during each experiment, averaged over 30 runs. The minimum lowest energy over all runs is shown in parentheses. Results for which the mean difference between crossover and mutation only are statistically significant are given in bold ( 95% confidence according to the Mann-Whitney U test). Lowest Energy Value Sampled Random Mutation Crossover and Mutation PDB Id Replacement Only 1-point 2-point Homologous 1 1dtdB 43.2(36.2) -11.3(-38.8) -19.1(-38.8) -14.3(-37.7) -12.7(-34.3) 2 1isuA 35.9(28.3) 1.1(-13.4) -0.4(-13.4) -1.5(-19.2) -2.9(-15.3) 3 1c8cA 14.3(2.2) -40.1(-67.0) -45.6(-66.5) -45.6(-64.2) -45.9(-68.3) 4 1sap 5.6(-5.2) -53.7(-76.5) -58.6(-86.2) -58.2(-83.2) -57.8(-78.7) 5 1hz6A 21.3(-5.8) -56.9(-81.3) -63.0(-99.8) -64.3(-84.8) -64.7(-89.0) 6 1wapA 25.8(11.2) -51.0(-93.3) -60.8(-86.3) -55.9(-74.0) -59.7(-81.5) 7 1ail 30.7(26.0) -6.2(-23.3) -12.6(-25.8) -13.6(-22.7) -13.3(-27.3) 8 1aoy 19.3(5.8) -28.6(-51.3) -37.4(-52.5) -36.4(-48.8) -41.7(-56.9) 9 2ezk 22.8(16.9) -18.7(-29.1) -21.2(-33.8) -18.4(-31.3) -20.8(-31.6) 10 2h5nD 45.2(31.4) -20.5(-40.1) -30.4(-48.9) -23.1(-39.7) -29.5(-53.0) that the increased disruption found in 2-point crossover makes it less suitable for use in protein structure prediction Fitness Improvement Figure 5.6 shows the mean fitness improvement (x-axis) versus the lowest energy (y-axis) reached for each experiment on all proteins. Fitness improvement is defined as the difference between parent energy and child energy (energy is averaged over the two parents in crossover). Figure 5.6 shows a strong linear correlation, particularly at lower energies 79

98 reached. This suggests that a more explorative perturbation operator, such as crossover, which is likely to have lower fitness improvement, will ultimately reach lower energy levels than a more exploitative operator, such as mutation Lowest Energy Reached Mean Fitness Improvement Figure 5.6: The mean fitness improvement between parents and children is given for each experiment versus the average lowest energy reached. A strong linear correlation is noted, suggesting a more explorative variation operator with lower mean fitness improvement will allow an EA to maintain breadth in search and access lower energies Sampling Near-native Conformations Table 5.2 shows the lowest C α -RMSD to the known native structure over all conformations sampled during each experiment. The value given is the best result over all 30 runs. Examination of columns 6 9 in Table 5.2 shows no consistent improvement in C α -RMSD between crossover methods and mutation alone; a difference of less than 0.5Å C α -RMSD is not considered structurally significant. This suggests that while crossover is able to reach lower energy regions of the conformational space, inaccuracies in the Rosetta score3 energy function prevent it from fully capitalizing on this success with respect to getting 80

99 closer to the known native structure. Table 5.2: The lowest C α -RMSDs to the known native structure over conformations sampled during each experiment are given. Columns 5 9 report results for experiments performed in this work, while Column 10 shows the lowest C α -RMSD reported for each protein by the ItFix algorithm [1]. Column 11 shows the Pearson correlation coefficient between energies of sampled conformations and their C α -RMSD across all experiments. Lowest Cα-RMSD Conformation Found (Å) Energy Random Mutation Crossover and Mutation to RMSD PDB Id Replacement Only 1 point 2 point homologous Correlation 1 1dtdB isuA c8cA sap hz6A wapA ail aoy ezk h5nD Column 11 in Table 5.2 quantifies energy function inaccuracies for each protein in terms of the Pearson correlation coefficient between the energy of each conformation and its C α - RMSD across all experiments for a given protein. In only a single protein (1hz6A) is this correlation above 52%, indicating the difficulty inherent in protein structure prediction as an optimization problem. On 1isuA, 1ail, and 2ezk where the correlation is bellow 40%, the random replacement method is actually competitive with crossover and mutation only. The fact that random replacement is able to find low RMSD structures for 1ail and 2ezk 81

100 underscores the advantage of employing molecular fragment replacement. 5.3 Conclusions This chapter addresses the problem of effective sampling at the global level by investigating two complementary approaches common in the evolutionary computation community: mutation and crossover. Both of these operators are challenged by the fact that, in the chosen dihedral representation, a change in parameter space does not necessarily correspond to a similar magnitude change in protein structure or energy. Section 5.1 shows that fragment replacement be more effectively employed as a mutation operator by biasing fragment selection based on RMSD. Section 5.2 then demonstrates that crossover can enhance sampling of the energy surface when combined with an effective local search. Furthermore, a method for selecting crossover points is proposed to lessen the impact of crossover on local structural features of the sampled conformation. While crossover is a widely employed practice in the evolutionary computation community, the work presented in this chapter represents the first successful employment of crossover for sampling near-native conformations when employing a realistic model of protein structure. The difficulty in successfully employing crossover when modeling proteins underscores the need to combine state-of-the-art domain-specific knowledge with advanced stochastic sampling techniques. 82

101 CHAPTER 6: GUIDING SAMPLING WITH MULTIPLE OBJECTIVES This chapter addresses the question how to effectively employ energy to discriminate between interesting conformations and noise in the conformational search space. The extent to which energy should be trusted to reach the native structure is currently under debate and inaccuracies in energy functions are considered primary reasons why ab-initio structure prediction remains challenging [114]. Recent work shows that even state-of-the-art coarse-grained energy functions, including the Rosetta energy function, have non-native energy minima that are lower than the one containing the experimentally-known native structure [55,115]. This is not surprising, as energy functions, particularly those that interface with coarse-grained representations, are known to be inaccurate due to energy terms which are in opposition to each-other. For instance, a structural change resulting in gaining a hydrogen bond in a conformation may also place two atoms in positions that are not favorable for their van der Waals interaction. Some recent studies advocate sacrificing efficiency and doing away with coarse-grained energy functions [116], effectively proposing using all-atom energy functions instead for all components of structure modeling. However, while hypothesized to be more accurate than coarse-grained energy functions, studies that have used all-atom energy functions have show presence of significant inaccuracies even among them [64]. The challenge of dealing with errors inherent in energy functions with conflicting terms seems to represent an impasse in computational structural biology. This chapter proposes to change the approach in which energy is employed to guide the search at a global level. I propose decomposing the energy function into individual terms through a multi-objective approach which is designed to deal with conflicting energy terms without sacrificing discriminatory power. Preliminary investigations in the context of the author s Ph.D. work 83

102 have demonstrated the suitability of multi-objective analysis as post-processing filtering of sampled decoy conformations to select a broad range of interesting conformations for further refinement by ab-initio protocols [117, 118]. In this thesis the multi-objective approach is integrated in the search itself, and is presented here in the context of the HEA proposed and described in chapter 4. This chapter refers to this algorithmic realization of the proposed framework as Multi-Objective (hybrid) Evolutionary Algorithm (MOEA). The MOEA proposed here is guided by the Pareto scoring metrics, such as Pareto rank and Pareto count, rather than the total energy of a conformation. Essentially, prior to adding a conformation to the population, the algorithm decomposes the energy of a conformation into various terms. The values of these terms are compared to those of other conformations maintained in an archive, and then a decision is made on whether to add the conformation to the population. The MOEA is described in further detail in section 6.1. Section 6.2 compares the MOEA to the HEA described in chapter 4 which is guided by energy instead. Section 6.2 explores different techniques to employing the Pareto scoring metrics in the proposed MOEA, including a novel approach to employing the Pareto count to increase conformational diversity. Results show that the use of a multi-objective approach is able not only to sample conformations with energy as low or lower than the HEA on nearly all target protein systems, but for many of these systems the MOEA also samples significantly more near-native conformations than the HEA. 6.1 A Multi-Objective (hybrid) Evolutionary Algorithm (MOEA) As indicated above, the MOEA employed in this chapter extends the HEA described in chapter 4. All details of the MOEA are identical to that of the HEA except for how conformations are retained in the population from one generation to the next. The MOEA makes use of a Pareto archive to score conformations according to Pareto rank and Pareto count. 84

103 Pareto rank and Pareto count are then used to select which parent and child conformations are retained in the population Decomposing an Energy Function for Multiple Objectives The Rosetta score4 energy function is the weighted sum of 13 distinct energy terms which measure a diverse set of properties including electro-static interactions, geometry, and similarity to known protein structures (see chapter 3 for more details on the Rosetta energy function). Multi-objective analysis is most effective when employing only a few separate objectives. Therefore the score4 energy function is decomposed into three separate terms measuring short range hydrogen bonding, long-range hydrogen bonding, and a third term which sums together the other the remaining 11 terms. This categorization of energy terms is based on studies suggesting that hydrogen bonding is especially important in identifying near-native conformations [55]; however, the additional energy terms are also important in guiding the search towards near-native regions of the search space. The proposed decomposition allows the search to simultaneously optimize conformations which are low in total energy and contain favorable local and non-local hydrogen bonding Pareto Dominance and Multi-objective Scoring Metrics The multi-objective scoring metrics employed in this chapter are based on the concept of Pareto dominance. A conformation C i is said to dominate a conformation C j when every energy term in C i is lower than the corresponding term in C j. If there is no conformation in ensemble Ω that dominates C j, then C j is said to be non-dominated. Conformations in the non-dominated ensemble, referred to as the Pareto front, are considered equivalent with respect to a multi-objective analysis. Figure 6.1 illustrates Pareto dominance with two objectives. Membership in the Pareto front is a binary state so additional metrics are necessary for a more granular ranking of conformations. Two such metrics are the Pareto rank and 85

104 strongly dominated by C 2 C 1 E 1 C 2 C 3 weakly dominated by C 2 C 4 C 5 E 2 Figure 6.1: Conformations are plotted with respect to two energy terms E 1 and E 2. Conformations represented by empty blue circles are non-dominated and form the Pareto front. C 2 strongly dominates 4 conformations and weakly dominates 1 additional conformation, thus the Pareto count of C 2 is 4 for strong Pareto dominance and 5 for weak Pareto dominance (strong Pareto dominance is employed in this thesis). 86

105 Pareto count of a conformation. The Pareto rank of C i measures the number of other conformations that dominate C i, where a Pareto rank of 0 indicates membership in the Pareto front. The Pareto count of C i measures the number of other conformations C i dominates Pareto Archive in MOEA The MOEA maintains an archive of every conformation sampled during the search in order to compute the multi-objective metrics described in section The archive stores the current Pareto rank and count for each conformation as well as the energy terms employed in the multi-objective analysis. When a new child conformation is sampled, it is compared to every existing member of the archive to compute its Pareto rank and count as well as update these metrics for all existing members of the archive. This approach allows the MOEA to maintain a more global view of the search space through the multi-objective archive than the view provided by only looking at the current members of the population. Maintaining the multi-objective archive does add significant computational time to the MOEA, however, this additional computation is small when compared to the total runtime of the search algorithm. For the current implementation, the time complexity of maintaining the archive is quadratic with respect the the number of conformations sampled during the search. However, in practice, when run for 10 million energy function evaluations, this adds less than 10% to the total runtime to the MOEA, as the vast majority of computational time is still consumed in the local search component. Furthermore, more advanced implementations of such an archive are able to lower the time complexity of maintaining the archive. Implementing this more advanced archive is out of the scope of this thesis Population Selection The MOEA employs a truncation selection method similar the HEA described in chapter 4. However, the MOEA ranks conformations according to Pareto rank and Pareto count as 87

106 well as total energy. Recall that in each generation, the parent conformations, P, in the current population produce a set of child conformations, C, where P = C. The parent conformations are then combined with the child conformations and the set P C is sorted in ascending order. The top P conformations from the sorted list are then retained as parents for the next generation. In the HEA, the set P C is sorted by total energy. In the MOEA, sorting is done by Pareto rank, Pareto count, and total energy, respectively. The highest rank in the population is given to conformations with Pareto rank of 0, which are members of the Pareto front. However, it is likely that in a given generation the majority of P C will be in the Pareto front. In this case the Pareto count is used as a secondary ranking to distinguish between two conformations with the same Pareto rank. Pareto count is used to bias selection away from heavily sampled regions of the Pareto front. The idea is that if a conformation in the Pareto front has a low Pareto count, then it is more likely that the search will be able to improve upon this conformation by finding a new conformation that dominates it. On the other hand, if the conformation has a high Pareto count, then this suggests than many attempts have already been made to improve upon this conformation. Another difference between the HEA and the MOEA is that while the HEA uses an elitism rate of 25% to encourage exploration in the search, the MOEA uses and elitism rate of 100%, where all of the parent population is allowed to compete against the child population. This is done to ensure the greatest portion of the population will be taken from the Pareto front, since it is unlikely that many of the child conformations will have a Pareto rank of 0. Since parents can fall out of the Pareto front as new children are added, the MOEA does not suffer form the same problem of convergence that the HEA does with an elitism rate of 100%. 88

107 6.2 Analysis of MOEA This section compares the ability of the MOEA to sample low-energy, near-native conformations to that of the HEA described in chapter 4. In addition, the effect of including the Pareto count to rank the population is explored. The MOEA-PC method uses Pareto count in the weighting, while MOEA uses only Pareto rank and total energy. The set of target proteins and experimental procedures are as described in chapter 3. As detailed above in section 6.1, the MOEA employs an elitism rate of 100%, while the HEA employs an elitism rate of 25%. As laid out in chapter 3, analysis is done both on the ability to sample low-energy conformations and near-native conformations Sampling low-energy conformations Table 6.1 shows the lowest energy sampled for the HEA, MOEA, and MOEA-PC on each of the target protein systems. Examining columns 5 to 7 reveals that, for the majority of proteins (highlighted in bold), MOEA and MOEA-PC, on average, reach a significantly lower energy level (less than 2 energy units) than HEA. HEA, on the other hand, only reaches a lower energy than both of the multi-objective methods on two proteins (also highlighted in bold). These results suggest that the use of the multi-objective archive is able to more effectively guide the search towards lower energy levels than employing total energy alone. However, these results do not show a clear improvement between inclusion of the Pareto count term in ranking members of the population Sampling near-native conformations Table 6.2 shows the minimum RMSD to native sampled and the percentage of conformations sampled which are below 5Å RMSD to native for the HEA, MOEA, and MOEA-PC across all five independent runs. Examination of columns 5 to 7 reveals that none of the 89

108 Table 6.1: The average minimum Rosetta score4 energy sampled across all independent runs is given for the HEA, MOEA, and MOEA-PC (the minimum across all five runs is given in parenthesis). Energies highlighted in bold represent a significant improvement in minimum energy sampled by MOEA and MOEA-PC over the HEA. Fold lowest Rosetta Score4 Energy PDB Id Size Topology HEA MOEA MOEA-PC 1 1bq9 53 α/β -45.6(-50.5) -43.3(-45.8) -47.8(-55.1) 2 1dtdB 61 α/β -49.0(-55.0) -67.3(-74.5) -63.2(-76.6) 3 1isuA 62 α/β -40.0(-46.5) -42.6(-48.4) -55.0(-76.7) 4 1c8cA 64 α/β -78.0(-86.4) -89.9(-98.4) -86.7(-101.5) 5 1sap 66 α/β (-121.4) (-120.1) (-109.8) 6 1hz6A 67 α/β (-130.9) (-135.6) (-134.8) 7 1wapA 68 β (-132.5) (-117.5) (-121.0) 8 1fwp 69 α/β -70.9(-84.4) -85.0(-92.8) -74.8(-81.7) 9 1ail 70 α -52.1(-56.1) -57.5(-67.1) -54.8(-71.1) 10 1dtjA 76 β -71.5(-82.2) -80.1(-97.4) -74.1(-89.8) 11 1aoy 78 α/β -92.1(-98.1) -95.4(-102.0) -94.8(-102.3) 12 1cc5 83 α -62.3(-68.6) -61.1(-67.8) -58.8(-67.5) 13 2ci2 83 α/β -95.7(-109.8) -92.6(-105.7) -91.5(-102.4) 14 1tig 88 α/β (-128.0) (-151.7) (-136.1) 15 2ezk 93 α -94.6(-100.7) -91.0(-93.4) -95.3(-101.1) 16 1hhp 99 β -82.6(-104.5) -80.2(-97.3) -74.0(-96.0) 17 3gwl 106 α -88.4(-100.0) -90.2(-95.3) -82.7(-85.2) 18 2hg6 106 α/β -90.9(-102.6) -89.4(-95.7) -93.7(-107.5) 19 2h5nD 123 α (-129.0) (-126.6) (-131.8) 20 1aly 146 β -75.0(-81.1) -99.9(-117.1) -88.4(-103.6) 90

109 algorithms show a consistent improvement in terms of the single lowest RMSD conformations sampled, however, the single lowest RMSD structure is typically an outlier and it is difficult to effectively identify such outliers for later stage refinement. Examination of columns 8 to 10 reveals that at least 1% of the conformations sampled by the MOEA-PC are near-native (below 5Å RMSD) for 6 proteins, while the MOEA and HEA only meet this threshold for 4 and 3 proteins, respectively. This suggests that the MOEA-PC is able to more effectively steer the search towards near-native regions of the search space. Figures 6.2 and 6.3 show the frequency of sampling of near-native conformations in greater detail for the 12 proteins where conformations below 5Å RMSD to native are sampled (PDB id 1cc5 is omitted since the number of samples below 5ÅRMSD is minimal). Examination of figures 6.2 and 6.3 reveals that in all but two cases, both the MOEA and MOEA-PC sample near-native conformations as often or more often than the HEA and in many of these cases the difference is dramatic. Figures 6.4, 6.5, and 6.6 examine three representative proteins in greater detail. The left side of each figure plots the energy versus RMSD to native for each conformation sampled for all five independent runs. The right side plots the energy versus RMSD to native for each member of the population for a given generation. On the right, conformations are colored based on how many generations they have remained in the population and only the results for a single run are shown (the run which achieved the lowest RMSD to native). Figure 6.4 shows PDB id 1ail, an example where there is little correlation between total energy and RMSD to the native structure. All three algorithms achieve a very low minimum RMSDs below 2Å. However, the MOEA-PC more effectively steers its population towards low-rmsd structures allowing it to sample significantly more near-native structures than the HEA or MOEA. While the HEA samples lower-rmsd structures early on in the search, these are soon discarded for lower-energy structures. The MOEA and MOEA- PC, on the other hand, are able to effectively use the decomposed energy function to steer 91

110 Table 6.2: Columns 5-7 given the minimum RMSD sampled across all independent runs of the HEA, MOEA, and MOEA-PC. Columns 8-9 give the percentage of near-native conformations sampled for each algorithm, with values above 1% highlighted in bold. min Cα-RMSD (Å) % less than 5 ÅCα-RMSD PDB Id Size HEA MOEA MOEA-PC HEA MOEA MOEA-PC 1 1bq dtdB isuA c8cA sap hz6A wapA fwp ail dtjA aoy cc ci tig ezk hhp gwl hg h5nD aly

111 HEA MOEA MOEA PC HEA MOEA MOEA PC Frequency 0.03 Frequency Cα RMSD to the native structure (Å) (a) 1bq9, 53 aas, α/β Cα RMSD to the native structure (Å) (b) 1dtdB, 61 aas, α/β HEA MOEA MOEA PC HEA MOEA MOEA PC Frequency 0.03 Frequency Cα RMSD to the native structure (Å) (c) 1c8cA, 64 aas, α/β Cα RMSD to the native structure (Å) (d) 1sap, 66 aas, α/β HEA MOEA MOEA PC HEA MOEA MOEA PC Frequency Frequency Cα RMSD to the native structure (Å) (e) 1hz6A, 67 aas, α/β Cα RMSD to the native structure (Å) (f) 1fwp, 69 aas, α/β Figure 6.2: The frequencies of RMSD to native sampling is shown for the 12 protein systems where near-native conformations (below 5Å RMAD) are achieved. The HEA, MOEA, and MOEA-PC are represented as a solid black line, dotted blue line, and solid purple line, respectively 93

112 HEA MOEA MOEA PC HEA MOEA MOEA PC 0.06 Frequency Frequency Cα RMSD to the native structure (Å) (a) 1ail, 70 aas, α Cα RMSD to the native structure (Å) (b) 1dtjA, 76 aas, β HEA MOEA MOEA PC HEA MOEA MOEA PC Frequency Frequency Cα RMSD to the native structure (Å) (c) 1aoy, 78 aas, α/β Cα RMSD to the native structure (Å) (d) 2ci2, 83 aas, α/β HEA MOEA MOEA PC HEA MOEA MOEA PC Frequency Frequency Cα RMSD to the native structure (Å) (e) 1tig, 88 aas, α/β Cα RMSD to the native structure (Å) (f) 2ezk, 93 aas, α Figure 6.3: he frequencies of RMSD to native sampling is shown for the 12 protein systems where near-native conformations (below 5Å RMAD) are achieved. The HEA, MOEA, and MOEA-PC are represented as a solid black line, dotted blue line, and solid purple line, respectively 94

113 the search towards near-native structures. In the case of the MOEA-PC, the focus on sampling a more diverse set of structures allows it to quickly identify the near-native portion of the search space. In the case to PDB id 1hz6A, the energy surface is well funneled and figure 6.5 shows that all three algorithms steer their population towards low-rmsd regions of the energy surface. However, it is interesting to note that in this case the MOEA and MOEA-PC reach near-native structures much more rapidly than the HEA. The third case examined in detail, 1dtjA, represents a common case where there are many deep local minima which make it difficult to narrow in on the global minima corresponding to the native state. Figure 6.6 shows that using only total energy, the HEA is not able to find the global minima, while the MOEA is able to follow multiple search paths and more effectively narrow in on this native basin. In this case, the use of the Pareto count seems to hinder MOEA-PC as it quickly narrows in on a low-rmsd basin, but does not find the lower-energy and RMSD basin discovered by the MOEA. 6.3 Conclusions This chapter provides an answer to the question of how energy can be better employed to guide conformational search. These results show that there is a distinct advantage to decomposing an energy function and employing multi-objective analysis to guide the search. The proposed MOEA is able not only to find lower-energy conformations but also sample significantly more near-native conformations than the HEA. The use of the Pareto count to encourage further diversity in the search can provide a dramatic improvement in sampling of near-native conformations in specific cases, however, further investigation is needed to determine when applying Pareto count is most effective. This chapter represents an initial foray into a multi-objective characterization of the protein energy surface for conformational search. The field of evolutionary computation is rich in techniques to enhance multi-objective optimization. Approaches from popular 95

(a) HEA : 1ail, 70 aas, α (b) HEA : 1ail, 70 aas, α (c) MOEA : 1ail, 70 aas, α (d) MOEA : 1ail, 70 aas,

4: On the left side, each conformation sampled across all independent runs by the HEA (top), MOEA

On the right, only conformations actually retained in the population are shown for a single run.

114 (a) HEA : 1ail, 70 aas, α (b) HEA : 1ail, 70 aas, α (c) MOEA : 1ail, 70 aas, α (d) MOEA : 1ail, 70 aas, α (e) MOEA-PC : 1ail, 70 aas, α (f) MOEA-PC : 1ail, 70 aas, α Figure 6.4: On the left side, each conformation sampled across all independent runs by the HEA (top), MOEA (middle), and MOEA-PC (bottom) is plotted with respect to total energy and RMSD to the native structure. On the right, only conformations actually retained in the population are shown for a single run. Here a 3rd dimension (generation) and 4th dimension (age of the conformation) provide a more detained view of how the conformational space is explored). 96

5: On the left side, each conformation sampled across all independent runs by the HEA (top), MOEA (middle), and MOEA-PC (bottom) is plotted with respect to total

115 (a) HEA : 1hz6A, 67 aas, α/β (b) HEA : 1hz6A, 67 aas, α/β (c) MOEA : 1hz6A, 67 aas, α/β (d) MOEA : 1hz6A, 67 aas, α/β (e) MOEA-PC : 1hz6A, 67 aas, α/β (f) MOEA-PC : 1hz6A, 67 aas, α/β Figure 6.5: On the left side, each conformation sampled across all independent runs by the HEA (top), MOEA (middle), and MOEA-PC (bottom) is plotted with respect to total energy and RMSD to the native structure. On the right, only conformations actually retained in the population are shown for a single run. Here a 3rd dimension (generation) and 4th dimension (age of the conformation) provide a more detained view of how the conformational space is explored). 97

(a) HEA : 1dtjA, 76 aas, β (b) HEA : 1dtjA, 76 aas, β (c) MOEA

1dtjA, 76 aas, β (f) MOEA-PC : 1dtjA, 76 aas, β Figure 6.

independent runs by the HEA (top), MOEA (middle), and MOEA-PC

116 (a) HEA : 1dtjA, 76 aas, β (b) HEA : 1dtjA, 76 aas, β (c) MOEA : 1dtjA, 76 aas, β (d) MOEA : 1dtjA, 76 aas, β (e) MOEA-PC : 1dtjA, 76 aas, β (f) MOEA-PC : 1dtjA, 76 aas, β Figure 6.6: On the left side, each conformation sampled across all independent runs by the HEA (top), MOEA (middle), and MOEA-PC (bottom) is plotted with respect to total energy and RMSD to the native structure. On the right, only conformations actually retained in the population are shown for a single run. Here a 3rd dimension (generation) and 4th dimension (age of the conformation) provide a more detained view of how the conformational space is explored). 98

117 algorithms, such as NSGA2 and PAES [119, 120], for more explicitly maintaining diversity in the Pareto front can be investigated and compared to the Pareto-count based method proposed here. The greatest challenge in applying multi-objective analysis to the energy surface produced by the Rosetta score4 energy function is selecting how to most effectively decompose the 13 terms in the energy function. Initial investigations showed that, while the hydrogen bonding terms contain the most discriminatory power between low-energy decoy conformations, inclusion of the other energy terms as a third aggregate term was more effective than employing hydrogen bonding alone. Further study may be needed to investigate which terms are most useful at the global level and which terms are useful only during local search. For example, the van der Waals term is important in filtering out physically-unrealistic atomic structures (commonly generated during fragment replacement) but has limited discriminatory power between low-energy decoy conformations. 99

118 CHAPTER 7: BRINGING IT ALL TOGETHER WITH EXTERNAL VALIDATION Many studies investigating EAs for ab-initio protein structure prediction show that these algorithms fail to be competitive in comparison to state-of-the-art trajectory-based methods in the computational biology community. In light of this, I choose a representative method currently considered among the top performing ones for ab-initio structure prediction in the computational biology community against which to benchmark the techniques proposed in chapters 4-6. I refer to this representative method as ClassicRosetta, and detail it below. 7.1 Algorithmic Realizations of Proposed Framework for Comparison to ClassicRosetta I propose a novel EA that combines the techniques and findings laid out in chapters 4-6 into a hybrid multi-objective genetic algorithm (MOGA), all the while incorporating a state-of-the-art energy function (Rosetta) and fragment library. The combined algorithm is benchmarked against ClassicRosetta, which employs the same energy function and fragment library, thus allowing for an appropriate comparison. MOGA combines the MOEA-PC algorithm presented in chapter 6 with one-point genetic crossover (this move was introduced in chapter 5). Other than the additional crossover operator, all details of MOGA are identical to those of the MOEA-PC in chapter 6. Analysis of MOGA suggests some simple enhancements to the HEA proposed in chapter 4, referred to as HEA* from now on, to improve its ability to sample coarse-grained decoy conformations near the protein native state. To create more diversity in the search space, the population size in HEA* is increased to 500, and the greedy local search is extended 100

119 such that each local search first converges to a minima using fragments of length 9 before further refining with fragments of length 3 until convergence is reached again. The results presented here show that MOGA is able to reach significantly lower-energy conformations on a diverse set of target protein systems than Rosetta, demonstrating the ability of the hybrid evolutionary framework to effectively navigate the protein energy surface. The enhanced sampling in MOGA results in not only finding many near-native conformations, but also identifying native and non-native basins in the protein energy surface. In addition, I show that HEA* can sample near-native conformations as effectively as Rosetta after some simple parameter tuning of HEA. Indeed, the results show that the tuned HEA often finds lower minimum RMSD to native structures. These results are significant, as this dissertation represents only an initial investigation into how approaches from evolutionary computation can best leverage domain-specific components to effectively sample the protein conformational space. The development of a modular evolutionary search framework for protein structure prediction allows future studies to conduct more detailed investigations in how specific components can be best optimized. Chapter 8 lays out the basic components of the evolutionary search framework, and how it can be easily modified for future investigations even in different application settings. 7.2 ClassicRosetta: Coarse-grained Sampling in the Rosetta Abinitio Protocol The Rosetta protocol is consistently a top contender in the bi-yearly CASP competition [121] and is the focus of many independent studies [47, 49, 53]. The protocol consists of a multistage trajectory-based MMC search which begins at a fully extended conformation and produces a single decoy conformation as the proposed native structure. In practice, the Rosetta protocol is run thousands of times as part of a multi-start algorithm. This results is 101

120 an ensemble of decoy conformations similar to the ensemble sampled by the evolutionary search framework proposed in this dissertation. For the purpose of benchmarking, the classic Rosetta protocol is employed instead of the downloadable version of Rosetta. The primary reason for doing so is that the downloadable version is not well documented and still in testing phase, whereas the classic version is mature, studied by many researchers in a comparative setting, and well understood in terms of its capabilities in ab-initio structure prediction. ClassicRosetta represents a state-of-the-art coarse-grained sampling algorithm which takes advantage of the same representation, fragment libraries, and energy functions which are available and incorporated in the evolutionary search framework proposed in this thesis. What I refer to as ClassicRosetta is actually just the classic implementation of the coarsegrained sampling in the full Rosetta protocol, which I have chosen for comparison in this chapter. ClassicRosetta is an adaptive MMC search split into 4 substages, where the lowest-energy conformations sampled in a substage is used as as the starting conformations for the following substage. The lowest-energy conformation sampled in substage 4 is the one ranked with the score4 energy function and retuned as a proposed decoy conformation. The parameters that differ among the substages are the energy function weights, fragment length, and number of MMC iterations. Table 7.1 gives the specific parameters for each of the four substages. The weights for each of the Rosetta scoring functions are given in chapter 3. Each substage employs an MMC search with an adaptive temperature schedule. The adaptive temperature schedule allows Rosetta to probe deep into minima, while preventing it from getting stuck in a particular local minimum; temperature is increased after a number of successive failed move attempts. As in a standard MMC search, each move is accepted with a probability given by the Metropolis Criterion, p = exp( E/T). Here p is the probability of acceptance, δe is the difference in energy (proposed minus current) and T is a unit-less measure of effective temperature, which serves to scale the change 102

121 in energy. The value of T starts at 2 and is increased by to 1 after 150 consecutive failed moves. Once a move has been accepted, T is reset to 2. This is a rather simple temperature schedule as compared to more sophisticated adaptive temperature schedules in literature [57, ]. In particular, substage 1 selects a random initial conformation for to ensure a diverse ensemble of decoy conformations when employing a multi-start approach to produce thousands of decoy conformations. By employing score0, substage 1 effectively samples a random conformation from the fragment library (fragments of length 9 are employed) that is collision-free. The energy function employed in substage 2 then biases fragment replacements to form secondary structural motifs. Substage 3 is the longest, and its purpose is to narrow in on an energy basin. Finally, substage 4 switches to using shorter fragments of length 3 to optimize the decoy conformations though at a coarse-grained level of detail. Together, these four substages perform 36, 000 individual MMC moves, corresponding to the same number of energy function evaluations. Table 7.1: The parameters for each of the 4 subs-stages of the Rosetta coarse-grained abinitio protocol are given. Protocol Energy Fragment Nr. MMC Moves/ Substage Weights Length Energy Evaluation 1 score0 9 2,000 2 score1 9 2,000 3 score2 9 20,000 4 score3 3 12,

122 7.3 Experimental Setup for Comparison of Proposed Framework Against ClassicRosetta MOGA and an optimized version of HEA (HEA*) are benchmarked against ClassicRosetta in the same software framework. MOGA and HEA are each run 5 independent times, each time conducting a total of 10, 000, 000 energy evaluations. ClassicRosetta is run as a multistart algorithm for 1500 independent runs, resulting in 54, 000, 000 total energy evaluations. The slightly high number of energy evaluations for Rosetta is warranted, as substage1 in ClassicRosetta can typically be terminated early after only a few hundred evaluations. 7.4 Analysis of Sampling of Low-energy Conformations Our first comparison of MOGA, HEA*, and ClassicRosetta focuses on energies. Table 7.2 lists the lowest energy level sampled for each protein system by ClassicRosetta, MOGA, and HEA* in columns 5,6, and 7, respectively. Comparison of columns 5 and 6 reveals that MOGA reaches significantly lower-energy conformations for 17 out of the 20 target protein systems (highlighted in bold). HEA*, while not optimized for finding low-energy structures, still reaches lower-energy conformations than ClassicRosetta on 15 out 20 proteins. These results confirm that the enhanced sampling capability in MOGA and HEA* results in probing lower-energy regions of the protein conformational space. As optimization techniques, this result makes the case that MOGA and HEA* are more powerful than ClassicRosetta. Figure 7.1 illustrates the ability of MOGA not only to reach lower energy levels but also capture non-native energy basins. Figure 7.1 shows the ensemble of local minima sampled by MOGA the left and the ensemble of decoy conformations sampled by ClassicRosetta on the right. The ensembles are shown in terms of energy of each conformation plotted against its Rosetta score4 energy value. The use of short greedy local search not only allows MOGA to sample more densely than ClassicRosetta, but the combination of global 104

123 Table 7.2: The minimum Rosetta score4 energy sampled across all independent runs is given for ClassicRosetta, MOGA, and HEA*. energies highlighted in bold represent a significant improvement in minimum energy sampled by MOGA and HEA* over ClassicRosetta. Fold lowest Rosetta Score4 Energy PDB Id Size Topology ClassicRosetta MOGA HEA* 1 1bq9 53 α/β dtdB 61 α/β isuA 62 α/β c8cA 64 α/β sap 66 α/β hz6A 67 α/β wapA 68 β fwp 69 α/β ail 70 α dtjA 76 β aoy 78 α/β ci2 83 α/β cc5 83 α tig 88 α/β ezk 93 α hhp 99 β gwl 106 α hg6 106 α/β h5nD 123 α aly 146 β

124 and local search together allows the MOGA to more effectively probe native and nonnative energy basins. 7.5 Analysis of Sampling Near-native Conformations Table 7.3 further compares the coarse-grained decoy sampling ability of MOGA and HEA* to that of Rosetta. ClassicRosetta is highly optimized for coarse-grained decoy generation and, as shown in table 7.3, has impressive near-native sampling on proteins under 100 amino acids in length. While MOGA achieves similar minimum RMSD to the native structure as ClassicRosetta, it does not have the same near-native decoy sampling capability. The simple optimizations in HEA*, however, do make the evolutionary framework more competitive against ClassicRosetta with respect to coarse-grained decoy sampling. Comparing columns 5 and 7 in table 7.3 reveals that HEA* is actually able to reach significantly lower RMSD conformations for 8 of the 20 target proteins (highlighted in bold). These results emphasize the ease of adaptability of evolutionary search frameworks to specific problem domains. 7.6 Unbiased Comparison of Proposed Framework to ClassicRosetta on Testing Set of Protein Sequences As a final validation, MOGA is compared to ClassicRosetta on an additional set of proteins which have not been previously employed to test the techniques or particular effective combinations of them during the design of the proposed evolutionary search framework. Examination of Table 7.4 shows that the MOGA is able to reach significantly lower energy levels on the largest 4 (out of 6) proteins in the comparison set. Furthermore, the MOGA achieves near-native conformations (below 5Å RMSD) on all but the largest protein, including two proteins greater than 100 amino acids in length. 106

(a) MOGA : 1hz6A, 67 aas, α/β (b) Rosetta : 1hz6A, 67

1dtjA, 76 aas, β (e) MOGA : 2ezk, 93 aas, α (f)

1: The ensemble of sampled decoy conformations is

the right) for three representative protein systems.

125 (a) MOGA : 1hz6A, 67 aas, α/β (b) Rosetta : 1hz6A, 67 aas, α/β (c) MOGA : 1dtjA, 76 aas, β (d) Rosetta : 1dtjA, 76 aas, β (e) MOGA : 2ezk, 93 aas, α (f) Rosetta : 2ezk, 93 aas, α Figure 7.1: The ensemble of sampled decoy conformations is plotted for MOGA (on the left) and ClassicRosetta (on the right) for three representative protein systems. Points are plotted with respect to the C α -RMSD to native and Rosetta score4 energy. 107

LOCAL MINIMA HOPPING ALONG THE PROTEIN ENERGY SURFACE

LOCAL MINIMA HOPPING ALONG THE PROTEIN ENERGY SURFACE by Brian Olson A Thesis Submitted to the Graduate Faculty of George Mason University In Partial Fulfillment of The Requirements for the Degree of Master