Prediction of the structures of proteins with the UNRES force eld, including dynamic formation and breaking of disul de bonds

Size: px

Start display at page:

Download "Prediction of the structures of proteins with the UNRES force eld, including dynamic formation and breaking of disul de bonds"

Mark Lucas
5 years ago
Views:

1 Protein Engineering, Design & Selection vol. 17 no. 1 pp. 29±36, 2004 DOI: /protein/gzh003 Prediction of the structures of proteins with the UNRES force eld, including dynamic formation and breaking of disul de bonds Cezary Czaplewski 1,2, Stanisøaw Oødziej 1,2, Adam Liwo 1,2,3 and Harold A.Scheraga 1,4 1 Baker Laboratory of Chemistry, Cornell University, Ithaca, NY , USA, 2 Faculty of Chemistry, University of GdanÂsk, ul. Sobieskiego 18, GdanÂsk and 3 Academic Computer Center in GdanÂsk TASK, ul. Narutowicza 11/12, Gdansk, Poland 4 To whom correspondence should be addressed. has5@cornell.edu The presence of disul de bonds is essential for maintaining the structure and function of many proteins. The disul de bonds are usually formed dynamically during folding. This process is not accounted for in present algorithms for protein-structure prediction, which either deduce the possible positions of disul de bonds only after the structure is formed or assume xed disul de bonds during the course of simulated folding. In this work, the conformational space annealing (CSA) method and the UNRES unitedresidue force eld were extended to treat dynamic formation of disul de bonds. A harmonic potential is imposed on the distance between disul de-bonded cysteine sidechain centroids to describe the energetics of bond distortion and an energy gain of 5.5 kcal/mol is added for disul de-bond formation. Formation, breaking and rearrangement of disul de bonds are included in the CSA search by introducing appropriate operations; the search can also be carried out with a xed disul de-bond arrangement. The algorithm was applied to four proteins: 1EI0 (a), 1NKL (a), 1L1I (b-helix) and 1ED0 (a + b). For 1EI0, a low-energy structure with correct fold was obtained both in the runs without and with disul de bonds; however, it was obtained as the lowest in energy only with the native disul de-bond arrangement. For the other proteins studied, structures with the correct fold were obtained as the lowest (1NKL and 1L1I) or lowenergy structures (1ED0) only in runs with disul de bonds, although the nal disul de-bond arrangement was non-native. The results demonstrate that, by including the possibility of formation of disul de bonds, the predictive power of the UNRES force eld is enhanced, even though the disul de-bond potential introduced here rarely produces disul de bonds in native positions. To the best of our knowledge, this is the rst algorithm for energy-based prediction of the structure of disul de-bonded proteins without any assumption as to the positions of native disul des or human intervention. Directions for improving the potentials and the search method are suggested. Keywords: conformational-space annealing/disul de bond/ global optimization/protein structure prediction/united residue Introduction Disul de bonds occur frequently in many proteins, especially in extracellular soluble globular proteins. These bonds provide stability to the native structure of a protein and may compensate for the absence of a signi cant hydrophobic core in small proteins (Betz, 1993; Petersen et al., 1999). It is still an unsettled question as to whether the disul de bonds are formed before or after the secondary structure elements are formed (Welker et al., 2001). Some disul de bonds are essential for maintaining the structure and function of the protein, while others can be broken without radically changing the properties of the protein. The addition of a single disul de bond can cause cooperative, global folding of the entire protein. For example, bovine pancreatic ribonuclease A with two (26±84, 58±110) out of four of the native disul de bonds has no conformational order, but species with one additional disul de (65±72 or 40± 95) have native-like structure (Lester et al., 1997; Wedemeyer et al., 2000). Protein folding simulations show that inclusion of disul de bonds as constraints reduces the conformational space that must be searched (Skolnick et al., 1997; Huang et al., 1999; Abkevich and Shakhnovich, 2000). Most protein structure prediction methods and protein folding simulations do not take into account dynamic disul de-bond formation during folding. The particular arrangement of disul de bonds is applied as a xed set of constraints. Some studies examine the disruption of native disul des and the incorporation of novel disul des and their effects on stability, but use different simulations with different, xed, arrangements of disul de bonds during simulations (Rey and Skolnick, 1994). Computer modeling of disul de bonds which can be introduced by protein engineering into various proteins of known structure to increase their stability relative to that of the wild type is fairly common (Zhou et al., 1993; Burton et al., 2000; Dani et al., 2003). A few studies have addressed the important problem of predicting the disul de bonding state of cysteines in proteins; these include the use of statistical methods (Fiser et al., 1992), a specially optimized threading potential (Dombkowski and Crippen, 2000), neural networks and hidden Markov models (Muskal et al., 1990; Martelli et al., 2002) and methods that combine local context and global information about protein sequences (Fiser and Simon, 2000; Mucchielli-Giorgi et al., 2002). SaitoÃ and co-workers (Watanabe et al., 1991; Kobayashi et al., 1992) developed a protein folding simulation method with dynamic formation/breaking of disul de bonds. Their procedure is based on an assumption that folding starts with the formation of secondary structures (a-helices and b-sheets) and then proceeds to assemble them into the tertiary structure. Consequently, the simulation starts from the conformation in which the secondary structure is already formed and other regions are extended. The search for the conformation of minimum energy is carried out by changing the dihedral angles only in regions other than the secondary structures. Packing of Protein Engineering, Design & Selection vol.17 no.1 ã Oxford University Press 2004; all rights reserved 29

2 C.Czaplewski et al. the secondary structures of a polypeptide chain is guided by introduction of appropriate hydrophobic interactions which are responsible for the construction of short-distance local structure. A strict algebraic relation involving the geometry that must be satis ed for two cysteine residues to form a disul de bond cannot be applied for distances too great to yield a disul de bond. Consequently, Watanabe et al. used a geometrical graphic representation for the locus of the hydrogen atom of the SH group in the cysteine residue to draw the distributions of the cysteines at the folding stage in which they come close during simulation. They introduced a bonding potential (Equation 1 in Watanabe et al., 1991) between selected pairs of cysteines provided only that the circles representing the possible positions of the hydrogen atoms of the SH groups are face-to-face, thereby making it easy for them to intersect. In this paper, we report an extension of our hierarchical procedure for protein-structure prediction (Liwo et al., 1999a; Pillardy et al., 2001a) to proteins containing disul de bonds. This extended procedure allows for dynamic formation and breaking of disul de bonds during the simulations. As in our previous work, an extensive search is carried out at the unitedresidue level with the UNRES force eld and the use of the conformational space annealing (CSA) search method (Lee et al., 1997, 1999; Czaplewski et al., 2003). However, both the UNRES force eld and the CSA procedure have been modi ed, in order to treat the possible formation of disul de bonds. The method was applied in the following sequence to proteins with all the basic types of secondary structure: an a-helical hairpin stabilized by two disul de bonds [a fragment of the p8mtcp1 protein; Protein Data Bank (PDB) code 1EI0, 38 residues], whose three-dimensional structure was determined by NMR spectroscopy (Barthe et al., 2000); NK-lysyin (PDB code 1NKL, 78 residues), whose structure was also determined from NMR data (Liepinsh et al., 1997) and contains three disul des and four a-helices; the thermal hysteresis protein isoform YL-1 (PDB code 1L1I, 84 residues), whose three-dimensional structure (in a b-helical form) was determined recently from NMR data (Daley et al., 2002) and contains eight disul des; and Viscotoxin A3 (PDB code 1ED0, 46 residues), whose structure was determined from NMR data (Romagnoli et al., 2000) as an a/b type protein with three disul de bonds, two a-helices and two short anti-parallel strands stabilized by one of the disul de bonds. Materials and methods The UNRES force eld In the UNRES model (Liwo et al., 1997a,b, 2001, 2002; Lee et al., 2001; Pillardy et al., 2001b), a polypeptide chain is represented by a sequence of a-carbon (C a ) atoms linked by virtual bonds with attached united side chains (SC) and united peptide groups (p). Each united peptide group is located in the middle between two consecutive a-carbons, with peptide group p i being located between C a i and C a i+1. Only these united peptide groups and the united side chains serve as interaction sites, the a-carbons serving only to de ne the chain geometry (see gure 1 of Liwo et al., 1997a). All virtual bond lengths (i.e. C a ±C a and C a ±SC) are xed; the distance between neighboring C a s is 3.8 AÊ, corresponding to trans peptide groups, while the side-chain angles (a SC and b SC ) and virtualbond (q) and dihedral (g) angles can vary. The energy of the virtual-bond chain is expressed by the equation: 30 U ˆ U SCi SC j w SCp U SCi p j w el U pi p j i < j i 6ˆ j i < j 1 w tor U tor g i w tord U tord g i ; g i 1 w b U b q i i i w rot U rot a SCi ; b SCi N corr i w m corr Um corr m ˆ 2 The term U SCi SC j represents the mean free energy of the hydrophobic (hydrophilic) interactions between the side chains, which implicitly contains the contributions from the interactions of the side chain with the solvent (potential of mean force). The term U SCi p j denotes the excluded-volume potential of the side chain±peptide group interactions. The peptide group interaction potential (U pi p j ) accounts mainly for the electrostatic interactions (i.e. the tendency to form backbone hydrogen bonds) between peptide groups p i and p j. U tor, U tord, U b and U rot represent the energies of virtual-dihedral angle torsions, double torsions, virtual-bond angle bending and side-chain rotamers, respectively; these terms account for the local propensities of the polypeptide chain. Details of the parameterization of all of these terms are provided in earlier publications (Liwo et al., 1997a,b). Finally, the terms U m corr, m = 1, 2, ¼, N corr, are the correlation or multibody contributions from a cumulant expansion (Liwo et al., 2001, 2003) of the restricted free energy (RFE) and the ws are the weights of the energy terms. The multibody terms are indispensable for reproduction of regular a-helical and b-sheet structures. The UNRES force eld has been derived as an RFE function of an all-atom polypeptide chain plus the surrounding solvent, where the all-atom energy function is averaged over the degrees of freedom that are lost when passing from the all-atom to the simpli ed system. This approach enabled us to derive the U m corr, m =1,2,¼,N corr multibody terms by a generalized cumulant expansion of the RFE developed by Kubo (1962). The internal parameters of the individual Us were derived by tting the analytical expressions to the RFE surfaces of model systems (Liwo et al., 2001) or by tting the calculated distribution functions to those determined from the PDB (Liwo et al., 1997b). The ws (the weights of the energy terms), the internal parameters of the energy U m corr terms and the mean free energies of side-chain interactions of the U SCi SC j energy term were optimized by a hierarchical design of the potential-energy landscape (Liwo et al., 2002). The optimization method assumes a hierarchical structure of the energy landscape, which means that the energy decreases as the number of native-like elements in a structure increases, being lowest for structures from the native family and highest for structures with no native-like element. A level of the hierarchy is de ned as a family of structures with the same number of native-like elements (or degree of native likeness). Optimization of a potential-energy function is aimed at achieving such a hierarchical structure of the energy landscape by forcing appropriate free-energy gaps between hierarchy levels to place their energies in ascending order from the native to the most unfolded structure. This procedure is different from the method used earlier, in which the energy gap and/or the Z score between the native structure and all non-native structures were maximized, regardless of the degree of native-likeness of the non-native structures (Liwo et al., 1997b; Lee et al., 2001; Pillardy et al., 2001b). 1IGD, an (a+b)-type protein, was used as a training protein for optimization of the internal parameters i 1

3 Protein structure prediction with UNRES force eld of the U m corr energy terms (A.Liwo et al., unpublished data), while the ws and the internal parameters of U SCiSCj were optimized by using a set of four proteins (PDB codes 1E0G, 1E0L, 1GAB and 1IGD) (S.Oødziej et al., unpublished data). The UNRES force eld is able to predict the structures of proteins containing both a-helical and b-sheet structures with a reasonable degree of accuracy, as assessed by tests on model proteins (Lee et al., 1999; Liwo et al., 1999b; Pillardy et al., 2001a) and also in the CASP3 (Lee et al., 1999, 2000; Orengo et al., 1999), CASP4 (Pillardy et al., 2001a) and CASP5 (Czaplewski et al., 2002) blind prediction experiments. In order to describe the energetics of disul de bonds, for the pair of half-cystines that forms a disul de bond, we replace the U SCi SC j energy term of Equation 1 by the following function: E Cysi Cys j ˆ E Cys Cys 0:5 k Cys Cys d Cysi Cys j d Cys Cys 2 2 It should be noted that d Cysi Cys j is the distance between the centers of cysteine side chains and not the distance between the sulfur atoms of the bond. E Cys±Cys = ±5.5 kcal/mol is the energy of formation of a non-strained disul de bond from two half-cystine residues. This energy has been estimated on the basis of the energy of formation of a single disul de bond in proteins that has been measured experimentally to be ±3.5 kcal/ mol (Doig and Williams, 1991) and the energy of non-bonded interactions between cysteine side chains estimated from the Miyazawa±Jernigan cysteine±cysteine contact energy (Miyazawa and Jernigan, 1985) on the basis of Equation 3 of Liwo et al. (1993) to be ±2.0 kcal/mol. The values of d Cys±Cys = 4.2 AÊ and k Cys±Cys = 6.6 kcal/(mol AÊ 2 ) were estimated on the basis of the average distance between cysteine side-chain centroids in disul de bonds calculated from ECEPP/ 3 geometry (NeÂmethy et al., 1992) and the ECEPP/3 torsional constants of the C b ±S g ±S g ±C b dihedral angle (NeÂmethy et al., 1992). The conformational space annealing method CSA (Lee et al., 1997, 1999, 2000; Czaplewski et al., 2003) is a hybrid method which combines genetic algorithms, essential aspects of the build-up method and a local gradient-based minimization. The method is based on the idea of conformational space annealing: in the early stages, it enforces a broad conformational search and then gradually focuses the search into smaller regions with low energy. The CSA searching method allows one to focus on many different groups of lowenergy protein structures, one of which is presumably the native structure. The CSA method begins with a randomly-generated population of conformations which are energy minimized to generate the rst bank of conformations. From the initial population, a number of conformations (called seeds) are selected as parents for the trial population. These `seed' conformations are altered in a non-random fashion to create new trial conformations. As in any genetic algorithm, the trial population is generated by the use of genetic operators: mutations and crossovers. Attention is paid to ensure that all trial conformations are signi cantly different from each other and from the parent conformations. After generation, all trial conformations are energy minimized. The next step of the CSA algorithm is the update of the current population (the bank) without increasing its size. Each trial conformation is compared with each existing conformation of the bank. If the trial conformation is similar to an existing conformation of the bank, only the lower energy conformation of these two is preserved. If the trial conformation is not similar to any existing conformation in the bank, it represents a new distinct region of conformational space. Then it replaces the highest energy conformation in the bank, if its energy is lower than the highest energy in the bank, otherwise it is discarded. The distance between conformations i and j is de ned as the differences of their virtual-bond angles and virtual-bond dihedral angles (Equation 9 of Lee et al., 2000). If the distance, D ij, is less than or equal to some prede ned cutoff value, D cut, conformations i and j are considered similar, otherwise they are considered different. CSA achieves its ef ciency by beginning with a large value of D cut essentially to search all possible structures and then gradually reduces (`anneals') D cut by reducing the minimum distance between the conformations of the bank and focusing the search in lowenergy regions of conformational space. After updating the current population, the seed conformations are selected from the set of conformations not selected as seeds previously. Introducing new operators for dynamic disul de-bond formation and breaking into CSA The CSA run with dynamic disul de-bond arrangement allows for changing the positions of disul de bonds. The only information supplied to the procedure are the positions of cysteines, but the links between them are unknown. In other words, any two cysteines are allowed to be in a bonded state. The following new genetic operators have been introduced into CSA to treat the formation and breaking of disul de links; these processes are well known to in uence folding pathways considerably (Staley and Kim, 1992; Weissman and Kim, 1992). 1. Formation of new bonds. All `seed' conformations are analyzed for the presence of non-bonded cysteines, whose sidechain centers are closer than 7 AÊ and the two speci ed cysteines are more than l residues apart in the amino acid sequence. In this work l = 1, unless stated otherwise. For each seed and each pair satisfying this criterion, a trial conformation is generated and, for the pair of cysteines designated as bonded, the hydrophobic-interaction potential U SCi SC j is replaced by the disul de-bond potential given by Equation Breaking of existing bonds. For each seed conformation with disul de-bonded cysteine pairs, a pair (a disul de bond) is selected at random. Then a new trial conformation is generated in which the selected bond is broken. Consequently, for the selected pair of cysteines, the disul de-bond energy given by Equation 2 is replaced by the respective hydrophobic-interaction energy U SCi SC j (Equation 1). 3. Exchange of links between disul de bonds and free cysteines. If a free cysteine (of number i in the sequence) in a `seed' conformation is found close to a disul de bond (formed by cysteines j and k), i.e. if its distance from Cys j or Cys k is less than 7 AÊ, the trial conformations with the links Cys i ±Cys j and Cys i ±Cys k, formed instead of Cys j ±Cys k, are generated. An additional criterion on the distance in the amino acid sequence between bonded cysteines is applied as in point Exchange of links between pairs of disul de bonds. If a disul de bond Cys i ±Cys j is found close to disul de bond Cys k ± Cys l in a `seed' conformation (i.e. if one of the distances Cys i...cys k, Cys i...cys l, Cys j...cys k or Cys j...cys l is less than 7 AÊ ), trial conformations with the exchange of the current disul de links into two alternative possibilities (Cys i ±Cys k, 31

4 C.Czaplewski et al. Table I. Energies, r.m.s.ds with respect to native structure and number of disul de bonds for representative structures found in the CSA simulations Protein/run type Relative energy (kcal/mol) R.m.s.d. (AÊ ) No. of disul de bonds No. of native disul de bonds Figure 1EI0 Native, 1A No disul des F Fixed, native B Dynamic F D C Dynamic, native start F E 1EI0 Dedicated force eld Native, 2A No disul des Fixed, native D Dynamic C Dynamic, native start B 1NKL Native, 3A No disul des Fixed, native C Dynamic D Dynamic, native start B 1L1I Native, 4A No disul des Fixed, native B Dynamic Dynamic, native start C Dynamic, native start, Cys i ±Cys j (j ± i > 3) D 1ED0 Native, 5A No disul des B Fixed, native D Dynamic Dynamic, native start C Cys j ±Cys l or Cys i ±Cys l, Cys j ±Cys k ) are generated. An additional criterion on the distance in the amino acid sequence between bonded cysteines is applied as in point 1. Trial conformations generated with existing CSA operators inherit the arrangement of disul de bonds from the parent `seed' conformation but only if the distance between the bonded cysteine side-chain centroids is smaller than 7 AÊ after the mutation or crossover operation. The CSA run with variable disul de-bond arrangement can start with no disul de bonds in the randomly generated rst bank or any xed disul de-bond arrangement for the rst randomly generated conformations. In this paper, only runs with no disul de bonds and runs with native disul de-bond arrangement in the rst bank were used in CSA with variable disul de-bond arrangement. The CSA simulation with xed disul de-bond arrangement assumes that all the pairs of cysteines between which the bonds are to be formed have been speci ed. Each of these pairs is in a `bonded' (`oxidized') state and the energy of interactions between the cysteine residues is expressed by Equation 2. New operators for dynamic disul de-bond formation and breaking are not used in the CSA runs with xed disul de-bond arrangement. For each protein, four types of simulations were carried out (i) with dynamic disul de-bond arrangement starting with a 32 random bank of conformations with no disul des; (ii) with xed (native) disul de-bond arrangement; (iii) with dynamic disul de-bond arrangement starting with a random bank of conformations with native disul de-bond arrangement; and (iv) assuming that no disul de bonds can be formed. For smaller proteins (1EI0, 1ED0), minimizations per CSA run were used whereas for larger proteins (1NKL, 1L1I), more than one run of each kind with around minimizations was carried out. Results Energies, r.m.s.d. values from the experimental structures and the numbers of native disul de bonds for all types of CSA runs for the proteins studied are summarized in Table I, while representative structures are shown in Figures 1±5. Modi cations of the UNRES force eld and of the CSA global optimization procedure which allow formation of disul de bonds were rst tested on a small a-helical protein with a helix±turn±helix fold and two disul des linking both helices together, identi ed in the PDB as 1EI0 (Barthe et al., 2000) (see Figure 1A). It should be stressed that 1EI0 was not used in the force- eld optimization. The CSA global optimization runs, assuming that no disul de bonds can be formed, led to a global-minimum (GM) structure in the form of a long

5 Protein structure prediction with UNRES force eld Fig. 1. Native structure of 1EI0 (A); the lowest energy structure from the run with xed (native) disul de-bond arrangement, an energy 8.4 kcal/mol higher than the GM and an r.m.s.d. of 3.9 AÊ with respect to native (B); the closest to native structure from the run with a dynamic disul de-bond arrangement, starting with a random bank of conformations with no disul des, an energy 13.7 kcal/mol higher than the GM and an r.m.s.d. of 2.2 AÊ (C); the structure from the run with dynamic disul de-bond arrangement, starting with a random bank of conformations with no disul des, with one native disul de bond formed, an energy 2.9 kcal/mol higher than the GM and an r.m.s.d. of 3.4 AÊ (D); the structure with both native disul de bonds formed in the run with a dynamic disul de-bond arrangement starting with a random bank of conformations with the native disul de-bond arrangement, an energy 36 kcal/mol higher than the GM and an r.m.s.d. of 4.4 AÊ (E); the GM is a straight a-helix (F). a-helix with an r.m.s.d. of 13.6 AÊ from the native structure (see Figure 1F) and also to a structure with correct fold (not shown), with an r.m.s.d. for the C a atoms of 4.1 AÊ from the average NMR structure in the PDB and an energy only 0.7 kcal/mol higher than the GM. However, in the native-like structure found in this 4.1 AÊ r.m.s.d. run, the distances between cysteine centroids are fairly large: 11.2 AÊ for Cys3±Cys34 and 12.6 AÊ for Cys13±Cys24. A CSA run with a xed (native) disul de-bond arrangement produced only native-like structures; the lowest energy was 8.4 kcal/mol higher than the GM from the previous run and the lowest energy structure has an r.m.s.d. of 3.9 AÊ (see Figure 1B) with both native disul de bonds. A CSA run with a dynamic disul de-bond arrangement, starting with a random bank of conformations with no disul des, found the same GM as the run in which no disul des were allowed (Figure 1F). No low-energy structures with both native disul des were present in the nal population. A large number of native-like structures were found, but with only one native disul de bond. The structure closest to native in this run has an r.m.s.d. of 2.2 AÊ and an energy of 13.7 kcal/ mol higher than the GM and only the Cys13±Cys24 disul de bond was formed (see Figure1C). In the same run, structures with the correct fold with only the Cys3±Cys34 disul de bond formed were found; the structure with an energy of 2.9 kcal/ mol higher than the GM and an r.m.s.d. of 3.4 AÊ is shown in Figure 1D. The CSA run with a dynamic disul de-bond arrangement, starting with a random bank of conformations with the native disul de-bond arrangement, gave similar results; the only difference is that a high-energy structure (36 kcal/mol higher than the GM) with both native disul de bonds formed and an Fig. 2. Native structure of 1EI0 (A) and three structures from CSA runs with the re-optimized UNRES force eld (including the 1EI0 protein in the set of training proteins): the lowest energy structure, the GM, with an r.m.s.d. of 3.5 AÊ with respect to native found in the run with a dynamic disul de-bond arrangement, starting with a random bank of conformations with the native disul de-bond arrangement (B); the structure with both native disul des formed in the run with a dynamic disul de-bond arrangement,starting with a random bank of conformations with no disul des, an energy 17.1 kcal/mol higher than the GM and an r.m.s.d. of 2.5 AÊ (C); the structure with the lowest energy from the run with a xed (native) disul de-bond arrangement, an energy 10.5 kcal/mol higher than GM and an r.m.s.d. of 3.6 AÊ (D). r.m.s.d. of 4.4 AÊ was present in the nal population (see Figure 1E). To check the in uence of the force eld used in the CSA runs, which allow dynamic formation of disul de bonds, we reoptimized the UNRES force eld by including the 1EI0 protein in the set of training proteins (PDB codes 1E0G, 1E0L, 1GAB, 1IGD and 1EI0). The disul de-bond energy parameters of Equation 2 were not optimized; only the internal parameters of U SCi SC j and the weights of the energy terms had been changed. In the CSA run with a dynamic disul de-bond arrangement, starting with a random bank of conformations with the native disul de-bond arrangement, the GM for 1EI0 was found with the new force eld; it had the correct fold and one disul de (Cys3±Cys34) formed and an r.m.s.d. of 3.5 AÊ with respect to the native (see Figure 2B). The lowest energy structure obtained by CSA, assuming that no disul de bonds can be formed, has an r.m.s.d. of 3.6 AÊ and an energy 3.8 kcal/mol higher than the GM and inter-cysteine distances of 5.6 and 8.8 AÊ for Cys3±Cys34 and Cys13±Cys24, respectively. In the CSA run with a dynamic disul de-bond arrangement starting with a random bank of conformations with no disul des, the native-like structure with both disul des was found as the structure with an energy 17.1 kcal/mol higher than the GM and an r.m.s.d. of 2.5 AÊ with respect to native (see Figure 2C). The lowest energy structure in the CSA run with xed (native) disul de-bond arrangement has an r.m.s.d. of 3.6 AÊ and was 10.5 kcal/mol higher than the GM (see Figure 2D). Even the force eld optimized for the 1EI0 protein does not lead to a structure of 1EI0 with two disul des formed as one of very low energy; the reason is the wrong inter-helical angle in the lowenergy structures obtained with this version of the UNRES force eld. 33

6 C.Czaplewski et al. Fig. 3. Native structure of 1NKL (A); the lowest energy structure, GM, with an r.m.s.d. of 5.4 AÊ with respect to native and one non-native disul de formed, found in the run with a dynamic disul de-bond arrangement, starting with a random bank of conformations with the native disul de-bond arrangement (B); the lowest energy from a run with a xed (native) disul de-bond arrangement, an energy 17.5 kcal/mol higher than the GM and an r.m.s.d. of 5.1 AÊ (C); the structure from the CSA run with a dynamic disul de-bond arrangement, starting with a random bank of conformations with no disul des with one native disul de formed, an energy of 40.6 kcal/ mol higher than GM and an r.m.s.d. of 4.9AÊ (D). The next test case was the four a-helix bundle 1NKL protein with three disul des (Liepinsh et al., 1997) (see Figure 3A). Only the original force eld, optimized on four training proteins (PDB codes 1E0G, 1E0L, 1GAB, 1IGD), was used. Assuming that no disul de bonds can be formed, all CSA runs found structures with the correct fold but they were 49.6 kcal/ mol higher in energy than the non-native three-helix cyclic-like structure which had the lowest (9.4 kcal/mol) energy in these runs. The GM, which has the correct fold, was found by the CSA run with a dynamic disul de-bond arrangement, starting with a random bank of conformations with the native disul debond arrangement (Figure 3B). It has an r.m.s.d. of 5.4 AÊ with respect to the native structure and has 9.4 kcal/mol lower energy than the low-energy non-native structures found in the former CSA runs without disul des. Only one disul de bond is present and it is non-native. The lowest energy structure in the CSA run with a xed (native) disul de-bond arrangement has an r.m.s.d. of 5.1 AÊ and was 17.5 kcal/mol higher than the GM (see Figure 3C). An example of a structure from the CSA run with a dynamic disul de-bond arrangement, starting with a random bank of conformations with no disul des, is shown in Figure 3D. This structure is 40.6 kcal/mol higher in energy than the GM and has an r.m.s.d. of 4.9 AÊ. Only one native disul de bond, Cys35± Cys45, forms easily. It should be noted that, although the lowest energy structure of 1NKL does not have the native disul de-bond arrangement, the very possibility of disul de-bond formation resulted in location of the structure with the correct fold as the lowest energy structure, as opposed to runs without the possibility of disul de-bond formation. From Figure 3B, it can be seen that, although the six cysteine residues do not form the native bonds, they are qualitatively positioned as in the native structure. This suggests that dynamic formation, breaking and rearrangement 34 Fig. 4. Native structure of 1LI1 (A); the lowest energy from the run with xed (native) disul de-bond arrangement, an energy 25.7 kcal/mol higher than the GM and an r.m.s.d. of 5.7 AÊ with respect to the native structure (B); the lowest energy structure, GM, from the run with a dynamic disul de-bond arrangement, starting with a random bank of conformations with the native disul de-bond arrangement, with an r.m.s.d. of 6.1 AÊ and six non-native disul de bonds (C); the low-energy structure with four native disul des found in the run with a dynamic disul de-bond arrangement, starting with a random bank of conformations with the native disul de-bond arrangement with the additional restriction of at least three residues in the loop between cysteines forming a disul de bond, an energy 46.8 kcal/mol higher than GM and an r.m.s.d. of 7.7 AÊ (D). Fig. 5. Native structure of 1ED0 (A) and the lowest energy, GM, non-native structure found in the run assuming that no disul de bonds exist (B); the structure with the correct fold from the run with a dynamic disul de-bond arrangement, starting with a random bank of conformations with the native disul de-bond arrangement, an energy 14.3 kcal/mol higher than the GM and an r.m.s.d. of 4.8 AÊ with respect to native (C); the native-like structure from the run with the xed (native) disul de-bond arrangement, an energy of 17.0 kcal/mol higher than GM and an r.m.s.d. of 4.9 AÊ (D). of disul de bonds guided the CSA search to nd the correct fold, although the disul de-bond potential introduced in this work cannot predict the correct disul de-bond arrangement. An analysis of the history of the CSA search indicated that a single native disul de bond was formed in low-energy structures during the course of the run (data not shown), which explains why it helped to nd the native fold. The population of intermediate structures contained all native disul de bonds, but only one was present in a particular structure. However, because of the imperfection of the force

7 Protein structure prediction with UNRES force eld eld, after the native-like fold was reached a lower energy was achieved with non-native disul de bonds. The b-helical protein with eight disul des, identi ed in the PDB with code 1L1I, has been chosen to test the algorithm on proteins with many disul de bonds (Daley et al., 2002) (see Figure 4A). The original force eld, optimized on four training proteins (PDB codes 1E0G, 1E0L, 1GAB, 1IGD), was used. It should be stressed that no similar fold is present in any of the training proteins. Assuming that no disul de bonds can be formed, all CSA runs found only non-native structures, the lowest energy structure of these being a at six-stranded b-sheet built out of consecutive hairpins (not shown). However, the GM, which has the correct fold, was found by the CSA run with a dynamic disul de-bond arrangement, starting with a random bank of conformations with the native disul de-bond arrangement (see Figure 4C). It has an r.m.s.d. of 6.1 AÊ with respect to native and has 32.9 kcal/mol lower energy than the low-energy non-native structures found in the former CSA runs without disul des. Six disul de bonds are present but all of them are non-native. It should be noted that the energy gain due to the formation of a single disul de bond is equal to 3.5 kcal/mol (see the section The UNRES force eld, in Materials and methods). Therefore, the energy gain is greater than that due to the formation of six bonds. The CSA run with a dynamic disul de-bond arrangement, starting with a random bank of conformations with a native disul de-bond arrangement, was repeated with an additional restriction of at least three residues in the loop between cysteines forming a disul de bond. Consequently, the Cys18±Cys21 bond was forbidden to form. The lowest energy structure found by the new run has the correct fold with an r.m.s.d. of 7.7 AÊ, four native disul des and an energy 46.8 kcal/mol higher than the GM (see Figure 4D). The lowest energy structure in the CSA run with the xed (native) disul de-bond arrangement has an r.m.s.d. of 5.7 AÊ and is 25.7 kcal/mol higher in energy than the GM (see Figure 4B). As in the case of 1NKL, although the lowest energy structure does not have a native disul de-bond arrangement, the structure has a largely correct fold and the cysteines are arranged as in the native structure. Unlike 1NKL, where the formation of disul de bonds only guided the search, but the lowest energy structure has only one non-native disul de bond, here the native b-helical structure is entirely stabilized by disul de bonds, although they are not native. The next test case was a small a/b protein identi ed in the PDB with code 1ED0 (Romagnoli et al., 2000) (see Figure 5A). The original force eld, optimized on four training proteins (PDB codes 1E0G, 1E0L, 1GAB, 1IGD), was used. Assuming that no disul de bonds can be formed, the CSA run found only non-native structures. One of them is the GM which is a at ve-stranded b-sheet built out of consecutive hairpins with a sixth strand packed to the surface of this b-sheet (see Figure 5B). The structures with the correct fold were found by CSA runs with a dynamic disul de-bond arrangement. A representative structure found by the CSA run with a dynamic disul de-bond arrangement, starting with a random bank of conformations with the native disul de-bond arrangement, is shown in Figure 5C. It has an r.m.s.d. of 4.8 AÊ with respect to native and is 14.3 kcal/mol higher in energy than the lowenergy non-native structures found in the former CSA runs without disul des. Only one disul de bond is present and it is non-native. The lowest energy structure in the CSA run with a xed (native) disul de-bond arrangement has an r.m.s.d. of 6.8 AÊ and an energy 13.2 kcal/mol higher than the GM and bad packing of the secondary-structure elements. The closest-tonative structure from this run has an r.m.s.d. of 4.9 AÊ and an energy 17.0 kcal/mol higher than the non-native GM (see Figure 5D). Discussion The results presented in this paper show that it is possible to extend the united-residue force- eld and search procedure (CSA) developed earlier to study proteins containing disul des. To the best of our knowledge, this is the rst algorithm for energy-based prediction of the structure of disul de-bonded proteins without any assumption as to the positions of native disul des or human intervention. The earlier work of Watanabe et al. (1991) concerned packing of xed secondary-structure elements with a limited search of loop conformations. Moreover, formation of disul de bonds was not guided by energy as in our approach, but the decision was arbitrarily based on a geometrical graphic representation, as described in the Introduction. For 1NKL and 1L1I, the lowest energy structures obtained with inclusion of dynamic disul de-bond formation had the correct fold, as opposed to runs in which the formation of disul des was ignored. These structures did not have all native disul de bonds; however, the positions of the cysteine residues were qualitatively the same as in the native structure. This suggests that the role of formation of disul de bonds as a factor stabilizing the native structure or guiding the folding process towards the native structure is reasonably well accounted for by the modi ed CSA search procedure proposed in this work, even though the disul de-bond potential is not perfect. As indicated in the Results section, the search is guided towards the native structure probably because some of the native disul de bonds are formed during the run, although they can be broken or rearranged in the nal structure. It must be stressed that the examples studied in this work show that it is not possible to use only the hydrophobic side-chain potential in united-residue simulations on proteins with disul de bonds. Only in the case of 1EI0 is the structure with the correct fold low in energy without including disul de-bond formation. If the formation of disul de bonds is not assumed, the simulation results not only in incorrect packing, but also in incorrect secondary structures. Based on the results presented in this paper, we propose the following future modi cations. The energy term describing the formation of a disul de bond should include an angular dependence in addition to the distance dependence used in the current version of the algorithm. The lack of an angular dependence led to overpopulation of the structures in which the cysteines are very close in the sequence (see Results). The formation of short-range disul des is easy from an entropic point of view, which is represented by a distance dependence, but bonds (like disul des) leading to short-range loops are more restricted by the stiffness of the peptide backbone and the chemical nature of the disul de bond itself and their formation can be described by an angular and a distance dependence. The case of the 1L1I protein shows the de ciency in the current mutation operators for formation/breaking disul de bonds in CSA. Generally, to increase the probability of formation of a disul de bond, we plan to introduce more complicated genetic operators. Based on our experience, the following new genetic operators are necessary: (i) copy the 35

8 C.Czaplewski et al. disul de bond present in one conformation to another one, (ii) global perturbation of a conformation without disrupting the disul de bond(bonds) that it already contains, (iii) small local perturbation of the backbone and side chains around cysteine moieties which are geometrically close to each other but still too far to be considered as bonded. Additionally, the possibility of formation of a disul de bond between two free cysteines should be based not only on the distance between their centroids, but also on the orientation of the respective C a ± SC vectors. All these changes in the search procedure should speed up the search and increase the probability of generating structures with correct disul de bonds. Acknowledgements We thank Jarosøaw Pillardy, Daniel Ripoll and Jorge Vila for helpful comments on this paper. This work was supported by grants from the National Institutes of Health (GM-14312), the National Science Foundation (MCB ), the Fogarty Foundation (TW1064) and grant BW/ from the Polish State Committee for Scienti c Research (KBN). Support was also received from the National Foundation for Cancer Research. This research was conducted by using the resources of (a) the National Science Foundation Terascale Computing System at the Pittsburgh Supercomputer Center, (b) our 392-processor Beowulf cluster at the Baker Laboratory of Chemistry and Chemical Biology, Cornell University and (c) our 45-processor Beowulf cluster at the Faculty of Chemistry, University of GdanÂsk. References Abkevich,V.I. and Shakhnovich,E.I. (2000) J. Mol. Biol., 300, 975±985. Barthe,P., Rochette,S., Vita,C. and Roumestand,C. (2000) Protein Sci., 9, 942±955. Betz,S.F. (1993) Protein Sci., 2, 1551±1558. Burton,R.E., Hunt,J.A., Fierke,C.A. and Oas,T.G. (2000) Protein Sci., 9, 776± 785. Czaplewski,C. et al. (2002) In Fifth Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction. predictioncenter.llnl.gov/casp5/casp5.html Czaplewski,C., Liwo,A., Pillardy,J., Oødziej,S. and Scheraga,H.A. (2003) Polymer, in press. Daley,M.E., Spyracopoulos,L., Jia,Z., Davies,P.L. and Sykes,B.D. (2002) Biochemistry, 41, 5515±5525. Dani,V.S., Ramakrishnan,C. and Varadarajan,R. (2003) Protein Eng., 16, 187±193. Doig,A.J. and Williams,D.H. (1991) J. Mol. Biol., 217, 389±398. Dombkowski,A.A. and Crippen,G.M. (2000) Protein Eng., 13, 679±689. Fiser,A. and Simon,I. (2000) Bioinformatics, 16, 251±256. Fiser,A., CserzoÈ,M., TuÈdoÈs,E. and Simon,I. (1992) FEBS Lett., 302, 117±120. Huang,E.S., Samudrala,R. and Ponder,J.W. (1999) J. Mol. Biol., 290, 267± 281. Kobayashi,Y., Sasabe,H., Akutsu,T. and SaitoÃ,N. (1992) Biophys. Chem., 44, 113±127. Kubo,R. (1962) J. Phys. Soc. Jpn., 17, 1100±1120. Lee,J., Scheraga,H.A. and Rackovsky,S. (1997) J. Comput. Chem., 18, 1222± Lee,J., Liwo,A. and Scheraga,H.A. (1999) Proc. Natl Acad. Sci. USA, 96, 2025±2030. Lee,J., Liwo,A., Ripoll,D.R., Pillardy,J., Saunders,J.A., Gibson,K.D. and Scheraga,H.A. (2000) Int. J. Quantum Chem., 71, 90±117. Lee,J., Ripoll,D.R., Czaplewski,C., Pillardy,J., Wedemeyer,W.J. and Scheraga,H.A. (2001) J. Phys. Chem. B, 105, 7291±7298. Lester,C.C., u,., Laity,J.H., Shimotakahara,S. and Scheraga,H.A. (1997) Biochemistry, 36, 13068± Liepinsh,E., Andersson,M., Ruysschaert,J.M. and Otting,G. (1997) Nat. Struct. Biol., 4, 793±795. Liwo,A., Pincus,M.R., Wawak,R.J., Rackovsky,S. and Scheraga,H.A. (1993) Protein Sci., 2, 1715±1731. Liwo,A., Oødziej,S., Pincus,M.R., Wawak,R.J., Rackovsky,S. and Scheraga,H.A. (1997a) J. Comput. Chem., 18, 849±873. Liwo,A., Pincus,M.R., Wawak,R.J., Rackovsky,S., Oødziej,S. and Scheraga,H.A. (1997b) J. Comput. Chem., 18, 874±887. Liwo,A., Lee,J., Ripoll,D.R., Pillardy,J. and Scheraga,H.A. (1999a) Proc. Natl Acad. Sci. USA, 96, 5482±5485. Liwo,A., Pillardy,J., Kazmierkiewicz,R., Wawak,R.J., Groth,M., 36 Czaplewski,C., Oødziej,S. and Scheraga,H.A. (1999b) Theor. Chem. Acc., 101, 16±20. Liwo,A., Czaplewski,C., Pillardy,J. and Scheraga,H.A. (2001) J. Chem. Phys., 115, 2323±2347. Liwo,A., Arøukowicz,P., Czaplewski,C., Oødziej,S., Pillardy,J. and Scheraga,H.A. (2002) Proc. Natl Acad. Sci. USA, 99, 1937±1942. Liwo,A., Oødziej,S., Czaplewski,C., Kozøowska,U. and Scheraga,H.A. (2003) J. Phys. Chem. B, in press. Martelli,P.L., Fariselli,P., Malaguti,L. and Casadio,R. (2002) Protein Eng., 15, 951±953. Miyazawa,S. and Jernigan,R.L. (1985) Macromolecules, 18, 534±552. Mucchielli-Giorgi,M.H., Hazout,S. and Tuffery,P. (2002) Proteins: Struct. Funct. Genet., 46, 243±249. Muskal,S.M., Holbrook,S.R. and Kim,S.H. (1990) Protein Eng., 3, 667±672. NeÂmethy,G., Gibson,K.D., Palmer,K.A., Yoon,C.N., Paterlini,G., Zagari,A., Rumsey,S. and Scheraga,H.A. (1992) J. Phys. Chem., 96, 6472±6484. Orengo,C.A., Bray,J.E., Hubbard,T., LoConte,L. and Sillitoe,I. (1999) Proteins: Struct. Funct. Genet., Suppl. 3, 149±170. Petersen,M.T.N., Jonson,P.H. and Petersen,S.B. (1999) Protein Eng., 12, 535± 548. Pillardy,J. et al. (2001a) Proc. Natl Acad. Sci. USA, 98, 2329±2333. Pillardy,J., Czaplewski,C., Liwo,A., Wedemeyer,W.J., Lee,J., Ripoll,D.R., Arøukowicz,P., Oødziej,S., Arnautova,Y.A. and Scheraga,H.A. (2001b) J. Phys. Chem. B, 105, 7299±7311. Rey,A. and Skolnick,J. (1994) J. Chem. Phys., 100, 2267±2276. Romagnoli,S., Ugolini,R., Fogolari,F., Schaller,G., Urech,K., Giannattasio,M., Ragona,L. and Molinari,H. (2000) Biochem. J., 350, 569±577. Skolnick,J., Kolinski,A. and Ortiz,A.R. (1997) J. Mol. Biol., 265, 217±241. Staley,J.P. and Kim,P.S. (1992) Proc. Natl Acad. Sci. USA, 89, 1519±1523. Watanabe,K., Nakamura,A., Fukuda,Y. and SaitoÃ,N. (1991) Biophys. Chem., 40, 293±301. Wedemeyer,W.J., Welker,E., Narayan,M. and Scheraga,H.A. (2000) Biochemistry, 39, 4207±4216. Weissman,J.S. and Kim,P.S. (1992) Science, 256, 112±114. Welker,E., Wedemeyer,W.J., Narayan,M. and Scheraga,H.A. (2001) Biochemistry, 40, 9059±9064. Zhou,N.E., Kay,C.M. and Hodges,R.S. (1993) Biochemistry, 32, 3178±3187. Accepted October 16, 2003 Edited by Valerie Daggett

Ab-initio protein structure prediction

Ab-initio protein structure prediction Jaroslaw Pillardy Computational Biology Service Unit Cornell Theory Center, Cornell University Ithaca, NY USA Methods for predicting protein structure 1. Homology