Computer Modeling of Protein Folding: Conformational and Energetic Analysis of Reduced and Detailed Protein Models

Size: px

Start display at page:

Download "Computer Modeling of Protein Folding: Conformational and Energetic Analysis of Reduced and Detailed Protein Models"

Leon Preston
5 years ago
Views:

Cust. Ref. No. PEW 62/94 [SGML] J. Mol. Biol. (1995) 247, 995 1012 : Conformational and Energetic Analysis of Reduced and Detailed Protein Models Alessandro Monge, Elizabeth J. P. Lathrop, John R.

1 Cust. Ref. No. PEW 62/94 [SGML] J. Mol. Biol. (1995) 247, : Conformational and Energetic Analysis of Reduced and Detailed Protein Models Alessandro Monge, Elizabeth J. P. Lathrop, John R. Gunn Peter S. Shenkin and Richard A. Friesner* Department of Chemistry and Center for Biomolecular Simulation, Columbia University, New York NY 10027, U.S.A. *Corresponding author Recently we developed methods to generate low-resolution protein tertiary structures using a reduced model of the protein where secondary structure is specified and a simple potential based on a statistical analysis of the Protein Data Bank is employed. Here we present the results of an extensive analysis of a large number of detailed, all-atom structures generated from these reduced model structures. Following side-chain addition, minimization and simulated annealing simulations are carried out with a molecular mechanics potential including an approximate continuum solvent treatment. By combining reduced model simulations with molecular modeling calculations we generate energetically competitive, plausible misfolded structures which provide a more significant test of the potential function than current misfolded models based on superimposing the native sequence on the folded structures of completely different proteins. The various contributions to the total energy and their interdependence are analyzed in detail for many conformations of three proteins (myoglobin, the C-terminal fragment of the L7/L12 ribosomal protein, and the N-terminal domain of phage 434 repressor). Our analysis indicates that the all-atom potential performs reasonably well in distinguishing the native structure. It also reveals inadequacies in the reduced model potential, which suggests how this potential can be improved to yield greater accuracy. Preliminary results with an improved potential are presented. Keywords: protein folding; computer modeling; potential functions Introduction The thermodynamics of protein folding has been the subject of intense experimental and theoretical study for many years (Privalov, 1989; Yang et al., 1992). From a theoretical point of view, the problem is extraordinarily difficult because the free energy of folding is a small fraction of the total free energy of the protein; consequently, one needs to calculate a small energy difference with empirical potential functions whose quality for this purpose is not known. Because globally different protein conformations must be compared, the cancellation of error that is often relied upon in free energy perturbation calculations is problematic. An adequate evaluation of the total energy of the protein is therefore Abbreviations used: r.m.s., root-mean-square; PDB, Protein Data Bank; RSA, rotamer simulated annealing; MBO, myoglobin; CTF, C-terminal fragment of the L7/L12 ribosomal protein; R69, N-terminal domain of phage 434 repressor; CPU, central processor unit. necessary. Furthermore, the size and complexity of the molecule and associated solvent make the usual procedures of molecular mechanics computationally expensive; this difficulty is compounded by the existence of a huge number of local minima (even in the neighborhood of the global minimum) which impedes rapid conformational sampling of phase space. Over the past several years, a few papers have appeared which have examined the total molecular mechanics energy of a small number of different protein conformations (Levitt & Sharon, 1988; Daggett & Levitt, 1991; Mark & van Gunsteren, 1992; Novotny et al., 1984, 1988; Bryant & Lawrence, 1993). While many of these results have been interesting, the amount of data obtained is not really sufficient to fully address the problem described above. It is relatively straightforward to generate large numbers of plausible conformations that are very close (e.g. 1 to 2 Å r.m.s. deviation) to the native structure, for example starting from the X-ray structure and using molecular dynamics or /95/ $08.00/ Academic Press Limited

2 996 simulated annealing algorithms (Daggett & Levitt, 1991; Mark & van Gunsteren, 1992). It is also straightforward to thread the sequence of one protein through the structure of another, as was done by Karplus and co-workers in a pioneer study of this type (Novotny et al., 1984) and more recently, for instance, by Bryant & Lawrence (1993). However, in order to design a methodology for protein folding by computer, it is crucial to examine a large number of low energy structures with conformations significantly different (at least in the 4 to 6 Å r.m.s. deviation range, which is estimated to be in the molten globule regime) from the native. Such structures can only be produced via simulations capable of rapidly traversing configuration space, which at the same time utilize a potential function that is a plausible approximation to the actual potential. Many studies have also been carried out using reduced protein models and approximate potentials (Friedrichs et al., 1991; Hinds & Levitt, 1992; Sun, 1993; Skolnick & Kolinski, 1990; Kolinski et al., 1993; Lau & Dill, 1989; Shakhnovich et al., 1991; Covell & Jernigan, 1990; Covell, 1992). Most of this work has been concerned with sequence alignment and homology modeling, i.e. identifying the native structure from the set of structures in the Protein Data Bank (PBD: Bernstein et al., 1977), given the sequence. Again, however, the restriction to PDB structures is qualitatively inadequate if one wishes to understand how the native conformation is selected as compared with alternatives that are actually suitable to the sequence. To begin with, a realistic treatment of excluded volume constraints is required and most of the database potentials in the literature simply ignore such constraints, as PDB structures have them built in automatically. Furthermore, the quality of these reduced model potentials is even more of an issue than the molecular mechanics potentials which at least have a performance record that can be evaluated for small molecules. In our earlier work on computer modeling of protein folding (Monge et al., 1994; Gunn et al., 1994), we have used a reduced model of the protein where we fix secondary structure and employ a simple potential based on a statistical analysis of PDB structures. The idea of fixing secondary structure was proposed in a number of previous works (Ptitsyn & Rashin, 1975; Warshel & Levitt, 1976; Cohen et al., 1979), but while promising results were reported, the methods used have not been of general applicability. Our efforts have been toward developing a genuinely automated algorithm to fold proteins of arbitrary complexity using the secondary structure as a starting point. Such an algorithm could be applied to proteins of unknown structure when combined with NMR spectroscopy. In fact, the NMR method can provide a precise characterization of the protein secondary structure at an early stage of a structure determination and quite independently of the complete structure calculation (Wüthrich et al., 1984, 1991; Wishart et al., 1992). In this paper, we combine an extensive set of reduced model simulations, using a potential with a primitive but reasonably effective set of excluded volume constraints, with molecular modeling calculations using the AMBER* force field (Weiner et al., 1984; McDonald & Still, 1992) for the protein and the generalized Born (GB) continuum solvent model of Still and co-workers (Still et al., 1990) to represent the aqueous environment. Large numbers of energetically competitive reduced model structures are generated, as described in previous papers (Monge et al., 1994; Gunn et al., 1994), for three proteins: myoglobin, an eight -helix protein (PDB code 1MBO); the C-terminal fragment of the L7/L12 ribosomal protein, a small mixed / protein (PDB code 1CTF); and the amino-terminal domain of phage 434 repressor, a small helical protein (PDB code 1R69). A subset of structures are then selected for further study: side-chains are added via the rotamer simulated annealing (RSA) program of Shenkin and co-workers (Farid et al., 1992), and minimization and simulated annealing runs are carried out with the AMBER*/GB potential using the MacroModel/BatchMin molecular modeling program (Mohamadi et al., 1990). Recently, Vieth et al. (1994) have studied the GCN4 leucine zipper (a dimer of two helices each containing 33 residues) using a hierarchical approach similar to ours, where a lattice model is used first and then all-atom structures are generated. They report very good agreement with the crystal structure and their results appear to be promising. However, a validation of their method must come in the context of larger and more complex proteins. Our results on three proteins allow us, for the first time, to systematically investigate the major issues described above. Can a molecular mechanics potential with a continuum solvent pick out the native structure when compared with a set of genuinely competitive alternatives, as opposed to structures of a completely different protein? What sort of energy gap is there (if any) between the native and other structures, and which terms in the potential contribute to it most substantially? How good is the correlation between reduced model and molecular mechanics potential for the total energy and for each of the component parts of the energy? While the conclusions that emerge from this investigation are in accordance with several previous speculations, the systematic trends, which can readily be observed in all three test proteins, are quite striking. The results suggest a strategy for constructing a significantly improved reduced model potential by identifying critical flaws in the potentials that have been produced to date. Work in this direction is currently in progress and an initial preliminary result is presented. This paper is organized into four sections. In section two, we review our reduced model and associated computational algorithms, and present new results for CTF and R69 (results for MBO can be found in Gunn et al., 1994). CTF is the first -strand containing protein that we have studied; the results

3 997 are quite satisfactory with the addition of a hydrogen-bonding potential to generate strand pairing (we do not otherwise bias how the strands pair). R69 is the first protein that we have studied for which our current model potential is grossly inadequate. As is often the case, one can learn as much or more from failure as from success; here, the difficulties in the potential, which are to some extent reflected even in the AMBER*/GB model, provide a key to understanding significant problems with the underlying physics of the approximate model. Section three describes the AMBER*/GB molecular mechanics calculations, including technical details of the simulations, statistical summaries of the results and discussion of the implications of these results for protein folding. Finally, section four contains conclusions and directions for future work. The Reduced Model Overview In our studies of protein folding we use a hierarchical approach in which we represent the polypeptide chain at different levels of detail. The crudest level consists of cylinders connected by spheres. The cylinders contain either -helices or -strands and the spheres enclose loop regions. The next level of detail incorporates explicitly the backbone atoms and represents side-chains as spheres centered at the -carbon atomic positions. The use of ideal geometries for helices and strands and of precalculated loop lists for loop segments results in a one-to-one correspondence between the two levels; in particular, each sphere at the coarse level corresponds to a possible loop at the more detailed level. Fixing the protein secondary structure provides a simplification of the problem and can be viewed as a computational technique with no implications for protein folding kinetics. In practice, secondary structure might be specified from NMR experimental data, as suggested by Wüthrich et al. (1991). In this section we describe the two levels of representation which define our reduced model. This model was introduced in our earlier work on myoglobin (Gunn et al., 1994). Here we emphasize recent algorithmic developments and extensions of the model to include -strands, and report new results for CTF and R69. Model representation and algorithms The geometric representation of the molecule is based on the assignment of each residue to one of 18 possible states which specify the backbone dihedral angles and with all other internal coordinates assuming standard values. These states are chosen to span the allowed regions of the Ramachandran map, but are otherwise not weighted according to local energy and are not residuespecific. This is to allow maximum flexibility for the loops by eliminating only impossible conformations. Each segment of repeated dihedral angle state ( -helices or -strands) can be represented by a cylinder described by the axis and the radial vectors of the terminal residues. Each loop can be represented by a vector connecting the end-points of adjacent cylinders, with its geometry specified by the internal coordinates (angles and dihedral angles) formed with the axes and radii of the cylinders. The first level of representation (cylinders and spheres) thus consists of the segmented chain formed by the cylinder axes and radii with the connecting loops. The second level consists of the sequence of dihedral angle states which uniquely specifies the positions of all C and C atoms. Trial moves are carried out by replacing the loop segments with new values selected from a pre-calculated list. The remainder of the structure can be pivoted into the new conformation simply by making use of the stored values of the internal coordinates corresponding to each loop. This allows for very fast construction and evaluation of trial structures using the cylinder-sphere representation. The more detailed representation of the molecule can then be constructed at periodic intervals by using the sequence of dihedral angle states which were used to construct each loop and are stored along with the corresponding geometry in the loop list. The entire chain can thus be rebuilt from the sequence of dihedral angles when required. The minimization procedure consists of an inner loop of trial moves in which the loops are randomly replaced by loops from the loop list. These structures are checked for self-avoidance and rejected if any of the secondary structure elements, modeled as hard cylinders and spheres, are closer than a minimum allowed distance. This effective radius is a parameter which describes an impenetrable core, but which does not enclose all atoms. Rejection at this level rules out grossly self-overlapping structures. After a number of iterations the resulting structure is used as the trial move for evaluation with the complete residue residue potential function. In this way, the more expensive potential, which is the quantity to be minimized by simulated annealing, is only evaluated for structures which have been selected by what is effectively a short minimization of a simpler model. The structures at this level are checked for self-avoidance with a cutoff distance for each pair of C or C atoms. The minimum distances for each possible pair of residues is determined by taking the shortest distances observed in a survey of the PDB. For each overlap in the structure a constant penalty is added to the total energy used to accept or reject the structure. If the structure is rejected, the previously accepted structure is used for the next cycle of the inner loop. The success of the algorithm depends significantly on the choice of the overlap penalty. If it is too high, at the start, most trial moves would be rejected, since the cylinder-sphere potential does not completely prevent atomic overlaps from occurring. This is because hard spheres and cylinders, which eliminate possible overlaps by enclosing all C and C atoms,

4 998 would grossly over-estimate the excluded volume and prevent the formation of compact structures. However, since there is a hard core in the simple model which does prevent impossible folding topologies, most overlaps can be alleviated with relatively small changes in the structure. This is achieved by gradually increasing the value of the penalty during the simulation so that those structures with fewer overlaps are progressively selected out. This method generates final structures with all overlaps removed without a significant increase in the energy. The minimization is carried out simultaneously for a large number of structures and includes periodic implementation of a genetic algorithm. In this step, a loop is chosen as a splice point and a number of hybrids are created by taking parts of different structures and connecting them together at the splice point with a new loop selected from the loop list. Each hybrid undergoes a minimization cycle using the simple model to select a reasonable loop to connect the two parts. For each parent, defined as the structure contributing the larger segment, the lowest energy hybrid is selected and used as a new trial move for the complete potential, following the procedure described above for the mutation steps. Further refinement is carried out by selecting the lowest energy structures in the ensemble, replicating them, and continuing the simulation. The potential function used in these simulations is based on a statistical analysis of the PDB (Casari & Sippl, 1992). Only pairs of residues far apart in the sequence are considered, so that the potential does not depend on the local geometry, but rather represents the overall packing of hydrophobic and hydrophilic residues and the formation of a hydrophobic core. In addition, the potential is long-ranged so that it can be used to evaluate non-compact structures. The potential has the form E = N (h i + h j + 2h 0 ) r i r j (1) i j 20 where the coefficients h i correspond to the relative hydrophobicities of the residues and h 0 is a net hydrophobicity of the molecule, which provides a driving force for compactness. For use with the cylinder-sphere representation, the potential for a pair of secondary structure segments can be expanded around the center center distance. The first two terms of this expansion can be interpreted as a net hydrophobicity interaction and a hydrophobic dipole interaction. This approximation is sufficiently accurate to provide a useful estimate of the total energy in the inner loop of the minimization. To improve the performance of the potential in differentiating similar compact structures, the all-residue potential also contains a contact term which consists of a residue dependent constant energy for each pair of C atoms within a cutoff distance (Maiorov & Crippen, 1992). This contact potential is added to the hydrophobic potential with a coefficient that is treated as an adjustable parameter. This allows the contact term to be smoothly turned on during the simulation. Since this term provides an additional driving force towards compactness, the net hydrophobicity h 0 is also reduced during the simulation. This parameter annealing combined with the increasing overlap penalty discussed above, allows the potential to become less smooth and more rugged during minimization, along with the usual lowering of the effective temperature. It should be remarked that the same functional form of the potential is used for different proteins. The net hydrophobicity h 0 and the weighting for the contact potential depend solely on the sequence, and the parameter annealing can be regarded as a computational artifice used to achieve an efficacious minimization. The above potential proved to be rather poor at generating structures with the correct pairing of -strands. The strands tended to clump together in bundles much like helices, rather than maintain a parallel arrangement. In order to describe the inter-strand hydrogen bonding, an additional term was added to the potential designed to mimic an attraction between backbone O and H atoms with an appropriate geometry. It would be very computationally expensive to consider interactions among all pairs of atoms in a pair of residues, so only a very simple strand strand potential was considered. Since the potential in this case involves short-range interactions between long extended segments, it is impossible to systematically approximate an allatom function with an effective center center interaction for use with the cylinders as in the case of the hydrophobic potential. Instead, an ad hoc function was constructed which provides a crude approximation of a hydrogen-bonding potential for many relative orientations of two strands, with the antiparallel configuration being favored. This has the form E (log R ij (R ij L i) 2 (R ij L j) 2 2.4) (1 2(L i L j) 5 )/R ij (2) where R ij is the center center vector and the L i are the axial vectors of the cylinders. This potential was found to provide an improved correlation between energy and r.m.s. deviation for the strand pairing, and therefore was subsequently used for both the cylinder-sphere and all-residue representations of the molecule. It should be emphasized that the form of this function is essentially arbitrary and is intended only to generate the simplest features of strand pairing. This is clearly only a first step towards developing a realistic hydrogen-bonding potential. In order to compensate for the increased attraction of the strands towards one another, the contact potential was set to zero for the strand residues. Note that the additional strand strand potential does not in any way specify the way in which strands must combine to form a given -sheet, but simply requires nearby strands to assume an antiparallel configuration.

1000 Figure 3. Distribution of reduced model structures plotted with r.m.s. deviation versus total energy for CTF. Structures considered in the all-atom analysis are identified by their ID number.

5 1000 Figure 3. Distribution of reduced model structures plotted with r.m.s. deviation versus total energy for CTF. Structures considered in the all-atom analysis are identified by their ID number. Figure 5. Distribution of reduced model structures plotted with r.m.s. deviation versus total energy for R69. Structures considered in the all-atom analysis are identified by their ID number. of the native structure, with the exception of the details of the strand pairing, which requires additional refinement with a more detailed model to adequately represent. This structure is shown superimposed on the native in Figure 4. Both for myoglobin and for CTF, the results of the simulations indicate that for our reduced model representation with secondary structure fixed the number of potential minima is drastically reduced and the native-like topology is one of a small number of distinct low-energy conformations. Since the description of the protein chain is very coarse, it is not surprising that we find misfolded structures energetically competitive with the native one. The final example discussed here is R69 (60 residues, five helices), for which the results are shown in Figure 5. Although there are a few structures generated with relatively low r.m.s. deviation from the native, they are no lower in energy than the average for the distribution. More importantly, the native structure itself exhibits a very high energy relative to misfolded ones. This implies that while further annealing of the structures shown might be expected to lower the energy of the ensemble, there is no reason to expect lower-energy structures to be any more native-like, and in fact the lowest r.m.s. deviation of the ensemble may well increase. This is the first case we have studied where the current potential function appears to make significant errors in distinguishing the native fold from reasonable compact alternatives encountered in the simulation. It thus provides an important test for the further understanding of the potential and the evaluation of alternatives. Figure 4. Superimposition of C worms for the native (yellow) and calculated (blue) structures of CTF. The r.m.s. deviation between the two structures is 5.0 Å.

6 1001 The Detailed Model Overview Structures generated with the reduced model of the previous section can be further analyzed by introducing another level in the hierarchical framework. The detailed model is a standard united atom representation of the protein molecule which can then be simulated employing traditional molecular mechanics force fields. This model serves a twofold purpose: a detailed all-atom representation is clearly required if accurate structures at the 1 to 2 Å resolution level are to be obtained; furthermore, detailed analysis of competitive reduced model structures should help understand the strengths and limitations of the simplified potentials. Another important issue that can be addressed in the context of detailed model simulations is the quality of molecular mechanics potentials and of continuum solvent treatments. Our reduced model is capable of generating diverse protein conformations that are competitive alternatives to the native. Analysis of the energetics of these structures and comparison with the native will allow critical evaluation of the force field. We first present the representation of the detailed all-atom model, the procedure used to map a minimized reduced structure onto an all-atom one, and the potential employed in the simulations. We then describe in detail the calculations for the detailed model, which were carried out using the MacroModel/BatchMin modeling package (Mohamadi et al., 1990). Finally, we report the results for our three test proteins. Model representation and algorithms All-atom structures are generated from the reduced model main-chain fold by adding explicit side-chains. This is a well known and difficult problem (Janin et al., 1978; Lee & Subbiah, 1991; Desmet et al., 1992), whose complexity is due to the astronomical number of possible structural permutations. For our purposes it is not actually necessary to achieve an accurate prediction of side-chain conformations, but rather some reasonable initial guess that can then be manipulated via minimization and/or simulated annealing. This is even truer in view of the fact that one would like to let the main-chain relax along with the side-chains. To generate detailed atomic models we used the RSA program developed by Peter Shenkin and co-workers (Farid et al., 1992). Reduced model structures are initially dressed with planar sidechains and random 1 torsion angles. Side-chain rotamer space is then explored with the RSA code which uses a Monte Carlo algorithm to sample from a rotamer library. A simulated annealing scheme is used to minimize bumps between side-chains. The RSA code produces all-atom structures which might still have bad contacts. This could be due to inadequacies in the optimization procedure and/or to the restriction to rotamers in describing side-chain conformations. In the reduced model, each residue and dihedral angles are not restricted to any particular region of the Ramachandran map, i.e. each residue can sample the same discrete set of main-chain dihedral angles, regardless of amino-acid type. This choice is motivated by the fact that in this way the reduced model is more flexible, and consequently capable of traversing configuration space more efficaciously. However, the all-atom structure modeling of proline residues requires the angle to be nearly fixed. In the present studies we have treated prolines as alanines (there are four proline residues in myoglobin, one in CTF and two in R69), judging that even at this stage of refinement added flexibility for these residues could be beneficial. Terminal loops are not modeled in the reduced structures. In order to avoid spurious interactions that might derive from charged and/or polar ends, we capped the all-atom structures with an acetyl group at the N terminus and N-methyl amide at the C terminus. These two groups are modeled from the two C atoms at either end of the reduced representation. Calculations for the detailed model are carried out using the AMBER* force field, the original AMBER potential of Kollman and co-workers (Weiner et al., 1984) with additional parameters for organic functionality (McDonald & Still, 1992). A united-atom scheme was used, whereby hydrogen atoms are explicitly considered only for polar atoms. Solvent was treated with the GB/SA continuum solvation model (Still et al., 1990). This model is based on a continuum dielectric for solvent polarization and solvent-accessible surface area treatment of the cavity and van der Waals solvation components. Structures generated with the RSA scheme are subjected to minimization using the conjugate gradient method in MacroModel/BatchMin. Minimization is carried out including solvation and employing analytical approximation of surface areas. To speed up the calculation we also employ cutoffs for non-bonded interactions: 7 Å for van der Waals interactions and 12 Å for electrostatic interactions. With these cutoffs, 27%, 65% and 70% of the non-bonded pair interactions are included in the calculation for myoglobin, CTF and R69, respectively. A convergence criterion is set by requiring that the gradient be less or equal to 0.05 kj mol 1 Å 1. In practice we have run minimization with a prefixed maximum number of iteration of 10,000, achieving convergence in most cases (typically after 7000 iterations for myoglobin, 3500 iterations for CTF and 4000 iterations for R69). The energy for the final structures is evaluated with essentially infinite cutoffs (i.e. cutoffs for which no significant change in the energy is observed by increasing them) and by using accurate numerical areas for solvation. The computational cost of conjugate gradient minimization is substantial; we ran our calculations on IBM

7 1002 RS/ and 370 workstations with an average CPU time of 13.5 hours for myoglobin, two hours for CTF and 2.5 hours for R69. Nonetheless, these calculations are orders of magnitude less expensive than calculations involving explicit solvent, where hundreds or thousands of discrete solvent molecules are used to model solvent effects. A subset of the minimized structures was further optimized using simulated annealing. Conformational sampling is performed by means of stochastic dynamics using the SHAKE protocol to constrain bonds to hydrogen atoms and a 1.5 fs time step. An initial equilibration is carried out at 300 K for 10 ps. Typically, the temperature is then lowered to 50 K in 40 ps with a linear cooling schedule. This is followed by a 3000 iteration conjugated gradient minimization. We experimented with different cooling rates and higher initial temperatures for CTF, and found that the annealing protocol described above produced structures with the lowest energy. The results of this analysis are described below in Simulated annealing, of section three, where we also discuss convergence for our simulated annealing procedure. In the following, when we refer to simulated annealing calculations, we always mean the combination of equilibration, annealing and minimization described above. The CPU time for simulated annealing runs is determined by the time required for each step of stochastic dynamics (3, 0.9, and 1 seconds for MBO, CTF, and R69, respectively) and the time of each conjugate gradient minimization iteration (7, 2, and 2.3 seconds for MBO, CTF, and R69, respectively). Results Overview The procedures described above were used to produce a large number of energetically plausible but substantially different conformations of the three proteins considered in this study. While the complexity of the potential energy functions and the procedure itself make the analysis of the results nontrivial, it is still possible to ask and provide at least preliminary answers to a number of important questions. These are as follows: (1) What is the performance of the AMBER*/GB potential in ranking the native structure as compared to the alternatives that we have generated? Can anything be inferred about the strengths or weaknesses of the molecular mechanics force field and solvation model used? (2) What are the uncertainties at each step of the dressing process (addition of side-chains, conjugate gradient minimization, simulated annealing), i.e. what sort of energetic and geometrical variations are obtained in the final structure if one starts from a given reduced model structure and repeats the dressing procedure many times? (3) Are there systematic differences in any of the components of the energy for the native structure as compared to alternative structures? Can anything be inferred about the driving forces from protein folding from such systematic behavior? (4) How well does the reduced model potential correlate with the AMBER*/GB potential? Are particular terms in either potential good predictors of native-like structures? (5) Can a better reduced model potential (i.e. one capable of higher resolution and reliability) be designed after the above analysis is completed? In addition to presenting the raw data in various schematic forms, we shall attempt to address each of these questions. The computations presented here are rather expensive, so the procedures described in Model representation and algorithms, of section three, were not applied to all the structures generated with the reduced model. Instead, we have tried to mix a broad survey of many structures with an in-depth analysis of the computational procedures for a small subset of these structures. In analyzing in detail the components of the AMBER*/GB potential energy, we focus on the van der Waals, electrostatic Coulombic, electrostatic solvation and surface area terms. The remaining terms (stretches, bends and torsions) are critical in maintaining the connectivity of the protein and its local geometry, but exhibit very small differences from structure to structure and hence are unimportant, at least at the level of resolution examined here, in the ranking of conformations. Detailed model data sets Detailed model calculations were performed on sets of structures selected from the reduced model distributions (Figures 1, 3 and 5) and on the corresponding X-ray structures. We typically selected structures with low reduced energy or with low r.m.s. deviation from the native, but we have also analyzed structures in the middle of the distribution or with high energy and high r.m.s. deviation. For myoglobin we have also studied two extended structures that do not appear in the reduced model distribution; these were obtained by running the reduced model code for just a few steps so that each of the loop regions, initially in helical conformation, is assigned a loop from the loop list. The results of conjugate gradient minimization and simulated annealing calculations for MBO, CTF and R69 are presented in Table 1. Each structure is identified by a code corresponding to its sequential position in the reduced model distribution (which contains 1024 not necessarily distinct structures). The r.m.s. deviations reported in the Table are based on the positions of the -carbon atoms only and are relative to the minimized native structure. For consistency with the generated structures, the native structure was stripped of the loops at each end and capped as described in Model representations and algorithms of section three. The prosthetic heme group of myoglobin was neglected in the calculations. Figure 6 plots the reduced model energy versus the total AMBER*/GB energy for the minimized MBO structures. A triangular distribution is observed,

8 1003 indicating that structures with high reduced energy also have high AMBER*/GB energy while structures with low reduced energy do not necessarily have a good AMBER*/GB energy. The r.m.s. deviation from the minimized native structure as function of the total AMBER*/GB energy is plotted for MBO in Figure 7. A gap is present between native and generated structures (a feature that is absent in the distribution for the reduced potential). We observe that the lowest-energy structures in the high-r.m.s. and low-r.m.s. clusters have comparable energies. Analysis of the different energetic components listed in Table 1 reveals that a correlation exists between surface area and van der Waals energies and between electrostatic Coulombic and solvation energies. This is shown in Figures 8 and 9, respectively, for the MBO structures of Table 1A. The sum of the internal electrostatic energy and of the electrostatic solvation energy has a much lower variance overstructure (hundreds of kj/mol) than the individual components, which vary by thousands of kj/mol. The r.m.s. deviation versus the total electrostatic energy is plotted in Figure 14 for all of the structures of Table 1A. Side-chain addition Our most detailed study of variability in the side-chain addition procedure has been carried out for the native backbone conformation of CTF. The RSA procedure of Shenkin and co-workers was run 52 times, followed by conjugate gradient minimiz- Table 1 Minimization and simulated annealing results for MBO, CTF and R69 ID IRG r.m.s. RG SA vdw ESC ESS TES E A. MBO SA SA SA SA SA SA SA SA SA SA SA SA SA SA SA E E B. CTF SA SA SA SA continued overleaf

9 1004 Table 1 (continued) ID IRG r.m.s. RG SA vdw ESC ESS TES E SA SA SA SA SA SA SA SA C. R SA SA SA SA SA Uncapped structure. ID is an identifier for the structure, IRG is the radius of gyration of the starting structure in Å, r.m.s. is the C r.m.s. deviation from the native in Å, RG is the final radius of gyration in Å, SA is the surface area energy, vdw is the van der Waals energy, ESC is the electrostatic Coulombic energy, ESS is the electrostatic solvation energy, TES is the total electrostatic energy and E is the total energy. All energies are in kj/mol. Each structure s ID consists of its sequential number as obtained from the reduced model distributions (compare Figures 1, 3, and 5), 0 being used for the native, a second number (1, 2,...) if different side-chain additions were considered, and the suffix SA if the structure was the result of a simulated annealing run (different simulated annealing runs with the same initial structure are indexed SA-1, SA-2,...). E1 and E2 refer to the two MBO extended structures (see text for details). ation. The average C r.m.s. deviation of all runs from the native structure is 0.65 Å and the average all-atom r.m.s. is 1.65 Å; this is competitive with results from other procedures reported in the literature for side-chain addition (Lee & Subbiah, 1991). With regard to packing (as measured by the van der Waals energy), our side-chain addition method does rather well; indeed, there is little variation in this component over the 52 runs (the average difference for the van der Waals energy between the generated side-chains and the native is kj/mol, with a standard deviation of 2.32 kj/mol). For electrostatics (Coulombic plus solvation), the average difference of all runs from the native is kj/ mol, with a standard deviation of 5.37 kj/mol. This is not surprising because electrostatics and solvation are not included in the approximate side-chain potential function used in the RSA procedure. Such potentials can be added to the side-chain dressing procedure and work along these lines is currently in progress. Although the relative difference of both van der Waals and total electrostatic energies of all runs with respect to those of the native is comparable (of the order of 1 to 2%), it is the actual value of the energy that matters in the ranking of structures. The overall energetic gap observed over the 52 runs between the native and the generated structures is almost entirely due to the total electrostatic component. Further evidence of this fact is that some of the generated structures show van der Waals energies lower than the native, while this is never observed for the electrostatic energy. In addition to this extensive study employing the native backbone conformation of CTF, we have also carried out a number of experiments where the same

10 1005 Figure 6. Plot of reduced model energy versus AMBER*/GB energy for minimized MBO structures. backbone conformation generated with the reduced model was dressed to give different initial all-atom structures. Examples of this analysis are reported in Table 1 for myoglobin structures 18, 994, 51, and 475, and CTF structures 249 and 579. The above results suggest that electrostatic stabilization of the generated structures may be underestimated by the side-chain addition/conjugated gradient minimization procedure as compared to the native structure. Again, this is no surprise, considering the overlap-only nature of the RSA method. The ability of simulated annealing to close this gap is addressed in the next subsection. Simulated annealing In view of the extreme jaggedness of the potential energy surface in the all-atom model, conjugate gradient minimization is not the most satisfactory energy optimization technique. Minimized structures might be surrounded by structures with lower energy which even relatively small barriers can make inaccessible when a conjugate gradient method is used. To investigate this scenario we Figure 8. Plot of surface area energy versus van der Waals energy for all MBO structures in Table 1A. In this Figure, as well as in the following ones, squares label structures obtained via minimization and triangles label structures obtained via minimization followed by simulated annealing. subjected some of the minimized structures to simulated annealing. The effects of the simulated annealing procedure (described in Model representation and algorithms of section three) are presented in Figure 10, which plot the C r.m.s. deviation of the starting structure from the native versus the r.m.s. deviation of the final structure from the starting one and versus the change in total energy, respectively, for MBO. Although non-native structures tended to move by 2 Å r.m.s. in the course of simulated annealing, this procedure altered the r.m.s deviation to the native structure by only about 0.5 Å. To ensure that our simulated annealing scheme is appropriate for study of medium to large molecules, we used CTF as an example to investigate the choice of parameters (cooling rate, initial temperature, random seed selection) on the final energy. All runs start from the same minimized solvated native structure and in all cases simulated annealing is Figure 7. Plot of r.m.s. deviation versus AMBER*/GB energy for minimized MBO structures. Figure 9. Plot of electrostatic Coulombic energy versus electrostatic solvation energy for all MBO structures in Table 1A.

11 1006 (a) (b) 600 K yield final energies about 500 kj/mol higher than energies obtained via conjugate gradient minimization, indicating that too high initial temperatures lead to other less stable local energy minima. On the other hand, runs starting from lower temperatures (200, 300 and 400 K) give comparable final energies. How random seed selection affects the final energy was investigated using both the native and one of the dressed structures (249). For both structures we performed ten runs, each one with a different seed initialization. For the native we obtained a mean energy of kj/mol with a standard deviation of kj/mol, and for the 249 structure a mean energy of kj/mol with a standard deviation of kj/mol. Based on these results, we expect energies obtained via simulated annealing to be accurate only up to approximately 15 kj/mol. Total energies Figure 10. Conformational and energetic effects of simulated annealing for MBO structures: plots of r.m.s. deviation of starting structure from native MBO versus r.m.s. deviation of final annealed structure from starting one (a) and versus change in total energy upon simulated annealing (b). preceded by a 10 ps equilibration at 300 K and is followed by 3000 conjugate gradient minimization steps. We tried five different linear cooling rates from 300 to 50 K. The final energies are reported in Table 2, which shows that the largest energy difference observed is 85 kj/mol. Although most of our simulations were performed with an initial temperature of 300 K, we also studied the effect of varying the initial temperature. Experiments starting from Table 2 Final energies of native CTF for five simulated annealing runs with different cooling rates Length Rate Final energy Run (ps) (K/step) (kj/mol) In all runs the initial and final temperatures were 300 K and 50 K, respectively. How well does the AMBER*/GB potential do at discriminating the native structure from misfolded structures generated by our reduced model simulations? The r.m.s. deviation from the native versus the total AMBER*/GB energy for the structures presented in Table 1 is plotted in Figure 11, where different symbols are used for minimized structures and for minimized structures followed by simulated annealing. First, it is noteworthy that the minimized native structure does not have a lower energy than alternative structures subjected to simulated annealing. The conformational adjustments generated by simulated annealing are responsible for a significant lowering of the energy. After simulated annealing, the native structure is lowest for myoglobin (by 40 kj/mol) and for CTF (by 140 kj/mol). These energy gaps are not unreasonable estimates for the stabilization of the native structure as compared to plausible compact alternatives. For R69, some misfolded structures are very close in energy to the native even after simulated annealing has been carried out. The conclusion is therefore that the AMBER*/GB potential has a reasonably good performance (certainly much better than the present reduced model potential), but that improvements are necessary to render it completely reliable for an arbitrary protein. With regard to the remaining structures, nativelike structures (e.g. 6 Å r.m.s. structures for myoglobin) do not have energies that are substantially better on average than non-native basins, in particular the basin centered around 12.5 Å r.m.s. deviation. We defer a discussion of the implications of this observation until the various components of the energies have been analyzed in detail. On the other hand, structures with very poor score from the reduced model potential typically have substantially higher energies with the AMBER*/GB potential; the correlation between the two potentials is plotted in

Can a continuum solvent model reproduce the free energy landscape of a β-hairpin folding in water?

Can a continuum solvent model reproduce the free energy landscape of a β-hairpin folding in water? Ruhong Zhou 1 and Bruce J. Berne 2 1 IBM Thomas J. Watson Research Center; and 2 Department of Chemistry,