Natural proteins have the algorithm of finding the global

Size: px

Start display at page:

Download "Natural proteins have the algorithm of finding the global"

Sheila Armstrong
6 years ago
Views:

1 Shaping up the protein folding funnel by local interaction: Lesson from a structure prediction study George Chikenji*, Yoshimi Fujitsuka, and Shoji Takada* *Department of Chemistry, Faculty of Science, and Graduate School of Science and Technology, Kobe University, Nada, Kobe , Japan; Department of Computational Science and Engineering, Graduate School of Engineering, Nagoya University, Nagoya , Japan; and Core Research for Evolutionary Science and Technology, Japan Science and Technology Agency, Nada, Kobe , Japan Edited by R. Stephen Berry, University of Chicago, Chicago, IL, and approved December 30, 2005 (received for review September 22, 2005) Predicting protein tertiary structure by folding-like simulations is one of the most stringent tests of how much we understand the principle of protein folding. Currently, the most successful method for folding-based structure prediction is the fragment assembly (FA) method. Here, we address why the FA method is so successful and its lesson for the folding problem. To do so, using the FA method, we designed a structure prediction test of chimera proteins. In the chimera proteins, local structural preference is specific to the target sequences, whereas nonlocal interactions are only sequence-independent compaction forces. We find that these chimera proteins can find the native folds of the intact sequences with high probability indicating dominant roles of the local interactions. We further explore roles of local structural preference by exact calculation of the HP lattice model of proteins. From these results, we suggest principles of protein folding: For small proteins, compact structures that are fully compatible with local structural preference are few, one of which is the native fold. These local biases shape up the funnel-like energy landscape. computational protein design energy landscape fragment assembly Go model SIMFOLD Natural proteins have the algorithm of finding the global minimum of their free energy surface within a biologically relevant timescale (1). One of the most stringent tests of how much we understand the algorithm may be to predict protein tertiary structures by simulating processes that are analogous to folding, which is often called de novo structure prediction. Recently, significant progress has been made in de novo structure prediction, in which the most successful method is the fragment assembly (FA) method developed by Baker and coworkers (2) and others (3 6). The FA method shows considerable promise for new fold targets of recent Critical Assessments of Techniques for Protein Structure Prediction (CASPs), the community-wide blind tests of structure prediction (7 11). In the FA method, the protocol is separated into two stages: First, we collect structural candidates for every short segment of the target sequence, retrieving them from the structural database. The second stage is to assemble fold these fragments for constructing tertiary structures that have low energies. Simple questions arose as to why the FA method is so successful and what we can learn about protein folding from the success of the FA method. According to Baker and coworkers (12, 13), the FA method is based on the experimental observation that local sequence of a protein biases but does not uniquely decide its local structure. To what extent does the modest local bias influence tertiary structures generated? How is the FA method related to recently developed protein folding theory (14, 15)? In this work, we address these questions. For this purpose, we need structure prediction software that uses the FA method. Here, we use the in-house-developed software, SIMFOLD. SIMFOLD employs a coarse-grained protein representation and has its own energy function derived from physicochemical considerations (16, 17). Conformational sampling is performed by the FA method (18, 19). In the recent CASP6, our group Rokko (as well as the server Rokky) used the SIMFOLD software as a primary prediction tool, demonstrating one of the top-level prediction performances in the new fold category (elite club membership by the assessor, B.K. Lee). In this work, we start with comparison of the structures predicted in the past. We then design a structure prediction test of chimera proteins, where the local structural preference is that for the wild-type (WT) sequence and nonlocal interactions are the only sequence-independent compaction force. We show for some small proteins that the chimera proteins can find the native-like topology of the intact proteins with high probability, indicating that local structural biases and nonspecific compaction forces together lead proteins to their native folds. We further explore this idea by performing exact calculation of the HP lattice model of proteins. The result supports the above suggestion and further indicates that compact structures that are fully compatible with weak bias in local structure are few, one of which is the native fold. These studies elucidate that local structural biases of modest strength make the energy landscape of proteins funnel-like. Results and Discussion Lesson from Structure Prediction in the Past. We start with two observations about comparisons of structures predicted in the past, which provide important insight. First, by using the same energy function (SIMFOLD), we compared the prediction performance by molecular dynamics (MD) simulation with that by the FA method for some small proteins (17). We found that the performance is strikingly different, albeit use of the same energy function. In the case of protein G, for example, both methods found roughly correct secondary structure elements at the right residues, whereas their packing in three dimensions was quite different. MD simulations generated many misfolds with the wrong topology, but not the native fold (four structures that are not marked X in Fig. 1a are snapshots from MD). In contrast, FA could successfully generate the native fold with quite high population. (The structure labeled with Native Fold in Fig. 1b is from FA simulation.) Why did it happen, and what does it tell us? The lowest energy of the misfold found in MD was slightly lower than that around the native basin, indicating that the SIMFOLD energy was not good at identifying the native fold. So, why could FA generate the native fold with high population? Because the FA method restricts local structure to that found in the database, local structure is not affected by the energy function used, as is the case for MD. Indeed, the lowest-energy structures found in MD contained some local structures, especially in the loops, that never appear in the structural database, implying inaccuracy of local interac- Conflict of interest statement: No conflicts declared. This paper was submitted directly (Track II) to the PNAS office. Abbreviations: CASP, Critical Assessment of Techniques for Protein Structure Prediction; FA, fragment assembly; MD, molecular dynamics; PDB, Protein Data Bank; rmsd, rms deviation. To whom correspondence should be addressed. stakada@kobe-u.ac.jp by The National Academy of Sciences of the USA BIOPHYSICS cgi doi pnas PNAS February 28, 2006 vol. 103 no

(b) The FA method generated the native fold with high probability, suggesting a funnel-like landscape.

2 Fig. 1. Comparison of structures generated by MD (a) and FA (b) with the same energy function and corresponding schematic energy landscapes. (a) MD simulations generated four misfold structures, suggesting a highly rugged energy landscape. (b) The FA method generated the native fold with high probability, suggesting a funnel-like landscape. Structures without red X were obtained by the corresponding simulation, whereas those with red X were not observed. The cartoons of protein structures were generated by MOLSCRIPT (59). tion of SIMFOLD, e.g., because of ignored local hydrogen bonding between main chain and side chain. The FA method is free from this inaccuracy in local interactions, which may be the primary reason that the FA method works so well. Success of FA, in turn, suggests importance of local interactions for encoding the native fold. The second observation is from the recent CASPs. In the new fold category of CASP6, most prediction groups of high score used FA-based methods. Although each group used its own scoring (i.e., energy) function, predicted structures were sometimes similar to each other. Obviously, well predicted structures were similar. Interestingly, sometimes predicted structures with the wrong fold were somewhat similar. A nice example is the target T0201 from CASP6. Shown in Fig. 2 are prediction models of three independent groups, Rokko, Baker ROBETTA, and Jones UCL, by the FA methods. Apparently, they have the wrong fold, but are very similar to each other. This observation suggests that the prediction of the FA method is robust against difference in scoring functions. Chimera Experiment. Motivated by these two observations, we designed a chimera experiment for addressing robustness of the FA method with respect to the energy function and the role of local interactions for protein folding. Here, we note that local and nonlocal interactions are treated explicitly in separate stages in the FA method; local structural preference is taken into Fig. 2. Examples of structure predictions from the new fold category in CASP6. (a) The native structure of target T0201 (PDB ID code 1s12). (b) Model 1 from group Rokko. (c) Model 6 from group Baker Robetta. (d) Model 1 from group Jones UCL. account when fragment candidates are listed up in the first stage, whereas nonlocal interactions play central roles in assembling folding these fragments in the second stage. Taking advantage of this separation, we designed the following chimera experiment. We use protein G (56-residue protein) as an example. First, we prepared fragment candidates of every overlapping eight residues, as usual, for protein G. Second, in the assembling stage, we used the energy function for the 56-residue polyvaline (polyval) sequence instead of the sequence of protein G and performed multicanonical Monte Carlo simulations of assembling (18). Thus, the simulation was conducted for the chimera interactions; local structural preference is for protein G, and nonlocal interaction is for polyval. For the latter, we chose Val nearly by chance as a representative of a hydrophobic residue. Hydrophobic residues tend to attract each other, thus introducing nonspecific force of compaction (collapse). Note that we are not interested in polyval folding itself. We tested the same experiment using polyisoleucin and polyleucine and found essentially the same results (data not shown), confirming that the choice of Val is not essential. As a control experiment, we also performed structure prediction of protein G using the WT sequence in the folding assembling stage, as usual. The chimera and control experiments are denoted as [polyval WT] and [WT WT], respectively. (Here, [A B] means that fragment candidates are prepared for B and the assembling simulation uses interactions of sequence A.) The fragment candidates prepared are identical between chimera and control experiments. We now show the results. Fig. 3 a and b plot energies and rms deviations (rmsds) of structures generated by assembling simulations of [WT WT] (control) and [polyval WT] (chimera), respectively. Here, the rmsd is measured from the native. We clearly see a stronger correlation between rmsd and a lower envelope of energies in the [WT WT] case than that in the [polyval WT] case, suggesting that the WT sequence has a more funnel-like energy landscape, and thus the prediction performance of the WT sequence is better than that of the chimera protein. Surprisingly enough, there is weak, but significant, correlation between the rmsd and the lower envelope of energies in the [polyval WT] case. Next, as usual in structure prediction, we performed cluster analysis of low energy structures. Fig. 3 d and e show the structures of the largest cluster centers for [WT WT] and [polyval WT], respectively. (The native structure of protein G is depicted in Fig. 3c as a reference.) As expected, the predicted structure of the [WT WT] case is quite similar to the native (rmsd is 4.0 Å). Surprisingly, the predicted structure of the [polyval WT] case also has the correct fold (rmsd of 5.1 Å), although secondary structure elements are slightly perturbed; the second -strand is too short. The size of the largest cluster for [polyval WT] is reasonably large ( 20% in population) by the standards of de novo structure prediction, although it is smaller than the control case. The structure prediction performance by the FA method is unexpectedly robust with respect to the energy function used for assembling. This robustness is not limited to protein G. We performed the same chimera and control experiments as above for an all -protein, 434 C1 repressor [Protein Data Bank (PDB) ID code 1r69), and an all -protein Hex1 (PDB ID code 1khi). rmsd energy plots of 434 C1 repressor for the [WT WT] and that for the [polyval WT] cases are shown in Fig. 3 f and g, respectively. The two figures look quite different, reflecting the difference in energy functions used in assembling: For the [WT WT] case, weak energetic bias is found toward smaller rmsd, whereas it is not seen in the chimera case. As was the case for protein G, performing cluster analysis for low-energy structures, we identified that the structures of the largest cluster center have the native fold for both cases (see Fig. 3 i and j), although the predicted structure for [polyval WT] (rmsd 5.5 Å) is worse than cgi doi pnas Chikenji et al.

Fig. 3. Results of control and chimera experiments of protein G (a e), 434 C1 repressor (f j), and Hex1 (k o). (a) rmsd energy plots of protein G for [WT WT]. (b) rmsd energy plots for [polyval WT].

3 Fig. 3. Results of control and chimera experiments of protein G (a e), 434 C1 repressor (f j), and Hex1 (k o). (a) rmsd energy plots of protein G for [WT WT]. (b) rmsd energy plots for [polyval WT]. (c) The native structure of protein G (PDB ID code 1pgb). (d) The structure of the largest cluster center of protein G for [WT WT]. (e) The structure of the largest cluster center for [polyval WT]. (f) rmsd energy plots of 434 C1 repressor for [WT WT]. (g) rmsd energy plots for [polyval WT]. (h) The native structure of 434 C1 repressor (PDB ID code 1r69). (i) The structure of the largest cluster center of 434 C1 repressor for [WT WT]. (j) The structure of the largest cluster center for [polyval WT]. (k) rmsd energy plots of Hex1 for [WT WT]. (l) rmsd energy plots for [polyval WT]. (m) The native structure of Hex1 (PDB ID code 1khi). (n) The structure of the second largest cluster center of Hex1 for [WT WT]. (o) The structure of the largest cluster center for [polyval WT]. that for [WT WT] (rmsd 3.5 Å). Results for Hex1 are shown in Fig. 3 k and l. In this case, there is no correlation between rmsds and energies for both [WT WT] and [polyval WT] cases, and thus it is expected that predictions are worse than the previous cases of protein G and 434 C1 repressor. Fig. 3 n and o shows the structure of the second largest cluster center from the [WT WT] simulations and the structure of the largest cluster center from the [polyval WT], respectively. Interestingly, the structure for [polyval WT] has the correct fold with rmsd of 5.4 Å, whereas that for the [WT WT] simulations has a pair of -strands (fourth strand in yellow and fifth strand in red) flipped, resulting in a larger rmsd of 5.9 Å. (The largest cluster of the [WT WT] simulations is even worse.) The above chimera experiment showed that local structural preferences specific to the WT sequence and the nonspecific compaction force together drive the protein into its native fold with reasonably high probability for small proteins. In other words, given the fragment structure ensemble that reflects local structure preferences of the target sequence, compact structures that are fully compatible with these fragments are few, one of which is the native fold. We also performed the same experiments for some mediumto-large size proteins, finding that the performance of chimera experiments, on average, becomes worse as the size increases (data not shown). Thus, the conclusion above is limited to small proteins. Nature of Short Fragments Taken from Proteins. Because the above results highlight the importance of local structural bias, we here briefly discuss physical properties of short fragments taken from proteins. First, a class of peptides excised from proteins can fold to the native structure by themselves, independently from other parts of the proteins (20 23). For these peptides, local interactions are so strong and specific that the peptide conformation is constrained into basically one structure. Another class of short sequences, often called chameleon sequences, is known to fold to entirely different conformations when they are embedded in different sequence contexts, demonstrating possible multiplicity of local structure bias (24). Some other peptides, once cleaved, do not possess any well defined structure (25). For these peptides, some local structures are prohibited by, for example, steric clash between side-chain atoms or by proximity of two atoms with the same charge. Thus, local bias does not pin down their conformations but reduces available conformational space. From these observations, it is reasonable to consider that short peptides excised from proteins have local biases and restriction in conformational space that are strongly sequence dependent. One of the biased conformations corresponds to the native. All atom MD simulations for short peptides extracted from proteins support this idea (26, 27). In the FA method, it is assumed that such biased conformations of short segments can be approximated by fragment conformations retrieved from the structure database by sequence-comparison methods. Impact of Weak Local Constraint: Lattice HP Model. Then, how and to what extent do local structural biases alter the global shape of the energy landscape? For addressing this question with least ambiguity, we conducted an exact calculation using a simple two-dimensional lattice (2D) HP model of proteins (28), to which we weakly added local structural biases, mimicking characteristics of short peptides of real proteins. In the lattice HP model, a protein is represented as a selfavoiding chain on a 2D square lattice with two types of amino acids: H (hydrophobic) and P (polar). The energy, E, of a chain conformation is determined only by the number of H H contacts, h, ase h, where is a positive constant (used as the energy unit). With this simplicity, complete enumeration in structural and sequence space is feasible for short chains, and thus the sequence structure relationship of this model has been quite thoroughly uncovered (28, 29). Here, we randomly choose a sequence that has a unique ground state-structure with the chain length n 16. The sequence chosen here is HHHPHH- PHHHHPHHP. By exact enumeration of all possible conformations, we easily BIOPHYSICS Chikenji et al. PNAS February 28, 2006 vol. 103 no

4 Fig. 5. Schematic energy landscapes of the pure HP model (a) and the HP model with local structure constraints (b). Fig. 4. The HP model studied here. (a) Native structure of the HP model. (b) Example of a maximally compact structure of the HP model. (c) All four-residue segments of the HP sequence and all possible structures. Structures surrounded by green rectangles are prohibited by the local structure constraint we impose here. (d) All structures of the first excited states (E 8) of the HP model. Structures are classified by the number of native contacts Q. Only two gray shaded structures are not prohibited by imposing the local conformational constraints. find that this sequence has a unique ground state with energy E 9 at the structure depicted in Fig. 4a and 20 misfolded structures in the first excited states (E 8). All of the latter structures are shown in Fig. 4d, as classified by the nativeness score, the Q score, which is used as a measure of structural similarity to the native and is defined by the number of nonbonded contacts common to those at the native structure (Q in the native state of this model is 9). The presence of many low-energy misfolded structures with low Q scores indicates that the energy landscape of this sequence is highly rugged (see Fig. 5a). It has been known in general that the HP model has a highly rugged energy landscape, which is non-protein-like (30). The HP model simply captures the hydrophobic force, which is widely believed to be the driving force for the protein to fold (31). As is clear from argument above, the key feature missing in this HP model is biases in local conformation. Thus, we introduce sequence-dependent weak biases in local structures of the HP model and consider the effect of them on the global shape of the energy landscape. Local bias of the HP model were investigated before (32, 33), where biases were sequence independent. Based on the argument in the previous section, the local structural bias we introduce should be weak and sequence dependent. Namely, the bias should not decisively pin down local structure of a segment, but should modestly reduce diversity of local structures. Physically, given a sequence of a short segment, some local structures are prohibited by, for example, steric clash. Obviously, these prohibited fragment conformations should not contain those that appear in the native structure. In the HP lattice model, we here impose structure biases on a four-residue fragment. There are six types of four-residue sequences in the sequence we use here, which are HHHH, PHHH, HPHH, HHPH, HHHP, and PHHP. The number of possible conformations of four residues in two dimensions is five, as listed in Fig. 4c. For each four-residue sequence, one of the five structures is chosen as a prohibited conformation (marked by green rectangles in Fig. 4c), where the choice is at random from all structures that do not appear in the native structure. These structures should be considered as conformations with high free energy. We first investigate how the first-excited structures (E 8) are affected by the introduced constraint in local structures, finding that 18 of 20 first-excited structures contain at least one prohibited local structure, and thus they are eliminated. (See Fig. 4d where all nonshaded structures contain prohibited local structures.) In particular, structures with small Q score, i.e., highly dissimilar to the native, are completely eliminated by the weak local constraint, where the Q score is the number of native contacts. Only 2 of the 20 structures remain, and they have relatively high Q score of 7 (shaded ones in Fig. 4d; Q score at the native is 9), which means that these first-excited structures are near native. This result suggests that the energy landscape becomes more funnel-like by adding a weak constraint in local structure (see Fig. 5). Next, we see how characteristics of global conformational space are altered by the weak constraint in local structure. Fig. 6a plots numbers of conformations (in logarithmic scale) as a function of total number of nonbonded contacts C, which reflects compactness, where data for the pure HP model without local constraint and the locally constrained HP model are shown in white and black bars, respectively. For all C, the number of conformations is significantly reduced by adding a local constraint. Also important is that the number of conformations decreases with C, indicating that structures with higher compactness are fewer. Led by these two tendencies, the striking result is that the number of maximally compact structures (C 9) is 69 without the local constraint, but only 2 of 69 remain after imposing the local constraint. One of the two remaining structures is of course the native structure, by design. The other structure (shown in Fig. 4b) has the energy E 5 and Q 1 and thus is highly excited. Suppose we try to predict the ground state of this locally cgi doi pnas Chikenji et al.

5 Fig. 6. The effect of the local structure constraint on the conformational space of lattice protein models. (a) The number of conformations (in logarithmic scale) as a function of total number of contacts C. The white bar indicates log (C) of the pure HP model. The dotted bar indicates log (C) of the HP model with local constraints. (b) Temperature dependence of the number of nonnative contacts. Dotted, solid, and broken lines correspond to the pure HP model, the locally constrained HP model, and Go model, respectively. constrained HP model. Assuming that the ground state is a maximally compact structure, we generate maximally compact structures that satisfy the local conformational rule. There are only two of these structures, and one of them is the native. Thus, we can predict the native structure with probability 0.5 regardless of the energy function. Because two maximally compact structures have the energies of 9 and 5, an energy function that has typical error of 2 energy units is enough to identify the ground state. This finding is perfectly parallel to the fact that, in the chimera experiments, just generating compact structures using fragment libraries produces the native fold with high probability. What Makes the Protein Energy Landscape Funnel-Like? What do these results tell us about protein folding? Recently it was established that the transition state ensemble of folding pathways can often be well described by the so-called Go -like models (34 41), which correspond to the perfect-funnel. The basic assumption in the model is that only native contacts are favorable, and nonnative contacts are neutral or repulsive (42). These studies suggested that the energy landscape of proteins is funnel-like. It is unclear, however, how the smooth and funnelshaped energy landscape is organized from numerous atomic interactions. Important and interesting questions related to this point are what is the physicochemical reason of the funnelshaped energy landscape and why nonnative interactions play only minor roles. In the previous section, we showed that constraints in local structure prohibit many stable misfolded structures, resulting in a funnel-like energy landscape. To see this effect more quantitatively under thermodynamic conditions, we calculated the temperature dependence of the average number of nonnative contacts for the above lattice HP model for the pure and locally constrained HP models (Fig. 6b). For comparison, we also show results for a Go model, in which the native structure is set as the same as that of the HP model (Fig. 4a). As shown in Fig. 6b, there is an apparent peak in the number of nonnative contacts curve for the pure HP model, indicating that there is a nonnative intermediate state in the thermodynamic transition. Thus, the pure HP model does not show a two-state transition. In contrast, the thermodynamic behavior of the number of nonnative contacts of the locally constrained HP model shows a sigmoidal transition, which is quite similar to that of the Go model. Indeed nonnative interactions in the locally constrained HP model are significantly reduced, relative to the pure HP model. Namely, although nonnative interactions exist in the locally constrained HP model, they do not contribute very much, suggesting local structural biases make the energy landscape more funnel-like (see Fig. 5b). We note that the current studies do not suggest that the nonlocal interactions are unimportant. Hydrophobicity and van der Waals packing, which are predominantly nonlocal, play major roles in the thermodynamic stability of proteins. The present work suggests the primary code that specifies the native fold, but not the stability, is in the local structure preferences. Implication to Computational Protein Design. The perspective that there are only limited numbers of compact structures fully compatible with local constraints may have implications to protein design as well. Recently, considerable progress has been made in de novo design of foldable proteins (43 45), where the primary strategy is to seek the sequence that optimizes sidechain interactions at a target structure. From the perspective of folding theory, however, it has been anticipated that, instead of optimizing the interactions at the target structure, the (normalized) gap between target energy and the average energy of the denatured ensemble should be optimized. Lattice model simulations support this theory (46). However, using this idea directly for designing real protein sequences is limited (47). More importantly, major successes of protein design so far come from optimization of the interaction energy only at the target (43 45). How can the practice and the theory be reconciled? Obviously by now, the perspective that there are only limited numbers of compact structures fully compatible with local constraints indeed fills the gap: If this is the case and if the target structure is one of the possible compact structures, destabilizing alternative conformations may not be necessary. Discussion on Denatured States of Proteins. The finding that local structural preference and compaction forces together lead the protein into the native fold may have some links to characteristics of denatured states of proteins, which recently attract considerable attention (48 54). In simulations of a -hairpin confined in a pore, Thirumalai and coworkers (52) found that, under highly denatured conditions, the confined -hairpin possesses partially native-like order. This result is in contrast to the denatured ensemble of a bulk -hairpin, which is extended and has little native-like order. Thus, compaction force alone could induce native-like order as in ref. 55. The present work suggests that this hypothesis is quite probable by the help of local structural preferences. Conclusions This study addressed what microscopic interactions make the protein energy landscape funnel-like, motivated the by recent success of the FA method in structure prediction. We reached the following principles of protein folding: (i) There are only limited numbers of compact conformations that are fully compatible with local structural preferences, and the native fold is one of them, and thus (ii) local structure preferences strongly shape up the protein folding funnel. Materials and Methods SIMFOLD energy employs a coarse-grained protein representation. It includes explicit backbone structure and a simplified side chain, which is represented by the centroid of side-chain heavy atoms. The SIMFOLD energy function consists of various interactions: hydrogen bonding, hydrophobic interaction, van der Waals interaction, and so on. Their functional forms are well based on physicochemical considerations, whereas parameters are optimized based on the energy landscape theory and structural database. The explicit expression is described in ref. 17. In the first stage of structure prediction, we prepared fragment candidates for every segment of a target sequence, as usual, using profile profile comparison (56) against a nonredundant set of the structural database. Sequences homologous to the target BIOPHYSICS Chikenji et al. PNAS February 28, 2006 vol. 103 no

6 protein [PSI-BLAST (57) E value 0.01] were rigorously excluded from the fragment libraries, because the focus here is not on homology modeling. We prepared 20 fragment candidates for every eight-residue-length segment of the target protein. Prepared fragment structures would reflect the structural preference of the corresponding segments of the target protein. Second, tertiary structures were generated by assembling the fragment candidates prepared above. Monte Carlo simulation was carried out for finding low-energy conformations with the newly developed reversible fragment replacement algorithm (G.C., Y.F., and S.T., unpublished data). This algorithm has some attractive characteristics over conventional fragment insertion. First, in contrast to conventional algorithms, the current one satisfies the detailed balance condition, thus enabling us to estimate thermodynamic quantities and use modern sampling techniques in computational physics. Here, we use one of these techniques, the multicanonical ensemble method that realizes uniform sampling in energy space, avoiding too-long trapping by misfolded structures (18). Second, in conventional fragment insertion algorithms, insertion of a fragment overwrites part of old fragments and thus short portions of fragments (for example, one-residue-length remains of fragment candidates) remain in any stage of the simulation. Frequent use of short remains of fragments diminishes the advantage of the FA method, that is, correlated, angle information in consecutive residues. The new algorithm rigorously prohibits use of remains of less than four residues, and thus any structure generated is made of portions of fragments with the length between four and eight. Finally, we performed cluster analysis of the resulting lowenergy structures using the pairwise C rmsd as a measure of dissimilarity. Cluster centers of largest cluster sizes were selected as prediction models (58). The cluster cutoff was set to 6 Å here. We thank Satoshi Takahashi and Sotaro Fuchigami for fruitful discussions. We also thank David Leitner for useful discussions and careful reading of the manuscript. G.C. and Y.F. were supported by the Japan Society for the Promotion of Science Research Fellowship for Young Scientists. This work was supported by Grant-in-Aid for scientific research on priority areas Water and Biomolecules from the Ministry of Education, Science, Sports, and Culture of Japan. 1. Fersht, A. R. (1999) Structure and Mechanism in Protein Science: A Guide to Enzyme Catalysis and Protein Folding (Freeman, New York). 2. Simons, K. T., Kooperberg, C., Huang, E. & Baker, D. (1997) J. Mol. Biol. 268, Jones, T. A. & Thirup, S. (1986) EMBO J. 5, Levitt, M. (1992) J. Mol. Biol. 226, Bowie, J. U. & Eisenberg, D. (1994) Proc. Natl. Acad. Sci. USA 91, Jones, D. T. (1997) Proteins 29, Bradley, P., Chivian, D., Meiler, J., Misura, K. M. S., Rohl, C. A., Schief, W. R., Wedemeyer, W. J., Schueler-Furman, O., Murphy, P., Strauss, J., et al. (2003) Proteins 53, Jones, D. T. & McGuffin, L. J. (2003) Proteins 53, Fang, Q. & Shortle, D. (2003) Proteins 53, Karplus, K., Karchin, R., Draper, J., Casper, J., Mandel-Gutfreund, Y., Diekhans, M. & Hughey, R. (2003) Proteins 53, Shao, Y. & Bystroff, C. (2003) Proteins 53, Bonneau, R. & Baker, D. (2001) Annu. Rev. Biophys. Biomol. Struct. 30, Rohl, C., C. E. M. Strauss, K. M. S. Misura & Baker, D. (2004) Methods Enzymol. 383, Bryngelson, J. D., Onuchic, J. N., Socci, N. D. & Wolynes, P. G. (1995) Proteins 21, Onuchic, J. N. & Wolynes, P. G. (2004) Curr. Opin. Struct. Biol. 14, Takada, S., Luthey-Schulten, Z. A. & Wolynes, P. G. (1999) J. Chem. Phys. 110, Fujitsuka, Y., Takada, S., Luthey-Schulten, Z. A. & Wolynes, P. G. (2004) Proteins 54, Chikenji, G., Fujitsuka, Y. & Takada, S. (2003) J. Chem. Phys. 119, Chikenji, G., Fujitsuka, Y. & Takada, S. (2004) Chem. Phys. 307, Blanco, F. J., Rivas, G. & Serrano, L. (1994) Nat. Struct. Biol. 1, Yi, Q., Bystroff, C., Rajagopal, P., Klevit, R. E. & Baker, D. (1998) Protein Sci. 283, Rotondi, K. S. & Gierasch, L. M. (2003) Biochemistry 42, Honda, S., Yamasaki, K., Sawada, Y. & Morii, H. (2004) Structure (London) 12, Zhou, X., Aleber, F., Folkers, G., Gonnet, G. H. & Chelvanayagam, G. (2000) Proteins 41, Wang, Y. & Shortle, D. (1997) Folding Des. 2, Nakajima, N., Higo, J., Kidera, A. & Nakamura, H. (2000) J. Mol. Biol. 296, Higo, J., Ito, N., Kuroda, M., Ono, S., Nakajima, N. & Nakamura, H. (2001) Protein Sci. 10, Lau, K. F. & Dill, K. A. (1989) Macromolecules 22, Li, H., Helling, R., Tang, C. & Wingreen, N. (1996) Science 273, Wolynes, P. G. (1997) Nat. Struct. Biol. 4, Dill, K. A. (1990) Biochemistry 29, Thomas, P. D. & Dill, K. A. (1993) Protein Sci. 2, Irbäck, A. & Sandelin, E. (1998) J. Chem. Phys. 108, Shoemaker, B. A., Wang, J. & Wolynes, P. G. (1997) Proc. Natl. Acad. Sci. USA 94, Portman, J. J., Takada, S. & Wolynes, P. G. (1998) Phys. Rev. Lett. 81, Galzitskaya, O. V. & Finkelstein, A. V. (1999) Proc. Natl. Acad. Sci. USA 96, Alm, E. & Baker, D. (1999) Proc. Natl. Acad. Sci. USA 96, Munoz, V. & Eaton, W. A. (1999) Proc. Natl. Acad. Sci. USA 96, Clementi, C., Nymeyer, H. & Onuchic, J. N. (2000) J. Mol. Biol. 298, Koga, N. & Takada, S. (2001) J. Mol. Biol. 313, Karaniclas, J. & Brooks, C. L., III (2003) J. Mol. Biol. 334, Go, N. (1983) Annu. Rev. Biophys. Bioeng. 12, Dahiyat, B. I. & Mayo, S. L. (1997) Science 278, Walsh, S. T. R., Cheng, H., Bryson, J. W., Roder, H. & DeGrado, W. F. (1999) Proc. Natl. Acad. Sci. USA 96, Kuhlman, B., Dantas, G., Ireton, G. C., Varani, G., Stoddard, B. L. & Baker, D. (2003) Science 302, Shakhnovich, E. I. (1998) Folding Des. 3, R45 R Jin, W., Kambara, O., Sasakawa, H., Tamura, A. & Takada, S. (2003) Structure (London) 11, Smith, L. J., Fiebig, K. M., Schwalbe H. & Dobson, C. M. (1996) Folding Des. 1, R95 R Hammarström, P. & Carlsson, U. (2000) Biochem. Biophys. Res. Commun. 276, Shortle, D. & Ackerman, M. S. (2001) Science 293, Mohana-Borges, R., Goto, N. K., Kroon, G. J., Dyson, H. J. & Wright, P. E. (2004) J. Mol. Biol. 304, Klimov, D. K., Newfield, D. & Thirumalai, D. (2002) Proc. Natl. Acad. Sci. USA 99, Zagrovic, B. & Pande, V. S. (2003) Nat. Struct. Biol. 10, Jha, A. K., Colubri, A., Freed, K. F. & Sosnick, T. R. (2005) Proc. Natl. Acad. Sci. USA 102, Chan, H. S. & Dill, K. A. (1990) Proc. Natl. Acad. Sci. USA 87, Pietrokovski, S. (1996) Nucleic Acids Res. 24, Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D. J. (1997) Nucleic Acids Res. 25, Shortle, D., Simons, K. T. & Baker, D. (1998) Proc. Natl. Acad. Sci. USA 95, Kraulis, P. (1991) J. Appl. Crystallogr. 24, cgi doi pnas Chikenji et al.

Supporting Online Material for

www.sciencemag.org/cgi/content/full/309/5742/1868/dc1 Supporting Online Material for Toward High-Resolution de Novo Structure Prediction for Small Proteins Philip Bradley, Kira M. S. Misura, David Baker*