What should the Z-score of native protein structures be?

Similar documents
arxiv:cond-mat/ v1 [cond-mat.soft] 19 Mar 2001

Algorithm for Rapid Reconstruction of Protein Backbone from Alpha Carbon Coordinates

Many proteins spontaneously refold into native form in vitro with high fidelity and high speed.

THE TANGO ALGORITHM: SECONDARY STRUCTURE PROPENSITIES, STATISTICAL MECHANICS APPROXIMATION

arxiv:cond-mat/ v1 [cond-mat.soft] 5 May 1998

A new combination of replica exchange Monte Carlo and histogram analysis for protein folding and thermodynamics

Outline. The ensemble folding kinetics of protein G from an all-atom Monte Carlo simulation. Unfolded Folded. What is protein folding?

Lecture 2 and 3: Review of forces (ctd.) and elementary statistical mechanics. Contributions to protein stability

Monte Carlo simulations of polyalanine using a reduced model and statistics-based interaction potentials

arxiv: v1 [cond-mat.soft] 22 Oct 2007

Folding of small proteins using a single continuous potential

Protein Folding & Stability. Lecture 11: Margaret A. Daugherty. Fall How do we go from an unfolded polypeptide chain to a

= (-22) = +2kJ /mol

Lecture 11: Protein Folding & Stability

Protein Folding & Stability. Lecture 11: Margaret A. Daugherty. Fall Protein Folding: What we know. Protein Folding

Short Announcements. 1 st Quiz today: 15 minutes. Homework 3: Due next Wednesday.

How to Derive a Protein Folding Potential? A New Approach to an Old Problem

arxiv:cond-mat/ v1 2 Feb 94

Guessing the upper bound free-energy difference between native-like structures. Jorge A. Vila

The starting point for folding proteins on a computer is to

Biology Chemistry & Physics of Biomolecules. Examination #1. Proteins Module. September 29, Answer Key

Computer simulations of protein folding with a small number of distance restraints

Identifying the Protein Folding Nucleus Using Molecular Dynamics

Computer design of idealized -motifs

Universal Similarity Measure for Comparing Protein Structures

09/06/25. Computergestützte Strukturbiologie (Strukturelle Bioinformatik) Non-uniform distribution of folds. Scheme of protein structure predicition

A simple technique to estimate partition functions and equilibrium constants from Monte Carlo simulations

The role of secondary structure in protein structure selection

Template Free Protein Structure Modeling Jianlin Cheng, PhD

Lecture 34 Protein Unfolding Thermodynamics

Pair potentials for protein folding: choice of reference states and sensitivity of predicted native states to variations in the interaction schemes

Simulations of the folding pathway of triose phosphate isomerase-type a/,8 barrel proteins

Protein Folding Prof. Eugene Shakhnovich

Modeling protein folding: the beauty and power of simplicity Eugene I Shakhnovich

Simulating Folding of Helical Proteins with Coarse Grained Models

Biochemistry,530:,, Introduc5on,to,Structural,Biology, Autumn,Quarter,2015,

Protein Folding experiments and theory

Comparison of Two Optimization Methods to Derive Energy Parameters for Protein Folding: Perceptron and Z Score

Importance of chirality and reduced flexibility of protein side chains: A study with square and tetrahedral lattice models

Monte Carlo simulation of proteins through a random walk in energy space

arxiv:chem-ph/ v1 11 Nov 1994

Why Proteins Fold. How Proteins Fold? e - ΔG/kT. Protein Folding, Nonbonding Forces, and Free Energy

Free energy, electrostatics, and the hydrophobic effect

Relationships Between Amino Acid Sequence and Backbone Torsion Angle Preferences

Conformational Geometry of Peptides and Proteins:

Lecture 2-3: Review of forces (ctd.) and elementary statistical mechanics. Contributions to protein stability

Protein Folding. I. Characteristics of proteins. C α

Computer Modeling of Protein Folding: Conformational and Energetic Analysis of Reduced and Detailed Protein Models

Template Free Protein Structure Modeling Jianlin Cheng, PhD

arxiv:cond-mat/ v1 [cond-mat.soft] 23 Mar 2007

Novel Monte Carlo Methods for Protein Structure Modeling. Jinfeng Zhang Department of Statistics Harvard University

Presenter: She Zhang

Introduction to Computational Structural Biology

arxiv:cond-mat/ v1 [cond-mat.soft] 16 Nov 2002

Clustering of low-energy conformations near the native structures of small proteins

Protein Folding by Robotics

Proteins are not rigid structures: Protein dynamics, conformational variability, and thermodynamic stability

Protein folding. Today s Outline

It is not yet possible to simulate the formation of proteins

Generating folded protein structures with a lattice chain growth algorithm

A geometric knowledge-based coarse-grained scoring potential for structure prediction evaluation

Computer simulation of polypeptides in a confinement

The sequences of naturally occurring proteins are defined by

Exploring the Free Energy Surface of Short Peptides by Using Metadynamics

Local Interactions Dominate Folding in a Simple Protein Model

A Minimal Physically Realistic Protein-Like Lattice Model: Designing an Energy Landscape that Ensures All-Or-None Folding to a Unique Native State

Supporting Online Material for

Lattice protein models

Thermodynamics. Entropy and its Applications. Lecture 11. NC State University

arxiv: v1 [physics.bio-ph] 26 Nov 2007

Helix-coil and beta sheet-coil transitions in a simplified, yet realistic protein model

Section Week 3. Junaid Malek, M.D.

Paul Sigler et al, 1998.

Introduction to Comparative Protein Modeling. Chapter 4 Part I

3. Solutions W = N!/(N A!N B!) (3.1) Using Stirling s approximation ln(n!) = NlnN N: ΔS mix = k (N A lnn + N B lnn N A lnn A N B lnn B ) (3.

Evolution of functionality in lattice proteins

Protein Structure. W. M. Grogan, Ph.D. OBJECTIVES

arxiv:cond-mat/ v1 7 Jul 2000

BIOC : Homework 1 Due 10/10

Master equation approach to finding the rate-limiting steps in biopolymer folding

Ab initio protein structure prediction Corey Hardin*, Taras V Pogorelov and Zaida Luthey-Schulten*

Temperature dependence of reactions with multiple pathways

Biochemistry Prof. S. DasGupta Department of Chemistry Indian Institute of Technology Kharagpur. Lecture - 06 Protein Structure IV

The protein folding problem consists of two parts:

De Novo Simulations of the Folding Thermodynamics of the GCN4 Leucine Zipper

Improved Recognition of Native-Like Protein Structures Using a Combination of Sequence-Dependent and Sequence-Independent Features of Proteins

Can a continuum solvent model reproduce the free energy landscape of a β-hairpin folding in water?

Design of a Novel Globular Protein Fold with Atomic-Level Accuracy

Discretized model of proteins. I. Monte Carlo study of cooperativity in homopolypeptides

Free Radical-Initiated Unfolding of Peptide Secondary Structure Elements

6 Hydrophobic interactions

Molecular dynamics simulations of anti-aggregation effect of ibuprofen. Wenling E. Chang, Takako Takeda, E. Prabhu Raman, and Dmitri Klimov

A first-order transition in the charge-induced conformational changes of polymers

Hyeyoung Shin a, Tod A. Pascal ab, William A. Goddard III abc*, and Hyungjun Kim a* Korea

BCHS 6229 Protein Structure and Function. Lecture 3 (October 18, 2011) Protein Folding: Forces, Mechanisms & Characterization

Protein Structure Prediction ¼ century of disaster

Protein Structure Prediction

Molecular Modelling. part of Bioinformatik von RNA- und Proteinstrukturen. Sonja Prohaska. Leipzig, SS Computational EvoDevo University Leipzig

Alpha-helical Topology and Tertiary Structure Prediction of Globular Proteins Scott R. McAllister Christodoulos A. Floudas Princeton University

Simulation of mutation: Influence of a side group on global minimum structure and dynamics of a protein model

Transcription:

Protein Science (1998), 71201-1207. Cambridge University Press. Printed in the USA. Copyright 0 1998 The Protein Society What should the Z-score of native protein structures be? LI ZHANG AND JEFFREY SKOLNICK Department of Molecular Biology, TPC-5, The Scripps Research Institute, 10550 North Torrey Pines Road, La Jolla, California 92037 (RECEIVED November 17, 1997; ACCEPTED January 23, 1998) Abstract The Z-score of a protein is defined as the energy separation between the native fold and the average of an ensemble of misfolds in the units of the standard deviation of the ensemble. The Z-score is often used as a way of testing the knowledge-based potentials for their ability to recognize the native fold from other alternatives. However, it is not known what range of values the Z-scores should have if one had a correct potential. Here, we offer an estimate of Z-scores extracted from calorimetric measurements of proteins. The energies obtained from these experimental data are compared with those from computer simulations of a lattice model protein. It is suggested that the Z-scores calculated from different knowledge-based potentials are generally too small in comparison with the experimental values. Keywords: knowledge-based potentials; protein folding; Z-scores In protein folding studies, knowledge-based potentials derived from a statistical analysis of known protein structures (Ueda et al., 1978; Maiorov & Crippen, 1992; Kolinski & Skolnick, 1994; Sippl, 1995; Mimy & Domany, 1996; Miyazawa & Jernigan, 1996; Liwo et al., 1997; Park et al., 1997) are frequently used in simplified models of proteins. The quality of such potentials is often assessed by socalled Z-scores, which test how well the potentials differentiate the native fold of a protein from an ensemble of misfolded structures. These Z-scores are calculated by (Bowie et al., 1991; Sippl, 1993): Pereira & Pochapsky, 1996; Thomas & Dill, 1996; Skolnick et al., 1997; Wang & Ben-Naim, 1997). This conflicting situation calls for examination of the physical meaning of the Z-scores. In this paper, we suggest a means of extracting Z-scores from the experimental data of proteins. Such Z-scores are defined differently than Zmr,,f,,,d.y, but it is possible to relate the two quantities. Thus, it is hoped that the results presented here can provide guidance into reasonable ranges of Z-score values and for assessing the significance of knowledge-based potentials. where Enotive is the energy of the native structure of a protein, is the average energy of an ensemble of misfolded structures, and arn,,f;,/ds is the standard deviation of the energy in this ensemble. However, it is not known what range of values of Z-scores should be expected for real proteins with exact potentials. Table 1 lists the Z-score ranges calculated from several typical knowledgebased potentials reported in the literature. The Z-scores vary tremendously from protein to protein. The widest spread of Z-scores was found for hydrophobic fitness potentials (Huang et al., 1996), which give a range between 1 and 24 for similar sized proteins, although on average the Z-score is quite high, around 11. In fact, because many approximations are introduced in the derivation of the knowledge-based potentials, there is a great deal of uncertainty as to the validity of these potentials (Jones & Thornton, 1996; Reprint requests to: Jeffrey Skolnick, Department of Molecular Biology. TPC-5, The Scripps Research Institute, 10550 North Torrey Pines Road, La Jolla, California 92037; e-mail: skolnick@scripps.edu. 1201 Experimental 2-scores of native proteins Experimentally, from calorimetric measurements on proteins (Makhatadze & Privalov, 1995), the average energy difference between the native and the denatured state and their associated energy fluctuations can be estimated as follows: ( )(I - ( )N = AH - pav c AH (2) ( 2)lI - ( ):I = k,t C:! (34 (~5:)~ - (E); = k,t C;, (3b) where kb is Boltzmann s constant, and Tis the absolute temperature, is the energy of a protein molecule (including the interactions within the protein and with the solvent); the brackets indicate ensemble averaging, the subscripts U and N indicate the denatured and the native state, respectively; Cp and C, stand for the constant volume heat capacities of the protein in the native and denatured states, respectively. The approximation sign indicates that the effect of volume changes upon protein folding is small; hence, it is ignored here.

1202 L. Zhang and J. Skolnick Table 1. Z-score of knowledge-bused potentials for small proteins Potentialsa Interaction type Average Z-score A HP-contact environment 11 B Atomic vdwb + local angles 1.5 c. Residue contact 7 D Residue contact 8 E Atomic contact 15 1; Residue contact + torsion angles 8 G Atomic solvation 4 athe data were obtained from the following references: A, Huang et al. (1996); B, Ulrich et al. (1997); C, Godzik; D, Miyazawa and Jemigan ( 1996); E, Melo and Feytmans (1997); F, Kocher et al. (1994); G, Delarue and Koehl (1995). bvdw, van der Waals interactions. Note that some parts of the total energy of a protein, such as the kinetic energy of atoms and covalent bonding, are independent of conformational changes. Here, we are only interested in those contributions that do change with the conformation of the protein. Thus, we decompose the total energy into two contributions: & = E + &, where only E depends on the conformation of the protein. (In the rest of the paper, unless specifically stated, the term energy refers to the conformational energy E.) Because Eo does not depend on protein conformation, we have All the quantities in this equation can be obtained from experiment. Relationship of Zmisfijlds to experimental quantities The Z-score defined in Equation 7 differs from that in Equation 1 because denatured proteins are not equivalent to an ensemble of misfolded proteins (i.e., the structures corresponding to the native conformation of other protein sequences) referred to in Equation 1. The true energy of such misfolded proteins is not known because these misfolded structures are purely hypothetical. In practice, they are often generated from ungapped threading, i.e., by placing a protein's sequence on the backbone of another protein. Such structures are mostly compact and have appreciable secondary structure content. In contrast, denatured proteins are believed to be unfolded with little secondary structure (Dyson & Wright, 1993; Smith et al., 1996). In addition, there is another key difference: the denatured conformations are preselected by a Boltzmann factor (so that very high energy structures are excluded), while an unweighted energy average is used for the misfolds in Equation I. To be more specific, let 3 refer to the conformational space that takes into account all nonnative protein structures in which covalent bonding and/or excluded volume is not violated, and g-,(e) denote the corresponding density of states. From elementary statistical mechanics, the average energy of the denatured state can be written as (Qu - ( O N = (E)(, - (E)N. (4) The Eo part of the total energy contributes to the heat capacity of proteins. It has been estimated that this contribution is approximately equivalent to 95% of the total heat capacity of an anhydrous protein (Gomez et al., 1995). Based on a global fitting of the heat capacity of proteins in different conformations over a wide temperature range, Gomez et al. found that the conformationally independent contribution to the heat capacity can be evaluated as where p = l/kbt. Now, considering all the misfolds as a subspace of the whole conformational space, with gmrsfold.y(e) as their corresponding density of states, we can then write C, = [V + w(t - 298)]Mw, (5) where v = 0.28 cal/k/g, w = 9.75 X IO" cal/kz/g, K is degree of kelvin, Mw is the molecular weight of the protein. Therefore, fluctuations of E in the native and the denatured state can be obtained from (E2), - (E); = k"t*(c; - Co) In contrast to Equation 9, Equation 10 does not involve a Boltzmann factor of energy. Equation 10 suggests that the corresponding quantity for the denatured state, Now, by analogy to Equation I, the experimental Z-score can be defined as follows: which measures the energy separation between the native and the denatured state. By using Equations 5 and 6, one obtains would be of interest. E, means the average energy of random coils at infinite temperature (T = a). Note that only in our defined conformational space 3 does E, have a meaningful value. Similar to Equation I. a new Z-score is defined as where a-, is the standard deviation of the energy of the ensemble 3.

Z-score of proteins 1203 To calculate this Z-score, one needs to relate g&e) to experimentally measurable quantities. In the Random Energy Model of a protein, the density of states is assumed to be a Gaussian function (Demda, 1981; Shakhnovich & Gutin, 1989; Bryngelson et al., 1995). Here, we argue that the same approximation can be made for gj(e). It is known that the energy distribution of the denatured state, p(e), can be well approximated by a Gaussian function, as is true for almost any thermally equilibrated system (Hill, 1986). so that Becausep(E) is proportional to ga(e)e-pe, it follows that ga(e) is also a Gaussian function. Thus, tion 14 is temperature dependent because p(e) depends on temperature. This means that the considered conformational space 3 implicitly depends on temperature. By combining Equations 9 and 14, it follows that The quantities on the right sides of these equations can be determined from the enthalpy and heat capacity measurements (Equations 2,4, and 6). Thus, E, and 02 can be obtained, in turn, from experiment so that It should be noted, however, that this Gaussian function only encloses a portion of the total density of states that gives rise to the observed energy distribution p(e). That portion of the density of states, which lies in the high energy region, once multiplied by the Boltzmann factor (e-p"), contributes nothing top(e). This region of the density of states versus energy curve is not of interest to us because our objective is not to find out what happens to proteins when the temperature approaches infinity. Our objective is to consider that portion of the density of states relevant to the denatured state at room temperature. It should also be noted that the total density of states, which in general is a more complex function than a Gaussian function, should be temperature independent. However, gs(e) in Equa- Experimental Z-scores of proteins Z,, and Z, were calculated for 16 small proteins using Equations 8 and 16. The experimental data presented in Table 2 were obtained from a survey by Makhatadze and Privalov (1995). As shown in Table 3, the values of Zexp are small at room temperature. Some of them are less than 1.O, which means that the energy distributions of the native and denatured state have significant overlap. Does this contradict the two-state model (Privalov, 1979) of protein folding? Below the transition temperature, since the denatured state is much less populated than the native state, the system is effectively in a single state. This does not contradict the overall two-state model of protein folding. Table 2. Thermodynamic data of proteinsa AH PDB code N Mw 25 25 125 75 75 25 12525 125 75 75 I25 1 shg 5pti 2ci2 1 acb I Pgx I hoe 1 ubq 8rnt 5cyt Irnb 7rsa llzl I ilb 1 mbo 31zm 5cha 57 58 65 75 76 104 104 110 I24 129 163 237 6,668 6,565 9,250 8,072 7,738 7,944 8,433 1 1.07 I 12,830 12.365 13.600 14,300 17,381 17,800 18,619 28.820 52 132 135 115 67 27 28 1 89 307 294 242 151 6 240 268 222 259 292 283 227 262 273 528 42 1 590 512 562 501 555 67 1 862 296 323 362 358 320 35 1 393 672 593 690 664 753 736 920 856 1,1 19 15.4 16.4 16.6 10.6 12.8 15.0 14.5 12.5 1 13.7 9.5 11.4 13.4 44.2 16.9 18.0 18.0 13.3 15.6 17.8 27.7 16.8 17.9 18.1 12.9 15.4 17.9 36.9 12.5 1 13.9 8.9 10.9 12.9 23.8 15.7 16.8 17.1 12.1 13.9 15.7 37.5 18.3 19.8 20.2 12.6 16.0 19.4 40.1 21.4 24.0 25.1 16.1 19.7 24.3 37.5 24.3 26.3 26.8 17.5 21.9 26.2 37.1 25.4 27.4 28.6 18.5 2 28.5 48.9 26.0 28.5 29.5 20.8 24.2 27.5 27.0 29.1 31.8 32.5 20.0 24.4 28.9 57.8 35.6 38.4 39.2 28.1 32.3 36.4 31.5 36.7 40.0 40.3 24.2 30.3 36.5 40.6 39.1 42.2 43.0 28.1 35.5 42.8 68.4 51.8 55.1 55.8 37.7 46.3 54.8 45.7-5.2 18.9-3.8 9.5 3.2 14.6 20.0-23.2-1.1-19.7-35.8 0.0 26.6-0.1 0.1-4 -44.7-20.9-52.3-35.6-37.0-29. I -32.1-114.4-76.0-117.5-127.6-97.9-121.7-110.0-1 14.7-199.2 at is the temperature, N is the number of residues of a protein, AH is thc enthalpic change upon unfolding, AG is the free energy change upon unfolding, C," and C j are the constant pressure heat capacities of a protein in the native and denatured state, respectively, Proteins are denoted by their code names in the Protein Data Bank. AH and AG are in kj/mol units; C," and C," are in kj/mol/k.

1204 L. Zhang and J. Skolnick Table 3. Z-scores of proteins extracted from thermodynamic data 1 shg Spti 2ci2 1 acb 1 Pgx 1 hoe 1 ubq 8rnt Scyt 1 mb 7rsa llzl lilb 1 mbo 31zm 5cha 57 58 65 75 76 1 04 104 1 IO I24 129 163 237 65.6 101.0.2 87.4 80.1 94.5 98.0 58.7 74.0 62.6 49.4 75.0 88.8 74.9 75.1 55.1 2.3 3.8 3.3 3.7 3.4 4.9 4.0 4.3 4.2 4.7 4.2 4.1 4.8 5.3 0.7 2.2 2.0 1.6 1.3 1.o 0.3 3.6 1.1 3.4 3.4 2.5 1.4 0.1 2.1 2.3 2.4 3.1 3.1 2.8 5.1 4. I 5.3 4.7 4.7 3.8 4. I 4.8 6.1 2.9 3.8 4.0 4.4 3.6 5.6 5.0 5.4 5.2 5.5 4.9 6.0 5.4 7.0 an is the number of residues of a protein. btm is the interpolated transition midpoint temperature As the temperature increases, Z,,, increases slightly. At the transition temperature T,,,, the native state and the denatured state are equally populated. (Values of T,,, were estimated from a parabolic interpolation of the free energy changes at AG = 0. The heat capacity and enthalpic values were then obtained from a parabolic interpolation at T,.) Assuming that the energy distributions of the native state and the denatured state are two Gaussian packets, PN(E) and P,(E), we can obtain the overall energy distribution: where uu and un are the standard deviations of the energy in the denatured and the native state, respectively. gu and un can be determined from the heat capacities (Equations 5 and 6), so that P(E) can be calculated. Figure 1 shows the P(E) computed from the enthalpy changes upon unfolding and the heat capacities at the transition temperature for the listed 16 proteins. The energy distributions of most of the proteins exhibit two well-separated peaks, which is consistent with the two-state model of protein folding. Figure 2 shows the values of Z, from experimental data at 25, 75, and 125 "C, and the transition midpoint temperature. Z, correlates well with the square root of the protein size and remains almost the same at different temperatures. From linear fitting of the data, we have where N,, is the number of residues of a protein. The correlation coefficient is 0.91. Z, varies as flrea, because, according to Equation 16, Z, varies as the square root of the heat capacity, which is proportional to the protein size. Z-scores of a lattice model protein From the above analysis, the energies of proteins in the denatured state ((E),) and in the extrapolated denatured state at infinite temperature (E,) can be obtained, but the energy of the manifold of misfolded conformations used to assess Zmr,g,,,d3 is still unknown. Because the misfolds cannot be observed experimentally, perhaps computer simulations of a simplified model of proteins can provide a qualitative insight into the relationship between Z,,,i,fi,,d> and Z,. We have analyzed the energies of a set of knowledge-based potentials developed for a lattice protein model. This model was previously studied by Entropy Sampling Monte Carlo (ESMC) simulations and was shown to have a cooperative, two-state folding transition (Kolinski et al., 1996). ESMC is a recently developed technique (Ha0 & Scheraga, 1994) from which one can determine all the thermodynamic variables of a protein at equilibrium. Thus, the Z-score can be calculated according to Equations 8 and 16. Such Z-scores provide a direct comparison between "real" potentials and knowledge-based potentials of proteins. The thermodynamic data of a lattice model protein were extracted from previously published work (Kolinski et al., 1996). The sequence of the model was manually designed to form a six-member /3 barrel consisting of 45 residues. We employed the model III force field, which was shown to produce two-state thermodynamic behavior. By using ESMC simulations, Kolinski et al. obtained the entropy as a function of energy, S(E). Then, the free energy profile was obtained as F(E,T) = E - T.S(E). Figure 3 shows the energy distribution function, P(E) = exp [ -- F y ].

Z-score of proteins 1205 r-" ~~ I 60.0 50.0 N' 40.0 5cha 31zm 30.0 1 mbo 1 ilb 20.0, a 8.0 10.0 12.0 fir<< I 14.0 16.0 1 lzl 7rsa 1 rnb 5cYt 8rnt 1 ubq 1 hoe 1 Pgx 1 acb 2ci2 5pti Fig. 2. 2-scores of protein extracted from calorimetric data of proteins (Zm). The triangles were obtained from data at 25 "C, the crosses at 75 "C, the pluses at 125 "C, and the circles at the transition temperature. size, Z-score should be around 3. The exaggerated Z-score of the lattice model may be caused by the sequence design. For a different amino acid sequence, the results may change drastically. Lack of consistency may be the reason that such a force field can successfully fold some proteins to their native structures, but fail for other proteins (L. Zhang & J. Skolnick, unpubl. obs.). To calculate the energy of the misfolded structure of the lattice model protein, its sequence was threaded through a library of native protein structures obtained from the Protein Data Bank. The average and standard deviations of the energies, as well as the components of the energies, are shown in Table 5. For comparison, also shown in Table 5 are the energies of the protein in the native and the denatured state calculated from Monte Carlo simulations at the transition temperature. From Table 5, one can see that the average energy of the misfolded structures is close to that of the denatured proteins. By analyzing the components of the energies, it seems that the burial 1 shg -500.0 500.0 1500.0 Energy Fig. 1. Energy distribution of proteins P(E) at transition temperature. Energies are in kj/mol. The distribution curves are shifted horizontally. The flat portions of the curves indicate the origins. r- " - " ~ The enthalpy change upon unfolding was calculated as the separation between the peaks. The heat capacities were calculated from the standard deviations of the two peaks. Consequently, the Z-scores were obtained according to Equations 8 and 16. The results are shown in Table 4. At the transition temperature, Z,, is 8 for this lattice model protein. However, from experimental values shown in Table 3, we can see that for a protein of similar Energy Fig. 3. Energy distribution of the lattice model protein at transition temperature. 0

1206 L. Zhang and J. Skolnick Table 4. Thermodynamic data and Z-scores computed from ESMC simulations of a lattice model proteina AG AH CY C," ZLM Z'X 0 415 11.4 20 5.0 8 aag and AH are in kj/mol; Cf and C," are in kj/mol/k; the data were computed at the transition temperature (T = 2.03 in lattice model units, assumed to be equivalent to 2.03 X 298 = 605 K). ZLM and 2, were calculated using Equations 8 and 16, respectively. and tertiary contact terms are small. The local interactions have a higher energy in the misfolds because of incorrect secondary structures, but this is largely compensated for by the amount of hydrogen bonds in these structures. It is suspected that the strength of the hydrogen bonds (2.5 kbt per hydrogen bond) may have been overestimated in this lattice model potential. Because there is no explicit solvent in the model, when a protein unfolds, the loss of hydrogen bonding energy is enormous. While this may help folding in computer simulations, it is not realistic for real proteins. If the contribution from hydrogen bonding is ignored, the misfolds would have a much higher energy than the denatured proteins (by about 100 kj/mol, second column in Table 5), approaching that of the denatured state at infinite temperature. If the data in the first column in Table 5 are used, from Equations l, 12, and 15, one would find that Zm,fi,/& = 7.6 and Z, = 19.7. With the data in the second column (i.e., excluding contribution of hydrogen bonds), zmj.yj&v = 14.3 and Z, = 17.6. Note that the values of Z, are about the same as that in Table 4. It also follows that, depending on the parameterization of the potentials, the ratio of Zmjsfol~.JZm is about 39 or 81%; whereas, most of the knowledge-based potentials developed by various groups do not include a hydrogen bonding term, and zmi,,j&~ should be closer to 81% of Z, if we assume that this ratio holds for real protein structures. For a 100-residue protein, using Equation 19, it follows that zmjsjo/d.r should be around 38.3 X 81% == 31, which is much higher than that obtained from the knowledge-based potentials developed so far (Table 1). Even if Zmr.v~lrdJ is taken to be 39% of Z,, then Zmrsf,/ds = 15 for a 100-residue protein, which is still higher than the Z-scores obtained from those knowledge-based potentials. Table 5. Potential energies of the lattice model in various states Discussion The Z-scores of protein structures are widely used because they characterize the conformational energy landscape of proteins. We have demonstrated that a meaningful Z-score can be extracted from the thermodynamic data of proteins. Because this method relies on experimental measurements and involves no particular formulation of the potential energies, the results shown in this study may thus serve as a useful reference for the development of knowledge-based potentials. Our proposed Z-scores are defined on the basis of the energies of the native state, denatured state, and a specially defined (T = co) denatured state. Figure 4 schematically shows the ranges of energies of interest. The energy distributions of the native state, the denatured state, the misfold ensemble, and the T = co random coils are drawn as bell-shaped packets. The order of the packets reflects their relative energy. In practice, there is substantial uncertainty about the energy of the misfolded structures. For knowledge-based potentials, as seen in our study of a lattice model, the misfolds may be either close to denatured proteins or much higher in energy. Our results indicate that the expected range of Zmr.yf,jldJ is around 15 to 31 for a small protein of 100 residues. Although there may be some errors in our Z, estimation (due to the assumption of a Gaussian distribution of the density of states), the qualitative picture is unlikely to change. These results suggest that the existing knowledge-based potentials need to be improved to have more specificity for the native conformation. Materials and methods Force field of the lattice model For a detailed description of the force field, see reference (Kolinski et al., 1996). Only a brief description is provided here. The parameters of the force field are available from anonymous ftp at scripps.edu (pub/mcsp). The force field was built for a high coordination lattice. The backbone of a polypeptide chain is represented by C, carbon,.- C.- 0 0 5 z s.e c t Native Denatured(T=-) Denatured I Msfolded State Energy Energya &fbd &onra,r Eburrai &~.~,/ 2.a,". L. v) I S Native -310-16 -0-263 -112-589 6 U h 38 35 8 28 2 15,. *. Denatured -34-211 -12-198 20-185 * U 45 48 10 22 10 30 <E>, <E>" <E>misfolds Em Misfolded -226-90 -135-33 22-80 U 62 35 45 15 12 30 Energy athe total energy excluding the hydrogen bonding contribution. is the standard deviation of the energies. All terms are in kj/mol. Fig. 4. Schematic description of the energy distributions of a protein in various states of interest.

Z-score of proteins 1207 carbonyl oxygen, and amide hydrogen atoms explicitly. Each sidechain rotamer is represented by a sphere located at the center of mass. The conformational energy of a protein is where Eloral represents the short-range interactions reflecting intrinsic secondary structure propensities of the amino acid sequences. Ehhond represents hydrogen bonding interaction energy. Each hydrogen bond is calculated via the positions of the oxygens and the hydrogens: (22) where roi.hj is the vector between the peptide plane j and peptide plane i. qh was chosen to 2.5 kbt in the calculation done in this study. Additional cooperative terms were used to promote a proteinlike hydrogen bond network. Ehurial reflects hydrophobic interactions in proteins, where So is the expected radius of gyration of a single domain protein in its native form. EcOnIaC, is the pairwise contact energy. Also included in Ec,n,a,l is the cooperativity term that promotes native protein-like contact patterns. E,, is the rotamer energy of side-chain groups. This term is small (less than 5% to the total energy) and was not used in our calculation of the misfolded structures. Generating rnisfolded structures through threading A library of native protein structures was collected from the Protein Data Bank. This set of proteins is the same as that in Zhang and Skolnick (1998). The backbone amide hydrogen positions were inferred from the C, carbon and oxygen positions. The side-chain groups of the threaded structures were assumed to be the same as in the original structure. However, the contact relationship was not calculated according to the positions of the side-chain groups, but from the original protein structure, based on the distances between the heavy atoms of the side-chain groups (contact distance cntenon is 4.5 A). Acknowledgments This research was supported in part by NIH grant GM48835. L.Z. is an NIH Postdoctoral Fellow. The assistance of Dr. A. Kolinski in the preparation of the ensemble of lattice protein states is greatly appreciated. References Bowie JU. Luthy R. Eisenberg D. 1991. A method to identify protein sequences that fold into a known three-dimensional structure. Science 253:164-1. Bryngelson J. Onuchic J. Socci N. Wolynes P. 1995. Funnels. pathways, and the energy landscape of protein folding: A synthesis. Proteins 21:167-195. Delarue M, Koehl P. 1995. Atomic environment energies in proteins defined from statistics of accessible and contact surface areas. J Mol Biol 249:675-690. Derrida B. 1981. Random-energy model: An exact solvable model of disordered systems. Phys Rev B 24:2613-2626. Dyson JH, Wright PE. 1993. Peptide conformation and protein folding. Curr Biol 3:60-65. Gomez J, Hiker V, Xie D. Freire E. 1995. The heat capacity of proteins. Proteins 22:404-412. Hao M-H, Scheraga HA. 1994. Monte Carlo simulation of a first-order transition for protein folding. J Phys Chem 98:4940-4948. Hill TL. 1986. An introduction to statistical thermodynamics. New York: Dover Publicationsnc. Huang ES, Subbiah S, Leitt M. 1996. Recognizing native folds by the arrangement of hydrophobic and polar residues. J Mol Biol 252:9-720. Jones DT. Thornton JM. 1996. Potential energy functions for threading. Current Opinion in Structural Biology 6:210-216. Kocher JA, Rooman MJ, Wodak SJ. 1994. Factors influencing the ability of knowledge-based potentials to identify native sequence-structure matches. J Mol Biol 235:1598-1613. Kolinski A, Skolnick J. 1994. Monte Carlo simulations of protein folding. 1. Lattice model and interaction scheme. Prot Science 18338-352. Kolinski A, Galazka W, Skolnick J. 1996. On the origin of the cooperativity of protein folding: lmplications from model simulations. Proteins 26:271-287. Liwo A. Pincus MR, Wadak RJ, Rockovsky S, Oldziej S, Scheraga HA. 1997. A united-residue force field for off lattice protein-structure simulations. 11. Parameterization of shon-range interactions and determination of weights of energy terms by Z-score optimization. J Comput Chem 18:874-887. Maiorov VN, Crippen GD. 1992. Contact potential that recognizes correct folding of globular proteins. J Mol Biol 227367-888. Makhatadze GI, Privalov PL. 1995. Energetics of protein structure. AdL, Protein Chem 47:307-425. Melo F, Feytmans E. 1997. Novel knowledge-based mean force potential at atomic level. J Mol Bid 267:207-222. Mirny L, Domany E. 1996. Protein fold recognition and dynamics in the space of contact maps. Proteins 26:391-410. Miyazawa S. Jernigan RL. 1996. Residue-residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading. J Mol Biol 256523-644. Park BH, Huang ES, Levitt M. 1997. Factors affecting the ability of energy functions to discriminate correct from incorrect folds. J Mol Biol 266:831-846. Pereira de Aujo AF, Pochapsky TC. 1996. Monte Carlo simulations of protein folding using inexact potentials: How accurate must potentials be in order to preserve the essential features of the energy landscape? Fold & Des 1999-314. Privalov PL. 1979. Stability of proteins. Small globular protems. Ah Protein Chem 33: 167-24 I. Shakhnovich E, Gutin A. 1989. Formation of unique structure in polypeptide chains. Theoretical investigation with the aid ofa replica approach. Biophy Chem 34:187-199. Sippl MJ. 1993. Boltzmann s principle, knowledge-based mean force fields and protein folding. An approach to the computational determination of protein structures. J Comput Aided Mol Des 7:473-501. Sippl MJ. 1995. Knowledge-based potentials for proteins. Current Opinion in Structural Biologv 5:229-235. Skolnick J, Jaroslewski L, Kolinski A, Godzik A. 1997. Derivation and testing of pair potentials for protein folding. When is the quasichemical approximation correct? Protein Sci 6:676-688. Smith LJ. Fiebig KM, Schwalbe H, Dobson CM. 1996. The concept of random coil. Residual structure in peptides and denatured proteins. Folding & Design 1(5):R95-106. Thomas PD. Dill KA. 1996. Statistlcal potentials cxtracted from proteln structures: How accurate arc they? J Mol Biol 257:457-469. Ueda Y, Taketomi H. Go N. 1978. Studies of protein folding, unfolding and fluctuations by computer simulations. 11. A three dimensional lattice model of lysozyme. Biopolymers /7:1-1548. Ulrich P, Scott W. van Gunstcren WF. Torda AE. 1997. Protein structure prediction lbrce fields: Parameterization with quasi-newtonian dynamics, Proteins 27367-384. Wang H, Ben-Naim A. 1997. Solvation and solubility of globular proteins. J Phys Chem B 101:1077-1086. Zhang L, Skolnick J. 1998. How do potentials derived from structural databases relate to true potentials? Protein Sci 7:l 12-122.