Calculation of protein backbone geometry from a-carbon coordinates based on peptide-group dipole alignment

Size: px

Start display at page:

Download "Calculation of protein backbone geometry from a-carbon coordinates based on peptide-group dipole alignment"

Erick Stafford
5 years ago
Views:

Protein science (1993), 2, 1697-1714. Cambridge University Press. Printed in the USA.

1 Protein science (1993), 2, Cambridge University Press. Printed in the USA. Copyright The Protein Society Calculation of protein backbone geometry from a-carbon coordinates based on peptide-group dipole alignment A. LIW0,'*4 M.R. PINCUS? R.J. WAWAK,' S. RACKOVSKYP AND H.A. SCHERAGA' I Baker Laboratory of Chemistry, Cornell University, Ithaca, New York Department of Pathology, Division of Clinical Pathology, State University of New York, Health Science Center, Syracuse, New York Department of Biophysics, School of Medicine and Dentistry, University of Rochester, Rochester, New York (RECEIVED March 2, 1993; ACCEPTED June 24, 1993) Abstract An algorithm is proposed for the conversion of a virtual-bond polypeptide chain (connected C" atoms) to an allatom backbone, based on determining the most extensive hydrogen-bond network between the peptide groups of the backbone, while maintaining all of the backbone atoms in energetically feasible conformations. Hydrogen bonding is represented by aligning the peptide-group dipoles. These peptide groups are not contiguous in the amino acid sequence. The first dipoles to be aligned are those that are both sufficiently close in space to be arranged in approximately linear arrays termed dipole paths. The criteria used in the construction of dipole paths are: to assure good alignment of the greatest possible number of dipoles that are close in space; to optimize the electrostatic interactions between the dipoles that belong to different paths close in space; and to avoid locally unfavorable amino acid residue conformations. The equations for dipole alignment are solved separately for each path, and then the remaining single dipoles are aligned optimally with the electrostatic field from the dipoles that belong to the dipole-path network. A least-squares minimizer is used to keep the geometry of the a-carbon trace of the resulting backbone close to that of the input virtual-bond chain. This procedure is sufficient to convert the virtual-bond chain to a real chain; in applications to real systems, however, the final structure is obtained by minimizing the total ECEPP/2 (empirical conformational energy program for peptides) energy of the system, starting from the geometry resulting from the solution of the alignment equations. When applied to model a-helical and &sheet structures, the algorithm, followed by the ECEPP/2 energy minimization, resulted in an energy and backbone geometry characteristic of these a-helical and P-sheet structures. Application to the a-carbon trace of the backbone of the crystallographic SPTI structure of bovine pancreatic trypsin inhibitor, followed by ECEPP/2 energy minimization with C"-distance constraints, led to a structure with almost as low energy and root mean square deviation as the ECEPP/2 geometry analog of SPTI, the best agreement between the crystal and reconstructed backbone being observed for the residues involved in the dipole-path network. Keywords: conversion to all-atom backbone; dipole alignment; ECEPP/2 force field; hydrogen-bond network; protein virtual-bond chain Much attention has been devoted in recent years to the construction of an all-atom polypeptide chain, given its a-carbon coordinates (Purisima& Scheraga, 1984; Jones & Thirup, 1986; Reid & Thornton, 1989; Holm & Sander, 1991; Bassolino-Klimas & Bruccoleri, 1992; Rey & Skolnick, 1992), because, in many cases, the geometry of the a-carbon trace is the only available information about the Reprint requests to: H.A. Scheraga, Baker Laboratory of Chemistry, Cornell University, Ithaca, New York n leave from the Department of Chemistry, University of Gdansk, ul. Sobieskiego 18, Gdansk, Poland, Contact A. Liwo for computer programs conformation of a protein. This situation arises in lowresolution X-ray studies, in homology-modeling studies when sequence alignment is poor (Chothia& Lesk, 1986; Chothia et al., 1986; Hubbard & Blundell, 1987), and in those methods for the prediction of protein conformation that involve the reduction of the polypeptide chain to a united-residue representation (Pincus & Scheraga, 1977; Crippen & Snow, 1990; Skolnick & Kolinski, 1990; Seetharamulu & Crippen, 1991). There are two groups of algorithms for backbone re- construction. One is based on fitting the fragments of the backbone of polypeptide structures with those from li- a

1698 A. Liwo et a/. brary of fragments derived from known X-ray structures with a given C" geometry (Jones & Thirup, 1986; Reid & Thornton, 1989; Holm & Sander, 1991).

2 1698 A. Liwo et a/. brary of fragments derived from known X-ray structures with a given C" geometry (Jones & Thirup, 1986; Reid & Thornton, 1989; Holm & Sander, 1991). This approach is used mainly in model building. Because it is not always possible to find a fragment that fits uniquely to a given a-carbon pattern, at the present stage of development this method still requires subjective user interaction. Algorithms of the second group are based on geometric considerations. The Purisima-Scheraga algorithm (Purisima & Scheraga, 1984) makes use of the constraints imposed by rigid valence geometry of the peptide groups. These constraints, together with the geometry of the given virtual-bond chain and additional information supplied by an analysis of the relationship between the geometry of the local virtual-bond chain and the conformational states of amino acid residues (Rackovsky & Scheraga, 1978, 1980), result in a finite number of solutions (i.e., sets of dihedral angles 4 and $). The solution that gives the least number of steric overlaps is chosen. This algorithm was applied successfully both to ideal virtual-bond chains (i.e., those derived from ECEPP/2 [empirical conformational energy program for peptides] structures [Momany et al., 1975; Ntmethy et al., 1983; Sippl et al., 19841) and to the virtual-bond chain of bovine pancreatic trypsin inhibitor (BPTI) obtained from a distanceconstraint-based prediction (Wako & Scheraga, 1982a,b). In the application to BPTI, however, the method was coupled with a least-squares algorithm, minimizing the difference between the C"-C" distances of the input virtual-bond chain and resulting all-atom backbone. This was necessary because no exact solution (Le., one that was compatible with the ECEPP/2 valence geometry) was found for the virtual-bond chain studied (Purisima & Scheraga, 1984). The root mean square (rms) deviation of the a-carbon trace of the reconstructed backbone of BPTI from the input virtual-bond chain was low (0.26 A). Recently, two other algorithms were published by Rey and Skolnick (1992) and by Bassolino-Klimas and Bruccoleri (1992). In the Rey-Skolnick algorithm, given the geometry of a virtual-bond chain, the authors first establish the positions of based on a correlation between the bond angles (Ca-Cu-Cu) of the virtual-bond chain and the locations of the Cp atoms in the coordinate system associated with three consecutive a-carbons, as derived from the Protein Data Bank. Then, the positions of the backbone atoms (N, C', 0) are calculated based on valence-geometry constraints. As in the Purisima-Scheraga algorithm (1984), a least-squares minimizer must be used to overcome inconsistencies arising when real structures are fitted to rigid valence geometry. When this algorithm was applied to the virtual-bond chains derived from known protein crystal structures, the average rms deviation between Theory the backbone atoms of the crystallographic and reconstructed chains was below 0.5 A. The Bassolino-Klimas- Bruccoleri algorithm samples the possible conformational states of individual amino acid residues from the (4, $) en- ergy maps and tries to fit the consecutive residues to a given a-carbon trace. The conformational search is accomplished using the CONGEN algorithm of Bruccoleri and Karplus (1987). The building of a complete backbone is carried out so as to lead to a minimum rms deviation between the input virtual-bond chain and the C" trace of the reconstructed backbone. When tested on model proteins, the algorithm led to rms deviations between the X-ray and reconstructed backbones below 1 A. A common feature of the algorithms outlined above is that they make use of the complete geometry of the virtual-bond chain; i.e., the values of both the virtual-bond and virtual-torsional angles must be known with some accuracy in order that the conversion procedure work. On the other hand, we have recently developed a unitedresidue force field (Liwo et al., 1993) in which all the virtual-bond valence angles are set at the "most likely" value (90") determined from data in the Protein Data Bank (Rackovsky, 1990). Therefore, the united-residue structures obtained do not contain any information about the actual virtual-bond valence angles, and hence the Purisima-Scheraga, the Rey-Skolnick, and the Bassolino- Klimas-Bruccoleri methods cannot be applied to obtain the all-atom backbone. For this purpose, we have developed an algorithm based on a completely different principle: optimization of the hydrogen-bond network subject to the constraints given by the geometry of the virtualbond chain. This algorithm uses an energetic rather than a geometric criterion. The approach assumes that, while a crude protein structure might be determined by hydrophobic-packing interactions, hydrogen-bonding interactions govern the arrangement of backbone atoms within a given virtual-bond chain conformation (Dill, 1990). In our approach, hydrogen bonding is represented as electrostatic interactions between peptide-group dipoles, which reduces the problem to finding the best dipole alignment. These peptide groups are not configuous in the amino acid sequence. The dipoles to be aligned first are grouped in ordered (approximately linear) arrays called dipole paths; thus, our algorithm is called the dipole-path method. The optimization of dipole-alignment energy is followed by minimization of the total ECEPP/2 energy (Momany et al., 1975; Nemethy et al., 1983; Sippl et al., 1984). This approach is reminiscent of the electrostatic-optimization approximation of Piela and Scheraga (1987). In the following sections, we first formulate the dipolepath algorithm. We then illustrate its performance when applied to ideal regular structures: a-helices Finally, we show the results for the reconstruction of the backbone of the SPTI crystal structure of BPTI. The angular dependence of the hydrogen-bond energy in a protein backbone is determined mainly by the electrostatic component (Schuster, 1976; Mitchell & Price, 1989).

Conversion of protein backbone virtual to all-atom backbone 1699 Furthermore, Piela and Scheraga (1987) showed that the electrostatic interactions between the peptide groups of the backbone can be

3 Conversion of protein backbone virtual to all-atom backbone 1699 Furthermore, Piela and Scheraga (1987) showed that the electrostatic interactions between the peptide groups of the backbone can be well represented by the interactions between peptide-group dipoles. In this paper, we explore this idea further. We formulate the problem of finding the orientation of peptide groups corresponding to the best hydrogen-bond network as finding that orientation of peptide-group dipoles that corresponds to the optimum dipole-dipole interaction energy, subject to a given virtualbond-chain geometry. Description of the virtual-bond chain We place the peptide-group dipoles in the centers of the virtual C"-C" bonds (Fig. 1). The ith dipole is placed between Cy and Cia+] (Liwo et al., 1993). We assume that all the peptide groups are in the planar trans conformation, which means that the virtual-bond lengths are fixed at 3.8 A (Nishikawa et al., 1974). Therefore, the geometry of the n-carbon virtual-bond chain is described in terms of virtual-bond valence angles el, e,,..., On-, (0; being the planar angle in the C;-Cia,l-Cy+2 group) and virtual-bond torsional angles yl, y2,..., ynp3 (yi being the torsional angle about the Cy+l-Cy+2 bond in the Cy-C,$l-Cy+2-Cy+3 group) (Nishikawa et al., 1974); yi has no relation to y;, of Figure 1. The orientation of the peptide groups, and thus the orientation of peptide-group I dipoles, with respect to the C" frame is described by the angles of rotation of the peptide groups about the virtualbond axes X,, h2,..., X,-, (Fig. 1). These angles are defined according to Nishikawa et al. (1974) as follows: Consider three consecutive a-carbons, Cy-], Cy, Cy+.,,. Let (X,); denote the angle of clockwise rotation (when looking from Cy toward Cy-,) of the peptide group i - 1 (lying between Cia_, and Cy), with (X,); = 0, when Cy+, belongs to the plane of the peptide group i - 1 and lies on the same side of the Cy-,-Cp axis as the oxygen of this peptide group. Similarly, (X,); is the angle of the clockwise rotation (when looking from Cy toward Cy+,) of the peptide group i, with (X2); = 0 when Cy-, lies in the ith peptide group plane, on the opposite side of the Cy-C,U,, axis from the oxygen atom of this peptide group. Now, for i = 1, 2,..., n - 2, we define X; = while X,,-] = (X2)n-,, because of the termination of the chain. Energy of interaction of peptide-group dipoles The energy of interaction of peptide-group dipoles i and j is given by Equation 1 (Hirschfelder et al., 1954): where pi and pj are the dipole-moment vectors of peptide groups i and j, rjj is the distance between peptide-group centers, e,, is the unit vector pointing from the center of peptide group i to the center of peptide groupj, and E is the dielectric constant (see Fig. 1). The energy u, can be also expressed by the following equation (see Equation A5 in the section on Derivation of the interaction energy of peptide-group dipoles in the Appendix): x [ Y;y cos( X; - X!) + Y;?) cos( X, - X?)] + pipj sin pi sin pj Er; [ Zi1' COS( X; + X, - At)) + Zf) cos( X; - X, - A?')]. Fig. 1. The relative orientation of the virtual bonds CP-Cp,, and CT-Cy+, is described by the angles ai,, pi,, and yi;, defined by Equation A6 (it should be noted that yi, has no relation to the virtual-bond torsional angle yi). The angle cyij is not shown here because the two virtual bonds Cy-Cy+, and CY-Cj.,, are not necessarily coplanar. 0 is the angle between two successive virtual bonds. The peptide-group dipole moments are represented by arrows (pointing from the carbonyl oxygen to the amide hydrogen of a peptide group), and the angles pi and pj between them and the virtual bonds are also shown, as well as the rotation angles X, and Aj of the peptide-group dipoles. The two dipoles are separated by a distance rij. All the mathematical symbols used in Equation 2 are defined in that section. The term involving cos pj cos pj in Equation 2 can be identified with the interaction between the components of the dipole moments parallel to the virtual bonds. This part of the energy does not depend on X; and Xj. The term involving sin( p; + pj) accounts for the interaction between the parallel and perpendicular components of the dipole moments. Finally, the term involving sin p; sin pj in-

4 1700 A. Liwo et al. cludes the interaction between the perpendicular (rotating) components of the dipole moments. Based on ECEPP/2 charges, the dipole moment of the peptide group is almost perpendicular to the virtual bond. Hence, the perpendicular component of the dipole moment is definitely the dominant one. Therefore, in our further analysis, we will consider only the last term in Equation 2; this greatly simplifies our further considerations. Thus, the total electrostatic energy of the backbone is expressed in this approximation as follows: n-2 n-1 n-2 n--l = Ujj(hj, Xj) = UY(X;, X,) i=l j=i+l i=l j=i+l i=l j=i+l + CF) cos(xi - X j - A:')], (3) where Sil;"' = pipj sin pi sin pjzi:k)/(u:), with k = 1, 2. The problem of conversion of the virtual-bond chain to an all-atom backbone can now be identified with minimizing U of Equation 3 as a function of X1,..., Obviously, this encounters a multiple-minima problem. Initially, therefore, instead of attempting to find the exact global minimum of Cy:; U~"(Xi, Xj), we estimate it approximately by optimizing the sum of the dominant components U T ( hi, Xj) in Equation 3; Le., in this equation, we consider only those pairs of peptide groups that make significant contributions to the total dipole-dipole interaction energy. To determine the most important terms in Equation 3, Le., the pairs of dipoles (i, j) that make a significant contribution to the electrostatic energy, we use an expression, derived in the section on Boltzmann-averaged energy uj in the Appendix and based on Equation 2, for the dipoledipole interaction energy Boltzmann-averaged over Xi and Xi, viz., Qj: A,. 0.. = 2 (cos ajj - 3 cos pij cos yij 'J r3 1J Bij --[4 + (cos ajj - 3 cos pij r6 cos yijy [J with constants A,, Bij summarized in Table 2 of the accompanying paper (Liwo et al., 1993). We identify the most important terms in Equation 3, i.e., those that contribute significantly to Xi.=;' XJnz;+l U~ppmx( Xi, Xi), by defining two peptide groups, i and j (not contiguous in the amino acid sequence), to be in electrostatic contact if the value Gj of their average electrostatic-interaction energy is more negative than a cutoff value of -0.3 kcal/mol. This cutoff value was determined from an analysis of the relationship between the geometry of hydrogen-bonded peptide groups and their average electrostatic-interaction energies, obtained from the Protein Data Bank; i.e., it was found that, if the value of the average electrostatic-interaction energy between two peptide groups is more positive than -0.3 kcal/mol, then these two peptide groups cannot form a hydrogen bond (optimal electrostatic-interaction energy corresponds to the most extensive hydrogen-bonding network). Definition of a dipole path In order to define a dipole path, let us first consider an isolated pair of dipoles i and j, i.e., only one component Uy(Xi, Xj) of the sum in Equation 3: U T ( hi, X,) = Ciy) COS( Xi + X j - At)) + Cf) cos(hi - X, - A?'). (5) It can easily be shown that there are two equivalent minima of this function, i.e., for h = 0 or 1: Each of the two orientations, corresponding to h = 0 or 1, of dipoles i and j given by [ Xyp'( h), Xi"'( h)], is hereafter referred to as the best alignment of this pair of dipoles. If the virtual bonds Cg-C;+, and CT-Cjg,, are parallel, each of these two alignments corresponds to the situation in which the dipoles i and j lie on the line linking them and point in the same direction (the two values of h then correspond to arrangements of these dipoles that differ by 180"). The integer h determines whether the dipole-moment vectors are aligned along the direction from pi to pj or from pj to pi. This is the ideal alignment of peptide-group dipoles that is encountered, for example, in &sheets. Now, consider a sequence of different dipoles, not consecutive in the virtual-bond chain, with indices ai, a2,..., ak,..., am, i.e., the numbers T,, a2,..., Irk,..., am are different nonsequential integers that may range from 1 to n - 1 (n- 1 is the number of dipoles in the virtual-bond chain). The above situation (in which two dipoles lie on the line linking them and point in the same direction) is similar for the sequence of dipoles rl, a2,..., am if they satisfy the following two conditions, for k = 2, 3,..., rn - 1 : the peptide group ptk is in electrostatic contact with ptk-, and ptk+,; the angles " : X: calculated from Equation 6 for the pair (Tk- l, Tk) and for the pair (Tk,?Tk+l) do not differ by more than 90, while, at the same time, the dipole p,,-, points toward

5 Conversion of protein backbone virtual to all-atom backbone 1701 the peptide group pr, and the dipole pr, points toward the peptide group pr,+,. By analogy to the case of an isolated pair of dipoles, there are two possibilities for the alignment of the dipoles: one in which all the dipoles, consecutive in the path, are aligned approximately along the directions from prk to ptk+,, k = 1, 2,..., m - 1, and a second one in which all the dipoles, consecutive in the path, are aligned approximately in the opposite directions. We associate the first possibility with the sequence rl, r2,..., Tk,..., rm and the second possibility with the sequence rm,?t,,-~,..., rk,...,?rl. Such sequences of dipoles are hereafter called dipole paths and are denoted by II. There is a close analogy between a path of aligned dipoles and the corresponding array of hydrogen-bonded peptide groups (Fig. 2). The two directions of dipole alignment correspond to the two possible directions of hydrogen bonds in the array, and the peptide group prk acts as a proton donor for the peptide group prk+,, k = 1, 2,..., m-1. For an isolated pair of adjacent peptide groups in the polypeptide chain, each of the two minima of the dipoledipole interaction energy given by Equation 6 corresponds to very poor alignment of the dipole-moment vectors (each of the two dipole-moment vectors makes an angle far from 0" with the line linking the two dipoles). This is due to the fact that two adjacent virtual C"-C" bonds are almost perpendicular. Physically, the alignment of dipole moments of adjacent peptide groups corresponds to the formation of 1,7- and 1 $hydrogen bonds, as in the C7 and C', respectively, hydrogen-bonded rings. Such hydrogen bonds are substantially distorted from optimum hydrogen-bond geometry (Baker & Hubbard, 1984; Gor- Fig. 2. The two possible hydrogen-bond paths on a model a-carbon frame compared with the corresponding dipole paths. The direction of each dipole path is indicated by the larger arrow. The shorter arrows represent the dipoles at the centers of the virtual bonds. The peptide groups represented here are never consecutive in the polypeptide chain. bitz, 1989). Thus, two peptide groups, adjacent in the polypeptide chain, will not be considered as consecutive elements of a path (the construction of dipole paths, which allows adjacent peptide groups to be consecutive elements of a path, was examined and did not improve the results). Determination of a dipole path The following conditions are applied in order to choose the peptide groups to form a path, say, n k (composed of dipoles with indices?tk(l),?tk(2),...,?tk(mk); mk is the number of dipoles in the path nk): I rk(;) - rk(j+l) I > 1 for all i = 1, mk - 1 (this condition excludes adjacent peptide groups in the chain from the dipole path). 2. The value of the average energy of electrostatic interaction - between peptide groups prk(i, and P~,(~+,), Urk(i)rkk(i+l), is more negative than -0.3 kcal/mol (as computed using Equation 4) for i = 1, 2,..., mk - 1. This is the requirement that any two peptide groups, consecutive in the path, are in electrostatic contact. 3. (er**(,,*~*(,+i,.e,**,,+l,**(,+*, ) >Ofori= 1,2,..., mk-2. This is the requirement that the curvature of the path be low; in other words, it is stipulated that the angle between the centers of any three peptide groups consecutive in the path be greater than 90". 4. prk(i) does not follow the proline a-carbon for i = 2, 3,..., mk (because the proline residue cannot act as a proton donor; i.e., proline must terminate a path, if it is at all involved).?tk(;+l)-?tk(;)# 2 for i = 1, 2,..., mk - 1, because inverse turns do not occur in proteins. By "inverse turn" we mean the arrangement that would be ob- tained from by flipping the first and third peptide groups (those that form the 1,lO hydrogen bond), i.e., the situation in which a hydrogen bond would be formed between the amide hydrogen of the first peptide group and the carbonyl oxygen of the third peptide group, rather than vice versa. The proline residue can terminate a path (i.e., act as a proton acceptor) only if the resulting alignment does not violate its internal geometry (i.e., the resulting value of the dihedral angle 4 of proline should not be different from -75" for L-proline or 75" for D-proline). A path composed of peptide groups in 1,4- (a-helical) electrostatic contacts is favored over one composed of peptide groups in 1,3- (310-helical) contacts if the a-helical path consists of more than two peptide groups. This is consistent with the fact that long 310-helices are unstable (Scott & Scheraga, 1966) and do not occur in proteins (Barlow & Thornton, 1988). The validity of this condition is illustrated by

6 1702 A. Liwo et al. the calculations on model helices presented in the Results section. 8. If several paths can be formed, depending on the choice among different residues to extend a path, we keep only the one that results in sterically allowed conformations of individual amino acid residues. If, however, there is still more than one possibility that results in sterically allowed conformations, that path is chosen that gives the lowest electrostatic energy. The alignment problem for a single dipole path Having established the sequences of dipoles to form dipole paths, it is then necessary to minimize the internal electrostatic energy of each path, with respect to the X's, in order to obtain the optimum alignment of the dipoles in each path. To solve the alignment problem (i.e., to determine the angles X) for each path, we assume that each dipole is influenced only by the electrostatic field from the dipole preceding and the dipole following it in the path; the other dipoles in the path are much more distant from it by definition and will, therefore, have little influence on the alignment. Then, as follows from Equation 3, the alignment energy Unk of path & can be expressed by: m, - 1 i= 1 Equation 6) of those resulting from the best alignment with the preceding (and following) dipole (except for the terminal Xo's, which are chosen as the result of the alignment of the first two or the last two dipoles): + (2h + 1)x 2 for 1 < i<mk- 1 where h = 0 or 1 indicates the path direction. Definition of a dipole-path network Having defined a dipole path, we now define a network of dipole paths. Let v be the number of dipole paths, mi the number of dipoles in the ith path, and nf the number of the dipoles that are not involved in any path ("free" dipoles). The dipole path IIi is a sequence of dipoles (P",(~,,,...,pTrcm,,) for i = 1, 2,..., v, while pfl, pfz,..., pfd are the "free" dipoles. Then, using the dipole-path concept, we can rewrite Equation 3 as follows: Optimum alignment within the path under consideration is achieved if the alignment energy expressed by Equation 7 attains its minimum. This means that the following system of nonlinear equations has to be solved: Y Y Y Y nf ax"k(l) The system of Equations 8 is solved by the Newton method (Ortega & Rheinboldt, 1970) with the Marquardt modification of the Hessian matrix (Marquardt, 1963). This assures that the minimum of Unk, and not a saddle point or maximum, is obtained. The initial angles, Xo's, used for the Newton-Marquardt method are chosen with the aid of Equation 9 as the arithmetic means (from =O The components of the first sum in Equation 10 can be considered as the internal electrostatic energies of each of the paths in the network, the "path energies." The next sum corresponds to the interaction between different paths in the network, the "path-interaction energies." The last two sums are the interactions of the "free" dipoles with the paths and with other "free" dipoles, respectively. According to our previous considerations, a dipole path corresponds to a particularly low-energy alignment of the component dipoles; we therefore assume that this alignment will not be affected substantially by the interactions with other dipoles in the system. Of course, a given path has its reversed counterpart with comparable electrostatic energy (the energy is not exactly the same, because chang-

Conversion of protein virtual backbone to all-atom backbone 1703 ing the angles X by 180" results in different values of the virtual-bond valence angles, e).

7 Conversion of protein virtual backbone to all-atom backbone 1703 ing the angles X by 180" results in different values of the virtual-bond valence angles, e). Therefore, having chosen the sequences of dipoles to form paths (see the subsection on Determination of a dipole path), we next find the groups of paths that are close in space, and we determine the orientation within each group that leads to the lowest path-interaction energy (the second sum in Equation 10). This is a combinatorial problem, because we can choose only one of the two directions for each of the interacting paths. For example, for a group of paths forming a &sheet, the antiparallel orientation of neighboring paths is favorable. Conversely, the three dipole paths running through an a-helix are parallel, because they are shifted along the helix axis with respect to each other. However, there are, in general, still two possibilities to orient a group of interacting paths, depending on the choice of the direction of a given path in the group. We choose the one that results in sterically more favorable conformations of the individual amino acid residues. As shown in the examples presented in the Results section, in the case of a-helical and 0-sheet structures, one of the possibilities results in the choice of conformational states lying in high-energy regions of the (+, IC/) conformational map, as defined by Zimmerman et al. (1977). Determination of dipole-path network The dipole-path network, a collection of dipole paths, is determined in the following way. First, the paths that can be constructed unambiguously, based on the conditions enumerated in the section above, are sought; these usually correspond to a-helices, the core of &sheets, These paths, representing most frequently occurring hydrogen-bonded structures, serve as a seed. Then, we look for dipoles that are in electrostatic contact with those already involved in the dipole-path network and attach them to the appropriate paths, iterating the procedure until no further single dipoles can be attached to any path. If there are still paths left whose directions are not determined (i.e., if both directions give equally good alignment with the electrostatic field of the closest paths and give sterically allowed conformations of the amino acid residues belonging to those paths), or if some dipoles can be attached to more than one path, we consider all the possible choices, and that arrangement is chosen that results in the lowest electrostatic energy, as calculated from Equation 3. Alignment of free dipoles In order to complete the estimation of the best alignment of dipoles, after the dipole-path network is established and the alignment equations are solved for each path, we align the "free" dipoles in the electrostatic field produced by the dipoles belonging to the paths. Adjustment of virtual-bond chain to satisfy the geometry of the real chain The set of X's (calculated with Equations 8 and 9 for each path, and from the alignment of free dipoles, according to the electrostatic field produced by all the dipoles involved in the dipole paths) will usually result in the angles 7 between the N-C" and C"-C' bonds of the real chain being far from their standard values 7'. However, because such deviations cannot be allowed, it is necessary to calculate a new set of virtual-bond angles 0 from the X's (Equations A15 and A16 of the Appendix) to reestablish the proper angles ' between the N-C" and C"-C' bonds. But, this would move the C"'S. To compensate for these displacements, we vary the virtual-bond torsional angles y to bring the (2"'s close to their original (input) positions (at the same time, retaining the correct bond angles 7 = 7' between the N-C" and C*-C' bonds); this is accomplished by minimizing of Equation l l with respect to the y's, treating the 0's as constants: n-3 72,... 9 Yn-3) = c c (dcpcp - d&)pcy.)2, i=l j=i+3 (1 1) where d&)scy is the distance between a-carbons i and j in the input virtual-bond chain, and the values of dcpcp are functions of the 7's. Minimization in Equation 11 is carried out by using the Marquardt method (Marquardt, 1963). Then, the new values of 6 and y are used to compute new coefficients c('), {(2), A('), and A(2), which replace the old ones in the system of Equations 8, and the cycle consisting of solving Equation 8 and minimizing CP of Equation 11 is, in principle, iterated until self-consistency is achieved. However, such an iterative procedure is usually unstable; i.e., it is usually not convergent. The values of the X's calculated by solving the alignment problem for the input virtual-bond chain are used to compute new values of virtual-bond angles 8, and these new values do not need to be close to the initial virtual-bond angles 8'; i.e., the changes of successive values of 193, y's, and X's are very large (e.g., the a-carbon trace obtained in united-residue calculations presented in the accompanying paper [Liwo et al., has all of the 0' angles fixed at go", the average value of 0 determined from the Protein Data Bank; this means that the starting values of eo and X' are too far from the values to which the procedure should lead to satisfy the proper geometry of the real chain). To surmount this obstacle, we make smaller successive changes of J's, X's, and y's, by starting with an artificial polypeptide chain consisting of peptide groups for which the N-C" and C"-C' bonds are colinear with the corresponding C"-C" virtual bonds. With ti, vi, and 7; (i = 2, 3,..., n - 1 ) being the planar angles of the peptide group

1704 A. Liwo et al. i, defined by atoms C?-,-C? -N;, C?+.,-Cp -C;, and N,-C? -C;, respectively, the values of these angles for this artificial chain are: ti = 0, q; = 0, and ri = e,!-,. Now, the angles 0 do not depend on the rotation angles X of the peptide groups, and 0 and X are consistent with each other.

8 1704 A. Liwo et al. i, defined by atoms C?-,-C? -N;, C?+.,-Cp -C;, and N,-C? -C;, respectively, the values of these angles for this artificial chain are: ti = 0, q; = 0, and ri = e,!-,. Now, the angles 0 do not depend on the rotation angles X of the peptide groups, and 0 and X are consistent with each other. We then gradually shift the angles &, q;, and r; toward the standard values t;, q;, and rfof the real chain. This assures a smooth transition from the virtualbond to the all-atom chain. The following algorithm is used for the above procedure: 1. Divide the intervals [0,[I], sf], and [e,"_,, rf], i = 2, 3,..., n - 1, into 10 equal segments. Then, iterate the steps below over k = 0, 1,..., 10, setting t!k) = &;/lo, qjk) = kqs/10, and rjk) = eio_, + k( rf - e;-,)/lo, i = 2, 3,..., n - 1, and substituting them for [;, qs, and rf in Equations A15 and A16 of the Appendix to calculate the e's. 2. For each dipole path, solve the system of alignment Equations 8. Then, align the free dipoles with the electrostatic field of all the paths, taking into account restrictions imposed by the presence of proline (whose dihedral angle $I must be -75" for L-proline or +75" for D-proline). 3. Using Equations A15 and A16, and taking into account the restrictions imposed by the presence of proline, calculate new virtual-bond angles: (e,, 62,..., e,,-d. 4. Solve the least-squares problem with the target func- tion of Equation 11, treating the virtual-bond dihedral angles (yl, y2,..., Y,,-~) as variables and setting the angles (el, 02,..., at the values calculated in the preceding step. 5. Iterate steps 2-4 until the maximum difference in the angles (el, 02,..., in two consecutive iterations is less than a required tolerance (0.5" in this work). After the alignment problem has been solved, the dihedral angles 4 and t+b are calculated from the resulting values of the y's and A's by using Equations A34 and A35 of Wako and Scheraga (1982a). Results and discussion Application to model regular systems a-helix In order to assess the capability of the dipole-path method to predict the correct hydrogen-bond geometry, we first considered a model polypeptide Ac-(Ala)],- NHMe in the a-helical conformation. The a-carbon trace was calculated from the ECEPP/Zminimized a-helix and the y angles were found to be about 48". The model contains a total of 20 peptide-group centers, of which those 1,3- and 1,Cseparated in the chain (i.e., having one or two peptide groups between them, respectively) can be considered to be in electrostatic contact, as defined in step 2 of the procedure for constructing the dipole paths. This is illustrated in the electrostatic-contact map presented in Figure 3. Figure 3 together with rules 3-8 determine possible dipole paths. Rule 1, excluding adjacent peptide groups, eliminates contacts that might have appeared in the white space near the diagonal. Rule 2 eliminates all contacts represented by empty squares. Rule 5 eliminates the contacts represented by the crossed-out circles. Rule 7 eliminates the 1,3 contacts represented by the white circles in the lower diagonal part of the graph, when 1,4 contacts (represented by the black circles) are present. Rule 8 selects the contacts represented by the black circles in preference to those represented by the white circles in the upper diagonal part of the graph. Finally, rule 3 assures that the curvature of the paths represented by the black circles is low. The selected dipole-path network, represented by the black circles and denoted by No, is based on 1,4- I II Peptide-group Center Fig. 3. The electrostatic contact map of the model a-helix: Ac-(Ala)19- NHMe. Each square corresponds to a pair of peptide-group centers. A circle (black or white) therein indicates that the value of the corresponding average dipole-dipole interaction energy is more negative than -0.3 kcav mol and, therefore, the pair can participate in formation of a dipole path. The position of each pair in the diagram also shows the direction of dipole alignment: the vector of the dipole moment of the center numbered by the abscissa is directed toward the center numbered by the ordinate. The crossed-out squares correspond to inverse-turn contacts; they do not occur in proteins. Black circles indicate the 1,4-type electrostatic contacts of the correct a-helix. The white circles in the lower diagonal part of the graph correspond to the 1,3-type electrostatic contacts, while the white (not crossed out) circles in the upper-diagonal correspond to the 1,Ctype electrostatic contacts of the a-helix with reverse dipoles.

9 Conversion of protein virtual backbone to backbone all-atom 1705 contacts, with the dipoles directed from the C-terminus toward the N-terminus. It is the only path network that conforms to all the rules for constructing a dipole path. It is fully compatible with the hydrogen-bond network of an a-helix. Nevertheless, to test our method, we also examined two other possible regular networks: N, -the upper-diagonal line of white circles, comprised of l,ccontacts, but having reversed dipoles, and N2 -the lower-diagonal line of white circles, comprised of 1,3-contacts (as in a 310-helix), with the dipoles directed from the C-terminus toward the N-terminus of the a-helix. (The fourth possible regular network, represented by the crossed-out white circles, based on 1,3-contacts but with reversed dipoles, is eliminated, for steric reasons.) It should be noted that network N1 consists of a right-handed a-helical backbone even though the peptide groups are reversed. In all three cases, No, N,, and N2, the dipole paths are parallel to each other to assure favorable electrostatic interactions between the dipoles neighboring in space; this is because the paths are shifted with respect to each other along the helix axis. The antiparallel orientation of any two paths in any of these networks would give a definitely higher electrostatic interaction energy; therefore, only parallel orientations are considered. For the No ("a-helical") structure, there are three dipole paths running through the centers of peptide groups , , and The corresponding reversed sequences hold for the N, ("reversed a-helical") paths; in these paths, the C" trace remains the same as in No, but all the peptide groups are rotated by 180", and the dipoles are directed from the N-terminus toward the C-terminus. There are two paths for the N2 ("3,0-helical") network: and The average values of the dihedral angles 4 and $ resulting from the solution of the dipole-alignment problem, and the values of q5 and $ obtained after subsequent total energy minimization with the ECEPP/2 force field, together with the values of the minimum ECEPP/2 energy for all three possible dipole-path constructions, are summarized in Table 1. The energy obtained for the a-helical network (No) is definitely the lowest one; it is the same as the minimum energy obtained for this system by Ripoll and Scheraga (1988) in their extensive simulations with the electrostatically driven Monte Carlo (EDMC) method. Although the N1 network (a-helix with reverse dipoles) is electrostatically equivalent to No, the final ECEPP/2 energy of the ZVl a-helix is high, because the solution of the alignment equations (Equations 8 and 9) results in locally unfavorable conformations; this observation indi- cates the applicability of rule 8 of the dipole-path construction. For the 310-helical paths (N2), the values of 4 and $ converged to the high-energy C*-region of Zimmerman Table 1. Average values of dihedral angles for helices NO N, N2 6 (deg) -41a -67b 51a 64b Oa 52b * (deg) EfofC (kcal/mol) a This column contains the results from the solution of the alignment problem. These are the results of subsequent ECEPP/2 total energy minimization of the structures in the previous column. For the No dipole-path network, the energy minimization resulted in crankshaft-type changes in the backbone dihedral angles. Total ECEPP/2 energy of the whole chain after energy minimization. et al. (1977). The N2 paths were highly curved, the angle between the lines joining the centers of any three peptide groups consecutive in the path being only slightly greater than 90" (violation of rule 3). The corresponding angles were about 160" for the No (a-helical) paths. Table 1 shows that the choice of a dipole-path network other than No results in high energy conformations and in the dihedral angles 6, $ lying in unfavorable regions. 310-Helix Using the average values of 4 = -71" and $ = -18" of a 310-helix (Barlow& Thornton, 1988), we calculated the a-carbon trace of this helix. The resulting virtualbond torsional angles were 70.5" (instead of 48" for an a-helix). When the dipole-path method was applied to this virtual-bond chain for our poly-alanine model, in which the only allowable electrostatic contacts are of the 1,3- type, only one dipole-path network was possible, viz., the 310-helical network. The dihedral angles 4 and $ result- ing from dipole alignment were -7" and -94". After ECEPP/2 energy minimization, the system converged to the a-helix. It is worth noting that ECEPP/2 energy minimization (without application of the dipole-path method) also led to a full a-helix, when starting from the conformation with the above 4 and $ dihedral angles characteristic of a 310-helix. &Sheets Next, we considered the parallel and models of poly@-alanine) studied by Chou et al. (1983). The models consisted of three Ac-(Ala),-NHMe strands. The electrostatic-contact maps for both types of sheets are shown in Figure 4A and C. For both the parallel and antiparallel sheets, there are four dipole-path networks, two composed of parallel paths and two composed of antiparallel paths. It can easily be shown that only the two networks of antiparallel dipole paths can be constructed in each case; the parallel orientation of any two neighboring paths is electrostatically unfavorable because each dipole exactly faces its counterpart in the neighboring

10 1706 A. Liwo et al. A II B C L Q) c 0 15 a. =I I. II.. U '= Q 9 CL 6 7 % 5 3 I I Peptide-group Center I II II I D I II Peptide-group Center I Fig. 4. The electrostatic-contact maps of three extended Ac-(Ala)6-NHMe strands forming a parallel (A) and an antiparallel (C) &sheet. A circle (black or white) indicates an electrostatic contact. Black circles indicate the electrostatic contacts that lead to a sterically favorable hydrogen-bond network. The crossed-out squares correspond to inverse-turn contacts, which do not occur in proteins. The numbering of the a-carbon atoms (and consequently the numbering of the peptide groups starting at those carbon atoms) and the orientations of the peptide groups resulting from solutions of the alignment equations are shown in B and D, respectively. path. The only antiparallel dipole paths (corresponding peptide groups are numbered consecutively from the first to the black circles in Fig. 4A and C) that lead to steri- to the last strand, as in Fig. 4B and D). cally allowed conformations are , , , As shown in Table 2, in each case the reversed paths , , , and for the parallel (corresponding to white circles in Fig. 4A and C) lead and , , , , , conformational states lying in the unfavorable C*-region , and for the (where the Moreover, in these cases, the alanine side chains overlap

11 Conversion of protein virtual backbone to all-atom backbone 1707 Table 2. Average values of dihedral angles for 0-sheets Parallel 6-sheet Antiparallel 6-sheet Correct Reversed Correct Reversed parallel parallel antiparallel antiparallel network network network network 4 (deg) a a -145 Illa - I. (ded E,,,d (kcal/mol) a This column contains the results from the solution of the alignment problem. These are the results of subsequent ECEPP/2 total energy minimization of the structures in the previous column. A dash in place of a value indicates that the 0-sheet became disrupted during ECEPP/2 energy minimization. Total ECEPP/2 energy for three strands after energy minimization. with the backbone atoms of the neighboring strands; therefore, in both cases energy minimization resulted in destruction of the &sheet arrangement. The dipole-path network leading to favorable intra- and interstrand interactions corresponds to the correct hydrogen-bond network (Chou et al., 1983). The dihedral angles d, and $ resulting from the solution of the alignment problem are close to those obtained after the final total ECEPP/2 energy minimization (see Table 2). In each case, the hydrogen-bond network resulting from the application of the dipole-path approach remained unchanged after energy minimization. Because we did not impose any regularity constraints on the dihedral angles (i.e., in contrast to the work of Chou et al. [1983], it was not required that all the dihedral angles d, and all the dihedral angles rl/ be equal), our values of the total energy are lower than those obtained by Chou et al. (1983). However, starting from their structures, after unconstrained energy minimization, we obtained the same values of the dihedral angles and energies that resulted from the application of the dipolepath method. Application to the 5PTI X-ray structure of bovine pancreatic trypsin inhibitor In order to check the applicability of the dipole-path method to convert a real virtual-bond chain to an all-atom backbone, we used the a-carbon trace of the 5PTI crystal structure (Wlodawer et al., 1987) of BPTI, a 58-residue protein. The electrostatic contact map, shown in Figure 5, indicates an array of 310-helical contacts at the N-terminus, an antiparallel sheet in the middle of the chain, and 310- and a-helical contacts at the C-terminus. Using our rules for construction of dipole paths, we can uniquely determine the three C-terminal a-helical paths: , , and (as was done for the model a-helix); an array of those &sheet paths that do not result in C* conformations of the corresponding res- idues: 32-20, 19-33, 34-18, 17-35, and 15-36; and a short 310-helix (a sequence of &turns) at the N-terminus (as was done for the model 310-helix): 6-4, 5-3, and 4-2. This is the seed of the dipole-path network being constructed. There are also some remaining peptide groups (1 1, 44, and 45) in the middle of the chain, each of which is in electrostatic contact with one peptide group initiating or ter- I 5 IO Peptide-group Center Fig. 5. The electrostatic contact map of the cr-carbon trace of 5PT1. Circles (black or white) indicate electrostatic contacts. The fields for the inverse-turn contacts and those that would imply that proline acts as a proton donor, are excluded and marked by X. If the second center in a pair being linked precedes proline, a favorable alignment cannot always be reached because the dihedral angle $I of proline is fixed. This is indicated by diamonds placed in the corresponding fields. Black circles indicate contacts leading to the sterically allowed hydrogen-bond network; white circles are sterically forbidden.

12 1708 minating an already determined /3-sheet path. Thus, they are added to the corresponding dipole paths. The final dipole paths (shown in Kinemage 1) are as follows: , , (the three a-helical paths at the C-terminus); 27-24,23-29, 30-22, , , 19-33, 34-18, , (the paths of the central &sheet, together with the peptide groups attached to them); 6-4, 5-3, 4-2 (the N-terminal 310-helix); and (an isolated &turn). It should be noted that about 60% of the peptide-group centers cannot be assigned to any path. The structure resulting from the application of the dipole-path method (shown in Fig. 6 and Kinemage 2) was subjected to ECEPP/2 energy minimization, with alanine substituted for all residues except glycines and prolines, and with constraints imposed on the distances between all the a-carbons (a total of 1,653 distances). (The dipolepath method provides information only about the backbone, and not the side chains; this is why the non-glycine and non-proline side chains were replaced by alanines. If we had retained the original side chains in arbitrary initial conformations, significant interatomic overlaps would have resulted, with a consequent disruption of the backbone structure resulting from the dipole-path method, during ECEPP/2 energy minimization.) The constraints were introduced into the target function by adding a Braun and (36 (1985) type term to the energy (i.e., Ci,, (d; - d$2)2, d: being the CY-C? distance in the X-ray struc- ture), with a weight of The resulting rms deviations from the X-ray structure were 0.85 A for the C" atoms and 1.09 A for all heavy backbone atoms. It should be noted that these deviations are 1.16 A and 1.12 A, respec- A. Liwo et al. tively, for the calculated ECEPP energy-refined structure with standard ECEPP/2 geometry (Vasquez& Scheraga, 1988). However, after carrying out the energy minimization of the poly-alanine model of the X-ray structure of BPTI, subject to the same C"-distance constraints, and starting from the ECEPP-minimized structure (Vasquez & Scheraga, 1988), the rms deviations from the X-ray structure were 0.86 A and 0.94 A, respectively. The final energy of the energy-minimized structure of the polyalanine model of BPTI, resulting from the dipole-path method, was kcal/mol, compared with kcal/mol for the ECEPP-minimized structure of the same model. We can, therefore, conclude that the dipole-path method gave almost as good agreement between the final calculated and crystallographic backbone of BPTI, and as low an energy, as those resulting from the direct fitting of the crystallographic backbone to the ECEPP/2 geometry. A plot of the average deviation of the backbone atoms of the consecutive residues along the chain from the X-ray structure (Fig. 7A) shows that agreement is very good for almost all the residues involved in dipole paths (a residue is considered to be involved in a dipole-path network if the peptide group beginning and the peptide group ending at the corresponding a-carbon are involved in the dipole-path network; such residues are indicated by large circles in Fig. 7A). Greater differences are observed for the residues that did not participate in the dipole-path network. Figure 7B and Kinemage 3, in which the crystallographic and reconstructed backbone are superposed, show that some of the peptide groups not involved in the dipole-path network (indicated by arrows) flipped to the Fig. 6. Stereo drawings of the superposition of the backbone and CB atoms of SPTI (black circles and heavy lines) on the structure resulting from the application of the dipole-path method (empty circles and thin lines). Only the N, Cu, and C' atoms of the backbones are shown.

13 - - - Conversion of protein virtual backbone to all-atom backbone 1709 A -! 0.- O. S > : W IT z w E, a t t I 4 I I W c 0, m B C I IO Residue Number 50 Fig. 7. A: Plot of the mean deviation of the backbone and CB atoms of the residues along the chain for the best superposition of the 5PTI structure and the structure resulting from application of the dipole-path method to the a-carbon trace of SPTI (after energy minimization with constraints). The residues involved in the dipolepath network are circled. B: Stereo drawings of the superposition of the backbone and CB atoms of 5PTI (black circles and heavy lines) on the structure obtained by using the dipole-path method and energy minimization (empty circles and thin lines). The arrows indicate the peptide groups whose orientations differ by from those of the X-ray structure. other side (i.e., their angles A differ by about 180 from the correct values). The conformational states of the reconstructed and ECEPP-minimized crystallographic backbone are compared in Table 3. Again, the best agreement is observed for those residues that were involved in the dipole-path network; they are either in the same or neighboring regions of Zimmerman et al. (1977) as those of the ECEPPminimized structure. Overall, about 77% of the residues are either in the same or neighboring regions of Zimmerman et al. (1977), which greatly exceeds the fraction (40%) involved in the dipole-path network. We can thus conclude that the alignment of the free dipoles in the electrostatic field of the paths is responsible, to some extent, for the location of the peptide groups. As might be expected, this kind of prediction is particularly good for residues that are located close to dipole paths. Computational details The dipole-path algorithm was implemented on a Stardent series 1500 computer. The computation time was about 30 s for the model helices and &sheets and about 600 s for BPTI. ECEPP/2 energy minimization was carried out on a Stardent series 3000 computer, the computation time being about 5 min for the model helices and

14 1710 A. Liwo et a/. Table 3. Conformational states of BPTIa Residue F A B A A F F F D A F * A F B F L A A C E F C C A F * F D * A * Residue E D E C E E E C C A A A A * E E C C C F C F C C A A A A * E Residue C E F E C E A B * E A * F F A C _ F E! E C E A A * G C C F C G Residue E E A E A A A A A A A B F E F E C F * A A A A A A A D G C * a The top line in each grouping is the ECEPP/2-minimized backbone (starting from the crystal structure). The bottom line is the ECEPP/2-minimized structure resulting from the application of the dipole-path method to the a-carbon trace of the 5PTI structure. The conformational states of the residues involved in the dipole paths are underlined. This is described in the accompanying paper (Liwo et al., 1993). Another potential application of the algorithm, which was only partly explored in this paper, is the conversion of a low-resolution X-ray structure to an all-atom backbone. Our sample calculation on BPTI showed that the algorithm could be used to solve this problem, since both the rms deviation and energy were of the same order as for the rigid-geometry analog of the SPTI structure. On the other hand, it must be borne in mind that, in solving the alignment problem, we do not conserve the virtualbond valence angles of the input chain, thereby allowing the virtual-bond angles to assume values that correspond to the best alignment. This is desirable as long as these angles cannot be estimated with sufficient accuracy (as in the united-residue treatment) but, if they are known with considerable accuracy, we should compromise between the best dipole alignment and the requirement of retaining the input geometry of the virtual-bond chain. From our calculations on BPTI, we conclude that this would affect mostly the alignment of the free dipoles, which experience only a weak electrostatic field, because they are not in electrostatic contact with any other dipoles. The advantage of the dipole-path method as a conversion algorithm, compared to methods that apply only geometric constraints, is that structures resulting from the dipole-path method not only satisfy the geometric constraints of the virtual-bond chain, but are also of low and about 30 min for the poly-alanine model of BPTI. The secant unconstrained minimization solver (SUMSL) algorithm (Gay, 1983) was used for energy minimization. Superposition of BPTI structures was carried out by using the singular-value decomposition algorithm (Golub & van Loan, 1985). Conclusions The examples presented in this paper show that the dipolepath method is capable of finding the correct location of peptide groups corresponding to a given virtual-bond chain for those residues that can be assigned to the dipolepath network. However, even if not all of the residues can be assigned to a dipole-path network, one can align the free dipoles with the electrostatic field of the paths, giving an additional fraction of correctly predicted conformational states. Therefore, the algorithm appears suitable for preparing a good approximation to the lowest-energy all-atom backbone corresponding to a C trace. In practice, the conversion of a united-residue structure (which is a sequence of C atoms and side-chain centroids) is followed by electrostatically driven Monte Carlo (EDMC) simulations (Ripoll& Scheraga, 1988,1989; Williams al., et 1992) in order to find a lower-energy all-atom structure. Acknowledgments This work was supported by grant DMB from the National Science Foundation, grant GM from the National Institute of General Medical Sciences of the National Institutes of Health, NIH grant CA (to M.R.P.), and Office of Naval Research grant N (to S.R.). Support was also received from the National Foundation for Cancer Research. We thank L. Piela for helpful discussions, K.D. Gibson for providing the FORTRAN code of the singular-value-decomposition algorithm (for the best superposition of structures), and A. Nayeem for providing the ECEPP-minimized coordinates of 5PTI. References Baker, E.N. & Hubbard, R.F. (1984). Hydrogen bonding in globular proteins. Prog. Biophys. Mol. Biol. 44, Barlow, D.J. & Thornton, J.M. (1988). Helix geometry in proteins. J. Mol. Biol. 201, Bassolino-Klimas, D. & Bruccoleri, R.E. (1992). Application of a directed conformational search for generating 3-D coordinates for protein structures from a-carbon coordinates. Proteins Struct. Funct. Genet. 14, Braun, W. & GO, N. (1985). Calculation of protein conformations by proton-proton distance constraints. A new efficient algorithm. J. Mol. Biol. 186, Bruccoleri, R.E. & Karplus, M. (1987). Prediction of the folding of short polypeptide segments by uniform conformational sampling. Biopolymers 26,

Conversion of protein virtual backbone to ail-atom backbone 171 1 Chothia, C. & Lesk, A.M. (1986). The relation between the divergence of sequence and structure in proteins. EMBO J. 5, 823-826.

15 Conversion of protein virtual backbone to ail-atom backbone Chothia, C. & Lesk, A.M. (1986). The relation between the divergence of sequence and structure in proteins. EMBO J. 5, Chothia, C., Lesk, A.M., Levitt, M., Arnit, A.G., Maiuzza, R.A., Phillips, S.E.V., & Poljak, R.J. (1986). The predicted structure of immunoglobulin Dl.3 and its comparison with the crystal structure. Science 233, Chou, K.-C., Nemethy, G., & Scheraga, H.A. (1983). Effect of amino acid composition on the twist and the relative stability of parallel and Biochemistry 22, $ Crippen, G.M. &Snow, M.E. (1990). A 1.8 A resolution potential function for protein folding. Biopolymers 29, Dill, K.A. (1990). Dominant forces in protein folding. Biochemistry 29, Gay, D.M. (1983). Algorithm 611. Subroutines for unconstrained minimization using a model/trust-region approach. Assoc. Comput. Math. Trans. Math. Software 9, Golub, G.H. & Van Loan, C.F. (1985). MatrixComputations, pp The Johns Hopkins University Press, Baltimore, Maryland. Gorbitz, C.H. (1989). Hydrogen-bond distances and angles in the structures of amino acids and peptides. Acta Crystallogr. B45, Hirschfelder, J.O., Curtiss, C.F., &Bird, R.B. (1954). Molecular Theory of Gases and Liquids, p Wiley & Sons, New York. Holm, L. & Sander, C. (1991). Database algorithm for generating protein backbone and side-chain co-ordinates from a C" trace application to model building and detection of co-ordinate errors. J. Mol. Biol. 218, Hubbard, T.J.P. & Blundell, T.L. (1987). Comparison of solvent-inaccessible cores of homologous proteins: Definitions useful for protein modelling. Protein Eng. I, Jones, T.A. & Thirup, S. (1986). Using known substructures in protein model building and crystallography. EMBO J. 5, Liwo, A., Pincus, M.R., Wawak, R.J., Rackovsky, S., & Scheraga, H.A. (1993). Prediction of protein conformation on the basis of a search for compact structures: Test on avian pancreatic polypeptide. Protein Science 2, Marquardt, D.W. (1963). An algorithm for least-squares estimation of nonlinear parameters. J. SOC. Indust. Appl. Math. 11, Mitchell, J.B.O. & Price, S.L. (1989). On the electrostatic directionality of N-H...O=C hydrogen bonding. Chem. Phys. Lett. 154, Momany, F.A., McGuire, R.F., Burgess, A.W., & Scheraga, H.A. (1975). Energy parameters in polypeptides. VII. Geometric parameters, partial atomic charges, nonbonded interactions, hydrogen bond interactions, and intrinsic torsional potentials for the naturally occurring amino acids. J. Phys. Chem. 79, Nemethy, G., Pottle, M., & Scheraga, H.A. (1983). Energy parameters in polypeptides. 9. Updating of geometrical parameters, nonbonded interactions, and hydrogen bonding interactions for the naturally occurring amino acids. J. Phys. Chem. 87, Nishikawa, K., Momany, F.A., & Scheraga, H.A. (1974). Low-energy structures of two dipeptides and their relationship to bend conformations. Macromolecules 7, Ortega, J.M. & Rheinboldt, W.C. (1970). Iterative Solution of Nonlinear Equations in Several Variabfes, pp Academic Press, New York. Piela, L. & Scheraga, H.A. (1987). On the multiple-minima problem in the conformational analysis of polypeptides. I. Backbone degrees of freedom for a perturbed a-helix. Biopolymers 26, S33-S58. Pincus, M.R. & Scheraga, H.A. (1977). An approximate treatment of long-range interactions inproteins. J. Phys. Chem. 81, Purisima, E.O. & Scheraga, H.A. (1984). Conversion from a virtualbond chain to a complete polypeptide backbone chain. Biopolymers 23, Rackovsky, S. (1990). Quantitative organization of the known protein X-ray structures. I. Methods and short-length-scale results. Proteins Struct. Funct. Genet. 7, Rackovsky, S. & Scheraga, H.A. (1978). Differential geometry and polymer conformation. I. Comparison of protein conformations. Mac- romolecules 11, Rackovsky, S. & Scheraga, H.A. (1980). Differential geometry and polymer conformation. 2. Development of conformational distance function. Macromolecules 13, Reid, L.S. & Thornton, J.M. (1989). Rebuilding flavodoxin from Ca coordinates: A test study. ProteinsStruct. Funct. Genet. 5, Rey, A. & Skolnick, J. (1992). Efficient algorithm for the reconstruction of a protein backbone from the a-carbon coordinates. J. Comput. Chem. 13, Ripoll, D.R. & Scheraga, H.A. (1988). On the multiple-minima problem in the conformational analysis of polypeptides. 11. An electrostatically driven Monte Carlo method-tests on poly@-alanine). Biopolymers 27, Ripoll, D.R. & Scheraga, H.A. (1989). The multiple-minima problem in the conformational analysis of polypeptides An electrostatically driven Monte Carlo method: Tests on enkephalin. J. Protein Chem. 8, Schuster, P. (1976). Energy surfaces for hydrogen bonded systems. In TheHydrogen Bond, Vol. I. (Schuster, P., Zundel, G., & Sandorfy, C., Eds.), pp North Holland Publishing Company, Amsterdam. Scott, R.A. & Scheraga, H.A. (1966). Conformational analysis of macromolecules Helical structures of polyglycine and poly-l-alanine..i Chem. Phys. 45, Seetharamulu, P. & Crippen, G.M. (1991). A potential function for protein folding. J. Math. Chem. 6, Sippl, M.J., Ntmethy, G., & Scheraga, H.A. (1984). Intramolecular potentials from crystal data. 6. Determination of empirical potentials for 0-H. -.O=C hydrogen bonds from packing configurations. J. Phys. Chem. 88, Skolnick, J. & Kolinski, A. (1990). Simulations of the folding of globular protein. Science 250, Vasquez, M. & Scheraga, H.A. (1988). Calculation of protein conformation by the build-up procedure. Application to bovine pancreatic trypsin inhibitor using limited simulated nuclear magnetic resonance data. J. Biomol. Struct. Dyn. 5, Wako, H. & Scheraga, H.A. (1982a). Distance constraint approach to protein folding. I. Statistical analysis of protein conformations in terms of distances between residues. J. Protein Chem. I, Wako, H. & Scheraga, H.A. (1982b). Distance constraint approach to protein folding. 11. Prediction of three-dimensional structure of bovine pancreatic trypsin inhibitor. J. Protein Chem. I, Williams, R.L., Vila, J., Perrot, G., & Scheraga, H.A. (1992). Empir- ical solvation models in the context of conformational energy searches. Application to bovine pancreatic trypsin inhibitor. Proteins Struct. Funct. Genet. 14, Wlodawer, A., Deisenhofer, J., & Huber, R. (1987). Comparison of two highly-refined structures of bovine pancreatic trypsin inhibitor. J. Mol. Biol. 193, Zimmerman, S.S., Pottle, M.S., NCmethy, G., &Scheraga, H.A. (1977). Conformational analysis of the twenty naturally occurring amino acid residues using ECEPP. Macromolecules 10, 1-9. Appendix Derivation of the interaction energy of peptide-group dipoles The dipole-moment vectors of peptide groups i and j, pi and pj, can be expressed as follows in their local coordinate systems (for dipolemoment vector pi, the x axis of this local coordinate system is defined by CP-Cp,,; the y axis lies in the plane of C?-Cp+i-CP+, and is oriented so that Cy+, has a positive y COordinate; the z axis is defined by the cross product of the unit vectors of the X and Y axes; the origin of the system is located at the Cy atom): [ 1 pi = pi -sin cc; cos A; = piepi ( A;), pi = pjtijepi (A,), sin cosccli p, sin A; (AI) where pi and pj are the dipole moments of peptide groups i and j (generally the two peptide groups can be of different type; e.g.,

16 1712 A. Liwo et al. one can be proline, which results in a different charge distribution); epi (X;) and epl (Aj), being functions of Xi and Xi, respectively, are unit vectors in the directions of the dipole-moment vectors; F~ and pj are the fixed angles between the peptidegroup dipole-moment vector and the virtual-bond axis for dipoles i and j, respectively; A; and Xi are the angles of rotation of peptide groups i and J about their respective virtual-bond axes, keeping the C" positions fixed (Fig. 1); and Tij is the transformation matrix from the local coordinate system of virtual bond; to that of virtual bond i. Obviously, Til is a constant (see Fig. 1) matrix (Le., independent of A; and Xj) for a fixed C" frame and hence is known, given only the geometry of the a-carbon trace. Tij can be represented as the product of consecutive rotation matrices corresponding to consecutive virtual-bond and virtual-torsional angles along the virtual-bond chain. Thus, Equation 1 becomes With vi and vj being unit vectors pointing from Cy to CP+, and from Cy to Cy+,, respectively, we define the angles ajj, pjj, and yjj for the relative orientation of the two virtual bonds, by the equations: cos cyij = vi.vj cos@jj = vj.eru cosyjj = vj.e,,. (A6) (See Figure 1 for pjj and yjj. The angle ajj is not shown there because the virtual bonds CP-Ci.,, and CY-CY+, need not be coplanar.) Obviously, the angles ajj, Pjj, and yjj are also con- stant (i.e., independent of X; and Xj) for a given C" frame. The constants in Equation A3 can now be expressed in terms of the orientation angles as follows: rqj = cos a;j - 3 cos p;j cos y;j Y y = J1 + 3 cos 2pij - w; y.. f) cos 2yjj - w; where I is the identity matrix and A, is a constant, known matrix (i.e., independent of X; and Xj), equal to (I - 3eri1e&)Tij. The superscript T denotes the transpose of a vector or a matrix. Introducing the following constants: Wj =all Y p = z.. z) - ;64(1 - COS ajj) + W; - 3(cOs - cos yjj)2. (A7) Thus, these constants (which are used to calculate the matrix Aij) do not depend on the positions of Cy+2 and but only on the relative orientation of the Cy-Cy+, and CT-CT+, virtual bonds. Furthermore, it can be shown that X; + Aj -A:) = (Ai - A:) + ( Xj - X?) + C;: X. 1 - X. J - A(2) IJ = (A- I- Xo) I - ( Xj - X:) + C;, (AS) (akl being the elements of the matrix Ail), and the constant angles X:, X?, A:), and A:) given by the equations a3 I a13 tan X: = - tan A: = - a2 1 a12 where the constants C$ and Ci, and the quantities (hi- X:) and (Xj- X?), do not depend on the positions of and Hence, the angles (X; - X:) and ( Aj - X:) can be treated as independent variables in Equation AS. Thus, Equation AS, for the energy of interaction of two dipoles, is independent of the choice of the reference system (Le., the positions of Cy+2 and as expected from a physical point of view. In numerical computations with Equation AS, only the left-hand sides of Equation AS are used. Boltzmann-averaged energy qj straightforward but lengthy algebraic and trigonometric operations lead to the following equation for Ujj: In order to obtain an approximate expression for the energy, Boltzmann-averaged over X; and Xi, the expression exp(-uij/ kbt) was expanded into a Taylor series in powers of - ujj/kbt, up to the first order (high-temperature approximation). Thus, the Boltzmann-averaged energy

Conversion of protein virtual backbone to all-atom backbone 1713 is approximated by Using the ECEPP/2 charges, it is found that the angle p is equal to approximately 112", and hence the perpendicular

17 Conversion of protein virtual backbone to all-atom backbone 1713 is approximated by Using the ECEPP/2 charges, it is found that the angle p is equal to approximately 112", and hence the perpendicular component of the dipole moment is dominant. Therefore, the values of cos p are close to zero. Hence, we can neglect the integral in the denominator, because it is equal to 47r2pipj cos pi cos p jfj/ (ET;), which is small compared to 47r2. Thus, pip, cos p; cos pj FJ kBTe2ri 2pi2piz cos2 p; cos2 pj w; 2kBTe2r$ None of the terms in Equation All contains the constants X:, X:, A:), and A:' (which depend not only on the relative orientation of the virtual bonds Cp-Cia+l and CT-CT+l but also on the particular choice of the reference system necessary to define X; and X,), because the Taylor expansion of exp(-ug/kbt) was truncated at the linear term. If, however, higher powers of this Taylor expansion were considered, most of the terms in the resulting equation would contain these constants. Nevertheless, taking advantage of the fact that CL and Ci depend only on ai,, pi,, and y;, and treating Ujj as a function of (X; - X:) and (X, - X,"), it can easily be shown that the average value of each term in the Taylor expansion depends only on the relative orientation of C:-CY+;, and CT-CJ"+I, i.e., only on a;,, fig, and y;,. In Equation A1 1, the terms in cos p in that part of the averaged energy that varies as rp6 can be neglected, because the angle p is close to 90" and rp6 vanishes quickly for large r. The only remaining term in cos p, representing the energy of interaction between the nonrotatable components of the dipole moments, varies as r-3 and, therefore, cannot be neglected. Attempts to fit Equation A1 1 to the set of average ECEPP/2 energies showed that the fit is not improved when the term containing sin2(pi + p,) is present; inclusion of only the first and second terms in Equation A1 1 (which requires less computation) is sufficient to approximate the average ECEPP/2 energy surface. With the constants Apjp, = pip, cos pi cos pj/e and B,,(,, =pfpj sin2 pi sin2 pj/(2kbte2), which can be treated as adjustable parameters (evaluated as described in the section on Energy function for united-residue model of Liwo et al. [1993] and summarized in Table 2 of Liwo et al. [1993]), we obtain Equation 4. Calculation of the dihedral angles 4 and rl/ from the rotational angles XI and Xz and the dihedral angles y In the conversion of the a-carbon chain to the all-atom backbone, it is necessary to determine the geometry of the chain, given the angles of rotation X of all peptide groups about the corresponding C"-C" virtual-bond axes. A very convenient method is to convert them to the dihedral angles + and $, which can later be used as the input values for the ECEPP/2 chaingeneration subroutine. This can be done by using Equations A34 and A35 from Wako and Scheraga (1982a), hereafter referred to as Equations WS1 and WS2, respectively: cos.$j = cos!hi = cos vi COS - cos ri cos E, - sin vi sin 8i-1 cos ( h2)i sin T~ sin ti cos E, COS - cos T~ cos vi - sin ti sin Bi-l cos ( hl )i sin r, sin vi (WS1) (WS2) where E;, vi, T;, and are the planar angles defined by the atoms CY-,-CY-Ni, C?+;,-Cp-C;, Ni-CY-C;, and Cp 1-1 -Cp- CY+.,,, respectively, as in Nishikawa et al. (1974). As in the text, let XI, X2, h3,... denote the angles (AIL, (All3, (All4,... (see also Nishikawa et al., 1974). Assume, first, that the residues considered are not prolines; in this case the X's are independent variables. In order to calculate the dihedral angles + and $ for residue i, we must know the angles and (Xz)i, and According to our definition, and following Equation 10 of Nishikawa et al. (1974), (Al )i and (A2); can be expressed as follows: (h~);=xi-l, i=2,3,..., n-1 (X,); = yi-l - Xi - 180", i = 2, 3,..., n - 2 (X2)n-l = An-lr ('412) where yip1 is the virtual-bond torsional angle defined by C?-._,- C?-C? -CP I r+l 1+2' In order to derive expressions for the e's, we multiply each side of Equation 11 of Nishikawa et al. (1974) by rotational matrices Tk,, from the left and Tk,, from the right, respectively, defined by Nishikawa et al. (1974), which gives Equation A13: Tk~iT~~,)iT:-~i-,T;hZ)iTk,li = T$jT:-TjT$j. (A131 Multiplying each side of Equation A13 by the unit row vector from the left, and the unit column vector from the right, as in Nishikawa et al. (1974), we obtain Equation A14: A sin + B cos OiP1 + C = 0, ('414)

18 1714 A. Liwo et al. where If any of the residues considered is L-proline, the corresponding dihedral angle + must be fixed at -75" (or at 75" for D-proline). A = -sin~icos~jcos(h,)j - ~ os~~sin~~cos(x~)~ Therefore, the corresponding angles hl and h2 will have to sat- B = cos tj cos vi - sin tj sin 7; co~(h,)~ cos(x2)j c = sin ti sin vi sin(x,), sin(x,), - cos 7;. ('415) This equation can be solved for tan(bj-,/2), giving two solutions, which are either equivalent or only one (a non-negative angle Bi-,) is allowed: BE, -B J A + ~ B~ - c2 tan - = 2 C-A isfy Equation WSl for + = T75". Because (as follows from Fig. 5 in Nishikawa et al. [1974]), if a value of + is given, it is always possible to find the appropriate value of XI when X2 is fixed, while it is not always possible to find X2 when X, is fixed, we treat h2 as the independent variable and X, as the dependent variable. This is due to the fact that, in a first approximation, XI and h2 can be identified with 4 and $, respectively (there would be an exact identity between + and X,, and $ and h2, if the angles t and q were zero). Given the value of the independent variable, X2, Equation WSl is solved for X, by the regula falsi method (Ortega & Rheinboldt, 1970).

Algorithm for Rapid Reconstruction of Protein Backbone from Alpha Carbon Coordinates

Algorithm for Rapid Reconstruction of Protein Backbone from Alpha Carbon Coordinates MARIUSZ MILIK, 1 *, ANDRZEJ KOLINSKI, 1, 2 and JEFFREY SKOLNICK 1 1 The Scripps Research Institute, Department of Molecular