CS 882 Course Project Protein Planes

Size: px

Start display at page:

Download "CS 882 Course Project Protein Planes"

Christiana Perkins
5 years ago
Views:

1 CS 882 Course Project Protein Planes Robert Fraser SN Abstract This study is a review of the properties of protein planes. The basis of the research is the covalent bonds along the backbone of the protein, since the plane is defined by these atoms. We will measure the values of the bond lengths and the angles between the bonds to determine the consistency of these values in the PDB. We will go further to measure the lengths of the planes and omega, the dihedral angle that defines the plane. Finally, we will look at secondary structures to see if there exist preferential angles between the peptide planes and the axes of the secondary structure. The results show that the first two properties are consistent and are reliably used for the purposes of refinement and structure prediction. The angles related to the secondary structures were less consistent, but the average value found for the alpha helix matched what was claimed in the literature. The other secondary structures have similar accuracy, and the preferred angles discovered are novel results based upon the literature review. 2 Introduction The purpose of this research has been to study the properties of protein planar regions, particularly with respect to their physical configurations found in known protein structures. There are a significant number of properties that were selected for study. We begin by looking at properties such as covalent bond lengths and angles, we progress with properties such as the plane length and shape, and we conclude with a look at how planes are configured in secondary structures. Let s begin by motivating this research with past work and potential applications. The primary motivation for this work is an open problem known as the C α -trace problem. To understand this problem, we must first take a quick look at protein structure determination. Traditionally, protein structures have been determined by X-ray crystallography. Although this method is highly time-consuming due to the need to crystallize the purified protein, it is used extensively and has been the structure determination method of choice since the first high resolution structure was published in 1960 [KDS + 60]. Once the crystal is formed, an electron density map is produced using X-ray diffraction. The crystallographer can determine the approximate positions of the constituent heavy atoms (ie. other than hydrogen)

2 from this map, but the positions are only as good as the resolution of the map. On small molecules this resolution can be better than 0.03Å, but on proteins 1.5 or 2Å is typical [WD93]. Another common technique is nuclear magnetic resonance (NMR) spectroscopy. NMR produces a graph with peaks corresponding to the shift due to each nucleus in the molecule. The result is higher resolution structures, but it is still subject to inaccuracies and is limited to small molecules [PR04]. The C α -trace problem arises when we are provided with only approximate positions of the alpha carbon atoms for a given protein and we would like to determine the remainder of the structure given only that information. This problem finds a number of applications. One is that you may wish to work with the approximate alpha carbon atom positions determined by X-ray crystallography. Since these positions are subject to inaccuracies, we would like to use some techniques to determine their precise coordinates, as well as the other atoms in the protein. There are a number of protein structures where the only information known is the alpha carbon coordinates, possibly because that was the only information found or because the researcher wished to obfuscate their results. It would be useful to have an accurate method to determine the remainder of the coordinates. Also, some protein structure prediction techniques produce a C α -trace as a first step (see [GKD06], for example), so a good solution to this problem would improve their prediction accuracy. Refinement in this context is the practice of using heuristics to improve the quality of an inaccurate structure. There are many approaches that have been used, and they can be roughly grouped into the following categories: De novo - This method is based purely on the energy of the protein, and adjustments are made to reduce the total energy of the molecule. Anfinsen s thermodynamic hypothesis [Anf73] is that the native state is at or very close to the global minimum energy of the protein. Therefore, force field approximations such as CHARMM [Cor90] or ECEPP/3 [KLS02] fields may be used to refine the structure so that the energy of the protein is minimized. Fragment matching - This approach is to search for fragments of proteins with known structure that have similar alpha carbon positions and sequence information to fragments in the unknown structure. Levitt [Lev92] uses fragment matching in his program SEGMOD, which first fits the discovered fragments into the C α -trace model, and then uses energy minimization to further refine the model. Maximize hydrogen bonding - This approach is similar to a de novo approach, but the first step of the algorithm is to determine which orientations of the peptide planes will maximize hydrogen bonding in the protein [LPW + 93]. Once this structure is determined, it is further refined using energy minimization. Idealized covalent geometry - This approach is the inspiration for this study. Used by Engh and Huber [EH91] for X-ray crystallography refinement, the idea is that even small variations from the ideal geometry for the protein backbone result in significant increases in the energy of the protein. Therefore, the protein

3 structure is refined by constraining all of these values to being as close to ideal as possible. This approach has also been supplemented by including dihedral information [Pay93,DdBSB03]. All of these methods achieve under 1Å root mean square deviation (rmsd) 1 from native structures, while under 0.5Å rmsd is considered good. Perhaps including more information about the plane could further improve results, particularly in the latter approaches. That is purpose of this research: we wish to determine whether there are additional properties associated with protein planes that could be used. We will approach this problem by first examining those properties that are traditionally used, and we will determine how close these values are adhered to. Finally, we will look at some additional properties such as the configuration of the peptide plane with respect to secondary structures to see if the accuracy of these properties is similar to that of the traditional metrics. 3 Protein Plane Properties The protein plane is an area of special interest because of the structural stability that it provides and the inherent reduction in the complexity of the protein model that a planar region entails. The peptide plane was identified in some of the earliest models of proteins. Pauling, Corey and Branson [PCB51] developed models which identified the planes, based solely on theoretical chemistry and crude X-ray crystallographic structures of amino acids. There was little uncertainty: they state there can be no doubt whatever about its (existence). They also were able to identify the covalent bond lengths and angles with remarkable accuracy. The reason for the planarity is because the bond between the nitrogen and carbonyl carbon atoms form a partial double bond. The chemical explanation is that the carbonyl carbon, nitrogen, and oxygen atoms have adjacent p orbitals, and these interact to produce three π orbitals. The result is electron sharing between these three orbitals, and the lowest energy configuration for this sharing is a planar one. For more details, refer to [Kyt95]. An important thing to note at this point is that peptide planes do not map one-to-one with amino acid residues. It is conventional when dealing with proteins to think of the amino acid as the basic building block, which is a useful model. The atom in the middle of the amino acid is the alpha carbon, and it forms the junction between the protein backbone and the side chain for each residue. The alpha carbon is also the corner of each protein plane, so each plane spans two residues. This planar region is sometimes called a peptide unit, since we can use it as a building block as well [BT99]. This concept is illustrated in Figure 1. Now that the nature of the protein plane is clear, we will look at the properties of interest in turn. 1 rmsd is a standard metric for comparing two structures. It is used when we have two sets of points where each point can be identified as having a unique counterpart in the the other set. The rmsd is essentially the average distance between each point in one set and its counterpart in the other.

4 Fig. 1. A section of polypeptide chain is illustrated to show the planar regions. Note that the alpha carbon is at the corner of the plane, while one amino acid residue is of the sequence N C α C along the backbone. Some important properties such as covalent bond lengths and angles are shown. The two dihedral angles φ and ψ are represented as well. ω, the dihedral angle in the peptide plane, is not labelled. The side chains are represented by spheres, where the sphere labelled R i is meant to represent the side chain for residue i. This image is reproduced with permission from [PR04].

5 3.1 Bond Lengths and Angles The lengths of the covalent bonds along the backbone of the protein chain have been determined accurately by X-ray diffraction to hundredths of an Ångström (Å). As mentioned earlier, we are interested to see how closely the bond lengths in protein structure files adhere to these values. The known values for the bonds we are interested in are summarized in Table 1. The standard angles between these atoms are shown in Table 2. The values that are generally considered standard are those of Engh and Huber [EH91]. We include those values published in [PCB51] over 50 years earlier to demonstrate the impressive accuracy of this early work. Although the planar hydrogen atom is shown in Figure 1, it is not being included in this study because hydrogen atoms are not resolved by X-ray diffraction methods. Thus, the majority of protein structure files do not include coordinate data for hydrogen atoms. The bond lengths and angles that are being studied are all those on the backbone that do not involve hydrogen. Table 1. The standard values for the covalent bond lengths among protein backbone atoms, as published by Engh and Huber [EH91]. The standard values from a recent textbook [PR04] are included as well, demonstrating that that the Engh and Huber values are considered standard. The classic results of Pauling et al. [PCB51] are included for interest. Bond Bond Length (Å) Textbook Pauling et al. N C α / C α C / C O / C N / Table 2. The standard values for the angles between protein backbone atoms are given [EH91]. Again, the classic results of Pauling et al. [PCB51] are included for interest. Bonds Bond Angle ( ) Pauling et al. N C α C / C α C O / C α C N / O C N / C N C α / Plane Length When we refer to the length of the plane in this context, we are referring to the distance from one alpha carbon to the next. This could be considered the length of the plane from corner to corner, for the two corners of the plane defined by alpha

6 carbon atoms. This distance is also referred to as the C α C α bond, as it can be treated as a sort of bond in models that consist only of alpha carbon atoms. Ideally, this distance should also be fairly consistent, since all of the intermediary lengths and angles are considered virtually fixed. The purpose of this section is to put that idea to the test. Making this slightly more interesting however, is that there are two stable configurations for the plane to adopt. Vastly more common (on the order of 1000:1 [Ham92]) is the trans configuration, shown in Figure 1. The other planar configuration is the cis conformation, the difference between these is shown in Figure 2. It is clear from the figure that we should expect that the length of the trans configuration will be longer than the cis configuration. For any non-planar instances, we would expect values that lie between these two distances. In the analysis, we will determine the average lengths and standard deviation of the planes for each of the cis and trans configurations. We will also count the number of instances of each to determine their relative prevalence. Finally, it is known that there is a strong preference for the second residue of the plane to be proline when in the cis conformation [Kyt95], we will measure this frequency as well. We will also determine the percentage of instances of proline that are involved in cis configurations, it has been measured as 25% [Kyt95] or in the range of 10-30% [Ham92] when proline is the second residue in the peptide unit. Fig. 2. The trans and cis configurations of the peptide plane both lie in the plane. The difference is that the C α atoms lie on opposite sides of the line formed by the C N bond in the trans configuration, while they are on the same side in the cis configuration. It is clear that the distance between alpha carbon atoms is less in the cis conformation. The terminology is from Latin, where trans means across, and cis translates as on the same side. 3.3 Omega Omega (ω) is the measure of the planarity of the plane. It is the dihedral angle about the C α C N C α atoms. The method of calculating ω is outlined in

7 Figure 3. We will use ω to classify the planes into the three conformations discussed in the previous section. If the angle is 180, we have a clear instance of the trans configuration. Since we are dealing with a natural system, we will allow for a 15 deviation from this value within the trans class. Similarly, anything within 15 of 0 will be considered cis. Everything else will be placed in an other class. It has been documented that such values do not occur [Kyt95], and structures containing values of ω as much as 10 (ie. [OS86]) away from trans were considered noteworthy. More recent works have noted a tendency for angles varying from 180 however, and a standard deviation of 6 has been observed [MT96,Edi01], as shown in Figure 4. Fig. 3. The vectors used to calculate omega. A and B are the cross products of vectors corresponding to backbone bonds, and omega is given by the angle between them. The example shown here is in the trans configuration, so the vectors point in opposite directions and the angle between them is near 180. The other class will be an interesting one to note. For one, it is expected that the occurrence will be fairly low, because it is a higher energy configuration than the two planar ones. The reason that there is a planar tendency in the backbone is that the N C bond becomes a partial double bond, as mentioned earlier. If we have a non-planar configuration, this partial double bond is not being adopted and the energy of the bond is higher. Correspondingly, since double bonds are shorter than single bonds, we would expect that the N C bond is longer in these nonplanar instances. We will check the data to see if this is reflected in known protein structures. 3.4 Secondary Structures Finally, we will examine the relationship between the protein plane and the most common secondary structures: alpha helices, beta strands, and 3-10 helices. The inspiration for this section is the claim that protein planes are generally parallel to the axis of an alpha helix [Ham92,GP98]. This claim is worth investigating to determine whether it is reliable enough for predictive purposes and for refinement of

8 Fig. 4. This graph shows the results of a previous survey of ω dihedral angles of peptide planes in the trans configuration. The histogram contains values of omega from proteins with high accuracy. The dots correspond to the Maxwell-Boltzmann relation with the mapping black-proteins, grey-peptides, and red-current ω value. The line is a classic energy function derived by Corey and Pauling. This image is reproduced from [Edi01]. structures. In addition, it is worthwhile to investigate whether protein planes exhibit any preferential orientations to the other secondary structures. Obviously, the first requirement for this analysis will be to determine the axis of the secondary structure. We will focus the discussion on the alpha helix here, and then the application of this method to the other structures will be covered briefly. There is no conventional method for determining the axis of an alpha helix; the method used should be driven by the particular application. The simplest method is to employ some sort of linear regression so that the distance from the axis to each of the alpha carbon atoms in the helix is minimized. However, secondary structures are natural structures, and as such there are deformations due to their surroundings inside and outside of the protein. Since we would like to know the angle between the protein plane and the axis, this model would not serve us so well as a curved line or segmented axis that follows the bends of the alpha helix. Chothia et al. [CLR81] developed a model that was later refined by Walther et al. [WEA96] that serves this purpose well. It is often referred to as the cross product of triad bisectors method [CSB96]. The method begins by finding the vector B i perpendicular to the axis at the i th alpha carbon, and then using the cross product of two consecutive such vectors to obtain the axis to the helix u i which is local to these alpha carbons. This approach has a particular aptitude for our approach because two alpha carbons are associated with each protein plane, thus we get one axis vector corresponding to one plane. These vectors are shown in Figure 5.

9 Fig. 5. The method described by Walther et al. for determining local helix axis vectors. The local axis vectors are calculated in the first iteration by taking the cross product of two of the consecutive normals. B i = r i + r i+2 2r i+1 u i = B i B i+1 For the helix axis for residue r n (the one at the C-terminus of the helix), the axis vector of r n 1 is used again. For the purposes of smoothing, the axes must be fit to the helix. In order to assign positions for the axis vectors, the geometric center of four consecutive residues around the present is calculated: A i = r i 1 + r i + r i+1 + r i+2 4 The formula needs to be adjusted for the ends of the helices. The axis vectors (u i ) are adjusted so that their lengths are all 1.5Å, which is the average rise per residue along the axis for an ideal alpha helix. The axis of the helix is now described by a series of line segments. The endpoints, b i and e i, of the local helix axis for residue r i are given by: b i = A i e i = A i + u i The axes are smoothed using an iterative approach. The first step is to take the average of three consecutive helix axis vectors (two at the ends): u i,smoothed = u i 1 + u i + u i+1. 3

Now the average point coordinates A i are adjusted by finding the midpoint between the beginning point of the current helix axis and the endpoint of the previous one: A i,smoothed = b i + e i 1 2

10 Now the average point coordinates A i are adjusted by finding the midpoint between the beginning point of the current helix axis and the endpoint of the previous one: A i,smoothed = b i + e i 1 2 This smoothing process is repeated three times; the result is a series of local helix axis line segments which approximate the curve of the helix. For our analysis, we will analyze the positions of the plane relative to the smoothed and unsmoothed axis vectors. The nature of the axis calculation technique lends itself well to any other regular structure containing an axis, since there is nothing inherent in the method to alpha helices (except for the rise per residue in the smoothing step). The vectors perpendicular to the axis are in the direction of the bisector of the external angle of the lines formed between three consecutive alpha carbons. Figure 6 illustrates some beta strands and alpha helices, it is clear from this figure that the same principles would apply to both. See [Rot95] for another example of treating a beta strand as a helix for the purpose of determining its axis. A review of the literature did not reveal any preferences for the configuration between beta strands and protein planes, nor for 3-10 helices, so this study will elucidate whether or not such properties exist. Fig. 6. This figure illustrates the configuration of beta strands. The image on the right is a ball and stick model where each ball corresponds to an alpha carbon, and the image on the right is the cartoon representation of the secondary structures of the same protein. The latter image is included to assist in locating beta strands; they are the arrow-shaped objects. Notice that the cross product of consecutive triad bisectors would give a vector that would form a local axis for the beta strand. For the plane, we will find the plane normal by taking the cross product of the C C α vector with the C α,i+1 C α,i vector so that the normal will be on the same

11 side of the plane regardless of the isomerism of the plane. Since nearly all helices are right-handed in proteins, this normal vector will consistently point towards the interior of the helix. Therefore we can have a signed angle, and we will define it so that planes tilted inwards have a positive angle. With beta strands however, the axis will be alternating on different sides of the planes, so will only measure the angle in the range of 0 to 90, with 0 being parallel. 4 Implementation Details In this section, we will begin by covering the data that was used in this study, and then we will briefly discuss how the data was analyzed. We will conclude by mentioning a few of the technical issues encountered during this research. The standard knowledge base used for this study is the Protein Data Bank (PDB) [BKW + 77]. The PDB is the standard repository for protein files once their structures have been determined. The scope of this survey was to analyze every PDB file in the repository as of There is a question of whether surveying the entire PDB would introduce bias or poor data, since many of the structures have poor resolution and there are groups of highly homologous proteins. However, since the purpose of this study was to be exhaustive, these compromises were accepted. Future studies could be easily run on data sets with only high resolution proteins and low homology to determine whether these factors make a difference. Matlab was chosen to perform the analysis so that the pdbread function in the Bioinformatics toolkit could be used. Because there are so many idiosyncrasies in PDB files, the robustness of the Matlab function was desirable. There are many files with unusual indexing, sometimes this is because there are loop regions that could not be resolved by the crystallographer. Sometimes there are errors in the data. In most cases, pdbread is able to make sense of this data, and it returns a struct containing most of the data in the file in a usable format. When Matlab identified a file as unreadable, often due to the file containing only alpha carbon atoms, the file was left aside for the purposes of this study. Of the files in the snapshot of the PDB, were able to be used in the study. Once the struct is obtained for a particular protein, we next need to extract the information relevant to our study. The struct contains an Atom field, which contains the coordinate information for each atom in the protein, along with its other attributes. We begin by extracting only the backbone atoms for each residue by parsing through the struct. The secondary structure of each protein was obtained by running the DSSP program [KS83] on each PDB file. To save computation at runtime, all of the DSSP files were precomputed and stored; this took about 12 hours on a standard 1GHz PC. For each amino acid residue, we have four atom structs (N,C α,c,o), each containing the following (in each case a character refers to an alphanumeric character): PDBID - the 4 character PDB identifier for the protein;

12 atomname - the name of the atom: N for nitrogen, CA for alpha carbon, etc.; resname - the 3 character name of the amino acid; resseq - the number associated with the amino acid containing the atom, incrementing from the N terminal on each chain; chainid - the character corresponding to the current chain; coords - the X,Y,Z coordinates of the atom; ss - the secondary structure. For this survey, we were interested in three types of secondary structure: the alpha helix (identified by an H in DSSP), the beta strand ( E ), and the 3-10 helix( G ). Everything else was considered as being other; for the most part this consisted of loop regions. Once we have all of these atom structs for the current protein, we can run the battery of tests to get all of the desired results. These results are then merged with all of the results previously obtained from other PDB files. 4.1 Technical Issues There were a number of technical issues that were encountered during this research that may be of interest to others hoping to do similar work. Because of the extremely large volume of data, the program had to be run in batches to avoid overflow errors. Thus, the source files were divided into 69 groups of 500 files each, and the results of each of these runs were compiled together once all the files had been analyzed. Another obstacle is the computation time required by the pdbread function. It takes roughly 30 seconds on average to parse a pdb file with a standard 1GHz PC. When parsing such a large number of files, the computation time becomes cumbersome. Due to the previous obstacle however, it became possible to run the analysis in parallel by having different computers analyzing different batches of files. Using this approach, the analysis was performed using three PCs, and the computation took 5 days. 5 Results This study is composed of a large number of tests, and the results are presented in this section in a series of tables and graphs. We begin by looking at the covalent bond lengths (Table 3) and angles (Table 4). In each case, we again present the standard values derived by Engh and Huber [EH91] for reference. We then present the average values found in our study. The standard deviation (σ) values presented are the average standard deviation in each file rather than the standard deviation over the entire data set to give an impression of the variance in each file. For interest, we have also included the absolute maximum and minimum values found for each attribute, as well as the averages of the maximum and minimum values found for each of the 69 batches. These show that there are physically impossible values present, and that the results would likely be cleaner if some sanity checking were performed before the

13 analysis to eliminate such cases. It is possible that there were some instances where there were multiple chains that were labelled as a single chain in the PDB file, since there are examples of the C N bond that are very long. Table 3. The standard values for the covalent bond lengths among protein backbone atoms. Bond Textbook Average σ Instances Maximum Avg Max Minimum Avg Min N C α C α C C O C N Non-planar C N Variable Notice that there is a small difference between the length of the C N bond in the overall and non-planar cases, but this difference is not significant. It is probably not considered in most refinement approaches. Table 4. The standard values for the angles between protein backbone atoms. Bonds Textbook Average σ Instances Maximum Avg Max Minimum Avg Min N C α C C α C O C α C N O C N C N C α These results are what we expected, the values are very close to the textbook values, and there is very little deviation. Thus, we should expect that the lengths of the planes should have similar properties. The results are shown in Table 5. The overall average is presented first, and then we present the results for each isomer. Once again, extreme values are presented for interest. Table 5. The results for the length of the peptide plane. The combined results are shown first, followed by the results for each of the trans and cis isomers. Isomer Instances Average σ Maximum Avg Max Minimum Avg Min All trans cis There were proline residues as the second residue in cis conformation peptide units, while there were a total of prolines encountered in the survey.

14 Thus 3.7% of prolines are associated with cis conformations, which is much lower than the values of 10-30% claimed earlier. The number of cis isomers with proline is 86.5% however, so we do have strong evidence for proline being preferred. We found that the trans isomer outnumbers its counterpart by a ratio of approximately 500:1, which is within a factor of 2 of the value presented in the methods section. In both cases, the standard deviation with respect to the length of the plane is low, so these values are reliable enough for refinement purposes. The values for omega for each of the isomers is shown in Table 6. Table 6. The average values of omega for each isomer class are presented. Isomer Instances Average Observed Standard Deviation trans cis Other In each case, the average is several degrees away from the ideal, which would be expected since we were measuring the absolute value of the angle. In the other case, the mean is surprisingly close to the trans threshold. In order to present the nature of these other values, we present two histograms of the data, as shown in Figure 7. We can now examine the results with respect to secondary structures. The first property looked at was the number of instances of cis configurations and others in the secondary structures to see the occurrence rates differ from the average, as shown in Table 7. Table 7. The number of cis isomers and others in each of the secondary structures chosen for study is presented. Isomer trans Instances cis Instances Other Instances Alpha Helix (26.7%) (20.7%) (28.5%) Beta Strand (10.9%) (25.9%) (20.6%) 3-10 Helix (0.2%) (4.3%) (3.2%) Other (62.1%) (49%) (42.9%) These results reveal some interesting tendencies. The numbers of non-trans isomers in beta strands is high, as well as for 3-10 helices. The number of cis isomers found alpha helices is lower than for the other structures since it is detrimental to alpha helices. Considering this, the number of instances is still quite high. It is also surprising to see the other class of secondary structures, composed mostly of loop regions, has a greater tendency to trans configurations than otherwise. The final results of the study are shown in Table 8. For each type of secondary structure, both the axis found in the initial iteration of the Walther et al. [WEA96] method and that after smoothing were analyzed. The results for the unsmoothed alpha helix

15 Fig. 7. These histograms show the distribution of omega dihedral angles for peptide units that are not in either the trans or cis classes. The graphs show the same data; the latter shows the logarithm (base 10) of the number of instances to elucidate the distribution. It is quite even below 135 except for a slight rise below 30. Above 135, there is a consistent increase in the number of instances in each bin with increasing angle. The bins are 5 wide, and are labelled according to their lower bounds.

16 axes was very close to the hypothesized parallel configuration, although the standard deviation was high. The distribution of the angles is shown the histogram in Figure 8. The smoothed axes were not quite as close to parallel, and the standard deviation was no lower, so based on this result it could be concluded that the raw axis found using the cross product of consecutive bisectors method is accurate for their corresponding peptide unit. We also looked at helices that were longer than three turns to see if the results would be different, but the difference was not significant. The results of the beta strands are shown in Figure 9. The standard deviation of the beta strands is lower than that for the two helices, but the range of possible values was half that of the helices. The distribution of the values for the 3-10 helices is very similar to those for the alpha helix; this histogram is shown in Figure 10. Table 8. This table summarizes the results of survey for the angles between peptide planes and the axes of secondary structures. In total there were alpha helices, of which were more than three turns long. There were beta strands and helices. The instances in the table refers to the number of peptide planes of each type. Secondary Structure Instances Average Observed Standard Deviation α-helix Long α-helix Smoothed α-helix Smoothed Long α-helix β-strand Smoothed β-strand Helix Smoothed 3-10-Helix Conclusions The study reviewed several properties associated with peptide planes. The lengths of the covalent bonds on the backbone of the proteins and the bond angles between these bonds were all found to be close to the accepted values, and had very low standard deviation. The lengths of the planes were found to have low deviation values as well. Once the secondary structure of the proteins were considered, the values predictably become less consistent. The average value of the angle between the axis of an alpha helix and the plane of the peptide unit was found to be very close to 0, as was claimed in the literature. Similar results were obtained for the 3-10 helix and the angle between the beta strand was found to be Both of these results had standard deviation values at least as good as that for the alpha helix, so the claim is equally as strong that these values are valid. These claims were not found in the literature review conducted, so they may be novel results. Finally, it should be considered that the protein structures surveyed have been refined. What is being measured in this review in a sense is how the refinement has been conducted

17 Fig. 8. This is the histogram for the angles between the local axis of each peptide unit in an alpha helix and the peptide plane. Notice that the main peak is slightly greater than 0. Each bin is 5 wide, and is labelled by the upper threshold of the bin. Fig. 9. This is the histogram for the angles between the local axis of each peptide unit in a beta strand and the peptide plane. The distribution is fairly even below 35. Each bin is 5 wide, and is labelled by the upper threshold of the bin.

18 Fig. 10. This is the histogram for the angles between the local axis of each peptide unit in a 3-10 helix and the peptide plane. Notice that the shape of the distribution is very similar to that of the alpha helix. Each bin is 5 wide, and is labelled by the upper threshold of the bin. in the past, so we should expect that the values used as constraints would have low deviation values. Thus, these results demonstrate that the secondary structure constraints are not being used in refinement, possibly to the detriment of the final model. 7 Future Work A more extensive literature review should be conducted to determine whether the novel results determined in this research were indeed novel. If they are not novel however, they are not widely recognized, and thus publication of this work may be of interest to the community. In addition, it would likely be useful to conduct this survey again on a data set consisting only of high resolution protein structures to determine whether the results with regard to secondary structures can be improved. Many researchers like to see such surveys conducted on data sets with low homology as well, so this constraint could also be added in a future review. Finally, the secondary structure constraints could be implemented into a refinement program to determine whether improvement can be obtained. References [Anf73] C.B. Anfinsen. Principles that govern the folding of protein chains. Science, 181(4096): , 1973.

19 [BKW + 77] F.C. Bernstein, T.F. Koetzle, G.J.B. Williams, E.F. Meyer, M.D. Brice, J.R. Rodgers, O. Kennard, T. Shimanouchi, and M. Tasumi. The Protein Data Bank: A computer-based archival file for macromolecular structures. European Journal Of Biochemistry, 80(2): , [BT99] C. Branden and J. Tooze. Introduction to Protein Structure. Garland Publishing Inc., New York, [CLR81] C. Chothia, M. Levitt, and D. Richardson. Helix to helix packing in proteins. Journal of Molecular Biology, 145: , [Cor90] P.E. Correa. The building of protein structures from α-carbon coordinates. Proteins, 7: , [CSB96] J.A. Christopher, R. Swanson, and T.O. Baldwin. Algorithms for finding the axis of a helix: Fast rotational and parametric least-squares methods. Computers & Chemistry, 20(3): , [DdBSB03] M.A. Depristo, P.I.W. de Bakker, R.P. Shetty, and T.L. Blundell. Discrete restraint-based protein modeling and the C α -trace problem. Protein Science, 12: , [Edi01] A.S. Edison. Linus pauling and the planar peptide bond. Nature Structural Biology, 8: , [EH91] [GKD06] [GP98] [Ham92] [KDS + 60] [KLS02] R.A. Engh and R. Huber. Accurate bond and angle parameters for x-ray protein structure refinement. Acta Crystallographica A, 47: , J. Glasgow, T. Kuo, and J. Davies. Protein structure from contact maps: A case-based reasoning approach. Information Systems Frontiers, 8:29 36, N. Guex and M.C. Peitsch. Tutorial: Comparative protein modelling. In The Sixth International Conference on Intelligent Systems for Molecular Biology (ISMB 98), K. Hamaguchi. The Protein Molecule: Conformation, Stability and Folding. Japan Scientific Societies Press, Springer-Verlag, Tokyo, J.C. Kendrew, R.E. Dickerson, B.E. Strandberg, R.G. Hart, D.R. Davies, D.C. Phillips, and V.C. Shore. Structure of myoglobin: A three-dimensional fourier synthesis at 2Å resolution. Nature, 185: , R. Kamierkiewicz, A. Liwo, and H.A. Scheraga. Energy-based reconstruction of a protein backbone from its α-carbon trace by a monte-carlo method. Journal of Computational Chemistry, 23(7): , [KS83] W. Kabsch and C. Sander. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 22(12): , [Kyt95] J. Kyte. Structure in Protein Chemistry. Garland Publishing Inc., New York, [Lev92] M. Levitt. Accurate modeling of protein conformation by automatic segment matching. Journal of Molecular Biology, 226: , [LPW + 93] A. Liwo, M.R. Pincus, R.J. Wawak, S. Rackovsky, and H.A. Scheraga. Calculation of protein backbone geometry from α-carbon coordinates based on peptide-group dipole alignment. Protein Science, 2(10): , [MT96] M.W. MacArthur and J.M. Thornton. Deviations from planarity of the peptide bond in peptides and proteins. Journal of Molecular Biology, 264(5): , [OS86] C. Oefner and D. Suck. Crystallographic refinement and structure of DNase I at 2Å resolution. Journal of Molecular Biology, 192(3): , [Pay93] P.W. Payne. Reconstruction of protein conformations from estimated positions of the C α [PCB51] [PR04] [Rot95] [WD93] [WEA96] coordinates. Protein Science, 2: , L. Pauling, R.B. Corey, and H.R. Branson. The structure of proteins: two hydrogen-bonded helical configurations of the polypeptide chain. Proceedings of the National Academy of Science USA, Chemistry, 37: , G.A. Petsko and D. Ringe. Protein Structure and Function. New Science Press Ltd, London, I. Roterman. The geometrical analysis of peptide backbone structure and its local deformations. Biochimie, 77(3): , D.A. Waller and G.G. Dodson. Biological structures obtained by x-ray diffraction methods. In R. Diamond, T.F. Koetzle, K. Prout, and J.S. Richardson, editors, Molecular Structures in Biology, pages Oxford University Press, D. Walther, F. Eisenhaber, and P. Argos. Principles of helix-helix packing in proteins: the helical lattice superimposition model. Journal of Molecular Biology, 255: , 1996.

Algorithm for Rapid Reconstruction of Protein Backbone from Alpha Carbon Coordinates

Algorithm for Rapid Reconstruction of Protein Backbone from Alpha Carbon Coordinates MARIUSZ MILIK, 1 *, ANDRZEJ KOLINSKI, 1, 2 and JEFFREY SKOLNICK 1 1 The Scripps Research Institute, Department of Molecular