CS 882 Course Project Protein Planes

Similar documents
Algorithm for Rapid Reconstruction of Protein Backbone from Alpha Carbon Coordinates

Introducing Hippy: A visualization tool for understanding the α-helix pair interface

CAP 5510 Lecture 3 Protein Structures

HOMOLOGY MODELING. The sequence alignment and template structure are then used to produce a structural model of the target.

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Secondary and sidechain structures

Number sequence representation of protein structures based on the second derivative of a folded tetrahedron sequence

Analysis and Prediction of Protein Structure (I)

Basics of protein structure

Protein Structure. W. M. Grogan, Ph.D. OBJECTIVES

Introduction to" Protein Structure

Protein Structure Determination from Pseudocontact Shifts Using ROSETTA

Protein Structure Analysis and Verification. Course S Basics for Biosystems of the Cell exercise work. Maija Nevala, BIO, 67485U 16.1.

Figure 1. Molecules geometries of 5021 and Each neutral group in CHARMM topology was grouped in dash circle.

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics

Orientational degeneracy in the presence of one alignment tensor.

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Ramachandran and his Map

Francisco Melo, Damien Devos, Eric Depiereux and Ernest Feytmans

THE TANGO ALGORITHM: SECONDARY STRUCTURE PROPENSITIES, STATISTICAL MECHANICS APPROXIMATION

Molecular Modeling lecture 2

Molecular Modelling. part of Bioinformatik von RNA- und Proteinstrukturen. Sonja Prohaska. Leipzig, SS Computational EvoDevo University Leipzig

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Design of a Novel Globular Protein Fold with Atomic-Level Accuracy

Assignment 2 Atomic-Level Molecular Modeling

Lecture 26: Polymers: DNA Packing and Protein folding 26.1 Problem Set 4 due today. Reading for Lectures 22 24: PKT Chapter 8 [ ].

Secondary Structure. Bioch/BIMS 503 Lecture 2. Structure and Function of Proteins. Further Reading. Φ, Ψ angles alone determine protein structure

Conformational Geometry of Peptides and Proteins:

Copyright Mark Brandt, Ph.D A third method, cryogenic electron microscopy has seen increasing use over the past few years.

Biochemistry,530:,, Introduc5on,to,Structural,Biology, Autumn,Quarter,2015,

THE UNIVERSITY OF MANITOBA. PAPER NO: 409 LOCATION: Fr. Kennedy Gold Gym PAGE NO: 1 of 6 DEPARTMENT & COURSE NO: CHEM 4630 TIME: 3 HOURS

Useful background reading

Bioinformatics. Macromolecular structure

Dihedral Angles. Homayoun Valafar. Department of Computer Science and Engineering, USC 02/03/10 CSCE 769

Tools for Cryo-EM Map Fitting. Paul Emsley MRC Laboratory of Molecular Biology

Computing RMSD and fitting protein structures: how I do it and how others do it

Model Mélange. Physical Models of Peptides and Proteins

NMR, X-ray Diffraction, Protein Structure, and RasMol

4 Proteins: Structure, Function, Folding W. H. Freeman and Company

The Structure and Functions of Proteins

Protein Structure. Hierarchy of Protein Structure. Tertiary structure. independently stable structural unit. includes disulfide bonds

Protein structure (and biomolecular structure more generally) CS/CME/BioE/Biophys/BMI 279 Sept. 28 and Oct. 3, 2017 Ron Dror

Protein Structure Basics

Better Bond Angles in the Protein Data Bank

From Amino Acids to Proteins - in 4 Easy Steps

Protein Structures. 11/19/2002 Lecture 24 1

Protein structure analysis. Risto Laakso 10th January 2005

Discrete representations of the protein C Xavier F de la Cruz 1, Michael W Mahoney 2 and Byungkook Lee

Protein Structure: Data Bases and Classification Ingo Ruczinski

Examples of Protein Modeling. Protein Modeling. Primary Structure. Protein Structure Description. Protein Sequence Sources. Importing Sequences to MOE

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Tertiary Structure Prediction

Computational Molecular Modeling

Motif Prediction in Amino Acid Interaction Networks

BCMP 201 Protein biochemistry

CMPS 3110: Bioinformatics. Tertiary Structure Prediction

COMP 598 Advanced Computational Biology Methods & Research. Introduction. Jérôme Waldispühl School of Computer Science McGill University

F. Piazza Center for Molecular Biophysics and University of Orléans, France. Selected topic in Physical Biology. Lecture 1

Molecular Mechanics. I. Quantum mechanical treatment of molecular systems

Section Week 3. Junaid Malek, M.D.

Protein Data Bank Contents Guide: Atomic Coordinate Entry Format Description. Version Document Published by the wwpdb

The typical end scenario for those who try to predict protein

Overview. The peptide bond. Page 1

Lecture 2 and 3: Review of forces (ctd.) and elementary statistical mechanics. Contributions to protein stability

Presentation Outline. Prediction of Protein Secondary Structure using Neural Networks at Better than 70% Accuracy

Reconstruction of Protein Backbone with the α-carbon Coordinates *

Outline. Levels of Protein Structure. Primary (1 ) Structure. Lecture 6:Protein Architecture II: Secondary Structure or From peptides to proteins

Electron Density at various resolutions, and fitting a model as accurately as possible.

Prediction and refinement of NMR structures from sparse experimental data

NMR BMB 173 Lecture 16, February

Macromolecular X-ray Crystallography

Chemistry. for the life and medical sciences. Mitch Fry and Elizabeth Page. second edition

Ab-initio protein structure prediction

Selecting protein fuzzy contact maps through information and structure measures

Supersecondary Structures (structural motifs)

Alpha-helical Topology and Tertiary Structure Prediction of Globular Proteins Scott R. McAllister Christodoulos A. Floudas Princeton University

Protein Bioinformatics Computer lab #1 Friday, April 11, 2008 Sean Prigge and Ingo Ruczinski

Homology models of the tetramerization domain of six eukaryotic voltage-gated potassium channels Kv1.1-Kv1.6

1. What is an ångstrom unit, and why is it used to describe molecular structures?

Pymol Practial Guide

Template Free Protein Structure Modeling Jianlin Cheng, PhD

Computer simulations of protein folding with a small number of distance restraints

BIOCHEMISTRY Course Outline (Fall, 2011)

Outline. Levels of Protein Structure. Primary (1 ) Structure. Lecture 6:Protein Architecture II: Secondary Structure or From peptides to proteins

Protein Structures: Experiments and Modeling. Patrice Koehl

AN AB INITIO STUDY OF INTERMOLECULAR INTERACTIONS OF GLYCINE, ALANINE AND VALINE DIPEPTIDE-FORMALDEHYDE DIMERS

Biomolecules: lecture 9

Biomolecules: lecture 10

Supplementary Information

SUPPLEMENTARY INFORMATION

Protein Secondary Structure Prediction using Pattern Recognition Neural Network

Details of Protein Structure

SUPPLEMENTARY INFORMATION

Bioinformatics III Structural Bioinformatics and Genome Analysis Part Protein Secondary Structure Prediction. Sepp Hochreiter

1) NMR is a method of chemical analysis. (Who uses NMR in this way?) 2) NMR is used as a method for medical imaging. (called MRI )

Summary of Experimental Protein Structure Determination. Key Elements

I690/B680 Structural Bioinformatics Spring Protein Structure Determination by NMR Spectroscopy

Physiochemical Properties of Residues

Protein Structure Determination

Molecular Modeling. Prediction of Protein 3D Structure from Sequence. Vimalkumar Velayudhan. May 21, 2007

LS1a Fall 2014 Problem Set #2 Due Monday 10/6 at 6 pm in the drop boxes on the Science Center 2 nd Floor

Transcription:

CS 882 Course Project Protein Planes Robert Fraser SN 20253608 1 Abstract This study is a review of the properties of protein planes. The basis of the research is the covalent bonds along the backbone of the protein, since the plane is defined by these atoms. We will measure the values of the bond lengths and the angles between the bonds to determine the consistency of these values in the PDB. We will go further to measure the lengths of the planes and omega, the dihedral angle that defines the plane. Finally, we will look at secondary structures to see if there exist preferential angles between the peptide planes and the axes of the secondary structure. The results show that the first two properties are consistent and are reliably used for the purposes of refinement and structure prediction. The angles related to the secondary structures were less consistent, but the average value found for the alpha helix matched what was claimed in the literature. The other secondary structures have similar accuracy, and the preferred angles discovered are novel results based upon the literature review. 2 Introduction The purpose of this research has been to study the properties of protein planar regions, particularly with respect to their physical configurations found in known protein structures. There are a significant number of properties that were selected for study. We begin by looking at properties such as covalent bond lengths and angles, we progress with properties such as the plane length and shape, and we conclude with a look at how planes are configured in secondary structures. Let s begin by motivating this research with past work and potential applications. The primary motivation for this work is an open problem known as the C α -trace problem. To understand this problem, we must first take a quick look at protein structure determination. Traditionally, protein structures have been determined by X-ray crystallography. Although this method is highly time-consuming due to the need to crystallize the purified protein, it is used extensively and has been the structure determination method of choice since the first high resolution structure was published in 1960 [KDS + 60]. Once the crystal is formed, an electron density map is produced using X-ray diffraction. The crystallographer can determine the approximate positions of the constituent heavy atoms (ie. other than hydrogen)

from this map, but the positions are only as good as the resolution of the map. On small molecules this resolution can be better than 0.03Å, but on proteins 1.5 or 2Å is typical [WD93]. Another common technique is nuclear magnetic resonance (NMR) spectroscopy. NMR produces a graph with peaks corresponding to the shift due to each nucleus in the molecule. The result is higher resolution structures, but it is still subject to inaccuracies and is limited to small molecules [PR04]. The C α -trace problem arises when we are provided with only approximate positions of the alpha carbon atoms for a given protein and we would like to determine the remainder of the structure given only that information. This problem finds a number of applications. One is that you may wish to work with the approximate alpha carbon atom positions determined by X-ray crystallography. Since these positions are subject to inaccuracies, we would like to use some techniques to determine their precise coordinates, as well as the other atoms in the protein. There are a number of protein structures where the only information known is the alpha carbon coordinates, possibly because that was the only information found or because the researcher wished to obfuscate their results. It would be useful to have an accurate method to determine the remainder of the coordinates. Also, some protein structure prediction techniques produce a C α -trace as a first step (see [GKD06], for example), so a good solution to this problem would improve their prediction accuracy. Refinement in this context is the practice of using heuristics to improve the quality of an inaccurate structure. There are many approaches that have been used, and they can be roughly grouped into the following categories: De novo - This method is based purely on the energy of the protein, and adjustments are made to reduce the total energy of the molecule. Anfinsen s thermodynamic hypothesis [Anf73] is that the native state is at or very close to the global minimum energy of the protein. Therefore, force field approximations such as CHARMM [Cor90] or ECEPP/3 [KLS02] fields may be used to refine the structure so that the energy of the protein is minimized. Fragment matching - This approach is to search for fragments of proteins with known structure that have similar alpha carbon positions and sequence information to fragments in the unknown structure. Levitt [Lev92] uses fragment matching in his program SEGMOD, which first fits the discovered fragments into the C α -trace model, and then uses energy minimization to further refine the model. Maximize hydrogen bonding - This approach is similar to a de novo approach, but the first step of the algorithm is to determine which orientations of the peptide planes will maximize hydrogen bonding in the protein [LPW + 93]. Once this structure is determined, it is further refined using energy minimization. Idealized covalent geometry - This approach is the inspiration for this study. Used by Engh and Huber [EH91] for X-ray crystallography refinement, the idea is that even small variations from the ideal geometry for the protein backbone result in significant increases in the energy of the protein. Therefore, the protein

structure is refined by constraining all of these values to being as close to ideal as possible. This approach has also been supplemented by including dihedral information [Pay93,DdBSB03]. All of these methods achieve under 1Å root mean square deviation (rmsd) 1 from native structures, while under 0.5Å rmsd is considered good. Perhaps including more information about the plane could further improve results, particularly in the latter approaches. That is purpose of this research: we wish to determine whether there are additional properties associated with protein planes that could be used. We will approach this problem by first examining those properties that are traditionally used, and we will determine how close these values are adhered to. Finally, we will look at some additional properties such as the configuration of the peptide plane with respect to secondary structures to see if the accuracy of these properties is similar to that of the traditional metrics. 3 Protein Plane Properties The protein plane is an area of special interest because of the structural stability that it provides and the inherent reduction in the complexity of the protein model that a planar region entails. The peptide plane was identified in some of the earliest models of proteins. Pauling, Corey and Branson [PCB51] developed models which identified the planes, based solely on theoretical chemistry and crude X-ray crystallographic structures of amino acids. There was little uncertainty: they state there can be no doubt whatever about its (existence). They also were able to identify the covalent bond lengths and angles with remarkable accuracy. The reason for the planarity is because the bond between the nitrogen and carbonyl carbon atoms form a partial double bond. The chemical explanation is that the carbonyl carbon, nitrogen, and oxygen atoms have adjacent p orbitals, and these interact to produce three π orbitals. The result is electron sharing between these three orbitals, and the lowest energy configuration for this sharing is a planar one. For more details, refer to [Kyt95]. An important thing to note at this point is that peptide planes do not map one-to-one with amino acid residues. It is conventional when dealing with proteins to think of the amino acid as the basic building block, which is a useful model. The atom in the middle of the amino acid is the alpha carbon, and it forms the junction between the protein backbone and the side chain for each residue. The alpha carbon is also the corner of each protein plane, so each plane spans two residues. This planar region is sometimes called a peptide unit, since we can use it as a building block as well [BT99]. This concept is illustrated in Figure 1. Now that the nature of the protein plane is clear, we will look at the properties of interest in turn. 1 rmsd is a standard metric for comparing two structures. It is used when we have two sets of points where each point can be identified as having a unique counterpart in the the other set. The rmsd is essentially the average distance between each point in one set and its counterpart in the other.

Fig. 1. A section of polypeptide chain is illustrated to show the planar regions. Note that the alpha carbon is at the corner of the plane, while one amino acid residue is of the sequence N C α C along the backbone. Some important properties such as covalent bond lengths and angles are shown. The two dihedral angles φ and ψ are represented as well. ω, the dihedral angle in the peptide plane, is not labelled. The side chains are represented by spheres, where the sphere labelled R i is meant to represent the side chain for residue i. This image is reproduced with permission from [PR04].

3.1 Bond Lengths and Angles The lengths of the covalent bonds along the backbone of the protein chain have been determined accurately by X-ray diffraction to hundredths of an Ångström (Å). As mentioned earlier, we are interested to see how closely the bond lengths in protein structure files adhere to these values. The known values for the bonds we are interested in are summarized in Table 1. The standard angles between these atoms are shown in Table 2. The values that are generally considered standard are those of Engh and Huber [EH91]. We include those values published in [PCB51] over 50 years earlier to demonstrate the impressive accuracy of this early work. Although the planar hydrogen atom is shown in Figure 1, it is not being included in this study because hydrogen atoms are not resolved by X-ray diffraction methods. Thus, the majority of protein structure files do not include coordinate data for hydrogen atoms. The bond lengths and angles that are being studied are all those on the backbone that do not involve hydrogen. Table 1. The standard values for the covalent bond lengths among protein backbone atoms, as published by Engh and Huber [EH91]. The standard values from a recent textbook [PR04] are included as well, demonstrating that that the Engh and Huber values are considered standard. The classic results of Pauling et al. [PCB51] are included for interest. Bond Bond Length (Å) Textbook Pauling et al. N C α 1.451 +/ 0.016 1.45 1.47 C α C 1.516 +/ 0.018 1.52 1.53 C O 1.231 +/ 0.020 1.23 1.23 C N 1.329 +/ 0.014 1.33 1.32 Table 2. The standard values for the angles between protein backbone atoms are given [EH91]. Again, the classic results of Pauling et al. [PCB51] are included for interest. Bonds Bond Angle ( ) Pauling et al. N C α C 112.5 +/ 2.9 110 C α C O 120.8 +/ 2.1 121 C α C N 116.4 +/ 2.1 117 O C N 123.0 +/ 1.6 122 C N C α 120.6 +/ 1.7 120 3.2 Plane Length When we refer to the length of the plane in this context, we are referring to the distance from one alpha carbon to the next. This could be considered the length of the plane from corner to corner, for the two corners of the plane defined by alpha

carbon atoms. This distance is also referred to as the C α C α bond, as it can be treated as a sort of bond in models that consist only of alpha carbon atoms. Ideally, this distance should also be fairly consistent, since all of the intermediary lengths and angles are considered virtually fixed. The purpose of this section is to put that idea to the test. Making this slightly more interesting however, is that there are two stable configurations for the plane to adopt. Vastly more common (on the order of 1000:1 [Ham92]) is the trans configuration, shown in Figure 1. The other planar configuration is the cis conformation, the difference between these is shown in Figure 2. It is clear from the figure that we should expect that the length of the trans configuration will be longer than the cis configuration. For any non-planar instances, we would expect values that lie between these two distances. In the analysis, we will determine the average lengths and standard deviation of the planes for each of the cis and trans configurations. We will also count the number of instances of each to determine their relative prevalence. Finally, it is known that there is a strong preference for the second residue of the plane to be proline when in the cis conformation [Kyt95], we will measure this frequency as well. We will also determine the percentage of instances of proline that are involved in cis configurations, it has been measured as 25% [Kyt95] or in the range of 10-30% [Ham92] when proline is the second residue in the peptide unit. Fig. 2. The trans and cis configurations of the peptide plane both lie in the plane. The difference is that the C α atoms lie on opposite sides of the line formed by the C N bond in the trans configuration, while they are on the same side in the cis configuration. It is clear that the distance between alpha carbon atoms is less in the cis conformation. The terminology is from Latin, where trans means across, and cis translates as on the same side. 3.3 Omega Omega (ω) is the measure of the planarity of the plane. It is the dihedral angle about the C α C N C α atoms. The method of calculating ω is outlined in

Figure 3. We will use ω to classify the planes into the three conformations discussed in the previous section. If the angle is 180, we have a clear instance of the trans configuration. Since we are dealing with a natural system, we will allow for a 15 deviation from this value within the trans class. Similarly, anything within 15 of 0 will be considered cis. Everything else will be placed in an other class. It has been documented that such values do not occur [Kyt95], and structures containing values of ω as much as 10 (ie. [OS86]) away from trans were considered noteworthy. More recent works have noted a tendency for angles varying from 180 however, and a standard deviation of 6 has been observed [MT96,Edi01], as shown in Figure 4. Fig. 3. The vectors used to calculate omega. A and B are the cross products of vectors corresponding to backbone bonds, and omega is given by the angle between them. The example shown here is in the trans configuration, so the vectors point in opposite directions and the angle between them is near 180. The other class will be an interesting one to note. For one, it is expected that the occurrence will be fairly low, because it is a higher energy configuration than the two planar ones. The reason that there is a planar tendency in the backbone is that the N C bond becomes a partial double bond, as mentioned earlier. If we have a non-planar configuration, this partial double bond is not being adopted and the energy of the bond is higher. Correspondingly, since double bonds are shorter than single bonds, we would expect that the N C bond is longer in these nonplanar instances. We will check the data to see if this is reflected in known protein structures. 3.4 Secondary Structures Finally, we will examine the relationship between the protein plane and the most common secondary structures: alpha helices, beta strands, and 3-10 helices. The inspiration for this section is the claim that protein planes are generally parallel to the axis of an alpha helix [Ham92,GP98]. This claim is worth investigating to determine whether it is reliable enough for predictive purposes and for refinement of

Fig. 4. This graph shows the results of a previous survey of ω dihedral angles of peptide planes in the trans configuration. The histogram contains 237807 values of omega from proteins with high accuracy. The dots correspond to the Maxwell-Boltzmann relation with the mapping black-proteins, grey-peptides, and red-current ω value. The line is a classic energy function derived by Corey and Pauling. This image is reproduced from [Edi01]. structures. In addition, it is worthwhile to investigate whether protein planes exhibit any preferential orientations to the other secondary structures. Obviously, the first requirement for this analysis will be to determine the axis of the secondary structure. We will focus the discussion on the alpha helix here, and then the application of this method to the other structures will be covered briefly. There is no conventional method for determining the axis of an alpha helix; the method used should be driven by the particular application. The simplest method is to employ some sort of linear regression so that the distance from the axis to each of the alpha carbon atoms in the helix is minimized. However, secondary structures are natural structures, and as such there are deformations due to their surroundings inside and outside of the protein. Since we would like to know the angle between the protein plane and the axis, this model would not serve us so well as a curved line or segmented axis that follows the bends of the alpha helix. Chothia et al. [CLR81] developed a model that was later refined by Walther et al. [WEA96] that serves this purpose well. It is often referred to as the cross product of triad bisectors method [CSB96]. The method begins by finding the vector B i perpendicular to the axis at the i th alpha carbon, and then using the cross product of two consecutive such vectors to obtain the axis to the helix u i which is local to these alpha carbons. This approach has a particular aptitude for our approach because two alpha carbons are associated with each protein plane, thus we get one axis vector corresponding to one plane. These vectors are shown in Figure 5.

Fig. 5. The method described by Walther et al. for determining local helix axis vectors. The local axis vectors are calculated in the first iteration by taking the cross product of two of the consecutive normals. B i = r i + r i+2 2r i+1 u i = B i B i+1 For the helix axis for residue r n (the one at the C-terminus of the helix), the axis vector of r n 1 is used again. For the purposes of smoothing, the axes must be fit to the helix. In order to assign positions for the axis vectors, the geometric center of four consecutive residues around the present is calculated: A i = r i 1 + r i + r i+1 + r i+2 4 The formula needs to be adjusted for the ends of the helices. The axis vectors (u i ) are adjusted so that their lengths are all 1.5Å, which is the average rise per residue along the axis for an ideal alpha helix. The axis of the helix is now described by a series of line segments. The endpoints, b i and e i, of the local helix axis for residue r i are given by: b i = A i e i = A i + u i The axes are smoothed using an iterative approach. The first step is to take the average of three consecutive helix axis vectors (two at the ends): u i,smoothed = u i 1 + u i + u i+1. 3

Now the average point coordinates A i are adjusted by finding the midpoint between the beginning point of the current helix axis and the endpoint of the previous one: A i,smoothed = b i + e i 1 2 This smoothing process is repeated three times; the result is a series of local helix axis line segments which approximate the curve of the helix. For our analysis, we will analyze the positions of the plane relative to the smoothed and unsmoothed axis vectors. The nature of the axis calculation technique lends itself well to any other regular structure containing an axis, since there is nothing inherent in the method to alpha helices (except for the rise per residue in the smoothing step). The vectors perpendicular to the axis are in the direction of the bisector of the external angle of the lines formed between three consecutive alpha carbons. Figure 6 illustrates some beta strands and alpha helices, it is clear from this figure that the same principles would apply to both. See [Rot95] for another example of treating a beta strand as a helix for the purpose of determining its axis. A review of the literature did not reveal any preferences for the configuration between beta strands and protein planes, nor for 3-10 helices, so this study will elucidate whether or not such properties exist. Fig. 6. This figure illustrates the configuration of beta strands. The image on the right is a ball and stick model where each ball corresponds to an alpha carbon, and the image on the right is the cartoon representation of the secondary structures of the same protein. The latter image is included to assist in locating beta strands; they are the arrow-shaped objects. Notice that the cross product of consecutive triad bisectors would give a vector that would form a local axis for the beta strand. For the plane, we will find the plane normal by taking the cross product of the C C α vector with the C α,i+1 C α,i vector so that the normal will be on the same

side of the plane regardless of the isomerism of the plane. Since nearly all helices are right-handed in proteins, this normal vector will consistently point towards the interior of the helix. Therefore we can have a signed angle, and we will define it so that planes tilted inwards have a positive angle. With beta strands however, the axis will be alternating on different sides of the planes, so will only measure the angle in the range of 0 to 90, with 0 being parallel. 4 Implementation Details In this section, we will begin by covering the data that was used in this study, and then we will briefly discuss how the data was analyzed. We will conclude by mentioning a few of the technical issues encountered during this research. The standard knowledge base used for this study is the Protein Data Bank (PDB) [BKW + 77]. The PDB is the standard repository for protein files once their structures have been determined. The scope of this survey was to analyze every PDB file in the repository as of 2006. There is a question of whether surveying the entire PDB would introduce bias or poor data, since many of the structures have poor resolution and there are groups of highly homologous proteins. However, since the purpose of this study was to be exhaustive, these compromises were accepted. Future studies could be easily run on data sets with only high resolution proteins and low homology to determine whether these factors make a difference. Matlab was chosen to perform the analysis so that the pdbread function in the Bioinformatics toolkit could be used. Because there are so many idiosyncrasies in PDB files, the robustness of the Matlab function was desirable. There are many files with unusual indexing, sometimes this is because there are loop regions that could not be resolved by the crystallographer. Sometimes there are errors in the data. In most cases, pdbread is able to make sense of this data, and it returns a struct containing most of the data in the file in a usable format. When Matlab identified a file as unreadable, often due to the file containing only alpha carbon atoms, the file was left aside for the purposes of this study. Of the 34091 files in the snapshot of the PDB, 27837 were able to be used in the study. Once the struct is obtained for a particular protein, we next need to extract the information relevant to our study. The struct contains an Atom field, which contains the coordinate information for each atom in the protein, along with its other attributes. We begin by extracting only the backbone atoms for each residue by parsing through the struct. The secondary structure of each protein was obtained by running the DSSP program [KS83] on each PDB file. To save computation at runtime, all of the DSSP files were precomputed and stored; this took about 12 hours on a standard 1GHz PC. For each amino acid residue, we have four atom structs (N,C α,c,o), each containing the following (in each case a character refers to an alphanumeric character): PDBID - the 4 character PDB identifier for the protein;

atomname - the name of the atom: N for nitrogen, CA for alpha carbon, etc.; resname - the 3 character name of the amino acid; resseq - the number associated with the amino acid containing the atom, incrementing from the N terminal on each chain; chainid - the character corresponding to the current chain; coords - the X,Y,Z coordinates of the atom; ss - the secondary structure. For this survey, we were interested in three types of secondary structure: the alpha helix (identified by an H in DSSP), the beta strand ( E ), and the 3-10 helix( G ). Everything else was considered as being other; for the most part this consisted of loop regions. Once we have all of these atom structs for the current protein, we can run the battery of tests to get all of the desired results. These results are then merged with all of the results previously obtained from other PDB files. 4.1 Technical Issues There were a number of technical issues that were encountered during this research that may be of interest to others hoping to do similar work. Because of the extremely large volume of data, the program had to be run in batches to avoid overflow errors. Thus, the source files were divided into 69 groups of 500 files each, and the results of each of these runs were compiled together once all the files had been analyzed. Another obstacle is the computation time required by the pdbread function. It takes roughly 30 seconds on average to parse a pdb file with a standard 1GHz PC. When parsing such a large number of files, the computation time becomes cumbersome. Due to the previous obstacle however, it became possible to run the analysis in parallel by having different computers analyzing different batches of files. Using this approach, the analysis was performed using three PCs, and the computation took 5 days. 5 Results This study is composed of a large number of tests, and the results are presented in this section in a series of tables and graphs. We begin by looking at the covalent bond lengths (Table 3) and angles (Table 4). In each case, we again present the standard values derived by Engh and Huber [EH91] for reference. We then present the average values found in our study. The standard deviation (σ) values presented are the average standard deviation in each file rather than the standard deviation over the entire data set to give an impression of the variance in each file. For interest, we have also included the absolute maximum and minimum values found for each attribute, as well as the averages of the maximum and minimum values found for each of the 69 batches. These show that there are physically impossible values present, and that the results would likely be cleaner if some sanity checking were performed before the

analysis to eliminate such cases. It is possible that there were some instances where there were multiple chains that were labelled as a single chain in the PDB file, since there are examples of the C N bond that are very long. Table 3. The standard values for the covalent bond lengths among protein backbone atoms. Bond Textbook Average σ Instances Maximum Avg Max Minimum Avg Min N C α 1.451 1.457 0.011 1.307 10 8 59.924 3.527 0.177 1.134 C α C 1.516 1.523 0.012 1.307 10 8 11.672 2.176 0.162 1.185 C O 1.231 1.232 0.009 1.307 10 8 6.536 2.310 0 0.851 C N 1.329 1.330 0.060 1.296 10 8 403.09 125.97 0.177 0.695 Non-planar C N Variable 1.364 0.390 7.036 10 5 Notice that there is a small difference between the length of the C N bond in the overall and non-planar cases, but this difference is not significant. It is probably not considered in most refinement approaches. Table 4. The standard values for the angles between protein backbone atoms. Bonds Textbook Average σ Instances Maximum Avg Max Minimum Avg Min N C α C 112.5 111.18 3.377 1.307 10 8 173.45 148.96 3.504 73.901 C α C O 120.8 120.63 1.074 1.307 10 8 176.98 149.72 5.645 77.39 C α C N 116.4 116.46 1.548 1.2957 10 8 179.95 160.73 3.869 31.362 O C N 123.0 122.86 1.182 1.2957 10 8 177.27 159.63 1.383 27.519 C N C α 120.6 121.54 1.906 1.2957 10 8 179.96 173.24 4.132 37.755 These results are what we expected, the values are very close to the textbook values, and there is very little deviation. Thus, we should expect that the lengths of the planes should have similar properties. The results are shown in Table 5. The overall average is presented first, and then we present the results for each isomer. Once again, extreme values are presented for interest. Table 5. The results for the length of the peptide plane. The combined results are shown first, followed by the results for each of the trans and cis isomers. Isomer Instances Average σ Maximum Avg Max Minimum Avg Min All 1.296 10 8 3.800 0.094 403.5 125.92 0.286 2.429 trans 1.286 10 8 3.801 0.036 348.67 45.50 0.588 2.888 cis 2.602 10 5 2.949 0.112 153.15 29.672 0.286 2.600 There were 225172 proline residues as the second residue in cis conformation peptide units, while there were a total of 6023383 prolines encountered in the survey.

Thus 3.7% of prolines are associated with cis conformations, which is much lower than the values of 10-30% claimed earlier. The number of cis isomers with proline is 86.5% however, so we do have strong evidence for proline being preferred. We found that the trans isomer outnumbers its counterpart by a ratio of approximately 500:1, which is within a factor of 2 of the value presented in the methods section. In both cases, the standard deviation with respect to the length of the plane is low, so these values are reliable enough for refinement purposes. The values for omega for each of the isomers is shown in Table 6. Table 6. The average values of omega for each isomer class are presented. Isomer Instances Average Observed Standard Deviation trans 1.286 10 8 178.13 1.929 cis 2.602 10 5 2.143 1.707 Other 7.036 10 5 156.05 13.213 In each case, the average is several degrees away from the ideal, which would be expected since we were measuring the absolute value of the angle. In the other case, the mean is surprisingly close to the trans threshold. In order to present the nature of these other values, we present two histograms of the data, as shown in Figure 7. We can now examine the results with respect to secondary structures. The first property looked at was the number of instances of cis configurations and others in the secondary structures to see the occurrence rates differ from the average, as shown in Table 7. Table 7. The number of cis isomers and others in each of the secondary structures chosen for study is presented. Isomer trans Instances cis Instances Other Instances Alpha Helix 3.437 10 7 (26.7%) 5.391 10 4 (20.7%) 2.107 10 5 (28.5%) Beta Strand 1.404 10 7 (10.9%) 6.752 10 4 (25.9%) 1.522 10 5 (20.6%) 3-10 Helix 3.048 10 5 (0.2%) 1.112 10 4 (4.3%) 2.350 10 4 (3.2%) Other 7.988 10 7 (62.1%) 1.276 10 5 (49%) 3.172 10 5 (42.9%) These results reveal some interesting tendencies. The numbers of non-trans isomers in beta strands is high, as well as for 3-10 helices. The number of cis isomers found alpha helices is lower than for the other structures since it is detrimental to alpha helices. Considering this, the number of instances is still quite high. It is also surprising to see the other class of secondary structures, composed mostly of loop regions, has a greater tendency to trans configurations than otherwise. The final results of the study are shown in Table 8. For each type of secondary structure, both the axis found in the initial iteration of the Walther et al. [WEA96] method and that after smoothing were analyzed. The results for the unsmoothed alpha helix

Fig. 7. These histograms show the distribution of omega dihedral angles for peptide units that are not in either the trans or cis classes. The graphs show the same data; the latter shows the logarithm (base 10) of the number of instances to elucidate the distribution. It is quite even below 135 except for a slight rise below 30. Above 135, there is a consistent increase in the number of instances in each bin with increasing angle. The bins are 5 wide, and are labelled according to their lower bounds.

axes was very close to the hypothesized parallel configuration, although the standard deviation was high. The distribution of the angles is shown the histogram in Figure 8. The smoothed axes were not quite as close to parallel, and the standard deviation was no lower, so based on this result it could be concluded that the raw axis found using the cross product of consecutive bisectors method is accurate for their corresponding peptide unit. We also looked at helices that were longer than three turns to see if the results would be different, but the difference was not significant. The results of the beta strands are shown in Figure 9. The standard deviation of the beta strands is lower than that for the two helices, but the range of possible values was half that of the helices. The distribution of the values for the 3-10 helices is very similar to those for the alpha helix; this histogram is shown in Figure 10. Table 8. This table summarizes the results of survey for the angles between peptide planes and the axes of secondary structures. In total there were 3.365 10 6 alpha helices, 1.888 10 6 of which were more than three turns long. There were 2.828 10 6 beta strands and 90812 3-10 helices. The instances in the table refers to the number of peptide planes of each type. Secondary Structure Instances Average Observed Standard Deviation α-helix 3.463 10 7 1.99 23.842 Long α-helix 2.632 10 7 2.35 23.411 Smoothed α-helix 3.463 10 7 11.77 23.955 Smoothed Long α-helix 2.632 10 7 11.45 24.113 β-strand 1.426 10 7 22.877 14.155 Smoothed β-strand 1.426 10 7 24.725 15.793 3-10-Helix 3.394 10 5 2.608 22.582 Smoothed 3-10-Helix 3.394 10 5 12.75 21.569 6 Conclusions The study reviewed several properties associated with peptide planes. The lengths of the covalent bonds on the backbone of the proteins and the bond angles between these bonds were all found to be close to the accepted values, and had very low standard deviation. The lengths of the planes were found to have low deviation values as well. Once the secondary structure of the proteins were considered, the values predictably become less consistent. The average value of the angle between the axis of an alpha helix and the plane of the peptide unit was found to be very close to 0, as was claimed in the literature. Similar results were obtained for the 3-10 helix and the angle between the beta strand was found to be 20-25. Both of these results had standard deviation values at least as good as that for the alpha helix, so the claim is equally as strong that these values are valid. These claims were not found in the literature review conducted, so they may be novel results. Finally, it should be considered that the protein structures surveyed have been refined. What is being measured in this review in a sense is how the refinement has been conducted

Fig. 8. This is the histogram for the angles between the local axis of each peptide unit in an alpha helix and the peptide plane. Notice that the main peak is slightly greater than 0. Each bin is 5 wide, and is labelled by the upper threshold of the bin. Fig. 9. This is the histogram for the angles between the local axis of each peptide unit in a beta strand and the peptide plane. The distribution is fairly even below 35. Each bin is 5 wide, and is labelled by the upper threshold of the bin.

Fig. 10. This is the histogram for the angles between the local axis of each peptide unit in a 3-10 helix and the peptide plane. Notice that the shape of the distribution is very similar to that of the alpha helix. Each bin is 5 wide, and is labelled by the upper threshold of the bin. in the past, so we should expect that the values used as constraints would have low deviation values. Thus, these results demonstrate that the secondary structure constraints are not being used in refinement, possibly to the detriment of the final model. 7 Future Work A more extensive literature review should be conducted to determine whether the novel results determined in this research were indeed novel. If they are not novel however, they are not widely recognized, and thus publication of this work may be of interest to the community. In addition, it would likely be useful to conduct this survey again on a data set consisting only of high resolution protein structures to determine whether the results with regard to secondary structures can be improved. Many researchers like to see such surveys conducted on data sets with low homology as well, so this constraint could also be added in a future review. Finally, the secondary structure constraints could be implemented into a refinement program to determine whether improvement can be obtained. References [Anf73] C.B. Anfinsen. Principles that govern the folding of protein chains. Science, 181(4096):223 230, 1973.

[BKW + 77] F.C. Bernstein, T.F. Koetzle, G.J.B. Williams, E.F. Meyer, M.D. Brice, J.R. Rodgers, O. Kennard, T. Shimanouchi, and M. Tasumi. The Protein Data Bank: A computer-based archival file for macromolecular structures. European Journal Of Biochemistry, 80(2):319 324, 1977. [BT99] C. Branden and J. Tooze. Introduction to Protein Structure. Garland Publishing Inc., New York, 1999. [CLR81] C. Chothia, M. Levitt, and D. Richardson. Helix to helix packing in proteins. Journal of Molecular Biology, 145:215 250, 1981. [Cor90] P.E. Correa. The building of protein structures from α-carbon coordinates. Proteins, 7:366 377, 1990. [CSB96] J.A. Christopher, R. Swanson, and T.O. Baldwin. Algorithms for finding the axis of a helix: Fast rotational and parametric least-squares methods. Computers & Chemistry, 20(3):339 345, 1996. [DdBSB03] M.A. Depristo, P.I.W. de Bakker, R.P. Shetty, and T.L. Blundell. Discrete restraint-based protein modeling and the C α -trace problem. Protein Science, 12:2032 2046, 2003. [Edi01] A.S. Edison. Linus pauling and the planar peptide bond. Nature Structural Biology, 8:201 202, 2001. [EH91] [GKD06] [GP98] [Ham92] [KDS + 60] [KLS02] R.A. Engh and R. Huber. Accurate bond and angle parameters for x-ray protein structure refinement. Acta Crystallographica A, 47:392 400, 1991. J. Glasgow, T. Kuo, and J. Davies. Protein structure from contact maps: A case-based reasoning approach. Information Systems Frontiers, 8:29 36, 2006. N. Guex and M.C. Peitsch. Tutorial: Comparative protein modelling. In The Sixth International Conference on Intelligent Systems for Molecular Biology (ISMB 98), 1998. K. Hamaguchi. The Protein Molecule: Conformation, Stability and Folding. Japan Scientific Societies Press, Springer-Verlag, Tokyo, 1992. J.C. Kendrew, R.E. Dickerson, B.E. Strandberg, R.G. Hart, D.R. Davies, D.C. Phillips, and V.C. Shore. Structure of myoglobin: A three-dimensional fourier synthesis at 2Å resolution. Nature, 185:422 427, 1960. R. Kamierkiewicz, A. Liwo, and H.A. Scheraga. Energy-based reconstruction of a protein backbone from its α-carbon trace by a monte-carlo method. Journal of Computational Chemistry, 23(7):715 723, 2002. [KS83] W. Kabsch and C. Sander. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 22(12):2577 2637, 1983. [Kyt95] J. Kyte. Structure in Protein Chemistry. Garland Publishing Inc., New York, 1995. [Lev92] M. Levitt. Accurate modeling of protein conformation by automatic segment matching. Journal of Molecular Biology, 226:507 533, 1992. [LPW + 93] A. Liwo, M.R. Pincus, R.J. Wawak, S. Rackovsky, and H.A. Scheraga. Calculation of protein backbone geometry from α-carbon coordinates based on peptide-group dipole alignment. Protein Science, 2(10):1697 1714, 1993. [MT96] M.W. MacArthur and J.M. Thornton. Deviations from planarity of the peptide bond in peptides and proteins. Journal of Molecular Biology, 264(5):1180 1195, 1996. [OS86] C. Oefner and D. Suck. Crystallographic refinement and structure of DNase I at 2Å resolution. Journal of Molecular Biology, 192(3):605 632, 1986. [Pay93] P.W. Payne. Reconstruction of protein conformations from estimated positions of the C α [PCB51] [PR04] [Rot95] [WD93] [WEA96] coordinates. Protein Science, 2:315 324, 1993. L. Pauling, R.B. Corey, and H.R. Branson. The structure of proteins: two hydrogen-bonded helical configurations of the polypeptide chain. Proceedings of the National Academy of Science USA, Chemistry, 37:205 211, 1951. G.A. Petsko and D. Ringe. Protein Structure and Function. New Science Press Ltd, London, 2004. I. Roterman. The geometrical analysis of peptide backbone structure and its local deformations. Biochimie, 77(3):204 216, 1995. D.A. Waller and G.G. Dodson. Biological structures obtained by x-ray diffraction methods. In R. Diamond, T.F. Koetzle, K. Prout, and J.S. Richardson, editors, Molecular Structures in Biology, pages 1 19. Oxford University Press, 1993. D. Walther, F. Eisenhaber, and P. Argos. Principles of helix-helix packing in proteins: the helical lattice superimposition model. Journal of Molecular Biology, 255:536 553, 1996.