DATA MINING OF ELECTROSTATIC INTERACTIONS BETWEEN AMINO ACIDS IN COILED-COIL PROTEINS USING THE STABLE COIL ALGORITHM ANKUR S.

Size: px

Start display at page:

Download "DATA MINING OF ELECTROSTATIC INTERACTIONS BETWEEN AMINO ACIDS IN COILED-COIL PROTEINS USING THE STABLE COIL ALGORITHM ANKUR S."

Arthur Parsons
5 years ago
Views:

1 University of Colorado at Colorado Springs i DATA MINING OF ELECTROSTATIC INTERACTIONS BETWEEN AMINO ACIDS IN COILED-COIL PROTEINS USING THE STABLE COIL ALGORITHM BY ANKUR S. DESHMUKH A project submitted to the Faculty of Graduate School of the University of Colorado at Colorado Springs in partial fulfillment of the requirements for the degree of Master of Science Department Computer Science 2008

2 University of Colorado at Colorado Springs ii This project for the Masters of Science Degree by Ankur Deshmukh has been approved for the Department of Computer Science by Approved by Date Advisor: Dr. Jugal Kalita Committee member: Dr. Edward Chow Committee member: Dr. Robert Hodges DATE

3 University of Colorado at Colorado Springs iii TABLE OF CONTENTS CHAPTER INTRODUCTION... 1 CHAPTER BACKGROUND RESEARCH BACKGROUND RESEARCH IN UNDERSTANDING COILED-COILS UNDERSTANDING PROTEIN STRUCTURE PRIMARY STRUCTURE SECONDARY STRUCTURE TERTIARY STRUCTURE QUATERNARY STRUCTURE COILED-COILS BACKGROUND RESEARCH IN UNDERSTANDING COILED-COIL PREDICTION ALGORITHMS COILS ALGORITHM PAIRCOILS ALGORITHM SOCKET ALGORITHM ZIP ALGORITHM - IDENTIFYING LEUCINE ZIPPERS STABLE INPUT ALGORITHM...17 CHAPTER STABLE COIL ALGORITHM STABLE COIL ALGORITHM: PART I STABLE COIL ALGORITHM: PART II CLUSTER PATTERNS IN COILED-COILS CHAPTER

4 iv PROJECT ARCHITECTURE University of Colorado at Colorado Springs 4.1 DATABASE ARCHITECTURE STRUCTURE AND CONTENT OF THE TABLES INVOLVED IN PHASE PROTEIN TABLE COILED-COIL TABLE PROTEIN COIL TABLE STRUCTURE AND CONTENT OF THE TABLES INVOLVED IN PHASE SALT RESIDUES LOOKUP TABLE SALT BRIDGE TABLE COILED-COIL HEPTAD TABLE HEPTAD SALT TABLE SCRAPE FILE TABLE MATERIALIZED VIEWS AMINO ACID OCCURRENCES COIL LENGTH VS. CLUSTER PER COIL PERL CODE DESIGN WEBSITE ARCHITECTURE INDEX PAGE PROTEIN SEARCH PAGE COILED-COIL SEARCH PAGE COILED-COIL MOTIF SEARCHING COIL HEPTAD AND SALT BRIDGE SEARCH GENERATED REPORTS...59 CHAPTER RESULTS... 62

5 University of Colorado at Colorado Springs v CHAPTER CONCLUSION CHAPTER REFERENCES APPENDIX A: CREATING MATERIALIZED VIEWS IN MYSQL APPENDIX B: SQL QUERIES FOR CREATING MATERIALIZED VIEW APPENDIX C: INSTALLATION AND PERFORMANCE OF THE PROJECT... 93

6 University of Colorado at Colorado Springs vi LIST OF FIGURES FIGURE 2-1: THE STRUCTURE OF PART OF A DNA DOUBLE HELIX. FIGURE OBTAINED FROM [23] FIGURE 2-2: A GENERAL STRUCTURE OF Α-AMINO ACID, WITH THE AMINO GROUP ON THE LEFT AND THE CARBOXYL GROUP ON THE RIGHT. FIGURE OBTAINED FROM [1] FIGURE 2-3: A CONDENSATION REACTION BETWEEN TWO Α-AMINO ACIDS RESULTING IN A PEPTIDE BOND. FIGURE OBTAINED FROM [1]... 5 FIGURE 2-4: PHI AND PSI ANGLES... 6 FIGURE 2-5: HYDROGEN BONDING BETWEEN AMINO ACIDS IN THE PROTEINS. FIGURE OBTAINED FROM [25]. 7 FIGURE 2-6: A DEPICTION OF Α-HELIX, MOST COMMONLY OCCURRING PROTEIN STRUCTURE IN COILED- COILS. FIGURE OBTAINED FROM [1]... 8 FIGURE 2-7: A DEPICTION OF Β-SHEET, IN ANTI-PARALLEL AND PARALLEL FORMATION. FIGURE OBTAINED FROM [2]... 9 FIGURE 2-8: PROTEIN STRUCTURE, FROM PRIMARY TO QUATERNARY. FIGURE OBTAINED FROM [26] FIGURE 2-9: CLASSIC EXAMPLE OF COILED-COIL GCN4 LEUCINE ZIPPER. FIGURE OBTAINED FROM [1] FIGURE 2-10: POSITIONS OF AMINO ACIDS IN THE COILED-COIL. THE FIGURE HAS BEEN OBTAINED FROM [3]13 FIGURE 2-11: CROSS-SECTIONAL VIEW OF A TWO-STRANDED COILED-COIL. HYDROPHOBIC AND ELECTROSTATIC INTERACTIONS BETWEEN TWO STRANDED Α-HELICAL COILED-COILS FORMED BY THE HOMODIMERIZATION OF 35-RESIDUE POLYPEPTIDE CHAINS. ADAPTED FROM [7] FIGURE 4-1: E-R DIAGRAM DETAILING THE RELATIONSHIP BETWEEN TBLPROTEIN AND TBLCOILEDCOIL FIGURE 4-2: STRUCTURE OF PROTEIN TABLE (TBLPROTEIN) FIGURE 4-3: STRUCTURE OF COILED-COIL TABLE (TBLCOILEDCOIL) FIGURE 4-4: STRUCTURE OF PROTEIN COIL TABLE (TBLPROTEINCOIL) FIGURE 4-5: E-R DIAGRAM DETAILING THE RELATIONSHIP BETWEEN TBLSALTBRIDGE AND TBLSPLITHEPTADCOILS FIGURE 4-6: STRUCTURE OF SALT RESIDUES LOOKUP TABLE (TBLSALTRESIDUESLOOKUP)... 35

7 University of Colorado at Colorado Springs vii FIGURE 4-7: STRUCTURE OF SALT BRIDGE TABLE (TBLSALTBRIDGE) FIGURE 4-8: STRUCTURE OF COILED-COIL HEPTAD TABLE (TBLSPLITHEPTADCOIL) FIGURE 4-9: STRUCTURE OF HEPTAD SALT BRIDGE TABLE (TBLHEPTADSALT) FIGURE 4-10: STRUCTURE OF SCRAPE FILE TABLE (TBLSCRAPEFILE) FIGURE 4-11: MATERIALIZED VIEW OF AMINO ACID OCCURRENCES (MATVIEW_AMINOACIDOCURRENCES)42 FIGURE 4-12: MATERIALIZED VIEW OF COILED-COIL LENGTH VS. THE CLUSTER COUNT (MATVIEW_COILCLUSTERCOUNT) FIGURE 4-13: PROCESS FLOW DIAGRAM FOR THE STABLE COIL ALGORITHM FIGURE 4-14: INDEX PAGE OF THE STABLE COIL WEBSITE FIGURE 4-15: PROTEIN RELATED SEARCH PAGE FIGURE 4-16: COILED-COIL RELATED SEARCH PAGE FIGURE 4-17: COILED-COIL MOTIF SEARCH WEB PAGE FIGURE 4-18: COILED HEPTAD AND SALT BRIDGE SEARCH PAGE FIGURE 5-1: COILED-COILS COUNT VS. COILED-COIL LENGTH FIGURE 5-2: LOCATION OF OCCURRENCE OF AMINO ACID WITHIN THE COILED-COIL WHEN THE AMINO ACID IS AT HEPTAD OFFSET A FIGURE 5-3: LOCATION OF OCCURRENCE OF AMINO ACID WITHIN THE COILED-COIL WHEN THE AMINO ACID IS AT HEPTAD OFFSET D FIGURE 5-4: NORMALIZED VALUE OF DESTABILIZING CLUSTERS IN COILED-COILS OF PARTICULAR LENGTH. RESULTS OBTAINED BY DIVIDING THE TOTAL NUMBER OF COILED-COILS WITH DE-CLUSTERS BY THE TOTAL NUMBER OF DE-CLUSTERS FIGURE 5-5: NORMALIZED VALUE OF STABILIZING CLUSTERS IN COILED-COILS OF PARTICULAR LENGTH. RESULTS OBTAINED BY DIVIDING THE TOTAL NUMBER OF COILED-COILS WITH CLUSTERS BY THE TOTAL NUMBER OF CLUSTERS FIGURE 5-6: DISTRIBUTION OF DESTABILIZING CLUSTER OF LENGTH 3 WITH RESPECT TO THE COILED-COIL LENGTH... 71

8 University of Colorado at Colorado Springs viii FIGURE 5-7: DISTRIBUTION OF DESTABILIZING CLUSTER OF LENGTH 4 WITH RESPECT TO THE COILED-COIL LENGTH FIGURE 5-8: DISTRIBUTION OF DESTABILIZING CLUSTER OF LENGTH 5 WITH RESPECT TO THE COILED-COIL LENGTH FIGURE 5-9: DISTRIBUTION OF DESTABILIZING CLUSTER OF LENGTH 6 WITH RESPECT TO THE COILED-COIL LENGTH FIGURE 5-10: DISTRIBUTION OF DESTABILIZING CLUSTER OF LENGTH 7+ WITH RESPECT TO THE COILED- COIL LENGTH FIGURE 5-11: DISTRIBUTION OF STABILIZING CLUSTER OF LENGTH 3 WITH RESPECT TO THE COILED-COIL LENGTH FIGURE 5-12: DISTRIBUTION OF STABILIZING CLUSTER OF LENGTH 4 WITH RESPECT TO THE COILED-COIL LENGTH FIGURE 5-13: DISTRIBUTION OF STABILIZING CLUSTER OF LENGTH 5 WITH RESPECT TO THE COILED-COIL LENGTH FIGURE 5-14: DISTRIBUTION OF STABILIZING CLUSTER OF LENGTH 6 WITH RESPECT TO THE COILED-COIL LENGTH FIGURE 5-15: DISTRIBUTION OF STABILIZING CLUSTER OF LENGTH 7+ WITH RESPECT TO THE COILED-COIL LENGTH FIGURE 5-16: RELATIONSHIP OF AMINO ACIDS IN OFFSET A TO AN I TO I + 3 SALT BRIDGE FIGURE 5-17: RELATIONSHIP OF AMINO ACIDS IN OFFSET A TO AN I TO I + 4 SALT BRIDGE FIGURE 5-18: RELATIONSHIP OF AMINO ACIDS IN OFFSET A TO AN I TO I + 5 SALT BRIDGE FIGURE 5-19: RELATIONSHIP OF AMINO ACIDS IN OFFSET D TO AN I TO I + 3 SALT BRIDGE FIGURE 5-20: RELATIONSHIP OF AMINO ACIDS IN OFFSET D TO AN I TO I + 4 SALT BRIDGE FIGURE 5-21: RELATIONSHIP OF AMINO ACIDS IN OFFSET D TO AN I TO I + 5 SALT BRIDGE... 79

9 University of Colorado at Colorado Springs LIST OF TABLES ix TABLE 2-1: TABLE OF STANDARD AMINO ACID ABBREVIATIONS AND SIDE CHAIN PROPERTIES... 7 TABLE 3-1: HELICAL PROPENSITY AND STABILITY VALUES OF THE 20 STANDARD AMINO ACIDS AT VARIOUS POSITIONS IN THE HEPTAD TABLE 3-2: COILED-COIL SEQUENCE STARTING AT OFFSET A TABLE 3-3: COILED-COIL SEQUENCE STARTING AT OFFSET B TABLE 3-4: AN AGGREGATION OF STABILITY VALUES 42 AMINO ACIDS AT A TIME TABLE 3-5: DETERMINING THE PRESENCE OF A COILED-COIL IN THE PROTEIN SEQUENCE TABLE 3-6: DETERMINING THE PRESENCE OF A CLUSTER (STABILIZING OR DE-STABILIZING) IN THE COILED- COIL SEQUENCE TABLE 4-1: LIST OF SALT BRIDGES WHICH PROVIDE I TO I + 3, I TO I + 4 AND I TO I + 5 ELECTROSTATIC INTERACTIONS TABLE 4-2: SEARCH PARAMETERS USED ON THE PROTEIN RELATED SEARCH PAGE TABLE 4-3: SEARCH PARAMETERS USED ON THE COILED-COIL RELATED SEARCH PAGE TABLE 4-4: SEARCH PARAMETERS USED ON THE COILED-COIL MOTIF SEARCH PAGE TABLE 4-5: SEARCH PARAMETERS USED ON THE COIL HEPTAD AND SALT BRIDGE SEARCH PAGE TABLE 5-1: TOP 10 AMINO ACID PAIRS OCCURRING IN HEPTAD OFFSETS A AND D WHICH FORM THE HYDROPHOBIC CORE TABLE 5-2: TOP 10 AMINO ACID PAIRS OCCURRING IN HEPTAD OFFSETS D AND E TABLE 5-3: TOP 10 AMINO ACID PAIRS OCCURRING IN HEPTAD OFFSETS G AND E USUALLY ASSOCIATED WITH ELECTROSTATIC ATTRACTION I TO I TABLE 5-4: TOP 10 AMINO ACID PAIRS OCCURRING IN HEPTAD OFFSETS E AND G USUALLY ASSOCIATED WITH ELECTROSTATIC ATTRACTION I TO I TABLE 5-5: TOP 10 AMINO ACID PAIRS OCCURRING IN HEPTAD OFFSETS G AND A... 63

10 University of Colorado at Colorado Springs x TABLE 5-6: TOP 30 AMINO ACID PAIR OCCURRENCES IN COILED-COILS TABLE 5-7: TOP 30 FREQUENTLY OCCURRING AMINO ACIDS IN THE STABLE COIL DATABASE... 66

11 University of Colorado at Colorado Springs 1 Chapter 1 INTRODUCTION The sequencing of the human genome, as well as the genomes of many other species, has introduced a wide array of research fields within the discipline of molecular biology. One of the fastest growing fields is Proteomics, the study of proteins, protein structures, and the functions these proteins perform. Various research facilities are dedicated to this field of study, including the Peptide Chemistry Lab of Dr. Robert Hodges at the University of Colorado Health Sciences Center (UCHSC). The primary focus of this group of researchers is to understand the factors that affect the stability of proteins in general and the coiled-coil oligomerization domain in particular. These factors include, but are not limited to, the hydrophobic and hydrophilic interactions and the intrachain and interchain electrostatic interactions between the amino acids present in these coiled-coils. The ability to determine coiled-coil stability will greatly facilitate the prediction of coiled-coils in protein structures and will advance protein design. Because coiled-coils are the most commonly occurring oligomerization domain in nature, understanding the interactions within them can advance the study of proteomics as a whole. The protein data available today is not only voluminous but complex. In order to interpret results in a timely and inexpensive manner, it is necessary to create prediction algorithms, which act as precursors to the lab experiments. This project uses such an algorithm to explore two of the primary areas of interest being studied at UCHSC. First, this project uses an established prediction algorithm, the Stable Coil Algorithm, to determine the existence of coiled-coils programmatically, eliminating the need for any human intervention. The second part of the project revolves around finding out what kind of interactions occurs within coiled-coils. Researchers have proposed that hydrophobic and electrostatic interactions are the primary forces that abet in the stability of the coiled-coil; hence, it is necessary to find out all possible information about these forces. Important areas of study include efforts to determine how the location of an amino acid in the heptad sequence affects coiled-coil stability and which amino acids hinder or aid that stability. In order to accomplish these goals, this project presents researchers with tools to efficiently study coiled-coil stability. These tools revolve around a revamped Stable Coil database. The first rendition of the Stable Coil database used the Stable Coil Algorithm to predict the presence of coiled-coils [4]. However, the database had become corrupted. Furthermore, the database did not recognize updates to the raw data available on the ExPASy 1 server. These issues, combined with query performance and the absence of error logging, made it necessary for this project to recreate the Stable Coil database as well as the Perl programs involved in data collection. The resulting database allows researchers to designate new sources of raw data for collection; the associated Perl programs then process the new data automatically. In addition to redesigning the original database, this project provides additional tables to facilitate research into the various factors affecting coiled-coil stability. This database is freely available via the Stable Coil web interface at This website provides users with three basic search functionalities that can help them understand the various aspects of coiled-coil formations within proteins and the electrostatic interactions within those coiled-coils. A fourth functionality allows for complex motif searching within coiled-coils, thus providing users with information on the types and frequency of amino acid residues occurring within coiled-coils. In 1 The ExPASy (Expert Protein Analysis System) proteomics server of the Swiss Institute of Bioinformatics (SIB), located at is dedicated to the analysis of protein sequences and structures.

12 University of Colorado at Colorado Springs 2 addition, the website presents users with fourteen unique reports, each of which provide a different insight into the realm of coiled-coils. These reports return results ranging from cluster distribution across coiledcoil lengths to lists of amino acids frequently found in and around coiled-coils. We hope this project will provide researchers with an efficient way to determine the presence of coiled-coils in proteins and act as a learning tool to help users understand the varying complexities of coiled-coil structures in protein.

13 University of Colorado at Colorado Springs 3 Chapter 2 BACKGROUND RESEARCH 2.1 BACKGROUND RESEARCH IN UNDERSTANDING COILED-COILS Deoxyribonucleic acid (DNA) contains the genetic instructions used in the development and functioning of all known living organisms. As the main role of DNA is long-term storage of information, it is often likened to a blue print repository that is used to construct cell components such as proteins and RNA molecules. The DNA is a double helix consisting of two long polymers of simple units called nucleotides, with backbones made of sugars and phosphate groups joined by ester bonds. These two strands run in opposite directions to each other and therefore called anti-parallel. Attached to each sugar is one of the four molecules known as the bases, which encode the genetic information. This information is interpreted using the genetic code, which specifies the sequence of amino acids within a protein sequence. Figure 2-1: The structure of part of a DNA double helix. Figure obtained from [23].

14 University of Colorado at Colorado Springs 4 The process by which genetic information is decoded from the DNA and converted to a protein is known as Protein Biosynthesis. Protein Biosynthesis is a multi-step process consisting of two major steps: Transcription and Translation. Transcription is the process of synthesizing of RNA under the direction of DNA. Both nucleic acid sequences use the same language, and the information is simply transcribed or copied from one molecule to the other. The DNA sequence is enzymatically copied by RNA polymerase to produce a complementary nucleotide RNA strand, called messenger RNA (mrna), which then carries a genetic message from the DNA to the protein-synthesizing machinery of the cell. During translation, the mrna sequence is used as a guide to synthesize a chain of amino acids into a protein sequence. During this process, the mrna is decoded using specific genetic instructions. A transfer RNA (trna), which is a small RNA, then transfers a specific amino acid to the growing polypeptide chain which is being catalyzed at the ribosomal site of protein synthesis. During and after protein synthesis, amino acid chains often fold to assume the tertiary and quaternary structures commonly associated with proteins. This process is known as protein folding. Many proteins undergo post-translational modifications, which extend the range of a protein s functions by attaching it to other biochemical functional groups or by formation of disulfide bridges. The various types of protein structures and sub-structures formed play important roles in how a protein will function. In order to understand the functions a protein performs at a molecular level, it is necessary understand the three dimensional protein structure. This constitutes the field of Proteomics. Researchers employ techniques such as X-ray crystallography or NMR spectroscopy to determine the structure of proteins UNDERSTANDING PROTEIN STRUCTURE Proteins are an important class of macromolecules present in all biological organisms. All proteins are polymers of the 20 standard α-amino acids listed in Table 2.1. Proteins fold into one or more specific spatial conformations, driven by a number of non-covalent interactions such as hydrogen bonding, ionic interactions, Van der Waals' forces and hydrophobic packing, so as to be able to perform their biological functions. In order to decipher protein folding, researchers require a fundamental understanding of the stability contributions of non-covalent stabilizing and destabilizing interactions. These interactions not only guide the initial hydrophobic collapse of a protein into an aqueous environment, but also provide a basis for which a protein assumes its overall structure. In biochemistry, a basic protein structure can be classified into four levels of hierarchy. These levels range from a singular linear arrangement of proteins to complex aggregate structures. These levels are described in detail below: PRIMARY STRUCTURE A protein s primary structure is the linear sequence of amino acids. Most protein databases represent a protein in this linear sequence, creating a list of the amino acids which constitute each protein. The sequence of the amino acids is unique to the protein and defines the structure and the function of the protein. In its elemental form, each amino acid in a protein is a four-part molecule starting with an amine group (NH 2 ), also known as the N-terminus, and ending with a carboxylate group (-COOH), known as the C terminus. In between these termini lies the α-carbon atom (C α). The C α is bonded to an R-group and a hydrogen atom. Counting of residues always starts at the N-terminal end, which is the end where the amino group is not involved in a peptide bond.

University of Colorado at Colorado Springs 5 Figure 2-2: A general structure of α-amino acid, with the amino group on the left and the carboxyl

By repeating this process over multiple amino acids, long chains can be generated.

During the formation of the peptide bond, the OH of the carboxylate bond of the first amino acid combines with the H of the amine bond in the

15 University of Colorado at Colorado Springs 5 Figure 2-2: A general structure of α-amino acid, with the amino group on the left and the carboxyl group on the right. Figure obtained from [1]. Two amino acids are joined by a peptide bond in a condensation reaction. By repeating this process over multiple amino acids, long chains can be generated. This reaction is catalyzed by ribosomes in the translation process. During the formation of the peptide bond, the OH of the carboxylate bond of the first amino acid combines with the H of the amine bond in the second amino acid to form water. Once the bond is formed, the two joined amino acids have only one amine group or N-terminus and one carboxylate group or C-terminus, as shown in Figure 2-3. Figure 2-3: A condensation reaction between two α-amino acids resulting in a peptide bond. Figure obtained from [1].

16 University of Colorado at Colorado Springs 6 The three dimensional structure of the protein is controlled by the dihedral angles the C α carbon atom forms with the N-terminus and the C-terminus. The phi angle (φ) is the angle formed by the C α carbon atom with the amine group. The psi angle (ψ) is the angle formed by the C α carbon atom with the previous amino acid s carboxylate group. The Figure 2.4 illustrates the phi and the psi angles. Figure 2-4: Phi and Psi angles The R group in the amino acid is called the side chain. A side chain can vary from a single hydrogen atom in glycine through a methyl group in alanine to a large hetrocyclic group in tryptophan [1]. The type and number of side chains in a protein influence its structure. The side chain determines whether the amino acid will be hydrophobic or hydrophilic, polar or non-polar. Table 2-1 lists the 20 standard amino acids and the relative polarity. Amino Acid One Letter Three Letter Category Name Code Code Alanine A Ala Non-polar Amino Acids (hydrophobic) Cysteine C Cys Polar Amino Acids (hydrophilic) Aspartic Acid D Asp Electrically Charged (negative and hydrophilic) Glutamic Acid E Glu Electrically Charged (negative and hydrophilic) Phenylalanine F Phe Non-polar Amino Acids (hydrophobic) Glycine G Gly Non-polar Amino Acids (hydrophobic) Histidine H His Electrically Charged (positive and hydrophilic) Isoleucine I Ile Non-polar Amino Acids (hydrophobic) Lysine K Lys Electrically Charged (positive and hydrophilic) Leucine L Leu Non-polar Amino Acids (hydrophobic) Methionine M Met Non-polar Amino Acids (hydrophobic) Asparagine N Asn Polar Amino Acids (hydrophilic) Proline P Pro Non-polar Amino Acids (hydrophobic) Glutamine Q Gln Polar Amino Acids (hydrophilic) Arginine R Arg Electrically Charged (positive and hydrophilic) Serine S Ser Polar Amino Acids (hydrophilic) Threonine T Thr Polar Amino Acids (hydrophilic)

University of Colorado at Colorado Springs 7 Valine V Val Non-polar Amino Acids (hydrophobic) Tryptophan W Trp Non-polar Amino Acids (hydrophobic) Tyrosine Y Tyr Polar Amino Acids (hydrophilic)

17 University of Colorado at Colorado Springs 7 Valine V Val Non-polar Amino Acids (hydrophobic) Tryptophan W Trp Non-polar Amino Acids (hydrophobic) Tyrosine Y Tyr Polar Amino Acids (hydrophilic) Unknown X UNK Unknown Protein Table 2-1: Table of standard amino acid abbreviations and side chain properties Hydrophobic amino acids repel a mass of water and tend to be non-polar. They do not form hydrogen bonds with any ionic group. Water is electrically polarized, and hence is able to form hydrogen bonds internally. But since hydrophobic amino acids are not electrically polarized, water repels hydrophobes, in favor of bonding with itself. This is true for all polar solvents. It is this effect that causes the hydrophobic interaction. To prevent destabilization, hydrophobic amino acids tend to be buried in the center of the protein away from the surrounding aqueous solution. For similar reasons, hydrophilic amino acids occur on the protein surface. The hydrophilic residues can be polar or electrically charged. The electric charge is +ve if the side chain is basic and ve if the side chain is acidic. The bonds formed due to these interactions are also known as ionic bonds. The distributions of hydrophobic and hydrophilic amino acids in the protein determine the tertiary structure of the protein, and their physical location on the outside structure of the protein influences the quaternary structure, by reducing the collective surface area and therefore the amount of water that can influence the protein structure. Besides these amino acid characteristics, there are electrostatic interactions determined by Van der Waal s forces and hydrogen bonding that determines the protein structure. Van der Waal s forces are the attractive and repulsive forces between atoms, molecules and surfaces. They differ from covalent bonds or ionic bonds in that they are caused by the fluctuating polarizations of nearby particles. Hydrogen bonding is another intermolecular force that affects protein structure, characterized by the presence of a hydrogen atom in the intermolecular bond. This hydrogen is chemically bound in one molecule as the proton donor and in the other as a proton acceptor. Figure 2-5 depicts a hydrogen bond formation in water dimer. Figure 2-5: Hydrogen bonding between amino acids in the proteins. Figure obtained from [25] In this figure, the water molecule on the right is the proton donor while the water molecule to the left is the proton acceptor. The hydrogen bond which is used as the donor is often covalently bonded to an electronegative atom, oxygen in our case. Thus, the result of this bonding is a dimer which has relatively large dipole-dipole forces. In a protein, hydrogen bonding interactions contribute to the secondary structure of a protein.

University of Colorado at Colorado Springs 8 2.

18 University of Colorado at Colorado Springs SECONDARY STRUCTURE Due to the interactions between the chemical groups in amino acids, mediated by hydrogen bonds, a few characteristic patterns occur within folded proteins. These recurring shapes describe the secondary structure of a protein. Their repeated occurrence renders a protein stable. Kabsch and Sander [8] in 1983 came up with an actual listing of the secondary structures found in proteins with a known 3D structure. The DSSP (Dictionary of Protein Secondary Structure) code they proposed is frequently used to describe secondary protein structures with single letter codes. The most commonly occurring protein structures in a protein include, but are not limited to, the α-helix, β-sheet and the β-turn. The α-helix is the most commonly occurring secondary structure in a protein. It is a right-handed coil formation resembling a spring, in which every backbone N-H group donates a hydrogen bond to he backbone C=O group of the amino acids four residues behind it (i + 4 to i hydrogen bonding). Each amino acid corresponds to a 100 turn in the helix. This means that that the α-helix has 3.6 residues per turn. For example, a helix of 36 amino acids long would form 10 turns. A coiled α-helix depicts the tight packing of bonds, leaving almost no free space in the helix. The amino acid side chains are on the outside of the helix pointing roughly downwards. Figure 2-6: A depiction of α-helix, the most commonly occurring protein structure in coiled-coils. Figure obtained from [1]. The β-sheet is yet another form of protein secondary structure formed by the collaboration of β-strands, connected laterally by 3 or more hydrogen bonds, forming a generally twisted, pleated sheet [2]. In other words, a β-sheet is an extended conformation of amino acids in a zig-zag manner. In a β-sheet, hydrogen bonding occurs between C=O and N-H groups of two or more β-strands. This is in contrast to the α-helix, where all hydrogen bonds involve the same element of the secondary structure. These

University of Colorado at Colorado Springs 9 hydrogen bonds can occur among adjacent β-strands in anti-parallel, parallel, or mixed arrangements.

In a parallel arrangement, all the N-termini of these strands are oriented in the same direction.

Figure 2-7: A depiction of β-sheet, in anti-parallel and parallel formation. Figure obtained from [2].

19 University of Colorado at Colorado Springs 9 hydrogen bonds can occur among adjacent β-strands in anti-parallel, parallel, or mixed arrangements. In an anti-parallel arrangement, the successive β-strands run in opposite directions; thus, the C-terminus of one β-strand is adjacent to the N-terminus of next β-strand. In a parallel arrangement, all the N-termini of these strands are oriented in the same direction. An individual strand may also exhibit mixed hydrogen bonding pattern, with a parallel strand on one end and an anti-parallel strand on the other end. These structures are depicted in Figure 2-7. Figure 2-7: A depiction of β-sheet, in anti-parallel and parallel formation. Figure obtained from [2]. The third type of secondary structure, the β-turn, is characterized by the hydrogen bonds in which the acceptor, meaning the main chain carboxyl oxygen (C=O), and the donor residues, meaning the main chain amine group (N-H), are separated by three residues (i to i + 3 hydrogen bonding). Turns are important secondary structures in proteins and occur abundantly on the surface of the protein molecule. They are distinguished by the hydrogen bonding in the i, i + 1, i + 2, and i + 3 residues. Helical regions are excluded from this definition, while turns between β-strands form a special class of turns known as the hairpin [10]. A β-hairpin connects to hydrogen bonded anti-parallel β-strands. Turns can also connect two regular secondary structure elements that do not interact to form what is known as

20 University of Colorado at Colorado Springs 10 diverging turns. Amino Acids vary in their ability to form secondary structures. Proline and Glycine, which are known as helix breakers, have amazing conformational abilities and are commonly found in turns. The most common amino acids that adopt the helical conformations include Methonine, Alanine, Leucine, Glutamate, and Lysine. The bigger amino acids, in contrast, prefer to adopt a β-sheet TERTIARY STRUCTURE The tertiary structure is the three dimensional arrangement of a protein, usually developed due to the presence of a variety of amino acids in the side chains. The tertiary structure of a protein is largely determined by the sequence of amino acids in the proteins and the interactions that occur among their side chains. As a result of these side chain interactions, the protein may have a number of folds, bends, and loops, thus assuming its final three dimensional structure. There are four types of side chain bonding interactions: disulfide bonds, hydrogen bonding, salt bridges, and non-polar hydrophobic bonding. Disulfide bonds are the only covalent bonds and are formed during oxidation of the sulfhydryl groups on Cysteine (C). The hydrogen bonding between side chains occurs mainly between two alcohols, between alcohol and an acid, or between two acids. Salt bridges are ionic interactions, resulting from the neutralization of an acid and amine on the side chains. Any combination of various acids and amine groups in the side chains will have this interaction. The salt bridges contribute towards the strengthening of the helix. The hydrophobic interactions are the most important factors contributing to the stability of the protein. As discussed in the primary structure, these interactions follow the simple solubility rule that likes dissolve likes. The hydrophobic components will repel water or any polar solvent, in turn forming strong bonds with other hydrophobic elements. In many cases this causes in the hydrophobic side chain to be buried in the centre of the protein and the hydrophilic residues to be exposed to the surface of the protein QUATERNARY STRUCTURE Many large proteins consist of multiple polypeptide chains, sometimes known as protein subunits. In addition to the tertiary structure of these subunits, these large proteins also possess a quaternary structure. These large proteins in essence are polymers. The most common examples of proteins with quaternary structure are hemoglobin and the DNA polymerase. Changes in the quaternary structure can occur through conformational changes in the underlying subunits or through the orientation of the subunits relative to each other. The forces that affect the tertiary structure of the protein also affect the quaternary structure. The different protein structures discussed above are pictorially represented in Figure 2-8.

21 University of Colorado at Colorado Springs 11 Figure 2-8: Protein structure, from primary to quaternary. Figure obtained from [26]. Now that we understand the different levels of hierarchy in the structural formation of a protein, we can better understand coiled-coils which form the basis of this Master s Project. The Stable Coil database is built for predicting the α-helical motifs with the ability to form α-helical coiled-coil motifs, and here we take a deeper look into coiled-coils and the importance of studying them.

University of Colorado at Colorado Springs 12 2.1.2 COILED-COILS Many proteins are involved in important biological functions.

22 University of Colorado at Colorado Springs COILED-COILS Many proteins are involved in important biological functions. Kinesin is a protein which transports cellular components between cells, while myosin is a fundamental protein used in muscle contractions, and both of these proteins perform these functions due to the ability of the coiled-coil to uncoil allowing the unattached heads to move. A coiled-coil is a structural motif in which two or more α-helices are coiled together like strands of a rope. α-helical structures are abundant in proteins. This project focuses on what is perhaps one of the most commonly occurring dimerization motifs in nature, the two stranded α-helical coiled-coil. This structure consists of a two amphiphatic, right handed α-helices that adopt a left handed super coil analogous to a two stranded rope where the non-polar face of the first α-helix is continually adjacent to that of the other helix [16] as shown in Figure 2-9. Figure 2-9: Classic example of Coiled-coil GCN4 leucine zipper. Figure obtained from [1]. The two stranded coiled-coil is an ideal model for coiled-coils studies because of its rod-like structure, which makes protein folding a one dimensional problem, thereby removing much of the complexity found in globular proteins. Coiled-coils are characterized by hydrophobic amino acids at every third and fourth residue within their sequence. They are distinguished by a heptad repeat defined as abcdefg where positions a and d are the hydrophobic amino acids responsible for the formation and stability of the coiled-coil. Shown below is an example of coiled-coil alongside its heptad repeat.

23 University of Colorado at Colorado Springs 13 Figure 2-10: Positions of amino acids in the coiled-coil. Figure obtained from [3]. The hydrophobic residues occur at positions a, d, a, and d and are indicated in red. These patterns repeat every 3.5 residues in the side chain; thus it takes less than two full heptads for the coiled-coil to turn twice, as indicated in Figure The hydrophobic residues are buried in the center, away from the surrounding aqueous solutions, while the hydrophilic residues are exposed to the surface. These hydrophobic interactions provide stability to the coiled-coil by aiding in inter- and intra-helical interactions. Various researchers over the years, including [7] [12] [18] have shown how not only the hydrophobic heptad repeats but also how the inter-helical and intra-helical electrostatic interactions between amino acids have contributed to the formation and the stability of the coiled-coil structure. A schematic representation of two-stranded, α-helical coiled-coils, with all the hydrophobic and electrostatic interactions is shown below in Figure Figure 2-11: Cross-sectional view of a two-stranded coiled-coil. Hydrophobic and Electrostatic Interactions between two stranded α-helical coiled-coils formed by the homodimerization of 35-residue polypeptide chains. Adapted from [7].

24 University of Colorado at Colorado Springs 14 Figure 2-11 uses the letters a to g and a to g designate the positions of the heptad repeat. As discussed earlier the hydrophobic residues interact at a and a and d and d indicated by open arrows. Electrostatic interactions can occur between b and e (b and e ) indicating intrachain i to i + 3 interactions or e and b (e and b ) indicating intrachain i to i + 4 interactions (dashed arrows) or g to e (g to e) indicating interchain i to i + 5 interactions (solid arrows) [7]. These interactions can consist of an attraction between the amino acid residues (salt residues) at these positions, or they can be repulsions which can respectively add or subtract from the overall stability of the coiled-coil. Coiled-coil prediction is an important goal pursued in bioinformatics and theoretical chemistry. Its aim is the prediction of the three-dimensional structure of proteins from their amino acid sequences, sometimes including additional relevant information such as the structures of related proteins. In other words, the goal is to predict a protein's tertiary/quaternary structure from its primary structure. A number of algorithms have been created to predict coiled-coils. Most of the current algorithms created use a statistical approach, in which they compare newly discovered proteins to existing ones and determine the probability of a coiled-coil being present. The University of Colorado at Colorado Springs in conjunction with the Department of Biochemistry and Molecular Genetics at University of Colorado Health Sciences Center has built a protein database that depicts proteins that contain coiled-coil motifs and their stability clusters as determined by the Stable Coil Algorithm [3][4]. This algorithm is based on the stability of the structure determined by the amino acids present in the protein sequence and the structural position of those amino acids. The algorithms mentioned above are described in detail in the following section.

25 University of Colorado at Colorado Springs BACKGROUND RESEARCH IN UNDERSTANDING COILED-COIL PREDICTION ALGORITHMS From previous discussions, it can be concluded that it is not only essential to determine the existence of coiled-coils in proteins; it is also essential to determine how the stability of the coiled-coil can be affected by certain amino acid residues at certain positions along the coiled-coil. Traditionally, the three dimensional structure of a coiled-coil has been determined by X-ray crystallography and NMR spectroscopy. Not only are these methods very expensive but they can also be very time consuming. Furthermore, it is highly improbable for a single group of researchers to apply these methods to all the naturally occurring proteins known to man. A better way to approach this problem is through the use of predictive algorithms, which provide the researchers with answers to questions like: Which proteins are more likely to contain coiled-coils? Which proteins are more/less stable due to the presence/absence of a hydrophobic residue or an electrostatic attraction/repulsion, etc.? Protein structure analysis was thus born out off the desire to determine protein characteristics without conducting laboratory experiments or using crystallography. Processes based on protein statistics and past experiments were generalized to create methods and algorithms, which provide insights into a given protein s structure and/or stability. This chapter exemplifies some of the predictive algorithms created to catalog coiled-coils present in proteins COILS ALGORITHM The COILS Algorithm [21], an enhanced version of the Lupus Algorithm [20], was developed by Andrew Lupus in The COILS is a program which compares the given amino acid sequence to a database of known parallel two stranded coiled-coils. The comparison yields a similarity score which is then compared with the distribution of the scores in coiled-coil proteins. Thus the program calculates a probability that a sequence will adopt a coiled-coil. The similarity scores are calculated by comparing against two different matrices: MTK is a matrix derived from the sequences of myosins, tropomyosins and keratins (intermediate filaments type I and II). MTIDK is a new matrix derived from myosins, paramyosins, tropomyosins, intermediate filaments types I - V, desmosomal proteins, and kinesins, calculated by weighing the residue frequencies of different protein families. Although using the MTIDK matrix results in a 20-30% drop in the generation of false-positives in the prediction algorithm, the results are still biased towards hydrophobic, hydrophilic charged residues. The program produces a fair amount of statistical noise as the window width decreases.

26 University of Colorado at Colorado Springs PAIRCOILS ALGORITHM PAIRCOILS [15] classifies coiled-coils using a statistical approach and utilizes a matrix similar to the MTIDK matrix used in the COILS Algorithm. The matrix used in PAIRCOILS contains all known coiled-coil sequences, extracted from the GENpept database [13]. Instead of comparing the entire sequence to the database, as is the case with COILS, PAIRCOILS determines conditional probabilities that two amino acids are found in any two heptad positions. These frequencies are then normalized and used to determine the probability that a certain pair of amino acids appears at a given heptad repeat. The probability cut off determines how stringently the data will be scrutinized in detecting the existence of a coiled-coil domain. Although the PAIRCOIL Algorithm successfully predicts coiled-coils reducing the number of false positives by using a scoring method based on pairwise probabilities, it is marred with the same problems as COILS; a large amount of statistical noise is present in the data as the probability cut off increases. The PAIRCOIL algorithm was extended to become what is known as the MULTICOIL Algorithm [22], in order to identity three stranded coiled-coils as well. The accuracy of the statistical results was limited as the MUTICOIL Algorithm was run against a small subset of proteins SOCKET ALGORITHM The SOCKET [19] program finds the Knobs-into-Holes mode of packing between alpha-helices which is characteristic of coiled-coils. It unambiguously defines the beginning and end of coiled-coil motifs in protein structures and assigns a heptad register to the sequence. Specifically, the purposes of SOCKET are: To objectively and unambiguously define the location of a coiled-coil motif in a protein structure, so that its sequence can be used to test new coiled-coil prediction algorithms and benchmark existing ones. To automatically collect statistics on frequencies of amino acids at each of the heptad positions (abcdefg) of the sequence/structure motif. Such data are useful for training computer programs that predict coiled-coils from primary structure, and for providing insights into new design rules. To highlight unusual assemblies of alpha-helices that go beyond the traditional coiled-coil, again it is hoped that design principles, founded on knobs-into-holes packing between alpha-helices, will enable us to create novel and useful protein assemblies.

27 University of Colorado at Colorado Springs ZIP ALGORITHM - IDENTIFYING LEUCINE ZIPPERS In order to implement the 2ZIP Algorithm, the TRESPASSER Algorithm is first used to extract from the SWISS-PROT [10] database only those residues that contain annotated leucine zippers, leucine-like zippers, and non-leucine zippers. TRESPAPPER is the algorithm of choice for this extraction, as it has been reported to predict leucine zippers with high reliability. Once this extraction is complete, the 2Zip Algorithm [6] is designed to determine the two general classes of the Leucine Zipper, strict and relaxed. The strict zipper is distinguished by occurrence of the of at least five leucine residues at four heptad repeats. A relaxed zipper occurs where in any of the five positions Leu is replaced with Met, Val or Ile. The results of the 2ZIP Algorithm show that the annotated proteins in the SWISS-PROT database do not really follow a strict or relaxed definition of the leucine zipper, as had been hypothesized due to the generation of a lot of false-positives. However, the algorithm does demonstrate, based on the appearance of the leucine zippers in DNA binding basic region (bzip) and helix-loop-helix (bhlh-zip), both of which have coiled-coil characteristics, that the presence of a leucine zipper is the hallmark of the coiledcoil itself rather than the leucine repeat STABLE INPUT ALGORITHM All coiled-coil prediction algorithms mentioned thus far are based on statistical probability. The Stable Input Algorithm [3] is the first algorithm created to determine the presence of coiled-coils using the experimentally determined stability and helical propensity values of various amino acids present in the protein sequence. This algorithm also provides stability clusters of amino acids based on the varying amount of residues at a and d positions in the coiled-coil. The SWISS-PROT database was used as the source data for this algorithm. This algorithm is the precursor to the Stable Coil Algorithm implemented in this project. Once coiled-coils are extracted from the SWISS PROT proteins, the Stable Input Algorithm uses a windowing function over which to calculate the relative stability of the coiled-coil. When researchers tested this algorithm, using window widths of 7 and 11, their results yielded some interesting observations concerning the stability of coiled-coils. According to these results, hydrophobic amino acids occupy hydrophobic a and d residues on average 65% for the SWISS-PROT dataset. As each hydrophobic core is added to the sequence length, the number of hydrophobic clusters decreases by a factor of 2, while the number of non-hydrophobic clusters decreases by a factor of 8. Also, the cluster frequency decreases as the heptad length increases. The Stable Input Algorithm does not evaluate the intermediate positions in the coiled-coil as strictly as it does the start and end positions. The result is that clusters are missed about 70% of the time. Also, researchers found it difficult to compare results from different sequences or perform quantitative queries, as the results were not stored in a database. Furthermore, they found that the algorithm is more susceptible to false positives; this can be attributed to the shortness of the windows lengths and the method in which the stability values were assigned to each amino acid. It is interesting to observe that although all of these algorithms predict coiled-coils in proteins, either by using statistical approaches or using the stability values, none of them store this data to allow users to perform customized searches. All of the above algorithms require the user to enter a protein sequence or a file in a certain format to produce the desired results. Not only is this approach inconvenient, it is also time consuming, particularly if the users want to run large set of data.

28 University of Colorado at Colorado Springs 18 The initial emphasis of this project is to retrieve additional information from this SWISS-PROT database concerning the electrostatic interactions among the amino acids within these coiled-coils. The scope of this project also entails helping researchers study in detail the role various amino acid residues play in the hydrophobic core of the coiled-coils. Finally, the project will try to improve the performance, accuracy, and user friendliness of the first rendition of the Stable Coil Algorithm, described in the next chapter. Hence, this project was undertaken to provide the researchers at UCHSC with a bigger, more readily accessible dataset of coiled-coils. The architecture and implementation of the database and the website are described detail later in this document.

29 University of Colorado at Colorado Springs 19 Chapter 3 STABLE COIL ALGORITHM The researchers at UCHSC have used a model protein, consisting of two identical 38 residue polypeptide chains covalently linked at their N termini via a disulfide bridge, to determine the effects that substituting different amino acids in a coiled-coil sequence may have on the coiled-coil stability. This work forms the basis for the design of new coiled-coil structures, to allow better understanding of the structural relationships between amino acids in a protein sequence, and also provides impetus to the design of new algorithms to predict the presence of coiled-coils within the native protein sequences. The study of the coiled-coil domain has a number of advantages. These advantages area best detailed by [5]: Abundant motif in proteins Only one type of secondary structure is present, i.e., the α-helix Only two interacting α-helices are required to introduce tertiary and quaternary structure Diversity in length makes it an ideal system to test predictions All non-covalent interactions that stabilize the three-dimensional structure of the proteins are found in the coiled-coil domain Experimentally easy to analyze structure and stability. To understand the proteins and the functions they perform, it is necessary to predict the occurrence of a coiled-coil before performing expensive and time consuming experiments. Hence the researchers at UCHSC have experimentally derived stability values for the twenty amino acids in their different heptad positions as described in the Table 3.1.

30 University of Colorado at Colorado Springs 20 Stability Value Amino Acid One Letter Three Letter Stability Value at Stability Value at Other Name Code Code Offset A at Offset D Positions Alanine A Ala Cysteine C Cys Aspartic Acid D Asp Glutamic Acid E Glu Phenylalanine F Phe Glycine G Gly Histidine H His Isoleucine I Ile Lysine K Lys Leucine L Leu Methionine M Met Asparagine N Asn Proline P Pro Glutamine Q Gln Arginine R Arg Serine S Ser Threonine T Thr Valine V Val Tryptophan W Trp Tyrosine Y Tyr Table 3-1: Helical Propensity and Stability Values of the 20 standard amino acids at various positions in the heptad Using these stability values as its inputs, the Stable Coil Algorithm determines the presence of a coiledcoil and the offset at which this coiled-coil occurs in a given protein. Thus, it can be said that the Stable Coil Algorithm is based on the structural stability of the coiled-coil region. The goal of this project is to provide researchers with enough data to perform quantitative analysis on a set of proteins and coiledcoils. The section below describes the workings of the Stable Coil Algorithm.

31 University of Colorado at Colorado Springs STABLE COIL ALGORITHM: PART I PROBLEM: Calculate the stability arrays for each protein in the Stable Coil database. INPUTS: The protein sequence (protein_array), The window size (window_size), The array of stability coefficients of amino acids depending on their heptad locations (stability_coefficients) OUTPUTS: The seven scoring arrays containing the stability scores for an individual protein, where each array starts at a different heptad offset a thru g (score_array) ALGORITHM: 1 FOR heptad_offset a to g 2 DO local_offset heptad_offset 3 FOR i 1 to length (protein array) 4 DO stability_array[heptad_offset][i] stability_coefficients[local_offset][protein array[i]] 5 IF local_offset = g 6 THEN local_offset = a 7 ELSE local offset = local offset FOR heptad_offset a to g 9 DO FOR i 1 to length (protein array) 10 IF length(protein_array) i > window_size 11 THEN score_array[heptad_offset][i] 12 ELSE score_array[heptad_offset][i] 13 RETURN score_array

32 University of Colorado at Colorado Springs STABLE COIL ALGORITHM: PART II PROBLEM: Determine the presence of a coiled-coil in a protein. INPUTS: The cut off value (cutoff_value), The seven scoring arrays determined in Part I (score_array), OUTPUTS: The coiled-coil count array containing the number of coiled-coils present in the protein sequence (coiled_coil_array) ALGORITHM: 1 FOR heptad_offset a to g 2 DO local_offset heptad_offset 3 IF score_array[heptad_offset][i] >= cutoff_value 4 THEN marker_array[heptad_offset][i] 1 5 ELSE marker_array[heptad_offset][i] 0 6 counter 0 7 IF marker array contains 42 or more consecutive ones 8 THEN coiled_coil_array[counter] = Amino acids corresponding to the maker sequence 9 counter counter RETURN coiled_coil_array

33 University of Colorado at Colorado Springs 23 In the steps 1 to 8 of PART I of the algorithm, we produce seven permutations of the stability values, each starting at a different heptad offset. Initially the protein sequence is assumed to start at heptad offset a. Then we assign a stability value to each amino acid in the protein sequence depending on its heptad position. The stability coefficients required for this are obtained from Table 3.1, which is experimentally determined by researchers at UCHSC. This process is repeated seven times, wherein each time we have the protein sequence starting at a different heptad offset (a, b, c, d, e, f and g). The resulting output of these steps is seven stability arrays (stability_array) each starting at different heptad offset. Example: Heptad Offset Position a b c d e f g Sequence Of Amino Acids M D Y L D L G Stability Values Table 3-2: Coiled-coil Sequence starting at offset a Heptad Offset Position b c d e f g a Sequence Of Amino Acids M D Y L D L G Stability Values Table 3-3: Coiled-coil Sequence starting at offset b To detect the presence of a coiled-coil, we use two experimentally determined values, the cutoff value of 38 and the window length of 42. The values are experimentally proven to be the best for predicting coils at UCHSC. The next step is to calculate seven scoring arrays obtained by aggregating the stability arrays. This is where we use the window length. The aggregation is performed for 42 residues at a time. If the number of residues left does not equal 42, we just aggregate the values till the end of the sequence. This provides us with seven arrays, known as the scoring arrays (score_array). Example: Heptad Offset Position a b c d e f g Sequence Of Amino Acids M D Y L D L G Stability Values Scoring Arrays Table 3-4: An aggregation of stability values 42 amino acids at a time

34 University of Colorado at Colorado Springs 24 The next experimentally determined value, the cutoff value of 38, is used here. If the aggregate scoring value for each amino acid in the protein sequence is greater than or equal to 38, we mark the scoring array as 1 else we mark it 0, thus generating a marker_array. Then, we look for the occurrence of 42 or more consecutive 1 s in the marker_array. If we find this pattern, then we predict the presence of a coiled-coil with the starting location of the pattern as the starting heptad offset of the coiled-coil. Only coiled-coils with 42 or more sequences are considered for this project as researchers at UCHSC are interested in cluster patterns found in large coiled-coils. Example: Heptad Offset Position a b c d e f g Sequence Of Amino Acids M D Y L D L G Scoring Arrays Coiled-coil Arrays Table 3-5: Determining the presence of a coiled-coil in the protein sequence

35 University of Colorado at Colorado Springs CLUSTER PATTERNS IN COILED-COILS Once we have predicted an occurrence of the coiled-coil, the presence of a cluster can be determined by the particular hydrophobic residues occurring at a and d positions. If a certain hydrophobic amino acid, i.e. Phenylalanine, Isoleucine, Leucine, Methionine, Valine, or Tyrosine, is found in the a or d position then the cluster sequence gets 1, or else it gets 0. Example: Heptad Offset Position d e F g a b c d e f g a b c Sequence Of Amino Acids L S T R I Y M V Q P N L G P Cluster Numbering Table 3-6: Determining the presence of a cluster (stabilizing or de-stabilizing) in the coiled-coil sequence For this sequence, a cluster pattern would be 1111 as all the amino acids in a and d positions are hydrophobic. The occurrence of three or more 1 s is classified as a stabilizing cluster, while the occurrence of three or more 0 s is termed as a destabilizing cluster indicating regions of lower stability and flexibility. A destabilizing cluster in a coiled-coil consists of the following amino acid residues: Alanine, Cysteine, Aspartic Acid, Glutamic Acid, Glycine, Histidine, Lysine, Asparagine, Proline, Glutamine, Arginine, Serine, Threonine, and Tryptophan. The researchers at UCHSC are interested in cluster patterns of coiled-coils because, as their names suggest, stabilizing clusters aid in the stability of the coiled-coil and destabilizing clusters hinder it. The Stable Coil database currently lists clusters and declusters of varying lengths (3, 4, 5, 6, and 6plus). There are also a number of summary queries based on these results that would help the researchers further understand the relationships between clusters and the stability of the coiled-coil.

36 University of Colorado at Colorado Springs 26 Chapter 4 PROJECT ARCHITECTURE The project began by understanding what enhancements could be made to the first implementation [4] of the Stable Coil Algorithm. The first hurdle to overcome was to restore the database to its original working state. While going through the Perl scripts, which handle the scraping of the data, it was observed that they did not correctly handle the retrieval of updates posted to the SWISS PROT files on the (Expert Protein Analysis System) [10] website. Moreover, it was found that the performance of the Perl scripts could be vastly improved and the scripts could be made less error prone with better error logging. Also, the website could be improved by allowing the users to dynamically chart their results and by allowing them to save their results for offline use. These reasons prompted a rewrite of the original implementation. The initial step is to recreate the database in MySQL 5.0. MySQL 5.0 was selected as it has become the database of choice for a new generation of applications built on the LAMP stack (Linux, Apache, MySQL, PHP / Perl / Python), which the project also uses. Also, among a vast list of MySQL 5.0 s features, one feature that is most favorable to this project is that it supports stored procedures, which lent a hand in improving the performance of the data loads. The Perl scripts are modified to first compare the files on the ExPASy repository with the files on the local system, so that if any changes in file modification date or file size exist, the Perl scripts can retrieve the data and run it through the Stable Coil Algorithm. The re-written load code then loads the retrieved proteins and coiled-coils into their respective tables. These modified Perl scripts are adept enough to use table driven downloads, which provide the user the flexibility of turning off or on downloads of certain files as per the requirements. The next step of the project is the extraction of interchain and intrachain electrostatic interactions and heptad offsets from the retrieved coiled-coils. The researchers at UCHSC have provided a list of salt residues as inputs to the data extraction process. The coiled-coils are then split into their heptad offsets and a table is created that lists salt residues occurring in particular heptad offset of the given coiled-coils. The final step of the project is to produce a HTML/PHP/JavaScript based data-driven website, the purpose of which is to highlight the electrostatic and hydrophobic interactions among amino acid residues in a coiled-coil. This enables the researchers at the Department of Biochemistry and Molecular Genetics at the University of Colorado Health Sciences Center (UCHSC) to query the underlying database, built using data from the ExPASy website. In querying the database, researchers can retrieve information regarding the coiled-coils that are present in the proteins, the salt residues that provide the electrostatic interactions for those proteins, and the amino acid residues that occur at a given heptad offset within the proteins. They can also run summary queries to determine information such as the frequency of amino acid residues occurring at certain heptad positions, or the kinds of electrostatic interactions that occur most frequently among the proteins in the database, to name a few.

37 University of Colorado at Colorado Springs DATABASE ARCHITECTURE The backbone of this entire project is the Stable Coil Database, which contains tables with information regarding the proteins, their coiled-coils and the salt bridges that provide the electrostatic interactions. The initial step was to recreate the database using MySQL 5.0, as the original rendition of the database had been corrupted. The MySQL database management system has become quite popular in recent years, particularly in the area of web services where it is used in combination with a web server to construct database-backed web sites that involve dynamic content generation. There are several reasons for the popularity of MySQL: MySQL is fast, and it is easy to set up, use and administrate. Also, among a vast list of MySQL 5.0 s features, the features that are most favorable to this project are that it supports AUTO_INCREMENT and stored procedures. One of the useful properties of an AUTO_INCREMENT column is that unique values do not need to be assigned manually MySQL does so automatically. Hence AUTO_INCREMENT is a very useful feature that automatically generates unique primary ID s when the rows are being inserted. Stored routines (procedures and functions) are supported in MySQL 5.0. A stored procedure is a set of SQL statements that can be stored in the server. Once this has been done, programmers don't need to keep reissuing the individual statements but can refer to the stored procedure instead. Situations where stored routines can be particularly useful include: When multiple client applications are written in different languages or work on different platforms, but need to perform the same database operations. When security is paramount. Banks, for example, use stored procedures and functions for all common operations. This provides a consistent and secure environment, and routines can ensure that each operation is properly logged. In such a setup, applications and users have no access to the database tables directly, and can only execute specific stored routines. Stored routines can provide improved performance because less information needs to be sent between the server and the client. The tradeoff is that this does increase the load on the database server because more of the work is done on the server side and less is done on the client (application) side. The performance improvement provided by the stored procedures was deemed more acceptable than the increase of load on the server.

38 University of Colorado at Colorado Springs STRUCTURE AND CONTENT OF THE TABLES INVOLVED IN PHASE 1 The Cocolysis Database is currently being hosted in on the University of Colorado Health Sciences Server (simbio.uchsc.edu). The tables involved in the first phase project are listed in the ER diagram in Figure 4-1. Figure 4-1: E-R Diagram detailing the relationship between tblprotein and tblcoiledcoil The first part of the database consists of two main tables, a table of proteins containing coiled-coils (tblprotein) and a table of the coiled-coils present in the proteins (tblcoiledcoil). There is a bridge table that connects these two tables together called tblproteincoil. This table provides information on which coiled-coil are present in which proteins. There is also a source look up table (tbldatasourcelookup) which provides the sources from which the data was retrieved.

39 University of Colorado at Colorado Springs 29 Every table has two audit columns, Record_Status and Change_Date. The Record_Status column distinguishes between records that are inserted or updated. The Change_Date column indicates the last time the record was inserted or updated. These columns help to keep track of how and when individual records have been changed. The next section describes the structure and content of each of these tables in detail PROTEIN TABLE The protein table is used to store information regarding proteins that have a coiled-coil motif. As of July 7 th 2008, there are 87,368 proteins with coiled motifs in the database. The table structure is listed in Figure 4-2 below. Figure 4-2: Structure of Protein Table (tblprotein) The columns specific to this table are described in detail here: ProteinID The ProteinID is an auto increment column. New IDs are created every time a row is inserted. This column is also the primary key for this table. The column also has a referential integrity to the tblproteincoil.proteinid, which is the bridge table used connect the proteins with coiled-coils. SourceID The SourceID is looked up against the tbldatasourcelookup. This ID indicates the source of data. Currently there are just two data sources in the table; SWISS-PROT which is a curated protein sequence database which strives to provide a high level of annotation (such as the description of the function of a protein, its domains structure and its post-translational modifications) and TREMBL, which is a computer-annotated supplement of SWISS-PROT that contains all the translations of EMBL nucleotide sequence entries not yet integrated into SWISS- PROT. EntryName This field indicates the ExPASy entry name from the SWISS-PROT database.

40 University of Colorado at Colorado Springs 30 EntryDataClass This field indicates the ExPASy data class from the SWISS-PROT database. The data class describes the status of the protein, i.e., whether or not that data has been manually reviewed by the UniProtKB curators or not. Accession This field refers to the Accession Number associated with the protein entry name. The purpose of accession numbers is to provide a stable way of identifying entries from release to release, because, although protein names can change in future releases, the accession number will remain the same [8] ProteinName This field holds the actual protein name, as opposed to the ExPASy name. It should be noted that different organisms may have the same protein although the sequence of the protein may be different. Organism This field refers to the species of the organism in which the protein is found. In the ExPASy database, the organism names are provided in both Latin genus and English name formats. For viruses, only the English name is provided. ProteinSeqLength This field refers to the number of amino acids in the protein sequence. ProteinSeqMolWeight This field indicates the molecular weight of the proteins. ProteinSeqCRC64 This field refers to the CRC 64-bit checksum of the protein sequence. ProteinSeqCreateDate, ProteinSeqModDate These fields refer to the creation and modification dates associated with the protein. Sequence This field stores the amino acids sequences of the protein COILED-COIL TABLE The coiled-coil table is used to store information regarding coiled-coils which have been retrieved from the SWISS PROT proteins using the Stable Coil Algorithm. Only unique coiled-coils are stored in this table, as coiled-coils are found in multiple proteins and these replications could skew the results while performing summary queries. To be considered unique, the coiled-coil must have a different amino acid sequence and a different structural offset. As of June 7 th 2008, there are 141,204 unique coiled-coils in the database. The table structure is listed in Figure 4-3 below.

41 University of Colorado at Colorado Springs 31 Figure 4-3: Structure of Coiled-coil Table (tblcoiledcoil) The columns specific to this table are described in detail here: CoilID The CoilID is an auto increment column. New IDs are created every time a row is inserted. This column is also the primary key for this table. The column also has a referential integrity to the tblproteincoil.coilid, which is the bridge table to connecting the coiled-coils with their corresponding proteins. CoilSequence This field stores the amino acid sequence of the coiled-coil. Coil sequence is a sub-string of the protein sequence and is retrieved using the Stable Coil Algorithm. As reserachers are interested in long coiled-coils, these sequences in this table are 42 or more amino acids in length Cluster This field stores the clusters found in the coiled-coils. CoilLength This field indicates the number of amino acids in the coil sequence. Offset This field refers to the starting heptad offset of the coiled-coil. The values range from a to g. Cluster3, Cluster4, Cluster5, Cluster6, and Cluster6p These fields store the number of instances where 3, 4, 5, 6, or 7 or more 1 s occurring sequentially in a cluster within a coiled-coil. For example, if the cluster sequence is , the field Cluster3 would have a value of 1 and Cluster5 would have a value of 1. De-cluster3, De-cluster4, De-cluster5, De-cluster6, and De-cluster6p These fields store the number of instances of 3, 4, 5, 6, or 7 or more 0 s occurring sequentially in a cluster within a coiled-coil. For example, if the cluster sequence is , the field De-cluster3 would have a value of 1.

University of Colorado at Colorado Springs 32 4.1.

42 University of Colorado at Colorado Springs PROTEIN COIL TABLE In nature, we find that there are proteins which contain many coiled-coils; also the same coiled-coil can be found in multiple proteins. In the technical sense, it can be said the protein sequences and coiled-coil sequences share a many-to-many relationship. The protein coil table is a bridge table for tblprotein and tblcoiledcoil, which stores information on the many-to-many relationships between proteins and coiledcoils. To understand this, a scenario is posited below. We have two proteins: the P1 protein containing coiled-coils C1 and C2, P2 protein containing the C2 coiled-coil. For the first protein, we insert P1 into tblprotein, C1 and C2 into tblcoiledcoil, and P1-C1 and P1-C2 into tblproteincoil. For the second protein, we insert P2 into tblprotein and P1-C2 into tblproteincoil. Thus, we do not insert any coiled-coils for the P2 protein into tblcoiledcoil, as C2 already exists in the table. As of July 7 th 2008, there are 179,054 unique coiled-coils in the database. The table structure is listed in Figure 4-4 below. Figure 4-4: Structure of Protein Coil Table (tblproteincoil) The columns specific to this table are described in detail here: ProteinCoilID The ProteinCoilID is an auto increment column. New IDs are created every time a row is inserted. This column is also the primary key for this table. ProteinID This field is used to indicate the proteins that contain coiled-coils. This column is foreign keyed to the ProteinID table in the tblprotein table. CoilID This field is used to indicate the coiled-coils in proteins. This column is foreign keyed to the CoilID table in the tblcoiledcoil table. CoilLocation This field indicates the location of the coiled-coil sequence in the protein. This is the only place where we can save this location as coiled-coils share a many-to-many relationship with proteins. Also, since MySQL does not support nested tables the way Oracle does, it is not possible to save this data in either tblprotein or tblcoiledcoil tables.

43 University of Colorado at Colorado Springs STRUCTURE AND CONTENT OF THE TABLES INVOLVED IN PHASE 2 The second part of the Stable Coil database consists of two additional main tables and one other bridge table. The tblsaltbridge contains all the i to i+ 3, i to i + 4 and i to i + 5 electrostatic interactions, which include attractions and repulsions between the Lys, Glu, Asp, and Arg amino acids. The lookup table (tblsaltresidueslookup) provides information on the different salt residue interactions that interest researchers. The researchers can add new residues to the database, and the Perl programs will automatically retrieve these residues from the coiled-coils. The tblsplitheptadcoils consist of all the coiled-coils split into their individual heptads (gabcdef). A bridge table, tblheptadsalt, which provides information about which heptad of the coiled-coil contains which salt bridge. The tables involved in the second phase this project are listed in the ER diagram in Figure 4-5. The next section describes the structure and content of each of these tables in detail. Figure 4-5: E-R Diagram detailing the relationship between tblsaltbridge and tblsplitheptadcoils

University of Colorado at Colorado Springs 34 4.1.2.1 SALT RESIDUES LOOKUP TABLE The Salt Residues table is a lookup table that provides information on interaction type and salt bridges type.

44 University of Colorado at Colorado Springs SALT RESIDUES LOOKUP TABLE The Salt Residues table is a lookup table that provides information on interaction type and salt bridges type. The researchers can add extra records to this table for other salt bridges and the related Perl programs will automatically retrieve information for the salt bridges from the coiled-coil table. As of July 7 th 2008, there are 48 types of salt bridges present in the table. The table structure is listed in Figure 4-6 below. Figure 4-6: Structure of Salt Residues Lookup Table (tblsaltresidueslookup) The columns specific to this table are described in detail here: SaltBridgeID The SaltBridgeID is an auto increment column. New IDs are created every time a row is inserted. This column is also the primary key for this table. The column also has a referential integrity to the tblheptadsalt.saltbridgeid, which is the bridge table that connects the salt bridges with coiled-coil heptads. SaltResidueID The SaltResidueID is looked up against the tblsaltresidueslookup. This Id indicates the type of salt bridge, for example whether an i to i + 3 salt bridge exists between salt residues Lys and Glu. This column is foreign keyed to tblsaltresidueslookup.saltresidueid. Currently there are 48 different attraction and repulsion salt residues in the lookup table. CoilID This field is foreign keyed to the tblcoiledcoil.coilid. This field will tell us what salt bridges are located which coiled-coils. SaltBridgeMatch This field refers to all the residues contained within the said salt bridge. These intermediate amino acids can help the researchers understand what are the most common residues occurring in a particular type of salt bridge. SaltStartLoc This field refers to start location of the salt bridge in the coiled-coil. SaltEndLoc This field refers to end location of the salt bridge in the coiled-coil. SaltStartOff This field refers to the starting offset of the salt bridge. SaltEndOff This field refers to the ending offset of the salt bridge. The offset fields help researchers identify what salt residues commonly occur at what heptad offsets in a coiled-coil.

45 University of Colorado at Colorado Springs SALT BRIDGE TABLE The salt bridge table is used to store information regarding the electro static interactions (i to i+ 3, i to i + 4, and i to i + 5) between the amino acids in a given coiled-coil. The researchers are interested in the following electrostatic interactions: Attractions Lys / Glu Glu / Lys. Lys / Asp Asp / Lys Arg / Glu Glu / Arg Arg / Asp Asp / Arg. Repulsions Lys / Lys Lys / Arg Arg /Arg Arg / Lys Glu / Glu Glu / Asp Asp / Glu Asp / Asp Table 4-1: Amino acid electrostatic interactions which the researchers at UCHSC are interested in This table provides information on the position of the salt bridge in the coiled-coil, as well as, the start and end heptad offsets of the salt bridge. These residues are searched using a regular expression parser. Using the salt bridge table, the researchers are able to answer questions such as what is the total number of Lys/Glu i to i + 3 salt bridges and where are these salt bridges distributed within the coiled-coil?. As of July 7 th 2008, there are 1,017,241 salt bridges present in the 141,204 unique coiled-coils. The table structure is listed in Figure 4-7 below. Figure 4-7: Structure of Salt Bridge Table (tblsaltbridge)

46 University of Colorado at Colorado Springs 36 The columns specific to this table are described in detail here: SaltBridgeID The SaltBridgeID is an auto increment column. New IDs are created every time a row is inserted. This column is also the primary key for this table. The column also has a referential integrity to tblheptadsalt.saltbridgeid, which is the bridge table that connects the salt bridges with coiled-coil heptads. SaltResidueID The salt residue id is looked up against tblsaltresidueslookup. This id indicates the type of salt bridge, for example whether it is an i to i + 3 salt bridge between salt residues Lys and Glu. This column is foreign keyed to tblsaltresidueslookup.saltresidueid. Currently there are 48 different attraction and repulsion salt residues in the lookup table. CoilID This field is foreign keyed to the tblcoiledcoil.coilid. This field will give us which salt bridges are located in which coiled-coils. SaltBridgeMatch This field refers to all the residues contained within the said salt bridge. These intermediate amino acids can help the researchers determine what the most common residues are occurring in a particular type of salt bridge. SaltStartLoc This field refers to start location of the salt bridge in the coiled-coil. SaltEndLoc This field refers to end location of the salt bridge in the coiled-coil. SaltStartOff This field refers to the starting offset of the salt bridge. SaltEndOff This field refers to the ending offset of the salt bridge. The offset fields help us identify what salt residues commonly occur at what heptad offsets in a coiled-coil COILED-COIL HEPTAD TABLE The coiled-coil heptad table is generated by splitting the coiled-coils into their respective heptad sequences. A heptad is defined as the sequence of offsets g,a,b,c,d,e,f. The heptad starts with g here to capture all i to i + 5 interactions in a given heptad of a coiled-coil. The table contains the residues occurring at each of these heptads for each of the coiled-coils in tblcoiledcoil. This table helps build queries which determine whether certain residues or certain pairs of residues occurring more frequently than others. As of June 7 th 2008, there are 1,186,214 heptads for 141,204 unique coiled-coils in the database. The table structure is listed in Figure 4-8 below.

47 University of Colorado at Colorado Springs 37 Figure 4-8: Structure of Coiled-coil Heptad Table (tblsplitheptadcoil) The columns specific to this table are described in detail here: HeptadOffsetID The HeptadOffsetId is an auto increment column. New IDs are created every time a row is inserted. This column is also the primary key for this table. The column also has a referential integrity to the tblheptadsalt.heptadoffsetid, which is the bridge table to connect the coiled-coil heptads with the corresponding salt bridges. CoilID This field is foreign keyed to tblcoiledcoil.coilid. This field will tell us which heptads belong to which coiled-coils. OffsetG, OffsetA, OffsetB, OffsetC, OffsetD, OffsetE, and OffsetF These fields store the amino acid residues of the coiled-coils at the corresponding offsets. HeptadStartLoc This field refers to start location of the corresponding heptad in the coiled-coil. HpetadEndLoc This field refers to end location of the corresponding heptad in the coiled-coil. HeptadStartOff This field refers to the starting offset of the heptad. HeptadEndOff This field refers to the ending offset of the heptad HEPTAD SALT TABLE The table tblheptadsalt is a bridge table between the tblsaltbridge and tblsplitheptadcoils. It stores the salt bridge IDs and heptad offset IDs, where the salt bridge is present in the heptad offset for the given coiled-coil. The table does not store any salt bridges that overlap two heptads. As of July 7 th 2008, there are 549,440 unique coiled-coils in the database. The table structure is listed in Figure 4-9 below.

University of Colorado at Colorado Springs 38 Figure 4-9: Structure of Heptad Salt Bridge Table (tblheptadsalt) The columns specific to this table are described in detail here: HeptadSaltID The

48 University of Colorado at Colorado Springs 38 Figure 4-9: Structure of Heptad Salt Bridge Table (tblheptadsalt) The columns specific to this table are described in detail here: HeptadSaltID The heptad salt id is an auto increment column. New ids are created every time a row is inserted. This column is also the primary key for this table. HeptadOffsetID This field is foreign keyed to the tblsplitheptadcoils.heptadoffsetid. This ID is used to indicate all the heptads which contain salt bridges. SaltBridgeID This field is foreign keyed to the tblsaltbridge.saltbridgeid.

49 University of Colorado at Colorado Springs SCRAPE FILE TABLE Finally there is a table which drives the data scrape process (tblscrapefile). It contains information about the location of the file on an ftp server, the last time the file was update on the source and the size of the file. This information is used to check if the file has been changed on the host and, if so, to retrieve it. Once the new version of the file has been retrieved, the file size and the mod date are updated, so as to store the most current attributes of a file. As of July 7 th 2008, there are 3 entries in the tblscrapefile in the database. The first entry is used to retrieve the SWISS-PROT file which gets updated on the source site monthly. The second entry is used to retrieve the TREMBL file. The final entry is used to retrieve the updates to the SWISS-PROT database. If the researchers would like to add more datasets, they simply need to add an entry to this table. The only factor they must take into account is the file format of the dataset. In order for the data in the new file to successfully load into the database, it must be of the same format as the SWISS-PROT file, a format in which protein sequences are commonly represented. The table structure is listed in Figure 4-10 below. Figure 4-10: Structure of Scrape File table (tblscrapefile) The columns specific to this table are described in detail here: ScrapeID The ScrapeID is an auto increment column. New IDs are created every time a row is inserted. This column is also the primary key for this table. FtpSiteUrl This field stores the main URL to the ftp site which hosts the data. The Perl program has been designed specifically to get data from a FTP server as most protein distribution files are large in size and are almost always accessible via FTP. FtpDirPath This field refers to directory path of the file on the FTP server FtpFileName This field is the actual file name retrieved from the FTP site. LocalDirPath This field refers to the directory path on the local machine where we plan to save the data obtained from the FTP site.

50 University of Colorado at Colorado Springs 40 LastModDate This field refers to the last time the file on the website was modified. If this date and date on the file do not match, Perl programs mark the file as changed and scrape the new version of the file. SizeInBytes This field refers to the file size in bytes as retrieved from the file during the most recent scrape. If this file size and size of the file on the FTP site do not match,, the Perl Programs mark the file as changed and scrape the new version of the file. PullWeeklyFlag This field acts as a flag which tells the main scrape program whether or not to scrape a file every week. A Perl program in turn calls a stored procedure, which changes the status of the PullWeeklyFlag field based on the last time a file was scraped. The field has just two values Y or N. ScrapeFileFlag This field indicates whether or not the scrape program requires a restart. Because the Perl programs are scraping huge files, it is entirely possible that the data transfer might end before finishing the complete download. If this happens and the researchers are trying to retrieve multiple files, this field determines which was the last file successfully scraped. Every time the program starts it looks at which files it needs to be retrieved and marks the ScrapeFileFlag field as N. Once the scrape of a particular file is completed, the ScrapeFileFlag field is marked as Y.

University of Colorado at Colorado Springs 41 4.1.3 MATERIALIZED VIEWS To provide faster access from the user interface, the database includes materialized views.

51 University of Colorado at Colorado Springs MATERIALIZED VIEWS To provide faster access from the user interface, the database includes materialized views. Materialized views are exactly like standard views which are based on certain select queries. A materialized view, however, takes a different approach, wherein the query result is cached as a concrete table that may be updated from the original base tables from time to time. This enables much more efficient access, at the cost of some data being potentially out-of-date. It is most useful in data warehousing scenarios, where frequent queries of the actual base tables can be extremely expensive. In addition, because the view is manifested as a real table, anything that can be done to a real table can be done to the view, the most important being the ability to build indexes on any column, thus enabling drastic speedups in query time. In a normal view, it is typically only possible to exploit indexes on columns that come directly from (or have a mapping to) indexed columns in the base tables. But MySQL 5.0 does not support materialized views. Hence a new approach was designed to create tables than will be automatically updated, just like a materialized view. MySQL provides a Create table as (CTAS) syntax which allows users to create tables using a select statement. This property was utilized to create a procedure which takes a select statement and a table name as inputs and creates a table on the fly. This procedure also assigns primary keys and any indexes if specified. The details of the stored procedure that creates these materialized views are described in Appendix A. The updates to these materialized views are scheduled using the UNIX crontab. Also for each of these queries it takes from 1.2 seconds to 5 seconds to refresh, which provides for a minimal down time, if any. The materialized views were created to provide answers to questions like what roles different amino acids residues play in the hydrophobic core (a and d positions), or What is frequency of the occurrence of pairs of residues in the coiled-coils. Some of the important materialized views are described here in detail AMINO ACID OCCURRENCES This materialized view provides the frequency of occurrence of a pair of amino acids at a certain heptad offset in a coiled-coil. This view, in turn, allows users to find out which amino acid pair occurs most frequently at a given heptad position and which occur less frequently. This table is based on a select query performed on the tblsplitheptadcoils. As of July 7 th 2008, there are 2,010 distinct amino acid pair occurrences out of a total possible 2,800. The most common occurring pair of amino acids is L-L at offsets a and d respectively, and they occur 57,854 times. The table structure is listed in Figure 4-11 below. Figure 4-11: Materialized view of Amino Acid Occurrences (matview_aminoacidocurrences) The columns specific to this materialized view (table) are described in detail here:

University of Colorado at Colorado Springs 42 Amino Acid Pair This column defines distinct pairs of amino acids which are found within the coiled-coils in the database.

52 University of Colorado at Colorado Springs 42 Amino Acid Pair This column defines distinct pairs of amino acids which are found within the coiled-coils in the database. Offset Location 1 This field stores the heptad offset of the first amino acid for the specific oiled coil Offset Location 2 This field stores the heptad offset of the second amino acid for the amino acid pair we have found. Offset Pair Occurrence This field stores the number of occurrences of the amino acid pair in the different heptad offsets, (a, b, c, d, e, f and g) COIL LENGTH VS. CLUSTER PER COIL This materialized view provides data on the frequency of occurrence coiled-coils of a certain length. In addition, it also provides the normalized value of the number of clusters occurring in coiled-coil of a given length. This view will allow users to see how cluster distribution varies as coil length increases. The coils have been divided into seven different subgroups divided by coiled-coil length: 1. Coiled-coils with coil length less than 50 amino acids 2. Coiled-coils with coil length between 50 and 59 amino acids 3. Coiled-coils with coil length between 60 and 69 amino acids 4. Coiled-coils with coil length between 70 and 79 amino acids 5. Coiled-coils with coil length between 80 and 89 amino acids 6. Coiled-coils with coil length between 90 and 99 amino acids 7. Coiled-coils with coil length greater than 100 amino acids The table structure is listed in Figure 4-12 below. Figure 4-12: Materialized view of Coiled-coil Length vs. the Cluster Count (matview_coilclustercount) The columns specific to this materialized view (table) are described in detail here: Coiled-coils By Length This field splits coiled-coils into different categories by length. Coiled-coil Count This field provides information on the number of coiled-coils in each group once they have been split by length. Stabilizing clusters per coil This field stores a normalized value of the Total Stabilizing Clusters

53 University of Colorado at Colorado Springs 43 by the Total Number of Coiled-coils. Stabilizing Clusters per coil in Coiled-coils with Clusters This field stores a normalized value of the Total Stabilizing Clusters by the Total Number of Coiled-coils that actually contain stabilizing clusters. Destabilizing clusters per coil This field stores a normalized value of the Total Destabilizing Clusters by the Total Number of Coiled-coils. Destabilizing Clusters per coil in Coiled-coils with Clusters This field stores a normalized value of the Total Destabilizing Clusters by the Total Number of Coiled-coils which actually contain destabilizing clusters. These are a couple of materialized views to accelerate the execution of the searches. The details concerning the creation of these materialized views and, the SQL statements used to create these views are covered in Appendix B.

54 University of Colorado at Colorado Springs PERL CODE DESIGN One of the most important and complex parts of the Stable Coil project was the process of loading the data from the source data. The raw data for the scrape is obtained from the ExPASy (Export Protein Analysis System) 1 server. The ExPASy database is an open source database developed to help researchers by providing the latest annotated protein sequences. The entire database is reposted monthly while protein updates that have been sequenced as a result of various genome projects are added to the database weekly. This database can be downloaded at ftp://ftp.expasy.org/databases/swiss-prot/release. The database is available in XML and DAT formats and can be downloaded in compressed or uncompressed formats. The updates to the protein sequences are available at ftp://ftp.expasy.org/databases/swiss-prot/updates_compressed in the DAT format. This project uses the DAT format in order to keep the data type consistent across the entire process. The main database is currently 2.9 gigabytes in size. The weekly updates range from 30 to 40 megabytes. There are four Perl programs which are used to retrieve source data from the website, parse the data using the Stable Coil Algorithm and load the data into the MySQL database. They are as follows: 1. StableCoil_Algorithm_setpullweekly_call.pl 2. StableCoil_Algorithm_scrape_parse_load.pl 3. StableCoil_Algorithm_saltresidues_heptadoffsets_extract.pl 4. Older_Files_Archiving_Removal.pl. The first program is used to set the PULLWEEKLYFLAG in the tblscrapefile table. This program sets the flag depending on how long it has been since the last scrape for the file. For updates the PULLWEEKLYFLAG is set every seven days, while for the entire database it is set every 180 days. The second program is the main program that actually scrapes and loads the data. It starts with first checking whether the PULLWEEKLYFLAG has been set for the specified file. If it is set then the program compares the LASTMODDATE and SIZEINBYTES in the tblscrapefile against the mod date and the size of the file on the ftp website; if they differ, the file is scraped. At this time, the LASTMODDATE and SIZEINBYTES for the particular file are also updated. Once the file has been scraped the sequences, their molecular weights, their create dates, and any other relevant information are extracted from the file. This information is then run through the Stable Coil Algorithm which predicts the presence and the location of the coiled-coils in the protein sequences. If the protein sequence length is less than 42, the program ignores it, as the researchers are interested in coiled-coils with 42 or more amino acids. Next the program checks to see if the coiled-coil already exists in the tblcoiledcoil table. If the coiledcoil does exist, it retrieves the CoilID for that coiled-coil. If the coiled-coil does not exist in the table, the program inserts a new record in to the table and retrieves its CoilID. Similarly we insert a new protein sequence in tblprotein or get a ProteinID from the table depending on whether the protein already exists in the database. The next step is to determine the clusters in the coiled-coils using Perl s pattern matching operators. The pattern matches are done as follows. 1 The ExPASy website is located at

55 University of Colorado at Colorado Springs 45 Cluster3: /(?:(?<=[^1]) (?<=^))111(?=[^1] $)/g Cluster4: /(?:(?<=[^1]) (?<=^))1111(?=[^1] $)/g Cluster5: /(?:(?<=[^1]) (?<=^))11111(?=[^1] $)/g Cluster6: /(?:(?<=[^1]) (?<=^))111111(?=[^1] $)/g Cluster6p: /1{7,}/g De-cluster3: /(?:(?<=[^0]) (?<=^))000(?=[^0] $)/g De-cluster4: /(?:(?<=[^0]) (?<=^))0000(?=[^0] $)/g De-cluster5: /(?:(?<=[^0]) (?<=^))00000(?=[^0] $)/g De-cluster6: /(?:(?<=[^0]) (?<=^))000000(?=[^0] $)/g De-cluster6p: /0{7,}/g Once the pattern matches are complete, the program inserts the ProteinID and the CoilID in the bridge table. Thus only distinct proteins and distinct coiled-coils are stored in the database. The third program, StableCoil_Algorithm_saltresidues_heptadoffsets_extract.pl, is used to retrieve salt residues and their heptad offsets from the coiled-coils. The results retrieved from this program provide insights into the relationship between the salt residues and the various heptads of the coiled-coils. There are 3 types of electrostatic interactions, i to i + 3, i to i + 4 and i to i + 5, which can be searched as i..i + 3, i i + 4 and i.i + 5. The salt residues that the researchers are interested in are as follows. Start Amino End Amino Interaction Acid Acid Type Salt Bridges Type K E A Intrachain i to i + 3 K E A Intrachain i to i + 4 K E A Interchain i to i' + 5 E K A Intrachain i to i + 3 E K A Intrachain i to i + 4 E K A Interchain i to i' + 5 K D A Intrachain i to i + 3 K D A Intrachain i to i + 4 K D A Interchain i to i' + 5 D K A Intrachain i to i + 3 D K A Intrachain i to i + 4 D K A Interchain i to i' + 5 R E A Intrachain i to i + 3 R E A Intrachain i to i + 4 R E A Interchain i to i' + 5 E R A Intrachain i to i + 3 E R A Intrachain i to i + 4 E R A Interchain i to i' + 5 R D A Intrachain i to i + 3 R D A Intrachain i to i + 4 R D A Interchain i to i' + 5 D R A Intrachain i to i + 3 D R A Intrachain i to i + 4

56 University of Colorado at Colorado Springs 46 Start Amino Acid End Amino Acid Interaction Type Salt Bridges Type D R A Interchain i to i' + 5 K K R Intrachain i to i + 3 K K R Intrachain i to i + 4 K K R Interchain i to i' + 5 K R R Intrachain i to i + 3 K R R Intrachain i to i + 4 K R R Interchain i to i' + 5 R R R Intrachain i to i + 3 R R R Intrachain i to i + 4 R R R Interchain i to i' + 5 R K R Intrachain i to i + 3 R K R Intrachain i to i + 4 R K R Interchain i to i' + 5 E E R Intrachain i to i + 3 E E R Intrachain i to i + 4 E E R Interchain i to i' + 5 E D R Intrachain i to i + 3 E D R Intrachain i to i + 4 E D R Interchain i to i' + 5 D E R Intrachain i to i + 3 D E R Intrachain i to i + 4 D E R Interchain i to i' + 5 D D R Intrachain i to i + 3 D D R Intrachain i to i + 4 D D R Interchain i to i' + 5 Table 4-2: List of Salt Bridges which provide i to i + 3, i to i + 4 and i to i + 5 electrostatic interactions These residues are determined by using the regular expressions along with the RegExp::Exhaustive Perl module. This module finds lookback, lookahead, and matches within matches. In addition to the occurrence of salt bridges, the third program also stores the positions of the salt bridges in tblsaltbridge. These heptad offsets are calculated by splitting the coiled-coils into seven residues at a time and storing the respective amino acid in columns Offsetg, Offseta, Offsetb, Offsetc, Offsetd, Offsete, Offsetf in the table tblsplitheptadcoils, depending upon the amino acid s offset. If a salt bridge occurs within a heptad offset, the HeptadOffsetID and SaltBridgeID are added to the bridge tblheptadsalt. If the researchers are interested in finding any new salt bridges, they need only to add new entries to the tblsaltresidueslookup table. Thus, using the heptad repeat designations abcdefg, it is possible to know the number of types of salt residues occurring in the coiled-coils and their distribution relative to their position. It is also possible to identify the frequency of occurrence of amino acids in certain heptad positions.

57 University of Colorado at Colorado Springs 47 Each of the above programs uses the in-house Log and Time modules. The Log module creates log files which log every step of the program and the Time module benchmarks the time taken for each step. The fourth program, Older_Files_Archiving_Removal.pl, used to periodically clean out the log files every 30 days, so as to prevent an INODE failure on the server. The process flow is summarized by the flow chart in Figure 4-13 below. Figure 4-13: Process Flow Diagram for the Stable Coil Algorithm

58 University of Colorado at Colorado Springs WEBSITE ARCHITECTURE Another component of the project is the data driven website, through which users will access the information in the Stable Coil database. This front end was developed in PHP and uses Fusion Charts to create dynamic Flash charts, which are based off of XML codes; the MySQL database serves as the backend. The website was created based on search parameters provided by the researchers at UCHSC. The website as of this writing is located at There are four search pages, each of which provides options to export the data in HTML or Microsoft Word or Microsoft Excel formats. Each page also includes a help option describing what the page does. The searches are highlighted to provide easy visual recognition of the data. The searches are stored per user between different sessions using the $_SESSION variable in PHP. Also, each search page contains a link allowing users to reset the search criteria. The users can search variations in proteins, coiled-coils and the salt bridges and also provide summary queries that describe certain anomalies or points of interests in the dataset. The summary queries are performed by creating materialized views, which are based on SELECT statements and use the JOIN statement to combine two or more tables. The following sections describe the different pages of the website and the search parameters associated with these pages. The searches are performed on the requirements provided by researchers at UCHSC.

University of Colorado at Colorado Springs 49 4.3.1 INDEX PAGE Figure 4-14: Index Page of the Stable Coil website The index page provides describes what the Stable Coil project is all about.

59 University of Colorado at Colorado Springs INDEX PAGE Figure 4-14: Index Page of the Stable Coil website The index page provides describes what the Stable Coil project is all about. It gives the users the necessary insights into the world of coiled-coils. It also provides a summary of the number of proteins, coiled-coils, and salt bridges in the database and displays the last date on which the database was updated. This webpage is intended to be a small introduction to the steps that the Stable Coil Algorithm performs.

University of Colorado at Colorado Springs 50 4.3.

60 University of Colorado at Colorado Springs PROTEIN SEARCH PAGE Figure 4-15: Protein Related Search Page This page searches both the tblprotein table and the tblproteincoil table to retrieve information on the coiled-coils related to the proteins. It also provides salt bridges associated with the proteins using the tblsaltbridge. The results include the ExPASy protein name, the protein sequence, the organism in which the protein is found and the date the protein was annotated in the ExPASy database. The users should use the accession number instead of the protein entry name, to compare the results to the ExPASy database. The following parameters determine which fields the users can search.

61 University of Colorado at Colorado Springs 51 Description of the search fields for the Protein Related Search Page Search # Parameter Explanation of the Search Parameter Name 1. Entry Name Users can use the SWISS-PROT ExPASy Entry Name to search a protein value. As per the ExPASy web site, the ExPASy Entry Name is not unique across different releases of their protein database. If you are looking for a particular protein, the Accession Number is the best field to search. 2. Entry Data Class The proteins are divided into two Entry Data Classes: Reviewed and Non-Reviewed. A protein is marked as reviewed if it as been checked by the analysts at ExPASy, otherwise it is marked as non-reviewed. 3. Accession Number Accession Number is a unique value used to distinguish between different proteins in the same release and also to distinguish proteins with the same ExPASy names in different releases. 4. Protein Name When you enter the name of a protein, for example Myosin, the program gives the different Myosin proteins in the database such as Myosin-1, Myosin-2, Myosin Va, Myosin Vb, etc. This is a case-insensitive search i.e. Myosin or myosin or myosin would all return the same results. 5. Organism This field indicates the type of organism in which the protein occurs. The underlying database stores the organism s genus name as well as it s common name. Ex: Homo Sapiens (Human). This search is case insensitive as well. 6. Protein Sequence Length 7. Protein Sequence Create Date A search on this field can be performed if you want to restrict your result set of proteins to a certain length. The values in these search fields must have to be a valid integer. A search on this field can be performed if you want to restrict your result set proteins to proteins created in a certain time frame. The values in these fields have to be of the format MM/DD/YYYY. 8. Sequence This field allows you to search the protein sequence. Protein sequences are always represented in upper case and hence the search is case sensitive. For example, 'LKLL' will yield results while 'lkll' will not yield any results. You can also search for a sequence as 'L_LL' where the underscore represents any amino acid. Table 4-3: Search Parameters used on the Protein Related Search Page

University of Colorado at Colorado Springs 52 4.3.

62 University of Colorado at Colorado Springs COILED-COIL SEARCH PAGE Figure 4-16: Coiled-coil Related Search Page This webpage searches through the coiled-coils which have been extracted from the proteins using the Stable Coil Algorithm. The search results include the coiled-coil sequence, the heptad offset at which the coiled-coil occurs in the protein, and the clusters that are found in this coiled-coil. In the results section, the 'a' and 'd' residues for every coiled-coil are highlighted in red. Every coiled-coil record contains links to the proteins and salt bridges related to that coiled-coil. The following parameters determine which fields the users can search.

63 University of Colorado at Colorado Springs 53 Description of the search fields for the Coiled-coil Related Search Page Search # Parameter Explanation of the Search Parameter Name 1. Coil Sequence This field allows you to search the coiled-coil sequence. Coiledcoil sequences are always represented in upper case; hence the search is case sensitive. For example, 'LKLL' will yield results while 'lkll' will not yield any results. You can also search for a sequence as 'L_LL' where the underscore represents any amino acid. 2. Cluster The clusters in the coiled-coils are determined by the presence or absence of the following hydrophobic amino acids: Phenylalanine (F), Isoleucine (I), Leucine (L), Methonine (M), Valine (V) and Tyrosine (Y) at a and d positions. If any of the above hydrophobic amino acids appear at a or d position in the coiled-coil the amino acid is represented by 1. If not, the value is 0. The number of 3 or more consecutive 1's or 0's in the coiled-coils are denoted as a cluster or a de-cluster. If the first residue in the cluster maps to a 1, the cluster starts with a 0, so as to group all 1's together. 3. Coil Length The coil length field allows you to restrict your search results to coiled-coils between certain lengths. The Stable Coil database only retrieves coiled-coils which contain 42 or more amino acids. 4. Offset This field allows you to search for coiled-coils with certain starting heptad offsets. The heptad offsets are a, b, c, d, e, f, and g. This field is case sensitive as offsets are always indicated in lower case. 5. Organism This field indicates the type of organism in which the protein occurs. The underlying database stores the organism s genus name as well as the common name. For example, Homo Sapiens (Human). This search is case insensitive as well. 6. Cluster 3, 4, 5, 6, 6 plus 7. De-cluster 3, 4, 5, 6, 6 plus As previously described, these fields provide the information on the number of clusters in the coiled-coils. If a cluster has three consecutive 1 s, the cluster3 count is increased by one. If a cluster has four consecutive 1 s, the cluster4 count is increased by one, and so on. These fields provide the information on the number of declusters in the coiled-coils. If a de-cluster has three consecutive 0 s, the de-cluster3 count is increased by one. If a de-cluster has four consecutive 0 s, the de-cluster4 count is increased by one, and so on. Table 4-4: Search parameters used on the Coiled-coil Related Search Page This page does not perform motif searching. For motif searching the users can go to the Coiled-coil Motif Searching page.

University of Colorado at Colorado Springs 54 4.3.4 COILED-COIL MOTIF SEARCHING Figure 4-17: Coiled-coil Motif Search web page This webpage allows users to perform motif searches on a coiled-coil.

64 University of Colorado at Colorado Springs COILED-COIL MOTIF SEARCHING Figure 4-17: Coiled-coil Motif Search web page This webpage allows users to perform motif searches on a coiled-coil. A motif search finds the sequence within a given coiled-coil where, given part of the coiled-coil sequence and the starting heptad offset, we find the matching sequence in the coiled-coil which starts with the heptad offset provided. This means that the program searches for the occurrence of certain amino acids residues at given heptad positions. The webpage also provides links to proteins and the salt bridges. The results also include the starting coiled-coil offset and the clusters in the coiled-coil. It also returns the total number of matches found. The matches are highlighted in yellow to be used as a visual aid. This search can be slow as the query needs to go through all the coiled-coils and find the number of matches. The following parameters determine which fields the users can search. Description of the search fields for the Coiled-coil Motif Search Page Search # Parameter Explanation of the Search Parameter Name 1. Coil Sequence This field can be used to search a string of amino acids occurring in the coiled-coil at a certain heptad offset location. You can also search for a sequence such as 'L L' occurring at offset 'a' in the coiled-coil where the underscore represents any amino acid. This field is case sensitive as coiled-coil sequences are always indicated in upper case. 2. Coil Offsets This field provides the start offset for the coil sequence match. For executing a query, both of these fields need to have valid inputs; otherwise, you may get an invalid input error. Table 4-5: Search parameters used on the Coiled-coil Motif Search Page

University of Colorado at Colorado Springs 55 4.3.

65 University of Colorado at Colorado Springs COIL HEPTAD AND SALT BRIDGE SEARCH Figure 4-18: Coiled Heptad and Salt Bridge Search Page This webpage provides information on what salt residues occur in which heptad of the coiled-coil. The salt bridges in this table have been identified using tblsaltresidueslookup table. The results include the coil sequence, the coil starting offset, the heptad in which the salt bridge occurs, the type of interaction, the type of salt bridge and the location of the heptad in the coiled-coil. The a and d offsets in the coil sequence are highlighted in red while the heptad offset is highlighted in yellow. The following parameters determine which fields the users can search.

66 University of Colorado at Colorado Springs 56 Description of the search fields for the Coiled Heptad Salt Bridge Search Page # Search Parameter Explanation of the Search Parameter Name 1. Coil Offset This field allows you to search for coiled-coils with certain starting heptad offsets. The heptad offsets are a, b, c, d, e, f, and g. This field is case sensitive as offsets are always indicated in lower case. 2. Offsetg, Offseta, Offsetb, Offsetc, Offsetd, Offsete, and These fields are used to search heptad offsets containing certain amino acids. You can search on the occurrence of an amino acid in a heptad. Offsetf 3. Interaction Type The Interaction Type determines whether the electrostatic interaction between the salt bridges is an Attraction or Repulsion. 4. Salt Bridges Type This field allows users to select between the three different types of salt bridges: i to i + 3, i to i + 4 and i to i + 5. Table 4-6: Search parameters used on the Coil Heptad and Salt Bridge Search Page

67 University of Colorado at Colorado Springs GENERATED REPORTS The reports are based on the data obtained from the Stable Coil Database. Most of the generated reports have materialized views as their backend. These materialized views are created using complex SQL statements, sometimes created by joining two or more tables. There are various factors affecting the stability of a coiled-coil; the generated reports try to highlight these factors, thus providing an insight into the variations of the coiled-coil data. The performance of each report is similar due to the fact that they are not views, but tables in general. Hence, by creating indexes on these materialized views, we can make searches as fast as the hardware and the database permit. The sections below describe the generated reports in detail. Amino Acid Pair Occurrences in Coiled-coils This report provides the frequency of occurrence of pairs of amino acids in the coiled-coils and the heptad positions at which these amino acids occur. For this report the heptad is defined as gabcdef, so that i to i + 5 salt bridges which generally occur in g and e heptad offsets can be easily identified. The researchers are interested only in amino acid pairs which involve the following heptads: 1. The d-e pair of residues helps users identify the most frequently occurring amino acid adjacent to the hydrophobic core offset d. 2. The g-a pair of residues helps users identify the most frequently occurring amino acid adjacent to the hydrophobic core offset a. The results returned by the g-a and d-e searches aid in understanding what kind of amino acid residues (hydrophobic, hydrophilic or electrically charged) occur adjoining the hydrophobic core and how they affect the overall stability of the coiled-coil. 3. The a-d pair of residues account for the hydrophobic core of the coiled-coil. The amino acids occurring in these positions are usually hydrophobic. The results from this report will help the users understand what residues occur frequently in the hydrophobic core, thus explaining the significance of a particular amino acid to stability of coiled-coil through hydrophobic interactions 4. The g-e pair of residues relates to i to i + 5 electrostatic interactions between coiled-coils. 5. The e-g pair of residues relates to i to i + 2 electrostatic interactions between coiled-coils. The difference between these residues and the residues described above is that they are not retrieved from the same heptad. The residue at offset e is retrieved from one heptad and the residue at offset g is retrieved from the following heptad. The electrostatic interactions occur between two different α-helical coiled-coils due to the difference in the ionic charge on the amino acids involved in the α-helix. This is yet another factor that contributes to the stability of the coiled-coil. Hence, it is important to know the amino acids that frequently participate in these interactions. These results can be filtered on amino acid pairs, the type of heptad and the number of amino acid pairs. The page also has page aggregate and overall aggregate values at the bottom of the page.

68 University of Colorado at Colorado Springs 58 Amino Acid Occurrences in Salt Bridges A salt bridge is an electrostatic interaction occurring between two different α-helical coiled-coils. The researchers at UCHSC are only interested in intrachain i to i + 3, intrachain i to i + 4 and interchain i to i + 5 interactions. The report generated provides the frequency of occurrence of the salt bridges (indicated in Table 4-1) and residues that occur between the salt bridges. The results can be filtered based on the type of salt bridge, the salt bridge offset, and the offsets that occur within the salt bridge. With these results, the users can determine what salt bridges, be they attractions or repulsions, affect the coiled-coil stability. Frequency of Occurrence of i to i+n Salt Bridges based on Amino Acids at Heptad Offset a/d There are six reports under this category, one each for the i to i + n (intrachain i to i + 3, intrachain i to i + 4 and interchain i to i + 5) electrostatic interaction based on amino acids at either heptad offset a or d. This report provides information on the occurrence of a particular amino acid in a heptad and whether or not there is a salt bridge present in the same heptad. This information is critical to users who want to know the relationship between hydrophobic and electrostatic interactions. From the results, the users are able to interpret how many times a strong hydrophobic interaction occurs when in the presence of a salt bridge. Cluster Count vs. Coiled-coil Length This report gives the number of clusters found in coiled-coils of certain lengths. The coiled-coils are divided into seven different groups depending on their lengths. 1. Coiled-coils with length less than 50 amino acids 2. Coiled-coils with length between 50 and 59 amino acids 3. Coiled-coils with length between 60 and 69 amino acids 4. Coiled-coils with length between 70 and 79 amino acids 5. Coiled-coils with length between 80 and 89 amino acids 6. Coiled-coils with length between 90 and 99 amino acids 7. Coiled-coils with length greater than 100 amino acids. The report provides information on number of coiled-coils of a particular length, the number of stabilizing/destabilizing clusters in these coiled-coils and the number of stabilizing/destabilizing clusters in coiled-coils of a particular length which actually has clusters. The stability of the coiled-coil is greatly affected by the strength of the hydrophobic interactions present in the coiled-coil. An increase in the number of stabilizing clusters in a coiled-coil is associated with an increase in the stability of the coiled-coil and vice versa for destabilizing clusters. Hence, it is necessary to understand the distribution of clusters within the coiled-coils. Destabilizing/Stabilizing Cluster Distribution in Coiled-coils This report gives the number of destabilizing/stabilizing clusters found in coiled-coils of certain lengths. The coiled-coils are classified by length as follows. 1. Coiled-coils with length less than 50 amino acids 2. Coiled-coils with length between 50 and 59 amino acids 3. Coiled-coils with length between 60 and 69 amino acids 4. Coiled-coils with length between 70 and 79 amino acids 5. Coiled-coils with length between 80 and 89 amino acids 6. Coiled-coils with length between 90 and 99 amino acids 7. Coiled-coils with length greater than 100 amino acids.

69 University of Colorado at Colorado Springs 59 The destabilizing/stabilizing clusters are also classified by length as follows. 1. De-clusters/clusters of length 3 2. De-clusters/clusters of length 4 3. De-clusters/clusters of length 5 4. De-clusters/clusters of length 6 5. De-clusters/clusters of length 7 or more. The stability of the coiled-coil is affected by the strength of the hydrophobic interactions present in the coiled-coil. If a coiled-coil contains destabilizing clusters of larger lengths, the stability of the coiled-coil is adversely affected. Similarly, the number of stabilizing clusters of higher lengths in a coiled-coil greatly improves the stability of the coiled-coil. Hence, it is necessary to understand the distribution of de-clusters/clusters of each length, within a coiled-coil. Occurrence of Amino Acid in Offset a/d with respect to the Location of that Occurrence in the Coiled-coil These reports provide information on how many times amino acids occur in a coiled-coil at heptad offset a/d and at what position they occur in the coiled-coil. The position of the amino acids is divided into three different groups. 1. At the beginning of the coiled-coil 2. In the center of the coiled-coil 3. At the end of the coiled-coil. These reports aid in understanding whether the amino acids involved in the hydrophobic core of the coiled-coil are present closer to the N-terminus or the C-terminus or are buried deep in the coiled-coil. Frequency of Occurrence of Amino Acids in Coiled-coils This report provides information on the number of occurrences of amino acids in the coiled-coils. This report helps researchers understand the most common occurring residues in the coiled-coils. As hydrophobic amino acids are important to the stability of the coiled-coil, it has been hypothesized that frequently occurring amino acids would indeed be hydrophobic in nature

70 University of Colorado at Colorado Springs 60 Chapter 5 RESULTS The major impetus in developing the Stable Coil Algorithm is to determine the presence of coiled-coils in proteins and to provide quantitative results that affirm the known factors affecting the stability of the coiled-coils. The reports based on the Stable Coil Database are intended to do exactly that. There are currently 87,368 proteins in the database, and 141,204 coiled-coils are found in these proteins. As mentioned earlier, the database is updated weekly by the Perl programs, and, as of this writing, the last update was performed on Monday, July 7 th, The database is automatically updated using the Perl programs, thus eliminating any need for manual intervention. The generated reports provide the results from the database in which the researchers are interested in. Amino Acid Pair Offset Location 1 Offset Location 2 Offset Pair Occurrence L-L A D 57,854 I-L A D 41,768 V-L A D 38,581 L-I A D 26,098 I-I A D 19,842 F-L A D 19,830 V-I A D 17,053 L-V A D 16,359 L-A A D 15,784 N-L A D 15,170 Table 5-1: Top 10 amino acid pairs occurring in heptad offsets a and d which form the hydrophobic core Amino Acid Pair Offset Location 1 Offset Location 2 Offset Pair Occurrence L-L D E 33,044 L-E D E 26,198 L-A D E 23,592 L-K D E 21,898 L-S D E 20,607 L-Q D E 19,023 L-R D E 18,939 L-I D E 17,561 L-V D E 17,339 L-G D E 16,560 Table 5-2: Top 10 amino acid pairs occurring in heptad offsets d and e

71 University of Colorado at Colorado Springs 61 Amino Acid Pair Offset Location 1 Offset Location 2 Offset Pair Occurrence L-L G A 25,747 A-L G A 19,562 E-L G A 17,572 L-V G A 16,772 L-I G A 16,400 S-L G A 16,150 K-L G A 14,507 I-L G A 14,463 V-L G A 13,885 A-V G A 13,752 Table 5-3: Top 10 amino acid pairs occurring in heptad offsets g and a Amino Acid Pair Offset Location 1 Offset Location 2 Offset Pair Occurrence L-L G E 13,416 A-L G E 9,608 L-A G E 9,320 E-K G E 8,389 S-L G E 8,291 L-S G E 8,207 I-L G E 8,075 L-I G E 7,841 A-A G E 7,715 L-G G E 7,285 Table 5-4: Top 10 amino acid pairs occurring in heptad offsets g and e usually associated with electrostatic attraction i to i + 5 Amino Acid Pair Offset Location 1 Offset Location 2 Offset Pair Occurrence L-L E G 15,394 A-L E G 10,970 L-A E G 10,514 A-A E G 9,821 S-L E G 9,508 E-E E G 9,420 L-I E G 9,063 L-S E G 8,859 I-L E G 8,854 L-V E G 8,173 Table 5-5: Top 10 amino acid pairs occurring in heptad offsets e and g usually associated with

72 University of Colorado at Colorado Springs 62 electrostatic attraction i to i + 2 The results in Table 5-1 to Table 5-5 show whether an amino acid pair occurs more frequently than others at certain heptad offsets. The heptad offsets that the researchers are interested in are d-e, g-a, a-d, g-e (i to i + 5 interaction) and e-g (i to i + 2 interaction). The a-d pair of residues account for the hydrophobic core of the coiled-coil. The amino acids occurring in these positions are usually hydrophobic. The results from this report help the users understand what residues occur frequently in the hydrophobic core, thus explaining the significance of a particular amino acid to the stability of a coiled-coil through hydrophobic interactions. Hence, it comes as no surprise that a Leu-Leu amino acid pair occurs most frequently in coiled-coils at heptad positions a and d respectively, as Leu is an hydrophobic amino acid. This is congruent with the findings in [18, 10]. The g-e pair of residues relates to i to i + 5 electrostatic interactions while the e-g pair of residues relates to i to i + 2 electrostatic interactions between coiled-coils. The difference between these residues and the residues described above is that they are not retrieved from the same heptad. The residue at offset e is retrieved from one heptad and the residue at offset g is retrieved from the following heptad. For example: consider a coiled-coil sequence ALLDKTREKTRE starting at heptad offset g. The offsets in red are part of one heptad and the offsets in black are part of another heptad. The bolded offsets account for the e-g (i to i + 2) electrostatic interaction. The electrostatic interactions occur between two different α-helical coiled-coils due to the ionic charge on the amino acids involved in the α-helix. Although the effect of one electrostatic interaction plays a small part in the stability of the coiled-coil compared to the effects of hydrophobic interactions, many electrostatic interactions together can add up to have substantial effects. The results indicate that the most frequently occurring electrostatic attraction in heptad positions g and e is the attraction between Glu-Lys, where Glu is negatively charged and Lys is positively charged. The commonly occurring repulsion in heptad positions e and g is Glu-Glu. The d-e pair of residues helps users identify the most frequently occurring amino acid adjacent to the hydrophobic core offset d, while the g-a pair of residues helps users identify the most frequently occurring amino acid adjacent to the hydrophobic core offset a. But there is also a more specific reason for identifying the amino acid residues at these offsets. Residues at positions e and g may form interhelical ion pairs that increase stability, but their contribution to stability is an order of magnitude less than the hydrophobic core. Thus, the results returned by the g-a and d-e searches aid in understanding what kinds of amino acid residues (hydrophobic, hydrophilic or electrically charged) aid in electrostatic interactions, what kinds of residues occur adjoining the hydrophobic core, and what effects these residues have on the overall stability of the coiled-coil. Our results indicate that the majority of residues occurring in offset d for the d-e pair are Leu, while the electrically charged residues Lys, Glu, and Arg appear in the top ten occurrences for residue e. As expected, Leu occurs frequently in heptad offset a in the g-a pair, while the top ten occurrences of offset g contain Lys and Glu.

73 University of Colorado at Colorado Springs 63 Amino Acid Pair Offset Location 1 Offset Location 2 L-L A D 57,854 I-L A D 41,768 V-L A D 38,581 L-L D E 33,044 L-E D E 26,198 L-I A D 26,098 L-L G A 25,747 L-A D E 23,592 L-K D E 21,898 L-S D E 20,607 I-I A D 19,842 F-L A D 19,830 A-L G A 19,562 L-Q D E 19,023 L-R D E 18,939 E-L G A 17,572 L-I D E 17,561 L-V D E 17,339 V-I A D 17,053 L-V G A 16,772 L-G D E 16,560 L-I G A 16,400 L-V A D 16,359 S-L G A 16,150 I-L D E 15,853 L-A A D 15,784 L-T D E 15,461 L-L E G 15,394 N-L A D 15,170 K-L G A 14,507 Offset Pair Occurrence Table 5-6: Top 30 Amino Acid Pair Occurrences in Coiled-coils These results, taken together without considering the heptad offsets, are not very useful; however, the results still help in understanding the commonly occurring amino acid residues in coiled-coils. The top 30 occurrences of pairs of amino acids occurring in coiled-coils is indicated in Table 5-6 above; it can be said that, of all the heptad offsets that interest researchers, Leu-Leu is the most commonly occurring pair of amino acid residue. This is closely followed by the Ile-Leu grouping in a and d offsets, because Ile, like Leu, is a hydrophobic residue, and hydrophobic residues are commonly found in offsets a and d, the hydrophobic core.

74 University of Colorado at Colorado Springs 64 Amino Acid Residue Type of Heptad Offset L d 299,843 L a 224,544 I a 149,708 V a 143,650 I d 135,597 L g 125,036 L e 124,545 L c 114,713 L b 114,082 L f 113,558 A f 95,516 A b 95,157 A c 91,935 A g 91,557 A e 86,254 S f 78,521 F a 78,473 S b 78,158 S c 78,117 E c 78,057 E b 78,032 E f 77,524 V d 77,177 E g 76,967 K f 76,122 E e 75,353 K b 73,966 I g 73,366 I e 73,261 S e 72,660 Total Count of Amino Acids by Heptad Offset Location Table 5-7: Top 30 frequently occurring amino acids in the Stable Coil Database The number of times a residue occurs individually in a coiled-coil is shown in the Frequency of Occurrence of Amino Acids in Coiled-coils report. This report is indicates results similar to Tables 5-1 to Table 5-6. The results are depicted in Table 5-7. The results show that Leu is the dominantly occurring amino acid in coiled-coils in heptad offsets a and d. This is followed by other hydrophobic amino acids Ile, Val, and Phe. The biggest difference between the results in [17, 18, 11] and the results seen here is the that these refernces rank Met as the third most occurring amino acid, while our results put Met as one of the least occurring hydrophobic amino acid in a and d positions, even though Met has relatively high stability values associated with it (Table 3.1). This indicates that although most amino acids appear in a coiled-coil sequence in accordance with their associated stability values, there are some which act as an exception to the rule.

75 University of Colorado at Colorado Springs 65 Figure 5-1: Coiled-coils Count vs. Coiled-coil Length The project also illuminates the relationship between a coiled-coil and its length. A coiled-coil s length is the number of amino acids in the coiled-coil. In this project we are interested in retrieving coiled-coils which contain 42 or more residues. From the results in Figure 5.1 it can be inferred that coiled-coils of length 60 or more are fairly uncommon. There is unfavorable entropy associated with chain length extension, which is not overcome by the increase in hydrophobic interactions associated with the increase in chain length, even if the heptad contained the most stabilizing hydrophobic residue (Leu) at position d and stabilizing ionic attractions. Thus, our findings are congruent with recent experiments [27] performed to study the effects of chain length on larger coiled-coils.

University of Colorado at Colorado Springs 66 Figure 5-2: Location of occurrence of amino acid within the coiled-coil when the amino acid is at heptad offset a.

The researchers were also interested in knowing where amino acids occur within the sequence.

76 University of Colorado at Colorado Springs 66 Figure 5-2: Location of occurrence of amino acid within the coiled-coil when the amino acid is at heptad offset a. Figure 5-3: Location of occurrence of amino acid within the coiled-coil when the amino acid is at heptad offset d. The researchers were also interested in knowing where amino acids occur within the sequence. They would like to know whether the amino acid occurs at the beginning or in the center or at the end of the coiled-coil. The figures 5-2 and 5-3 above show the results of analyzing the coiled-coils as per their location. The occurrence of a coiled-coil is divided into three different types: 1. At the start of the coiled-coil: An amino acid is said to be at the start of the coiled-coil if it appears anywhere between the start of the coiled-coil sequence and ⅓ length of the coiled-coil. 2. At the center of the coiled-coil: An amino acid is said to be at the center of the coiled-coil if it appears anywhere between the ⅓ length of the coiled-coil and ⅔ length of the coiled-coil.

University of Colorado at Colorado Springs 67 3.

77 University of Colorado at Colorado Springs Near the end of the coiled-coil: An amino acid is said to be near the end of the coiled-coil if it appears anywhere between the ⅔ length of the coiled-coil and the end of the coiled-coil sequence. These reports aid in understanding whether the amino acids involved in the hydrophobic core of the coiled-coil are present closer to the N-terminus or the C-terminus or are buried deep in the coiled-coil. As seen in the results above, the most common occurrence of an amino acid at any position is Leu which is congruent with our earlier findings. Leu occurs times near the end of the coiled-coil, times at the center of the coiled-coil, and times at the start of the coiled-coil. Also the top 10 occurrences are all hydrophobic residues and the distribution of these amino acids is equally spaced between start, center, and end locations. From these results it can be interpreted that hydrophobic residues are necessary for the formation and stability of the coiled-coil. Also, these results agree with those provided in [20], which state that Leu is most frequently found in the core of the coiled-coils. Figure 5-4: Normalized Value of Destabilizing Clusters in Coiled-Coils of particular length. Results obtained by dividing the total number of coiled-coils with de-clusters by the total number of de-clusters.

78 University of Colorado at Colorado Springs 68 Figure 5-5: Normalized Value of Stabilizing Clusters in Coiled-Coils of particular length. Results obtained by dividing the total number of coiled-coils with clusters by the total number of clusters. Clusters offer further insights into the stability of coiled-coils. For this part of analysis, a convention of 0 s and 1 s are used to signify the hydrophobic amino acids (Phe, Leu, Ile, Met, Val and Thr) and nonhydrophobic amino acids, respectively. The stability of the coiled-coil is affected by the strength of the hydrophobic interactions present in the coiled-coil. If a coiled-coil contains destabilizing clusters of larger lengths, the stability of the coiled-coil is adversely affected. Similarly, the number of stabilizing clusters of higher lengths in a coiled-coil greatly improves the stability of the coiled-coil. Hence, it is necessary to understand the distribution of de-clusters/clusters of each length within a coiled-coil. The project provides various summary queries based on clusters in coiled-coils. The figures 5-4 and 5-5 display the distribution of destabilizing and stabilizing clusters across coiled-coils of varying lengths. As shown on these charts, the number of destabilizing clusters remains almost constant no matter what the length of the coiled-coil, while the number of stabilizing clusters continues to increase linearly as the coil length increases. As mentioned earlier, there is unstable entropy associated with increasing chain length, and hence, to keep the coiled-coils stable, we see an increase in the number of stabilizing clusters.

respect to the Coiled-Coil length Figure 5-7: Distribution of

79 University of Colorado at Colorado Springs 69 Figure 5-6: Distribution of Destabilizing Cluster of Length 3 with respect to the Coiled-Coil length Figure 5-7: Distribution of Destabilizing Cluster of Length 4 with respect to the Coiled-Coil length

respect to the Coiled-Coil length Figure 5-9: Distribution of

80 University of Colorado at Colorado Springs 70 Figure 5-8: Distribution of Destabilizing Cluster of Length 5 with respect to the Coiled-Coil length Figure 5-9: Distribution of Destabilizing Cluster of Length 6 with respect to the Coiled-Coil length

respect to the Coiled-Coil length Figure 5-11: Distribution

81 University of Colorado at Colorado Springs 71 Figure 5-10: Distribution of Destabilizing Cluster of Length 7+ with respect to the Coiled-Coil length Figure 5-11: Distribution of Stabilizing Cluster of Length 3 with respect to the Coiled-Coil length

to the Coiled-Coil length Figure 5-13: Distribution of

82 University of Colorado at Colorado Springs 72 Figure 5-12: Distribution of Stabilizing Cluster of Length 4 with respect to the Coiled-Coil length Figure 5-13: Distribution of Stabilizing Cluster of Length 5 with respect to the Coiled-Coil length

83 University of Colorado at Colorado Springs 73 Figure 5-14: Distribution of Stabilizing Cluster of Length 6 with respect to the Coiled-Coil length Figure 5-15: Distribution of Stabilizing Cluster of Length 7+ with respect to the Coiled-Coil length

84 University of Colorado at Colorado Springs 74 The results shown in Figures 5-6 thru indicate that as the coil length increases, the number of clusters of any length in the coiled-coil decreases. The count of clusters in varying length coiled-coils gives an appreciation for the differences found. Also as we start looking into clusters with increasing lengths, the number of clusters found in the coiled-coils starts decreasing. The database is dominated by cluster lengths of 3 and 4. It can been seen that there are very few de-stabilizing clusters as compared to the stabilizing clusters, which determines that clusters are a necessary and vital ingredient to the stability of the coiled. The more stabilizing clusters a coiled-coil has the more stable it is. On the other hand, the increase in length destabilizing clusters as very little or no effect on the stability of the coiledcoil. Destabilizing clusters of length 3 or 4 are most destabilizing and addition of more residues to the cluster does not contribute in any way to the stability of the coiled-coil. Coiled-coils thus tend to have less destabilizing clusters as hydrophobic amino acids predominantly occupy the hydrophobic core. The findings form our database with regards to clusters is congruent with the experimental analysis done by Kwok and Hodges [27]. Type of Electro Static Interaction Attraction /Repulsion Salt Bridge Salt Bridge Start Offset Total Number of Salt Bridges In Coiled Coils Intrachain i to i + 3 Attraction E..K b 8688 Intrachain i to i + 4 Attraction E...K b 8663 Intrachain i to i + 4 Attraction K...E f 8462 Interchain i to i' + 5 Attraction E...K g 8430 Intrachain i to i + 3 Attraction E..K f 8338 Intrachain i to i + 3 Attraction E..K c 8216 Intrachain i to i + 4 Attraction E...K e 8211 Intrachain i to i + 3 Attraction K..E f 8128 Intrachain i to i + 3 Attraction K..E G 8113 Intrachain i to i + 4 Attraction E...K C 8060 Table 5-8: Top 10 electrostatic attractions present in the coiled-coils Type of Electro Static Interaction Attraction /Repulsion Salt Bridge Salt Bridge Start Offset Total Number of Salt Bridges In Coiled Coils Intrachain i to i + 3 Repulsion E..E b 7926 Intrachain i to i + 3 Repulsion E..E c 7797 Intrachain i to i + 3 Repulsion E..E g 7725 Intrachain i to i + 3 Repulsion E..E f 7648 Intrachain i to i + 4 Repulsion E...E b 7600 Intrachain i to i + 4 Repulsion E...E e 7448 Intrachain i to i + 4 Repulsion E...E f 7030 Intrachain i to i + 4 Repulsion E...E c 6904 Intrachain i to i + 3 Repulsion K..K f 6772 Intrachain i to i + 3 Repulsion K..K c 6632 Table 5-9: Top 10 electrostatic repulsions present in the coiled-coils

i to i + 3 salt bridge Figure 5-17: Relationship of

85 University of Colorado at Colorado Springs 75 Figure 5-16: Relationship of amino acids in offset a to an i to i + 3 salt bridge Figure 5-17: Relationship of amino acids in offset a to an i to i + 4 salt bridge

i to i + 5 salt bridge Figure 5-19: Relationship of

86 University of Colorado at Colorado Springs 76 Figure 5-18: Relationship of amino acids in offset a to an i to i + 5 salt bridge Figure 5-19: Relationship of amino acids in offset d to an i to i + 3 salt bridge

i to i + 4 salt bridge Figure 5-21: Relationship of

87 University of Colorado at Colorado Springs 77 Figure 5-20: Relationship of amino acids in offset d to an i to i + 4 salt bridge Figure 5-21: Relationship of amino acids in offset d to an i to i + 5 salt bridge

PROTEIN STRUCTURE AMINO ACIDS H R. Zwitterion (dipolar ion) CO 2 H. PEPTIDES Formal reactions showing formation of peptide bond by dehydration:

PROTEIN STRUCTURE AMINO ACIDS H R. Zwitterion (dipolar ion) CO 2 H. PEPTIDES Formal reactions showing formation of peptide bond by dehydration: PTEI STUTUE ydrolysis of proteins with aqueous acid or base yields a mixture of free amino acids. Each type of protein yields a characteristic mixture of the ~ 20 amino acids. AMI AIDS Zwitterion (dipolar