Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Similar documents
CMPS 6630: Introduction to Computational Biology and Bioinformatics. Tertiary Structure Prediction

CMPS 3110: Bioinformatics. Tertiary Structure Prediction

Protein Threading. BMI/CS 776 Colin Dewey Spring 2015

CAP 5510 Lecture 3 Protein Structures

Protein folding. α-helix. Lecture 21. An α-helix is a simple helix having on average 10 residues (3 turns of the helix)

Bio nformatics. Lecture 23. Saad Mneimneh

Basics of protein structure

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Neural Networks for Protein Structure Prediction Brown, JMB CS 466 Saurabh Sinha

Protein Structures. 11/19/2002 Lecture 24 1

Protein Structure Prediction, Engineering & Design CHEM 430

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics

COMP 598 Advanced Computational Biology Methods & Research. Introduction. Jérôme Waldispühl School of Computer Science McGill University

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

Protein Structure: Data Bases and Classification Ingo Ruczinski

RNA and Protein Structure Prediction

Protein Structure Determination

Amino Acid Structures from Klug & Cummings. 10/7/2003 CAP/CGS 5991: Lecture 7 1

SUPPLEMENTARY MATERIALS

Analysis and Prediction of Protein Structure (I)

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror

Protein Structure Prediction

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Getting To Know Your Protein

Bioinformatics. Macromolecular structure

Conditional Graphical Models

Procheck output. Bond angles (Procheck) Structure verification and validation Bond lengths (Procheck) Introduction to Bioinformatics.

Examples of Protein Modeling. Protein Modeling. Primary Structure. Protein Structure Description. Protein Sequence Sources. Importing Sequences to MOE

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Design of a Novel Globular Protein Fold with Atomic-Level Accuracy

Orientational degeneracy in the presence of one alignment tensor.

Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program)

Introduction to Computational Structural Biology

Supersecondary Structures (structural motifs)

Protein Structure Prediction and Display

THE TANGO ALGORITHM: SECONDARY STRUCTURE PROPENSITIES, STATISTICAL MECHANICS APPROXIMATION

Introduction to" Protein Structure

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

Protein Structures. Sequences of amino acid residues 20 different amino acids. Quaternary. Primary. Tertiary. Secondary. 10/8/2002 Lecture 12 1

ALL LECTURES IN SB Introduction

Protein Structure. W. M. Grogan, Ph.D. OBJECTIVES

Protein Secondary Structure Prediction

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror

Molecular Modeling. Prediction of Protein 3D Structure from Sequence. Vimalkumar Velayudhan. May 21, 2007

Algorithms in Bioinformatics

Protein Dynamics. The space-filling structures of myoglobin and hemoglobin show that there are no pathways for O 2 to reach the heme iron.

Copyright Mark Brandt, Ph.D A third method, cryogenic electron microscopy has seen increasing use over the past few years.

Protein Structure Analysis and Verification. Course S Basics for Biosystems of the Cell exercise work. Maija Nevala, BIO, 67485U 16.1.

Protein Structure Prediction Using Multiple Artificial Neural Network Classifier *

A New Similarity Measure among Protein Sequences

114 Grundlagen der Bioinformatik, SS 09, D. Huson, July 6, 2009

Bioinformatics: Secondary Structure Prediction

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

CS612 - Algorithms in Bioinformatics

Chemogenomic: Approaches to Rational Drug Design. Jonas Skjødt Møller

Intro Secondary structure Transmembrane proteins Function End. Last time. Domains Hidden Markov Models

Today. Last time. Secondary structure Transmembrane proteins. Domains Hidden Markov Models. Structure prediction. Secondary structure

Physiochemical Properties of Residues

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

Protein structure alignments

Objective: Students will be able identify peptide bonds in proteins and describe the overall reaction between amino acids that create peptide bonds.

HIV protease inhibitor. Certain level of function can be found without structure. But a structure is a key to understand the detailed mechanism.

Protein Structures: Experiments and Modeling. Patrice Koehl

Protein Structure Prediction using String Kernels. Technical Report

SINGLE-SEQUENCE PROTEIN SECONDARY STRUCTURE PREDICTION BY NEAREST-NEIGHBOR CLASSIFICATION OF PROTEIN WORDS

Computational Genomics and Molecular Biology, Fall

Alpha-helical Topology and Tertiary Structure Prediction of Globular Proteins Scott R. McAllister Christodoulos A. Floudas Princeton University

Week 10: Homology Modelling (II) - HHpred

Bayesian Models and Algorithms for Protein Beta-Sheet Prediction

INDEXING METHODS FOR PROTEIN TERTIARY AND PREDICTED STRUCTURES

Protein Structure. Hierarchy of Protein Structure. Tertiary structure. independently stable structural unit. includes disulfide bonds

SCOP. all-β class. all-α class, 3 different folds. T4 endonuclease V. 4-helical cytokines. Globin-like

Molecular Modeling Lecture 7. Homology modeling insertions/deletions manual realignment

Number sequence representation of protein structures based on the second derivative of a folded tetrahedron sequence

3D Structure. Prediction & Assessment Pt. 2. David Wishart 3-41 Athabasca Hall

Supporting Online Material for

Motif Prediction in Amino Acid Interaction Networks

Better Bond Angles in the Protein Data Bank

Dihedral Angles. Homayoun Valafar. Department of Computer Science and Engineering, USC 02/03/10 CSCE 769

Copyright 2000 N. AYDIN. All rights reserved. 1

Building 3D models of proteins

Template-Based Modeling of Protein Structure

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Bioinformatics III Structural Bioinformatics and Genome Analysis Part Protein Secondary Structure Prediction. Sepp Hochreiter

Protein Threading Based on Multiple Protein Structure Alignment

Can protein model accuracy be. identified? NO! CBS, BioCentrum, Morten Nielsen, DTU

Goals. Structural Analysis of the EGR Family of Transcription Factors: Templates for Predicting Protein DNA Interactions

Motivating the need for optimal sequence alignments...

Improving Protein 3D Structure Prediction Accuracy using Dense Regions Areas of Secondary Structures in the Contact Map

Template Free Protein Structure Modeling Jianlin Cheng, PhD

Protein Structure Prediction Using Neural Networks

BCB 444/544 Fall 07 Dobbs 1

Protein Secondary Structure Prediction

A General Model for Amino Acid Interaction Networks

1.5 Sequence alignment

Automated Assignment of Backbone NMR Data using Artificial Intelligence

PROTEIN SECONDARY STRUCTURE PREDICTION: AN APPLICATION OF CHOU-FASMAN ALGORITHM IN A HYPOTHETICAL PROTEIN OF SARS VIRUS

Selecting protein fuzzy contact maps through information and structure measures

Transcription:

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche The molecular structure of a protein can be broken down hierarchically. The primary structure of a protein is simply its sequence, the secondary structure is its localized folding, its tertiary structure is the long-range domain, its quaternary structure is its multimeric organization (if it is made up of multiple peptide chains), and its supramolecular structure is the global assembly. A significant and challenging problem in computational biology is coming up with algorithms which predict one or more of these types of folds. 1 Methods of Predicting Protein Structure Experimental methods, such as X-ray crystallography and nuclear magnetic resonance (NMR) spectroscopy, are generally very effective. However, they have drawbacks that make it a relatively inefficient technique. For instance, such methods are very expensive, time consuming, and not robust (only some proteins can be crystallized). Also, NMR spectroscopy enables the observation of structural and dynamic properties of proteins, but is limited in the size of the protein. Protein structures determined by X-ray crystallography (A) and NMR spectroscopy (B). Computational methods, at this point, are relatively unrefined. Ab initio predictions are structure predictions based only on the sequence of the protein in question, utilizing the fundamental principles of a protein fold, such as the geometric, kinematical and energy properties of a the molecule. These methods have not been very successful to date. The most successful ab initio method is currently Rosetta. This method functions by breaking down folds into subsequence folds, and then searching a library of structural motifs that may match. The entire fold is then evaluated by applying an energy function. The rationale behind this method is that local structures often fold independently of the full protein. Such a method is unsatisfying biologically, because it does not use our understanding of how proteins fold. Homology prediction compares the protein sequence to existing sequences in the PDB with a known 3-Dimensional structure to determine the evolutionary, structural, and

functional changes. Once the structurally known homologous protein is determined, local refinements of the protein is made in order compute the full structure of the unknown protein. 2 Threading 2.1 Algorithm Threading is another experimental method of protein structural prediction, which combines methods of homology comparison and molecular modeling. One of the differences between homology modeling and threading is that the latter computes an energy function during the sequence alignment. Threading first takes a protein sequence and a library of templates or known structures as shown in Figure 2. These are PDB structures and, ideally, the library contains all known protein folds. The collection of templates is a search space of all possible alignments of the protein. In order to prevent a statistical skew in favor of large protein families for finding an ideal match, databases such as SCOP, CATH, and FSSP are used to remove pairs of proteins with highly similar structures, thus normalizing the number of matches for each target. Figure 2 We then take the sequence and "thread it through each template in our library (Figure 3). Thus, we drag the sequence ACDEFG through each location on each template (we are really just searching for the best composition of the sequence as measured by the energy function). In the third alignment shown in Figure 3, the protein sequence is aligned so that it excludes part of the template. This problem of determining the best arrangement of residues counting such insertions and gaps is based on the scoring of structure to sequence alignment. Once all of the candidate structures together with their 2

scores are gathered, the lowest energy one is accepted as the structure prediction of the protein. The energy function is expressed as a weighted sum of how preferable it is to put to residues nearby, taking into effect steric clashes and hydrophobic groupings (E p ), the alignment gap penalty (E g ), the compatibility with local secondary structure prediction (E ss ), how well a residue fits its structural environment (E s ), and how often a residue mutates to the template residue (E m ). Figure 3 The way in which we represent template structure is a contact graph. Contact graphs entail abstracting each residue to a point. Core elements are short subsequences, usually formed by alignments in helices or sheets, where we allow no alignment gaps. Cores are then connected by contacts represented in the graph. Consequently, contact graphs reduce the search space of possible threads. 3

2.2 Difficulties There are a significant number of computational problems associated with such an algorithm. For instance, the final prediction depends on the size and details of the initial library. It is convenient to have a small library, as calculations involved with threading are often slow, and a small library will reduce computational time. However, threading score functions are imperfect, so the closer a template is to the correct structure, the more likely a sequence is going to score well. Ideally, then, we should have large libraries to increase such a likelihood. Usually, certain groups will have template libraries from about 500 to 5000 members and there is no way to determine the optimal size of the library. There are also difficulties with the representation of templates as contact graphs. Since alignment gaps are confined to the connecting non-core loop regions, the lengths of the connection loops are variable. As a result we have an exponentially large search space of possible threadings. When we search for such a sequence-structure alignment, and allow for variable gap lengths and interactions between neighboring amino acids, then finding the globally optimal threading is NP-hard. Thus, any algorithm takes an amount of time that is exponential in protein size. This can be proved when we consider that threading is at least as difficult as the MAX-CUT problem. This problem asks that, given a graph G = (V, E), find a cut (S, T) of V with maximum number of edges between S and T. The mapping between MAX-CUT and threading is very similar. We start with a graph for which we want to apply MAX-CUT, and consider it a contact graph for a structure. Thus, we make a protein structure with 2n amino acids for a total of n nodes, where each amino acid is represented by either a 0 or a 1. If we drop all of the energy terms except for E p, and we say that each node labeled (0,1) or (1,0) gets a score of 1, the maximizing the threading score is maximizing number of 0 s next to 1 s. This is the same as finding the maximum cut of the graph. 4

3 Threading Programs 3.1 Branch and Bound The branch and bound method is an exact method for searching for an optimal alignment, but it takes exponential time in the size of the protein. The algorithm functions by assuming that each solution can be partitioned into subsets, and that the upper limit on a subset s solution can be computed quickly. In the diagram, each circle illustrates the space of possible threadings, the solid lines indicate partitions made in a previous step, and dashed lines indicate partitions made in the current step. Furthermore, numbers indicate lower bounds for newly created subsets, and arrows indicate the set that was partitioned. Branch and bound selects the subset with the best possible bounds, subdivides it, and computes a bound for each subsequent subset. The aspects of this search which determine its efficiency are how the lower bound for the set of possible threadings is computed, and how the threading set is partitioned into subsets. Ideally, the lower bound should take into consideration the interaction of the set with the preceding set, and the best interaction with other sets. A threading set is partitioned by selecting a core segment, and choosing a split point in the set. 3.2 RAPTOR Currently, the best protein threading program is probably the RAPTOR. This a linear function of a given number of variables subject to linear constraints, which are typically inequalities. This is a problem that is evidently solvable in polynomial time. We then solve the problem with the addition of integer constraints on the variables. 5

Raptor uses integer linear programming to minimize the protein energy function based on the constraints. In the constraints, the variable x(k) denotes that core i is aligned to sequence position k, y(k,l) denotes that core i is aligned to position k and core j is aligned to position l, D(i) gives all of the positions to which core i can be aligned, and R( k) is the set of possible alignments of core j given that core i aligns to position k. In RAPTOR s energy function, only interactions between residues in the cores are considered, as it is generally accepted that interactions involving loop residues can be ignored since they contribute very little to fold recognition. Minimize E = W E m m + W E s s + W E p p + W E g g + W E ss ss s. t. x ( l) y y y ( l )( k ) ( l )( k ) ( l )( k ) l l D[ i] x l + x x, y ( i+ 1, k ) x x x = 1 ( l)( k ) l l 1, k R[ i + 1, l] k, k R[ l], l R[ k] + x k {0,1} 1 Cores are aligned in order Each y variable is 1 if and only if its two x variables are 1, where x and y represent exactly the same threading Each core has only one alignment position Raptor then relaxes integer constraints to linear constraints, so that x and y between 0 and 1. Subsequently, a standard linear method, such as IBM s Optimization and Solution Library is utilized to solve the linear problem. If the resulting solution is integral, then the program is complete. The problem is that if the solution is not integral, then we get fractional placements of each core. As a solution, we heuristically select a non-integral variable and generate two sub-problems by setting the variable to 0 and 1. We then solve each sub-problem using Branch and Bound. In practice, though, we do not usually this last step, because almost all (99%) of solutions are integral. 6

3.3 Results According to benchmark testing, such as CAFASP3, RAPTOR does relatively well in structural prediction. Overall, though, threading is losing ground in competitions because ab initio methods are becoming better for unknown proteins and homology predictions are getting increasingly better as protein databases are expanding. 4 Specialized Prediction Methods 4.1 Predicting Motifs Some prediction algorithms seek to identify structural motifs within proteins. For instance, a coiled coil, which is doubly twisted helix, contains 3.5 amino acids per turn instead of 3.6, causes helix to bend around itself. This structure is very strong, and is typically used by viruses to penetrate the cell wall. Other unique motifs are zinc fingers, helix-loop-helix structures, beta helices, beta trefoils, and beta barrels. Because the coiled coil sequence is identical on left and right side, it is computationally easy to identify. The algorithm NewCoils computes the propensity of each amino acid to occupy each position and applies weight matrix. Similarly, PairCoils computes the pairwise probabilities of the neighboring positions in the structure. Each residue is given a score which is based on the likelihood that this residue is in a coiled coil. In order to compute this score, the maximum window score over the 28 residue windows containing the amino acid is determined. A similar template method, BetaWrap, can predict beta helices. This program scores sequences based on the compatibility with the right-handed beta-helix fold. It incorporates residue pair preferences taken from beta sheets in non-beta helices to create and score potential wraps of a given sequence into a beta-helical structure using the hydrophobicity property. The program then returns the best scoring parses of the sequence into multiple rungs. Motif recognition is useful as an intermediate step toward full structure prediction, because it is more compliant with sequence analysis, and because it offers useful information about the function of the protein. 7

4.2 Predicting Secondary Structure Secondary structure is generally harder to predict than finding motifs because of the relative lack of constraints. The best algorithms usually use Neural Networks. For instance, PSIPRED generates a profile based on the given sequence, and then passes it to a pre-trained Neural Network, which determines whether the sequence folds into an alpha helix, beta sheet, or loop. The algorithm starts with a database of predetermined folds, and removes structural redundancies within proteins, leaving 187 proteins. This program performs with an accuracy of 76%, and is recognized as the best such algorithm. References Diagrams and some descriptions were obtained from: CS273 Lecture Slides 15. <http://www.stanford.edu/class/cs273/slides/cs273_structureprediction2.ppt> R.H. Lathrop and T.F. Smith. Global optimum protein threading with gapped alignment and empirical pair score functions. Journal of Molecular Biology, 255:641--655, 1996. J. Xu, M. L D. Kim, and Y. Xu. RAPTOR: Optimal protein threading by linear programming. Journal of Bioinformatics and Computational Biology, 1, 2003. B. Berger, D. B. Wilson, E. Wolf, T. Tonchev, M. Milla and P. S. Kim. Predicting coiled coils by use of pairwise residue correlations. Proc. Nat. Acad. Sc 1995, Vol 92, 8259-8263. A.E. Torda. Proteomics Handbook: Protein Threading. University of Hamburg, 15 Nov. 2003. DT Jones. Protein secondary structure prediction based on position-specific scoring matrices. 1999. Journal of Molecular Biology, 292(2):195-202. 8