CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

Similar documents
Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

1. Protein Data Bank (PDB) 1. Protein Data Bank (PDB)

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Tertiary Structure Prediction

CMPS 3110: Bioinformatics. Tertiary Structure Prediction

CS612 - Algorithms in Bioinformatics

Heteropolymer. Mostly in regular secondary structure

Week 10: Homology Modelling (II) - HHpred

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Neural Networks for Protein Structure Prediction Brown, JMB CS 466 Saurabh Sinha

Procheck output. Bond angles (Procheck) Structure verification and validation Bond lengths (Procheck) Introduction to Bioinformatics.

Protein Structure: Data Bases and Classification Ingo Ruczinski

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Molecular Modeling Lecture 7. Homology modeling insertions/deletions manual realignment

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Protein structure analysis. Risto Laakso 10th January 2005

Sequence analysis and comparison

Protein Structure Prediction

Alpha-helical Topology and Tertiary Structure Prediction of Globular Proteins Scott R. McAllister Christodoulos A. Floudas Princeton University

Analysis and Prediction of Protein Structure (I)

Genome Databases The CATH database

CAP 5510 Lecture 3 Protein Structures

Bioinformatics. Macromolecular structure

Amino Acid Structures from Klug & Cummings. 10/7/2003 CAP/CGS 5991: Lecture 7 1

Protein Structure Overlap

Basics of protein structure

114 Grundlagen der Bioinformatik, SS 09, D. Huson, July 6, 2009

Structural Alignment of Proteins

SUPPLEMENTARY MATERIALS

COMP 598 Advanced Computational Biology Methods & Research. Introduction. Jérôme Waldispühl School of Computer Science McGill University

Algorithms in Bioinformatics

Comparing Protein Structures. Why?

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror

Multiple structure alignment with mstali

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror

Some Problems from Enzyme Families

Homologous proteins have similar structures and structural superposition means to rotate and translate the structures so that corresponding atoms are

EBI web resources II: Ensembl and InterPro

Number sequence representation of protein structures based on the second derivative of a folded tetrahedron sequence

Efficient Protein Tertiary Structure Retrievals and Classifications Using Content Based Comparison Algorithms

Practical considerations of working with sequencing data

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

A profile-based protein sequence alignment algorithm for a domain clustering database

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Identification of Representative Protein Sequence and Secondary Structure Prediction Using SVM Approach

Sequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir

Automated Identification of Protein Structural Features

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics

A New Similarity Measure among Protein Sequences

Pairwise & Multiple sequence alignments

Can protein model accuracy be. identified? NO! CBS, BioCentrum, Morten Nielsen, DTU

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Measuring quaternary structure similarity using global versus local measures.

Non-sequential Structure Alignment

The CATH Database provides insights into protein structure/function relationships

Orientational degeneracy in the presence of one alignment tensor.

Computational Genomics and Molecular Biology, Fall

Protein Structure Prediction Using Multiple Artificial Neural Network Classifier *

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Protein tertiary structure prediction with new machine learning approaches

Protein Structure and Function Prediction using Kernel Methods.

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Homology. and. Information Gathering and Domain Annotation for Proteins

Exercise 5. Sequence Profiles & BLAST

proteins Refinement by shifting secondary structure elements improves sequence alignments

Today s Lecture: HMMs

ALL LECTURES IN SB Introduction

Motif Prediction in Amino Acid Interaction Networks

Analysis on sliding helices and strands in protein structural comparisons: A case study with protein kinases

INDEXING METHODS FOR PROTEIN TERTIARY AND PREDICTED STRUCTURES

Template Free Protein Structure Modeling Jianlin Cheng, PhD

Large-Scale Genomic Surveys

Introduction to Evolutionary Concepts

IT og Sundhed 2010/11

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

Structure-Based Comparison of Biomolecules

Automated Identification of Protein Structural Features

Using Phase for Pharmacophore Modelling. 5th European Life Science Bootcamp March, 2017

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

09/06/25. Computergestützte Strukturbiologie (Strukturelle Bioinformatik) Non-uniform distribution of folds. Scheme of protein structure predicition

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Bioinformatics. Proteins II. - Pattern, Profile, & Structure Database Searching. Robert Latek, Ph.D. Bioinformatics, Biocomputing

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Examples of Protein Modeling. Protein Modeling. Primary Structure. Protein Structure Description. Protein Sequence Sources. Importing Sequences to MOE

EECS730: Introduction to Bioinformatics

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

MSAT a Multiple Sequence Alignment tool based on TOPS

Template Free Protein Structure Modeling Jianlin Cheng, PhD

SCOP. all-β class. all-α class, 3 different folds. T4 endonuclease V. 4-helical cytokines. Globin-like

In-Depth Assessment of Local Sequence Alignment

Getting To Know Your Protein

Bayesian Models and Algorithms for Protein Beta-Sheet Prediction

Molecular Modeling lecture 17, Tue, Mar. 19. Rotation Least-squares Superposition Structure-based alignment algorithms

Lecture 7 Sequence analysis. Hidden Markov Models

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Protein Structure Prediction, Engineering & Design CHEM 430

Protein structure alignments

Transcription:

CMPS 6630: Introduction to Computational Biology and Bioinformatics Structure Comparison

Protein Structure Comparison Motivation Understand sequence and structure variability Understand Domain architecture of proteins Understand evolution of protein function Infer structural relationships Infer evolutionary relationships Determine coverage of fold space Use in predictive modeling

Structure Comparison Points to Consider Feature Extraction What features are to be extracted and compared? Fine level (residue or atom) vs. Coarse Level (SSE) Fine level can be used to make functional hypotheses Coarse level used for global fold comparison / classification Maintenance of Topology? Does pattern need to have similar sequential ordering? Method of Comparison? How similar do structural elements need to be to match? Should be: - Invariant to trivial changes (ie. rotation / translation) - Robust, description should not change drastically due to minor changes in structure 1 2 3 1 3 2

Structure Comparison Protein Similarity Given a correspondence and an optimal positioning of two structures, how close are corresponding residues/ elements? Superposition A B S (A, B) = min T,C D(A, T (B),C) T: Transformation C: Correspondence

RMSD Root Mean Squared Distance Most common method to score similarity of two structures Most useful to compare relatively similar structures Often computed from C only Requires residue correspondence between two proteins Distance measured in Angstroms Smaller RMSD implies more similar structures Coordinate RMSD RMSD C (E) = min T T w i i 1 r i=1 w i r i=1 : Transformation : weights (often 1) : set of equivalenced atoms w i (T i C( i )) 2

RMSD Root Mean Squared Distance Distance RMSD Rotation and Translation Invariant RMSD(E) = 1 r ( ij A 1 i,j n B ij )2 Note that distance-rmsd increases with the size of the point sets being compared. We can normalize by the square root of the length: RMSD(E) = 1 r 1 i,j n ( A ij n B ij )2

RMSD S (A, B) = min T,C D(A, T (B),C) If distance measure (D) is RMSD and correspondence (C) is given, then T can be computed easily using SVD. Protein Bioinformatics, 2004 A B Aligned Centroids Optimal Superposition

RMSD Limitations Equivalence of positions (correspondence) must be known Relative displacement of one subdomain within one structure can result in poor overall fit Insertions and Deletions? Gaps? A B Superposition What can we learn from sequence alignment?

Structural Alignment Lets try the same thing to compute alignment and similarity. 2 1 3 4 i A 1 2 3 5 4 k B How to compute the similarity D i,k? 1 2 3 4 1 2 3 4 5 Similarity Matrix D

Structural Alignment Will Dynamic Programming Work for Structural Alignment?? small, Optimal Substructure: an optimal solution to the problem contains within it optimal solutions to subproblems Overlapping Subproblems: the space of subproblems must be solving the same subproblems over and over Best Alignment and Superposition of Entire Chain Protein Bioinformatics, 2004 Best Alignment and Superposition of first 5 Residues

Structural Alignment Will Dynamic Programming Work for Structural Alignment?? small, Optimal Substructure: an optimal solution to the problem contains within it optimal solutions to subproblems Overlapping Subproblems: the space of subproblems must be solving the same subproblems over and over Any choice to align two substructures (local alignment) will affect the scoring of the global alignment between the complete structures. The independence requirement is violated and DP can no longer guarantee an optimal solution.

Structural Alignment 1 4 3 2 A i 5 Each entry should indicate likelihood of match by incorporating global information 4 1 3 2 B k 1 2 3 4 5 Similarity Matrix 1 2 3 4 D Incorporate some global information locally.

SSAP Structure and Sequence Alignment Program Double Dynamic Programming - flashy name for utilizing two levels of dynamic programming (two levels of scoring matrices) 1 2 3 4 5 1 2 3 4 High-Level Scoring Matrix D H : locally incorporates some global information Position i,j score is likelihood that i,j will appear in the final alignment*. This is determined by computing the best alignment forced to contain each i,j and using the transformation that best superimposes i and j. This is done with the low-level scoring matrix. * this is somewhat different than before

SSAP 1 2 3 4 5 1 2 3 4 High-Level Scoring Matrix D H : locally incorporates some global information Results from DP on low-level matrix are added to the highlevel scoring matrix. ~Voting For each element D H (i, j), define a separate low-level scoring matrix, ij D L. Element ij D L (k,l) gets a score specifying how well a k fits to b l given that a i is perfectly aligned with b j. ij D L Anchor F and C

SSAP 1 3 2 i 5 1 3 2 j 1 3 2 j B 4 A 4 B 5 4 For all pairs of points, compute similarity score based on vector differences. 1 5 1 4 3 3 2 4 2 j i A B Rotated to align local coordinate frame of i and j

SSAP Low-level Scoring Matrices D L with overlaid best alignment from DP High-Level Scoring Matrix Taylor & Orengo, JMB 1989 Anchor F and C Anchor V and C

SSAP For each potential correspondence of residues i, j: 1. Pin down the correspondence i, j 2. Use local coordinate frames to orient the two proteins 3. Compute a low-level scoring matrix where the score between residues x and y is based on similarity of their positions relative to i and j. 4. Use this low-level scoring matrix to find best alignment given correspondence between i and j. 5. Use the result of this best alignment to vote for correspondences in the high-level scoring matrix The correct transformation should bring multiple consistent pairs of residues into proximity and so it should get voted for many times. Time Complexity: O(n 4 )

SSAP Returns a correspondence between two structures and a similarity score Can be used to compare structures for classification No guarantee on optimality - what cases can t we handle? Works well but is slow O(n 4 ) As database size grows this becomes a problem Used by the CATH protein structure classification database Many SSAP Extensions: Incorporate sequence information Multiple ways of computing low-level scoring matrix Iterative version - keep track of a set of candidate anchor points, the best alignment is computed and the anchor point list is updated until convergence.

GRATH Need for faster structure comparison to replace SSAP Goal: Produce a front-end filter for the more reliable SSAP - only those structures are are reasonably close need to be compared with SSAP Graph based method Secondary Structure Matching Nodes are secondary structures Edges contain relationship information Axial vectors computed for each SS element via least squares fitting of C

GRATH b b a m a m d b a : Bioinformatics, 19(14):1748, 2003 left, right, other

GRATH 3 2 4 5 6 1 ( 12, 12,d 12, 12 ) 2 1 ( 14, 14,d 14, 14 ) 4 ( 24, 24,d 24, 24 )

GRATH Given a graph for each protein, compute similarities Use of two matrices - Secondary Structure Similarity Matrix Identical secondary structural elements marked - Correspondence Matrix Consistent pairs of secondary structure marked R1 R2 R3 R4 R5 G1 G2 G3 G4 Bioinformatics, 19(14):1748, 2003

R1 R2 R3 R4 R5 G1 G1 G2 G3 G4 k matches in SS Similarity Matrix Consider all k 2 pairs of matches. The consistency of each pair of matches is recorded in the k x k Correspondence Matrix. R1 G1 ( 12, 12,d 12, 12 ) ( 14, 14,d 14, 14 ) R1 Matches are consistent if: distance, angle, torsion, and chirality are within error tolerance G2 R3

The k x k Correspondence Matrix indicates consistent pairs of secondary structure matches Enforce topology Maintain ordering Maintain self-consistency (a SS can not match itself, ie. the pair 1 and 3 is not allowed) ok not ok Can be considered a graph with k nodes and edges between nodes i and j if M i,j = 1 Bioinformatics, 19(14):1748, 2003 scoring...

GRATH Scoring SS1, SS2: number of secondary structure R1, R2: number of amino acids CS: clique size CR1, CR2: residues in secondary structures of clique W 1, W 2, W 3, W 4 : weights (W 4 = W 1 +W 2 +W 3 )

Atomic Coordinates GRATH Scoring Extract Secondary Structure Compute Axial Vectors Compute pairwise vector similarity measures Generate Secondary Structure Similarity Matrix Examine all pairs of matches for consistency generate Consistency Matrix Clique Detection Scoring

GRATH Results GRATH almost always find the correct fold at the top or close to the top of the ranked list Correct fold is within top-10 results 98% of time ENRICHMENT Bioinformatics, 19(14):1748, 2003 Empirically 3-4 Orders of Magnitude faster than SSAP

GRATH Results The larger the found clique size, the better GRATHs ability to find the correct fold Bioinformatics, 19(14):1748, 2003

Enrichment The increase in frequency of true positives in a dataset Enrichment Factor - the ratio of the frequency of positive samples in the filtered dataset to the frequency of positive samples in the original dataset. Lossy Enrichment - some results meeting specified criteria are lost. Some false negatives. Lossless Enrichment - no results meeting specified criteria are lost. No false negatives. Common Theme A fast enriching algorithm is often followed by a slower, but more precise, verification algorithm for identifying true positives. (ie. GRATH and SSAP)

SCOP: Structural Classification of Proteins http://scop.berkeley.edu/index.html Manual classification based on Sequence, Structure, and Function Unit of classification is the domain Hierarchical Classification of Structures (7 levels) Class - Fold - Superfamily - Family - Protein Domain - Species - PDB Entry

SCOP: Structural Classification of Proteins http://scop.berkeley.edu/index.html Class: Based on secondary structure content all alpha, all beta, alpha/beta, alpha+beta Fold: Based on number, type, and arrangement of secondary structural elements. Same core structure and topology. Superfamily: Based on hypothesized evolutionary relationship. Proteins have similar structure and function (but not necessarily sequence) Family: Proteins with clear evolutionary relationship. Similar structure, function, and >30% sequence identity

SCOP: Structural Classification of Proteins. 1.71 release 27599 PDB Entries (October 2006). 75930 Domains. Class Num folds Num Superfamilies Num Families All alpha proteins 226 392 645 All beta proteins 149 300 594 Alpha and beta proteins (a/b) 134 221 661 Alpha and beta proteins (a+b) 286 424 753 Multi-domain proteins 48 48 64 Membrane and cell surface prot. 49 90 101 Small proteins 79 114 186 Total 971 1589 3004 SCOP has been very useful Over 700 direct citations! Study evolution of enzymatic function Study of distantly related proteins with the same fold Study sequence and structure variability Derive AA similarity matrices Composition of multi-domain proteins Identification of new targets for structural genomics initiatives

CATH http://www.cathdb.info/latest/index.html Class, Architecture, Topology, Homologous Superfamily + Sequence Family and PDB Entry Hierarchical Grouping of Structure Significantly Automated (but not completely) Initiated in 1993

CATH http://www.cathdb.info/latest/index.html Class, Architecture, Topology, Homologous Superfamily Class + Sequence Family and PDB Entry Mainly Alpha Mainly Beta Mixed Alpha- Beta Few Secondary Structures

CATH Architecture Arrangement of secondary structures Ignores order of secondary structures Ignores topology Manually assigned into ~40 different architectures Roll Alpha-Beta Barrel Alpha-Beta Horseshoe 5-Stranded Propeller

CATH Topology (Fold Family) For proteins with same Architecture, do they have same topology (ie. ordering of elements) Performed automatically using SSAP and rules SSAP score >70 and >60% of larger protein matches smaller Similar architecture, different topology

CATH Homologous Superfamily Structures grouped by evolutionary relationships Includes sequence information To be in same Homologous Superfamily: Must have 60% of larger struct equivalent to smaller AND Sequence identity >35% OR SSAP score >80 and sequence identity >20% OR SSAP score >80 and domains with related function Statistics: C A T H CATH v3.1 Jan, 2007 Mainly Alpha 5 305 652 Mainly Beta 20 192 415 Mixed Alpha/Beta 14 496 922 Few Sec Struct 1 92 102

CATH New Structure Structure Comparison (SSAP, GRATH) Sequence Comparison Relatives Identified? Yes Relatives Identified? Structure Comparison (SSAP, GRATH) Sequence Comparison No Identify and Extract Domains Add to CATH Yes No Add to CATH Manually Assign to Existing or New Classification