2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

Similar documents
CS612 - Algorithms in Bioinformatics

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Protein Structure: Data Bases and Classification Ingo Ruczinski

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

Basics of protein structure

Amino Acid Structures from Klug & Cummings. 10/7/2003 CAP/CGS 5991: Lecture 7 1

Bioinformatics. Macromolecular structure

Bioinformatics. Proteins II. - Pattern, Profile, & Structure Database Searching. Robert Latek, Ph.D. Bioinformatics, Biocomputing

Procheck output. Bond angles (Procheck) Structure verification and validation Bond lengths (Procheck) Introduction to Bioinformatics.

Protein structure alignments

1. Protein Data Bank (PDB) 1. Protein Data Bank (PDB)

Getting To Know Your Protein

Heteropolymer. Mostly in regular secondary structure

Protein Structure and Function Prediction using Kernel Methods.

Homology and Information Gathering and Domain Annotation for Proteins

Homology. and. Information Gathering and Domain Annotation for Proteins

Neural Networks for Protein Structure Prediction Brown, JMB CS 466 Saurabh Sinha

CMPS 3110: Bioinformatics. Tertiary Structure Prediction

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Tertiary Structure Prediction

Abstract A lot of work has gone into predicting the secondary (small scale) structure of proteins from their amino acid sequence. Current research ind

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

DATE A DAtabase of TIM Barrel Enzymes

Genome Databases The CATH database

Bayesian Models and Algorithms for Protein Beta-Sheet Prediction

ALL LECTURES IN SB Introduction

Protein Structure Prediction using String Kernels. Technical Report

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Protein Structure Prediction Using Neural Networks

SCOP. all-β class. all-α class, 3 different folds. T4 endonuclease V. 4-helical cytokines. Globin-like

Integrating multi-attribute similarity networks for robust representation of the protein space

STRUCTURAL BIOINFORMATICS I. Fall 2015

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Outline. Levels of Protein Structure. Primary (1 ) Structure. Lecture 6:Protein Architecture II: Secondary Structure or From peptides to proteins

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

Week 10: Homology Modelling (II) - HHpred

Presentation Outline. Prediction of Protein Secondary Structure using Neural Networks at Better than 70% Accuracy

The CATH Database provides insights into protein structure/function relationships

Outline. Levels of Protein Structure. Primary (1 ) Structure. Lecture 6:Protein Architecture II: Secondary Structure or From peptides to proteins

Jessica Wehner. Summer Fellow Bioengineering and Bioinformatics Summer Institute University of Pittsburgh 29 May 2008

Protein Structure. Hierarchy of Protein Structure. Tertiary structure. independently stable structural unit. includes disulfide bonds

D Dobbs ISU - BCB 444/544X 1

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Protein tertiary structure prediction with new machine learning approaches

Number sequence representation of protein structures based on the second derivative of a folded tetrahedron sequence

SINGLE-SEQUENCE PROTEIN SECONDARY STRUCTURE PREDICTION BY NEAREST-NEIGHBOR CLASSIFICATION OF PROTEIN WORDS

SUPPLEMENTARY INFORMATION

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Protein structure analysis. Risto Laakso 10th January 2005

Identification of Representative Protein Sequence and Secondary Structure Prediction Using SVM Approach

Multiple structure alignment with mstali

CSCE555 Bioinformatics. Protein Function Annotation

INDEXING METHODS FOR PROTEIN TERTIARY AND PREDICTED STRUCTURES

A General Model for Amino Acid Interaction Networks

Modern Information Retrieval

Protein Structure & Motifs

Support Vector Machine. Industrial AI Lab.

Lecture 10 (10/4/17) Lecture 10 (10/4/17)

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Advanced Certificate in Principles in Protein Structure. You will be given a start time with your exam instructions

Protein Structure Prediction and Display

2 Dean C. Adams and Gavin J. P. Naylor the best three-dimensional ordination of the structure space is found through an eigen-decomposition (correspon

Conditional Graphical Models

EBI web resources II: Ensembl and InterPro. Yanbin Yin Spring 2013

Large-Scale Genomic Surveys

Chapter 2 Structures. 2.1 Introduction Storing Protein Structures The PDB File Format

YAPS: Yet Another Protein Similarity

Protein Secondary Structure Prediction

ECE 592 Topics in Data Science

Physiochemical Properties of Residues

Protein Structures. Sequences of amino acid residues 20 different amino acids. Quaternary. Primary. Tertiary. Secondary. 10/8/2002 Lecture 12 1

09/06/25. Computergestützte Strukturbiologie (Strukturelle Bioinformatik) Non-uniform distribution of folds. Scheme of protein structure predicition

Supersecondary Structures (structural motifs)

Protein Structure Basics

Introduction to" Protein Structure

Reconstructing Amino Acid Interaction Networks by an Ant Colony Approach

Modern Information Retrieval

Motif Prediction in Amino Acid Interaction Networks

Efficient Protein Tertiary Structure Retrievals and Classifications Using Content Based Comparison Algorithms

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee

Style-aware Mid-level Representation for Discovering Visual Connections in Space and Time

Bio nformatics. Lecture 23. Saad Mneimneh

Improving Protein Function Prediction using the Hierarchical Structure of the Gene Ontology

Algorithm-Independent Learning Issues

Biophysics 101: Genomics & Computational Biology. Section 8: Protein Structure S T R U C T U R E P R O C E S S. Outline.

Protein folding. α-helix. Lecture 21. An α-helix is a simple helix having on average 10 residues (3 turns of the helix)

Some Problems from Enzyme Families

Carson Andorf 1,3, Adrian Silvescu 1,3, Drena Dobbs 2,3,4, Vasant Honavar 1,3,4. University, Ames, Iowa, 50010, USA. Ames, Iowa, 50010, USA

EBI web resources II: Ensembl and InterPro

Holdout and Cross-Validation Methods Overfitting Avoidance

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

Final Exam, Machine Learning, Spring 2009

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Machine Learning

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

SUPPLEMENTARY MATERIALS

Large-Scale Genomic Surveys

Molecular Modeling Lecture 7. Homology modeling insertions/deletions manual realignment

Transcription:

Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity. A global picture of the protein universe will help us to understand the functions of proteins better. Structures are classified usually by their secondary structure content and how they are arranged. Lactate Dehydrogenase SCOP Class: Mixed / Immunoglobulin SCOP Class: Hemoglobin Beta Chain SCOP Class: 2MHR proteins can be modular single chain may be divisible into smaller independent units of tertiary structure called domains domains are the basic unit of structure classification different domains in a protein are also often associated with different functions carried out by the protein. domains are usually the main units for structural classification A polypeptide or part of a polypeptide chain that can independently into a stable tertiary structure... from Introduction to Protein Structure, by Branden & Tooze Compact units within the ing pattern of a single chain that look as if they should have independent stability. from Introduction to Protein Architecture, by Lesk http://scop.berkeley.edu 1

Class: derived from secondary structure content Fold: derived from topological connection, orientation, arrangement and number of secondary structures Super: clusters of low sequence similarity but related structures & functions Family: clusers of proteins with sequence similarity > 30% with similar structure & function A hierarchical classification scheme is maintained SCOP Classification root class class Class [C] derived from secondary structure content (automatic) Architecture [A] derived from orientation of 2 o structures (manual) Topology [T] derived from topological connection and # 2 o structures Homologous Super (H) clusters of similar structures & functions http://cathwww.biochem.ucl.ac.uk/latest/index.html In the CATH hierarchy, Class simply describes what type of secondary structure is present. There are only four classes: mainly mainly few secondary structures 90% of structures are trivial to assign at this level. Architecture is hard to define precisely In CATH it is defined broadly as describing general features of protein shape such as arrangements of secondary structure in 3D space It does not define connectivities between secondary structural elements--that s what the topology level does. It does not even explicitly define directionality of secondary structure, e.g. parallel or antiparallel beta-sheets. in CATH, architectures are presently assigned manually, by visual inspection. 2

if two proteins have the same topology, it means they have the same number and arrangement of secondary structures, and the connectivities between these elements are the same. this is also sometimes called the of a protein. in CATH, automated structure alignment is used to group proteins according to topology. example: a four-stranded antiparallel beta-sheet can have many different topologies based on the order in which the four beta-strands are connected. up-and-down greek key The lowest two levels in the CATH hierarchy relate to common ancestry some, but not all proteins with the same show evidence of common ancestry the surest way of identifying common ancestry is that two proteins have sequences roughly >30% identical (sequence level) if protein sequences are not that similar, common ancestry may still be inferred on the basis of a combination of structural and functional similarity, and possibly weak sequence similarity (homologous level) SCOP class domain CATH class architecture topology homologous sequence domain CATH more directed toward structural classification, SCOP pays more attention to evolutionary relationships 3

in CATH, there is one class to represent mixed alpha-beta in SCOP there are two: : beta structure is largely parallel, made of motifs : alpha and beta structure segregated to different parts of structure they have in common that they are hierarchical and based on abstractions they both include some manual aspects and are curated by experts in the field of protein structure fully automated methods for structure classification/comparison? SCOP Classification root new protein structure class class?? Class membership? Does the query protein belong to a SCOP category? Or does it need a new category to be defined? Binary classification problem: member, non-member Class label assignment? What SCOP category is the query protein assigned to? Multi-class classification problem Let p be a protein structure, proceed bottomup from level to level: Does p belong to a? no yes Does p belong to a? report yes report A consensus classification scheme in which the decisions of a number of individual (component) classifiers are merged together in an intelligent way. Decision Tree Classification is one technique to use as the intelligent way no Does p belong to a? no new yes report 4

Using a sequence/structure comparison tool as a classifier Perform a nearest neighbor query: if similarityscore(query, NN) < trained cutoff then not a member of any category else member of class(nn) Comparison tools that we have used: Sequence: PSI-Blast, ER+SUPERFAMILY database Structure:,, Database: SCOP 1.59 Query: SCOP 1.61 SCOP 1.59 78.6% 66.1% 72.2% 77.6% 78.4% 94.5% 73% Class membership BLAST 92.6% 60.7% 89% 78.5% 89% 82% 89% 85% At least one 98.2% 96% 100% Database: SCOP 1.59 Query: SCOP 1.61 SCOP 1.59 94.8% 69% 40.5% Class label assignment BLAST 92.3% 12% 0% 91% 81% 80.4% 81.7% 40.5% 88% 46% 92% 54% At least one 97.9% 93.9% 64.9% Universal confidence levels instead of toolspecific scores Perform nearest neighbor queries Database: SCOP 1.59 Query: SCOP 1.61 SCOP 1.59 Partition score space of tools into confidence levels e.g. z-score of 5.4 we are 80% confident that the query protein is a member of an existing. Each component classifier reports a confidence level for the query protein: c = [C 1, C 2, C 3, C 4, C 5 ] What is the best way to combine these probabilistic decisions? A solution: decision trees. < 45% new new < 40% < 55%? else? else +? > 93% > 75% >= 55% existing existing new existing 5

The dataset: Database Query new Training v1.59 (20449) v1.61 v1.59 (2241) 248 Evaluation v1.61 (22724) v1.63 v1.59 (2825) 618 Error percentages 70 60 50 40 30 20 PSI-Blast Ensemble None new 84 424 10 new 47 339 0 Family Super Fold Classification level Percentage of misclassified proteins 100 90 80 70 60 50 40 30 20 PSI-Blast Ensemble None Russian doll effect: A small structure is contained exactly in a larger structure, but they are classified in different groups in SCOP hierarchy. 10 0 Family Super Fold Classification level Automated classification of a protein structure is feasible. More accurate classifiers can be built by combining less accurate individual classifiers. - http://www.ebi.ac.uk/dali/ VAST - http://ww.ncbi.nlm.nih.gov/structure/vast/vast.shtml FSSP - http://www.ebi.ac.uk/dali/fssp/fssp.html PDBsum - http://ww.biochem.ucl.ac.uk/bsm/pdbsum/ The presented classification technique can also be used as an aid to manual classification by providing clues. 6

structural taxonomy reveals that although structures are being solved more rapidly than ever, fewer and fewer of them have new s! Will we get them all soon? 7