Mining and classification of repeat protein structures

Similar documents
Week 10: Homology Modelling (II) - HHpred

Università degli Studi di Padova. Tesi di Laurea Magistrale in Ingegneria Informatica. Automatic classification of repeat proteins with Raphael

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

Some Problems from Enzyme Families

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Bioinformatics. Proteins II. - Pattern, Profile, & Structure Database Searching. Robert Latek, Ph.D. Bioinformatics, Biocomputing

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Overview of Research at Bioinformatics Lab

PROTEIN FUNCTION PREDICTION WITH AMINO ACID SEQUENCE AND SECONDARY STRUCTURE ALIGNMENT SCORES

Protein Structure Prediction using String Kernels. Technical Report

CAP 5510 Lecture 3 Protein Structures

Hidden Markov Models

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

Introduction to Bioinformatics

Computational Biology From The Perspective Of A Physical Scientist

Conditional Graphical Models

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Information Extraction from Text

Hidden Markov Models in computational biology. Ron Elber Computer Science Cornell

Computational Molecular Biology (

COMP 598 Advanced Computational Biology Methods & Research. Introduction. Jérôme Waldispühl School of Computer Science McGill University

Computational Biology: Basics & Interesting Problems

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Amino Acid Structures from Klug & Cummings. 10/7/2003 CAP/CGS 5991: Lecture 7 1

CS612 - Algorithms in Bioinformatics

Protein Structures. Sequences of amino acid residues 20 different amino acids. Quaternary. Primary. Tertiary. Secondary. 10/8/2002 Lecture 12 1

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

Template Free Protein Structure Modeling Jianlin Cheng, PhD

Large-Scale Genomic Surveys

Multiple sequence alignment

STRUCTURAL BIOINFORMATICS I. Fall 2015

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program)

Supporting Text 1. Comparison of GRoSS sequence alignment to HMM-HMM and GPCRDB

How much non-coding DNA do eukaryotes require?

CSCE555 Bioinformatics. Protein Function Annotation

Outline. Sequence-comparison methods. Buzzzzzzzz. Why compare sequences? Gerard Kleywegt Uppsala University

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Basics of protein structure

Intro Secondary structure Transmembrane proteins Function End. Last time. Domains Hidden Markov Models

Computational Genomics. Reconstructing dynamic regulatory networks in multiple species

Today. Last time. Secondary structure Transmembrane proteins. Domains Hidden Markov Models. Structure prediction. Secondary structure

1-D Predictions. Prediction of local features: Secondary structure & surface exposure

December 2, :4 WSPC/INSTRUCTION FILE jbcb-profile-kernel. Profile-based string kernels for remote homology detection and motif extraction

HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

DATA ACQUISITION FROM BIO-DATABASES AND BLAST. Natapol Pornputtapong 18 January 2018

Genome Annotation. Qi Sun Bioinformatics Facility Cornell University

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics

Motif Prediction in Amino Acid Interaction Networks

Amino Acid Structures from Klug & Cummings. Bioinformatics (Lec 12)

Variable-Length Protein Sequence Motif Extraction Using Hierarchically-Clustered Hidden Markov Models

Computational methods for predicting protein-protein interactions

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Model Accuracy Measures

Reconstructing Amino Acid Interaction Networks by an Ant Colony Approach

Radial Basis Function Neural Networks in Protein Sequence Classification ABSTRACT

IMPORTANCE OF SECONDARY STRUCTURE ELEMENTS FOR PREDICTION OF GO ANNOTATIONS

Phylogenomics, Multiple Sequence Alignment, and Metagenomics. Tandy Warnow University of Illinois at Urbana-Champaign

Prediction of protein function from sequence analysis

Annotation Error in Public Databases ALEXANDRA SCHNOES UNIVERSITY OF CALIFORNIA, SAN FRANCISCO OCTOBER 25, 2010

3/1/17. Content. TWINSCAN model. Example. TWINSCAN algorithm. HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

Chemogenomic: Approaches to Rational Drug Design. Jonas Skjødt Møller

Molecular Modeling. Prediction of Protein 3D Structure from Sequence. Vimalkumar Velayudhan. May 21, 2007

PREDICTION OF PROTEIN BINDING SITES BY COMBINING SEVERAL METHODS

EBI web resources II: Ensembl and InterPro

The Homology Kernel: A Biologically Motivated Sequence Embedding into Euclidean Space

Protein Structure Prediction and Display

Computational Genomics and Molecular Biology, Fall

Proteins: Structure & Function. Ulf Leser

Computational Systems Biology

networks in molecular biology Wolfgang Huber

Efficient Remote Homology Detection with Secondary Structure

Sequence Alignment Techniques and Their Uses

RNA Protein Interaction

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Detection of Protein Binding Sites II

Human Mobility Pattern Prediction Algorithm using Mobile Device Location and Time Data

RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology"

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

A New Similarity Measure among Protein Sequences

STRUCTURAL BIOINFORMATICS II. Spring 2018

Prediction and Classif ication of Human G-protein Coupled Receptors Based on Support Vector Machines

Predicting Protein Functions and Domain Interactions from Protein Interactions

Combining pairwise sequence similarity and support vector machines for remote protein homology detection

Bioinformatics. Macromolecular structure

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Jessica Wehner. Summer Fellow Bioengineering and Bioinformatics Summer Institute University of Pittsburgh 29 May 2008

Bioinformatics 2 - Lecture 4

#33 - Genomics 11/09/07

MBLG lecture 5. The EGG! Visualising Molecules. Dr. Dale Hancock Lab 715

Bioinformatics in Neurocomputing Framework

Domain-based computational approaches to understand the molecular basis of diseases

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

Subfamily HMMS in Functional Genomics. D. Brown, N. Krishnamurthy, J.M. Dale, W. Christopher, and K. Sjölander

Inferring Transcriptional Regulatory Networks from Gene Expression Data II

Transcription:

Mining and classification of repeat protein structures Ian Walsh Ph.D. BioComputing UP, Department of Biology, University of Padova, Italy URL: http://protein.bio.unipd.it/

Repeat proteins Why are they important? Ribonuclease inhibitor Abundant in nature: Cell regulation Transcriptional control Nervous system roles Protein transport Disease: Neurodegenerative and infectious disease PDB code: 1z7x_W

Repeat proteins Why are they important? Abundant in nature: Cell regulation Transcriptional control Nervous system roles Protein transport Disease: Neurodegenerative and infectious disease Very modular Repeating structural motifs Good for protein design Bad for homology modeling

Example: Why do we need a dedicated resource? PDB code: 3oja_A

Goal Dedicated repeat resource Protein Data Bank: is great for collecting info on all structures Repeats Data Bank (annotation steps): 1. Extract repeat structures from PDB Repeats 2. Identify repeating domains and classify 3. Split repeats into tandem arrays 4. Publicly available Data Bank

RAPHAEL (Walsh, et al. Bioinformatics, 28:3257-64, 2012) Idea: Discriminates repeat from globular structures Algorithm: Identify periodic changes in 3D coordinates Geometry Machine learning Periodic variance (A) Repeat protein Leucine-rich Effector protein (PDB code 1JL5) (B) Globular Sulfhydryl protease (PDB code 9PAP).

RAPHAEL results: Detection Large improvement Vs. sequence detectors ROC curve on the combined training and test set. RAPHAEL trained using the leave one out split is compared to four other methods. The curve ends when a method does not produce further output, i.e. believes to have found all solenoids.

RAPHAEL- Period Number of residues in each repeating unit Derived from distance frequencies: Accurate period calculation Accuracy = 90% within 5 residues 20 residues

RAPHAEL- Insertions Non-periodic parts Derivation: deviations from period Not all repeat detection easy Example of insertions for a β-solenoid.

ProteinZoo Invites people to assist in the classification of large protein numbers Crowd Science Large repeat protein set Large amount of people + In-house software annotation tools Classification Level. Consensus Detailed Level Consensus

Annotation with ProteinZoo RAPHAEL MINING Repeats Predicted Period Predicted insertions ProteinZoo >7,000 structures

(Di Domenico, et al. Nucl. Acids Res., accepted) RepeatsDB is a collaborative database of repeat proteins. It builds on the results of RAPHAEL, and leverages manual curation to provide a high quality gold standard of repeat protein annotations. In collaboration with: Dr. Andrey V. Kajava (CNRS Montpellier) Giant RNA twister (TAL effector, PDB: 4gjr) Propeller with spare part (Proteasome regulatory particle, PDB: 3acp) Dr. Diego U. Ferreiro (Universidad de Buenos Aires)

Schematic view of the different classification levels (Pyrimad scheme)

(Pyrimad scheme) RAPHAEL Mining Schematic view of the different classification levels

(Pyrimad scheme) ProteinZoo basic classification (Tree) Schematic view of the different classification levels Based on a scheme previously proposed (Kajava A.V., Tandem repeats in proteins: From sequence to structure. J Struct Biol, 2012)

(Pyrimad scheme) Bridge manual with predicted: Homology by sequence alignment Schematic view of the different classification levels

(Pyrimad scheme) ProteinZoo detailed annotation Domain assignment Tandem arrays assigned in sequence Schematic view of the different classification levels

(The classification tree) (V) Beads-ona-string (IV) Closed (toroids) (III) Solenoids (II) Fibrils (I) Crystalites 2nd level sub-division

Fibrils Solenoids Closed (toroids) Beads-ona-string Unclassified

ProteinZoo In Progress Fibrils Solenoids Closed (toroids) Beads-ona-string Unclassified

Fibrils (2 sub-classes) Closed (6 sub-classes) Solenoids (5 sub-classes) 2nd level sub-division EXAMPLES Beads-ona-string (3 sub-classes)

URL: http://repeatsdb.bio.unipd.it/

Tandem Arrays in Sequence Secondary Structure Insertions URL: http://repeatsdb.bio.unipd.it/

Structural Annotation URL: http://repeatsdb.bio.unipd.it/

Functions, Domains + Cross-links URL: http://repeatsdb.bio.unipd.it/

Use case: Improved homology modeling Template Hidden Markov Models (HMMs) HMM-HMM search

Use case: Improved homology modeling Query sequence (a repeat) Hidden Markov Model (HMM) Template Hidden Markov Models (HMMs) HMM-HMM search Often a poor model due to repeats having considerable sequence divergence for same structures.

Use case: Improved homology modeling Query sequence (a repeat) Hidden Markov Model (HMM) Repeat Template Hidden Markov Models (HMMs) HMM-HMM search Model Should be improved Due to more finely tuned HMMs

Acknowledgements Tomàs Di Domenico Dr. Emilio Potenza Dr. Giovanni Minervini (University of Padova) Funding FIRB Futuro in Ricerca Università di Padova CARIPLO CARIPARO AIRC Manuel Giollo Prof.Silvio Tosatto Dr. Awais Ihsan (Sahiwal, Pakistan) Dr. Andrey V. Kajava (CNRS Montpellier, France) Gonzalo Parra (Universidad de Buenos Aires, Argentina) URL: http://protein.bio.unipd.it/