Mining and classification of repeat protein structures Ian Walsh Ph.D. BioComputing UP, Department of Biology, University of Padova, Italy URL: http://protein.bio.unipd.it/
Repeat proteins Why are they important? Ribonuclease inhibitor Abundant in nature: Cell regulation Transcriptional control Nervous system roles Protein transport Disease: Neurodegenerative and infectious disease PDB code: 1z7x_W
Repeat proteins Why are they important? Abundant in nature: Cell regulation Transcriptional control Nervous system roles Protein transport Disease: Neurodegenerative and infectious disease Very modular Repeating structural motifs Good for protein design Bad for homology modeling
Example: Why do we need a dedicated resource? PDB code: 3oja_A
Goal Dedicated repeat resource Protein Data Bank: is great for collecting info on all structures Repeats Data Bank (annotation steps): 1. Extract repeat structures from PDB Repeats 2. Identify repeating domains and classify 3. Split repeats into tandem arrays 4. Publicly available Data Bank
RAPHAEL (Walsh, et al. Bioinformatics, 28:3257-64, 2012) Idea: Discriminates repeat from globular structures Algorithm: Identify periodic changes in 3D coordinates Geometry Machine learning Periodic variance (A) Repeat protein Leucine-rich Effector protein (PDB code 1JL5) (B) Globular Sulfhydryl protease (PDB code 9PAP).
RAPHAEL results: Detection Large improvement Vs. sequence detectors ROC curve on the combined training and test set. RAPHAEL trained using the leave one out split is compared to four other methods. The curve ends when a method does not produce further output, i.e. believes to have found all solenoids.
RAPHAEL- Period Number of residues in each repeating unit Derived from distance frequencies: Accurate period calculation Accuracy = 90% within 5 residues 20 residues
RAPHAEL- Insertions Non-periodic parts Derivation: deviations from period Not all repeat detection easy Example of insertions for a β-solenoid.
ProteinZoo Invites people to assist in the classification of large protein numbers Crowd Science Large repeat protein set Large amount of people + In-house software annotation tools Classification Level. Consensus Detailed Level Consensus
Annotation with ProteinZoo RAPHAEL MINING Repeats Predicted Period Predicted insertions ProteinZoo >7,000 structures
(Di Domenico, et al. Nucl. Acids Res., accepted) RepeatsDB is a collaborative database of repeat proteins. It builds on the results of RAPHAEL, and leverages manual curation to provide a high quality gold standard of repeat protein annotations. In collaboration with: Dr. Andrey V. Kajava (CNRS Montpellier) Giant RNA twister (TAL effector, PDB: 4gjr) Propeller with spare part (Proteasome regulatory particle, PDB: 3acp) Dr. Diego U. Ferreiro (Universidad de Buenos Aires)
Schematic view of the different classification levels (Pyrimad scheme)
(Pyrimad scheme) RAPHAEL Mining Schematic view of the different classification levels
(Pyrimad scheme) ProteinZoo basic classification (Tree) Schematic view of the different classification levels Based on a scheme previously proposed (Kajava A.V., Tandem repeats in proteins: From sequence to structure. J Struct Biol, 2012)
(Pyrimad scheme) Bridge manual with predicted: Homology by sequence alignment Schematic view of the different classification levels
(Pyrimad scheme) ProteinZoo detailed annotation Domain assignment Tandem arrays assigned in sequence Schematic view of the different classification levels
(The classification tree) (V) Beads-ona-string (IV) Closed (toroids) (III) Solenoids (II) Fibrils (I) Crystalites 2nd level sub-division
Fibrils Solenoids Closed (toroids) Beads-ona-string Unclassified
ProteinZoo In Progress Fibrils Solenoids Closed (toroids) Beads-ona-string Unclassified
Fibrils (2 sub-classes) Closed (6 sub-classes) Solenoids (5 sub-classes) 2nd level sub-division EXAMPLES Beads-ona-string (3 sub-classes)
URL: http://repeatsdb.bio.unipd.it/
Tandem Arrays in Sequence Secondary Structure Insertions URL: http://repeatsdb.bio.unipd.it/
Structural Annotation URL: http://repeatsdb.bio.unipd.it/
Functions, Domains + Cross-links URL: http://repeatsdb.bio.unipd.it/
Use case: Improved homology modeling Template Hidden Markov Models (HMMs) HMM-HMM search
Use case: Improved homology modeling Query sequence (a repeat) Hidden Markov Model (HMM) Template Hidden Markov Models (HMMs) HMM-HMM search Often a poor model due to repeats having considerable sequence divergence for same structures.
Use case: Improved homology modeling Query sequence (a repeat) Hidden Markov Model (HMM) Repeat Template Hidden Markov Models (HMMs) HMM-HMM search Model Should be improved Due to more finely tuned HMMs
Acknowledgements Tomàs Di Domenico Dr. Emilio Potenza Dr. Giovanni Minervini (University of Padova) Funding FIRB Futuro in Ricerca Università di Padova CARIPLO CARIPARO AIRC Manuel Giollo Prof.Silvio Tosatto Dr. Awais Ihsan (Sahiwal, Pakistan) Dr. Andrey V. Kajava (CNRS Montpellier, France) Gonzalo Parra (Universidad de Buenos Aires, Argentina) URL: http://protein.bio.unipd.it/