Measuring quaternary structure similarity using global versus local measures.

Similar documents
Statistical Analysis of Interface Similarity in Crystals of Homologous Proteins

Week 10: Homology Modelling (II) - HHpred

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

PDBe TUTORIAL. PDBePISA (Protein Interfaces, Surfaces and Assemblies)

Analyzing six types of protein-protein interfaces. Yanay Ofran and Burkhard Rost

Supplementary text for the section Interactions conserved across species: can one select the conserved interactions?

CS612 - Algorithms in Bioinformatics

Procheck output. Bond angles (Procheck) Structure verification and validation Bond lengths (Procheck) Introduction to Bioinformatics.

Mapping Monomeric Threading to Protein Protein Structure Prediction

Detection of Protein Binding Sites II

SUPPLEMENTARY INFORMATION

Supporting Online Material for

SUPPLEMENTARY INFORMATION

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

The true oligomerization state of a protein is often difficult to

Homology and Information Gathering and Domain Annotation for Proteins

SUPPLEMENTARY MATERIALS

Eugene Krissinel. European Bioinformatics Institute, Genome Campus, Hinxton.

SUPPLEMENTARY INFORMATION

PGA: A Program for Genome Annotation by Comparative Analysis of. Maximum Likelihood Phylogenies of Genes and Species

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Introduction to Comparative Protein Modeling. Chapter 4 Part I

We used the PSI-BLAST program ( to search the

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics

Protein Structure Prediction, Engineering & Design CHEM 430

1. Protein Data Bank (PDB) 1. Protein Data Bank (PDB)

RNA Polymerase I Contains a TFIIF-Related DNA-Binding Subcomplex

Nature Structural and Molecular Biology: doi: /nsmb.2938

Homology. and. Information Gathering and Domain Annotation for Proteins

09/06/25. Computergestützte Strukturbiologie (Strukturelle Bioinformatik) Non-uniform distribution of folds. Scheme of protein structure predicition

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Predicting the Binding Patterns of Hub Proteins: A Study Using Yeast Protein Interaction Networks

Motif Prediction in Amino Acid Interaction Networks

A profile-based protein sequence alignment algorithm for a domain clustering database

Table S1. Primers used for the constructions of recombinant GAL1 and λ5 mutants. GAL1-E74A ccgagcagcgggcggctgtctttcc ggaaagacagccgcccgctgctcgg

Human and Server CAPRI Protein Docking Prediction Using LZerD with Combined Scoring Functions. Daisuke Kihara

Insights into Protein Protein Interfaces using a Bayesian Network Prediction Method

Protein Structure: Data Bases and Classification Ingo Ruczinski

Template Free Protein Structure Modeling Jianlin Cheng, PhD

Protein quality assessment

Building a Homology Model of the Transmembrane Domain of the Human Glycine α-1 Receptor

Structure to Function. Molecular Bioinformatics, X3, 2006

Analysis and Prediction of Protein Structure (I)

SUPPLEMENTARY INFORMATION

Introduction to Evolutionary Concepts

CSCE555 Bioinformatics. Protein Function Annotation

SUPPLEMENTARY INFORMATION

Protein Structure Prediction

Supplementary Figure 1 Crystal packing of ClR and electron density maps. Crystal packing of type A crystal (a) and type B crystal (b).

Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007

Online Protein Structure Analysis with the Bio3D WebApp

Protein Science (1997), 6: Cambridge University Press. Printed in the USA. Copyright 1997 The Protein Society

DIA-Umpire: comprehensive computational framework for data independent acquisition proteomics

The typical end scenario for those who try to predict protein

COMP 598 Advanced Computational Biology Methods & Research. Introduction. Jérôme Waldispühl School of Computer Science McGill University

Number sequence representation of protein structures based on the second derivative of a folded tetrahedron sequence

1-D Predictions. Prediction of local features: Secondary structure & surface exposure

Subfamily HMMS in Functional Genomics. D. Brown, N. Krishnamurthy, J.M. Dale, W. Christopher, and K. Sjölander

Template Free Protein Structure Modeling Jianlin Cheng, PhD

Identification of Representative Protein Sequence and Secondary Structure Prediction Using SVM Approach

Bayesian Models and Algorithms for Protein Beta-Sheet Prediction

Using Phase for Pharmacophore Modelling. 5th European Life Science Bootcamp March, 2017

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Francisco Melo, Damien Devos, Eric Depiereux and Ernest Feytmans

Bioinformatics. Proteins II. - Pattern, Profile, & Structure Database Searching. Robert Latek, Ph.D. Bioinformatics, Biocomputing

Nature Structural & Molecular Biology: doi: /nsmb Supplementary Figure 1

Structure, mechanism and ensemble formation of the Alkylhydroperoxide Reductase subunits. AhpC and AhpF from Escherichia coli

A New Similarity Measure among Protein Sequences

Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics

PREDICTION OF PROTEIN BINDING SITES BY COMBINING SEVERAL METHODS


Protein structure alignments

2 Dean C. Adams and Gavin J. P. Naylor the best three-dimensional ordination of the structure space is found through an eigen-decomposition (correspon

Wataru Nemoto 1,2* and Hiroyuki Toh 1

Protein structure analysis. Risto Laakso 10th January 2005

SUPPLEMENTARY INFORMATION

Sequence and Structure Alignment Z. Luthey-Schulten, UIUC Pittsburgh, 2006 VMD 1.8.5


Predicting Protein Functions and Domain Interactions from Protein Interactions

SUPPLEMENTARY INFORMATION

CAP 5510 Lecture 3 Protein Structures

In-Depth Assessment of Local Sequence Alignment

EBI web resources II: Ensembl and InterPro. Yanbin Yin Spring 2013

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

7.91 Amy Keating. Solving structures using X-ray crystallography & NMR spectroscopy

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

Protein-protein Interaction Prediction using Desolvation Energies and Interface Properties

Structure and RNA-binding properties. of the Not1 Not2 Not5 module of the yeast Ccr4 Not complex

SUPPLEMENTARY INFORMATION

Can protein model accuracy be. identified? NO! CBS, BioCentrum, Morten Nielsen, DTU

Supplementary Figure 1 Crystal contacts in COP apo structure (PDB code 3S0R)

Christian Sigrist. November 14 Protein Bioinformatics: Sequence-Structure-Function 2018 Basel

Computational Analysis of the Fungal and Metazoan Groups of Heat Shock Proteins

Nature Structural & Molecular Biology: doi: /nsmb Supplementary Figure 1

Introduction to Bioinformatics Online Course: IBT

Homology models of the tetramerization domain of six eukaryotic voltage-gated potassium channels Kv1.1-Kv1.6

Nature Structural and Molecular Biology: doi: /nsmb Supplementary Figure 1

Transcription:

Supplementary Figure 1 Measuring quaternary structure similarity using global versus local measures. (a) Structural similarity of two protein complexes can be inferred from a global superposition, which yields a global score, as was done in this work. (b) Structural similarity can also be assessed at the level of pairwise interfaces 1-14, but such information would have to be integrated to infer a global similarity measure when complexes contain multiple interfaces. For example, in the case of a tetramer with four interfaces, four similarity measures will be obtained and this number would increase further when comparing complexes with more subunits.

Supplementary Figure 2 Heuristic employed for superposing protein complexes. The names of the chains in a PDB files are arbitrary. For example, considering the two tetramers depicted, chains may be labeled clockwise in one PDB file but counter-clockwise in another. Thus, although two structures can be similar structurally, differences in chain order can yield a false negative result when structures are being compared. To circumvent this problem, we must infer chainchain correspondences among the structures being compared. This was achieved using a seed superposition of the two structures, which is based on chains from the first QS maximizing the TM-score with the second QS. If the QSs are similar, this seed superposition naturally places structurally equivalent chains in proximity, which made their identification possible by analysis of the aligned coordinates. We then used this mapping to re-write the coordinate files in matching chain order, and recalculated a global superposition of the complete QSs using the re-ordered coordinates. The latter provided us with the final TM-score.

Supplementary Figure 3 Procedure used to infer the biological significance of QSs. Each symmetry group is considered iteratively. Within each group, each QS is used to search for structural homologs. If a homologue is found, both QSs are annotated to be correct. Once all the QSs of a symmetry group have b een processed, each QS is used again to search for proteins identical in sequence but having different QSs. If found, we considered such QSs to be likely nonbiological and annotated them as such.

Supplementary Figure 4 Information flow involved in QSalign.

Supplementary Figure 5 Integrating pairwise interface information to infer biological relevance of quaternary structures. QSbio needs to compare QSs from PDB with predictions from PISA, EPPIC, and QSalign/anti-QSalign. Comparing QSs between PDB and PISA is achieved with the full QS superimposition approach described above (Figure S2). However, to compare QSs between PDB and EPPIC, we must employ a different strategy because EPPIC provides pairwise interface information (as opposed to assembly information). We therefore mapped pairwise information from EPPIC onto QSs from PDB using the following approach. First, each QS from PDB was decomposed into pairs of chains, using all pairs burying >90 Å 2. Each pair was subsequently matched to an interface group from EPPIC by structural superposition. Each interface group in EPPIC is classified as being either biological (green) or non-biological (magenta). In the case where all subunits of the QS could be linked by biological contacts, the QS was deemed to match EPPIC (example 1) and otherwise it was inferred as non-matching (example 2).

Supplementary Figure 6 Protein interfaces are plastic. (a) We compared interfaces of structurally similar protein complexes. We examined whether interface properties of one complex were predictive of the same property in its homologues, given different levels of sequence identity between them. (b) We first compared the interaction propensity of interfaces. Higher values indicate interfaces with a high fraction of residues normally enriched at interfaces while lower values correspond to interfaces chemically close to solvent-exposed surfaces. (c) We then compared the hydrophobicity of interface pairs, defined as the ratio of non-polar residues to the total number of interface residues. (d) Finally, we compared evolutionary conservation of interface residues relative to surface residues. Values below 1 correspond to complexes where the interface is more conserved than the surface. The right-most plot summarizes the squared correlation coefficient (R 2 ) for each property considered, calculated for pairs of proteins binned by shared sequence identity: < 30%, 30-45%, 45-60%, 60-75% and 75-90%. All properties show very low correlation values for pairs sharing less than 30% identity, showing that despite being structurally similar, interfaces can differ dramatically in their chemistry and evolutionary properties. One thousand random data points were sampled for each plot to ease visualization.

Supplementary Figure 7 Annotating monomers with anti-qsalign. We annotated monomers based on the enrichment of monomeric homologs over oligomeric ones. This enrichment is used to derive probabilities by the formulae above. Proteins sharing at least 30% and at most 90% sequence identity and having an overlap of 60% or more were considered as homologs.

Source Data Prediction details of PISA, EPPIC, QSalign/anti-QSalign and QSbio on the different datasets. (separate Excel file). Supplementary Table 1 Number of structures used in the different benchmark datasets. For benchmarks based on PiQSi annotations, methods had to predict whether a PDB structure is correct ( ) or not ( ). For benchmarks using the cgs dataset, we used two different approaches: (i) In Fig 3a, when benchmarking the prediction of monomers, the positive set consisted of 144 true monomers and the negative set contained 137 true dimers + 50 true oligomers. (ii) In Fig. 2d, a positive was counted when the prediction matched the number of subunits from cgs and a negative when it did not. The differences in the number of structures (137 vs 105, and 50 vs 57) have two origins. One is that the number of subunit information from EPPIC oligomers was not always reliable, so these structures were only used to benchmark anti-qsalign (green) where this information was not critical. A second is that only structures annotated by QSalign/anti-QSalign were used and their number differed (QSalign annotated 57 oligomers from the cgs benchmark, while anti-qsalign annotated only 50). Monomers Dimers Oligomers (>=3 subunits) Figure PiQSi 149 111 354 64 370 61 Fig. 2d Fig. 3b cgs 144 n/a 137 105 n/a 50 57 n/a Fig. 3a, Fig. 2d Supplementary Table 2 Number of structures annotated by QSalign. The total number of redundant (R) and non-redundant structures (NR, filtered at 90% sequence identity) and annotated by each method is given. Note that QSalign also annotates monomers that should normally exist as oligomers (numbers in grey), but these are not counted when calculating the coverage. Monomers Dimer and Oligomers Total Total in PDB Coverage QSalign (R) 6,429 35,947-51,626 (ns>1) 70% QSalign (NR) 1,087 9,668-16,643 (ns>1) 58% anti-qsalign (R) anti-qsalign (NR) 46,877 - - 58,872 (ns=1) 80% 8,687 - - 12,292 (ns=1) 71% Combined (R) 46,877 35,947 82,824 110,096 75% Combined (NR) 8,687 9,668 18,355 28,848 64% 1

Supplementary Note Description of the QSalign, QSinfer and QSpropagate routines with pseudo-code. QSalign(PDB1): Define SUB1 = number of subunits in PDB1 Retrieve list PDB2 of potential structure matches using the following criteria: - Annotated as homo-oligomer in 3DComplex - Number of subunits = SUB1 - Structures should be homologous, as defined by any of these three conditions:(sequence identity > 0.3 & alignment overlap > 0.8) or (matching SCOP domain architecture) or (matching Pfam domain architecture) Execute KPAX between PDB1 and all retrieved structures PDB2 For PDB2_j in PDB2: Parse coordinates produced by KPAX (such that PDB2_j coordinates are now transformed to be superposed onto PDB1). Record the number of overlapping Cα between pairs of chains from PDB2_j and PDB1 Define matching chain-pairs as those with the bi-directional best scores Re-write PDB2_j with chain order matching that of PDB1. If a chain was not matched, its order is random. Executes KPAX again between PDB1 and all re-written PDB2 structures Record properties of the structural alignment into a MySQL table. Function QSinfer: Retrieve list L1 of "symmetry type (SYM) - number of subunits (SUB)" pairs, sorted in decreasing order by number of subunits For pairs (SYMi, SUBi) in L1: Retrieve list L2 of structure pairs PDB1, PDB2 that meet the following criteria, sorted by increasing sequence identity. - Symmetry == SYMi - Number of subunits == SUBi - Sequence identity < 80% - QS alignment with TM-Score > 0.65 - Overlap of sequence alignment > 0.65 Note: PDB2_i can be from PISA but is sorted after the match with the PDB structure if it exists For pairs (PDB1_i, PDB2_i) in L2: if PDB1_i is not already annotated: Mark PDB1_i as likely correct "Interface geometry is similar to that of PDB2_i" Mark PDB1_i as annotated if PDB2_i is not already annotated: if PDB2_i is from PDB: Mark PDB2_i as likely correct "Interface geometry is similar to that of PDB1_i" elsif PDB2_i is generated by PISA: Mark PDB2_i as likely incorrect "Interface geometry is similar to that of PDB2_i but was detected based on PISA and does not appear in the PDB assembly" Mark PDB2_i as annotated Call: QSpropagate(SYMi, SUBi) Function QSpropagate(SYMi, SUBi): Retrieve List L3 of structure pairs PDB1, PDB2 that meet the following criteria: - PDB1 is annotated as likely correct - PDB1 symmetry == SYMi - PDB1 number of subunits == SUBi - PDB2 is not yet annotated - Sequence identity between PDB1 and PDB2 > 95% For pairs (PDB1_j, PDB2_j) in L3: 2

Define #PDB1_j and #PDB2_j as numbers of subunits in PDB1_j and PDB2_j respectively Case 1: #PDB1_j < #PDB2_j if #PDB2_j is consistent with PDB2_j symmetry: Mark PDB2_j as ambiguous "Sequence is/are identical to PDB1_j (which is supposedly correct) but the structure is different, and it might form a higher-order oligomer" else: Mark PDB2_j as likely erroneous "Sequence is/are identical to PDB1_j (which is supposedly correct) but the structure is different, and the stoichiometry appears inconsistent with the composition and/or symmetry" Case 2: #PDB1_j > #PDB2_j Mark PDB2_j as likely erroneous "Sequence is/are identical to PDB1_j (which is supposedly correct) suggesting that interface(s) is/are missing" Case 3: #PDB1_j == #PDB2_j Mark PDB2_j as likely erroneous "Sequence is/are identical to PDB1_j (which is supposedly correct) but the structure appears to be different. This might reflect that a wrong interface was inferred in its reconstruction, but might also be caused by a large conformational change" 1 Winter, C., Henschel, A., Kim, W. K. & Schroeder, M. SCOPPI: a structural classification of proteinprotein interfaces. Nucleic acids research 34, D310-314 (2006). 2 Stein, A., Ceol, A. & Aloy, P. 3did: identification and classification of domain-based interactions of known three-dimensional structure. Nucleic acids research 39, D718-723, doi:10.1093/nar/gkq962 (2011). 3 Aloy, P., Ceulemans, H., Stark, A. & Russell, R. B. The relationship between sequence and interaction divergence in proteins. J Mol Biol 332, 989-998 (2003). 4 Shoemaker, B. A. et al. IBIS (Inferred Biomolecular Interaction Server) reports, predicts and integrates multiple types of conserved interactions for proteins. Nucleic acids research 40, D834-840, doi:10.1093/nar/gkr997 (2012). 5 Bordner, A. J. & Gorin, A. A. Comprehensive inventory of protein complexes in the Protein Data Bank from consistent classification of interfaces. BMC Bioinformatics 9, 234, doi:10.1186/1471-2105-9-234 (2008). 6 Xu, Q. et al. Statistical analysis of interface similarity in crystals of homologous proteins. J Mol Biol 381, 487-507, doi:10.1016/j.jmb.2008.06.002 (2008). 7 Tsuchiya, Y., Kinoshita, K., Ito, N. & Nakamura, H. PreBI: prediction of biological interfaces of proteins in crystals. Nucleic acids research 34, W320-324, doi:10.1093/nar/gkl267 (2006). 8 Xu, Q. & Dunbrack, R. L., Jr. The protein common interface database (ProtCID)--a comprehensive database of interactions of homologous proteins in multiple crystal forms. Nucleic acids research 39, D761-770, doi:10.1093/nar/gkq1059 (2011). 9 Ponstingl, H., Henrick, K. & Thornton, J. M. Discriminating between homodimeric and monomeric proteins in the crystalline state. Proteins 41, 47-57 (2000). 10 Faure, G., Andreani, J. & Guerois, R. InterEvol database: exploring the structure and evolution of protein complex interfaces. Nucleic acids research 40, D847-856, doi:10.1093/nar/gkr845 (2012). 11 Davis, F. P. & Sali, A. PIBASE: a comprehensive database of structurally defined protein interfaces. Bioinformatics (Oxford, England) 21, 1901-1907 (2005). 12 Gong, S. et al. PSIbase: a database of Protein Structural Interactome map (PSIMAP). Bioinformatics (Oxford, England) 21, 2541-2543 (2005). 13 Lo, Y. S., Chen, Y. C. & Yang, J. M. 3D-interologs: an evolution database of physical protein- protein interactions across multiple genomes. BMC genomics 11 Suppl 3, S7, doi:10.1186/1471-2164-11-s3-s7 (2010). 14 Baskaran, K., Duarte, J. M., Biyani, N., Bliven, S. & Capitani, G. A PDB-wide, evolution-based assessment of protein-protein interfaces. BMC Struct Biol 14, 22, doi:10.1186/s12900-014-0022-0 (2014). 3