Measuring quaternary structure similarity using global versus local measures.

Supplementary Figure 1 Measuring quaternary structure similarity using global versus local measures. (a) Structural similarity of two protein complexes can be inferred from a global superposition, which yields a global score, as was done in this work. (b) Structural similarity can also be assessed at the level of pairwise interfaces 1-14, but such information would have to be integrated to infer a global similarity measure when complexes contain multiple interfaces. For example, in the case of a tetramer with four interfaces, four similarity measures will be obtained and this number would increase further when comparing complexes with more subunits.

Supplementary Figure 2 Heuristic employed for superposing protein complexes. The names of the chains in a PDB files are arbitrary. For example, considering the two tetramers depicted, chains may be labeled clockwise in one PDB file but counter-clockwise in another. Thus, although two structures can be similar structurally, differences in chain order can yield a false negative result when structures are being compared. To circumvent this problem, we must infer chainchain correspondences among the structures being compared. This was achieved using a seed superposition of the two structures, which is based on chains from the first QS maximizing the TM-score with the second QS. If the QSs are similar, this seed superposition naturally places structurally equivalent chains in proximity, which made their identification possible by analysis of the aligned coordinates. We then used this mapping to re-write the coordinate files in matching chain order, and recalculated a global superposition of the complete QSs using the re-ordered coordinates. The latter provided us with the final TM-score.

Supplementary Figure 3 Procedure used to infer the biological significance of QSs. Each symmetry group is considered iteratively. Within each group, each QS is used to search for structural homologs. If a homologue is found, both QSs are annotated to be correct. Once all the QSs of a symmetry group have b een processed, each QS is used again to search for proteins identical in sequence but having different QSs. If found, we considered such QSs to be likely nonbiological and annotated them as such.

Supplementary Figure 4 Information flow involved in QSalign.

Supplementary Figure 5 Integrating pairwise interface information to infer biological relevance of quaternary structures. QSbio needs to compare QSs from PDB with predictions from PISA, EPPIC, and QSalign/anti-QSalign. Comparing QSs between PDB and PISA is achieved with the full QS superimposition approach described above (Figure S2). However, to compare QSs between PDB and EPPIC, we must employ a different strategy because EPPIC provides pairwise interface information (as opposed to assembly information). We therefore mapped pairwise information from EPPIC onto QSs from PDB using the following approach. First, each QS from PDB was decomposed into pairs of chains, using all pairs burying >90 Å 2. Each pair was subsequently matched to an interface group from EPPIC by structural superposition. Each interface group in EPPIC is classified as being either biological (green) or non-biological (magenta). In the case where all subunits of the QS could be linked by biological contacts, the QS was deemed to match EPPIC (example 1) and otherwise it was inferred as non-matching (example 2).

Supplementary Figure 6 Protein interfaces are plastic. (a) We compared interfaces of structurally similar protein complexes. We examined whether interface properties of one complex were predictive of the same property in its homologues, given different levels of sequence identity between them. (b) We first compared the interaction propensity of interfaces. Higher values indicate interfaces with a high fraction of residues normally enriched at interfaces while lower values correspond to interfaces chemically close to solvent-exposed surfaces. (c) We then compared the hydrophobicity of interface pairs, defined as the ratio of non-polar residues to the total number of interface residues. (d) Finally, we compared evolutionary conservation of interface residues relative to surface residues. Values below 1 correspond to complexes where the interface is more conserved than the surface. The right-most plot summarizes the squared correlation coefficient (R 2 ) for each property considered, calculated for pairs of proteins binned by shared sequence identity: < 30%, 30-45%, 45-60%, 60-75% and 75-90%. All properties show very low correlation values for pairs sharing less than 30% identity, showing that despite being structurally similar, interfaces can differ dramatically in their chemistry and evolutionary properties. One thousand random data points were sampled for each plot to ease visualization.

Supplementary Figure 7 Annotating monomers with anti-qsalign. We annotated monomers based on the enrichment of monomeric homologs over oligomeric ones. This enrichment is used to derive probabilities by the formulae above. Proteins sharing at least 30% and at most 90% sequence identity and having an overlap of 60% or more were considered as homologs.

Source Data Prediction details of PISA, EPPIC, QSalign/anti-QSalign and QSbio on the different datasets. (separate Excel file). Supplementary Table 1 Number of structures used in the different benchmark datasets. For benchmarks based on PiQSi annotations, methods had to predict whether a PDB structure is correct ( ) or not ( ). For benchmarks using the cgs dataset, we used two different approaches: (i) In Fig 3a, when benchmarking the prediction of monomers, the positive set consisted of 144 true monomers and the negative set contained 137 true dimers + 50 true oligomers. (ii) In Fig. 2d, a positive was counted when the prediction matched the number of subunits from cgs and a negative when it did not. The differences in the number of structures (137 vs 105, and 50 vs 57) have two origins. One is that the number of subunit information from EPPIC oligomers was not always reliable, so these structures were only used to benchmark anti-qsalign (green) where this information was not critical. A second is that only structures annotated by QSalign/anti-QSalign were used and their number differed (QSalign annotated 57 oligomers from the cgs benchmark, while anti-qsalign annotated only 50). Monomers Dimers Oligomers (>=3 subunits) Figure PiQSi 149 111 354 64 370 61 Fig. 2d Fig. 3b cgs 144 n/a 137 105 n/a 50 57 n/a Fig. 3a, Fig. 2d Supplementary Table 2 Number of structures annotated by QSalign. The total number of redundant (R) and non-redundant structures (NR, filtered at 90% sequence identity) and annotated by each method is given. Note that QSalign also annotates monomers that should normally exist as oligomers (numbers in grey), but these are not counted when calculating the coverage. Monomers Dimer and Oligomers Total Total in PDB Coverage QSalign (R) 6,429 35,947-51,626 (ns>1) 70% QSalign (NR) 1,087 9,668-16,643 (ns>1) 58% anti-qsalign (R) anti-qsalign (NR) 46,877 - - 58,872 (ns=1) 80% 8,687 - - 12,292 (ns=1) 71% Combined (R) 46,877 35,947 82,824 110,096 75% Combined (NR) 8,687 9,668 18,355 28,848 64% 1

Supplementary Note Description of the QSalign, QSinfer and QSpropagate routines with pseudo-code. QSalign(PDB1): Define SUB1 = number of subunits in PDB1 Retrieve list PDB2 of potential structure matches using the following criteria: - Annotated as homo-oligomer in 3DComplex - Number of subunits = SUB1 - Structures should be homologous, as defined by any of these three conditions:(sequence identity > 0.3 & alignment overlap > 0.8) or (matching SCOP domain architecture) or (matching Pfam domain architecture) Execute KPAX between PDB1 and all retrieved structures PDB2 For PDB2_j in PDB2: Parse coordinates produced by KPAX (such that PDB2_j coordinates are now transformed to be superposed onto PDB1). Record the number of overlapping Cα between pairs of chains from PDB2_j and PDB1 Define matching chain-pairs as those with the bi-directional best scores Re-write PDB2_j with chain order matching that of PDB1. If a chain was not matched, its order is random. Executes KPAX again between PDB1 and all re-written PDB2 structures Record properties of the structural alignment into a MySQL table. Function QSinfer: Retrieve list L1 of "symmetry type (SYM) - number of subunits (SUB)" pairs, sorted in decreasing order by number of subunits For pairs (SYMi, SUBi) in L1: Retrieve list L2 of structure pairs PDB1, PDB2 that meet the following criteria, sorted by increasing sequence identity. - Symmetry == SYMi - Number of subunits == SUBi - Sequence identity < 80% - QS alignment with TM-Score > 0.65 - Overlap of sequence alignment > 0.65 Note: PDB2_i can be from PISA but is sorted after the match with the PDB structure if it exists For pairs (PDB1_i, PDB2_i) in L2: if PDB1_i is not already annotated: Mark PDB1_i as likely correct "Interface geometry is similar to that of PDB2_i" Mark PDB1_i as annotated if PDB2_i is not already annotated: if PDB2_i is from PDB: Mark PDB2_i as likely correct "Interface geometry is similar to that of PDB1_i" elsif PDB2_i is generated by PISA: Mark PDB2_i as likely incorrect "Interface geometry is similar to that of PDB2_i but was detected based on PISA and does not appear in the PDB assembly" Mark PDB2_i as annotated Call: QSpropagate(SYMi, SUBi) Function QSpropagate(SYMi, SUBi): Retrieve List L3 of structure pairs PDB1, PDB2 that meet the following criteria: - PDB1 is annotated as likely correct - PDB1 symmetry == SYMi - PDB1 number of subunits == SUBi - PDB2 is not yet annotated - Sequence identity between PDB1 and PDB2 > 95% For pairs (PDB1_j, PDB2_j) in L3: 2

Define #PDB1_j and #PDB2_j as numbers of subunits in PDB1_j and PDB2_j respectively Case 1: #PDB1_j < #PDB2_j if #PDB2_j is consistent with PDB2_j symmetry: Mark PDB2_j as ambiguous "Sequence is/are identical to PDB1_j (which is supposedly correct) but the structure is different, and it might form a higher-order oligomer" else: Mark PDB2_j as likely erroneous "Sequence is/are identical to PDB1_j (which is supposedly correct) but the structure is different, and the stoichiometry appears inconsistent with the composition and/or symmetry" Case 2: #PDB1_j > #PDB2_j Mark PDB2_j as likely erroneous "Sequence is/are identical to PDB1_j (which is supposedly correct) suggesting that interface(s) is/are missing" Case 3: #PDB1_j == #PDB2_j Mark PDB2_j as likely erroneous "Sequence is/are identical to PDB1_j (which is supposedly correct) but the structure appears to be different. This might reflect that a wrong interface was inferred in its reconstruction, but might also be caused by a large conformational change" 1 Winter, C., Henschel, A., Kim, W. K. & Schroeder, M. SCOPPI: a structural classification of proteinprotein interfaces. Nucleic acids research 34, D310-314 (2006). 2 Stein, A., Ceol, A. & Aloy, P. 3did: identification and classification of domain-based interactions of known three-dimensional structure. Nucleic acids research 39, D718-723, doi:10.1093/nar/gkq962 (2011). 3 Aloy, P., Ceulemans, H., Stark, A. & Russell, R. B. The relationship between sequence and interaction divergence in proteins. J Mol Biol 332, 989-998 (2003). 4 Shoemaker, B. A. et al. IBIS (Inferred Biomolecular Interaction Server) reports, predicts and integrates multiple types of conserved interactions for proteins. Nucleic acids research 40, D834-840, doi:10.1093/nar/gkr997 (2012). 5 Bordner, A. J. & Gorin, A. A. Comprehensive inventory of protein complexes in the Protein Data Bank from consistent classification of interfaces. BMC Bioinformatics 9, 234, doi:10.1186/1471-2105-9-234 (2008). 6 Xu, Q. et al. Statistical analysis of interface similarity in crystals of homologous proteins. J Mol Biol 381, 487-507, doi:10.1016/j.jmb.2008.06.002 (2008). 7 Tsuchiya, Y., Kinoshita, K., Ito, N. & Nakamura, H. PreBI: prediction of biological interfaces of proteins in crystals. Nucleic acids research 34, W320-324, doi:10.1093/nar/gkl267 (2006). 8 Xu, Q. & Dunbrack, R. L., Jr. The protein common interface database (ProtCID)--a comprehensive database of interactions of homologous proteins in multiple crystal forms. Nucleic acids research 39, D761-770, doi:10.1093/nar/gkq1059 (2011). 9 Ponstingl, H., Henrick, K. & Thornton, J. M. Discriminating between homodimeric and monomeric proteins in the crystalline state. Proteins 41, 47-57 (2000). 10 Faure, G., Andreani, J. & Guerois, R. InterEvol database: exploring the structure and evolution of protein complex interfaces. Nucleic acids research 40, D847-856, doi:10.1093/nar/gkr845 (2012). 11 Davis, F. P. & Sali, A. PIBASE: a comprehensive database of structurally defined protein interfaces. Bioinformatics (Oxford, England) 21, 1901-1907 (2005). 12 Gong, S. et al. PSIbase: a database of Protein Structural Interactome map (PSIMAP). Bioinformatics (Oxford, England) 21, 2541-2543 (2005). 13 Lo, Y. S., Chen, Y. C. & Yang, J. M. 3D-interologs: an evolution database of physical protein- protein interactions across multiple genomes. BMC genomics 11 Suppl 3, S7, doi:10.1186/1471-2164-11-s3-s7 (2010). 14 Baskaran, K., Duarte, J. M., Biyani, N., Bliven, S. & Capitani, G. A PDB-wide, evolution-based assessment of protein-protein interfaces. BMC Struct Biol 14, 22, doi:10.1186/s12900-014-0022-0 (2014). 3