Supplemental Materials for. Structural Diversity of Protein Segments Follows a Power-law Distribution

Similar documents
Physiochemical Properties of Residues

Peptides And Proteins

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Clustering and Model Integration under the Wasserstein Metric. Jia Li Department of Statistics Penn State University

What makes a good graphene-binding peptide? Adsorption of amino acids and peptides at aqueous graphene interfaces: Electronic Supplementary

Packing of Secondary Structures

Ramachandran Plot. 4ysz Phi (degrees) Plot statistics

Supporting information to: Time-resolved observation of protein allosteric communication. Sebastian Buchenberg, Florian Sittel and Gerhard Stock 1

Sequential resonance assignments in (small) proteins: homonuclear method 2º structure determination

Secondary Structure. Bioch/BIMS 503 Lecture 2. Structure and Function of Proteins. Further Reading. Φ, Ψ angles alone determine protein structure

Structure and evolution of the spliceosomal peptidyl-prolyl cistrans isomerase Cwc27

Supplementary Figure 3 a. Structural comparison between the two determined structures for the IL 23:MA12 complex. The overall RMSD between the two

HSQC spectra for three proteins

Computational Protein Design

Proteins: Characteristics and Properties of Amino Acids

Properties of amino acids in proteins

Protein Fragment Search Program ver Overview: Contents:

Other Methods for Generating Ions 1. MALDI matrix assisted laser desorption ionization MS 2. Spray ionization techniques 3. Fast atom bombardment 4.

Structural Alignment of Proteins

Course Notes: Topics in Computational. Structural Biology.

Read more about Pauling and more scientists at: Profiles in Science, The National Library of Medicine, profiles.nlm.nih.gov

Protein structure. Protein structure. Amino acid residue. Cell communication channel. Bioinformatics Methods

April, The energy functions include:

114 Grundlagen der Bioinformatik, SS 09, D. Huson, July 6, 2009

7.012 Problem Set 1. i) What are two main differences between prokaryotic cells and eukaryotic cells?

Supporting Information

Computer simulations of protein folding with a small number of distance restraints

B O C 4 H 2 O O. NOTE: The reaction proceeds with a carbonium ion stabilized on the C 1 of sugar A.

Resonance assignments in proteins. Christina Redfield

Energy Minimization of Protein Tertiary Structure by Parallel Simulated Annealing using Genetic Crossover

Protein Structures: Experiments and Modeling. Patrice Koehl

Central Dogma. modifications genome transcriptome proteome

NMR Assignments using NMRView II: Sequential Assignments

Unraveling the degradation of artificial amide bonds in Nylon oligomer hydrolase: From induced-fit to acylation processes

C H E M I S T R Y N A T I O N A L Q U A L I F Y I N G E X A M I N A T I O N SOLUTIONS GUIDE

Protein Data Bank Contents Guide: Atomic Coordinate Entry Format Description. Version Document Published by the wwpdb

Amino Acid Side Chain Induced Selectivity in the Hydrolysis of Peptides Catalyzed by a Zr(IV)-Substituted Wells-Dawson Type Polyoxometalate

Viewing and Analyzing Proteins, Ligands and their Complexes 2

Model Mélange. Physical Models of Peptides and Proteins

Figure 1. Molecules geometries of 5021 and Each neutral group in CHARMM topology was grouped in dash circle.

Molecular Structure Prediction by Global Optimization

Section Week 3. Junaid Malek, M.D.

NMR parameters intensity chemical shift coupling constants 1D 1 H spectra of nucleic acids and proteins

Solutions In each case, the chirality center has the R configuration

PROTEIN SECONDARY STRUCTURE PREDICTION: AN APPLICATION OF CHOU-FASMAN ALGORITHM IN A HYPOTHETICAL PROTEIN OF SARS VIRUS

BIOINF 4120 Bioinformatics 2 - Structures and Systems - Oliver Kohlbacher Summer Protein Structure Prediction I

Using Higher Calculus to Study Biologically Important Molecules Julie C. Mitchell

Supplementary figure 1. Comparison of unbound ogm-csf and ogm-csf as captured in the GIF:GM-CSF complex. Alignment of two copies of unbound ovine

Peptide Syntheses. Illustrative Protection: BOC/ t Bu. A. Introduction. do not acid

CSE 549: Computational Biology. Substitution Matrices

Diastereomeric resolution directed towards chirality. determination focussing on gas-phase energetics of coordinated. sodium dissociation

The translation machinery of the cell works with triples of types of RNA bases. Any triple of RNA bases is known as a codon. The set of codons is

Major Types of Association of Proteins with Cell Membranes. From Alberts et al

Protein Structure Prediction

UNIT TWELVE. a, I _,o "' I I I. I I.P. l'o. H-c-c. I ~o I ~ I / H HI oh H...- I II I II 'oh. HO\HO~ I "-oh

Supplementary Information Intrinsic Localized Modes in Proteins

NMR study of complexes between low molecular mass inhibitors and the West Nile virus NS2B-NS3 protease

Protein structure analysis. Risto Laakso 10th January 2005

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Conformational Geometry of Peptides and Proteins:

Exam III. Please read through each question carefully, and make sure you provide all of the requested information.

Supplementary Information. Broad Spectrum Anti-Influenza Agents by Inhibiting Self- Association of Matrix Protein 1

Protein Structure Bioinformatics Introduction

Similarity or Identity? When are molecules similar?

Bioinformatics Practical for Biochemists

Oxygen Binding in Hemocyanin

Amino Acids and Proteins at ZnO-water Interfaces in Molecular Dynamics Simulations: Electronic Supplementary Information

Automated Identification of Protein Structural Features

Programme Last week s quiz results + Summary Fold recognition Break Exercise: Modelling remote homologues

Protein Secondary Structure Prediction using Feed-Forward Neural Network

Protein Struktur (optional, flexible)

Exam I Answer Key: Summer 2006, Semester C

THE TANGO ALGORITHM: SECONDARY STRUCTURE PROPENSITIES, STATISTICAL MECHANICS APPROXIMATION

CHMI 2227 EL. Biochemistry I. Test January Prof : Eric R. Gauthier, Ph.D.

Electronic Supplementary Information

Ranjit P. Bahadur Assistant Professor Department of Biotechnology Indian Institute of Technology Kharagpur, India. 1 st November, 2013

Any protein that can be labelled by both procedures must be a transmembrane protein.

Automated Identification of Protein Structural Features

Overview. The peptide bond. Page 1

A. Two of the common amino acids are analyzed. Amino acid X and amino acid Y both have an isoionic point in the range of

Desorption/Ionization Efficiency of Common Amino Acids in. Surface-assisted Laser Desorption/ionization Mass Spectrometry

Bacterial protease uses distinct thermodynamic signatures for substrate recognition

Basic Principles of Protein Structures

Full wwpdb X-ray Structure Validation Report i

Bayesian Probabilistic Approach for Predicting Backbone Structures in Terms of Protein Blocks

Protein Struktur. Biologen und Chemiker dürfen mit Handys spielen (leise) go home, go to sleep. wake up at slide 39

Secondary and sidechain structures

Advanced Certificate in Principles in Protein Structure. You will be given a start time with your exam instructions

Lecture 15: Realities of Genome Assembly Protein Sequencing

Supporting information

Bahnson Biochemistry Cume, April 8, 2006 The Structural Biology of Signal Transduction

Sensitive NMR Approach for Determining the Binding Mode of Tightly Binding Ligand Molecules to Protein Targets

BIRKBECK COLLEGE (University of London)

Heteropolymer. Mostly in regular secondary structure

Details of Protein Structure

Knowledge-based structure prediction of MHC class I bound peptides: a study of 23 complexes Ora Schueler-Furman 1,2, Ron Elber 2 and Hanah Margalit 1

ENZYME MECHANISMS, PROTEASES, STRUCTURAL BIOLOGY

CHEM J-9 June 2014

Chapter 4: Amino Acids

Problem Set 1

Transcription:

Supplemental Materials for Structural Diversity of Protein Segments Follows a Power-law Distribution Yoshito SAWADA and Shinya HONDA* National Institute of Advanced Industrial Science and Technology (AIST), Central 6, Tsukuba 305-8566, Japan *Correspondence: s.honda@aist.go.jp. Contents SUPPORTING DESCRIPTIONS Summary of Classified Structural Motifs Methodological Advantages Reproducibility of Results in the Single-Pass Clustering Method Validity of Overlapping Segment Sampling Objective Function in Fitting Calculations Goodness of Fit Evaluated by Parametric Bootstrapping Analysis Assessment of the Structural Dissimilarity Threshold, D th REFERENCE TABLE S1 Sensitivity of the single-pass clustering method to the order of sampling TABLE S2 Amino acid compositions of the sets of 9-residue segments with and without overlap TABLE S3 Confidence intervals of fit coefficients and goodness-of-fit statistics determined by parametric bootstrapping analysis (1)

FIGURE S1 Structural and sequential summary of clusters FIGURE S2 Flow chart of the single-pass clustering method used in the present study FIGURE S3 Reproducibility of results in the single-pass clustering method FIGURE S4 Sensitivity of the parameters to the order of sampling FIGURE S5 Suitability of objective functions to minimize in fitting calculations FIGURE S6 Histograms showing the structural dissimilarity, D, of 9-residue segments FIGURE S7 Structural differences of 9-residue segments from the center of the cluster Summary of Classified Structural Motifs An in-depth analysis of each cluster is beyond the scope of this paper. Here we show several clusters that were obtained under a typical condition (L=9, D th =30º, Culled PDB), in order to illustrate that our clustering method succeeded in the extraction of distinct structural motifs, including known canonical ones. Structural and sequential summary of these clusters are listed in Fig. S1 in Supplemental Materials. Methodological Advantages To compare protein structures, algorithms based on the Cartesian coordinates of Cα atoms are the most common in use (1-6). While these algorithms are effective in comparing the topology of whole proteins, they have a disadvantage in calculation time because they require some steps for the transformation of the system of coordinates to obtain a best superimposition between two proteins. Hence, in order to classify an enormous number of short segments effectively, we developed new algorithm defining the structural dissimilarity based on dihedral angles (φ, ψ and ω), which contain enough information to reproduce the backbone structure of proteins. In contrast to the methods by de Brevern et al. (7), our definition contains all three dihedral angles (see Methods). Although one (2)

may think the analysis including ω-angle causes confusion, we confirmed that the local structures containing a cis peptide bond were completely discriminated from the clusters of the segments having an all trans configuration (data not shown). Thus, this algorithm does not need the preprocessing to eliminate the structures containing cis peptide bonds. The single-pass clustering method (8), used in the present study, does not require us to presume a parameter for the total number of clusters before clustering (Fig. S2 in Supplemental Materials). Time-consuming iterative calculations are also unnecessary in the single-pass method. Therefore, the method is applicable to the problem in which the number of clusters is unknown, and it can process large-scale calculations at higher speed than other non-hierarchical clustering methods such as k-means and self-organizing map (SOM). The single-pass method also has an intrinsic advantage in the classification of samples that have a quite unbalanced distribution (In fact, there is more than a thousand-fold difference in frequency in Fig. 1a), while the k-means and SOM are rather preferred to the strict analysis of uniformity distributed samples. Reproducibility of Results in the Single-Pass Clustering Method A disadvantage of the single-pass method is that clustering results might be altered by the order of sampling. Before proceeding with in-depth analyses, we therefore checked the reproducibility of the calculations. Fig. S3a in Supplemental Materials shows ten distribution curves of 9-residue segments which were independently obtained when the order of sampling was changed randomly. Their almost identical shapes indicate that the structural distribution of protein segments is so robust that the results in the single-pass clustering method are not influenced significantly by the order of sampling. We also evaluated the statistical deviations of the various parameters introduced in the present study by analysis of independent 100 (or 1000) sets of results for both 9-residue and 21- residue segments that were obtained by changing the order of sampling randomly (Fig. S4 and Table S1 in Supplemental Materials). The resultant standard deviations of all parameters are considerably small. This implies that the conclusion concerning the structural diversity of protein segments presented in the present study is never affected by the order of sampling. For instance, the standard deviations in log(n est )/L and S est /L are less than 1% of the averaged parameters. Consequently, if one (3)

draws the standard deviations in Fig. 3 and Fig. 4, the size of error bars will become smaller than the size of circles of the data points. Furthermore, we checked the adequateness of the single-pass method by comparing with other iterative clustering methods. As shown in Fig. S3b in Supplemental Materials, the difference between the result of the single-pass clustering and the result of an iterative calculation (100 times) based on k-means algorithm using the former result as an initial condition was not significant. Thus, we considered an iterative and time-consuming calculation is not necessary for carrying out our purpose. Validity of Overlapping Segment Sampling To clear out a concern that the analyses using a set of overlapping segments might give some serious biases in the statistical results, we checked amino acid compositions of the sets of segments with and without the presence of overlap. Table S2 in Supplemental Materials shows the compositions calculated from the nine sets of 9-residue segments, where each set has no overlap. Comparing to the composition of the set of overlapping segments, i.e. the set of all 9-residue segments, there is no significant deviation among the compositions. This indicates the validity of the statistical analyses using a set of overlapping segments performed in the present study. Objective Function in Fitting Calculations To begin with the fitting calculations of the modified Mandelbrot formula (Eq.2) to the empirical distributions, we tried several equations as an objective function to minimize, because an ordinary equation, i.e. the sum of squared errors, seems to be inappropriate to obtain a good fitting result in a double-logarithmic scale plot. In fact, large upper shifts were found in low-ranked clusters when Eq.s1 was used (Fig S5a in Supplemental Materials). In case of Eq.s2, high-ranked clusters were underestimated. Among several equations we found Eq.s3 and Eq.s4 are available. Then, we further examined the goodness in fitting calculations using Kolmogorov-Smirnov (KS) parameter. As judged by the KS values, the fitting calculation using Eq.s4 appears to show good performance in many cases (Fig. S5b in Supplemental Materials). Accordingly, we chose Eq.s4 as an objective function to minimize and used it in further analyses. The equation can be interpreted on the (4)

assumption that the expected error in f(r) should be proportional to the square root of f(r). g 1 Ncls β [ cls ] 1 2 β (s1) N ( a, b, ) = f a( r + b) cls r = 2 Ncls β [ cls { }] 1 g2 log N g g 3 4 ( a, b, ) = log( f ) a( r + b) ( a, b, ) ( a b, ) β (s2) cls r = 2 cls r = 2 β [ log( f ) log{ a( r + b) }] Ncls 1 cls β = (s3) N r β [ fcls a( r + b) ] a( r + b) Ncls, = β Ncls r = 2 2 1 β (s4) 2 2 Goodness of Fit Evaluated by Parametric Bootstrapping Analysis To test the goodness of fit and provide the confidence intervals of fit coefficients in fitting calculations of the modified Mandelbrot formula (Eq.2) to the empirical distributions, a parametric bootstrapping analysis was performed. Assuming the expected error ε is proportional to the square root of f(r) as described above, ε = β [ fcls a( r + b) ] β a( r + b) the model to be evaluate is designated as 2 = f cls a a( r + b) ( r + b) β β f cls ( r > 1) = a β β ( r + b) + a( r + b) ε (s5) Here, the error is assumed to represent a normal distribution; ε ~ N(0, σ 2 ε ). The variance σ 2 ε was determined by a maximum likelihood method using the residuals of f cls (r) from the best fitted curve. We generated 10000 sets of artificial data that follow Eq.s5 by setting the fit coefficients obtained from the original data. By repeating the same fitting calculations against the artificial data, we obtained bootstrap fit coefficients (a*, b*, and β*) as well as two kinds of goodness-of-fit statistics (χ 2 * and G 2 *) χ β ( ) 2 ( X E) fcls( r a( r + b) = M E a( r + b) 2 ) = β 2 (s6) (5)

G 2 X X ln E 2M f ( r)ln a ( r) 2 cls = = cls β f ( r + b) Confidence intervals of bootstrap parameters were determined using a percentile method. The results are summarized in TABLE S3 in Supplemental Materials. In many cases, the fit coefficients and the goodness-of-fit statistics calculated from the original data reside within the range of 99% confidence intervals of bootstrap parameters. Especially, it is appreciable in relatively long segments compared to relatively short segments. This tendency is consistent with the results in Fig. 2 where the model appears to be well fitted to the empirical data of relatively long segments. Through the analysis, the null hypothesis that the empirical data can be represented by the modified Mandelbrot formula is not rejected especially as for relatively long segments. (s7) Assessment of the Structural Dissimilarity Threshold, D th In the one-path clustering method used in the present study, the distribution of the local structures in principle depends on one parameter, a structural dissimilarity threshold, D th, the value of which was assigned arbitrarily before clustering (see Methods and Fig. S2 in Supplemental Materials). The effect of the value of D th on the shape of the distribution functions has been already described in Result. Here, we show some characteristics of the structural dissimilarity D to help the readers to catch the technical meaning of this threshold. Fig. S6 in Supplemental Materials shows histograms of 9-regidue segments, in which the x-axis indicates the value of D of these segments against certain typical secondary structures. Since the unit of D corresponds to an angle, the minimum and the maximum values of D are 0 and 180º, respectively, in principle. However, the actual data show most segments distribute within the range from 0 to 110º. In these three histograms, the shapes in the right half area, corresponding to the distributions of relatively dissimilar segments, are similar to each other. In contrast, the distributions in a left half area are different in their shape, which may indicate the individuality of the clusters composed of relatively similar segments. Our condition in D th is 20, 30, or 40º, which seems reasonable to discriminate relatively similar segments from relatively dissimilar segments. Fig. S7 in Supplemental Materials illustrates the structural difference of 9-residue segments from the center of the cluster which the segments were classified to. The x- and y-axes indicate the differences in D and in backbone RMS deviation (bbrmsd), (6)

respectively. Although the data are widely dispersed, a rough correlation between D and bbrmsd is identified. In case of segments having similar structures, smaller distance in angles indicates smaller distance in coordinates. From this correlation, we can say that 10, 20, and 30º in D th correspond to approximately 0.5, 1.0, and 2.0 Å in bbrmsd. REFERENCE 1. Holm, L. and C. Sander. 1993. Protein structure comparison by alignment of distance matrices. J. Mol. Biol. 233:123-138. 2. Madej, T., J. F. Gibrat, and S. H. Bryant. 1995. Threading a database of protein cores. Proteins 23:356-369. 3. Shindyalov, I. N. and P. E. Bourne. 1998. Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng. 11:739-747. 4. Rooman, M. J., J. Rodriguez, and S. J. Wodak. 1990. Automatic definition of recurrent local structure motifs in proteins. J. Mol. Biol. 213:327-336. 5. Fetrow, J. S., M. J. Palumbo, and G. Berg. 1997. Patterns, structures, and amino acid frequencies in structural building blocks, a protein secondary structure classification scheme. Proteins 27:249-271. 6. Hunter, C. G. and S. Subramaniam. 2003. Protein fragment clustering and canonical local shapes. Proteins 50:580-588. 7. de Brevern, A. G., C. Etchebest, and S. Hazout. 2000. Bayesian probabilistic approach for predicting backbone structures in terms of protein blocks. Proteins 41:271-287. 8. Richards, J. A. and X. Jia. 1999. Remote sensing digital image analysis. New York: Springer- Verlag. (7)

TABLE S1 Sensitivity of the single-pass clustering method to the order of sampling. Averaged values and standard deviations of various parameters obtained from 100 or 1000 times clustering calculations in different order of sampling are listed. L f cls (r=1) N N cls a b β N est log(n est )/ L S est /L remark PDB Select 9 0.122 29805 4052 0.214 13.8 1.08 1.4 10 4 0.462 1.147 ±0.002 ±49 ±27 ±0.033 ±2.6 ±0.02 ±1.2 10 3 ±0.004 ±0.004 A 21 0.0280 123526 4071 0.035 14.1 0.89 8.2 10 5 0.281 0.781 ±0.0003 ±70 ±32 ±0.002 ±1.3 ±0.01 ±1.2 10 5 ±0.003 ±0.004 A Culled PDB 9 0.143 10516 3841 0.261 12.6 1.12 9.0 10 3 0.439 1.070 ±0.007 ±29 ±22 ±0.022 ±1.4 ±0.01 ±2.8 10 2 ±0.001 ±0.004 A 9 0.144 10516 3841 0.269 13.2 1.12 9.1 10 3 0.440 1.069 ±0.006 ±29 ±22 ±0.026 ±1.6 ±0.01 ±3.1 10 2 ±0.002 ±0.004 B 21 0.0194 44016 4781 0.039 10.6 0.89 3.9 10 5 0.266 0.742 ±0.0006 ±32 ±25 ±0.002 ±1.0 ±0.01 ±3.4 10 4 ±0.002 ±0.002 A A Clustering conditions: D th =30º, number of runs=100. B Clustering conditions: D th =30º, number of runs=1000. Refer to Methods for the meaning of symbols. (8)

TABLE S2 Amino acid compositions of the sets of 9-residue segments with and without overlap without overlap with set of segment overla 1 2 3 4 5 6 7 8 9 p no. of segments 8521 8522 8522 8522 8522 8522 8521 8521 8521 76694 ALA 0.0888 0.0884 0.0884 0.0883 0.0884 0.0883 0.0882 0.0884 0.0885 0.0884 ARG 0.0461 0.0460 0.0461 0.0460 0.0458 0.0460 0.0462 0.0462 0.0461 0.0461 ASN 0.0477 0.0479 0.0479 0.0479 0.0478 0.0479 0.0479 0.0479 0.0479 0.0479 ASP 0.0607 0.0607 0.0607 0.0607 0.0604 0.0605 0.0604 0.0606 0.0606 0.0606 CYS 0.0152 0.0151 0.0151 0.0150 0.0150 0.0150 0.0150 0.0149 0.0150 0.0150 GLN 0.0379 0.0379 0.0379 0.0378 0.0377 0.0378 0.0377 0.0379 0.0379 0.0378 GLU 0.0599 0.0598 0.0602 0.0601 0.0602 0.0602 0.0602 0.0601 0.0599 0.0601 GLY 0.0810 0.0810 0.0810 0.0811 0.0813 0.0812 0.0811 0.0809 0.0810 0.0811 HIS 0.0232 0.0233 0.0232 0.0230 0.0232 0.0232 0.0232 0.0232 0.0232 0.0232 ILE 0.0522 0.0519 0.0520 0.0521 0.0521 0.0523 0.0524 0.0523 0.0522 0.0522 LEU 0.0853 0.0855 0.0855 0.0856 0.0855 0.0853 0.0853 0.0852 0.0854 0.0854 LYS 0.0557 0.0560 0.0560 0.0559 0.0559 0.0558 0.0558 0.0557 0.0558 0.0558 MET 0.0203 0.0203 0.0202 0.0202 0.0202 0.0202 0.0202 0.0203 0.0202 0.0202 PHE 0.0392 0.0391 0.0391 0.0393 0.0395 0.0394 0.0395 0.0394 0.0394 0.0393 PRO 0.0461 0.0460 0.0461 0.0460 0.0460 0.0458 0.0459 0.0460 0.0461 0.0460 SER 0.0615 0.0616 0.0616 0.0616 0.0616 0.0618 0.0618 0.0618 0.0616 0.0617 THR 0.0588 0.0588 0.0585 0.0587 0.0586 0.0588 0.0589 0.0588 0.0589 0.0587 TRP 0.0157 0.0157 0.0157 0.0156 0.0157 0.0157 0.0156 0.0156 0.0156 0.0157 TYR 0.0361 0.0362 0.0361 0.0361 0.0361 0.0361 0.0359 0.0360 0.0361 0.0361 VAL 0.0686 0.0689 0.0689 0.0689 0.0689 0.0688 0.0688 0.0687 0.0686 0.0688 The value with an underline corresponds to either minimum or maximum number among the sets of segments having no overlap. These data were calculated from the Culled PDB. (9)

TABLE S3 Confidence intervals of fit coefficients and goodness-of-fit statistics determined by parametric bootstrapping analysis. L a a* b b* 7 0.3512 [ 0.2613, 0.3104 ] 18.56 [ 15.41, 17.88 ] 9 0.1820 [ 0.1518, 0.1688 ] 11.41 [ 9.76, 11.08 ] 11 0.0888 [ 0.0821, 0.0873 ] 6.53 [ 5.90, 6.55 ] 13 0.0599 [ 0.0573, 0.0598 ] 6.58 [ 6.16, 6.66 ] 15 0.0393 [ 0.0381, 0.0394 ] 4.79 [ 4.51, 4.88 ] 17 0.0382 [ 0.0371, 0.0384 ] 7.82 [ 7.47, 7.96 ] 19 0.0389 [ 0.0374, 0.0392 ] 12.35 [ 11.80, 12.59 ] 21 0.0330 [ 0.0320, 0.0333 ] 13.39 [ 12.90, 13.66 ] 31 0.0152 [ 0.0146, 0.0155 ] 11.45 [ 10.97, 11.81 ] L β β* N est N est * 7 1.152 [ 1.099, 1.125 ] 5683 [ 4100, 4783 ] 9 1.062 [ 1.029, 1.045 ] 13138 [ 9764, 11006 ] 11 0.962 [ 0.948, 0.957 ] 24806 [ 21214, 22947 ] 13 0.913 [ 0.905, 0.912 ] 45564 [ 41528, 44021 ] 15 0.867 [ 0.861, 0.866 ] 74892 [ 69715, 73359 ] 17 0.876 [ 0.871, 0.877 ] 162448 [ 150328, 160090 ] 19 0.893 [ 0.886, 0.893 ] 396416 [ 350967, 388838 ] 21 0.884 [ 0.879, 0.885 ] 727170 [ 653740, 728501 ] 31 0.860 [ 0.854, 0.863 ] 20231942 [ 16240935, 22164407 ] L S est S est * Z/α (Z/α) * 7 9.31 [ 9.17, 9.28 ] 3.438 [ 3.282, 3.355 ] 9 10.28 [ 10.14, 10.23 ] 2.868 [ 2.775, 2.812 ] 11 11.38 [ 11.29, 11.35 ] 2.509 [ 2.474, 2.491 ] 13 12.48 [ 12.42, 12.47 ] 2.282 [ 2.266, 2.276 ] 15 13.42 [ 13.37, 13.40 ] 2.113 [ 2.103, 2.110 ] 17 14.44 [ 14.39, 14.43 ] 2.025 [ 2.016, 2.024 ] 19 15.48 [ 15.40, 15.46 ] 1.971 [ 1.958, 1.969 ] 21 16.33 [ 16.26, 16.32 ] 1.902 [ 1.892, 1.902 ] 31 20.66 [ 20.49, 20.73 ] 1.721 [ 1.708, 1.726 ] L χ 2 χ 2 * G 2 G 2 * 7 9532 [ 7782, 8850 ] -154 [ 718, 1279 ] 9 7487 [ 6379, 7138 ] 32 [ 476, 977 ] 11 3532 [ 3200, 3555 ] 148 [ 21, 382 ] 13 1647 [ 1524, 1683 ] 116 [ -96, 191 ] 15 1033 [ 961, 1061 ] -49 [ -95, 151 ] 17 676 [ 636, 704 ] -92 [ -90, 112 ] 19 636 [ 598, 666 ] -42 [ -75, 102 ] 21 353 [ 331, 370 ] -62 [ -61, 71 ] 31 98 [ 89, 105 ] 1 [ -25, 28 ] Values in bracket indicate the 99% confidence intervals of bootstrap parameters obtained from the parametric bootstrapping analysis (10000 times) with a percentile method. Clustering conditions: D th =30º, PDB Select. Refer to Methods for the meaning of symbols. (10)

FIGURE S1 Structural and sequential summary of clusters. Only representative clusters are shown. Rank:1 Clustering conditions: L=9, Dth=30, culled PDB Cluster ranking 1 The number of assigned segments 11660 Normalized frequency 0.1520 RMS deviation (Å) 0.36 Kullback Leibler entropy (bit) 0.09 The number of hydorgen bonds 7.7 The centroid of the cluster Pos. Phi Psi Omega 4 64.9 39.0 179.5 3 65.1 40.0 179.3 2 64.8 40.8 179.3 1 64.1 41.2 179.3 0 64.2 41.3 179.3 1 64.3 41.2 179.4 2 64.8 40.6 179.6 3 65.6 38.8 179.9 4 69.5 34.2 180.0 The Position Specific Scoring Matrix < N terminal C terminal > 4 3 2 1 0 1 2 3 4 ALA 1.3631 1.4486 1.4690 1.4992 1.5031 1.5614 1.5060 1.5157 1.5011 ARG 1.1405 1.1294 1.1517 1.2766 1.2952 1.2263 1.2486 1.2952 1.3139 ASN 0.8116 0.8152 0.7845 0.7827 0.7971 0.7483 0.7212 0.7700 0.9038 ASP 1.0249 0.9793 0.7851 0.7009 0.7023 0.6409 0.6024 0.6695 0.6752 CYS 0.6916 0.7253 0.7984 0.7872 0.7478 0.8434 0.8547 0.8659 0.8209 GLN 1.3719 1.4036 1.3289 1.3153 1.3289 1.2927 1.2134 1.3266 1.3742 GLU 1.4663 1.4292 1.1965 1.1308 1.2208 1.1180 1.0737 1.2365 1.2764 GLY 0.6120 0.5991 0.5531 0.4921 0.4761 0.4472 0.4194 0.4258 0.4151 HIS 0.9530 0.9679 0.9234 0.8789 0.9048 0.8714 0.9159 0.9419 1.0643 ILE 0.9224 0.9927 1.2037 1.2593 1.2119 1.2560 1.3051 1.1693 1.0123 LEU 1.0593 1.1737 1.3282 1.3743 1.3633 1.5128 1.5609 1.4436 1.3763 LYS 0.9955 0.9864 0.9940 1.1000 1.1516 1.1000 1.1576 1.3273 1.3804 MET 1.1473 1.3039 1.4563 1.4394 1.4097 1.5283 1.6130 1.4987 1.4055 PHE 0.9264 1.0130 1.0952 1.0823 1.0281 1.0823 1.0996 0.9416 0.9004 PRO 0.7889 0.3211 0.2858 0.2858 0.2766 0.2283 0.1838 0.2135 0.1949 SER 0.8643 0.8043 0.7135 0.6898 0.6982 0.6786 0.6856 0.7708 0.8476 THR 0.8238 0.8310 0.8049 0.7599 0.7759 0.7324 0.7541 0.7541 0.8093 TRP 1.0839 1.0343 0.9903 1.0674 1.0509 1.0949 1.0949 0.9518 0.8638 TYR 0.8646 0.8622 0.9003 0.8884 0.9313 0.9599 0.9742 0.8527 0.9194 VAL 0.9276 0.9699 1.0818 1.0619 0.9798 0.9910 1.0146 0.8903 0.8120 (11)

FIGURE S1 (continued) Rank:2 Clustering conditions: L=9, Dth=30, culled PDB Cluster ranking 2 The number of assigned segments 1577 Normalized frequency 0.0206 RMS deviation (Å) 0.53 Kullback Leibler entropy (bit) 0.23 The number of hydorgen bonds 6.6 The centroid of the cluster Pos. Phi Psi Omega 4 96.8 138.1 179.0 3 61.4 35.8 179.7 2 65.1 36.0 179.1 1 69.1 37.0 179.2 0 64.1 40.8 179.7 1 63.9 40.3 179.2 2 64.4 40.6 179.5 3 65.2 40.2 179.6 4 66.5 37.2 179.5 The Position Specific Scoring Matrix < N terminal C terminal > 4 3 2 1 0 1 2 3 4 ALA 0.2875 1.0136 1.3227 1.0711 1.4305 1.5096 1.6390 1.4808 1.4377 ARG 0.5512 0.9232 0.9921 0.5925 1.1988 1.4055 1.3504 1.2401 1.2953 ASN 2.7665 0.4009 0.7751 0.7618 0.4544 0.8152 0.9088 0.6415 0.6014 ASP 3.2403 1.0238 2.0054 1.6676 0.3905 0.7388 0.9816 0.4327 0.5488 CYS 0.7899 0.6652 0.4157 0.5820 1.1225 0.7899 0.4989 0.7068 1.0393 GLN 0.5524 0.8035 1.7073 1.7743 1.2889 1.2721 1.8078 0.9374 1.1215 GLU 0.5912 1.3302 2.7131 2.2064 0.6229 1.5307 1.7419 0.7495 0.6440 GLY 0.6645 0.5379 0.5854 0.5854 0.3243 0.4983 0.4983 0.3006 0.3955 HIS 1.3161 0.7403 1.0419 1.1516 0.8500 0.9871 0.7403 0.7403 0.6032 ILE 0.1572 0.7497 0.3144 0.6409 1.8259 0.9311 0.7981 1.4631 1.5236 LEU 0.2003 0.9420 0.4154 0.8826 1.5131 1.2090 1.0829 1.8691 1.8691 LYS 0.5602 0.8290 1.0083 0.9523 0.9411 1.2884 1.3892 0.7618 1.0867 MET 0.3130 0.9077 0.6886 1.2834 1.6590 1.0956 0.8138 1.8155 1.5338 PHE 0.3201 0.8482 0.5921 0.9922 1.5684 1.0082 0.7842 1.6164 1.3443 PRO 1.4136 3.8564 0.5352 0.2196 0.1784 0.4117 0.4117 0.1510 0.1372 SER 2.7256 0.9395 1.3008 0.7433 0.5885 0.8466 0.7020 0.5781 0.6504 THR 2.2411 0.7506 0.8150 1.0509 0.5898 0.8257 0.7077 0.6756 0.6648 TRP 0.4068 1.3017 1.1797 1.0983 1.3831 0.7729 0.9763 1.3017 1.1797 TYR 0.4755 0.7396 0.7572 0.9334 1.1447 0.6164 0.8629 1.3208 1.0918 VAL 0.0919 0.8826 0.4413 0.9102 1.5721 0.9102 0.6160 1.3423 1.1676 (12)

FIGURE S1 (continued) Rank:3 Clustering conditions: L=9, Dth=30, culled PDB Cluster ranking 3 The number of assigned segments 1030 Normalized frequency 0.0134 RMS deviation (Å) 1.48 Kullback Leibler entropy (bit) 0.10 The number of hydorgen bonds 6.0 The centroid of the cluster Pos. Phi Psi Omega 4 107.2 139.6 178.0 3 108.4 137.2 178.3 2 113.8 138.3 177.7 1 115.3 137.2 177.9 0 114.7 136.2 177.9 1 115.0 136.9 177.9 2 111.3 135.7 178.2 3 110.3 139.1 178.0 4 101.9 138.8 179.1 The Position Specific Scoring Matrix < N terminal C terminal > 4 3 2 1 0 1 2 3 4 ALA 0.8144 0.7374 0.7814 0.7814 0.8475 0.7704 0.7594 0.8585 0.7704 ARG 1.1603 1.0760 1.0971 1.0549 1.1814 1.0127 1.3291 1.0338 0.9916 ASN 0.9413 0.7571 0.5729 0.7776 0.4911 0.5116 0.6139 0.4911 0.9617 ASP 0.6464 0.5656 0.3555 0.4525 0.4848 0.4363 0.4686 0.3717 0.9534 CYS 0.8911 1.3367 0.9548 1.0821 1.0821 1.4004 1.4004 0.8275 0.8275 GLN 1.2814 0.8713 0.6407 0.7688 0.8970 0.9995 0.9995 1.0507 0.9482 GLU 1.0183 0.9536 0.6465 0.6304 0.6950 0.9213 0.8566 0.8243 1.0021 GLY 0.4844 0.4360 0.4481 0.5450 0.4723 0.3997 0.4118 0.6540 0.5692 HIS 1.0495 0.7556 1.2174 1.3433 0.9235 1.1334 1.2594 1.5532 1.3853 ILE 1.0738 1.3515 1.7403 1.7033 1.6662 1.5552 1.4626 1.6477 1.1479 LEU 0.7722 0.9766 1.2492 1.2605 1.1356 1.1924 1.0788 0.9426 0.8404 LYS 1.5094 1.1321 0.8748 0.8576 0.7547 0.5317 0.7547 1.0463 1.1321 MET 1.1502 0.8626 1.1981 1.1502 1.1502 1.0064 1.0064 1.0543 0.7189 PHE 0.9066 0.8331 1.2252 1.3967 1.4212 1.3722 1.6172 1.4947 1.2497 PRO 1.3238 1.8281 1.0296 0.6304 0.7354 1.0296 0.9245 1.0296 1.5339 SER 0.7903 0.8536 0.8220 0.8852 0.8062 0.7587 0.7903 0.5216 0.9010 THR 1.1985 1.1985 1.1328 1.3298 1.3298 1.4448 1.1492 1.1492 1.3134 TRP 0.8097 0.9965 1.1834 0.7474 0.8720 0.8097 1.3079 1.0588 0.9965 TYR 1.4560 1.5908 1.6447 1.3481 1.3751 1.4830 1.4560 1.4560 0.9976 VAL 1.3372 1.4921 1.8440 1.6469 1.9566 1.8581 1.6469 1.7032 1.1824 (13)

FIGURE S1 (continued) Rank:4 Clustering conditions: L=9, Dth=30, culled PDB Cluster ranking 4 The number of assigned segments 972 Normalized frequency 0.0127 RMS deviation (Å) 0.68 Kullback Leibler entropy (bit) 0.27 The number of hydorgen bonds 6.0 The centroid of the cluster Pos. Phi Psi Omega 4 97.6 142.6 177.8 3 87.6 146.8 179.4 2 60.4 37.0 179.8 1 64.0 36.9 179.0 0 68.7 37.5 179.2 1 64.3 41.0 179.8 2 63.9 39.7 179.4 3 65.0 39.1 179.8 4 68.5 36.8 179.7 The Position Specific Scoring Matrix < N terminal C terminal > 4 3 2 1 0 1 2 3 4 ALA 0.7114 0.3382 1.0380 1.3879 0.8281 1.3879 1.3296 1.6095 1.3762 ARG 0.5813 0.4248 0.9837 0.9837 0.5365 0.9837 1.2967 1.3637 1.1625 ASN 0.3252 2.2550 0.4120 0.6288 0.7372 0.3686 0.9107 0.9324 0.7806 ASP 0.3939 2.7741 1.2329 2.0035 1.9864 0.3082 0.7535 1.1987 0.4452 CYS 1.3490 0.8094 0.7420 0.4722 0.4722 1.0118 0.7420 0.4047 0.9443 GLN 0.5703 0.2987 0.8147 1.6294 2.1454 1.1678 1.1949 1.8738 1.0320 GLU 0.5652 0.6509 1.3188 2.8089 2.7918 0.5823 1.4558 1.9012 0.8050 GLY 0.3850 0.5134 0.4492 0.6289 0.4235 0.2438 0.5005 0.5134 0.2823 HIS 1.0231 1.0676 0.7562 0.9342 1.2011 0.8007 1.1121 0.7117 0.8452 ILE 2.1384 0.1373 0.9613 0.2550 0.5493 2.1580 1.1379 0.6278 1.4518 LEU 2.0458 0.1444 0.9747 0.3971 0.8303 1.7569 1.2756 1.0229 1.8412 LYS 0.5635 0.5635 0.8906 0.9815 0.9452 0.8361 1.4723 1.5632 0.8361 MET 2.1837 0.3555 1.0665 0.5078 1.3712 1.7267 0.9141 0.8125 1.8790 PHE 1.7397 0.2077 0.8569 0.5972 0.8309 1.4800 1.0126 0.6491 1.6618 PRO 1.3582 2.4938 3.0504 0.7570 0.1336 0.0668 0.3340 0.4453 0.1336 SER 0.3685 3.0150 0.8543 1.4238 0.7035 0.5025 0.6868 0.6868 0.6365 THR 0.8873 2.6618 0.7829 0.8525 1.1308 0.6611 0.8003 0.6611 0.6263 TRP 1.0560 0.2640 1.6500 1.0560 0.9900 1.5840 0.9240 0.7260 0.8580 TYR 1.2286 0.4000 0.7714 0.5143 0.7714 1.2000 0.7143 0.8286 1.5429 VAL 1.5960 0.0746 0.8950 0.4027 0.8204 1.7154 0.9994 0.5221 1.1784 (14)

FIGURE S1 (continued) Rank:5 Clustering conditions: L=9, Dth=30, culled PDB Cluster ranking 5 The number of assigned segments 871 Normalized frequency 0.0114 RMS deviation (Å) 0.42 Kullback Leibler entropy (bit) 0.43 The number of hydorgen bonds 6.8 The centroid of the cluster Pos. Phi Psi Omega 4 64.5 40.6 179.3 3 62.9 41.4 179.2 2 63.3 41.6 179.2 1 64.5 41.6 179.3 0 63.9 41.0 179.4 1 65.5 41.1 179.5 2 67.7 31.0 179.7 3 90.3 3.7 178.9 4 76.9 23.8 179.1 The Position Specific Scoring Matrix < N terminal C terminal > 4 3 2 1 0 1 2 3 4 ALA 1.5748 1.6009 1.3666 1.4707 2.4469 1.2104 1.9132 1.5618 0.1952 ARG 0.5738 1.2225 1.7963 0.9979 0.7734 1.7713 1.2724 1.5718 0.4740 ASN 0.3630 0.7985 0.8227 0.7017 0.3146 0.7017 0.9437 1.1131 1.9600 ASP 0.4777 0.6688 1.0510 0.3058 0.1720 0.8026 1.0510 0.5160 0.6688 CYS 1.3549 1.0538 0.3011 0.8280 1.8065 0.4516 0.4516 0.5269 0.1505 GLN 1.3335 1.0304 1.6365 1.1213 0.8183 1.2122 1.8790 1.4547 0.4546 GLU 0.7645 0.9939 2.0069 1.2042 0.5543 1.7776 2.5039 0.8983 0.3823 GLY 0.5156 0.5013 0.3724 0.2578 0.2721 0.2721 0.2864 0.1862 8.2351 HIS 0.8439 0.7943 1.0921 0.8936 0.5461 1.0921 0.6950 1.4893 1.0921 ILE 1.4012 1.6201 0.7882 1.9266 1.0728 1.1822 0.4598 0.3722 0.0000 LEU 2.0010 1.3832 0.9669 1.5847 2.6321 1.0878 0.8192 2.0815 0.0806 LYS 0.8316 1.2779 1.8661 1.0751 0.7302 2.5152 1.9878 1.1765 1.2576 MET 2.0402 1.5302 0.5667 1.7002 2.0969 0.5101 1.1901 1.3602 0.0000 PHE 1.4778 1.0142 0.7534 1.4778 1.4198 0.4346 0.4057 0.8113 0.1159 PRO 0.2485 0.3727 0.2485 0.1988 0.1739 0.0497 0.1988 0.0994 0.0000 SER 0.6169 0.4299 0.6916 0.5234 0.7851 0.8225 0.9720 1.0842 0.2056 THR 0.8931 0.6795 0.7183 0.5048 0.3883 0.9125 0.3883 1.0484 0.0194 TRP 1.1784 0.4419 1.1784 1.3994 0.9575 0.4419 0.5156 0.5892 0.0000 TYR 0.7015 0.8609 0.7333 1.4348 0.9884 0.6696 0.4783 1.1160 0.0957 VAL 1.0653 1.1985 0.5826 0.9987 0.6658 0.8822 0.4827 0.4827 0.0000 (15)

FIGURE S1 (continued) Rank:6 Clustering conditions: L=9, Dth=30, culled PDB Cluster ranking 6 The number of assigned segments 813 Normalized frequency 0.0106 RMS deviation (Å) 0.62 Kullback Leibler entropy (bit) 0.14 The number of hydorgen bonds 6.3 The centroid of the cluster Pos. Phi Psi Omega 4 65.5 39.2 179.5 3 64.8 40.1 179.1 2 64.6 40.0 179.4 1 65.0 40.3 179.8 0 65.8 38.8 179.6 1 66.8 33.2 179.6 2 72.0 25.2 178.3 3 93.1 11.7 178.5 4 99.3 125.3 180.0 The Position Specific Scoring Matrix < N terminal C terminal > 4 3 2 1 0 1 2 3 4 ALA 1.4501 1.4501 1.2549 1.6175 1.3386 1.5338 1.2968 1.1573 0.9621 ARG 1.2028 1.0424 1.2295 1.4166 0.9622 1.2562 1.5235 1.4433 0.9355 ASN 0.5703 0.4407 0.5185 0.6481 0.6740 0.8296 1.0369 1.7887 1.6591 ASP 0.9827 0.5323 0.9827 0.5528 0.4299 0.5528 0.6756 0.6551 1.3922 CYS 0.8064 0.5645 0.7258 0.5645 1.5322 0.9677 1.0483 0.8871 1.5322 GLN 1.1689 1.2987 1.5260 1.7858 1.2338 1.1689 1.5585 1.4935 0.8766 GLU 1.3515 0.8191 1.4129 1.4948 1.0034 0.8396 1.3105 0.8805 0.4095 GLY 0.5524 0.3836 0.3376 0.3222 0.2762 0.2762 0.5370 0.5063 0.5524 HIS 0.9041 0.9041 0.4787 0.9573 1.0105 0.9573 0.9573 1.4891 2.0210 ILE 1.0320 1.8530 1.0320 0.9148 1.2431 1.3370 0.6098 0.5864 0.9382 LEU 1.1798 1.5538 1.3524 1.0215 1.9279 1.7553 1.3236 0.9783 1.3812 LYS 0.9996 0.6954 1.4342 1.4125 0.8910 1.0214 1.8254 1.4777 0.8910 MET 1.7001 2.0036 1.8215 0.9715 2.3072 2.1858 1.2750 0.8500 1.2143 PHE 1.4280 1.4590 0.9003 1.0555 1.2417 1.4280 0.7450 1.0555 1.1796 PRO 0.7720 0.1597 0.2662 0.2928 0.2130 0.2130 0.3727 0.1065 0.1065 SER 0.5207 0.7610 0.6408 0.7810 0.7410 0.4606 1.3417 1.3417 1.0614 THR 0.7280 0.8112 0.7280 0.9152 0.7488 0.6864 0.7904 0.9984 0.9984 TRP 0.5523 1.1836 1.8938 0.9469 1.2625 1.4203 0.3156 0.7891 0.7891 TYR 0.8198 1.1956 0.9565 1.1614 0.7857 1.2639 0.4099 1.2297 1.5030 VAL 1.1057 1.1770 1.0878 0.9452 1.1592 0.8025 0.5350 0.7133 0.7668 (16)

FIGURE S1 (continued) Rank:7 Clustering conditions: L=9, Dth=30, culled PDB Cluster ranking 7 The number of assigned segments 685 Normalized frequency 0.0089 RMS deviation (Å) 0.58 Kullback Leibler entropy (bit) 0.47 The number of hydorgen bonds 6.3 The centroid of the cluster Pos. Phi Psi Omega 4 62.8 41.3 179.3 3 63.3 41.5 179.3 2 64.8 41.5 179.3 1 63.9 41.6 179.7 0 64.8 40.3 179.6 1 68.0 30.9 179.7 2 90.9 4.1 179.0 3 78.3 23.4 178.9 4 93.5 139.9 178.6 The Position Specific Scoring Matrix < N terminal C terminal > 4 3 2 1 0 1 2 3 4 ALA 1.6218 1.3736 1.5060 2.6313 1.3901 2.0356 1.6053 0.1655 1.2743 ARG 1.3641 2.1254 1.0151 0.7613 1.8716 1.3006 1.7765 0.4441 0.8565 ASN 0.7692 0.7692 0.5846 0.2461 0.5538 0.9846 1.0153 1.7538 0.4000 ASP 0.6561 0.9234 0.3159 0.0729 0.8262 0.9719 0.3645 0.6318 0.4860 CYS 0.9571 0.4786 0.7657 2.2971 0.3828 0.4786 0.6700 0.1914 0.6700 GLN 1.1175 1.5414 0.8478 0.6936 1.6185 1.9653 1.2717 0.3083 0.4239 GLU 1.0451 2.2116 1.1423 0.5104 2.0172 2.5276 0.9721 0.4132 0.4618 GLY 0.4735 0.4189 0.2550 0.3096 0.2732 0.2550 0.1275 8.6319 0.5099 HIS 0.8206 0.8837 1.0731 0.6312 1.3256 0.5681 1.4518 1.0099 0.6943 ILE 1.5868 0.6960 1.9208 1.1135 0.8630 0.4732 0.3341 0.0000 2.0044 LEU 1.3319 0.8538 1.7076 2.4418 0.9733 0.7343 2.1686 0.0683 1.5368 LYS 1.3670 1.9344 0.9543 0.6964 2.5276 1.9860 1.1864 1.1606 1.1348 MET 1.4412 0.5765 1.5854 2.0898 0.5044 0.9368 1.4412 0.0000 1.2251 PHE 0.9948 0.6264 1.4001 1.5843 0.3684 0.2948 0.9948 0.0737 1.6211 PRO 0.3475 0.3159 0.1896 0.1580 0.0632 0.2528 0.0948 0.0000 0.2844 SER 0.4041 0.8319 0.5467 0.8081 0.8319 1.0696 0.7130 0.1664 0.3328 THR 0.6419 0.6912 0.4937 0.3209 0.9134 0.4197 1.0122 0.0247 0.7159 TRP 0.2810 1.1238 1.6857 1.2175 0.5619 0.2810 0.6556 0.0000 0.9365 TYR 0.7703 0.6081 1.6217 0.8109 0.5271 0.4054 1.3784 0.0811 1.9055 VAL 1.2699 0.5080 1.0371 0.7196 0.6138 0.5291 0.5715 0.0000 2.0319 (17)

FIGURE S1 (continued) Rank:9 Clustering conditions: L=9, Dth=30, culled PDB Cluster ranking 9 The number of assigned segments 487 Normalized frequency 0.0063 RMS deviation (Å) 0.78 Kullback Leibler entropy (bit) 0.50 The number of hydorgen bonds 5.8 The centroid of the cluster Pos. Phi Psi Omega 4 63.6 40.9 179.3 3 64.7 40.9 179.2 2 64.5 41.8 179.8 1 64.4 40.1 179.6 0 68.1 30.9 179.5 1 91.9 3.4 179.2 2 77.1 23.2 179.1 3 97.5 144.8 177.9 4 91.8 136.7 179.5 The Position Specific Scoring Matrix < N terminal C terminal > 4 3 2 1 0 1 2 3 4 ALA 1.4665 1.4432 2.3976 1.3501 2.0019 1.5829 0.1397 0.9311 0.5121 ARG 1.8741 0.8924 0.8924 1.9187 1.3386 1.7402 0.4908 0.9370 1.0263 ASN 0.8655 0.5626 0.2597 0.6492 0.9521 1.1252 1.8609 0.3029 1.3849 ASP 0.9228 0.3760 0.0342 0.7861 0.9228 0.3418 0.6836 0.3076 1.6405 CYS 0.6731 0.8077 2.2886 0.2692 0.2692 0.5385 0.1346 0.9424 0.1346 GLN 1.2467 0.8672 0.7046 1.8971 1.8429 0.9756 0.2168 0.5420 0.7588 GLU 2.2562 1.1965 0.6153 2.2220 2.5639 0.9914 0.3760 0.4444 1.5383 GLY 0.3842 0.2049 0.3586 0.2818 0.2561 0.0512 8.6578 0.5635 0.3330 HIS 0.8879 1.0654 0.6215 1.1542 0.7991 1.6869 1.2430 0.8879 0.7103 ILE 0.5482 1.8795 1.0964 0.7440 0.5482 0.4307 0.0000 2.1928 0.6657 LEU 0.7686 1.5852 2.4018 0.9127 0.8166 2.2337 0.0721 1.7293 0.4323 LYS 2.0316 1.1609 0.6530 2.4306 1.7776 1.4148 0.9432 1.1609 1.7051 MET 0.6082 1.8245 2.4326 0.6082 1.0136 1.1150 0.0000 1.4190 0.3041 PHE 0.6219 1.6065 1.7102 0.2591 0.3109 0.8810 0.1036 1.5029 0.3628 PRO 0.4444 0.1778 0.1778 0.0444 0.2222 0.0889 0.0000 0.0444 2.9775 SER 0.7355 0.5683 0.8358 0.9361 1.0029 0.8024 0.1672 0.2006 1.5378 THR 0.9375 0.4167 0.2778 0.9028 0.4167 1.0417 0.0347 0.7986 1.4584 TRP 0.9221 1.8442 1.3173 0.3952 0.3952 0.3952 0.0000 1.0538 0.0000 TYR 0.6273 1.4827 0.7413 0.4562 0.5132 1.4256 0.0570 1.9389 0.5132 VAL 0.5656 1.0718 0.7443 0.6252 0.5954 0.5061 0.0298 2.1733 0.7443 (18)

FIGURE S1 (continued) Rank:21 Clustering conditions: L=9, Dth=30, culled PDB Cluster ranking 21 The number of assigned segments 295 Normalized frequency 0.0038 RMS deviation (Å) 0.92 Kullback Leibler entropy (bit) 0.58 The number of hydorgen bonds 5.4 The centroid of the cluster Pos. Phi Psi Omega 4 65.2 40.2 179.3 3 64.1 42.5 179.9 2 64.6 39.9 179.5 1 68.5 29.6 179.6 0 92.2 2.4 179.1 1 78.6 21.3 178.9 2 94.5 142.2 177.6 3 91.2 113.7 180.0 4 112.0 139.9 177.8 The Position Specific Scoring Matrix < N terminal C terminal > 4 3 2 1 0 1 2 3 4 ALA 1.4603 2.6131 1.4218 1.6908 1.5371 0.1153 1.1528 0.6148 0.6533 ARG 0.8103 0.7366 1.9152 1.5469 1.9889 0.1473 0.8103 1.0313 0.3683 ASN 0.3572 0.1429 0.5001 0.7859 1.2860 2.0004 0.4287 1.2145 0.2858 ASP 0.3950 0.0564 0.6771 0.9592 0.3385 0.6206 0.2821 1.1849 0.4514 CYS 1.1112 2.8892 0.4445 0.0000 0.8890 0.0000 0.8890 0.0000 1.7779 GLN 0.8948 0.5369 1.7896 2.4160 1.2527 0.3579 0.5369 0.8948 0.3579 GLU 1.2980 0.6772 2.0316 3.1603 1.0158 0.4515 0.5079 1.9752 0.1129 GLY 0.1269 0.2537 0.0846 0.1691 0.0846 8.9224 0.4651 0.2960 0.3806 HIS 1.0260 0.4397 1.1726 0.8794 1.6123 1.1726 0.5863 1.0260 0.4397 ILE 1.7453 1.1635 0.9696 0.3878 0.1939 0.0000 2.0685 0.7757 2.0685 LEU 1.7050 2.5773 0.8327 0.7930 1.8239 0.0000 1.7446 0.6741 1.2688 LYS 1.1379 0.6588 2.5153 1.8566 1.4972 0.8384 0.8384 2.0362 0.2396 MET 1.8406 1.8406 0.8367 0.3347 1.0040 0.0000 0.8367 0.3347 1.3386 PHE 1.3689 1.9677 0.2567 0.4278 0.6844 0.0856 1.5400 0.2567 1.2833 PRO 0.2201 0.1467 0.0734 0.2935 0.0000 0.0000 0.3668 2.9346 1.6140 SER 0.6623 0.9382 0.9382 1.0486 0.8279 0.2208 0.2760 0.9382 0.7727 THR 0.4013 0.2866 1.2038 0.2866 1.2038 0.0000 0.6879 1.4331 0.7452 TRP 1.5222 1.3048 0.4349 0.2175 1.0873 0.0000 0.6524 0.0000 0.8698 TYR 1.7887 0.4707 0.3766 0.4707 1.2238 0.0941 2.3535 0.3766 1.5063 VAL 1.0812 0.5406 0.6881 0.5406 0.6389 0.0000 2.2608 0.6881 3.6861 (19)

FIGURE S1 (continued) Rank:23 Clustering conditions: L=9, Dth=30, culled PDB Cluster ranking 23 The number of assigned segments 288 Normalized frequency 0.0038 RMS deviation (Å) 0.99 Kullback Leibler entropy (bit) 0.56 The number of hydorgen bonds 5.2 The centroid of the cluster Pos. Phi Psi Omega 4 64.3 41.6 179.9 3 64.2 40.2 179.5 2 68.0 30.0 179.2 1 93.0 2.3 179.0 0 78.0 21.4 178.9 1 98.5 145.8 177.2 2 92.4 134.1 179.8 3 108.6 139.1 177.3 4 109.8 136.3 179.0 The Position Specific Scoring Matrix < N terminal C terminal > 4 3 2 1 0 1 2 3 4 ALA 2.5585 1.4170 2.0862 1.4958 0.1968 0.9840 0.6298 0.7872 0.6692 ARG 0.4527 2.4899 1.3581 2.1881 0.4527 1.1318 1.3581 0.3018 1.0563 ASN 0.5123 0.8782 0.8782 1.4636 2.3418 0.4391 0.9513 0.1464 0.9513 ASP 0.0000 0.5779 1.0981 0.2312 0.5779 0.3468 0.9247 0.1734 0.5201 CYS 2.5041 0.4553 0.2276 0.6829 0.0000 0.6829 0.0000 1.8212 1.3659 GLN 0.4583 1.5581 2.3830 1.1915 0.3666 0.5499 1.0999 0.3666 0.2750 GLU 0.5781 2.0232 2.7169 0.9827 0.2312 0.6359 1.7342 0.0578 0.9249 GLY 0.3465 0.1299 0.2166 0.0433 8.7494 0.8663 0.5198 0.2166 0.4331 HIS 0.6005 1.2011 1.0509 1.6515 1.2011 0.9008 1.0509 0.9008 0.9008 ILE 0.7945 0.7283 0.5297 0.1986 0.0000 2.1850 0.7283 1.9202 2.4499 LEU 2.3962 0.8529 0.6904 1.8683 0.0406 1.4621 0.5280 1.3809 0.8123 LYS 0.6748 2.1471 1.5950 1.4723 0.6748 0.9815 1.9630 0.3067 0.9815 MET 2.5710 1.0284 0.3428 1.0284 0.0000 1.1998 0.3428 1.5426 1.0284 PHE 1.6650 0.2629 0.3505 0.9640 0.1753 1.4898 0.5258 1.6650 0.9640 PRO 0.2254 0.0751 0.0751 0.0000 0.0000 0.0751 3.3816 1.6532 1.2775 SER 1.1872 0.9045 1.1872 0.9610 0.1131 0.2827 0.5653 0.4523 0.6784 THR 0.4110 0.9982 0.2936 1.0569 0.0000 0.5872 1.3505 0.8220 1.7615 TRP 1.3365 0.8910 0.4455 0.4455 0.0000 0.4455 0.0000 1.5592 0.8910 TYR 0.6750 0.3857 0.4821 1.6393 0.0964 1.9286 0.3857 1.5429 0.7714 VAL 0.6041 0.7048 0.4531 0.4027 0.0000 2.1647 0.9062 3.6246 1.6613 (20)

FIGURE S1 (continued) Rank:79 Clustering conditions: L=9, Dth=30, culled PDB Cluster ranking 79 The number of assigned segments 122 Normalized frequency 0.0016 RMS deviation (Å) 0.89 Kullback Leibler entropy (bit) 0.69 The number of hydorgen bonds 5.2 The centroid of the cluster Pos. Phi Psi Omega 4 114.9 137.2 178.0 3 110.5 137.8 178.1 2 124.5 143.6 177.9 1 113.2 131.3 178.7 0 131.7 124.0 179.8 1 55.3 40.4 178.0 2 79.3 2.9 179.9 3 109.4 138.3 178.8 4 89.2 130.1 179.8 The Position Specific Scoring Matrix < N terminal C terminal > 4 3 2 1 0 1 2 3 4 ALA 1.0221 0.8363 0.4646 0.9292 0.4646 0.0929 0.2788 0.4646 0.6504 ARG 1.0687 1.4249 1.0687 0.7125 0.8906 0.8906 0.1781 2.3155 0.5343 ASN 0.6910 0.5183 0.1728 1.0365 0.8638 6.0464 0.3455 0.6910 1.0365 ASP 0.6822 0.1364 0.5457 0.5457 0.4093 3.8201 0.8186 0.0000 0.6822 CYS 1.0748 2.6870 0.0000 1.6122 1.0748 0.0000 0.0000 0.5374 1.0748 GLN 1.0818 1.5146 0.4327 0.6491 0.2164 1.5146 0.2164 0.6491 0.4327 GLU 0.6823 0.4094 0.4094 1.3646 1.3646 0.5458 0.0000 2.1834 1.2281 GLY 0.8180 0.3067 0.7157 0.3067 0.0000 2.0450 10.2249 0.3067 0.1022 HIS 1.4177 1.4177 0.7088 1.4177 2.4809 1.7721 0.3544 1.4177 0.3544 ILE 1.4067 1.8757 2.0320 2.1883 1.7194 0.0000 0.0000 1.0941 2.9698 LEU 1.0546 0.9588 0.6711 0.8629 0.8629 0.1918 0.0000 0.3835 1.4381 LYS 0.4344 0.5793 0.8689 0.4344 1.8826 0.4344 0.1448 4.0548 1.3033 MET 1.2138 1.2138 0.4046 0.8092 0.0000 0.0000 0.4046 1.2138 0.4046 PHE 1.2412 1.0344 3.1031 0.8275 0.8275 0.4137 0.0000 0.2069 0.8275 PRO 1.0644 1.0644 0.5322 0.3548 0.5322 0.0000 0.0000 0.0000 1.5966 SER 1.2011 0.5338 0.4004 1.3345 1.2011 1.0676 0.1335 0.2669 0.4004 THR 1.1089 1.5247 0.2772 1.6633 0.1386 0.0000 0.2772 1.3861 2.0791 TRP 1.0517 1.5775 2.1033 1.5775 1.0517 0.0000 0.0000 0.0000 1.5775 TYR 0.9105 1.3658 1.8211 1.1382 2.9593 0.4553 0.2276 1.3658 0.2276 VAL 1.3072 1.7826 3.5652 1.3072 2.2580 0.0000 0.2377 1.4261 0.8319 (21)

FIGURE S2 Flow chart of the single-pass clustering method used in the present study START Initializing: All segments are unassigned to any cluster. Unique SID is given to each segment. N <= 0 Does an unassigned segment exist? Yes An unassigned segment is chosen randomly. j<= SID of the chosen segment d(j) <= dihedral angle vector of the segment j No Ranking Clusters: Clusters are ranked in the decreasing order of MCID. EXIT N > 0? No Yes Finding the nearest cluster: Finding a cluster whose center is the closest to the segment j. k <= CID of the nearest cluster Dmin <= the dissimilarity between c(k) and d(j) Dmin < Dth? No Yes Updating cluster: Mk <= Mk + 1 c(k) <= {(Mk-1) c(k) + d(j)}/mk The segment j is assigned to the nearest cluster k. Creating new cluster: N <= N + 1 CID of new cluster is numbered with N. k <= CID of the new cluster Mk <= 1 c(k) <= d(j) The segment j is assigned to the new cluster k. SID CID N M CID j k c(cid) d(sid) D th D min : numerical identification of each segment : numerical identification of each cluster : total number of clusters : number of segments in a cluster : SID of the chosen segment : CID of the nearest or created cluster : averaged dihedral angle vector of segments in a cluster (the cluster centroid) : dihedral angle vector of a segment : threshold parameter for creating a new cluster : dissimilarity of a segment from the nearest cluster (22)

FIGURE S3 Reproducibility of results in the single-pass clustering method. a, Ten distribution curves which were independently obtained when the order of sampling was changed randomly. Clustering conditions: L=9, D th =30º, Culled PDB. b, Comparison of the single-pass clustering method with an iterative clustering method. The iterative calculation based on k-means algorithm was carried out 100 times using the result of the single-pass clustering method as an initial condition. Clustering conditions: L=9, D th =30º, Culled PDB. a fcls b r fcls r (23)

FIGURE S4 Sensitivity of the parameters to the order of sampling. Each histogram corresponds to the parameters which were determined from the 1000 sets of clustering results that were independently performed by changing the order of sampling randomly. Clustering conditions: L=9, D th =30º, Culled PDB. Refer to Methods for the meaning of symbols. f cls (r=1) N N cls a b β N est log(n est )/L S est /L (24)

FIGURE S5 Suitability of objective functions to minimize in fitting calculations. a, Best fitted curves obtained in the fitting calculations of the same model to the same data when the different objective functions were used. Refer to Supporting Descriptions in Supplemental Materials for the equations Eq.S1, S2, S3, and S4. Clustering conditions: L=9, D th =30º, Culled PDB. b, Kolmogorov- Smirnov (KS) parameters for evaluating the goodness of fit in order to chose an appropriate objective function. KS parameters were determined from two cumulative distribution functions which were respectively computed from the fitted curve and the empirical distribution of the clusters containing at least five segments. Clustering conditions: D th =30º. a fcls or fest r 0.180 0.160 0.140 0.120 b Eq.s1 Eq.s2 Eq.s3 Eq.s4 0.100 KS 0.080 0.060 0.040 0.020 0.000 7 8 9 11 13 15 17 19 21 31 7 8 9 11 13 15 17 19 21 31 PDB Select L Culled PDB (25)

FIGURE S6 Histograms showing the structural dissimilarity, D, of 9-residue segments. The segments were generated from the Culled PDB and classified into several classes depending on the value of D against α helix (red), β strand (green), or β-hairpin (blue). A class interval is 5. Number of segments Structural dissimilarity D (º) (26)

FIGURE S7 Structural differences of 9-residue segments from the center of the cluster. The x- and y-axes indicate the differences in D and in backbone RMS deviation, respectively. Clustering conditions: Culled PDB, L=9, D th =30º. bbrmsd(å) D (º) (27)