Supplemental Materials for. Structural Diversity of Protein Segments Follows a Power-law Distribution

Size: px

Start display at page:

Download "Supplemental Materials for. Structural Diversity of Protein Segments Follows a Power-law Distribution"

Gloria Fitzgerald
5 years ago
Views:

1 Supplemental Materials for Structural Diversity of Protein Segments Follows a Power-law Distribution Yoshito SAWADA and Shinya HONDA* National Institute of Advanced Industrial Science and Technology (AIST), Central 6, Tsukuba , Japan *Correspondence: s.honda@aist.go.jp. Contents SUPPORTING DESCRIPTIONS Summary of Classified Structural Motifs Methodological Advantages Reproducibility of Results in the Single-Pass Clustering Method Validity of Overlapping Segment Sampling Objective Function in Fitting Calculations Goodness of Fit Evaluated by Parametric Bootstrapping Analysis Assessment of the Structural Dissimilarity Threshold, D th REFERENCE TABLE S1 Sensitivity of the single-pass clustering method to the order of sampling TABLE S2 Amino acid compositions of the sets of 9-residue segments with and without overlap TABLE S3 Confidence intervals of fit coefficients and goodness-of-fit statistics determined by parametric bootstrapping analysis (1)

2 FIGURE S1 Structural and sequential summary of clusters FIGURE S2 Flow chart of the single-pass clustering method used in the present study FIGURE S3 Reproducibility of results in the single-pass clustering method FIGURE S4 Sensitivity of the parameters to the order of sampling FIGURE S5 Suitability of objective functions to minimize in fitting calculations FIGURE S6 Histograms showing the structural dissimilarity, D, of 9-residue segments FIGURE S7 Structural differences of 9-residue segments from the center of the cluster Summary of Classified Structural Motifs An in-depth analysis of each cluster is beyond the scope of this paper. Here we show several clusters that were obtained under a typical condition (L=9, D th =30º, Culled PDB), in order to illustrate that our clustering method succeeded in the extraction of distinct structural motifs, including known canonical ones. Structural and sequential summary of these clusters are listed in Fig. S1 in Supplemental Materials. Methodological Advantages To compare protein structures, algorithms based on the Cartesian coordinates of Cα atoms are the most common in use (1-6). While these algorithms are effective in comparing the topology of whole proteins, they have a disadvantage in calculation time because they require some steps for the transformation of the system of coordinates to obtain a best superimposition between two proteins. Hence, in order to classify an enormous number of short segments effectively, we developed new algorithm defining the structural dissimilarity based on dihedral angles (φ, ψ and ω), which contain enough information to reproduce the backbone structure of proteins. In contrast to the methods by de Brevern et al. (7), our definition contains all three dihedral angles (see Methods). Although one (2)

3 may think the analysis including ω-angle causes confusion, we confirmed that the local structures containing a cis peptide bond were completely discriminated from the clusters of the segments having an all trans configuration (data not shown). Thus, this algorithm does not need the preprocessing to eliminate the structures containing cis peptide bonds. The single-pass clustering method (8), used in the present study, does not require us to presume a parameter for the total number of clusters before clustering (Fig. S2 in Supplemental Materials). Time-consuming iterative calculations are also unnecessary in the single-pass method. Therefore, the method is applicable to the problem in which the number of clusters is unknown, and it can process large-scale calculations at higher speed than other non-hierarchical clustering methods such as k-means and self-organizing map (SOM). The single-pass method also has an intrinsic advantage in the classification of samples that have a quite unbalanced distribution (In fact, there is more than a thousand-fold difference in frequency in Fig. 1a), while the k-means and SOM are rather preferred to the strict analysis of uniformity distributed samples. Reproducibility of Results in the Single-Pass Clustering Method A disadvantage of the single-pass method is that clustering results might be altered by the order of sampling. Before proceeding with in-depth analyses, we therefore checked the reproducibility of the calculations. Fig. S3a in Supplemental Materials shows ten distribution curves of 9-residue segments which were independently obtained when the order of sampling was changed randomly. Their almost identical shapes indicate that the structural distribution of protein segments is so robust that the results in the single-pass clustering method are not influenced significantly by the order of sampling. We also evaluated the statistical deviations of the various parameters introduced in the present study by analysis of independent 100 (or 1000) sets of results for both 9-residue and 21- residue segments that were obtained by changing the order of sampling randomly (Fig. S4 and Table S1 in Supplemental Materials). The resultant standard deviations of all parameters are considerably small. This implies that the conclusion concerning the structural diversity of protein segments presented in the present study is never affected by the order of sampling. For instance, the standard deviations in log(n est )/L and S est /L are less than 1% of the averaged parameters. Consequently, if one (3)

4 draws the standard deviations in Fig. 3 and Fig. 4, the size of error bars will become smaller than the size of circles of the data points. Furthermore, we checked the adequateness of the single-pass method by comparing with other iterative clustering methods. As shown in Fig. S3b in Supplemental Materials, the difference between the result of the single-pass clustering and the result of an iterative calculation (100 times) based on k-means algorithm using the former result as an initial condition was not significant. Thus, we considered an iterative and time-consuming calculation is not necessary for carrying out our purpose. Validity of Overlapping Segment Sampling To clear out a concern that the analyses using a set of overlapping segments might give some serious biases in the statistical results, we checked amino acid compositions of the sets of segments with and without the presence of overlap. Table S2 in Supplemental Materials shows the compositions calculated from the nine sets of 9-residue segments, where each set has no overlap. Comparing to the composition of the set of overlapping segments, i.e. the set of all 9-residue segments, there is no significant deviation among the compositions. This indicates the validity of the statistical analyses using a set of overlapping segments performed in the present study. Objective Function in Fitting Calculations To begin with the fitting calculations of the modified Mandelbrot formula (Eq.2) to the empirical distributions, we tried several equations as an objective function to minimize, because an ordinary equation, i.e. the sum of squared errors, seems to be inappropriate to obtain a good fitting result in a double-logarithmic scale plot. In fact, large upper shifts were found in low-ranked clusters when Eq.s1 was used (Fig S5a in Supplemental Materials). In case of Eq.s2, high-ranked clusters were underestimated. Among several equations we found Eq.s3 and Eq.s4 are available. Then, we further examined the goodness in fitting calculations using Kolmogorov-Smirnov (KS) parameter. As judged by the KS values, the fitting calculation using Eq.s4 appears to show good performance in many cases (Fig. S5b in Supplemental Materials). Accordingly, we chose Eq.s4 as an objective function to minimize and used it in further analyses. The equation can be interpreted on the (4)

5 assumption that the expected error in f(r) should be proportional to the square root of f(r). g 1 Ncls β [ cls ] 1 2 β (s1) N ( a, b, ) = f a( r + b) cls r = 2 Ncls β [ cls { }] 1 g2 log N g g 3 4 ( a, b, ) = log( f ) a( r + b) ( a, b, ) ( a b, ) β (s2) cls r = 2 cls r = 2 β [ log( f ) log{ a( r + b) }] Ncls 1 cls β = (s3) N r β [ fcls a( r + b) ] a( r + b) Ncls, = β Ncls r = β (s4) 2 2 Goodness of Fit Evaluated by Parametric Bootstrapping Analysis To test the goodness of fit and provide the confidence intervals of fit coefficients in fitting calculations of the modified Mandelbrot formula (Eq.2) to the empirical distributions, a parametric bootstrapping analysis was performed. Assuming the expected error ε is proportional to the square root of f(r) as described above, ε = β [ fcls a( r + b) ] β a( r + b) the model to be evaluate is designated as 2 = f cls a a( r + b) ( r + b) β β f cls ( r > 1) = a β β ( r + b) + a( r + b) ε (s5) Here, the error is assumed to represent a normal distribution; ε ~ N(0, σ 2 ε ). The variance σ 2 ε was determined by a maximum likelihood method using the residuals of f cls (r) from the best fitted curve. We generated sets of artificial data that follow Eq.s5 by setting the fit coefficients obtained from the original data. By repeating the same fitting calculations against the artificial data, we obtained bootstrap fit coefficients (a*, b*, and β*) as well as two kinds of goodness-of-fit statistics (χ 2 * and G 2 *) χ β ( ) 2 ( X E) fcls( r a( r + b) = M E a( r + b) 2 ) = β 2 (s6) (5)

6 G 2 X X ln E 2M f ( r)ln a ( r) 2 cls = = cls β f ( r + b) Confidence intervals of bootstrap parameters were determined using a percentile method. The results are summarized in TABLE S3 in Supplemental Materials. In many cases, the fit coefficients and the goodness-of-fit statistics calculated from the original data reside within the range of 99% confidence intervals of bootstrap parameters. Especially, it is appreciable in relatively long segments compared to relatively short segments. This tendency is consistent with the results in Fig. 2 where the model appears to be well fitted to the empirical data of relatively long segments. Through the analysis, the null hypothesis that the empirical data can be represented by the modified Mandelbrot formula is not rejected especially as for relatively long segments. (s7) Assessment of the Structural Dissimilarity Threshold, D th In the one-path clustering method used in the present study, the distribution of the local structures in principle depends on one parameter, a structural dissimilarity threshold, D th, the value of which was assigned arbitrarily before clustering (see Methods and Fig. S2 in Supplemental Materials). The effect of the value of D th on the shape of the distribution functions has been already described in Result. Here, we show some characteristics of the structural dissimilarity D to help the readers to catch the technical meaning of this threshold. Fig. S6 in Supplemental Materials shows histograms of 9-regidue segments, in which the x-axis indicates the value of D of these segments against certain typical secondary structures. Since the unit of D corresponds to an angle, the minimum and the maximum values of D are 0 and 180º, respectively, in principle. However, the actual data show most segments distribute within the range from 0 to 110º. In these three histograms, the shapes in the right half area, corresponding to the distributions of relatively dissimilar segments, are similar to each other. In contrast, the distributions in a left half area are different in their shape, which may indicate the individuality of the clusters composed of relatively similar segments. Our condition in D th is 20, 30, or 40º, which seems reasonable to discriminate relatively similar segments from relatively dissimilar segments. Fig. S7 in Supplemental Materials illustrates the structural difference of 9-residue segments from the center of the cluster which the segments were classified to. The x- and y-axes indicate the differences in D and in backbone RMS deviation (bbrmsd), (6)

7 respectively. Although the data are widely dispersed, a rough correlation between D and bbrmsd is identified. In case of segments having similar structures, smaller distance in angles indicates smaller distance in coordinates. From this correlation, we can say that 10, 20, and 30º in D th correspond to approximately 0.5, 1.0, and 2.0 Å in bbrmsd. REFERENCE 1. Holm, L. and C. Sander Protein structure comparison by alignment of distance matrices. J. Mol. Biol. 233: Madej, T., J. F. Gibrat, and S. H. Bryant Threading a database of protein cores. Proteins 23: Shindyalov, I. N. and P. E. Bourne Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng. 11: Rooman, M. J., J. Rodriguez, and S. J. Wodak Automatic definition of recurrent local structure motifs in proteins. J. Mol. Biol. 213: Fetrow, J. S., M. J. Palumbo, and G. Berg Patterns, structures, and amino acid frequencies in structural building blocks, a protein secondary structure classification scheme. Proteins 27: Hunter, C. G. and S. Subramaniam Protein fragment clustering and canonical local shapes. Proteins 50: de Brevern, A. G., C. Etchebest, and S. Hazout Bayesian probabilistic approach for predicting backbone structures in terms of protein blocks. Proteins 41: Richards, J. A. and X. Jia Remote sensing digital image analysis. New York: Springer- Verlag. (7)

8 TABLE S1 Sensitivity of the single-pass clustering method to the order of sampling. Averaged values and standard deviations of various parameters obtained from 100 or 1000 times clustering calculations in different order of sampling are listed. L f cls (r=1) N N cls a b β N est log(n est )/ L S est /L remark PDB Select ±0.002 ±49 ±27 ±0.033 ±2.6 ±0.02 ± ±0.004 ±0.004 A ± ±70 ±32 ±0.002 ±1.3 ±0.01 ± ±0.003 ±0.004 A Culled PDB ±0.007 ±29 ±22 ±0.022 ±1.4 ±0.01 ± ±0.001 ±0.004 A ±0.006 ±29 ±22 ±0.026 ±1.6 ±0.01 ± ±0.002 ±0.004 B ± ±32 ±25 ±0.002 ±1.0 ±0.01 ± ±0.002 ±0.002 A A Clustering conditions: D th =30º, number of runs=100. B Clustering conditions: D th =30º, number of runs=1000. Refer to Methods for the meaning of symbols. (8)

9 TABLE S2 Amino acid compositions of the sets of 9-residue segments with and without overlap without overlap with set of segment overla p no. of segments ALA ARG ASN ASP CYS GLN GLU GLY HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR VAL The value with an underline corresponds to either minimum or maximum number among the sets of segments having no overlap. These data were calculated from the Culled PDB. (9)

10 TABLE S3 Confidence intervals of fit coefficients and goodness-of-fit statistics determined by parametric bootstrapping analysis. L a a* b b* [ , ] [ 15.41, ] [ , ] [ 9.76, ] [ , ] 6.53 [ 5.90, 6.55 ] [ , ] 6.58 [ 6.16, 6.66 ] [ , ] 4.79 [ 4.51, 4.88 ] [ , ] 7.82 [ 7.47, 7.96 ] [ , ] [ 11.80, ] [ , ] [ 12.90, ] [ , ] [ 10.97, ] L β β* N est N est * [ 1.099, ] 5683 [ 4100, 4783 ] [ 1.029, ] [ 9764, ] [ 0.948, ] [ 21214, ] [ 0.905, ] [ 41528, ] [ 0.861, ] [ 69715, ] [ 0.871, ] [ , ] [ 0.886, ] [ , ] [ 0.879, ] [ , ] [ 0.854, ] [ , ] L S est S est * Z/α (Z/α) * [ 9.17, 9.28 ] [ 3.282, ] [ 10.14, ] [ 2.775, ] [ 11.29, ] [ 2.474, ] [ 12.42, ] [ 2.266, ] [ 13.37, ] [ 2.103, ] [ 14.39, ] [ 2.016, ] [ 15.40, ] [ 1.958, ] [ 16.26, ] [ 1.892, ] [ 20.49, ] [ 1.708, ] L χ 2 χ 2 * G 2 G 2 * [ 7782, 8850 ] -154 [ 718, 1279 ] [ 6379, 7138 ] 32 [ 476, 977 ] [ 3200, 3555 ] 148 [ 21, 382 ] [ 1524, 1683 ] 116 [ -96, 191 ] [ 961, 1061 ] -49 [ -95, 151 ] [ 636, 704 ] -92 [ -90, 112 ] [ 598, 666 ] -42 [ -75, 102 ] [ 331, 370 ] -62 [ -61, 71 ] [ 89, 105 ] 1 [ -25, 28 ] Values in bracket indicate the 99% confidence intervals of bootstrap parameters obtained from the parametric bootstrapping analysis (10000 times) with a percentile method. Clustering conditions: D th =30º, PDB Select. Refer to Methods for the meaning of symbols. (10)

11 FIGURE S1 Structural and sequential summary of clusters. Only representative clusters are shown. Rank:1 Clustering conditions: L=9, Dth=30, culled PDB Cluster ranking 1 The number of assigned segments Normalized frequency RMS deviation (Å) 0.36 Kullback Leibler entropy (bit) 0.09 The number of hydorgen bonds 7.7 The centroid of the cluster Pos. Phi Psi Omega The Position Specific Scoring Matrix < N terminal C terminal > ALA ARG ASN ASP CYS GLN GLU GLY HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR VAL (11)

FIGURE S1 (continued) Rank:2 Clustering conditions: L=9, Dth=30, culled PDB Cluster ranking 2 The number of assigned segments 1577 Normalized frequency 0.0206 RMS deviation (Å) 0.

12 FIGURE S1 (continued) Rank:2 Clustering conditions: L=9, Dth=30, culled PDB Cluster ranking 2 The number of assigned segments 1577 Normalized frequency RMS deviation (Å) 0.53 Kullback Leibler entropy (bit) 0.23 The number of hydorgen bonds 6.6 The centroid of the cluster Pos. Phi Psi Omega The Position Specific Scoring Matrix < N terminal C terminal > ALA ARG ASN ASP CYS GLN GLU GLY HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR VAL (12)

13 FIGURE S1 (continued) Rank:3 Clustering conditions: L=9, Dth=30, culled PDB Cluster ranking 3 The number of assigned segments 1030 Normalized frequency RMS deviation (Å) 1.48 Kullback Leibler entropy (bit) 0.10 The number of hydorgen bonds 6.0 The centroid of the cluster Pos. Phi Psi Omega The Position Specific Scoring Matrix < N terminal C terminal > ALA ARG ASN ASP CYS GLN GLU GLY HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR VAL (13)

FIGURE S1 (continued) Rank:4 Clustering conditions: L=9, Dth=30, culled PDB Cluster ranking 4 The number of assigned segments 972 Normalized frequency 0.0127 RMS deviation (Å) 0.

14 FIGURE S1 (continued) Rank:4 Clustering conditions: L=9, Dth=30, culled PDB Cluster ranking 4 The number of assigned segments 972 Normalized frequency RMS deviation (Å) 0.68 Kullback Leibler entropy (bit) 0.27 The number of hydorgen bonds 6.0 The centroid of the cluster Pos. Phi Psi Omega The Position Specific Scoring Matrix < N terminal C terminal > ALA ARG ASN ASP CYS GLN GLU GLY HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR VAL (14)

15 FIGURE S1 (continued) Rank:5 Clustering conditions: L=9, Dth=30, culled PDB Cluster ranking 5 The number of assigned segments 871 Normalized frequency RMS deviation (Å) 0.42 Kullback Leibler entropy (bit) 0.43 The number of hydorgen bonds 6.8 The centroid of the cluster Pos. Phi Psi Omega The Position Specific Scoring Matrix < N terminal C terminal > ALA ARG ASN ASP CYS GLN GLU GLY HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR VAL (15)

FIGURE S1 (continued) Rank:6 Clustering conditions: L=9, Dth=30, culled PDB Cluster ranking 6 The number of assigned segments 813 Normalized frequency 0.0106 RMS deviation (Å) 0.

16 FIGURE S1 (continued) Rank:6 Clustering conditions: L=9, Dth=30, culled PDB Cluster ranking 6 The number of assigned segments 813 Normalized frequency RMS deviation (Å) 0.62 Kullback Leibler entropy (bit) 0.14 The number of hydorgen bonds 6.3 The centroid of the cluster Pos. Phi Psi Omega The Position Specific Scoring Matrix < N terminal C terminal > ALA ARG ASN ASP CYS GLN GLU GLY HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR VAL (16)

17 FIGURE S1 (continued) Rank:7 Clustering conditions: L=9, Dth=30, culled PDB Cluster ranking 7 The number of assigned segments 685 Normalized frequency RMS deviation (Å) 0.58 Kullback Leibler entropy (bit) 0.47 The number of hydorgen bonds 6.3 The centroid of the cluster Pos. Phi Psi Omega The Position Specific Scoring Matrix < N terminal C terminal > ALA ARG ASN ASP CYS GLN GLU GLY HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR VAL (17)

FIGURE S1 (continued) Rank:9 Clustering conditions: L=9, Dth=30, culled PDB Cluster ranking 9 The number of assigned segments 487 Normalized frequency 0.0063 RMS deviation (Å) 0.

18 FIGURE S1 (continued) Rank:9 Clustering conditions: L=9, Dth=30, culled PDB Cluster ranking 9 The number of assigned segments 487 Normalized frequency RMS deviation (Å) 0.78 Kullback Leibler entropy (bit) 0.50 The number of hydorgen bonds 5.8 The centroid of the cluster Pos. Phi Psi Omega The Position Specific Scoring Matrix < N terminal C terminal > ALA ARG ASN ASP CYS GLN GLU GLY HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR VAL (18)

19 FIGURE S1 (continued) Rank:21 Clustering conditions: L=9, Dth=30, culled PDB Cluster ranking 21 The number of assigned segments 295 Normalized frequency RMS deviation (Å) 0.92 Kullback Leibler entropy (bit) 0.58 The number of hydorgen bonds 5.4 The centroid of the cluster Pos. Phi Psi Omega The Position Specific Scoring Matrix < N terminal C terminal > ALA ARG ASN ASP CYS GLN GLU GLY HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR VAL (19)

20 FIGURE S1 (continued) Rank:23 Clustering conditions: L=9, Dth=30, culled PDB Cluster ranking 23 The number of assigned segments 288 Normalized frequency RMS deviation (Å) 0.99 Kullback Leibler entropy (bit) 0.56 The number of hydorgen bonds 5.2 The centroid of the cluster Pos. Phi Psi Omega The Position Specific Scoring Matrix < N terminal C terminal > ALA ARG ASN ASP CYS GLN GLU GLY HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR VAL (20)

21 FIGURE S1 (continued) Rank:79 Clustering conditions: L=9, Dth=30, culled PDB Cluster ranking 79 The number of assigned segments 122 Normalized frequency RMS deviation (Å) 0.89 Kullback Leibler entropy (bit) 0.69 The number of hydorgen bonds 5.2 The centroid of the cluster Pos. Phi Psi Omega The Position Specific Scoring Matrix < N terminal C terminal > ALA ARG ASN ASP CYS GLN GLU GLY HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR VAL (21)

22 FIGURE S2 Flow chart of the single-pass clustering method used in the present study START Initializing: All segments are unassigned to any cluster. Unique SID is given to each segment. N <= 0 Does an unassigned segment exist? Yes An unassigned segment is chosen randomly. j<= SID of the chosen segment d(j) <= dihedral angle vector of the segment j No Ranking Clusters: Clusters are ranked in the decreasing order of MCID. EXIT N > 0? No Yes Finding the nearest cluster: Finding a cluster whose center is the closest to the segment j. k <= CID of the nearest cluster Dmin <= the dissimilarity between c(k) and d(j) Dmin < Dth? No Yes Updating cluster: Mk <= Mk + 1 c(k) <= {(Mk-1) c(k) + d(j)}/mk The segment j is assigned to the nearest cluster k. Creating new cluster: N <= N + 1 CID of new cluster is numbered with N. k <= CID of the new cluster Mk <= 1 c(k) <= d(j) The segment j is assigned to the new cluster k. SID CID N M CID j k c(cid) d(sid) D th D min : numerical identification of each segment : numerical identification of each cluster : total number of clusters : number of segments in a cluster : SID of the chosen segment : CID of the nearest or created cluster : averaged dihedral angle vector of segments in a cluster (the cluster centroid) : dihedral angle vector of a segment : threshold parameter for creating a new cluster : dissimilarity of a segment from the nearest cluster (22)

23 FIGURE S3 Reproducibility of results in the single-pass clustering method. a, Ten distribution curves which were independently obtained when the order of sampling was changed randomly. Clustering conditions: L=9, D th =30º, Culled PDB. b, Comparison of the single-pass clustering method with an iterative clustering method. The iterative calculation based on k-means algorithm was carried out 100 times using the result of the single-pass clustering method as an initial condition. Clustering conditions: L=9, D th =30º, Culled PDB. a fcls b r fcls r (23)

24 FIGURE S4 Sensitivity of the parameters to the order of sampling. Each histogram corresponds to the parameters which were determined from the 1000 sets of clustering results that were independently performed by changing the order of sampling randomly. Clustering conditions: L=9, D th =30º, Culled PDB. Refer to Methods for the meaning of symbols. f cls (r=1) N N cls a b β N est log(n est )/L S est /L (24)

25 FIGURE S5 Suitability of objective functions to minimize in fitting calculations. a, Best fitted curves obtained in the fitting calculations of the same model to the same data when the different objective functions were used. Refer to Supporting Descriptions in Supplemental Materials for the equations Eq.S1, S2, S3, and S4. Clustering conditions: L=9, D th =30º, Culled PDB. b, Kolmogorov- Smirnov (KS) parameters for evaluating the goodness of fit in order to chose an appropriate objective function. KS parameters were determined from two cumulative distribution functions which were respectively computed from the fitted curve and the empirical distribution of the clusters containing at least five segments. Clustering conditions: D th =30º. a fcls or fest r b Eq.s1 Eq.s2 Eq.s3 Eq.s KS PDB Select L Culled PDB (25)

26 FIGURE S6 Histograms showing the structural dissimilarity, D, of 9-residue segments. The segments were generated from the Culled PDB and classified into several classes depending on the value of D against α helix (red), β strand (green), or β-hairpin (blue). A class interval is 5. Number of segments Structural dissimilarity D (º) (26)

27 FIGURE S7 Structural differences of 9-residue segments from the center of the cluster. The x- and y-axes indicate the differences in D and in backbone RMS deviation, respectively. Clustering conditions: Culled PDB, L=9, D th =30º. bbrmsd(å) D (º) (27)

Physiochemical Properties of Residues

Physiochemical Properties of Residues Various Sources C N Cα R Slide 1 Conformational Propensities Conformational Propensity is the frequency in which a residue adopts a given conformation (in a polypeptide)