David B. Lukatsky and Ariel Afek Department of Chemistry, Ben-Gurion University of the Negev, Beer-Sheva Israel

Similar documents
A new combination of replica exchange Monte Carlo and histogram analysis for protein folding and thermodynamics

MONTE CARLO METHOD. Reference1: Smit Frenkel, Understanding molecular simulation, second edition, Academic press, 2002.

Statistical Physics of The Symmetric Group. Mobolaji Williams Harvard Physics Oral Qualifying Exam Dec. 12, 2016


Brownian motion and the Central Limit Theorem

Entropy production fluctuation theorem and the nonequilibrium work relation for free energy differences

Superparameterization and Dynamic Stochastic Superresolution (DSS) for Filtering Sparse Geophysical Flows

Topology of Protein Interaction Network Shapes Protein Abundances and Strengths of Their Functional and Nonspecific Interactions

An Importance Sampling Algorithm for Models with Weak Couplings

Supplementary Information. Overlap between folding and functional energy landscapes for. adenylate kinase conformational change

Many proteins spontaneously refold into native form in vitro with high fidelity and high speed.

Physics 116C The Distribution of the Sum of Random Variables

Phase transitions and critical phenomena


arxiv:cond-mat/ v1 [cond-mat.other] 4 Aug 2004

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Multiple time step Monte Carlo

1.5 Sequence alignment

Understanding temperature and chemical potential using computer simulations

Fragmentation under the scaling symmetry and turbulent cascade with intermittency

Magnets, 1D quantum system, and quantum Phase transitions

Study of the Magnetic Properties of a Lieb Core-Shell Nano-Structure: Monte Carlo Simulations

Clusters and Percolation

arxiv:cond-mat/ v1 [cond-mat.soft] 19 Mar 2001

LECTURE 4 WORM ALGORITHM FOR QUANTUM STATISTICAL MODELS II

Monte Caro simulations

Stability Of Specialists Feeding On A Generalist

The Phase Transition of the 2D-Ising Model

Monte Carlo Methods in High Energy Physics I

Distance Constraint Model; Donald J. Jacobs, University of North Carolina at Charlotte Page 1 of 11

Pressure Dependent Study of the Solid-Solid Phase Change in 38-Atom Lennard-Jones Cluster

arxiv:cond-mat/ v1 2 Feb 94

arxiv: v1 [cond-mat.stat-mech] 7 Mar 2019

André Schleife Department of Materials Science and Engineering

Effect of protein shape on multibody interactions between membrane inclusions

3.320 Lecture 18 (4/12/05)

Monte Carlo Simulations of Protein Folding using Lattice Models

Importance Sampling in Monte Carlo Simulation of Rare Transition Events

Monte Carlo. Lecture 15 4/9/18. Harvard SEAS AP 275 Atomistic Modeling of Materials Boris Kozinsky

Any live cell with less than 2 live neighbours dies. Any live cell with 2 or 3 live neighbours lives on to the next step.

Statistical Mechanics of Active Matter

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Electrons in a periodic potential

Los Alamos IMPROVED INTRA-SPECIES COLLISION MODELS FOR PIC SIMULATIONS. Michael E. Jones, XPA Don S. Lemons, XPA & Bethel College Dan Winske, XPA

6 Hydrophobic interactions

Plug-in Measure-Transformed Quasi Likelihood Ratio Test for Random Signal Detection

Elastic constants and the effect of strain on monovacancy concentration in fcc hard-sphere crystals

Monte Carlo simulation of proteins through a random walk in energy space

Rate Constants from Uncorrelated Single-Molecule Data

8.334: Statistical Mechanics II Problem Set # 4 Due: 4/9/14 Transfer Matrices & Position space renormalization

Protein Mistranslation is Unlikely to Ease a Population s Transit across a Fitness Valley. Matt Weisberg May, 2012

Free energy recovery in single molecule experiments

Monte Carlo simulation of confined water

Computer simulation methods (1) Dr. Vania Calandrini

Triangular Lattice Foldings-a Transfer Matrix Study.

Molecular dynamics simulation. CS/CME/BioE/Biophys/BMI 279 Oct. 5 and 10, 2017 Ron Dror

Physics Letters A 375 (2011) Contents lists available at ScienceDirect. Physics Letters A.

Effect of surfactant structure on interfacial properties

arxiv:cond-mat/ v4 [cond-mat.dis-nn] 23 May 2001

Solving the Schrödinger equation for the Sherrington Kirkpatrick model in a transverse field

Improved model of nonaffine strain measure

CHEM-UA 652: Thermodynamics and Kinetics

Physics 115/242 Monte Carlo simulations in Statistical Physics

Analysis of the ultrafast dynamics of the silver trimer upon photodetachment

A Monte Carlo Implementation of the Ising Model in Python

Numerical Analysis of 2-D Ising Model. Ishita Agarwal Masters in Physics (University of Bonn) 17 th March 2011

The Tangled Nature Model of Evolutionary Ecology: (Is the approach of Statistical Mechanics relevant to the New Ecology Systems Perspective project.

1 Coherent-Mode Representation of Optical Fields and Sources

Department of Electrical and Electronic Engineering, Ege University, Bornova 3500, Izmir, Turkey

Today: Fundamentals of Monte Carlo

arxiv:nucl-th/ v1 12 Jun 2000

Protein Structure Prediction, Engineering & Design CHEM 430

Lattice protein models

Phase behavior of a lattice protein model

Extending the Tools of Chemical Reaction Engineering to the Molecular Scale

Statistical Mechanics for the Truncated Quasi-Geostrophic Equations

Sloppy Nuclear Energy Density Functionals: effective model optimisation. T. Nikšić and D. Vretenar

Kinetic Monte Carlo. Heiko Rieger. Theoretical Physics Saarland University Saarbrücken, Germany

THE TANGO ALGORITHM: SECONDARY STRUCTURE PROPENSITIES, STATISTICAL MECHANICS APPROXIMATION

3.320: Lecture 19 (4/14/05) Free Energies and physical Coarse-graining. ,T) + < σ > dµ

Superparameterization and Dynamic Stochastic Superresolution (DSS) for Filtering Sparse Geophysical Flows

Classical Monte Carlo Simulations

3D HP Protein Folding Problem using Ant Algorithm

Protein design: a perspective from simple tractable models Eugene I Shakhnovich

Beyond Wiener Askey Expansions: Handling Arbitrary PDFs

GAMM-workshop in UQ, TU Dortmund. Characterization of fluctuations in stochastic homogenization. Mitia Duerinckx, Antoine Gloria, Felix Otto

Quantum and classical annealing in spin glasses and quantum computing. Anders W Sandvik, Boston University

Chapter XXII The Covariance

Sequential Monte Carlo Methods for Bayesian Computation

Physics 221A Fall 1996 Notes 16 Bloch s Theorem and Band Structure in One Dimension

Folding of small proteins using a single continuous potential

VIII.B Equilibrium Dynamics of a Field

The dynamics of small particles whose size is roughly 1 µmt or. smaller, in a fluid at room temperature, is extremely erratic, and is

Renormalization Group for the Two-Dimensional Ising Model

PART IV Spectral Methods

Protein Folding Prof. Eugene Shakhnovich

1.1 A Scattering Experiment

Dilatancy Transition in a Granular Model. David Aristoff and Charles Radin * Mathematics Department, University of Texas, Austin, TX 78712

arxiv:chem-ph/ v2 11 May 1995

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 5 G R A T I V. Pair-wise Sequence Alignment

Transcription:

Sequence correlations shape protein promiscuity David B. Lukatsky and Ariel Afek Department of Chemistry, Ben-Gurion University of the Negev, Beer-Sheva 84105 Israel Abstract We predict that diagonal correlations of amino acid positions within protein sequences statistically enhance protein propensity for promiscuous binding. Diagonal correlations represent statistically significant repeats of sequence patterns where amino acids of the same type are clustered together. The predicted effect is qualitatively robust with respect to the form of the microscopic interaction potential and the average amino acid composition. We suggest experimental and bioinformatics approaches to test the predicted effect. Recent experimental evidences that proteins within a cell maintain a high degree of nonspecificity have challenged the understanding of molecular mechanisms providing the specificity of protein-protein binding [1, 2]. Such non-specific binding is termed protein promiscuity. Numerous organismal-scale measurements of protein-protein interactions (PPI) suggest that organismal proteomes possess a higher degree of non-specific binding [3-7]. It appears that protein promiscuity is an evolutionary selectable trait enabling proteins to evolve more efficiently [1]. The key question is what makes a protein promiscuous? Are there generic sequence signatures of promiscuity? In this report we predict one such generic signature. We predict that protein sequences with the enhanced correlations of sequence positions of amino acids of the same type generally represent more promiscuous sequences. We term such correlations diagonal. The latter finding suggests that the symmetry properties and strength of sequence correlations is a key factor that controls the global connectivity properties of PPI networks. We begin by introducing a statistical measure of promiscuity for a given protein, A. Such measure can be defined as the probability distribution of the interaction energies, P(E A ), of this protein with a set of target proteins, where E A is the interaction energy between protein A and a protein from the target set. Now we can compare the promiscuity of two proteins A and B interacting with the same target set, assuming that their average interaction energies are the same, E A = E B. This latter target set is not supposed to be optimized in any way for stronger binding with either protein A or protein B. If the dispersion, σ A, of P(E A ) is greater 1

than the dispersion, σ B, of P(E B ), then protein A is statistically more promiscuous than protein B. This is because if σ A > σ B, the distribution of the minimal interaction energies (the extreme value distribution), P min (E A ), will be always shifted towards lower energies as compared with P min (E B ) [8]. The assumption that E A = E B corresponds to the constraint that the sequences of two proteins A and B have the same average amino acid composition (see below). The latter constraint is necessary for a fare comparison of promiscuities, since the differences in the average amino acid composition would produce a trivial shift of the average interaction energies. The predicted effect is induced exclusively by sequence correlations and goes beyond the mean-field. We introduce now a simple model for random and designed (or correlated ) protein-like, linear sequences. Despite its one-dimensional origin, the model is not exactly solvable because of the generally long-range nature of the potentials that we use below. For simplicity we use a minimalistic sequence alphabet with two types of residues only. A random sequence is obtained by distributing N p polar and N h hydrophobic residues at random within the linear sequence of the total length L = N p + N h. Our simplistic approach therefore does not take into account the folding of the sequence. The average liner fraction of polar and hydrophobic residues is thus fixed and given by φ p,0 = N p /L and φ h,0 = N h /L, respectively. After each random sequence is generated, the residues are fixed and not allowed to change their positions. A correlated sequence is obtained using the following stochastic, Monte-Carlo (MC) procedure. First, we generate a random sequence as described above. Second, we allow residues to anneal at a given design temperature,. We note that our notion of the designed sequences stands to describe the existence of positional correlations of amino acids within the linear sequences and not the folding. We thus impose that the residues within the sequence under the design procedure interact through the pair-wise additive design potential, U αβ (x). The intra-sequence interaction energy for any given residue distribution is thus: E intra = 1 2 φ p (x)u pp (x x )φ p dx d x + 1 2 φ h(x)u hh (x x )φ h dx d x Eq. (1) + φ h (x)u hp (x x )φ p dx d x where φ p (x) and φ h (x) are the local, linear fraction densities of polar and hydrophobic residues, respectively. The average composition of polar and hydrophobic residues is fixed by 2

the values, φ p,0 and φ h,0, respectively, and we impose that the total fraction of residues at each sequence position, x, is unity, φ p (x) + φ h (x) = 1. Here U pp (x), U hh (x), and U hp (x) is the interaction potential between polar-polar, hydrophobic-hydrophobic, and hydrophobic-polar residues, respectively. We also note that φ p (x) can be represented in the form: φ p (x) = φ p,0 + δφ p (x), where δφ p (x) is the deviation of the local density of polar residues from its average value, and analogously, φ h = φ h,0 + δφ h (x). The only two assumptions about the interaction potentials, U αβ (x), used in the sequence design procedure are that they are pairwise additive and have a finite range of action. Our next step is to analyze the probability distribution P(E) of the interaction energy, E, between the random and correlated sequences. Every pair of interacting sequences thus consists of one random and one correlated sequence superimposed in a parallel configuration, thus the problem is a quasi one-dimensional one. We show below that the enhanced correlations between amino acids of the same type lead to the broadening of the distribution P(E). Such broadening implies that the corresponding extreme value distribution (EVD) will always be shifted to lower energies for stronger correlated sequences [8]. The latter property implies that such correlated sequences will be more promiscuous, i.e. statistically prone to a stronger binding with an arbitrary sequence. We use an ensemble of entirely random sequences as a proxy for an ensemble of arbitrary protein sequences. The interaction energy between the random and correlated sequences is the following: E = ν p (x)v pp (x x )φ p dx d x + ν h (x)v hh (x x )φ h dx d x Eq. (2) + ν h (x)v hp (x x )φ p dx d x + ν p (x)v hp (x x )φ h dx d x where ν p (x) and ν h (x) are the local, linear fraction densities of polar and hydrophobic residues, respectively, within the random sequence, and here again ν p (x) + ν h (x) = 1, and ν p (x) = φ p,0 + δν p (x) with δν p (x) being the deviation of the polar residue density from its average value, and analogously for ν h (x) = φ h,0 + δν h (x). We thus assume that the average amino acid composition is the same for random and correlated sequences. We emphasize that the inter-sequence interaction potentials, V pp (x), V hh (x), and V hp (x) need not be identical to the potentials U pp (x), U hh (x), and U hp (x) used in the sequence design procedure. We will describe below in details the influence of the potentials V αβ (x) and U αβ (x) on the properties of P(E). 3

The probability distribution for the interaction energies between the random and correlated sequences, P(E), is characterized by its mean, E, and by the variance. The mean, E, is independent on the design potential, U αβ (x), and therefore all the different distributions P(E) obtained at different values of the design temperature,, will have exactly the same mean. The variance of P(E) is σ 2 = ( δe 2 ) 2, where the only relevant term for the averaging is quadratic in the sequence density fluctuations: where V (x) = V pp (x) + V hh (x) 2V ph (x), and V ˆ (k) = δe 2 = δν p (x)v (x x ) δφ p ( x ) dx d x, Eq. (3) V (x)e ikx dx. The averaging in σ is performed using the Boltzmann probability distribution function for the sequence density fluctuations of the correlated (i.e. designed) sequences that has the following form: P d [δφ p (x)] = C 1 exp δφ 2 p(x) dx exp( E 2φ p,0 φ intra /k B ), Eq. (4) h,0 where is the design temperature, and k B is the Boltzmann constant. The first exponential term in Eq. (4) is the entropic contribution [8] due to the sequence density fluctuations of the designed sequences, and the second exponential term represents the strength of the correlations within the designed sequences. The corresponding probability distribution for the density fluctuations of the random sequences contains only the entropic contribution: P r [δν p (x)] = C 2 exp δν 2 p (x) dx. Eq. (5) 2φ p,0 φ h,0 The constants C 1 and C 2 in Eqs. (4) and (5), are found from the normalization constrains applied on the probability distributions. The averaging leads to the following result: where Û(k) = dk σ 2 = 4Lφ p,0 φ h,0 ˆV(k) 2 1 2π 1 / φ p,0 φ h,0 + Û(k) / k T, Eq. (6) B d U(x)eikx dx, and U(x) = U pp (x) + U hh (x) 2U ph (x). The larger σ (and thus the broader the distribution P(E) of the interaction energies between the correlated and random sequences), the more promiscuous are the correlated sequences. We note that our model is only solvable analytically in the Gaussian approximation, and not exactly solvable, unlike the one-dimensional Ising model, due to the generally long-range nature of the intra-sequence ( design ) potential, U(x), and the inter-sequence potential, V(x). We also note the existence 4

of the singularity in Eq. (6) at sufficiently large and negative values of the design potential, U(x), when the Gaussian fluctuation model breaks down. The analysis of Eq. (6) leads to the two key conclusions. First, the more negative is the design potential, U(x), the larger is σ. Taking into account the definition of U = U pp + U hh 2U ph, one concludes that in order to increase σ one needs to design the sequences with the enhanced correlations in the positions between the residues of similar types. This means that correlated sequences where amino acids of the same type are clustered together will be the more promiscuous ones. Second, such correlated sequences will interact statistically stronger (than non-correlated sequences would do) with any arbitrary sequences independently on the sign of the inter-residue interaction potential, V = V pp + V hh 2V ph. Third, if the design potential is overall positive, U > 0, designed sequences will be even less promiscuous than random sequences. We emphasize that the predicted effects are generic and qualitatively independent on the specific form and even sign of the microscopic interaction potentials, V αβ, and on the average amino acid composition of the sequences. We note that the predicted effect gets even stronger when both interacting sequences are designed (i.e. correlated). In the latter case the variance, σ d,d, of the corresponding P(E) is a straightforward generalization of Eq. (6): 2 σ d,d = 4L dk 2π ˆV(k) 2 1 (1 / φ p,0 φ h,0 + Û1 (k) / k B 1 )(1 / φ p,0φ h,0 + Û2 (k) / k B 2 ), Eq. (7) where U 1 (x) and U 2 (x) are defined analogously to U(x) for each of the interacting sequences; and 1 and 2 are the design temperatures for the first and second sequence, respectively. If both design potentials, U 1 (x) and U 2 (x), are overall negative, then σ d,d > σ, and thus in the latter case the sequences will be statistically more promiscuous than in the case when only one of the interacting sequences is designed (Eq. (6)). We stress that the interacting sequences are designed independently and not optimized in any way towards a stronger binding. Therefore the observed effect of statistically enhanced binding corresponds to the non-specific (promiscuous) binding. In order to verify our theoretical predictions, we first perform the standard MC annealing procedure [9] to design the correlated sequences. We begin with generating a random sequence with a given amino acid composition. In the computations described below we used a uniform sequence composition with 50% polar and 50% hydrophobic residues. Qualitatively, the results are valid for an arbitrary amino acid composition. We next perform 5

the MC stochastic design procedure, where the residues within the sequence are allowed to exchange their positions, and each sequence configuration has the Boltzmann weight, ~ exp( E intra /k B ), where E intra is the internal energy of the sequence in a given configuration given by Eq. (1). The MC design procedure is stopped after a certain number of MC moves, and the resulting annealed configuration is accepted as the final, designed configuration for a given sequence. The lower is, the stronger are the correlations within the sequences. Intuitively, stronger correlations correspond to repetitive sequence patterns with a longer correlation length. The properties of the correlated patterns depend critically on the sign of the interaction potentials U αβ (x) used in the design procedure. If the effective design potential U = U pp + U hh 2U hp is overall negative (this corresponds to the attraction between the amino acids of similar types), the correlated patterns will have the form of repetitive residues of the same type, for example: HHHHPPPPHHHPPP If however, the potential U = U pp + U hh 2U hp is overall positive, the correlated patterns will have the form of the alternating hydrophobic and polar residues, for example: HPHPHPHPHPHPHP To characterize the correlation properties of the sequences quantitatively, we introduce the normalized correlation function: r η αβ (x) = g αβ (x) / g αβ (x), Eq. (8) r where g αβ (x) is proportional to the probability to find a residue of the type α separated by the r distance x from a residue of the type β, and g αβ (x) is the corresponding probability for the r randomized sequence, and g αβ (x) r corresponds to the averaging with respect to different realizations of randomized sequences. The computed correlation functions are represented in Fig. 1 at the value of k B = 2, and the insert shows the analogous calculations for a lower value of the design temperature, k B = 1 (in the units of k B T ). For the entirely uncorrelated (random) sequences, all the matrix elements of η αβ (x) are equal to unity, Fig. 1. The clustering of the residues of a similar type corresponds to η αα (x) > 1, Fig. 1. The next step is to compute numerically the properties of the probability distribution, P(E), of the interaction energies, E, between random and designed sequences (i.e. each interacting pair consists of a random and designed sequences). The results of these calculations are shown in Fig. 2. We computed P(E) at different values of and we represented the results as a ratio between the dispersion of P(E), σ = σ d, r and the dispersion of the corresponding probability distribution where both sequences are entirely random, σ r, r 6

(the latter corresponds to the case of a vanishing design potential, U αβ = 0). We used here the inter-residue interaction potential, V pp = V hh = 1, and V hp = 1, and we assumed that the nearest neighbor and the next-nearest neighbor amino acids can interact between the two sequences. The analytical result computed from Eq. (6) is also plotted in Fig. 2. As expected, the Gaussian fluctuation model becomes accurate at small values of the ratio, U(a) / k B 1, where a is the potential range. The insert of Fig. 2 shows the computed P(E) in the case of designed-random and random-random sequence pairs, respectively. The key conclusion here is that in accordance with the analytical predictions, the dispersion of P(E) is larger for the sequences designed with the overall negative U, as compared to the dispersion of P(E) in the case where both interacting sequences are entirely random, σ d,r > σ r,r. We stress that the latter result is qualitatively insensitive to the sign of the inter-residue interaction potential, V(x). We emphasize also that the interaction energy mean-value, E, is identical in the two cases. To summarize our results qualitatively, correlated sequences of the type HHHHHPPPPPPHHHHHPPPP, where amino acids of the same type are clustered together will bind statistically stronger to an arbitrary target sequence set, compared to either random sequences, or correlated sequences of the type HPHPHPHPHPHPHPHPHP. The sequences possessing the latter symmetry of correlations will be the least promiscuous ones. This effect is qualitatively robust with respect to the specific form and even sign of the microscopic intersequence interaction potential. Despite the one-dimensional nature of our model, its results are directly applicable to protein-protein interaction networks since the most recent, wholeorganism experimental and bioinformatics data suggest that 15-40% of all protein-protein interactions are mediated by linear sequence motifs, and not by large protein surfaces [10]. Still our key objective for the future theoretical analysis is to take into account the effect of protein folding. There are several possible strategies to test our predictions. The direct experimental test would utilize the protein chip [11] or microfluidic protein chip [12] technology. The target protein data set would be attached to the chip surface. The test proteins or peptides would be synthesized with a varying strength and symmetry of sequence correlations but keeping the average amino acid composition fixed. Titration experiments would allow measuring directly the binding affinity [11, 12] as a function of sequence correlation properties. Another possibility is to use the existing high-throughput protein-protein binding data [3-6], and to compare sequence correlation properties of multi-specific and mono-specific proteins [13]. 7

Yet another possibility is to use the recent whole-genome protein over-expression analysis [14]. Since the over-expression of highly promiscuous proteins should presumably be toxic to a cell, the correlation analysis of such toxic proteins (hundreds of them are known [14]) will show whether the predicted effect plays a significant role in a living cell [15]. Acknowledgements We thank Amir Aharoni, Gilad Haran, Nikolaus Rajewsky, Irit Sagi, Eugene Shakhnovich, and Dan Tawfik for helpful discussions. This work is supported by the Israel Science Foundation (ISF). References [1] O. Khersonsky, and D. S. Tawfik, Annu Rev Biochem 79, 471 (2010). [2] I. Nobeli, A. D. Favia, and J. M. Thornton, Nat Biotechnol 27, 157 (2009). [3] N. N. Batada et al., PLoS Biol 4, e317 (2006). [4] G. Butland et al., Nature 433, 531 (2005). [5] P. Hu et al., PLoS Biol 7, e96 (2009). [6] J. F. Rual et al., Nature 437, 1173 (2005). [7] U. Stelzl et al., Cell 122, 957 (2005). [8] D. B. Lukatsky, K. B. Zeldovich, and E. I. Shakhnovich, Phys Rev Lett 97, 178101 (2006). [9] D. Frenkel, and B. Smit, Understanding molecular simulation : from algorithms to applications (Academic Press, San Diego, 2002), pp. xxii. [10] J. R. Perkins et al., Structure 18, 1233 (2010). [11] A. Wolf Yadlin, M. Sevecka, and G. MacBeath, Curr Opin Chem Biol 13, 398 (2009). [12] D. Gerber, S. J. Maerkl, and S. R. Quake, Nat Methods 6, 71 (2009). [13] A. Afek, and D. B. Lukatsky, (in preparation). [14] T. Vavouri et al., Cell 138, 198 (2009). [15] D. S. Tawfik, (private communication). 8

1.6 1.4 η pp (x)=η hp (x), random sequence η hp (x), =2 η pp (x)=η hh (x), =2 1.2 η (x) 1 0.8 0.6 1.5 1 =1 0.4 0.5 0 0 20 40 60 80 0.2 0 2 4 6 8 10 12 x Figure 1: Computed sequence correlation functions for the designed sequences, η pp (x) = η hh (x) (red squares), η ph (x) = η hp (x) (blue diamonds); and for the random sequences, (black circles). All the matrix elements of η αβ (x) are the same for the random sequences. The design potential was chosen to be U pp = U hh = 1, and U hp = 1, and we assumed that only the nearest-neighbor residues can interact. The design temperature is = 2. The sequence length was chosen to be 200 amino acids, and we generated 5000 different sequences in each calculation. The plotted η αβ (x) represent the average over the entire set of the designed sequences. The uniform amino acid composition was adopted: 50% polar (p) and 50% hydrophobic (h) residues in each sequence. The error bars are smaller than the symbol size. Insert: an analogous computation as in the main figure but at a lower design temperature, = 1 (the design temperature is in the units of k B T ). 9

1.6 0.3 random-random 0.2 σ d, r / σ r, r 1.4 1.2 P(E) 0.1 0 B designed-random T d=1 0.6 0.4 0.2 0 0.2 0.4 0.6 E, k T 1 0 2 4 6 8 10 12 14 T, k B Figure 2: Computed ratio between the dispersions of the P(E) for the interaction energies of the designed-random, σ = σ d, r, and random-random, σ = σ r, r, sequence pairs at different values of the design temperature, (circles). The error bars are smaller than the symbol size. The uniform amino acid composition was adopted: 50% polar (p) and 50% hydrophobic (h) residues in each sequence. Thin curve represents the corresponding analytical result, Eq. (6). Insert: Computed probability distribution function, P(E), for the interaction energies between the pairs of two random sequences (black), and pairs consisting each of a random and a designed sequences, where the designed sequences were generated at = 1 (red). The energy E is normalized per one residue. 10