David B. Lukatsky and Ariel Afek Department of Chemistry, Ben-Gurion University of the Negev, Beer-Sheva Israel

Sequence correlations shape protein promiscuity David B. Lukatsky and Ariel Afek Department of Chemistry, Ben-Gurion University of the Negev, Beer-Sheva 84105 Israel Abstract We predict that diagonal correlations of amino acid positions within protein sequences statistically enhance protein propensity for promiscuous binding. Diagonal correlations represent statistically significant repeats of sequence patterns where amino acids of the same type are clustered together. The predicted effect is qualitatively robust with respect to the form of the microscopic interaction potential and the average amino acid composition. We suggest experimental and bioinformatics approaches to test the predicted effect. Recent experimental evidences that proteins within a cell maintain a high degree of nonspecificity have challenged the understanding of molecular mechanisms providing the specificity of protein-protein binding [1, 2]. Such non-specific binding is termed protein promiscuity. Numerous organismal-scale measurements of protein-protein interactions (PPI) suggest that organismal proteomes possess a higher degree of non-specific binding [3-7]. It appears that protein promiscuity is an evolutionary selectable trait enabling proteins to evolve more efficiently [1]. The key question is what makes a protein promiscuous? Are there generic sequence signatures of promiscuity? In this report we predict one such generic signature. We predict that protein sequences with the enhanced correlations of sequence positions of amino acids of the same type generally represent more promiscuous sequences. We term such correlations diagonal. The latter finding suggests that the symmetry properties and strength of sequence correlations is a key factor that controls the global connectivity properties of PPI networks. We begin by introducing a statistical measure of promiscuity for a given protein, A. Such measure can be defined as the probability distribution of the interaction energies, P(E A ), of this protein with a set of target proteins, where E A is the interaction energy between protein A and a protein from the target set. Now we can compare the promiscuity of two proteins A and B interacting with the same target set, assuming that their average interaction energies are the same, E A = E B. This latter target set is not supposed to be optimized in any way for stronger binding with either protein A or protein B. If the dispersion, σ A, of P(E A ) is greater 1

than the dispersion, σ B, of P(E B ), then protein A is statistically more promiscuous than protein B. This is because if σ A > σ B, the distribution of the minimal interaction energies (the extreme value distribution), P min (E A ), will be always shifted towards lower energies as compared with P min (E B ) [8]. The assumption that E A = E B corresponds to the constraint that the sequences of two proteins A and B have the same average amino acid composition (see below). The latter constraint is necessary for a fare comparison of promiscuities, since the differences in the average amino acid composition would produce a trivial shift of the average interaction energies. The predicted effect is induced exclusively by sequence correlations and goes beyond the mean-field. We introduce now a simple model for random and designed (or correlated ) protein-like, linear sequences. Despite its one-dimensional origin, the model is not exactly solvable because of the generally long-range nature of the potentials that we use below. For simplicity we use a minimalistic sequence alphabet with two types of residues only. A random sequence is obtained by distributing N p polar and N h hydrophobic residues at random within the linear sequence of the total length L = N p + N h. Our simplistic approach therefore does not take into account the folding of the sequence. The average liner fraction of polar and hydrophobic residues is thus fixed and given by φ p,0 = N p /L and φ h,0 = N h /L, respectively. After each random sequence is generated, the residues are fixed and not allowed to change their positions. A correlated sequence is obtained using the following stochastic, Monte-Carlo (MC) procedure. First, we generate a random sequence as described above. Second, we allow residues to anneal at a given design temperature,. We note that our notion of the designed sequences stands to describe the existence of positional correlations of amino acids within the linear sequences and not the folding. We thus impose that the residues within the sequence under the design procedure interact through the pair-wise additive design potential, U αβ (x). The intra-sequence interaction energy for any given residue distribution is thus: E intra = 1 2 φ p (x)u pp (x x )φ p dx d x + 1 2 φ h(x)u hh (x x )φ h dx d x Eq. (1) + φ h (x)u hp (x x )φ p dx d x where φ p (x) and φ h (x) are the local, linear fraction densities of polar and hydrophobic residues, respectively. The average composition of polar and hydrophobic residues is fixed by 2

the values, φ p,0 and φ h,0, respectively, and we impose that the total fraction of residues at each sequence position, x, is unity, φ p (x) + φ h (x) = 1. Here U pp (x), U hh (x), and U hp (x) is the interaction potential between polar-polar, hydrophobic-hydrophobic, and hydrophobic-polar residues, respectively. We also note that φ p (x) can be represented in the form: φ p (x) = φ p,0 + δφ p (x), where δφ p (x) is the deviation of the local density of polar residues from its average value, and analogously, φ h = φ h,0 + δφ h (x). The only two assumptions about the interaction potentials, U αβ (x), used in the sequence design procedure are that they are pairwise additive and have a finite range of action. Our next step is to analyze the probability distribution P(E) of the interaction energy, E, between the random and correlated sequences. Every pair of interacting sequences thus consists of one random and one correlated sequence superimposed in a parallel configuration, thus the problem is a quasi one-dimensional one. We show below that the enhanced correlations between amino acids of the same type lead to the broadening of the distribution P(E). Such broadening implies that the corresponding extreme value distribution (EVD) will always be shifted to lower energies for stronger correlated sequences [8]. The latter property implies that such correlated sequences will be more promiscuous, i.e. statistically prone to a stronger binding with an arbitrary sequence. We use an ensemble of entirely random sequences as a proxy for an ensemble of arbitrary protein sequences. The interaction energy between the random and correlated sequences is the following: E = ν p (x)v pp (x x )φ p dx d x + ν h (x)v hh (x x )φ h dx d x Eq. (2) + ν h (x)v hp (x x )φ p dx d x + ν p (x)v hp (x x )φ h dx d x where ν p (x) and ν h (x) are the local, linear fraction densities of polar and hydrophobic residues, respectively, within the random sequence, and here again ν p (x) + ν h (x) = 1, and ν p (x) = φ p,0 + δν p (x) with δν p (x) being the deviation of the polar residue density from its average value, and analogously for ν h (x) = φ h,0 + δν h (x). We thus assume that the average amino acid composition is the same for random and correlated sequences. We emphasize that the inter-sequence interaction potentials, V pp (x), V hh (x), and V hp (x) need not be identical to the potentials U pp (x), U hh (x), and U hp (x) used in the sequence design procedure. We will describe below in details the influence of the potentials V αβ (x) and U αβ (x) on the properties of P(E). 3

The probability distribution for the interaction energies between the random and correlated sequences, P(E), is characterized by its mean, E, and by the variance. The mean, E, is independent on the design potential, U αβ (x), and therefore all the different distributions P(E) obtained at different values of the design temperature,, will have exactly the same mean. The variance of P(E) is σ 2 = ( δe 2 ) 2, where the only relevant term for the averaging is quadratic in the sequence density fluctuations: where V (x) = V pp (x) + V hh (x) 2V ph (x), and V ˆ (k) = δe 2 = δν p (x)v (x x ) δφ p ( x ) dx d x, Eq. (3) V (x)e ikx dx. The averaging in σ is performed using the Boltzmann probability distribution function for the sequence density fluctuations of the correlated (i.e. designed) sequences that has the following form: P d [δφ p (x)] = C 1 exp δφ 2 p(x) dx exp( E 2φ p,0 φ intra /k B ), Eq. (4) h,0 where is the design temperature, and k B is the Boltzmann constant. The first exponential term in Eq. (4) is the entropic contribution [8] due to the sequence density fluctuations of the designed sequences, and the second exponential term represents the strength of the correlations within the designed sequences. The corresponding probability distribution for the density fluctuations of the random sequences contains only the entropic contribution: P r [δν p (x)] = C 2 exp δν 2 p (x) dx. Eq. (5) 2φ p,0 φ h,0 The constants C 1 and C 2 in Eqs. (4) and (5), are found from the normalization constrains applied on the probability distributions. The averaging leads to the following result: where Û(k) = dk σ 2 = 4Lφ p,0 φ h,0 ˆV(k) 2 1 2π 1 / φ p,0 φ h,0 + Û(k) / k T, Eq. (6) B d U(x)eikx dx, and U(x) = U pp (x) + U hh (x) 2U ph (x). The larger σ (and thus the broader the distribution P(E) of the interaction energies between the correlated and random sequences), the more promiscuous are the correlated sequences. We note that our model is only solvable analytically in the Gaussian approximation, and not exactly solvable, unlike the one-dimensional Ising model, due to the generally long-range nature of the intra-sequence ( design ) potential, U(x), and the inter-sequence potential, V(x). We also note the existence 4

of the singularity in Eq. (6) at sufficiently large and negative values of the design potential, U(x), when the Gaussian fluctuation model breaks down. The analysis of Eq. (6) leads to the two key conclusions. First, the more negative is the design potential, U(x), the larger is σ. Taking into account the definition of U = U pp + U hh 2U ph, one concludes that in order to increase σ one needs to design the sequences with the enhanced correlations in the positions between the residues of similar types. This means that correlated sequences where amino acids of the same type are clustered together will be the more promiscuous ones. Second, such correlated sequences will interact statistically stronger (than non-correlated sequences would do) with any arbitrary sequences independently on the sign of the inter-residue interaction potential, V = V pp + V hh 2V ph. Third, if the design potential is overall positive, U > 0, designed sequences will be even less promiscuous than random sequences. We emphasize that the predicted effects are generic and qualitatively independent on the specific form and even sign of the microscopic interaction potentials, V αβ, and on the average amino acid composition of the sequences. We note that the predicted effect gets even stronger when both interacting sequences are designed (i.e. correlated). In the latter case the variance, σ d,d, of the corresponding P(E) is a straightforward generalization of Eq. (6): 2 σ d,d = 4L dk 2π ˆV(k) 2 1 (1 / φ p,0 φ h,0 + Û1 (k) / k B 1 )(1 / φ p,0φ h,0 + Û2 (k) / k B 2 ), Eq. (7) where U 1 (x) and U 2 (x) are defined analogously to U(x) for each of the interacting sequences; and 1 and 2 are the design temperatures for the first and second sequence, respectively. If both design potentials, U 1 (x) and U 2 (x), are overall negative, then σ d,d > σ, and thus in the latter case the sequences will be statistically more promiscuous than in the case when only one of the interacting sequences is designed (Eq. (6)). We stress that the interacting sequences are designed independently and not optimized in any way towards a stronger binding. Therefore the observed effect of statistically enhanced binding corresponds to the non-specific (promiscuous) binding. In order to verify our theoretical predictions, we first perform the standard MC annealing procedure [9] to design the correlated sequences. We begin with generating a random sequence with a given amino acid composition. In the computations described below we used a uniform sequence composition with 50% polar and 50% hydrophobic residues. Qualitatively, the results are valid for an arbitrary amino acid composition. We next perform 5

the MC stochastic design procedure, where the residues within the sequence are allowed to exchange their positions, and each sequence configuration has the Boltzmann weight, ~ exp( E intra /k B ), where E intra is the internal energy of the sequence in a given configuration given by Eq. (1). The MC design procedure is stopped after a certain number of MC moves, and the resulting annealed configuration is accepted as the final, designed configuration for a given sequence. The lower is, the stronger are the correlations within the sequences. Intuitively, stronger correlations correspond to repetitive sequence patterns with a longer correlation length. The properties of the correlated patterns depend critically on the sign of the interaction potentials U αβ (x) used in the design procedure. If the effective design potential U = U pp + U hh 2U hp is overall negative (this corresponds to the attraction between the amino acids of similar types), the correlated patterns will have the form of repetitive residues of the same type, for example: HHHHPPPPHHHPPP If however, the potential U = U pp + U hh 2U hp is overall positive, the correlated patterns will have the form of the alternating hydrophobic and polar residues, for example: HPHPHPHPHPHPHP To characterize the correlation properties of the sequences quantitatively, we introduce the normalized correlation function: r η αβ (x) = g αβ (x) / g αβ (x), Eq. (8) r where g αβ (x) is proportional to the probability to find a residue of the type α separated by the r distance x from a residue of the type β, and g αβ (x) is the corresponding probability for the r randomized sequence, and g αβ (x) r corresponds to the averaging with respect to different realizations of randomized sequences. The computed correlation functions are represented in Fig. 1 at the value of k B = 2, and the insert shows the analogous calculations for a lower value of the design temperature, k B = 1 (in the units of k B T ). For the entirely uncorrelated (random) sequences, all the matrix elements of η αβ (x) are equal to unity, Fig. 1. The clustering of the residues of a similar type corresponds to η αα (x) > 1, Fig. 1. The next step is to compute numerically the properties of the probability distribution, P(E), of the interaction energies, E, between random and designed sequences (i.e. each interacting pair consists of a random and designed sequences). The results of these calculations are shown in Fig. 2. We computed P(E) at different values of and we represented the results as a ratio between the dispersion of P(E), σ = σ d, r and the dispersion of the corresponding probability distribution where both sequences are entirely random, σ r, r 6

(the latter corresponds to the case of a vanishing design potential, U αβ = 0). We used here the inter-residue interaction potential, V pp = V hh = 1, and V hp = 1, and we assumed that the nearest neighbor and the next-nearest neighbor amino acids can interact between the two sequences. The analytical result computed from Eq. (6) is also plotted in Fig. 2. As expected, the Gaussian fluctuation model becomes accurate at small values of the ratio, U(a) / k B 1, where a is the potential range. The insert of Fig. 2 shows the computed P(E) in the case of designed-random and random-random sequence pairs, respectively. The key conclusion here is that in accordance with the analytical predictions, the dispersion of P(E) is larger for the sequences designed with the overall negative U, as compared to the dispersion of P(E) in the case where both interacting sequences are entirely random, σ d,r > σ r,r. We stress that the latter result is qualitatively insensitive to the sign of the inter-residue interaction potential, V(x). We emphasize also that the interaction energy mean-value, E, is identical in the two cases. To summarize our results qualitatively, correlated sequences of the type HHHHHPPPPPPHHHHHPPPP, where amino acids of the same type are clustered together will bind statistically stronger to an arbitrary target sequence set, compared to either random sequences, or correlated sequences of the type HPHPHPHPHPHPHPHPHP. The sequences possessing the latter symmetry of correlations will be the least promiscuous ones. This effect is qualitatively robust with respect to the specific form and even sign of the microscopic intersequence interaction potential. Despite the one-dimensional nature of our model, its results are directly applicable to protein-protein interaction networks since the most recent, wholeorganism experimental and bioinformatics data suggest that 15-40% of all protein-protein interactions are mediated by linear sequence motifs, and not by large protein surfaces [10]. Still our key objective for the future theoretical analysis is to take into account the effect of protein folding. There are several possible strategies to test our predictions. The direct experimental test would utilize the protein chip [11] or microfluidic protein chip [12] technology. The target protein data set would be attached to the chip surface. The test proteins or peptides would be synthesized with a varying strength and symmetry of sequence correlations but keeping the average amino acid composition fixed. Titration experiments would allow measuring directly the binding affinity [11, 12] as a function of sequence correlation properties. Another possibility is to use the existing high-throughput protein-protein binding data [3-6], and to compare sequence correlation properties of multi-specific and mono-specific proteins [13]. 7

Yet another possibility is to use the recent whole-genome protein over-expression analysis [14]. Since the over-expression of highly promiscuous proteins should presumably be toxic to a cell, the correlation analysis of such toxic proteins (hundreds of them are known [14]) will show whether the predicted effect plays a significant role in a living cell [15]. Acknowledgements We thank Amir Aharoni, Gilad Haran, Nikolaus Rajewsky, Irit Sagi, Eugene Shakhnovich, and Dan Tawfik for helpful discussions. This work is supported by the Israel Science Foundation (ISF). References [1] O. Khersonsky, and D. S. Tawfik, Annu Rev Biochem 79, 471 (2010). [2] I. Nobeli, A. D. Favia, and J. M. Thornton, Nat Biotechnol 27, 157 (2009). [3] N. N. Batada et al., PLoS Biol 4, e317 (2006). [4] G. Butland et al., Nature 433, 531 (2005). [5] P. Hu et al., PLoS Biol 7, e96 (2009). [6] J. F. Rual et al., Nature 437, 1173 (2005). [7] U. Stelzl et al., Cell 122, 957 (2005). [8] D. B. Lukatsky, K. B. Zeldovich, and E. I. Shakhnovich, Phys Rev Lett 97, 178101 (2006). [9] D. Frenkel, and B. Smit, Understanding molecular simulation : from algorithms to applications (Academic Press, San Diego, 2002), pp. xxii. [10] J. R. Perkins et al., Structure 18, 1233 (2010). [11] A. Wolf Yadlin, M. Sevecka, and G. MacBeath, Curr Opin Chem Biol 13, 398 (2009). [12] D. Gerber, S. J. Maerkl, and S. R. Quake, Nat Methods 6, 71 (2009). [13] A. Afek, and D. B. Lukatsky, (in preparation). [14] T. Vavouri et al., Cell 138, 198 (2009). [15] D. S. Tawfik, (private communication). 8

1.6 1.4 η pp (x)=η hp (x), random sequence η hp (x), =2 η pp (x)=η hh (x), =2 1.2 η (x) 1 0.8 0.6 1.5 1 =1 0.4 0.5 0 0 20 40 60 80 0.2 0 2 4 6 8 10 12 x Figure 1: Computed sequence correlation functions for the designed sequences, η pp (x) = η hh (x) (red squares), η ph (x) = η hp (x) (blue diamonds); and for the random sequences, (black circles). All the matrix elements of η αβ (x) are the same for the random sequences. The design potential was chosen to be U pp = U hh = 1, and U hp = 1, and we assumed that only the nearest-neighbor residues can interact. The design temperature is = 2. The sequence length was chosen to be 200 amino acids, and we generated 5000 different sequences in each calculation. The plotted η αβ (x) represent the average over the entire set of the designed sequences. The uniform amino acid composition was adopted: 50% polar (p) and 50% hydrophobic (h) residues in each sequence. The error bars are smaller than the symbol size. Insert: an analogous computation as in the main figure but at a lower design temperature, = 1 (the design temperature is in the units of k B T ). 9

1.6 0.3 random-random 0.2 σ d, r / σ r, r 1.4 1.2 P(E) 0.1 0 B designed-random T d=1 0.6 0.4 0.2 0 0.2 0.4 0.6 E, k T 1 0 2 4 6 8 10 12 14 T, k B Figure 2: Computed ratio between the dispersions of the P(E) for the interaction energies of the designed-random, σ = σ d, r, and random-random, σ = σ r, r, sequence pairs at different values of the design temperature, (circles). The error bars are smaller than the symbol size. The uniform amino acid composition was adopted: 50% polar (p) and 50% hydrophobic (h) residues in each sequence. Thin curve represents the corresponding analytical result, Eq. (6). Insert: Computed probability distribution function, P(E), for the interaction energies between the pairs of two random sequences (black), and pairs consisting each of a random and a designed sequences, where the designed sequences were generated at = 1 (red). The energy E is normalized per one residue. 10