Analysis of Biological Networks: Network Robustness and Evolution

Analysis of Biological Networks: Network Robustness and Evolution Lecturer: Roded Sharan Scribers: Sasha Medvedovsky and Eitan Hirsh Lecture 14, February 2, 2006 1 Introduction The chapter is divided into two topics: Network Robustness and Network Evolution. We discuss biological networks robustness and analyze the way these networks evolve. We expect the networks robustness to be rooted in its evolution process. 1.1 Network Robustness In this section we define measures of topological robustness, and use them to study exponential and scalefree networks. Then we explore the relation between topological robustness and the corresponding biological robustness. 1.2 Network Evolution In this section we explore evolutionary processes that shape the development of PPI networks over time. At first, we base the entire model on gene duplication with preferential link attachment. However, soon we explain why this process alone cannot shape the PPI network, and additional processes of link loss and gain are added (commonly called link dynamics). Finally, we state that link dynamics is the main process that shapes the network, and gene duplication has only a minor effect. 2 Network Robustness As biological networks are required to function in various conditions, we expect evolution to provide them with robust structures. For example, fruit fly can develop in two distinct environments: one rich in glucose and the second rich in galactose. This ability may imply that the fly s PPI network is robust enough to allow it to function even when some of its nodes are disabled. There are many ways to decide if a network is robust or not. In this section we ll discuss topological robustness, which is measured by the network s resilience in terms of measures on largest components size, average distance etc. There are two scenarios for which network robustness is measured: Failure - Random node removal. Attack - A removal of chosen nodes from the network. In this paper we are going to apply a specific type of attack, called hub attack: at each step the node with the maximal degree is removed. 1

Figure 1: The first two diagrams show an exponential random network and the effect of random failure of nodes on the networks connectivity. The second two diagrams show the effect of random node removal in a scale free random network. It can be seen that the connectivity is hardly damaged. The last two diagrams show the effect of a hub attack on the scale-free network, in which case the network s connectivity is heavily damaged [2]. 2.1 Failure vs. Attack We would like to analyze the different effects of failure and attack on networks with different properties. As we have seen in chapter 2, biological networks usually have a scale free degree distribution. In [2] the authors tried to simulate the effects of failure and attack on such scale-free random networks. Figure 1 shows that scale-free networks tend to be much more robust, in terms of network connectivity, than exponential random networks when random nodes are removed, but they are more vulnerable to hub attacks. This result can be explained by presence of hubs in scale-free networks. Random node removal is more likely to remove low-degree nodes while leaving the hubs, thus the impact on network connectivity is expected to be small. On the other hand, in a targeted attack the hubs are removed, which is expected to damage the network connectivity. In order to further evaluate the effect of failure and attack on the two types of networks, the behavior of the diameter in the two types of networks was examined as a function of the fraction of nodes being removed. The results in Figure 2 strengthen the previous statement that scale-free networks are more robust than exponential networks when it comes to failures, but more vulnerable to attacks. We can also see that exponential random networks behave the same in case of failure or attack. The authors performed an additional experiment comparing the different behavior of exponential and scale-free networks under failures and attacks based on the size of the largest component and the avg. size of the other components (Figure 3). As before, we see that exponential random networks behave the same under failure and attack, whereas scale-free networks are resilient to failure and vulnerable to attack. 2

Figure 2: Changes in the diameter of the network as a function of the fraction of the removed nodes. a Comparison between the exponential (E) and scale-free (SF) network models, each containing N = 10, 000 nodes and 20, 000 links. b Changes in the diameter of the Internet under random failures (squares) or attacks (circles). c Changes in the diameter of the world-wide web under random failures (squares) or attacks (circles) [1]. Figure 3: The relative size of the largest cluster S (open symbols) and the average size of the isolated clusters < s > (filled symbols) as function of the fraction of removed nodes f for the same systems. The size S is defined as the fraction of nodes contained in the largest cluster. a Fragmentation of the exponential network under random failures (squares) and attacks (circles). b Fragmentation of the scale-free network under random failures (blue squares) and attacks (red circles). The inset shows the error tolerance curves for the whole range of f, indicating that the main cluster falls apart only after it has been completely deflated. c and d show the effect of failure and attack on the Internet and www, respectively [1]. 3

Figure 4: The fraction of essential proteins with exactly k links in the yeast proteome [7]. Figure 5: A comparison between essential and non-essential nodes centrality properties. The comparison covers the betweeness, connectivity and closeness centrality properties. The results clearly show a correlation between essentiality and centrality [5]. 2.2 Essentiality and Centrality A previous work studied the relation between structural and functional properties of biological networks. Specifically, they found that nodes in PPI networks with many links tend to be more essential for yeast cell survival (Figure 4). An extension of this work examined the relation between node centrality and essentiality. A node s centrality can be defined in various ways: Betweenness - the fraction of shortest paths between all graph nodes, which includes a node Connectivity - the degree of a node Closeness - inverse of mean distance to other nodes in the graph A thorough examination done in [5] verifies the correlation between essentiality and centrality. As can be seen in Table 5, when examining three species (yeast, fly and worm) we see a strong correlation between the essentiality of a protein and its centrality, measured by its betweenness, connectivity and closeness. 4

Figure 6: Probability densities of the average PCCs (Pearson correlation coefficients) were calculated from a global expression profiling compendium. The number n in the panel refers to the number of data points for each gene. Average PCCs for hubs in the FYI (red curve) show a clear bimodal distribution that is used to separate date and party hubs (located by the arrow). No bimodal distribution is observed with the average PCCs of non-hub proteins (cyan curve) or for hubs in randomized networks (black curve) [6]. 2.3 Essentiality depends on Functionality In the previous sections we showed that topologically central nodes (hubs) tend to be more essential, and saw their contribution to structural robustness. In this section we ll examine the essentiality of different types of hubs and the effect of their removal on the topological robustness of the network. We can divide hubs into two categories: Party hubs - proteins that have many simultaneous interactions (multiple interactions happening under the same conditions). Date hubs - proteins that have many different interaction partners, but only a few under any single conditions. 2.3.1 Party vs. Date Hubs As we have seen in chapter 1, PPI data is generated via several methods (e.g. coip and Y2H). All of these methods create stationary data and disregard the dynamic behavior of these interactions over time. Thus we cannot distinguish between party hubs and date hubs by looking at the PPI network alone. In order to overcome this problem, we need to use some other available data, e.g. gene expression. We would like to check the gene expression correlation a hub node has with its neighbors under various conditions. We can then check whether there really are two distinct types of hubs - those that are more correlated with their neighbors (they will be called party hubs), and those that are less correlated (date hubs). The authors in [6] considered only high-confidence PPIs (with at least 2 sources), and defined hubs as nodes with more than 5 neighbors. The result of their experiment can be seen in Figure 6. Surprisingly, we can clearly see the bimodal distribution that is used to separate date hubs from party hubs. Another possible way to distinguish between party hubs and date hubs is the diversity of the cellular localization of their neighbors. Party hubs tend to interact within a specific cell region, while date hubs interact in different cell regions. Figure 7 shows some examples of party and date hubs and states that party hubs have mostly intramodule interactions (within the same functional module), while date hubs have mostly inter-module interactions (between different functional modules). Here a module is a group of proteins with similar mrna expression patterns. 5

Figure 7: In this schematic protein interaction network, proteins are colored according to mutual similarity in their mrna expression patterns. Party hubs are highly correlated in expression with their partners, and presumably interact with them at similar times. The partners of date hubs exhibit more limited coexpression, and presumably the corresponding physical interactions occur at different times and/or different locations [6]. Now that we have seen that there are two types of hubs and that we know how to distinguish between them, let s go back and check how this observation influences network robustness. As can be seen in Figure 8 the network is resilient to loss of party hubs, however vulnerable to deletion of date hubs. As mentioned, party hubs are inter-module, thus take part in a specific function. Due to the functional resilience of the network we can expect it to be more resilient to loss of party hubs. To further investigate the differences between date and party hubs, the authors compare the network resilience to deletion of date hubs and party hubs with similar properties, such as degrees and clustering coefficients on the FYI (filtered yeast interactome) network (Figure 9). As shown, removal of party hubs does not affect the connectivity of the network and thus resembles failures, whereas removal of date hubs causes rapid network breakdown. 2.3.2 Essentiality and Hubs A further study of the biological relevance of date and party hubs reveals a correlation to essentiality of date and party hubs and their participation in genetic interactions. As can be seen in Figure 10b, in singlegene knockout experiments date and party hubs have similar chances to be essential. However, among known genetic interactions there are twice as many interactions involving date hubs, as there are interactions involving party hubs (see Figure 10c). Note though, that these differences may result from different factors, such as more studies involving date hubs, total number of date and party hubs etc. 6

Figure 8: The effects on the characteristic path length of the network on gradual node removal. Random removal of nodes (failures) is represented by the green line, attacks against all hubs by the brown line, attacks against party hubs by the blue line, and attacks against date hubs by the red line. The breakdown point is the threshold after which the main component of the network starts disintegrating [6]. Figure 9: Subsets of date and party hubs with comparable degree k and clustering coefficients C (lines are colored as in figure 8) [6]. 7

Figure 10: b Date and party hubs are both more likely to be essential than non-hubs, but their single knockout affects cellular viability to the same extent. c Date hubs participate in more genetic interactions than party hubs or non-hubs, as measured by genetic interaction density (GID) based on genetic interactions gathered at MIPS [9, 6]. 3 PPI Network Evolution In this chapter we explore various possible models of network development, during evolution, focusing on PPI network evolution. 3.1 Growth and Preferential Attachment A simple model for evolution of PPI networks is network growth: nodes with fixed degree m are constantly added to the network, and are never removed. A refinement to this model is preferential attachment, which states that the probability that a new node connects to existing ones is proportional to their degrees. One way to validate these models is to divide proteins into age groups, and see if there are differences in degrees of proteins from different groups. The authors in [4] divided proteins to age groups, according to the evolutionary stage at which the proteins emerged. First, they analyzed the yeast PPI network and by using BLAST, found matching proteins in E.coli, Arabydopsis and Fission Yeast. Then they divided the proteins into 4 age groups, based on their presence in the different species. for example: a protein found in all 4 species was tagged as group 4 and a protein that was found in yeast and fission yeast but not in E.coli and Arabidopsis would be tagged as group 2 (Figure 11). Next we show how the authors used the interactions between proteins from different age groups in order to examine the growth process. 3.1.1 Empirical Evidence for Growth In growth model (both with or without preferential attachment), we expect older proteins to have higher degrees, since they had more chances to connect with new proteins. Figure 12 shows evidence that the growth model is correct: clearly, the older proteins have higher degrees on average [4]. 3.1.2 Empirical Evidence for Preferential Attachment In the following experiment the authors examine the number of links attached to proteins in newer age groups, as function of the number of links to proteins in older age groups. This analysis shows that proteins with higher degrees are more likely to gain new interactions. Figure 13 shows that proteins that started with 8

Figure 11: A schematic representation of the relative position of the four studied organisms on the phylogenetic tree [4]. Figure 12: Averaged connectivity for four age groups of yeast proteins. Groups are numbered in increasing age order: group 1 proteins (those with no similarities in fission yeast, Arabidopsis, or E.coli genomes) are expected to be the newest, and group 4 proteins (with similarities in all three organisms) are expected to be the oldest. Results are presented for the whole interactions database (solid symbols), and for a restricted set excluding the low-confidence interactions (open symbols). For most data points, the error bar is smaller than the symbol [4]. 9

Figure 13: Preferential attachment in protein network evolution. The averaged number of links a protein acquires to proteins from newer groups as a function of k, its number of connections to all other (older) proteins (annotated by N(k)). The solid lines are the integration over the values of N(k). We can see they resemble the power-law function k 2 (presented by dashed line for comparison). Results have been obtained using the full interactions database. a New links to proteins from group 1 alone, as a function of the number of links in groups 2, 3 and 4. b New links to groups 1 and 2, as a function of the number of links in groups 3 and 4. c New links to groups 1, 2 and 3 for all group-4 proteins [4]. more links gained more links. Thus the experiment provides evidence for preferential attachment. Moreover, we can see that the distribution of the number of acquired links resembles power-law. Further support for the preferential attachment model is shown in [11]. This model considers pairs of paralogous proteins, assuming that interactions that are present in one protein but not in the other were added after the duplication took place. Then they compare the chance to develop new interactions to the degree of the protein. Figure 14 shows that there is an approximately linear association between protein degree and chance to develop new interactions. 3.2 Link Dynamics Until now we discussed only one possible mechanism of evolutionary network development - the addition of new nodes with a fixed number of links. In biological networks this process is explained by gene duplication. But can network development be explained solely by this process of gene duplication? According to [10], the effective rate of gene duplication is 8.3 10 4 duplications per gene per Myrs. However, examination of duplicate genes shows that they diverge in their interactions much faster (see Figure 15). This means that the divergence can not be explained solely by attachment of new duplicated genes, suggesting that there should be some additional process that affects evolutionary development. 10

Figure 14: Empirical evidence for preferential attachment in protein interaction networks. Horizontal-axis shows the degree d of the protein. The vertical axis shows the likelihood P d that a protein of degree d has evolved new interactions. There is a strong, approximately linear association between protein degree and the likelihood to evolve new interactions [11]. Figure 15: The effect of gene duplications on gene products that interact with proteins. Shortly after a gene duplication, the products P and P of the duplicate genes will interact with the same proteins. Eventually, some or all of the common interactions will be lost, and new interactions may be gained by either protein. In the rightmost panel, protein P has lost one interaction and gained a new interaction partner, whereas protein P has lost two interactions. If the number of common interaction partners is taken as a measure of functional overlap, then one of the functions of P is also covered by P, and vice versa [10]. 11

Figure 16: a Histogram of the fraction of duplicate genes whose products have at least one interacting protein in common as a function of K s, the fraction of synonymous substitutions per synonymous site [8]. Gene pairs were grouped according to their K s values into bins of width 0.5 whose lower boundaries are indicated on the x-axis. The horizontal line labeled random expectation indicates the estimated probability that two proteins chosen at random from the entire network share an interaction partner. This estimate was obtained numerically by randomly choosing 1,000 pairs of proteins from the network. Duplicate gene pairs whose products interact with each other are included in the values shown here. Asterisks above a bar indicate that the number of duplicates with shared interactions is significantly different from the random expectation as assessed by a χ 2 test. b Mean and standard deviation of path lengths among products of duplicate genes in the same component as a function of K s. The solid and dotted horizontal lines indicate mean and standard deviation in path lengths between two proteins chosen at random from the same subnet within the protein contact graph, as estimated by choosing 1,000 protein pairs at random [10]. 3.2.1 Interaction Turnover In order to investigate the divergence speed of duplicate genes, the amount of common interactions two duplicate proteins have was measured with respect to the time the duplication took place. In order to estimate the time of duplication, the fraction of substitutions per synonymous site, denoted K s, is examined [8]. Substitutions in synonymous sites do not affect the resulting protein, thus they are under fewer evolutionary constraints than are nonsynonymous sites. As can be seen in Figure 16, the fraction of duplicates with shared interaction decreases very fast after duplication (K s = 1 is approximately 100Myrs). We also see that almost immediately after duplication (K s = 0.5) the mean pathlength between duplicates becomes similar to the mean distance in the network. These findings suggest that after duplication of proteins, both duplicates loose some of their interactions. 3.2.2 Rate of Interaction Loss In order to calculate the interaction loss rate the authors examines duplicate proteins with K s < 2, and assumes that the diversification observed between them is caused by interaction loss. They argue that the 12

lower bound for interaction loss rate is 2.3X10 3 /Myr, which results from the number of lost interactions divided by the number of interactions after duplication, divided by 200Myr. They also note that the actual rate may be much higher, because (1) the interactions lost by both proteins are disregarded, and (2) many duplicates are younger than 200M yr. Notice though, that this calculation arises from the assumption that the diversification is caused only by interaction loss, which might not be true. 3.2.3 Interaction Gain In the previous section we have examined interaction loss. Is there also a process of interaction gain? Figure 17a shows patterns of interactions between proteins of duplicate genes, which may have evolved either through interaction loss or gain. Some of the patterns, including the most common pattern, two interacting duplicate proteins with no self interaction, are more easily explained by interaction gain than loss (see Figure 17b). 3.2.4 Rate of Interaction Gain In order to evaluate the interaction gain rate we look at interacting duplicates that do not self interact. When processing the PPI network, the author of [10] finds that 13 out of 9059 duplicate pairs with K s < 5 interact without a self-interaction in neither of them (seen in figure 17b). Assuming these interactions are a result of interaction gain, the author concludes that the interaction gain rate is approximately 10 2 new interactions per protein per Myrs. Note that in the past two sections, the calculations of interaction loss and gain rates can only suggest an order of magnitude of the actual rates. 3.2.5 Preferential Attachment in Interaction Gain Interaction gain, as in the previous growth model, seems to follow preferential attachment. When observing all cross-interactions (interacting duplicates), and treating them as new attachments, we see that most interacting pairs are proteins with high degrees (Figure 18). Based on this finding it was suggested [3] that: The higher the protein s degree the more chances it has to gain a new interaction. The choice is asymmetric: one protein is chosen uniformly and the other according to its degree. This would explain the correlation between the fraction of cross-interaction and k + k. If the choice were symmetric, we would expect the correlation to be with kk. 3.3 Duplication vs. Link Dynamics We now know that there are two main processes, which shape the PPI networks we are looking at: the duplication process, in which a node (protein) is duplicated together with its interactions, and link dynamics, in which edges are removed or new edges are added (see Figure 19). In [3] the authors try to examine the influence of the duplication process, compared to the influence of link dynamics. Their conclusion is that the duplication influence is negligible. If the duplication process had substantional influence, we would expect each node (especially with high degree) to have a high fraction of pairs of neighbors that are products of gene duplication. Figure 20 shows that this is not the case. This fraction is small, and does not increase with node degree. 13

Figure 17: Self-interactions and interactions between products of duplicate genes. a Self-interactions of genes with paralogs and interactions between duplicate genes may have evolved by two different routes. First, a gene product may have been a self-interactor before duplication. In this case, observed selfinteractions and interactions between paralogs are a reflection of self-interaction before duplication. Second, the interactions may have evolved de novo after the duplication. b Number of paralogous gene pairs observed in the yeast protein interaction networks with the indicated combination of self and cross-interaction. The last category of five duplicates, in which only one of the paralogs is self-interacting, involves 16 paralogous gene pairs, but 11 of them are redundant. Notice the abundance of duplicate pairs without selfinteractions (13/25) and the small number of gene pairs (1/25) where both genes are self-interacting [10]. Figure 18: The histogram shows the fraction of cross-interactions versus the sum of the connectivity of the two participating proteins (k + k ). We can see there is a strong correlation between the two figures [3]. 14

Figure 19: The elementary processes of protein network evolution. The progression of time is symbolized by arrows. a Link attachment, b link detachment and c gene duplication. Empirical data suggests duplications occur at a much lower rate than link dynamics and that redundant links are lost subsequently (often in an asymmetric fashion), which affects the connectivity of the duplicate pair and of all its binding partners [3]. Figure 20: The histogram shows the fraction of duplicate pairs among the k(k 1) 2 neighbor pairs of a node of connectivity k plotted versus k. A high number of duplicate pairs would be expected if duplications were a significant mechanism of link gain [3]. 15

4 Summary In section 2 we have introduced the concept of topological robustness and showed that biological networks are more robust than random networks, and showed that this robustness is rooted in the existence of hubs in the biological networks. Then we showed that in biological networks hubs tend to be more essential. We divided the hubs into two types - party hubs and date hubs - and examined the differences between them. In section 3 we examined different models of network evolution. We started with examining a model based on network growth with preferential link attachment and showed supporting evidence for this model. Then we introduced a model based on link dynamics, and showed that it is more suited to explain the divergence between duplicate genes. Finally we showed that link dynamics processes are more dominant than gene duplication in network evolution. References [1] R. Albert, H. Jeong, and A.L. Barabasi. Error and attack tolerance of complex networks. Nature, 406, 2000. [2] A.L. Barabasi and E. Bonabeau. Scale-free networks. Scientific American, 228:60 69, 2003. [3] J. Berg, M. Lassig, and A. Wagner. Structure and evolution of protein interaction networks: A statistical model for link dynamics and gene duplications. BMC Evolutionary Biology, 4, 2001. [4] E. Eisenberg and E.Y. Levanon. Preferential attachment in the protein network evolution. Physical Review Letters, 91, 2003. [5] M.W. Hahn and A.D. Kern. Comparative genomics of centrality and essentiality in three eukaryotic protein-interaction. Molecular Biology and Evolution, 22:803 806, 2005. [6] J.D. Han, N. Bertin, T. Hao, D.S. Goldberg,.F. Berriz, L.V. Zhang, D. Dupuy, A.J. Walhout, M.E. Cusick, F.P. Roth, and M. Vidal. Evidence for dynamically organized modularity in the yeast proteinprotein interaction network. Nature, 430:88 93, 2004. [7] H. Jeong, S.P. Mason, A.L. Barabasi, and Z.N. Oltvai. Lethality and centrality in protein networks. Nature, 411:41 42, 2001. [8] W.H. Li. Unbiased estimation of the rates of synonymous and nonsynonymous substitution. J. Molecular Evolution, 36, 1993. [9] H.W. Mewes, J. Hani, F. Pfeiffer, and D. Frishman. MIPS: a database for protein sequences and complete genomes. Nucleic Acids Research, 26(1):33 37, 1998. [10] A. Wagner. The yeast protein interaction network evolves rapidly and contains few redundant duplicate genes. Molecular Biology and Evolution, 18:1283 1292, 2001. [11] A. Wagner. How the global structure of protein interaction networks evolves. Proc. Biol. Sci., 270:257 266, 2003. 16