Long Term Evolution of Networks CS224W Final Report

Size: px

Start display at page:

Download "Long Term Evolution of Networks CS224W Final Report"

Elaine Cross
5 years ago
Views:

1 Long Term Evolution of Networks CS224W Final Report Jave Kane (SUNET ID javekane), Conrad Roche (SUNET ID conradr) Nov. 13, 2014 Introduction and Motivation Many studies in social networking and computer science have investigated evolution of large networks, on time scales of a year to ten years. Less well investigated is the long term evolution of large networks and the stability of communities when the external forces generating these structures are themselves evolving. We investigate collaboration networks generated from American Physical Society (APS) article metadata ( ). We study the correlation between author interests as represented by the APS PACS fields. We then simulate the APS network using the Kronecker and Forest Fire models, which turn out to be difficult to use with APS. Therefore, we create a third, simple model of a collaboration network based on a few intuitive rules about how coauthors are chosen. This model produces a network with power law degree structure and clustering coefficient similar to APS. The effective diameter is systematically lower than for APS, and we explore why. Overall the results suggest APS is a collection of tightly bound communities, with few edges joining them. Finally, we briefly investigate longer- term evolution in two what- if scenarios. Summary & Critique of selected papers Microscopic Evolution of Social Networks, Leskovec et al. Analyzes the evolution of four social networks at the microscopic level. Show that the edge creation for a node seems unaffected by its age, but is proportional to its degree. Evolution of the social network of scientific collaborations, Baraba śi et al. Investigate structural properties of two collaboration networks. For both networks, the degree distribution has a power law tail, with different exponents for the two networks. Tracking the Evolution of Communities in Dynamic Social Networks, Greene & Doyle. A model for tracking user communities in dynamic networks where each the evolution of each community is determined by a set of significant events. Realistic, Mathematically Tractable Graph Generation and Evolution, Using Kronecker Multiplication, Leskovec et al. A graph generator that obeys static properties and temporal evolution patterns of real- life network and is mathematically. Generated graphs exhibit multinomial degree distribution, and multinomial eignevalue distribution. Processing the APS Data and Creating Network Snapshots The corpus of Physical Review Letters, Physical Review, and Reviews of Modern Physics contains metadata for 541,448 articles dating back to 1893 from 12 APS journals described in Table 1. Pre- processing the data to generating network snapshots is a significant effort;

2 after considerable investigation we settled on the following. We select only articles where having the authors and names fields, thus eliminating preambles and commentaries, and the first names, thus eliminating collaborations (with an average of 331 authors per article.) We build 121 yearly snapshots (1893 to 1913) by scanning the articles in date order. We add an author as a node at the time of her first article. For each article, we add a coauthor edge for each pair of authors. We accumulate the time- dependent interest vector for each author as the sum of the PACS vector of all their papers. Identifying authors is very difficult. An author may be listed with various first names or initials. Some names are very common, e.g. Lee, and Brown. The affiliations field is not helpful; its format varies even for the same author. Furthermore, authors often change institutions. Misidentification means authors may map one- to- many or many- to- one to nodes, spuriously creating or omitting edges. The final (2013) network has 2.7e5 nodes and 5.7e6 edges. Full Title Year Description Physical Review Original journal Physical Review A Atomic, molecular, optical, quantum information Physical Review B Condensed matter and materials Physical Review C Experimental and theoretical nuclear Physical Review D Elementary particle, field theory, gravitation, cosmology Physical Review E Collective phenomena of many-body systems: PR Series I Physical Review Letters All fields, short letters, important research. PR Special Topics Accelerators and Beams PR Special Topics Education PR X all areas: pure, applied, interdisciplinary Reviews of Modern Physics 1929 broad fundamental current trends and applications Table 1. The APS journals We use the APS PACS classification field to assign interest vectors to authors. The 10 PACS areas were introduced in 1975, e.g. (10) The Physics of Elementary Particles and Fields; (70) Condensed Matter: Electronic Structure, Electrical, Magnetic, and Optical Properties. For each article we parse the PACS code down to a broad category, e.g. 7y.xx, becomes % of the articles have no PACS, mostly older articles but persisting well into the 1980 s. We tested using journal- averaged PACS in these cases but felt this procedure was too complicated and could obscure network of interest, so we assign a zero PAC vector instead, and for analyzing interest of communities, average over the normalized non- zero vectors. Community Detection and PACS Interest Vectors To detect communities in the APS network snapshots and investigate their correlation with the accumulated PACS vectors of authors, we have tried the Girvan- Newman (GM) and Clauset- Newman- Moore (CNM) algorithms (fastcommunity_mh in C. GM appears intractable, while CNM takes 18 hours on the 2013 network. We considered Big Clam, and settled on Louvain, which takes less than a minute on both the APS and simulated networks, and generates intuitively reasonable communities in both cases. We also tested Louvain with edge weighting with our simulated network; we found it was too slow. Figure 1 shows the distribution of PACS for each of the top 5 communities by size in the 2010, 2011, 2012

3 and 2013 snapshots. The largest community (blue bars in all plots) is always specialized in Condensed Matter Physics (PACS categories 7 and 8), but the continuity of the other communities is less clear. All communities grow in size with time, but in addition Louvain unpredictably splits communities with similar interests that might intuitively be combined into one. Communities also change order in the ranking by size. To match communities from one year to the next, we computed two Jaccard- type similarities on the author (node) ids, and also cosine similarities between the mean normalized total PAC vectors (summed over authors) within each community. Because a very small community U (of say, 200 nodes) could become split off from a larger community V (on, say, nodes) that contained it a year earlier, we computed both J! U V U V and J! U V min U, V. We found that both J 1 and J 2 were well correlated with the cosine similarities of the groups. Since our goal here was only to confirm a supposed link between communities and interests, we did not attempt to compute exhaustive statistics for the database, but for example both J 1 =72% and J 2 = 88% both match community 3 (green) in 2012 with community 2 (light blue) in 2012; the interest cosine similarity is 99.76%. These results suggest APS communities are tight knit and identifiable by similar PACS interests. Figure 1. Fraction of total community PACS vector in each category for top 5 Louvain APS communities by size for the years 2010, 2011, 2012 and The legend shows the total number of nodes in each group.

4 Models of the APS Network Forest Fire Model The Forest Fire Model [3] generate evolutionary models by burning through an existing graph. It has two parameter, the forward and backward burning probabilities p and r. Each arriving node v attaches to an ambassador node, w chosen uniformly at random then attaches to a subset of the out- and in- neighbors of w using a geometric distribution based on the p and r. The node v then creates edges with the subset of nodes selected. This is repeated recursively on the subset until no nodes are left, with any node visited only once. The resulting graphs exihibit densification and a power law degree distribution. For APS network, the forest fire model can be viewed as new author selecting a primary co- author, then selecting more coauthors recursively. The forward/reverse burning probabilities were determined for each of the APS snapshots. A simple least square approximation of the goodness of fit model was used to determine the probabilities for which the model closely matches the properties of the actual network. The representative properties used were node count, edge count, approximate diameter, maximum WCC size, clustering coefficient and maximum node degree. The probabilities for the best fits were in the sweet spot between 0.2 and The forest fire model with the best fit for all the snapshots had a forward burning probability, p, of 0.33 and a backward burning probability, r, of The basic Forest Fire model will generate a graph which contains a single component whose nodes (ignoring edge direction) one can navigate to any other node in the component. Thus, the maximal WCC is the entire network. This is not true for APS, however (Figure 4), so the forest fire model did not correctly model the maximum WCC size. Figure 2. Graph properties over time for the Forest Fire models initial years of the graph while the basic forest fire model fits the later years of the graph. As suggested in [3], one way to generate a graph which contains multiple weakly connected components is the orphan model. Here, with probability of op, v, will not establish an edge with the existing graph. This leads to many orphan nodes which will eventually form edges

5 with new nodes. The forest fire model with the best fit for all the snapshots had a forward burning probability, p, of 0.27 and a backward burning probability, r, of 0.28 and an orphan probability of 0.3. The Forest Fire orphan model fit the initial evolution of APS well, but the basic model fit the latter part of the evolution better. The clustering coefficients in the both models remain flat for most of the evolution. The APS network on the other hand has a steep initial increase in the clustering coefficient and then, as the number of nodes increase, it asymptotes. This makes it difficult to fit the forest fire model to the data. Stochastic Kronecker Graph model The self- similar Stochastic Kronecker graph [4] is generated recursively, starting with a N! N! probability matrix (N! = 2 in our cae), and compute its k!" power. Edges are based on the probability in the corresponding entry of the resulting matrix. For APS self- similarity would imply that the network amongst authors is similar (but not exactly the same) across the network. This makes sense, as authors in different fields would exhibit similar collaborative behavior. Unlike Forest Fire, stochastic Kronecker is undirected. It exhibits a power law degree distribution, small diameter and densification. Another interpretation of Stochastic Kronecker graphs, as described in [4], is to associate each node with a set of features, where the probability of an edge depends on the feature similarities. For APS the features are authors (research) interests. We used the KronFit approach [4] to fit the APS netwrk snapshots and determine the initiator matrix. We chose N 1 = 2, since larger N 1 did radically improve the fit not in [4]. The matrix varied for each of the snapshots. The diagonal values in the matrices were nearly the same, with an average difference over the years of The median value of the matrix elements in the yearly snapshots was chosen as the initiator for the entire time series, Θ = Figure 3 The Stochastic Kronecker graph modeled the edge growth and the effective diameter well. It could not model the behavior of the maximum SCC or the clustering coefficient, which a steep fall with network growth while the APS network had the clustering coefficient rise for the latter half of the time.

6 Simple Model We have also built a simple model for generating an APS- like network. The key element is a small set of intuitive rules choosing coauthors that account for the interests of the authors, where interest means any external influence, especially funding or popular support for particular subfields of physics. A rule is chosen randomly using the weighted distribution shown in Table 2. Through extensive experimentation, these rules were found to be both effective in controlling network propertie, and feasible for APS- size graphs. Coauthor selection rule Typical relative weighting 1 complete a triangle where nodes have the same interest 23 2 seek a high degree node with same interest 70 3 seek any node with same interest 6 4 seek a high degree node (neighbor of random node with any 1e- 3 interest) 5 random node (any degree, any interest 0.4 Table 2. Intuitive rules for choosing coauthors in the simple model. Notably, Rule 4, essentially an edge rewiring rule, reduces the effective diameter in the model shown here from around 8 to an APS- like value of around 5 6. In the model shown here, for simplicity of exposition each author has one of three possible scalar interest values (1,2, or 3). We also tried as many as 20 scalar interests. For the model run shown here, the network G = (V, E) is initialized with 4 nodes in each of three equal- size complete subgraph communities C1, C2 and C3, where every node in Ci has a single scalar interest i. The subgraphs are connected in a ring (three additional edges joining pairs of subgraphs). The model is run for 1440 time steps (representing 120 years time 12 months). Each month n = max(1, rn) new authors (nodes) are added to the network, where the rate r = 0.058/12 and N= V. Each new author is assigned an interest value 1, 2, or 3 with equal probability. The top right panel of Figure 3 shows an excellent quadratic fit to the number of edges vs. the number of nodes in the APS Wcc, so we stipulate the number of new papers ΔP is related to the number of new nodes ΔN at each step as (1 + 2α N 2 ) ΔN; the constant α = 3e- 3 is tuned along with the probabilities for number of coauthors stated above.. Each new paper has m coauthors in addition to the first author, where m is chosen randomly for each paper separately using the weighting [0.35, 0.15, 0.10, 0.10, 0.05] for m = [1, 2, 3, 4, 5]. This weighting influences the rate at which edges appear. All n new authors are placed in an initially empty queue q of first authors waiting for coauthors, and then s = q n existing first authors are also added to the list. To choose an existing first author for q, an existing node v is chosen randomly from the network. With 10% probability, v is added to the list, or with 90% probability, a random neighbor of v is node is added to the list; the latter implements a preferential well- connected get more publications mode. Once q is assembled, each v in q is assigned its own m(v) - 1 coauthors by the rules in Table 1. Because the initial network is connected and every paper has at least one coauthor, the network remains connected. Using a profiling utility in Python, we found that careful approximations to the rule reduced, the running time to generate the full APS- size from several hours to two minutes. For example, at first we used sampling routines in the Python stats package to generate a list of

7 candidate nodes for Rule 1, from which we randomly downselected. However, the profiler showed this and similar approaches are very expensive. Therefore, in Rule 1, if the chosen neighbor of v or its neighbor does not have the same interest as v, we simply iterate and choose a rule again (possibly rule 1). This means the rules are not applied with exactly the distribution shown in Table 2. However, because the edges are sparse E << N(N- 1)/2, and the rules tend to surround neigbors with similar interest, the rules usually find a candidate, so we expect the approximations to work well. The improved speed of smade it possible to test many concepts for the model and to vary the parameters. Results of Simple Model As the top left panel of Figure 4 shows, the APS network is significantly disconnected until the 1940s, when the network has only a few thousand nodes. Therefore we compare simulated results to the maximum weakly connected component (Wcc) of APS. The left panel on the second line of Figure 4 shows the number of edges versus the number of nodes. The simple model produces an APS- like network with the following features. 1) clustering coefficient C: close to APS value of 6.1: second line, right panel of Figure 4. 2) power law degree similar to APS: third line of Figure 4 3) effective diameter D e and # of nodes at a given # of hops similar to APS: bottom line. Surprisingly, D e of the APS network is the most sensitive measure to the input parameters. As the left panel on the bottom line of Figure 4 shows, for the model run shown the final D e is closer to 4 than to the final D e of about 5 6 in the APS Wcc, although the two may be converging. We note that the full diameter of the model is very similar to the APS Wcc D e. One way a graph can have high C and high D e is if it has dense communities with tendrils. For example, a complete graph G C on n has C = 1. If a chain (line) subgraph G L of m nodes is attached by one edge to one node v of G C, then every node i in G L has C i = 0. The full diameter of the graph is D = m +1. Every node j in G C except v has C j = 1. The node v has 2e C v = v ( ( ) +1= n is the degree of v and k v = n 1 )( n 2) is the number of edges k v 1 2 k v ( ), where k v = n 1 between pairs of neighbors of C v, i.e. between the n 1 neighbors of v in G L. Thus ( ) ( n 1) ( n 2) 2 ( ) & ( ) 2 C v = k v ( k v 1) = 2 = n 2. Therefore for the entire graph, the clustering coefficient n n 1 n is C = 1 # % m + n m 0 + n 2 + ( n 1) 1( = 1 n n 2 2 =, m 1 $ n ' m + n n D + n 1 n For a given D, lim n C =1, i.e. C can be arbitrarily close to 1. We suspect this type of structure is present in APS, but it would not result from the simple rules in Table 1. (This would be interesting further work). Notably, APS has a cloud of high- frequency, high- distribution counts above the power law (third line, left panel of Figure 4) that the simple model does not reproduce. These could be explained as small highly connected communities nearly detached from the main network, and could raise D e. (These may be specialty institutions or fields this would be worth investigating further.) In general, by varying parameters in the simple model it is easy to improve the match on any of criteria 1) 3) individually (not shown), such as the number of nodes at degree 1 or the tail of single- count, high- degree nodes. We have made no systematic attempt to find a simultaneous good match on all 3 criteria. However, given the simplicity of our rules, these results suggest that simple intuitive rules can robustly explain the APS network

8 Simple Model Figure 4. Results of the simple model Top left: APS # Nodes vs. Year. Top right: Fit to APS # Edges vs. # Nodes. 2 nd line left: # Edge vs. # Nodes. 2 nd line right: Global Clustering Coefficient. 3 rd line left: degree distribution for APS. 3 rd line right: same for simple model. Bottom left: Diameter. Bottom distribution of nodes versus number of hops:

9 Long term evolution of simulated network We have also the used the simple model to investigate longer- term evolution of a collaboration network in two what- if scenarios. The type of scenario we envision is a significant change in the funding levels for two subfields of physics research. An example might be an increase in funding for Condensed Matter as it becomes the dominant profitable field in physics, and a decrease in funding for nuclear physics as it loses political support. To mock up these scenarios, at 1200 months (out of 1440 months) we impose a change in the interest attributes of new nodes and existing nodes publishing new papers. At 1200 months time the network contains about 1/3 the number of nodes it does at 1440 months. The change in interest mocks up an increase in the fractional funding to Interest 1 from 33% to 60% of total physics funding, and a decrease in funding for Interest 3 from 33% to 5%. New authors entering the network are assigned interests according to the changed interest distribution, while existing first authors who currently have Interest 3 reassign their interest according to the new interest distribution when they publish a new paper. This mocks up the effect of existing authors possibly bailing out on a declining field. We run Louvain community detection on yearly network snapshots. The result of this model is that the clustering coefficient increases, while the effective diameter stays nearly constant. This is a somewhat surprising result. A change in author interest might be expected to generate more long- distance connections since an existing author who changes interest would tend to have different interests from its neighbors; this would not tend to decrease the clustering coefficient, but would decrease the diameter. As a fraction of the total number of authors in the network, C1 increases in size in response to increased funding, while C2 stays constant in size. C3 changes in several ways: it shrinks in fractional size by a factor of 3; its average interest changes from 3 to 1.6, and the fraction of authors in Community 3 with interests 1, 2 and 3 respectively each become about 1/3. This result suggest that a fairly simple change in the funding input to a collaboration network can lead to the disruption of a previously stable community in a short amount of time. Since existing edges (papers published on earlier interests) are not remove in this simple model, it is interesting that the new edges come to dominate the structure. However, this result was obtained for the case where the network continued to grow; in this case new authors entering the network may dominate the structure. To address the last point, we simulate a case where we stop adding nodes to the network at 1200 months, and where existing authors with Interest 3 change their interest according to the prescription in the previous section when they publish a paper. By 1440 months the effective network diameter drops to to 5 and the clustering coefficient actually decrease to 0.46, because many existing nodes that had Interest 3 have started new triangles that are not yet completed. The number of edges in the network has increased from 1.8e6 to 3.9e6 (compared to ~5.9e6 in the case where new nodes are added.) At 1440 months the Interest 3 community from 1200 months has significantly changed: the community is only 5% Interest 3, and 51% Interest 1. This result in not surprising, but suggests that the rapid transformation of communities takes place among both existing nodes and new nodes.

10 Conclusions and Further work The forward and backward probabilities of the forest fire mode that fit each of the yearly snapshots varied from year to year. We could consider a model where the probabilities are a function of the number of nodes in the graph decreasing as the number of nodes increase. An alternative approach could be a model where the orphan probability decreases as the graph grows, so that it reduces to zero once the graph reaches a certain size. We would like to add more features to our simple model of network generation. Nodes and edges could age, and be refreshed by new papers; edges could have strengths based on cosine similarity of the endpoint interest. We have created and run models with these features (shown in the milestone report) but have not presented results here. External changes on collaboration networks could also include influence/outbreak, where author interests spread by contact edges. However, from the viewpoint of creating simple, realistic model, it s not obvious that such effects are as important as more practical concerns like funding levels and political/social palatability of particular areas of research. The results for extended 'longer term' evolution with the simple model are interesting, but not too surprising, because the APS it consists of tight communities with few links between them, i.e. as a whole it's barely a network. The average number of edges from a node to a node in a different community is much less than one. This is not astonishing; coauthoring a paper is a much more involved undertaking and commitment than friending someone or liking their post. The main conclusion is that the APS network is formed by authors choosing like- minded (or like- funded) coauthors that are well- connected. References [1] [2] A. Clauset, M.E.J. Newman and C. Moore, "Finding community structure in very large networks." Phys. Rev. E 70, (2004). [3] Leskovec, J. Kleinberg and C. Faloutsos. Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), [4] J Leskovec, D. Chakrabarti, J. Kleinberg, C. Faloutsos, and Z. Ghahramani. Kronecker graphs: An approach to modeling networks. Journal of Machine Learning Research (JMLR), Tasks/Roles C. Roche: compiled Louvain for Mac; ran and analyzed Forest Fire and Kronecker models. J. Kane: formulated project goals; supervised project; preprocessed APS data; created network snapshots; performed Louvain community detection and analysis; built and ran simple model and analyzed output; did long- term evolution runs; assembled final report and wrote 80% of it.

Networks as vectors of their motif frequencies and 2-norm distance as a measure of similarity

Networks as vectors of their motif frequencies and 2-norm distance as a measure of similarity CS322 Project Writeup Semih Salihoglu Stanford University 353 Serra Street Stanford, CA semih@stanford.edu