Rethinking Network Structure Kristina Lerman USC Information Sciences Institute Università della Svizzera Italiana, December 16, 2011
Measuring network structure Central nodes Community structure Strength of ties Zachary, J. Anthro. Research 33 No. 4. (1977)
Measuring network structure SNA metrics examine network topology to measure structure Centrality Degree, Katz score [Katz, 1953], Betweenness [Freeman, 1977], eigenvector [Bonacich, 1987], PageRank [Brin et al, 1998], Community detection Dozens of algorithms to partition network into groups 400+ references in 2010 review of community detection Strength of ties Neighborhood overlap to measure strength of tie [Granovetter, 1973] Claim: The nature of interactions between nodes affects how we measure network structure Consequences for network analysis metrics and algorithms
Types of interactions Two classes of interactions between network nodes Conservative One to one: phone calls, money transfer, web surfing Modeled by random walk 1 4 5 Non conservative One to many: epidemics, information diffusion, innovation adoption Modeled by contact process 1 4 5 2 3 2 3 Transfer matrix ~ D -1 A Transfer matrix ~ A
Matrix formulation 0 1 0 0 0 1 4 5 2 3 Adjacency matrix of the network A = 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 Outdegree matrix D = 0 2 0 0 0 0 0 2 0 0 0 0 0 1 0 0 0 0 0 1
Conservative interactions At time t, each node receives amount (t) At next time step, node retains (1-) of the amount it received and divides the rest among its neighbors Transfer matrix T: amount given to each neighbor T=D -1 A when evenly divided among neighbors 4 5 t=0 ( 0) w 0 1 2 3
Conservative interactions At time t, each node receives amount (t) At next time step, node retains (1-) of the amount it received and divides the rest among its neighbors Transfer matrix T: amount given to each neighbor T=D -1 A when evenly divided among neighbors 1 4 5 t=0 t=1 ( 0) w 0 ( 1) (0)T w0t 2 3
Conservative interactions At time t, each node receives amount (t) At next time step, node retains (1-) of the amount it received and divides the rest among its neighbors Transfer matrix T: amount given to each neighbor T=D -1 A when evenly divided among neighbors 1 4 5 2 3 t=0 t=1 t=2 ( 0) w 0 ( 1) (0)T w0t 2 2 ( 2) (1) T w0t
Conservative interactions At time t, each node receives amount (t) At next time step, node retains (1-) of the amount it received and divides the rest among its neighbors Transfer matrix T: amount given to each neighbor T=D -1 A when evenly divided among neighbors 1 4 5 2 3 t=0 t=1 t=2 t ( 0) w 0 ( 1) (0)T w0t 2 2 ( 2) (1) T w0t ( t) ( t 1)T t w0 T t
Conservative interactions At time t, each node receives amount (t) At next time step, node retains (1-) of the amount it received and divides the rest among its neighbors Transfer matrix T: amount given to each neighbor T=D -1 A when evenly divided among neighbors 4 5 t=0 t=1 ( 0) w 0 ( 1) (0)T w0t 1 t=2 2 2 ( 2) (1) T w0t 2 3 t ( t) ( t 1)T t w0 T t w( t) (1 ) t 1 k 0 (1 ) w 0 ( k) ( t) w( t 1)T
Steady state of conservative dynamic process At time t, each node receives amount (t) At next time step, node retains (1-) of the amount it received and divides the rest among its neighbors Transfer matrix T: amount given to each neighbor T=D -1 A when evenly divided among neighbors 1 4 5 2 3 t w (1 ) w (1 ) w 0 0 w T ( I T ) 1
Non conservative interactions At time t, each node receives amount (t) At next time step, node prints fraction of this amount for each out neighbor Replication matrix R: the additional amount produced for each neighbor R=A where A is adjacency matrix t=0 ( 0) w0 1 4 5 2 3
Non conservative interactions At time t, each node receives amount (t) At next time step, node prints fraction of this amount for each out neighbor Replication matrix R: the additional amount produced for each neighbor R=A where A is adjacency matrix t=0 ( 0) w 0 t=1 ( 1) (0)R w0r 4 5 1 2 3
Non conservative interactions At time t, each node receives amount (t) At next time step, node prints fraction of this amount for each out neighbor Replication matrix R: the additional amount produced for each neighbor R=A where A is adjacency matrix t=0 t=1 t=2 ( 0) w 0 ( 1) (0)R w R 0 2 2 ( 2) (1) R w0r 1 4 5 2 3
Non conservative interactions At time t, each node receives amount (t) At next time step, node prints fraction of this amount for each out neighbor Replication matrix R: the additional amount produced for each neighbor R=A where A is adjacency matrix t=0 t=1 t=2 t ( 0) w 0 ( 1) (0)R w R 0 2 2 ( 2) (1) R w0r ( t) ( t 1)R t w0 R t 1 4 5 2 3
Non conservative interactions At time t, each node receives amount (t) At next time step, node prints fraction of this amount for each out neighbor Replication matrix R: the additional amount produced for each neighbor R=A where A is adjacency matrix t=0 ( 0) w 0 t=1 ( 1) (0)R w R 0 t=2 t 2 2 ( 2) (1) R w0r ( t) ( t 1)R t w0 R t 1 4 5 w( t) t k 0 w 0 ( k) k 0 w( t 1) R t k w R 0 k 2 3
Steady state of non conservative dynamic process At time t, each node receives amount (t) At next time step, node prints fraction of this amount for each out neighbor Replication matrix R: the additional amount produced for each neighbor R=A where A is adjacency matrix 4 5 t 1 w w w 0 0 w ( I R R) 1 while <1/ max 2 3
Interactions and centrality Centrality identifies important nodes in the network e.g., most connected Degree centrality e.g., in the middle of most shortest paths Betweenness centrality e.g., those that are often visited by a process Nature of the process matters Conservative PageRank Non conservative Alpha Centrality
Interactions and Centrality Centrality identifies important nodes in the network, i.e., those that are often visited by a dynamic process Conservative Random surfer: follows out links at random with probability ; otherwise, jumps to random node Equilibrium PageRank Non conservative Epidemic spread: with probability, transmit disease to each out neighbor Equilibrium while <1/ max Alpha Centrality pr (1 ) s pr D 1 A cr s cr A 4 5 4 5 1 1 2 3 2 3
Interactions and Centrality Centrality identifies important nodes in the network, i.e., those that are often visited by a dynamic process Conservative Random surfer: follows out links at random with probability ; otherwise, jumps to random node Equilibrium PageRank Non conservative Epidemic spread: with probability, transmit disease to each out neighbor Equilibrium while <1/ max Alpha Centrality pr (1 ) s pr D 1 A cr s cr A 1 4 5 1 4 5 2 3 2 3 w c (1 ) w (0) w c c T w w (0) w n n n R
Which centrality metric is right for social media? follower submitter follower follower Information flow in social media is non conservative
Ground truth User activity data in social media provides ground truth Empirical measure of influence/importance 1. average size of cascades a node triggers 2. average number re broadcasts by followers Rank nodes by the empirical measure ground truth Compare rankings produced by centrality metrics to the ground truth
Which centrality metric is right for social media? Correlation between the ground truth and rankings predicted by Alpha-Centrality and PageRank Digg Twitter Non conservative Alpha Centrality best predicts node centrality
Alpha Centrality [Bonacich, 87] C() A A 2 2 A 3... A k 0 k A k A (I A) Measures the number of paths between nodes, each path attenuated by its length with parameter Parameter [0,1/ 1 ) sets the length scale of interactions Local: For = 0, only short range (local) interactions are considered Same rankings as degree centrality Meso: As grows, the length scale of interactions grows Global: As 1/ 1, global interactions are considered (length diverges) Same rankings as eigenvector centrality [Ghosh and Lerman, Parameterized Metric for Network Analysis Physical Review E, 2011]
Epidemic threshold for non conservative processes Diverging length scale critical phenomena Threshold behavior in non conservative diffusion Critical value of transmissibility c =1/ 1 [Wang et al., 2003] for < c, epidemic dies out, i.e., reaches vanishing fraction of nodes for > c, epidemic reaches a large fraction of nodes c =0.006 Size of simulated epidemics on the Digg follower graph and a synthetic graph transmissibility [Ver Steeg, Ghosh & Lerman, What stops social epidemics? ICWSM, 2011]
Multi scale analysis with Alpha Centrality Length scale parameter allows for multi scale analysis of networks Differentiate between local and global structures Change in rankings with Leaders: high influence on group members Nodes with high centrality locally (small Bridges: mediate communication between groups Nodes with low centrality locally (small But high centrality globally (large Peripherals: poorly connected to everyone Nodes with low centrality for any
Karate club network [Zachary, 1977] administrator instructor [Zachary An Information Flow Model for Conflict and Fission in Small Groups. J. Anthro. Research 33 No. 4. (1977)]
Ranking karate club members Centrality scores of nodes vs. No need to know communities to find bridging nodes
Community detection Divide the network into group such that nodes within a group are more similar to each other than to other nodes [Zachary An Information Flow Model for Conflict and Fission in Small Groups. J. Anthro. Research 33 No. 4. (1977)]
Synchronization in complex networks after a long time Hierarchical community structure revealed en route to synchronization [Arenas et al. Synchronization Reveals Topological Scales in Complex Networks, Phys. Rev. Lett. 96 (2006)]
Mathematics of synchronization Conservative Kuramoto model of coupled oscillators d i dt i sin( j i ) j neighbors(i) Linearizedmodel: Laplace operator d dt (D A) L
Mathematics of synchronization Conservative Kuramoto model of coupled oscillators d i dt Linearizedmodel: Laplace operator d dt i sin( j i ) j neighbors(i) (D A) L Non conservative Non conservative model A node does not divide its coupling energy among neighbors; rather, it applies its full coupling energy to each neighbor Linearizedmodel: Replicator operator d dt (I A) R max
Steady state d dt X X=L or R (t) ( 0 X 1 )e Xt X 1 System reaches steady state iff X is positive semi definite Time to reach the steady state ~ 1/ 1 (smallest positive eigenvalue of X) In steady state, ~ eigenvector corresponding to 0 (smallest eigenvalue of X) Conservative (X=L): i (t)= i (t+1)= j (t+1) Non conservative (X=R): i (t)= i (t+1);
Synthetic graph Adjacency matrix of the graph
Eigenvalue Spectrum Eigenvalue spectrum of the Laplacian used to characterize graph structure Number of null eigenvalues # disconnected components Smallest positive eigenvalue equilibration time Gaps between consecutive eigenvalues relative difference of time scales Large eigenvalues hubs in the network
Synchronization matrix T=1500 Conservative Non-conservative
Zachary karate club Adjacency matrix of the graph
Eigenvalue spectrum Eigenvalue spectrum of the Laplacian used to characterize graph structure Number of null eigenvalues # disconnected components Gaps between consecutive eigenvalues relative difference of time scales Large eigenvalues hubs in the network Cheeger bounds, graph partitioning criteria, conductance,
Synchronization matrix of the Karate Club Network Laplacian (T=1000) Replicator (T=1000) More synchronization Less synchronization
Hierarchical clustering emerging communities Non conservative Conservative t=10 t=1000 t=3000 t=3899
Community structure Conservative Non conservative Hierarchical agglomerative clustering on synchronization matrix at time=3899 Non conservative: clustering reveals ground truth community structure Conservative: two nodes mis assigned
Community structure of Digg social network
Community structure of Digg social network Whiskers Core No further structure in the core [Leskovec et al., 2008]
Onion like structure of the core Non conservative Conservative Digg mutual follower network with ~40K nodes, ~360K edges Each core has its own core and whiskers structure Little overlap between the cores discovered by the two models [Ghosh & Lerman, Role of Dynamic Interactions in Multi scale Analysis of Community Structure submitted to WWW]
Long tailed size distribution of whiskers Non conservative Conservative whiskers in a sub core disconnected components in the mutual follower graph Clustering nodes in the core, reveals many small communities (whiskers) with long-tailed size distribution [Ghosh & Lerman, Role of Dynamic Interactions in Multi scale Analysis of Community Structure submitted to WWW]
Strength of ties Social ties and proximity People receive novel information (e.g., new jobs) not through close friends (strong ties) but acquaintances (weak ties) [Granovetter, 1973] Proposed neighborhood overlap as measure of tie strength Tie strength ~ proximity in networks Empirical correlation between proximity (neighborhood overlap) and tie strength (frequency of calls) in a mobile call graph [Onnela et al, 2007] Link prediction Proximity predicts future links in networks E.g., future collaborations between scientists [Liben Nowell & Kleinberg, 2003] Tested many proximity metrics
Measuring proximity Variety of metrics proposed to measure proximity in graph CN: number of common neighbors JA: fraction of common neighbors (Jaccard) AA: Adamic Adar metric [Adamic & Adar, 1998] weighs each common neighbor by log 1 ( degree) best metric for predicting future collaborations! [Liben Nowell & Kleinberg, 2003] 1 AA uv log(d z ) z Neighbors Effective conductance [Koren et al., 2006]
Interactions and proximity Proximity between u and v = likelihood a message will get from u to v or vice versa Conservative Non conservative 4 v 4 v 1 1 u 3 u 3 CO 1 2 z Neighbors 1 d u d z Attention limited CO_ AL 1 2 zneighbors zneighbors 2 d u d z d z d v 1 d v d z NC 1 2 1 1 z Neighbors NC _ AL 1 2 zneighbors z Neighbors CN 1 1 d z d v d z d u
Activity prediction in social media What posts will user retweet? Social media users tend to be similar to their friends i.e. retweet the same posts as friends do (or vote for the same stories on Digg [Lerman, 2007]) But they tend to be more similar to closer friends Closeness based on proximity in the follower graph Which proximity metric is better? [Lerman et al., Using proximity to predict activity in social networks submitted to WWW]
Prediction experiment user? friend friend friend x i friend friend Pr u p p Re u p u Measure how well each proximity metric predicts activity [Lerman et al., Using proximity to predict activity in social networks submitted to WWW]
Prediction results: Digg Baseline = all friends contribute equally to user s activity Lift = percent change over baseline 70 60 50 precision recall 40 lift (%) 30 20 10 0-10 CN, NC JA AA CS CS_AL NC_AL -20 [Lerman et al., Using proximity to predict activity in social networks submitted to WWW]
Prediction results: Twitter Baseline = all friends contribute equally to user s activity Lift = percent change over baseline 30 25 20 precision recall lift (%) 15 10 5 0-5 -10-15 -20 CN,NC JA AA CS CS_AL NC_AL [Lerman et al., Using proximity to predict activity in social networks submitted to WWW]
Conclusion How we measure network structure depends on the nature of interactions between nodes Centrality Conservative interactions PageRank, Non-conservative Alpha-centrality, Alpha-centrality better predicts influential users on Digg, Twitter Community structure Conservative use Laplacian to probe structure Non-conservative use Replicator operator Communities synchronize faster in non-conservative interactions Social ties A principled way to measure proximity in graphs Attention-limited proximity better predicts user activity on Digg, Twitter