Network Observational Methods and. Quantitative Metrics: II

Similar documents
Quantitative Metrics: II

Design and characterization of chemical space networks

Complex Networks, Course 303A, Spring, Prof. Peter Dodds

Assortativity and Mixing. Outline. Definition. General mixing. Definition. Assortativity by degree. Contagion. References. Contagion.

A LINE GRAPH as a model of a social network

Network Analysis and Modeling

6.207/14.15: Networks Lecture 12: Generalized Random Graphs

Emerging communities in networks a flow of ties

Class President: A Network Approach to Popularity. Due July 18, 2014

Mini course on Complex Networks

ECS 253 / MAE 253 April 26, Intro to Biological Networks, Motifs, and Model selection/validation

Erzsébet Ravasz Advisor: Albert-László Barabási


Random Networks. Complex Networks CSYS/MATH 303, Spring, Prof. Peter Dodds

Random Networks. Complex Networks, CSYS/MATH 303, Spring, Prof. Peter Dodds

Social Networks. Chapter 9

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Complex (Biological) Networks

Spectral Analysis of Directed Complex Networks. Tetsuro Murai

Complex Networks CSYS/MATH 303, Spring, Prof. Peter Dodds

Network Science (overview, part 1)

ELEC6910Q Analytics and Systems for Social Media and Big Data Applications Lecture 3 Centrality, Similarity, and Strength Ties

arxiv:cond-mat/ v1 [cond-mat.dis-nn] 4 May 2000

Kristina Lerman USC Information Sciences Institute

networks in molecular biology Wolfgang Huber

1 Complex Networks - A Brief Overview

Complex (Biological) Networks

CS6220: DATA MINING TECHNIQUES

Identifying communities within energy landscapes

Degree Distribution: The case of Citation Networks

Complex networks: an introduction

LAPLACIAN MATRIX AND APPLICATIONS

CS224W: Analysis of Networks Jure Leskovec, Stanford University

ENV Laboratory 1: Quadrant Sampling

CS6220: DATA MINING TECHNIQUES

A Modified Method Using the Bethe Hessian Matrix to Estimate the Number of Communities

Vocabulary. 1. the product of nine and y 2. the sum of m and six

Betweenness centrality of fractal and nonfractal scale-free model networks and tests on real networks

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

CS224W: Social and Information Network Analysis

ECS 289 / MAE 298 May 8, Intro to Biological Networks, Motifs, and Model selection/validation

Math 304 Handout: Linear algebra, graphs, and networks.

Finding central nodes in large networks

Freeman (2005) - Graphic Techniques for Exploring Social Network Data

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search

Social Network Analysis. Mrigesh Kshatriya, CIFOR Sentinel Landscape Data Analysis workshop (3rd-7th March 2014) Venue: CATIE, Costa Rica.

Slides based on those in:

Overview of Network Theory

Network landscape from a Brownian particle s perspective

Networks. Can (John) Bruce Keck Founda7on Biotechnology Lab Bioinforma7cs Resource

The Trouble with Community Detection

Groups of vertices and Core-periphery structure. By: Ralucca Gera, Applied math department, Naval Postgraduate School Monterey, CA, USA

Quadratic Equations Part I

HIERARCHICAL GRAPHS AND OSCILLATOR DYNAMICS.

Chaos, Complexity, and Inference (36-462)

Mathematical Foundations of Social Network Analysis

Networks and stability

Quadratics Unit 3 Tentative TEST date

2 nd February Round Oak Newsletter

Complex Networks, Course 303A, Spring, Prof. Peter Dodds

Adventures in random graphs: Models, structures and algorithms

GraphRNN: A Deep Generative Model for Graphs (24 Feb 2018)

Chaos, Complexity, and Inference (36-462)

cs/ee/ids 143 Communication Networks

Midterm Exam 2 Solutions

BECOMING FAMILIAR WITH SOCIAL NETWORKS

Show that the following problems are NP-complete

Spectral Methods for Subgraph Detection

Westerly High School. Math Department. Summer Packet

2.4. Conditional Probability

Class Note #20. In today s class, the following four concepts were introduced: decision

OUR 2016 BIG STORIES F R O M R U R A L C O M M U N I T I E S F O L L O W T H E M O N E Y N G.

CS 124/LINGUIST 180 From Languages to Information

June Dear prospective Pre-Calculus student,

Data Collection. Lecture Notes in Transportation Systems Engineering. Prof. Tom V. Mathew. 1 Overview 1

Comparing linear modularization criteria using the relational notation

Typical information required from the data collection can be grouped into four categories, enumerated as below.

SCOPE & SEQUENCE. Algebra I

Web Structure Mining Nodes, Links and Influence

CS261: A Second Course in Algorithms Lecture #8: Linear Programming Duality (Part 1)

New Directions in Computer Science

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

Communities Via Laplacian Matrices. Degree, Adjacency, and Laplacian Matrices Eigenvectors of Laplacian Matrices

Follow links Class Use and other Permissions. For more information, send to:

Lecture December 2009 Fall 2009 Scribe: R. Ring In this lecture we will talk about

Absence of depletion zone effects for the trapping reaction in complex networks

Joanne N. Halls, PhD Dept. of Geography & Geology David Kirk Information Technology Services

Econ 172A, Fall 2012: Final Examination Solutions (I) 1. The entries in the table below describe the costs associated with an assignment

MAT 210 Test #1 Solutions, Form A

Exact solution of site and bond percolation. on small-world networks. Abstract

BOSTON UNIVERSITY GRADUATE SCHOOL OF ARTS AND SCIENCES. Dissertation TRANSPORT AND PERCOLATION IN COMPLEX NETWORKS GUANLIANG LI

Preface. Contributors

Networks as a tool for Complex systems

1 Reductions and Expressiveness

Balancing Chemical Equations

Nonlinear Dynamical Behavior in BS Evolution Model Based on Small-World Network Added with Nonlinear Preference

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: NP-Completeness I Date: 11/13/18

Econ 172A, Fall 2012: Final Examination (I) 1. The examination has seven questions. Answer them all.

Origins of fractality in the growth of complex networks. Abstract

Heavy Tails: The Origins and Implications for Large Scale Biological & Information Systems

Transcription:

Network Observational Methods and Whitney topics Quantitative Metrics: II Community structure (some done already in Constraints - I) The Zachary Karate club story Degree correlation Calculating degree correlation for simple regular structures like trees and grids 3/20/06 Quantitative Metrics II Daniel E Whitney 1997-2006 1

Clustering or Grouping Metrics Community structure Seek to find tightly connected subgroups within a larger network Clustering coefficient Measure the extent to which nodes link to each other in triangles Are your friends friends? Clusters are often called modules by network researchers and are also associated by them with function Assortativity and disassortativity (AKA degree correlation) Do highly linked nodes ( hubs ) link to each other (assortative) or do they link with weakly linked nodes (disassortative) Average (shortest) path length (AKA geodesic) How far apart are nodes Max geodesic is called network diameter 3/20/06 Quantitative Metrics II Daniel E Whitney 1997-2006 2

Community-finding and Pearson Coefficient r Technological networks seem to have r < 0 Social networks seem to have r > 0 Newman and Park sought an explanation in community structure and clustering Their algorithm for finding communities looks like a flow algorithm Zachary used a flow algorithm to find the communities in the Karate Club 3/20/06 Quantitative Metrics II Daniel E Whitney 1997-2006 3

Summary Properties of Several Big Networks (Newman) Network Type n m z l α C (1) C (2) r SOCIAL Film actors Company directors Math coauthorship Physics coauthorship Biology coauthorship Telephone call graph E-mail messages E-mail address books Student relationships Sexual contacts directed directed 449913 7673 253339 52909 1520251 47000000 59912 16881 573 2810 25516482 55392 496489 245300 11803064 80000000 86300 57029 477 113.43 14.44 3.92 9.27 15.53 3.16 1.44 3.38 1.66 3.48 4.60 7.57 6.19 4.92 4.95 5.22 16.01 2.3 - - - - 2.1 1.5/2.0 - - 3.2 0.20 0.59 0.15 0.45 0.088 0.17 0.005 0.78 0.88 0.34 0.56 0.60 0.16 0.13 0.001 0.208 0.276 0.120 0.363 0.127 0.092-0.029 INFORMATION WWW nd.edu WWW Altavista Citation network Roget's Thesaurus Word co-occurrence directed directed directed directed 269504 203549046 783339 1022 460902 1497135 2130000000 6716198 5103 17000000 5.55 10.46 8.57 4.99 70.13 11.27 16.18 4.87 2.1/2.4 2.1/2.7 3.0/- - 2.7 0.11 0.13 0.29 0.15 0.44-0.067 0.157 TECHNOLOGICAL Internet Power grid Train routes Software packages Software classes Electronic circuits Peer-to-peer network directed directed 10697 4941 587 1439 1377 24097 880 31992 6594 19603 1723 2213 53248 1296 5.98 2.67 66.79 1.20 1.61 4.34 1.47 3.31 18.99 2.16 2.42 1.51 11.05 4.28 2.5 - - 1.6/1.4-3.0 2.1 0.035 0.10 0.070 0.033 0.010 0.012 0.39 0.080 0.69 0.082 0.012 0.030 0.011-0.189-0.003-0.033-0.016-0.119-0.154-0.366 BIOLOGICAL Metabolic network Protein interactions Marine food web Freshwater food web Neural network directed directed directed 765 2115 135 92 307 3686 2240 598 997 2359 9.64 2.12 4.43 10.84 7.68 2.56 6.80 2.05 1.90 3.97 2.2 2.4 - - - 0.090 0.072 0.16 0.20 0.18 0.67 0.071 0.23 0.087 0.28-0.240-0.156-0.263-0.326-0.226 Figure by MIT OCW. Basic statistics for a number of published networks. The properties measured are: type of graph, directed or ; total number of vertices n; total number of edges m; mean degree z; mean vertex-vertex distance l; exponent α of degree distribution if the distribution follows a power law (or "-" if not; in/out-degree exponents are given for directed graphs); clustering coefficient C (1) ; clustering coefficient C (2) ; and degree correlation coefficient r. Blank entries indicate unavailable data.

Calculating r r = # # ( x " x ) y " y ( ) ( x " x ) 2 # y " y ( ) 2 # x y 1 2 3 1 2 2 2 2 3 2 2 3 2 3 4 2 3 4 2 3 5 2 2 5 2 2 x = 2 y = 2 3 1 2 5 4 r = -0.676752968 using Pearson function in Excel Note: if all nodes have the same k then r = 0/0 3/20/06 Quantitative Metrics II Daniel E Whitney 1997-2006 5

n=1 binary tree with n=5 2 3 2 3 3 2 3 2 n=3 n=4 n=5 2 rows like this - ignore n=2 6 rows like this - ignore All other rows like this except last two sets 2 n-2 rows of 3-3 2*2 n-2 rows of 3-1 3-3 means 3" x 3-1 = 1-3 and means! 2 n-1 rows of 3-1 ( ) 2! ( 3" x )( 1" x ) 1 2 3 4 5 Pure Binary Tree 3/20/06 Quantitative Metrics II Daniel E Whitney 1997-2006 6 Census of Pairs for

Result of Census Sum of row entries = Total number of rows = #k i 2 =10*2 n"1 "14 "k i = 2 n +1 + 4 = ksum " x = #k 2 # k = 2.5 in the limit of large n <k>=2 Total 2 n rows of 3-1 Approx (ksum - 2 n ) rows of 3-3 r = - 0.4122 3/20/06 Quantitative Metrics II Daniel E Whitney 1997-2006 7

Closed Form Results 1 2 3 4 5 1 2 3 4 5 Property Pure Binary Tree Binary Tree with Cross-linking ksum 2 n +1 " 4 3*2 n "10 ksqsum 10*2 n"1 "14 13*2 n " 64 x # 2.5 as n becomes large (>~ 6) # 13 as n becomes large (>~ 6) 3 Pearson numerator ~ 2 n (3" x )(1" x ) + (ksum " 2 n )(3" x ) 2 ~ 2 n (5 " x )(1" x ) + (ksum " 2 n )(5 " x ) 2 Pearson denominator ~ 2 n"1 (1" x ) 2 + (ksum " 2 n"1 )(3" x ) 2 ~ 2 n"1 (1" x ) 2 + (ksum " 2 n"1 )(5 " x ) 2 r # " as n becomes large # " 1 5 as n becomes large l Note: Western Power Grid r = 0.0035 Bounded grid r = 16(2 " x )(3 " x ) + 8(l " 3)(3 " x )2 2(2 " x ) 2 +12(l " 2)(3 " x ) 2 # 2 3 3/20/06 Quantitative Metrics II Daniel E Whitney 1997-2006 8

Nested Self-Similar Networks nested r = - 0.25, c = 0.625 nested2 Probably, r = 0 in the limit as the network grows r = - 0.0925, c = 0.5500 3/20/06 Quantitative Metrics II Daniel E Whitney 1997-2006 9

Tree with Diminishing Branching Ratio 1 node with k = 16 16 times 16 nodes with k = 9 8 times 8*16 = 128 nodes with k = 5 4 times 4*8*16 = 512 nodes with k = 3 2 times 2*4*8*16 = 1024 nodes with k = 1 r = 0.38166 3/20/06 Quantitative Metrics II Daniel E Whitney 1997-2006 10

Toy Networks with Positive and Negative r 1 3 4 2 3 2 1 5 4 6 5 6 7 7 8 8 9 9 10 11 10 11 12 13 12 13 14 14 15 r = 0.4330 15 16 17 18 19 20 21 r = - 0.4434 16 17 18 19 20 21 3/20/06 Quantitative Metrics II Daniel E Whitney 1997-2006 11

Toward Matlab for Pearson (symmetric) r = # # ( x " x ) y " y ( ) ( x " x ) 2 # y " y ( ) 2 Look at numerator, ignore xbar for the moment #( x i y ) j = x ' i " ij y j = x ' Ax " ij =1 if i links to j " ij = 0 if i does not link to j Essentially the calculation is a quadratic form. My bias: control theory, where quadratic forms are common 3/20/06 Quantitative Metrics II Daniel E Whitney 1997-2006 12

Matlab Implementation function prs = pearson(a) %calculates pearson degree correlation of A [rows,colms]=size(a); won=ones(rows,1); k=won'*a; ksum=won'*k'; ksqsum=k*k'; xbar=ksqsum/ksum; num=(k-won'*xbar)*a*(k'-xbar*won); kkk=(k'-xbar*won).*(k'.^.5); denom=kkk'*kkk; prs=num/denom; 3/20/06 Quantitative Metrics II Daniel E Whitney 1997-2006 13

Newman-Girvan Algorithm Seeks edges along which a lot of traffic flows between nodes, revealed by high edge betweenness Edge betweenness rises with number of shortest paths between all node pairs that pass along that edge Removing this edge and repeating the process reveals clusters that roughly conform to Modularity 1 (?) 3/20/06 Quantitative Metrics II Daniel E Whitney 1997-2006 14

Zachary s Karate Club: A Social Network with r < 0 (from UCINET) 17 12 1 6 11 5 7 4 8 2 13 22 18 14 20 3 r = - 0.475 C = 0.588 26 25 27 15 23 19 16 30 31 28 32 9 33 29 10 There is no link between 23 and 34. 24 34 Every later scholar has this error. Figure by MIT OCW. 3/ 20/ 06 Quantitative Metrics II Daniel E Whitney 1997-2006 15

Zachary s Karate Club - Most Studied by Community-Finding Researchers Zachary studied a karate club that had an internal fight and split into two Based on data he took about relationships between club members, he predicted how the group would split His algorithm correctly assigned all but one person to the groups they actually joined after the split An Information Flow Model for Conflict and Fission in Small Groups, J Anth Res v 33, 1977, pp 452-473 3/20/06 Quantitative Metrics II Daniel E Whitney 1997-2006 16

The Reason for the Split The karate instructor Mr Hi wanted more money The club president John A felt the club administrators should set his salary Many angry club meetings occurred over this conflict When John A fired Mr Hi, the group split Half formed a new club around Mr Hi The other half found another instructor or gave up karate 3/20/06 Quantitative Metrics II Daniel E Whitney 1997-2006 17

The Dynamics Different club members took different sides Club meetings (different from karate lessons) were fights based on votes, and the faction with the most votes prevailed at any given meeting Political activity occurred outside the club as the sides activists recruited others to attend meetings and vote their way 3/20/06 Quantitative Metrics II Daniel E Whitney 1997-2006 18

Zachary s Model Nodes are club members, plus Mr Hi Mr Hi is node #1, John A is node #34 There is a line between two nodes if those people meet in some venue outside of the club Venues include local campus pub, Mr Hi s private karate school, common classes, outside karate tournaments, etc Each edge has a weight = the number of outside venues that the two people have in common Based on the idea that communication, including recruiting people to come to club meetings, happens in the outside venues, and that more venues in common means stronger communication, represented by stronger edge weight 3/20/06 Quantitative Metrics II Daniel E Whitney 1997-2006 19

Zachary s Algorithm Zachary assumed that each side tried to recruit its adherents and keep the other side from learning about a meeting So communication flow was important, and the group would likely split at chokepoints of communication between the groups He adopted the Ford-Fulkerson capacitated flow algorithm - max flow/min cut - from source Mr Hi to sink John A: the cut closest to Mr Hi that cuts saturated edges divides the network into the two factions He correctly predicted every member s decision except #9 His algorithm depended on knowing who was who and what was what 3/20/06 Quantitative Metrics II Daniel E Whitney 1997-2006 20

Max-flow Min-Cut Theorem c 1 f 1 S c 2 f 2 T flow in = F max c 3 f 3 flow out = F max The cut divides the network in two Its capacity = c 1 +c 2 +c 3 Its flow = f 1 +f 2 +f 3 There is a cut such that c 1 +c 2 +c 3 = F max No other cut can have less capacity or else the total flow will be less than F max Other cuts can have more capacity but that makes no difference. 3/20/06 Quantitative Metrics II Daniel E Whitney 1997-2006 21

The Relationship Graph 27 15 23 10 This version of the Karate club appears in several papers by Newman or Girvan and Newman 6 26 25 11 13 17 5 7 22 18 15 23 19 16 27 30 31 28 12 4 14 20 32 9 1 33 8 3 29 31 2 27 21 19 r = - 0.475 C = 0.588 23 10 16 30 24 26 33 34 29 28 25 31 32 9 3 14 20 2 1 Breakdown of the Karate Club according to Newman. Every algorithm from Zachary to Newman puts 9 in 34's group - but 9 actually went with 1, not 34. Algorithms have trouble with #3, not #9. 8 13 4 5 18 Figure by MIT OCW. 22 12 11 6 7 17 24 34 Figure by MIT OCW. 3/20/06 Quantitative Metrics II Daniel E Whitney 1997-2006 22

Different Approaches and Lessons Zachary s method depends on knowing facts about both nodes and edges, and uses a weighted graph Edges describe relationships outside the club #9 chose Mr Hi s group for an inside reason, something no one else did Later scholars used no info about nodes and edges and used an unweighted graph, but get the same answer and make the same mistake with #9 Newman uses geodesics between all pairs of nodes while Zachary uses only paths between 1 and 34. How come later scholars get the same answer? 3/20/06 Quantitative Metrics II Daniel E Whitney 1997-2006 23

Possible Explanation The uncommitted members were the only bridges between two committed groups There were only a few such people and they shared few venues with members of both factions Thus the break can practically be seen on the unweighted graph with the naked eye So possibly later scholars have simply been lucky The goal of abstraction is to learn as much as you can while knowing a priori as little as possible 3/20/06 Quantitative Metrics II Daniel E Whitney 1997-2006 24

Network Comparisons Network Type Clustering Path Length Pearson Degree Coefficient Coeff Random (Erdös & Small Small Zero Rényi) Regular Grid Large Large compared to Positive random Regular Grid Large Small, similar to? Randomly random Rewired (Watts and Strogatz) Trees Small Small Negative Sociological Large compared to Small Positive? random Technological Large? Negative? 3/20/06 Quantitative Metrics II Daniel E Whitney 1997-2006 25

Conventional Wisdom Regarding r Positive r means hubs connect to hubs Positive r means that high degree nodes tend to connect to each other and so do low degree nodes If self-loops and multiple edges between nodes are not allowed, then hubs have no choice but to connect to lowdegree nodes, so r will be < 0 ( hubs repel each other ) These explanations do not work reliably, although the converses work sometimes If high k link to high k and low k to low k then r > 0 Note: a random graph has r = 0 (-0.0105 in MATLAB) Also, small networks can have big values of r 3/20/06 Quantitative Metrics II Daniel E Whitney 1997-2006 26

Bike Rewired to Have Max r r = 0.1815 3/20/06 Quantitative Metrics II Daniel E Whitney 1997-2006 27

Another Bike with r = 0.1448 3/20/06 Quantitative Metrics II Daniel E Whitney 1997-2006 28

Li-Alderson Toy Internet Client-Server Networks All Have Same Degree Sequence and r ~ - 0.17 3/20/06 Quantitative Metrics II Daniel E Whitney 1997-2006 Figure 5 in Li, Lun, David Alderson, John C. Doyle, and Walter Willinger. "Towards a Theory of Scale-Free Graphs: Definition, Properties, and Implications." Internet Mathematics 2, no. 4 (2006): 431-523. Reproduced courtesy of A K Peters, Ltd. and David Alderson. Used with permission. 29