CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Size: px

Start display at page:

Download "CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University"

Daniel Long
5 years ago
Views:

1 CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

2 Intro sessions to SNAP C++ and SNAP.PY: SNAP.PY: Friday 9/27, 4:5 5:30pm in Gates B03 SNAP C++: Thursday 0/3, 4:5 5:30pm in Gates B03 Sessions will be recorded and available via SCPD About the software libraries: TAs support SNAP C++ (Justin, Bell), SNAP.PY (Christie, Yoni) You can use other libraries: NetworkX, JUNG, Boost, R They will do the job but we don t offer support for them Start early on HW0 since these packages are new to you, complex and non trivial to use! Review of: Probability: Friday, 0/4, 4:5 5:30pm in Gates B03 Linear algebra: Tuesday, 0/8, 2:5 3:30pm,Gates B03 9/25/203 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 2

3 Observations Models Algorithms Small diameter, Edge clustering Patterns of signed edge creation Viral Marketing, Blogosphere, Memetracking Scale Free Densification power law, Shrinking diameters Strength of weak ties, Core periphery Erdös Renyi model, Small world model Structural balance, Theory of status Independent cascade model, Game theoretic model Preferential attachment, Copying model Microscopic model of evolving networks Kronecker Graphs Decentralized search Models for predicting edge signs Influence maximization, Outbreak detection, LIM PageRank, Hubs and authorities Link prediction, Supervised random walks Community detection: Girvan Newman, Modularity 9/25/203 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 3

For example, last time we talked about Observations and Models for the Web graph: ) We took a real system: the Web 2) We represented it as a directed graph 3) We used the language of graph theory

4 For example, last time we talked about Observations and Models for the Web graph: ) We took a real system: the Web 2) We represented it as a directed graph 3) We used the language of graph theory Strongly Connected Components 4) We designed a computational experiment: Find In and Out components of a given node v 5) We learned something about the structure of the Web: BOWTIE! v Out(v) 9/25/203 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 4

5 Undirected graphs Links: undirected (symmetrical, reciprocal relations) A L F M Directed graphs Links: directed (asymmetrical relations) B C D I D B G E G A H C F Undirected links: Directed links: Collaborations Phone calls Friendship on Facebook Following on Twitter 9/25/203 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 5

6 9/25/203 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 6 A ij = if there is a link from node i to node j A ij = 0 otherwise A A Note that for a directed graph (right) the matrix is not symmetric

7 Undirected Directed A B G A C Source: node with k in = 0 Sink: node with k out = 0 D F E Node degree, k i : the number of edges adjacent to node i k A 4 Avg. degree: In directed networks we define an in degree and out degree. The (total) degree of a node is the sum of in and out degrees. out k 2 k 3 in C k 9/25/203 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 7 E N C k k N k in k C N k i 2E i N k out

8 The maximum number of edges in an undirected graph on N nodes is E max N 2 N ( N 2 ) A graph with the number of edges E = E max is a complete graph, and its average degree is N- 9/25/203 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 8

9 Most real world networks are sparse E << E max (or k << N-) WWW (Stanford Berkeley): N=39,77 k=9.65 Social networks (LinkedIn): N=6,946,668 k=8.87 Communication (MSN IM): N=242,720,596 k=. Coauthorships (DBLP): N=37,080 k=6.62 Internet (AS Skitter): N=,79,037 k=4.9 Roads (California): N=,957,027 k=2.82 Proteins (S. Cerevisiae): N=,870 k=2.39 (Source: Leskovec et al., Internet Mathematics, 2009) Consequence: Adjacency matrix is filled with zeros! (Density of the matrix (E/N 2 ): WWW=.50-5, MSN IM = ) 9/25/203 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 9

10 Unweighted (undirected) 4 Weighted (undirected) 4 A ij E 2 A ii 0 N i, j A ij A ij A ji k 2E N Examples: Friendship, Hyperlink 9/25/203 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 0 E A ij N A i, j ii 0 nonzero( A ij ) Aij Aji 2E k N Examples: Collaboration, Internet, Roads

Self edges (self loops) (undirected) 4 Multigraph (undirected) 4

CS224W: Social and Information Network Analysis, http://cs224w.

11 Self edges (self loops) (undirected) 4 Multigraph (undirected) 4 E A ij A ii N 0 A N ij i, j, i j i A ii A Examples: Proteins, Hyperlink ij? A ji nonzero( A Aij Aji 2E k N 9/25/203 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, E A ij N A i, j ii 0 Examples: Communication, Collaboration ij )

12 WWW > Facebook friendships > Citation networks > Collaboration networks > Mobile phone calls > Protein Interactions > > directed multigraph with self edges > undirected, unweighted > unweighted, directed, acyclic > undirected multigraph or weighted graph > directed, (weighted?) multigraph > undirected, unweighted with self interactions 9/25/203 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 2

Bipartite graph is a graph whose nodes can be divided into two disjoint sets U and V such that every link connects a node in U to one in V; that is, U and V are independent sets A B C Examples:

13 Bipartite graph is a graph whose nodes can be divided into two disjoint sets U and V such that every link connects a node in U to one in V; that is, U and V are independent sets A B C Examples: Authors to papers (they authored) Actors to Movies (they appeared in) Users to Movies (they rated) Folded networks: Author collaboration networks Movie co rating networks 9/25/203 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 3 D E U E A B C V D Folded version of the graph above

15 Degree distribution P(k): Probability that a randomly chosen node has degree k N k = # nodes with degree k Normalized histogram: P(k) = N k / N plot P(k) k N k 9/25/203 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 5 k

16 A path is a sequence of nodes in which each node is linked to the next one P n {i 0,i,i 2,...,i n } P n {(i 0,i ),(i,i 2 ),(i 2,i 3 ),...,(i n,i n )} Path can intersect itself and pass through the same edge multiple times E.g.: ACBDCDEG In a directed graph a path can only follow the direction of the arrow A C B D E H F G 9/25/203 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 6

17 Number of paths between nodes u and v : Length h=: If there is a link between u and v, A uv = else A uv =0 Length h=2: If there is a path of length two between u and v then A uk A kv = else A uk A kv =0 N ( 2 ) 2 H uv A uk A kv [ A ] k Length h: If there is a path of length h between u and v then A uk... A kv = else A uk... A kv =0 So, the no. of paths of length h between u and v is ( h ) h H uv [ A ] uv (holds for both directed and undirected graphs) 9/25/203 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 7 uv

18 B A C D Distance (shortest path, geodesic) between a pair of nodes is defined as the number of edges along the shortest path connecting the nodes h B,D = 2 A D *If the two nodes are disconnected, the distance is usually defined as infinite In directed graphs paths need to follow the direction of the arrows C B h B,C =, h C,B = 2 Consequence: Distance is not symmetric: h A,C h C, A 9/25/203 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 8

19 Diameter: the maximum (shortest path) distance between any pair of nodes in a graph Average path length for a connected graph (component) or a strongly connected (component of a) directed graph h 2E max i, ji h ij where h ij is the distance from node i to node j Many times we compute the average only over the connected pairs of nodes (we ignore infinite length paths) 9/25/203 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 9

20 Breath First Search: Start with node u, mark it to be at distance h u (u)=0, add u to the queue While the queue not empty: Take node v off the queue, put its unmarked neighbors w into the queue and mark h u (w)=h u (v) u /25/203 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 20

21 Clustering coefficient: What portion of i s neighbors are connected? Node i with degree k i C i [0,] where e i is the number of edges between the neighbors of node i i i i C i =0 C i =/3 C i = Average Clustering Coefficient: C N N i C i 9/25/203 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 2

22 Clustering coefficient: What portion of i s neighbors are connected? Node i with degree k i where e i is the number of edges between the neighbors of node i A C B D E F G k B =2, e B =, C B =2/2 = k D =4, e D =2, C D =4/2 = /3 H 9/25/203 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 22

23 Degree distribution: Path length: Clustering coefficient: P(k) h C 9/25/203 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 23

25 MSN Messenger activity in June 2006: 50Gb/day (compressed) 4.5Tb / month 245 million users logged in 80 million users engaged in conversations More than 30 billion conversations More than 255 billion exchanged messages 9/25/203 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 25

26 26

27 Network: 80M people,.3b edges 27

3 billion edges 9/25/203 Jure Leskovec, Stanford CS224W:

28 Buddy Conversation Communication graph Edge (u,v) if users u and v exchanged at least msg N=80 million people E=.3 billion edges 9/25/203 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 28

29 9/25/203 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 29

30 9/25/203 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 30

31 We plot the same data as on the previous slide, just the axes are now logarithmic. 9/25/203 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 3

32 Avg. clustering of the MSN: C = 0.40 C k : average C i of nodes i of degree k: C k i 9/25/203 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 32 N k i: k k C i

33 Steps #Nodes ,96 4 8,648 Number of links between pairs of nodes Avg. path length % of the people can be reached in < 8 hops # nodes as we do BFS out of a random node 5 3,299, ,395, ,059, ,995, ,32,008 0,955,007 58, , ,66 4 3, ,476 6, /25/203 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis,

34 Degree distribution: heavily skewed avg. degree= 4.4 Path length: 6.6 Clustering coefficient: 0. Are these metrics expected? Are they surprising? To answer this we need a null model! 9/25/203 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 34

35 P(k) = (k-4) C = ½ Path length: k i =4 for all nodes h max N 2 The average shortest path length: all as N O(N) h O(N) So, we have: Constant degree, Constant avg. clustering coeff. Linear avg. path length Note about calculations: We are interested in quantities as graphs get large (N ) We will use big-o: f(x) = O(g(x)) as x if f(x) < g(x)*c for all x > x 0 and some constant c. 9/25/203 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 35

36 P(k) = (k-6) k =6 for each inside node C = 6/5 for inside nodes Path length: h max O( N ) In general, for lattices: Average path length is / D (D lattice dimensionality) Constant degree, constant clustering coefficient h 9/25/203 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 36 N

38 Erdös Renyi Random Graphs [Erdös Renyi, 60] Two variants: G n,p : undirected graph on n nodes and each edge (u,v) appears i.i.d. with probability p G n,m : undirected graph with n nodes, and m uniformly at random picked edges What kinds of networks does such model produce? 9/25/203 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 38

39 n and p do not uniquely determine the graph! The graph is a result of a random process We can have many different realizations given the same n and p n = 0 p= /6 9/25/203 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 39

40 How likely is a graph on E edges? P(E): the probability that a given G np generates a graph on exactly E edges: P( E) E max E p E ( p) E where E max =n(n-)/2 is the maximum possible number of edges in an undirected graph of n nodes P(E) is exactly the Binomial distribution >>> Number of successes in a sequence of n independent yes/no experiments E max 9/25/203 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 40

41 What is expected degree of a node? Let X v be a rnd. var. measuring the degree of node v n We want to know: E[ X ] j P ( X j) j 0 For the calculation we will need: Linearity of expectation For any random variables Y,Y 2,,Y k If Y=Y +Y 2 + Y k, then E[Y]= i E[Y i ] An easier way: Decompose X v to X v = X v, +X v,2 + +X v,n- where X v,u is a {0,} random variable which tells if edge (v,u) exists or not E[X v ] n u E[X vu ] (n )p v How to think about this? Prob. of node u linking to node v is p u can link (flips a coin) to all other (n-) nodes Thus, the expected degree of node u is: p(n-) 9/25/203 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 4 v

42 Degree distribution: Path length: Clustering coefficient: P(k) h C What are values of these properties for G np? 9/25/203 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 42

43 Fact: Degree distribution of G np is Binomial. Let P(k) denote a fraction of nodes with degree k: P( k ) Select k nodes out of n- n k p k ( Probability of having k edges p) n k Probability of missing the rest of the n--k edges P(k) k Mean, variance of a binomial distribution k p( n ) 2 p( p)(n ) k p p (n ) /2 (n ) /2 As the network size increases, the distribution becomes increasingly narrow we are increasingly confident that the degree of a node is in the vicinity of k. 9/25/203 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 43

44 2ei Remember: Ci ki ( ki ) Edges in G np appear i.i.d with prob. p So: Then: e i p k i(k i ) Each pair is connected with prob. p 2 Number of distinct pairs of neighbors of node i of degree k i C pk i(k i ) k i (k i ) p k N Clustering coefficient of a random graph is small. For a fixed avg. degree, C decreases with the graph size N. Where e i is the number of edges between i s neighbors 9/25/203 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 44

45 Degree distribution: n P( k) p k k ( p) n k Clustering coefficient: C=p=k/n Path length: next! 0/4/20 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 45

46 To prove the diameter of a G np we define few concepts Random k Regular graph: Assume each node has k spokes (half edges) k=: Graph is a set of pairs k=2: Graph is a set of cycles k=3: Arbitrarily complicated graphs Randomly pair them up! 0/4/20 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 46

$Graph G(V, E) has expansion α: if S V: # of edges leaving S α min( S, V\S ) Or equivalently: min S V # edges leaving min( S,$

47 Graph G(V, E) has expansion α: if S V: # of edges leaving S α min( S, V\S ) Or equivalently: min S V # edges leaving min( S, V \ S S ) S V \ S 0/4/20 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 47

48 S nodes α S edges S nodes α S edges min S V # edges leaving min( S, V \ S (A big) graph with good expansion S ) 0/4/20 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 48

49 Expansion is measure of robustness: To disconnect l nodes, we need to cut α l edges Low expansion: High expansion: Social networks: Communities min SV # edges leaving S min( S, V \ S ) 0/4/20 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 49

50 k regular graph (every node has degree k): Expansion is at most k (when S is a single node) Is there a graph on n nodes (n), of fixed max deg. k, so that expansion α remains const? Examples: nngrid:k=4: α =2n/(n 2 /4)0 (S=n/2 n/2 square in the center) S min SV # edges leaving S min( S, V \ S ) Complete binary tree: α 0 for S =(n/2)- S Fact: For a random 3 regular graph on n nodes, there is some const α (α >0, independent. of n) such that w.h.p. the expansion of the graph is α 0/4/20 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 50

51 Fact: In a graph on n nodes with expansion α for all pairs of nodes s and t there is a path of O((log n) / α) edges connecting them. Proof: Proof strategy: We want to show that from any node s there is a path of length O((log n)/α) to any other node t Let S j be a set of all nodes S 2 found within j steps of BFS from s. How does S j increase as a function of j? S 0 S s 9/25/203 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 5

52 Proof (continued): Let S j be a set of all nodes found within j steps of BFS from s. We want to relate S j and S j+ S j S j Expansion S k j At most k edges collide at a node S 0 S S 2 S j nodes s S j+ nodes S j S j k k j At least α S j edges Each of degree k 9/25/203 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 52

53 Proof (continued): In how many steps of BFS we reach >n/2 nodes? j Need j so that: S j Let s set: Then: k log k log 2 j log 2 2 k 2 n n In 2k/α log n steps S j grows to Θ(n). So, the diameter of G is O(log(n)/ α) n n k n 2 9/25/203 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 53 n 2 In j steps, we reach >n/2 nodes s In log(n) steps, we reach >n/2 nodes In j steps, we reach >n/2 nodes In log(n) steps, we reach >n/2 nodes Claim: k log 2 n 2 k t log Remember n>0, α k then: if k : k if 0 then x : and x e x log log 2 lim n x 2 n e log Diameter = 2 j Diameter = 2 log(n) 2 2 n 2 n log 2 n 2 x log x 2 n

54 Degree distribution: n P( k) p k k ( p) n k Path length: O(log n) Clustering coefficient: C = p = k / n 9/25/203 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 54

k / n C 8 0-8 9/25/203 Jure Leskovec, Stanford

55 MSN G np Degree distribution: Path length: 6.6 O(log n) h 8.2 Clustering coefficient: 0. k / n C /25/203 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 55

56 Are real networks like random graphs? Giant connected component: Average path length: Clustering Coefficient: Degree Distribution: Problems with the random network model: Degreed distribution differs from that of real networks Giant component in most real network does NOT emerge through a phase transition No local structure clustering coefficient is too low Most important: Are real networks random? The answer is simply: NO! 9/25/203 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 56

57 If G np is wrong, why did we spend time on it? It is the reference model for the rest of the class. It will help us calculate many quantities, that can then be compared to the real data It will help us understand to what degree is a particular property the result of some random process So, while G np is WRONG, it will turn out to be extremly USEFUL! 9/25/203 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 57

58 What happens to G np when we vary p?

59 Remember, expected degree We want E[X v ] be independent of n So let: p=c/(n-) Observation: If we build random graph G np with p=c/(n-) we have many isolated nodes Why? 0/4/20 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 59 c n n n e n c p v P ) ( 0] degree has [ c c x x c x n n e x x n c lim lim n c x Use substitution e x x x e lim By definition: p n X E v ) ( ] [

60 How big do we have to make p before we are likely to have no isolated nodes? We know: P[v has degree 0] = e -c Event we are asking about is: I = some node is isolated I where I v is the event that v is isolated vn I v We have: P I P I v P Iv vn vn 0/4/20 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 60 ne c Union bound A i A i i A i i

61 We just learned: P(I) = n e -c Let s try: c = ln n then: n e -c = n e -ln n =n/n= c= 2 ln n then: n e -2 ln n = n/n 2 = /n So if: p= ln n then: P(I) = p= 2 ln n then: P(I) = /n 0 as n 0/4/20 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 6

62 Graph structure of G np as p changes: p /(n-) 0 Giant component Avg. deg const. Fewer isolated No isolated nodes. Empty graph appears c/(n-) Lots of isolated nodes. log(n)/(n-) nodes. 2*log(n)/(n-) Complete graph Emergence of a Giant Component: avg. degree k=2e/n or p=k/(n-) k=-ε: all components are of size Ω(log n) k=+ε: component of size Ω(n), others have size Ω(log n) 0/4/20 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 62

Fraction of nodes in the largest component G np, n=00k, p(n-) = 0.

63 Fraction of nodes in the largest component G np, n=00k, p(n-) = /4/20 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 63

CS224W: Analysis of Networks Jure Leskovec, Stanford University

CS224W: Analysis of Networks Jure Leskovec, Stanford University http://cs224w.stanford.edu 10/30/17 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 2