CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Size: px

Start display at page:

Download "CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University"

Kerry Houston
5 years ago
Views:

1 CS224W: Social and Information Network Analysis Jure Leskovec Stanford University Jure Leskovec, Stanford University

2 Task: Find coalitions in signed networks Incentives: European chocolates! Fame Up to 10% extra credit Due: Friday midnight No late days! 11/8/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 2

eigenvector (5) Overlapping communities: Clique percolation method 11/8/2010

3 Today: 3 methods (3) Trawling: Community signatures that can be efficiently extracted (4) Spectral graph partitioning: i Laplacian matrix, ti 2nd eigenvector (5) Overlapping communities: Clique percolation method 11/8/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 3

4 [Kumar et al. 99] Searching for small communities in Web graph (1) What is the signature of a community/discussion in a Web graph Use this to define topics: What the same people on the left talk about on the right A dense 2 layer graph Intuition: many people all talking about the same things 11/8/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 4

5 [Kumar et al. 99] (2) A more well defined problem: Enumerate complete bipartite subgraphs K s,t Where K s,t = s nodes where each links to the same t other nodes 11/8/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 5

6 [Kumar et al. 99] Two points: (1) The signature of a community/discussion (2) Complete bipartite subgraph K s,t K s,t = graph on s nodes, each links to the same t other nodes Plan: (A) From (2) get back to (1): Via: Any dense enough graph contains a smaller K s,t as a subgraph (B) How do we solve (2) in a giant graph? Whatsimilar problems have been solvedonon big non graph data? (3) Frequent itemset enumeration [Agrawal Srikant 99] 11/8/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 6

7 [Agrawa Srikant 99] Marketbasket analysis: What items are bought together in a store? Setting: Universe U of n items m subsets of U: S 1, S 2,, S m U (S i is a set of items one person bought) Frequency threshold f Goal: Find all subsets T s.t. T S i of f sets S i (items in T were bought together f times) 11/8/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 7

8 Example: Universe of items: U={1,2,3,4,5} {,,,,} Itemsets: S 1 ={1,3,5}, S 2 ={2,3,4}, S 3 ={2,4,5}, S 4 ={3,4,5}, S 5 ={1,3,4,5}, S 6 ={2,3,4,5} Minimum support: f=3 Algorithm: Build up the lists Insight: for a frequent set of size k, all its subsets are also frequent 11/8/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 8

9 [Agrawa Srikant 99] U={1,2,3,4,5}, U={12345} f=3 S 1 ={1,3,5}, S 2 ={2,3,4}, S 3 ={2,4,5}, S 4 ={345} ={3,4,5}, S={1345} 5 ={1,3,4,5}, S={2345} 6 ={2,3,4,5} 11/8/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 9

10 [Agrawa Srikant 99] For i =1 1,,k Find all frequent sets of size i by composing sets of size i-1 that differ in 1 element Open question: Efficiently find only maximal frequent sets 11/8/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 10

11 [Kumar et al. 99] Claim: (3) (itemsets) solves (2) (bipartite graphs) How? View each node i as a set S i of nodes i points to K s,t = a set y of size t that occurs in s sets S i Looking for K s,t set of frequency threshold to s and look at layer t all frequent sets of size t 11/8/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 11

12 [Kumar et al. 99] (2) (1): Informally, every dense enough graph G contains a bipartite K s,t subgraph where s and t depend on size (# of nodes) and density (avg. degree) of G [Kovan Sos Turan 53] Theorem: Let G=(X,Y,E), X = Y = n with avg. degree: 1/ t 1 1/ t d s n then G contains K st s,t as a subgraph [Will not prove it here. See online slides] t 11/8/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 12

13 For the proof we will need the following fact Recall: a b a( a 1)...( a b 1) b! Let f(x) = x(x-1)(x-2) (x-k) Once x k, f(x) curves upward (convex) Suppose a setting: g(y) is convex Want to minimize g(x i ) where x i =x To minimize i i g(x i ) make each x i = x/n 11/8/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 13

14 Node i, degree d i and neighbor set S i Put node i in buckets for all size t subsets of its neighbors Potential right hand sides of K s,t (i.e., all size t subsets of S i ) 11/10/2009 Jure Leskovec, Stanford CS322: Network Analysis 14

15 Note: As soon as s nodes appear in a bucket we have a K s,t How many buckets node i contributes? d i degree of node i i What is the total size of all buckets? 11/10/2009 Jure Leskovec, Stanford CS322: Network Analysis 15

16 a a( a 1)...( a b 1) So, the total height of b b b!! all buckets is Plug in: d s 1/ t 1 1/ /tt n t 11/10/2009 Jure Leskovec, Stanford CS322: Network Analysis 16

17 We have: Total height of all buckets How many buckets are there? n t What is the average height of buckets? n t s t! t n t! s t s t! n t height s n t! So, avg. bucket So by pigeonhole principle, there must be at least one bucket with more than s nodes in it. 11/8/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 17

18 [Kumar et al. 99] Theoretical result: Complete bipartite subgraphs K s,t are embedded in larger dense enough graphs (i.e., the communities) i.e., biparite subgraphs as signatures of communities Algorithmic result: Frequent itemset extraction and dynamic programming g SCALABLE!!! 11/8/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 18

19 1 Undirected graph G(V,E): G(VE): Bi partitioning task: Divide vertices into two disjoint groups (A,B) A Questions: How can we define a good partition of G? How can we efficiently identify such a partition? 2 3 B /8/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 19

20 What makes a good partition? Maximize the number of within group connections Minimize i i the number of bt between group connections /8/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 20

21 Express partitioning objectives as a function of the edge cut of the partition Cut: Set of edges with ihonly one vertex in a group: A B cut(a,b) = 2 11/8/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 21

22 Criterion: Minimum cut Minimise weight of connections between groups min AB A,B cut(a,b) Degenerate case: Optimal cut Minimum cut Problem: Only considers external cluster connections Does not consider internal cluster connectivity 11/8/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 22

23 [Shi Malik] Criterion: Normalized cut [Shi Malik, 97] Connectivity between groups relative to the density of each group Vol(A): The total weight of the edges originating from groupa A. Why use this criterion? Produces more balanced partitions How do we efficiently find a good partition? Problem: Computing optimal cut is NP hard 11/8/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 23

24 A: adjacency matrix of undirected G A ij =1 if (i,j) is an edge, else 0 x is a vector in n with components (x 1,, x n ) just a label/value of each node of G What is the meaning of A x? Entry y j is a sum of labels x i of neighbors of j 11/8/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 24

a graph, ordered by the magnitude (strength) of their corresponding eigenvalues: 11/8/2010 Jure

25 j th coordinate of Ax: Sum of the x values of neighbors of j Make this a new value at node j Spectral Graph Theory: Analyze the spectrum of matrix representing G Spectrum: Eigenvectors of a graph, ordered by the magnitude (strength) of their corresponding eigenvalues: 11/8/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 25

26 Suppose all nodes in G have degree d and G is connected What are some eigenvalues/vectors of G? Ax = x What is? What x? 11/8/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 26

27 What if G is not connected? Say G has 2 components, each d regular What are some eigenvectors? x= Put all 1s on A and 0s on B or vice versa 11/8/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 27

28 Adjacency matrix (A): n n matrix A=[a ij ], a ij =1 if edge between node i and j Important properties: 4 5 Symmetric matrix Eigenvectors are real and orthogonal /8/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 28

29 Degree matrix (D): n n diagonal matrix D=[d ii ], d ii = degree of node i /8/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 29

30 Laplacian matrix () (L): n n symmetric matrix What is trivial eigenvector/ eigenvalue? Important properties: L = D - A Eigenvalues are non negative negative real numbers Eigenvectors are real and orthogonal 11/8/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 30

31 For symmetric matrix M: 2 min x T x T Mx x T Mx What is the meaning of min x T Lx on G? x 11/8/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 31

32 What else do we know about x? x is unit vector x is orthogonal to 1 st eigenvector (1,,1) 1) thus: Then: min ( x x i j 2 All lbli f x 2 i All labelings of nodes so that sum(x i )=0 ) 2 11/8/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 32

33 Express partition (A,B) as a vector We can minimize the cut of the partition by finding a non trivial vector x that minimizes: 11/8/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 33

eigenvector λ 2, referred as the Fiedler vector 11/8/2010 Jure Leskovec,

34 The minimum value is given by the 2 nd smallest eigenvalue λ 2 of the Laplacian matrix L The optimal solution for x is given by the corresponding eigenvector λ 2, referred as the Fiedler vector 11/8/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 34

Approximate using information provided by the eigenvalues and eigenvectors of a graph

35 How to define a good partition of a graph? Minimise a given graph cut criterion How to efficiently identify such a partition? Approximate using information provided by the eigenvalues and eigenvectors of a graph Spectral Clustering 11/8/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 35

36 Threebasic stages: 1. Pre processing Construct a matrixrepresentationrepresentation of the graph 2. Decomposition Compute eigenvalues and eigenvectors of the matrix Map each point to a lower dimensional representation based on one or more eigenvectors 3. Grouping Assign points to two or more clusters, based on the new representation 11/8/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 36

37 Pre processing: Build Laplacian matrix L of the graph Decomposition: Find eigenvalues and eigenvectors x of the matrix L 0.0 = X = Map vertices to corresponding components of How do we now find the clusters? /8/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 37

38 Grouping: Sort components of reduced 1 dimensional vector Identify clusters by splitting the sorted vector in two How to choose a splitting point? Naïve approaches: Split at 0, mean or median value More expensive approaches: Attempt to minimise normalized cut criterion in 1 dimension Split at 0: Cluster A: Positive points Cluster B: Negative points /8/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, A B

39 11/8/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 39

40 How do we partition a graph into k clusters? Two basic approaches: Recursive bi partitioning [Hagen et al., 92] Recursively apply bi partitioning algorithm in a hierarchical divisive manner Disadvantages: Inefficient, unstable Cluster multiple eigenvectors [Shi Malik, 00] Build a reduced space from multiple eigenvectors Commonly used in recent papers A preferable approach 11/8/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 40

41 k eigenvector Algorithm [Ng et al., 01] Pre processing: Construct the scaled adjacency matrixa': A 1/ 2 1/ 2 A' Decomposition: D AD Find the eigenvalues and eigenvectors of A' Buildembedded space from the eigenvectors corresponding to the k largest eigenvalues Grouping: Apply k means to reduced n k space to get k clusters 11/8/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 41

42 Approximates the optimal cut [Shi Malik, 00] Can be used to approximate the optimal k way normalized cut Emphasizes cohesive clusters Increases the unevenness in the distribution ib i of the data Associations between similar points are amplified, associations between dissimilar points are attenuated The data dt begins to approximate a clustering Well separated space Transforms data to a new embedded space, consisting of k orthogonal basis vectors NB: Multiple eigenvectors prevent instability due to information loss 11/8/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 42

43 Eigengap: The difference between two consecutive eigenvalues Most stable clustering is generally given by the value k that maximises eigengap: k k Example: k 1 Eigenvalue λ 1 λ k max k Choose k= /8/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 43

44 METIS: Heuristic but works really well in practice Graclus: Based on kernel k means Cluto: /g / / / Clique percorlation method: For finding overlapping clusters 11/8/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 44

45 Non overlapping overlapping vs. overlapping communities 11/8/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 45

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit