Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

Similar documents
CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Communities Via Laplacian Matrices. Degree, Adjacency, and Laplacian Matrices Eigenvectors of Laplacian Matrices

Data Mining Techniques

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

1 Matrix notation and preliminaries from spectral graph theory

1 Matrix notation and preliminaries from spectral graph theory

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S

Lecture 13: Spectral Graph Theory

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Graph Partitioning Algorithms and Laplacian Eigenvalues

Markov Chains and Spectral Clustering

Lecture 12 : Graph Laplacians and Cheeger s Inequality

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Slides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

Spectral Clustering. Guokun Lai 2016/10

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

Data Mining and Matrices

CS168: The Modern Algorithmic Toolbox Lectures #11 and #12: Spectral Graph Theory

Link Analysis Ranking

Spectral Clustering. Spectral Clustering? Two Moons Data. Spectral Clustering Algorithm: Bipartioning. Spectral methods

Networks and Their Spectra

Data Mining and Analysis: Fundamental Concepts and Algorithms

Algebraic Representation of Networks

1 T 1 = where 1 is the all-ones vector. For the upper bound, let v 1 be the eigenvector corresponding. u:(u,v) E v 1(u)

Aditya Bhaskara CS 5968/6968, Lecture 1: Introduction and Review 12 January 2016

Graph Clustering Algorithms

Spectral Graph Theory and its Applications. Daniel A. Spielman Dept. of Computer Science Program in Applied Mathematics Yale Unviersity

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

CS246 Final Exam. March 16, :30AM - 11:30AM

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Kristina Lerman USC Information Sciences Institute

Data Mining Recitation Notes Week 3

A Dimensionality Reduction Framework for Detection of Multiscale Structure in Heterogeneous Networks

Finding normalized and modularity cuts by spectral clustering. Ljubjana 2010, October

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

Lecture: Local Spectral Methods (1 of 4)

Problem Set 4. General Instructions

Complex Networks CSYS/MATH 303, Spring, Prof. Peter Dodds

Bayesian Networks: Independencies and Inference

CS246 Final Exam, Winter 2011

COMPSCI 514: Algorithms for Data Science

Communities, Spectral Clustering, and Random Walks

LAPLACIAN MATRIX AND APPLICATIONS

Web Structure Mining Nodes, Links and Influence

Spectral Graph Theory Lecture 2. The Laplacian. Daniel A. Spielman September 4, x T M x. ψ i = arg min

An Algorithmist s Toolkit September 10, Lecture 1

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing

Mining Newsgroups Using Networks Arising From Social Behavior by Rakesh Agrawal et al. Presented by Will Lee

Overlapping Communities

Data Analysis and Manifold Learning Lecture 7: Spectral Clustering

Lecture 18: More NP-Complete Problems

Data Analysis and Manifold Learning Lecture 3: Graphs, Graph Matrices, and Graph Embeddings

Introduction to Spectral Graph Theory and Graph Clustering

8.1 Concentration inequality for Gaussian random matrix (cont d)

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit:

CSCE 750 Final Exam Answer Key Wednesday December 7, 2005

CS60021: Scalable Data Mining. Dimensionality Reduction

Spectral Clustering. Zitao Liu

Machine Learning for Data Science (CS4786) Lecture 11

ORIE 4741: Learning with Big Messy Data. Spectral Graph Theory

Jeffrey D. Ullman Stanford University

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Lecture 13 Spectral Graph Algorithms

Jure Leskovec Joint work with Jaewon Yang, Julian McAuley

Blog Community Discovery and Evolution

Relations Between Adjacency And Modularity Graph Partitioning: Principal Component Analysis vs. Modularity Component Analysis

Statistical Machine Learning

Spectral Graph Theory and You: Matrix Tree Theorem and Centrality Metrics

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

ECEN 689 Special Topics in Data Science for Communications Networks

CS6220: DATA MINING TECHNIQUES

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Data Mining and Analysis: Fundamental Concepts and Algorithms

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

MATH 567: Mathematical Techniques in Data Science Clustering II

CSI 445/660 Part 6 (Centrality Measures for Networks) 6 1 / 68

Online Social Networks and Media. Link Analysis and Web Search

Spectral Clustering on Handwritten Digits Database

Topics in Approximation Algorithms Solution for Homework 3

Link Mining PageRank. From Stanford C246

Networks as vectors of their motif frequencies and 2-norm distance as a measure of similarity

Spectra of Adjacency and Laplacian Matrices

Data Mining and Analysis: Fundamental Concepts and Algorithms

Graph Theory. Thomas Bloom. February 6, 2015

Lecture: Local Spectral Methods (2 of 4) 19 Computing spectral ranking with the push procedure

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

P P P NP-Hard: L is NP-hard if for all L NP, L L. Thus, if we could solve L in polynomial. Cook's Theorem and Reductions

Introduction to Data Mining

Lecture 10. Lecturer: Aleksander Mądry Scribes: Mani Bastani Parizi and Christos Kalaitzis

Certifying the Global Optimality of Graph Cuts via Semidefinite Programming: A Theoretic Guarantee for Spectral Clustering

Spectral Graph Theory Tools. Analysis of Complex Networks

ON THE QUALITY OF SPECTRAL SEPARATORS

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

A Statistical Look at Spectral Graph Analysis. Deep Mukhopadhyay

Chapter 11. Matrix Algorithms and Graph Partitioning. M. E. J. Newman. June 10, M. E. J. Newman Chapter 11 June 10, / 43

Spectral Methods for Subgraph Detection

Transcription:

Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University http://www.mmds.org

We often think of networks being organized into modules, cluster, communities: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 2

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 3

Find micro-markets by partitioning the query-to-advertiser graph: query advertiser [Andersen, Lang: Communities from seed sets, 2006] J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4

Clusters in Movies-to-Actors graph: [Andersen, Lang: Communities from seed sets, 2006] J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5

Discovering social circles, circles of trust: [McAuley, Leskovec: Discovering social circles in ego networks, 2012] J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6

How to find communities? We will work with undirected (unweighted) networks J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7

Edge betweenness:number of shortest paths passing over the edge Intuition: b=16 b=7.5 Edge strengths (call volume) in a real network Edge betweenness in a real network J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8

[Girvan-Newman 02] Divisive hierarchical clustering based on the notion of edge betweenness: Number of shortest paths passing through the edge Girvan-Newman Algorithm: Undirected unweighted networks Repeat until no edges are left: Calculate betweenness of edges Remove edges with highest betweenness Connected components are communities Gives a hierarchical decomposition of the network J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9

1 12 33 49 Need to re-compute betweenness at every step J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10

Step 1: Step 2: Step 3: Hierarchical network decomposition: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11

Communities in physics collaborations J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12

Zachary s Karate club: Hierarchical decomposition J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13

1. How to compute betweenness? 2. How to select the number of clusters? J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14

Want to compute betweennessof paths starting at node Breath first search starting from : 0 1 2 3 4 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 15

Count the number of shortest paths from to all other nodes of the network: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 16

Compute betweennessby working up the tree:if there are multiple paths count them fractionally The algorithm: Add edge flows: -- node flow = 1+ child edges -- split the flow up based on the parent value Repeat the BFS procedure for each starting node 1 path to K. Split evenly 1+0.5 paths to J Split 1:2 1+1 paths to H Split evenly J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17

Compute betweennessby working up the tree:if there are multiple paths count them fractionally The algorithm: Add edge flows: -- node flow = 1+ child edges -- split the flow up based on the parent value Repeat the BFS procedure for each starting node 1 path to K. Split evenly 1+0.5 paths to J Split 1:2 1+1 paths to H Split evenly J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 18

1. How to compute betweenness? 2. How to select the number of clusters? J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 19

Communities:sets of tightly connected nodes Define:Modularity A measure of how well a network is partitioned into communities Given a partitioning of the network into groups : Q s S [ (# edges within group s) (expected # edges within group s) ] Need a null model! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 20

Given real on nodes and edges, construct rewired network Same degree distribution but random connections Consider as a multigraph The expected number of edges between nodes and of degrees and equals to: The expected number of edges in (multigraph) G : J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 21 i j Note: 2

Modularity of partitioning S of graph G: Q s S [ (# edges within group s) (expected # edges within group s) ], Normalizing cost.: -1<Q<1 A ij = 1 if i j, 0 else Modularity values take range [ 1,1] It is positive if the number of edges within groups exceeds the expected number 0.3-0.7<Q means significant community structure J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 22

Modularity is useful for selecting the number of clusters: Q Next time: Why not optimize Modularity directly? J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 23

Undirected graph, : Bi-partitioning task: 2 1 3 4 5 6 Divide vertices into two disjoint groups, A 2 1 3 4 5 6 B Questions: How can we define a good partition of? How can we efficiently identify such a partition? J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 25

What makes a good partition? Maximize the number of within-group connections Minimize the number of between-group connections 1 5 2 3 4 6 A B J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 26

Express partitioning objectives as a function of the edge cut of the partition Cut:Set of edges with only one vertex in a group: A 2 1 3 4 5 6 B cut(a,b) = 2 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 27

Criterion: Minimum-cut Minimize weight of connections between groups arg min A,B cut(a,b) Degenerate case: Optimal cut Minimum cut Problem: Only considers external cluster connections Does not consider internal cluster connectivity J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 28

[Shi-Malik] Criterion: Normalized-cut [Shi-Malik, 97] Connectivity between groups relative to the density of each group :total weight of the edges with at least one endpoint in : Why use this criterion? Produces more balanced partitions How do we efficiently find a good partition? Problem: Computing optimal cut is NP-hard J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 29

A: adjacency matrix of undirected G A ij =1 if, is an edge, else 0 xis a vector in R n with components,, Think of it as a label/value of each node of What is the meaning of A x? Entry y i is a sum of labels x j of neighbors of i J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 30

j th coordinate of A x : Sum of the x-values of neighborsof j Make this a new value at node j Spectral Graph Theory: Analyze the spectrum of matrix representing Spectrum:Eigenvectors of a graph, ordered by the magnitude (strength) of their corresponding eigenvalues : J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 31

Suppose all nodes in have degree and is connected What are some eigenvalues/vectors of? What is λ? What x? Let s try:,,, Then:,,,. So: We found eigenpairof :,,,, Remember the meaning of : J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 32

Details! Gisd-regular connected, Ais its adjacency matrix Claim: dis largest eigenvalue ofa, dhas multiplicity of 1 (there is only 1 eigenvector associated with eigenvalue d) Proof: Why no eigenvalue? To obtain dwe needed for every, This means 1,1,,1 for some const. Define: = nodes with maximum possible value of Then consider some vector which is not a multiple of vector,,. So not all nodes (with labels ) are in Consider some node and a neighbor then node gets a value strictly less than So is not eigenvector! And so is the largest eigenvalue! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 33

What if is not connected? has 2components, each -regular What are some eigenvectors? Put all s on and s on or vice versa,,,,, then,,,,,,,,,, then,,,,, And so in both cases the corresponding A bit of intuition: A A B B A J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 34 B A B 2 nd largest eigval. now has value very close to

More intuition: A B A B 2 nd largest eigval. now has value very close to If the graph is connected (right example) then we already know that, is an eigenvector Since eigenvectors are orthogonal then the components of sum to 0. Why? Because So we can look at the eigenvector of the 2 nd largest eigenvalue and declare nodes with positive label in A and negative label in B. But there is still lots to sort out. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 35

Adjacency matrix (A): n nmatrix A=[a ij ], a ij =1if edge between node iand j 2 1 3 4 5 Important properties: Symmetric matrix Eigenvectors are real and orthogonal 6 1 2 3 4 5 6 1 0 1 1 0 1 0 2 1 0 1 0 0 0 3 1 1 0 1 0 0 4 0 0 1 0 1 1 5 1 0 0 1 0 1 6 0 0 0 1 1 0 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 36

Degree matrix (D): n n diagonal matrix D=[d ii ], d ii =degree of node i 1 2 3 4 5 6 1 5 1 3 0 0 0 0 0 2 0 2 0 0 0 0 2 3 4 6 3 0 0 3 0 0 0 4 0 0 0 3 0 0 5 0 0 0 0 3 0 6 0 0 0 0 0 2 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 37

Laplacian matrix (L): n n symmetric matrix 2 1 3 What is trivial eigenpair? 4 5,, then and so Important properties: Eigenvalues are non-negative real numbers Eigenvectors are real and orthogonal 6 1 2 3 4 5 6 1 3-1 -1 0-1 0 2-1 2-1 0 0 0 3-1 -1 3-1 0 0 4 0 0-1 3-1 -1 5-1 0 0-1 3-1 6 0 0 0-1 -1 2 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 38

Details! (a)all eigenvalues are 0 (b) 0 for every (c) That is, is positive semi-definite Proof: (c) (b): 0 As it is just the square of length of (b) (a): Let be an eigenvalue of. Then by (b) 0so (a) (c): is also easy! Do it yourself. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 39

Fact: For symmetric matrix M: What is the meaning of min x T L x on G? x L x,,, 2 λ = min 2 x x T x M x T x, 2, Node has degree. So, value needs to be summed up times. But each edge, has two endpoints so we need J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 40

λ = min 2 x T x M x T x x Details! Write in axes of eigenvecotrs,,, of. So, Then we get: So, what is? To minimize this over all unit vectors x orthogonal to: w = min over choices of, so that: 1 (unit length) 0(orthogonal to ) To minimize this, set and so if 1 otherwise J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 41

What else do we know about x? is unit vector: is orthogonal to 1 st eigenvector,, thus: Remember: λ 2 = min All labelings of nodes so that 0 ( i, j) E ( x i i 2 i We want to assign values to nodes i such that few edges cross 0. (we want x i and x j to subtract each other) x x j ) 2 0 Balance to minimize x J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 42

Back to finding the optimal cut Express partition (A,B) as a vector We can minimize the cut of the partition by finding a non-trivial vector x that minimizes: Can t solve exactly. Let s relax and allow it to take any real value. 1 0 1 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 43

0 x :The minimum value of is given by the 2 nd smallest eigenvalue λ 2 of the Laplacian matrix L :The optimal solution for y is given by the corresponding eigenvector, referred as the Fiedler vector J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 44

Details! Suppose there is a partition of Ginto Aand B # where, s.t. then 2 This is the approximation guarantee of the spectral clustering. It says the cut spectral finds is at most 2 away from the optimal one of score. Proof: Let: a= A, b= B and e=# edges from Ato B Enough to choose some based on AandBsuch that: is only smaller 2 (while also 0 ) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 45

Details! Proof (continued): 1)Let s set: Let s quickly verify that 0: 2) Then:, e number of edges between A and B J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 46 Which proves that the cost achieved by spectral is better than twice the OPT cost

Details! Putting it all together: where is the maximum node degree in the graph Note we only provide the 1 st part: We did not prove Overall this always certifies that always gives a useful bound J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 47

How to define a good partition of a graph? Minimize a given graph cut criterion How to efficiently identify such a partition? Approximate using information provided by the eigenvalues and eigenvectors of a graph Spectral Clustering J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 48

Three basic stages: 1) Pre-processing Construct a matrix representation of the graph 2) Decomposition Compute eigenvalues and eigenvectors of the matrix Map each point to a lower-dimensional representation based on one or more eigenvectors 3) Grouping Assign points to two or more clusters, based on the new representation J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 49

1)Pre-processing: Build Laplacian matrix Lof the graph 2) Decomposition: Find eigenvalues λ and eigenvectors x of the matrix L Map vertices to corresponding components of λ 2 0.0 1.0 3.0 λ= X = 3.0 4.0 5.0 1 0.3 2 0.6 3 0.3 4-0.3 5-0.3 6-0.6 0.4 0.4 0.4 0.4 0.4 0.4 1 2 3 4 5 6 1 3-1 -1 0-1 0 2-1 2-1 0 0 0 3-1 -1 3-1 0 0 4 0 0-1 3-1 -1 5-1 0 0-1 3-1 6 0 0 0-1 -1 2 0.3 0.6 0.3-0.3-0.3-0.6-0.5 0.4 0.1 0.1-0.5 0.4-0.2-0.4 0.6 0.6-0.2-0.4-0.4 0.4-0.4 0.4 0.4-0.4-0.5 0.0 0.5-0.5 0.5 0.0 How do we now find the clusters? J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 50

3)Grouping: Sort components of reduced 1-dimensional vector Identify clusters by splitting the sorted vector in two How to choose a splitting point? Naïve approaches: Split at 0or median value More expensive approaches: Attempt to minimize normalized cut in 1-dimension (sweep over ordering of nodes induced by the eigenvector) 1 0.3 Split at 0: 2 0.6 Cluster A: Positive points 3 4 5 6 0.3-0.3-0.3-0.6 Cluster B: Negative points 1 2 3 0.3 0.6 0.3 4 5 6-0.3-0.3-0.6 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 51 A B

Value of x 2 Rank in x 2 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 52

Components of x 2 Value of x 2 Rank in x 2 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 53

Components of x 1 Components of x 3 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 54

How do we partition a graph into kclusters? Two basic approaches: Recursive bi-partitioning[hagen et al., 92] Recursively apply bi-partitioning algorithm in a hierarchical divisive manner Disadvantages: Inefficient, unstable Cluster multiple eigenvectors[shi-malik, 00] Build a reduced space from multiple eigenvectors Commonly used in recent papers A preferable approach J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 55

Approximates the optimal cut [Shi-Malik, 00] Can be used to approximate optimal k-way normalized cut Emphasizes cohesive clusters Increases the unevenness in the distribution of the data Associations between similar points are amplified, associations between dissimilar points are attenuated The data begins to approximate a clustering Well-separated space Transforms data to a new embedded space, consisting of k orthogonal basis vectors Multiple eigenvectors prevent instability due to information loss J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 56

[Kumar et al. 99] Searching for small communities in the Web graph What is the signature of a community / discussion in a Web graph? Use this to define topics : What the same people on the left talk about on the right Remember HITS! Dense 2-layer graph Intuition: Many people all talking about the same things J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 58

A more well-defined problem: Enumerate complete bipartite subgraphsk s,t Where K s,t : snodes on the left where each links to the same tother nodes on the right X K 3,4 Y X = s = 3 Y = t = 4 Fully connected J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 59

[Agrawal-Srikant 99] Market basket analysis. Setting: Market:Universe Uof nitems Baskets:msubsets of U: S 1, S 2,, S m U (S i is a set of items one person bought) Support: Frequency threshold f Goal: Find all subsets T s.t. T S i of at least f sets S i (items in Twere bought together at least f times) What s the connection between the itemsets and complete bipartite graphs? J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 60

[Kumar et al. 99] Frequent itemsets = complete bipartite graphs! How? View each node ias a set S i of nodes ipoints to K s,t = a set Yof size t that occurs in ssets S i Looking for K s,t set of frequency threshold to s and look at layer t all frequent sets of size t i X j i k a b c d S i ={a,b,c,d} a b c d Y s minimum support ( X =s) t itemset size ( Y =t) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 61

[Kumar et al. 99] View each node ias a set S i of nodes ipoints to i a b c d S i ={a,b,c,d} Find frequent itemsets: s minimum support t itemset size We found K s,t! K s,t = a set Yof size t that occurs in ssets S i x Say we find a frequent itemset Y={a,b,c} of supp s So, there are snodes that link to all of {a,b,c}: a b c X J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 62 y x y z a b c a b c z Y a b c

a c d b e f Itemsets: a = {b,c,d} b = {d} c = {b,d,e,f} d = {e,f} e = {b,d} f = {} Support threshold s=2 {b,d}: support 3 {e,f}: support 2 And we just found 2 bipartite subgraphs: a c e d b J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 63 c e f d

Example of a community from a web graph Nodes on the right Nodes on the left [Kumar, Raghavan, Rajagopalan, Tomkins: Trawling the Web for emerging cyber-communities 1999] J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 64