Communities Via Laplacian Matrices. Degree, Adjacency, and Laplacian Matrices Eigenvectors of Laplacian Matrices

Similar documents
Overlapping Communities

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

1 Matrix notation and preliminaries from spectral graph theory

Overlapping Community Detection at Scale: A Nonnegative Matrix Factorization Approach

1 T 1 = where 1 is the all-ones vector. For the upper bound, let v 1 be the eigenvector corresponding. u:(u,v) E v 1(u)

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Web Structure Mining Nodes, Links and Influence

Data Mining and Analysis: Fundamental Concepts and Algorithms

Lecture 13: Spectral Graph Theory

1 Matrix notation and preliminaries from spectral graph theory

Data Mining and Matrices

Introduction to Data Mining

ORIE 4741: Learning with Big Messy Data. Spectral Graph Theory

CS246 Final Exam. March 16, :30AM - 11:30AM

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

Jure Leskovec Joint work with Jaewon Yang, Julian McAuley

CS168: The Modern Algorithmic Toolbox Lectures #11 and #12: Spectral Graph Theory

Algebraic Methods in Combinatorics

Online Social Networks and Media. Link Analysis and Web Search

Markov Chains and Spectral Clustering

Link Analysis Ranking

Spectral Graph Theory and You: Matrix Tree Theorem and Centrality Metrics

DATA MINING LECTURE 3. Frequent Itemsets Association Rules

Walks, Springs, and Resistor Networks

Lecture 12 : Graph Laplacians and Cheeger s Inequality

The minimum G c cut problem

Machine Learning for Data Science (CS4786) Lecture 11

Class President: A Network Approach to Popularity. Due July 18, 2014

Slides based on those in:

Lecture 14: Random Walks, Local Graph Clustering, Linear Programming

12. LOCAL SEARCH. gradient descent Metropolis algorithm Hopfield neural networks maximum cut Nash equilibria

Project in Computational Game Theory: Communities in Social Networks

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

COMPSCI 514: Algorithms for Data Science

Online Social Networks and Media. Link Analysis and Web Search

Networks and Their Spectra

Problem Set 4. General Instructions

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Quantum walk algorithms

Admin NP-COMPLETE PROBLEMS. Run-time analysis. Tractable vs. intractable problems 5/2/13. What is a tractable problem?

Dominating Configurations of Kings

Classification and Regression Trees

Communities, Spectral Clustering, and Random Walks

Algebraic Methods in Combinatorics

Lecture: Local Spectral Methods (2 of 4) 19 Computing spectral ranking with the push procedure

Spectral Graph Theory Lecture 2. The Laplacian. Daniel A. Spielman September 4, x T M x. ψ i = arg min

A New Space for Comparing Graphs

12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters)

ECEN 689 Special Topics in Data Science for Communications Networks

Data Mining Recitation Notes Week 3

CS 781 Lecture 9 March 10, 2011 Topics: Local Search and Optimization Metropolis Algorithm Greedy Optimization Hopfield Networks Max Cut Problem Nash

BOOLEAN MATRIX AND TENSOR DECOMPOSITIONS. Pauli Miettinen TML September 2013

This section is an introduction to the basic themes of the course.

Introduction to Data Mining

Lecture: Local Spectral Methods (1 of 4)

Data Mining Techniques

Powerful tool for sampling from complicated distributions. Many use Markov chains to model events that arise in nature.

Learning from Sensor Data: Set II. Behnaam Aazhang J.S. Abercombie Professor Electrical and Computer Engineering Rice University

Lecture 1 and 2: Random Spanning Trees

Spectra of Adjacency and Laplacian Matrices

Association Rule. Lecturer: Dr. Bo Yuan. LOGO

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Lecture 4: An FPTAS for Knapsack, and K-Center

Justification and Application of Eigenvector Centrality

Laplacians of Graphs, Spectra and Laplacian polynomials

Graph Theory. Thomas Bloom. February 6, 2015

Spectral Clustering. Zitao Liu

Semidefinite and Second Order Cone Programming Seminar Fall 2001 Lecture 5

CS 6820 Fall 2014 Lectures, October 3-20, 2014

Complex Networks, Course 303A, Spring, Prof. Peter Dodds

Statistical and Computational Phase Transitions in Planted Models

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

U.C. Berkeley Better-than-Worst-Case Analysis Handout 3 Luca Trevisan May 24, 2018

INFO 4300 / CS4300 Information Retrieval. IR 9: Linear Algebra Review

An Algorithmist s Toolkit September 10, Lecture 1

Lecture 13 Spectral Graph Algorithms

Physical Metaphors for Graphs

New Topic. PHYS 1021: Chap. 10, Pg 2. Page 1

STA141C: Big Data & High Performance Statistical Computing

Interesting Patterns. Jilles Vreeken. 15 May 2015

Linear Algebra, Summer 2011, pt. 3

Combinatorial Optimization

Lecture Introduction. 2 Brief Recap of Lecture 10. CS-621 Theory Gems October 24, 2012

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

P P P NP-Hard: L is NP-hard if for all L NP, L L. Thus, if we could solve L in polynomial. Cook's Theorem and Reductions

Randomized Algorithms

Peter J. Dukes. 22 August, 2012

Di-eigenals. Jennifer Galovich St. John s University/College of St. Benedict JMM 11 January, 2018

PageRank algorithm Hubs and Authorities. Data mining. Web Data Mining PageRank, Hubs and Authorities. University of Szeged.

Chapter 11. Matrix Algorithms and Graph Partitioning. M. E. J. Newman. June 10, M. E. J. Newman Chapter 11 June 10, / 43

Linear Algebra March 16, 2019

Math 443/543 Graph Theory Notes 5: Graphs as matrices, spectral graph theory, and PageRank

CSI 445/660 Part 6 (Centrality Measures for Networks) 6 1 / 68

Scientific Computing with Case Studies SIAM Press, Lecture Notes for Unit VII Sparse Matrix

The Hypercube Graph and the Inhibitory Hypercube Network

CS246 Final Exam, Winter 2011

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

Networks as vectors of their motif frequencies and 2-norm distance as a measure of similarity

1 Primals and Duals: Zero Sum Games

Transcription:

Communities Via Laplacian Matrices Degree, Adjacency, and Laplacian Matrices Eigenvectors of Laplacian Matrices

The Laplacian Approach As with betweenness approach, we want to divide a social graph into communities with most edges contained within a community. A surprising technique involving the eigenvector with the second-smallest eigenvalue serves as a good heuristic for breaking a graph into two parts that have the smallest number of edges between them. Can iterate to divide into as many parts as we like. 2

Three Matrices That Describe Graphs 1. Degree matrix: entry (i, i) is the degree of node i; off-diagonal entries are 0. 2. Adjacency matrix: entry (i, j) is 1 if there is an edge between node i and node j, otherwise 0. 3. Laplacian matrix = adjacency matrix minus degree matrix. 3

Example: Matrices A B C D 1 0 0 0 0 2 0 0 0 0 2 0 0 0 0 1 Degree matrix 0 1 0 0 1 0 1 0 0 1 0 1 0 0 1 0 Adjacency matrix 1-1 0 0-1 2-1 0 0-1 2-1 0 0-1 1 Laplacian matrix 4

Every Laplacian Has Zero as an Eigenvalue Proof: Each row has a sum of 0, so Laplacian L multiplying an all-1 s vector is all 0 s, which is also 0 times the all-1 s vector. Example: 1-1 0 0-1 2-1 0 0-1 2-1 0 0-1 1 1 1 1 1 = 0 1 1 1 1 5

The Second-Smallest Eigenvalue Let L be a Laplacian matrix, so L = D A, where D and A are the degree matrix and adjacency matrix for some graph. The second eigenvector x can be found by minimizing x T Lx subject to the constraints: 1. The length of x is 1. 2. x is orthogonal to the eigenvector associated with the smallest eigenvalue. The all-1 s vector for Laplacian matrices L. And the minimum of x T Lx is the eigenvalue. 6

Meaning of Second Eigenvector Let the i-th component of x be x i. Aside: Constraint that x is orthogonal to all-1 s vector says sum of x i s = 0. Break up x T Lx as x T Lx = x T Dx x T Ax. Since D is diagonal, with degree d i as i-th diagonal entry, Dx = vector with i-th element d i x i. Therefore, x T Dx = sum of d i x i 2. i-th component of Ax = sum of x j s where node j is adjacent to node i. x T Ax = sum of -2x i x j over all adjacent i and j. 7

Second Eigenvector (2) Now we know x T Lx = Σ i d i x i 2 Σ i,j adjacent 2x i x j. Distribute d i x i 2 over all nodes adjacent to node i. Gives us x T Lx = Σ i,j adjacent x i 2-2x i x j + x j 2 = Σ i,j adjacent (x i -x j ) 2. Remember: we re minimizing x T Lx. The minimum will tend to make x i and x j close when there is an edge between i and j. Also, constraint that sum of x i s = 0 means there will be roughly the same number of positive and negative x i s. 8

Second Eigenvector (3) Put another way: if there is an edge between i and j, then there is a good chance that both x i and x j will be positive or both negative. So partition the graph according to the sign of x i. Likely to minimize the number of edges with one end in either side. 9

Example: Second Eigenvector. A B C D 1-1 0 0-1 2-1 0 0-1 2-1 0 0-1 1 Laplacian matrix Eigenvalues: 0, 2-1 2-1 1-2 -1 =, 2, 2+ 2 2 2-2 3 2-4 4-3 2 2-2 =.586.242 -.242 -.586 Puts A and B in the positive group, C and D in the negative group. 10

Analysis of Large Graphs: Trawling

Trawling [Kumar et al. 99] Searching for small communities in the Web graph What is the signature of a community / discussion in a Web graph? Use this to define topics : What the same people on the left talk about on the right Remember HITS! Dense 2-layer graph Intuition: Many people all talking about the same things

Searching for Small Communities A more well-defined problem: Enumerate complete bipartite subgraphs K s,t Where K s,t : s nodes on the left where each links to the same t other nodes on the right X K 3,4 Y X = s = 3 Y = t = 4 Fully connected

[Agrawal-Srikant 99] Frequent Itemset Enumeration Market basket analysis. Setting: Market: Universe U of n items Baskets: m subsets of U: S 1, S 2,, S m U (S i is a set of items one person bought) Support: Frequency threshold f Goal: Find all subsets T s.t. T S i of at least f sets S i (items in T were bought together at least f times) What s the connection between the itemsets and complete bipartite graphs?

[Kumar et al. 99] From Itemsets to Bipartite K s,t Frequent itemsets = complete bipartite graphs! How? View each node i as a set S i of nodes i points to K s,t = a set Y of size t that occurs in s sets S i Looking for K s,t! set of frequency threshold to s and look at layer t all frequent sets of size t i X j i k a b c d S i ={a,b,c,d} a b c d Y s minimum support ( X =s) t itemset size ( Y =t)

From Itemsets to Bipartite K s,t [Kumar et al. 99] View each node i as a set S i of nodes i points to i a b c d S i ={a,b,c,d} Find frequent itemsets: s minimum support t itemset size We found K s,t! K s,t = a set Y of size t that occurs in s sets S i x Say we find a frequent itemset Y={a,b,c} of supp s So, there are s nodes that link to a all of {a,b,c}: a b b z c y c X x y z a b c Y a b c

Example (1) a c d e f Itemsets: a = {b,c,d} b = {d} c = {b,d,e,f} d = {e,f} e = {b,d} f = {} b Support threshold s=2 {b,d}: support 3 {e,f}: support 2 And we just found 2 bipartite subgraphs: a c e d b c e f d

Dense Communities Have Big Bi-Cliques Suppose we have a community with 2n nodes, divided into left and right sides of size n. Suppose the average degree of a node within the community is 2d, so the average node has d edges connecting to the other side. Then a basket (right-side node) with d i items generates about d ( i itemsets of size t. t ) Minimum number of itemsets of size t is generated when all d i s are the same and therefore = d. That number is n d. ( t )

Bi-Cliques Exist (2) Total number of itemsets of size n ( t ) is. Average number of baskets per itemset is d at least ( t ) n ( n t) /. Assume n > d >> t, and we can approximate the average by n(d/n) t. At least one itemset of size t must appear in an average number of baskets, so there will be an itemset of size t with support s as long as n(d/n) t > s. Uses approximation x choose y is about x y /y! when x >> y.

Example: Bi-Cliques Exist Suppose there is a community of 200 nodes, which we divide into the two sides with n = 100 each. Suppose that within the community, half of all possible edges exist, so d = 50. Then there is a bi-clique with t nodes on the left and s nodes on the right as long as 100(1/2) t > s. For instance, (t, s) could be (2, 25), (3,13), or (4, 6).

Example (2) Example of a community from a web graph Nodes on the right Nodes on the left [Kumar, Raghavan, Rajagopalan, Tomkins: Trawling the Web for emerging cyber-communities 1999]

Analysis of Large Graphs: Overlapping Communities

Identifying Communities Can we identify node groups? (communities, modules, clusters) Nodes: Football Teams Edges: Games played

NCAA Football Network NCAA conferences Nodes: Football Teams Edges: Games played

Protein-Protein Interactions Can we identify functional modules? Nodes: Proteins Edges: Physical interactions

Protein-Protein Interactions Functional modules Nodes: Proteins Edges: Physical interactions

Facebook Network Can we identify social communities? Nodes: Facebook Users Edges: Friendships

Facebook Network Social communities High school Summer internship Stanford (Squash) Stanford (Basketball) Nodes: Facebook Users Edges: Friendships

Overlapping Communities Non-overlapping vs. overlapping communities

Non-overlapping Communities Nodes Nodes Network Adjacency matrix

Communities as Tiles! What is the structure of community overlaps: Edge density in the overlaps is higher! Communities as tiles

Recap so far Communities in a network This is what we want!

Plan of attack 1) Given a model, we generate the network: Generative model for networks A C B D E F G H 2) Given a network, find the best model B A C D E H F G Generative model for networks

Model of networks Goal: Define a model that can generate networks The model will have a set of parameters that we will later want to estimate (and detect communities) Generative model for networks A C B D E F G H Q: Given a set of nodes, how do communities generate edges of the network?

Community-Affiliation Graph Communities, C Memberships, M p A p B Mode l Nodes, V Model Network Generative model B(V, C, M, {p c }) for graphs: Nodes V, Communities C, Memberships M Each community c has a single probability p c Later we fit the model to networks to detect communities

AGM: Generative Process Communities, C Memberships, M p A p B Mode l Nodes, V Community Affiliations Network P ( u, v) = 1 (1 ) c M u M v p c Think of this as an OR function: If at least 1 community says YES we create an edge

Recap: AGM networks Model Network

AGM: Flexibility AGM can express a variety of community structures: Non-overlapping, Overlapping, Nested

How do we detect communities with AGM?

Detecting Communities Detecting communities with AGM: A C B D E F G H 1) Affiliation graph M 2) Number of communities C 3) Parameters p c

Maximum Likelihood Estimation

Example: MLE

MLE for Graphs 0 0.10 0.10 0.04 0.10 0 0.02 0.06 0.10 0.02 0 0.06 0.04 0.06 0.06 0 Flip biased coins 0 1 0 0 1 0 1 1 0 1 0 1 0 1 1 0 P( G Θ) = Π ( u, v) E P( u, v) Π ( u, v) E (1 P( u, v))

Graphs: Likelihood P(G Θ) Given graph G(V,E) and Θ, we calculate likelihood that Θ generated G: P(G Θ) G A B =B(V, C, M, {p c }) 0 0.9 0.9 0 0.9 0 0.9 0 0.9 0.9 0 0.9 0 0 0.9 0 P(G Θ) 0 1 1 0 1 0 1 0 1 1 0 1 0 0 1 0 G P( G Θ) = Π ( u, v) E P( u, v) Π ( u, v) E (1 P( u, v))

MLE for Graphs arg max Θ P( ) AGM Θ

MLE for AGM

From AGM to BigCLAM u v

j Nodes Communities

From AGM to BigCLAM 0 1.2 0 0.2 0.5 0 0 0.8 0 1.8 1 0 Node community membership strengths

BigCLAM: How to find F

BigCLAM: V1.0

BigCLAM: V2.0

BigClam: Scalability BigCLAM takes 5 minutes for 300k node nets Other methods take 10 days Can process networks with 100M edges!

CPM: Clique-based community detection

Complete Mutuality: Cliques Clique: a maximum complete subgraph in which all nodes are adjacent to each other Nodes 5, 6, 7 and 8 form a clique NP-hard to find the maximum clique in a network Straightforward implementation to find cliques is very expensive in time complexity 55

Finding the Maximum Clique In a clique of size k, each node maintains degree >= k-1 Nodes with degree < k-1 will not be included in the maximum clique Recursively apply the following pruning procedure Sample a sub-network from the given network, and find a clique in the sub-network, say, by a greedy approach Suppose the clique above is size k, in order to find out a larger clique, all nodes with degree <= k-1 should be removed. Repeat until the network is small enough Many nodes will be pruned as social media networks follow a power law distribution for node degrees 56

Maximum Clique Example Suppose we sample a sub-network with nodes {1-9} and find a clique {1, 2, 3} of size 3 In order to find a clique >3, remove all nodes with degree <=3-1=2 Remove nodes 2 and 9 Remove nodes 1 and 3 Remove node 4 57

Clique Percolation Method (CPM) Clique is a very strict definition, unstable Normally use cliques as a core or a seed to find larger communities CPM is such a method to find overlapping communities Input A parameter k, and a network Procedure Find out all cliques of size k in a given network Construct a clique graph. Two cliques are adjacent if they share k-1 nodes Each connected components in the clique graph form a community 58

CPM Example Cliques of size 3: {1, 2, 3}, {1, 3, 4}, {4, 5, 6}, {5, 6, 7}, {5, 6, 8}, {5, 7, 8}, {6, 7, 8} Communities: {1, 2, 3, 4} {4, 5, 6, 7, 8} 59

Counting Triangles Bounds on Numbers of Triangles Heavy Hitters An Optimal Algorithm

Counting Triangles Why Care? 1. Density of triangles measures maturity of a community. As communities age, their members tend to connect. 2. The algorithm is actually an example of a recent and powerful theory of optimal join computation.

Data Structures Needed We need to represent a graph by data structures that let us do two things efficiently: 1. Given nodes u and v, determine whether there exists an edge between them in O(1) time. 2. Find the edges out of a node in time proportional to the number of those edges. Question for thought: What data structures would you recommend?

First Observations Let the graph have N nodes and M edges. N < M < N 2. One approach: Consider all N-choose-3 sets of nodes, and see if there are edges connecting all 3. An O(N 3 ) algorithm. Another approach: consider all edges e and all nodes u and see if both ends of e have edges to u. An O(MN) algorithm. Therefore never worse than the first approach.

Heavy Hitters To find a better algorithm, we need to use the concept of a heavy hitter a node with degree at least M. Note: there can be no more than 2 M heavy hitters, or the sum of the degrees of all nodes exceeds 2M. Impossible because each edge contributes exactly 2 to the sum of degrees. A heavy-hitter triangle is one whose three nodes are all heavy hitters.

Finding Heavy-Hitter Triangles First, find the heavy hitters. Determine the degrees of all nodes. Takes time O(M), assuming you can find the incident edges for a node in time proportional to the number of such edges. Consider all triples of heavy hitters and see if there are edges between each pair of the three. Takes time O(M 1.5 ), since there is a limit of 2 M on the number of heavy hitters.

Finding Other Triangles At least one node is not a heavy hitter. Consider each edge e. If both ends are heavy hitters, ignore. Otherwise, let end node u not be a heavy hitter. For each of the at most M nodes v connected to u, see whether v is connected to the other end of e. Takes time O(M 1.5 ). M edges, and at most M work with each.

Optimality of This Algorithm Both parts take O(M 1.5 ) time and together find any triangle in the graph. For any N and M, you can find a graph with N nodes, M edges, and Ω(M 1.5 ) triangles, so no algorithm can do significantly better. Hint: consider a complete graph with M nodes, plus other isolated nodes. Note that M 1.5 can never be greater than the running times of the two obvious algorithms with which we began: N 3 and MN.

SimRank

Basic Graph Model Directed Graph G = (V,E) V = set of objects E = set of unweighted edges Edge (u,v) exists if there is an relation u! v I(v) = set of in-neighbors of vertex v O(v) = set of out-neighbors of vertex v U Pa Pb Sa Sb U Pa Pb Sa Sb & 0 $ $ 0 $ 0 $ $ 1 $ % 0 1 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0#! 0! 1!! 0! 0! "

SimRank Similarity Recursive Model Two objects are similar if they are referenced by similar objects That is, a ~ b if c! a and d! b, and c ~ d An object is equivalent to itself (score = 1) Example 1. ProfA ~ ProfB because both are referenced by Univ. 2. StudentA ~ StudentB because they are referenced by similar nodes {ProfA,ProfB}

Basic SimRank Equation s(a,b) = similarity between a and b = average similarity between in-neighbors of a and in-neighbors of b s(a,b) is in the range [0, 1] If a=b, then s(a,b) = 1 If a b, C is a constant, 0 < C < 1 if I(a) or I(b) =, then s(a,b) = 0 D. Fogaras and B. Rácz, Scaling link-based similarity search, presented at the Proceedings of the 14th international conference, 2005.

Random Surfer-Pair Model s(u, v) = Pr[W(u) and W(v) meets at the same step] W(u): reverse sqrt(c)-walk that starts at u Monte Carlo algorithm Sample t pairs of W i (u), W i (v), i=1,,t X(u, v) = X(u, v)+1 if W i (u) and W i (v) meet at the same step s (u,v) = X(u, v)/t as an estimation of s(u, v) If t > 2/ε 2 log 1/δ, Pr{ s -s <ε}>1- δ B. Tian and X. Xiao, SLING: A Near-Optimal Index Structure for SimRank, presented at the SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data, New York, New York, USA, 2016, pp. 1859 1874.

Homework 10.5.2 10.3.2