Communities Via Laplacian Matrices. Degree, Adjacency, and Laplacian Matrices Eigenvectors of Laplacian Matrices

Communities Via Laplacian Matrices Degree, Adjacency, and Laplacian Matrices Eigenvectors of Laplacian Matrices

The Laplacian Approach As with betweenness approach, we want to divide a social graph into communities with most edges contained within a community. A surprising technique involving the eigenvector with the second-smallest eigenvalue serves as a good heuristic for breaking a graph into two parts that have the smallest number of edges between them. Can iterate to divide into as many parts as we like. 2

Three Matrices That Describe Graphs 1. Degree matrix: entry (i, i) is the degree of node i; off-diagonal entries are 0. 2. Adjacency matrix: entry (i, j) is 1 if there is an edge between node i and node j, otherwise 0. 3. Laplacian matrix = adjacency matrix minus degree matrix. 3

Example: Matrices A B C D 1 0 0 0 0 2 0 0 0 0 2 0 0 0 0 1 Degree matrix 0 1 0 0 1 0 1 0 0 1 0 1 0 0 1 0 Adjacency matrix 1-1 0 0-1 2-1 0 0-1 2-1 0 0-1 1 Laplacian matrix 4

Every Laplacian Has Zero as an Eigenvalue Proof: Each row has a sum of 0, so Laplacian L multiplying an all-1 s vector is all 0 s, which is also 0 times the all-1 s vector. Example: 1-1 0 0-1 2-1 0 0-1 2-1 0 0-1 1 1 1 1 1 = 0 1 1 1 1 5

The Second-Smallest Eigenvalue Let L be a Laplacian matrix, so L = D A, where D and A are the degree matrix and adjacency matrix for some graph. The second eigenvector x can be found by minimizing x T Lx subject to the constraints: 1. The length of x is 1. 2. x is orthogonal to the eigenvector associated with the smallest eigenvalue. The all-1 s vector for Laplacian matrices L. And the minimum of x T Lx is the eigenvalue. 6

Meaning of Second Eigenvector Let the i-th component of x be x i. Aside: Constraint that x is orthogonal to all-1 s vector says sum of x i s = 0. Break up x T Lx as x T Lx = x T Dx x T Ax. Since D is diagonal, with degree d i as i-th diagonal entry, Dx = vector with i-th element d i x i. Therefore, x T Dx = sum of d i x i 2. i-th component of Ax = sum of x j s where node j is adjacent to node i. x T Ax = sum of -2x i x j over all adjacent i and j. 7

Second Eigenvector (2) Now we know x T Lx = Σ i d i x i 2 Σ i,j adjacent 2x i x j. Distribute d i x i 2 over all nodes adjacent to node i. Gives us x T Lx = Σ i,j adjacent x i 2-2x i x j + x j 2 = Σ i,j adjacent (x i -x j ) 2. Remember: we re minimizing x T Lx. The minimum will tend to make x i and x j close when there is an edge between i and j. Also, constraint that sum of x i s = 0 means there will be roughly the same number of positive and negative x i s. 8

Second Eigenvector (3) Put another way: if there is an edge between i and j, then there is a good chance that both x i and x j will be positive or both negative. So partition the graph according to the sign of x i. Likely to minimize the number of edges with one end in either side. 9

Example: Second Eigenvector. A B C D 1-1 0 0-1 2-1 0 0-1 2-1 0 0-1 1 Laplacian matrix Eigenvalues: 0, 2-1 2-1 1-2 -1 =, 2, 2+ 2 2 2-2 3 2-4 4-3 2 2-2 =.586.242 -.242 -.586 Puts A and B in the positive group, C and D in the negative group. 10

Analysis of Large Graphs: Trawling

Trawling [Kumar et al. 99] Searching for small communities in the Web graph What is the signature of a community / discussion in a Web graph? Use this to define topics : What the same people on the left talk about on the right Remember HITS! Dense 2-layer graph Intuition: Many people all talking about the same things

Searching for Small Communities A more well-defined problem: Enumerate complete bipartite subgraphs K s,t Where K s,t : s nodes on the left where each links to the same t other nodes on the right X K 3,4 Y X = s = 3 Y = t = 4 Fully connected

[Agrawal-Srikant 99] Frequent Itemset Enumeration Market basket analysis. Setting: Market: Universe U of n items Baskets: m subsets of U: S 1, S 2,, S m U (S i is a set of items one person bought) Support: Frequency threshold f Goal: Find all subsets T s.t. T S i of at least f sets S i (items in T were bought together at least f times) What s the connection between the itemsets and complete bipartite graphs?

[Kumar et al. 99] From Itemsets to Bipartite K s,t Frequent itemsets = complete bipartite graphs! How? View each node i as a set S i of nodes i points to K s,t = a set Y of size t that occurs in s sets S i Looking for K s,t! set of frequency threshold to s and look at layer t all frequent sets of size t i X j i k a b c d S i ={a,b,c,d} a b c d Y s minimum support ( X =s) t itemset size ( Y =t)

From Itemsets to Bipartite K s,t [Kumar et al. 99] View each node i as a set S i of nodes i points to i a b c d S i ={a,b,c,d} Find frequent itemsets: s minimum support t itemset size We found K s,t! K s,t = a set Y of size t that occurs in s sets S i x Say we find a frequent itemset Y={a,b,c} of supp s So, there are s nodes that link to a all of {a,b,c}: a b b z c y c X x y z a b c Y a b c

Example (1) a c d e f Itemsets: a = {b,c,d} b = {d} c = {b,d,e,f} d = {e,f} e = {b,d} f = {} b Support threshold s=2 {b,d}: support 3 {e,f}: support 2 And we just found 2 bipartite subgraphs: a c e d b c e f d

Dense Communities Have Big Bi-Cliques Suppose we have a community with 2n nodes, divided into left and right sides of size n. Suppose the average degree of a node within the community is 2d, so the average node has d edges connecting to the other side. Then a basket (right-side node) with d i items generates about d ( i itemsets of size t. t ) Minimum number of itemsets of size t is generated when all d i s are the same and therefore = d. That number is n d. ( t )

Bi-Cliques Exist (2) Total number of itemsets of size n ( t ) is. Average number of baskets per itemset is d at least ( t ) n ( n t) /. Assume n > d >> t, and we can approximate the average by n(d/n) t. At least one itemset of size t must appear in an average number of baskets, so there will be an itemset of size t with support s as long as n(d/n) t > s. Uses approximation x choose y is about x y /y! when x >> y.

Example: Bi-Cliques Exist Suppose there is a community of 200 nodes, which we divide into the two sides with n = 100 each. Suppose that within the community, half of all possible edges exist, so d = 50. Then there is a bi-clique with t nodes on the left and s nodes on the right as long as 100(1/2) t > s. For instance, (t, s) could be (2, 25), (3,13), or (4, 6).

Example (2) Example of a community from a web graph Nodes on the right Nodes on the left [Kumar, Raghavan, Rajagopalan, Tomkins: Trawling the Web for emerging cyber-communities 1999]

Analysis of Large Graphs: Overlapping Communities

Identifying Communities Can we identify node groups? (communities, modules, clusters) Nodes: Football Teams Edges: Games played

NCAA Football Network NCAA conferences Nodes: Football Teams Edges: Games played

Protein-Protein Interactions Can we identify functional modules? Nodes: Proteins Edges: Physical interactions

Protein-Protein Interactions Functional modules Nodes: Proteins Edges: Physical interactions

Facebook Network Can we identify social communities? Nodes: Facebook Users Edges: Friendships

Facebook Network Social communities High school Summer internship Stanford (Squash) Stanford (Basketball) Nodes: Facebook Users Edges: Friendships

Overlapping Communities Non-overlapping vs. overlapping communities

Non-overlapping Communities Nodes Nodes Network Adjacency matrix

Communities as Tiles! What is the structure of community overlaps: Edge density in the overlaps is higher! Communities as tiles

Recap so far Communities in a network This is what we want!

Plan of attack 1) Given a model, we generate the network: Generative model for networks A C B D E F G H 2) Given a network, find the best model B A C D E H F G Generative model for networks

Model of networks Goal: Define a model that can generate networks The model will have a set of parameters that we will later want to estimate (and detect communities) Generative model for networks A C B D E F G H Q: Given a set of nodes, how do communities generate edges of the network?

Community-Affiliation Graph Communities, C Memberships, M p A p B Mode l Nodes, V Model Network Generative model B(V, C, M, {p c }) for graphs: Nodes V, Communities C, Memberships M Each community c has a single probability p c Later we fit the model to networks to detect communities

AGM: Generative Process Communities, C Memberships, M p A p B Mode l Nodes, V Community Affiliations Network P ( u, v) = 1 (1 ) c M u M v p c Think of this as an OR function: If at least 1 community says YES we create an edge

Recap: AGM networks Model Network

AGM: Flexibility AGM can express a variety of community structures: Non-overlapping, Overlapping, Nested

How do we detect communities with AGM?

Detecting Communities Detecting communities with AGM: A C B D E F G H 1) Affiliation graph M 2) Number of communities C 3) Parameters p c

Maximum Likelihood Estimation

Example: MLE

MLE for Graphs 0 0.10 0.10 0.04 0.10 0 0.02 0.06 0.10 0.02 0 0.06 0.04 0.06 0.06 0 Flip biased coins 0 1 0 0 1 0 1 1 0 1 0 1 0 1 1 0 P( G Θ) = Π ( u, v) E P( u, v) Π ( u, v) E (1 P( u, v))

Graphs: Likelihood P(G Θ) Given graph G(V,E) and Θ, we calculate likelihood that Θ generated G: P(G Θ) G A B =B(V, C, M, {p c }) 0 0.9 0.9 0 0.9 0 0.9 0 0.9 0.9 0 0.9 0 0 0.9 0 P(G Θ) 0 1 1 0 1 0 1 0 1 1 0 1 0 0 1 0 G P( G Θ) = Π ( u, v) E P( u, v) Π ( u, v) E (1 P( u, v))

MLE for Graphs arg max Θ P( ) AGM Θ

MLE for AGM

From AGM to BigCLAM u v

j Nodes Communities

From AGM to BigCLAM 0 1.2 0 0.2 0.5 0 0 0.8 0 1.8 1 0 Node community membership strengths

BigCLAM: How to find F

BigCLAM: V1.0

BigCLAM: V2.0

BigClam: Scalability BigCLAM takes 5 minutes for 300k node nets Other methods take 10 days Can process networks with 100M edges!

CPM: Clique-based community detection

Complete Mutuality: Cliques Clique: a maximum complete subgraph in which all nodes are adjacent to each other Nodes 5, 6, 7 and 8 form a clique NP-hard to find the maximum clique in a network Straightforward implementation to find cliques is very expensive in time complexity 55

Finding the Maximum Clique In a clique of size k, each node maintains degree >= k-1 Nodes with degree < k-1 will not be included in the maximum clique Recursively apply the following pruning procedure Sample a sub-network from the given network, and find a clique in the sub-network, say, by a greedy approach Suppose the clique above is size k, in order to find out a larger clique, all nodes with degree <= k-1 should be removed. Repeat until the network is small enough Many nodes will be pruned as social media networks follow a power law distribution for node degrees 56

Maximum Clique Example Suppose we sample a sub-network with nodes {1-9} and find a clique {1, 2, 3} of size 3 In order to find a clique >3, remove all nodes with degree <=3-1=2 Remove nodes 2 and 9 Remove nodes 1 and 3 Remove node 4 57

Clique Percolation Method (CPM) Clique is a very strict definition, unstable Normally use cliques as a core or a seed to find larger communities CPM is such a method to find overlapping communities Input A parameter k, and a network Procedure Find out all cliques of size k in a given network Construct a clique graph. Two cliques are adjacent if they share k-1 nodes Each connected components in the clique graph form a community 58

CPM Example Cliques of size 3: {1, 2, 3}, {1, 3, 4}, {4, 5, 6}, {5, 6, 7}, {5, 6, 8}, {5, 7, 8}, {6, 7, 8} Communities: {1, 2, 3, 4} {4, 5, 6, 7, 8} 59

Counting Triangles Bounds on Numbers of Triangles Heavy Hitters An Optimal Algorithm

Counting Triangles Why Care? 1. Density of triangles measures maturity of a community. As communities age, their members tend to connect. 2. The algorithm is actually an example of a recent and powerful theory of optimal join computation.

Data Structures Needed We need to represent a graph by data structures that let us do two things efficiently: 1. Given nodes u and v, determine whether there exists an edge between them in O(1) time. 2. Find the edges out of a node in time proportional to the number of those edges. Question for thought: What data structures would you recommend?

First Observations Let the graph have N nodes and M edges. N < M < N 2. One approach: Consider all N-choose-3 sets of nodes, and see if there are edges connecting all 3. An O(N 3 ) algorithm. Another approach: consider all edges e and all nodes u and see if both ends of e have edges to u. An O(MN) algorithm. Therefore never worse than the first approach.

Heavy Hitters To find a better algorithm, we need to use the concept of a heavy hitter a node with degree at least M. Note: there can be no more than 2 M heavy hitters, or the sum of the degrees of all nodes exceeds 2M. Impossible because each edge contributes exactly 2 to the sum of degrees. A heavy-hitter triangle is one whose three nodes are all heavy hitters.

Finding Heavy-Hitter Triangles First, find the heavy hitters. Determine the degrees of all nodes. Takes time O(M), assuming you can find the incident edges for a node in time proportional to the number of such edges. Consider all triples of heavy hitters and see if there are edges between each pair of the three. Takes time O(M 1.5 ), since there is a limit of 2 M on the number of heavy hitters.

Finding Other Triangles At least one node is not a heavy hitter. Consider each edge e. If both ends are heavy hitters, ignore. Otherwise, let end node u not be a heavy hitter. For each of the at most M nodes v connected to u, see whether v is connected to the other end of e. Takes time O(M 1.5 ). M edges, and at most M work with each.

Optimality of This Algorithm Both parts take O(M 1.5 ) time and together find any triangle in the graph. For any N and M, you can find a graph with N nodes, M edges, and Ω(M 1.5 ) triangles, so no algorithm can do significantly better. Hint: consider a complete graph with M nodes, plus other isolated nodes. Note that M 1.5 can never be greater than the running times of the two obvious algorithms with which we began: N 3 and MN.

SimRank

Basic Graph Model Directed Graph G = (V,E) V = set of objects E = set of unweighted edges Edge (u,v) exists if there is an relation u! v I(v) = set of in-neighbors of vertex v O(v) = set of out-neighbors of vertex v U Pa Pb Sa Sb U Pa Pb Sa Sb & 0 $ $ 0 $ 0 $ $ 1 $ % 0 1 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0#! 0! 1!! 0! 0! "

SimRank Similarity Recursive Model Two objects are similar if they are referenced by similar objects That is, a ~ b if c! a and d! b, and c ~ d An object is equivalent to itself (score = 1) Example 1. ProfA ~ ProfB because both are referenced by Univ. 2. StudentA ~ StudentB because they are referenced by similar nodes {ProfA,ProfB}

Basic SimRank Equation s(a,b) = similarity between a and b = average similarity between in-neighbors of a and in-neighbors of b s(a,b) is in the range [0, 1] If a=b, then s(a,b) = 1 If a b, C is a constant, 0 < C < 1 if I(a) or I(b) =, then s(a,b) = 0 D. Fogaras and B. Rácz, Scaling link-based similarity search, presented at the Proceedings of the 14th international conference, 2005.

Random Surfer-Pair Model s(u, v) = Pr[W(u) and W(v) meets at the same step] W(u): reverse sqrt(c)-walk that starts at u Monte Carlo algorithm Sample t pairs of W i (u), W i (v), i=1,,t X(u, v) = X(u, v)+1 if W i (u) and W i (v) meet at the same step s (u,v) = X(u, v)/t as an estimation of s(u, v) If t > 2/ε 2 log 1/δ, Pr{ s -s <ε}>1- δ B. Tian and X. Xiao, SLING: A Near-Optimal Index Structure for SimRank, presented at the SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data, New York, New York, USA, 2016, pp. 1859 1874.

Homework 10.5.2 10.3.2