Overlapping Community Detection at Scale: A Nonnegative Matrix Factorization Approach

Overlapping Community Detection at Scale: A Nonnegative Matrix Factorization Approach Author: Jaewon Yang, Jure Leskovec 1 1 Venue: WSDM 2013 Presenter: Yupeng Gu 1 Stanford University 1

Background Community 2

Background Community Overlap 3

Background Communities are everywhere in networks, especially in large social networks. 4

Background Communities are everywhere in networks, especially in large social networks. Nodes can belong to multiple communities simultaneously, which leads to overlapping community structure. 5

Background Communities are everywhere in networks, especially in large social networks. Nodes can belong to multiple communities simultaneously, which leads to overlapping community structure. In traditional methods, it is assumed that overlaps between communities are sparsely connected. 6

Cluster Affiliation Model The social/information network is assumed to be undirected and unweighted. They represent node community memberships with a bipartite affiliation network. Communities Nodes 9

Cluster Affiliation Model The social/information network is assumed to be undirected and unweighted. They represent node community memberships with a bipartite affiliation network. F ua u 10

Notations Notations G(V, E) N B(V, C, M) C M K F R N K Meanings Network Total number of nodes, V = N Bipartite affiliation network Set of communities Node community affiliations Total number of communities, C = K Affiliation factor matrix 11

Cluster Affiliation Model for Big Networks ( BIGCLAM ) The process of generating network G(V, E) given a bipartite community affiliation B(V, C, M): B(V, C, M) Communities C F uc F vc Nodes u v Nonnegative weight F uc 12

Cluster Affiliation Model for Big Networks ( BIGCLAM ) The process of generating network G(V, E) given a bipartite community affiliation B(V, C, M): B(V, C, M) Communities Nodes F uc u C F vc v F uc u C? F vc v Larger F uc is more likely to generate links (inside C). F uc = 0 will not affect the link generation probabilities. Nonnegative weight F uc Community connects its members with probability 1 e F uc F vc 13

Cluster Affiliation Model for Big Networks ( BIGCLAM ) The process of generating network G(V, E) given a bipartite community affiliation B(V, C, M): B(V, C, M) Communities C C p u, v = 1 exp( F u T F v ) F uc F vc F uc F vc F u is a weight vector for node u: F u = F u Nodes u v u? v Nonnegative weight F uc Community connects its members with probability 1 e F uc F vc 14

Probabilistic Interpretation In the generation process, we have an undirected weighted network where pairs of nodes u, v have a latent interaction of non-negative strength X uv In the observed graph G(V, E), u, v is connected if X uv > 0 u, v generate an interaction of strength X c uv within community c (using a Poisson distribution with mean F uc F vc ) Then the total amount of interaction X uv = c X uv X c uv ~Pois F uc F vc X uv ~Pois c F uc F vc pr X uv 0 = 1 exp( F T u F v ) c 16

Probabilistic Interpretation Which kind of nodes is likely to have a high degree? Answer: Node u with larger F uc is more likely to be connected to other members of c 1 e F uc F vc F uc F uc = 0 p c u, v = 0 for all v C 17

Probabilistic Interpretation Which pair of nodes is likely to have a link? Answer: Pair of nodes that share multiple community memberships receive multiple chances to create a link. A p u, v = 1 e F ua F va F ua F va u v 18

Background Part Edges between pair of nodes u, v that do not share any common communities (=0?) p u, v = ε, where ε is the background edge probability (between a random pair of nodes) ε = 2 E / V ( V 1) 10 8 20

Community Detection Given an undirected network G(V, E), detect K communities by finding the most likely affiliation factor matrix F to the underlying network G by maximizing the likelihood l F = log P(G F): where l F = u,v E F = arg max F 0 l(f) log(1 exp F u T F v ) (u,v) E F u T F v 21

A variant of nonnegative matrix factorization (NMF): learn F R N K that best approximates the adjacency matrix A of a given network G. F = arg min F 0 D(A, f FFT ) where loss function D = l(f) and link function f = 1 exp( ) 22

Optimization where l F = u,v E F = arg max F 0 l(f) log(1 exp F u T F v ) (u,v) E F u T F v Update F u with the other F v fixed (convex). After each update, F u is projected into a space of non-negative vectors: F uc = max(f uc, 0). Each step requires O( N u ) time. 23

Community Affiliations Whether u belongs to community c or not? Criterion: ignore the membership of u to c if F uc is below some threshold δ. ε 1 exp( δ 2 ) δ = log(1 ε) 10 5 ~10 4 24

Number of communities Reserve 20% of node pairs as a hold out set. The K with the maximum hold out likelihood will be chosen as the number of communities. 25

Experimental results 26

Data Description They collected 6 large social and information networks where nodes explicitly state their community memberships. (Defining ground-truth communities will also help quantitatively evaluate the performance) LiveJournal, Friendster, Orkut, YouTube: Nodes: users, edges: friendship Groups are formed over specific interests/hobbies etc. DBLP Nodes: authors, edges: co-authorship Communities: publication venues Amazon Nodes: products, edges connect commonly co-purchased products Each node belongs to one or more hierarchically product categories 27

Data Description Some statistics N: number of nodes E: number of edges C: number of communities S: average community size A: community memberships per node 28

Data Overview Ground-truth communities heavily overlap: on average 95% of all communities overlap with at least one other community. How the edge probability changes as k increases: The edge probability increases by 10 4 (from 10 5 to 10 1 ) when the pair share two communities 29

Evaluation Measures Runtime comparison 30

Evaluation Measures F-1 score F-1 score = 31

Evaluation Measures Omega Index C uv is the set of ground truth communities u and v share. 32

Evaluation Measures NMI (normalized mutual information) Accuracy in the number of communities ( 1) 33

Experimental results Composite performance of six datasets Scores of methods are scaled: the best performing method achieves the score of 1. 34

Experimental results 35

Conclusion A novel large scale community detection method A set of networks with explicit ground-truth labels for nodes Overlaps of communities are more connected 36