An ADMM Algorithm for Clustering Partially Observed Networks

Size: px

Start display at page:

Download "An ADMM Algorithm for Clustering Partially Observed Networks"

Wendy Arlene Rice
5 years ago
Views:

1 An ADMM Algorithm for Clustering Partially Observed Networks Necdet Serhat Aybat Industrial Engineering Penn State University 2015 SIAM International Conference on Data Mining Vancouver, Canada

2 Problem setting Goal: Detect communities" from partially observed relation" graph Details: N = {1,..., n} nodes of an undirected graph G = (N, E) (i, j) = (j, i) E if nodes i and j are "related" Nl : nodes in community-l, {N l }r l=1 partition of N r l=1 N l = N Nl 1 Nl 2 = if l 1 l 2 p, q [0, 1] s.t. 0 q p 1, { ( ) p, if l s.t. i, j N P (i, j) E = l ; q, o.w. (i, j) E or (i, j) E is observable with probability p 0 2

3 A popular quality function: Modularity Measures the density of edges within clusters relative to a random graph with the same node degrees {N l } r l=1 a partition of nodes in G = (N, E) modularity ({N l } r l=1 ) := 1 2 E maximizing modularity is NP-hard ni,j=1 (D ij d(i)d(j) 2 E ) δ (i, j; {N l } r l=1 ) Louvain method is a popular greedy alg. outperforming many others in time and modularity 3

4 A popular quality function: Modularity Measures the density of edges within clusters relative to a random graph with the same node degrees {N l } r l=1 a partition of nodes in G = (N, E) modularity ({N l } r l=1 ) := 1 2 E maximizing modularity is NP-hard ni,j=1 (D ij d(i)d(j) 2 E ) δ (i, j; {N l } r l=1 ) Louvain method is a popular greedy alg. outperforming many others in time and modularity Shortcoming: Resolution limit Implicit assumption: there is a chance that any two node can be connected in the random model (not realistic for large networks) Cliques connected by a single edge would be merged as E Leads to undetectable small communities 3

5 Preliminary: Robust PCA R n n D = L + S such that rank( L) = r n, S 0 = s n 2 L: Low-rank, incoherent not sparse S: Sparse, nonzero loc. uniformly at random not low-rank Ω: Observable loc. uniformly at random s.t. Ω = p 0 n 2 Candès et al. 09 proposed solving (L ρ, S ρ) argmin{ L + ρ S 1 : π Ω (L + S) = π Ω (D)} For ρ = 1 p0 n, c > 0 such that (L ρ, S ρ) = ( L, S) (exact recovery) w.p. at least 1 cn 10 Notation: L := rank(l) i=1 σ i (L), and S 1 := i,j S ij { π Ω : R n n R n n Dij, if (i, j) Ω; s.t. (π Ω (D)) ij = 0, o.w. 4

6 Robust PCA for clustering D S n : adjacency matrix for G such that diag(d) = 1 { 1, if (i, j) E or i = j; D ij = 0, o.w. 1, if (i, j) E and l s.t. i, j Nl S ; ij := 1, if (i, j) E and l s.t. i, j Nl ; and L := D S. 0, o.w. 5

7 Robust PCA for clustering D S n : adjacency matrix for G such that diag(d) = 1 { 1, if (i, j) E or i = j; D ij = 0, o.w. 1, if (i, j) E and l s.t. i, j Nl S ; ij := 1, if (i, j) E and l s.t. i, j Nl ; and L := D S. 0, o.w. N1 = {1, 2, 3}, N 2 = {4, 5}, E = {(1, 3), (2, 3), (2, 4), (3, 5), (4, 5)} = } {{ 1 1 } } {{ 1 1 } } {{ 0 0 } D L S D=Low-Rank+Sparse 5

8 Robust PCA for clustering N1 = {1, 2, 3}, N 2 = {4, 5}, E = {(1, 3), (2, 3), (2, 4), (3, 5), (4, 5)} = } {{ 1 1 } D D = Low-Rank + Sparse } {{ } L rank( L) = r, the number of clusters, communities λ l = N l for l = 1,..., r, e.g., λ 1 = 3, λ 2 = 2 S is sparse with high probability } {{ } S 6

9 Related work Chen et al. 14 showed the statistical guarantees, and proposed RPCA-B. Thm: Let γ = max{1 p, q}, ρ = 1 32 p 0 n, and K min := min l N l. C, c > 0 such that (L ρ, S ρ) = ( L, S) w.p. 1 cn 10 provided that n log 2 (n) CK 2 min p 0(1 2γ) 2. Algorithm RPCA-B (ρ) 1: STOP false 2: while STOP == false do 3: (L ρ, Sρ) argmin{ L + ρ S 1 : π Ω (L + S) = π Ω (D)} 4: if Tr(L ρ) > n (1 + tol) then 5: ρ ρ/2 6: else if Tr(L ρ) < n (1 tol) then 7: ρ 2ρ 8: else 9: STOP true 10: end if 11: end while 7

10 RCPA-C model by Aybat et al. (RPCA-C): Robust PCA for Clustering (a tighter formulation) (L, S ) argmin L,S S n Tr(L) + ρ S 1 s.t. π Ω (L + S) = π Ω (D), diag(s) = 0, S ij 1 i j, L 0 n, L 0 n. Let χ S n S n denote the feasible region. Clearly, ( L, S) χ χ {(L, S) : π Ω (L + S) = π Ω (D)} same results in Chen et al. 14 (L, S ) = ( L, S) w.p. at least 1 cn 10 8

11 ADMIP-C: ADMM for RPCA-C Equivalent problem using partial variable splitting: min L,X,S S n Tr(L) + ρ S 1 s.t. X = L, L 0 n, (X, S) φ. for ρ > 0 and φ defined as φ := (X, S) S n S n : π Ω (X + S) = π Ω (D), X 0 n, diag(s) = 0, S ij 1 i j. Objective: Develop an ADMM with increasing penalty 9

12 ADMIP-C: ADMM for RPCA-C Augmented Lagrangian: L µ (L, X, S; Y ) := Tr(L) + ρ S 1 + Y, X L + µ 2 X L 2 F Hard to compute Method of Multipliers iterate sequence: (L k+1, X k+1, S k+1 ) argmin{l µ (L, X, S; Y k ) : L 0 n, (X, S) φ} Y k+1 = Y k + µ(x k+1 L k+1 ), where µ > 0 10

13 ADMIP-C: ADMM for RPCA-C Augmented Lagrangian: L µ (L, X, S; Y ) := Tr(L) + ρ S 1 + Y, X L + µ 2 X L 2 F Hard to compute Method of Multipliers iterate sequence: (L k+1, X k+1, S k+1 ) argmin{l µ (L, X, S; Y k ) : L 0 n, (X, S) φ} Y k+1 = Y k + µ(x k+1 L k+1 ), where µ > 0 Easy to compute ADMIP-C iterate sequence: (X k+1, S k+1 ) = argmin{l µk (L k, X, S; Y k ) : (X, S) φ} L k+1 = argmin{l µk (L, X k+1, S k+1 ; Y k ) : L 0 n } Y k+1 = Y k + µ k (X k+1 L k+1 ) 10

14 ADMIP-C: ADMM for RPCA-C Augmented Lagrangian: L µ (L, X, S; Y ) := Tr(L) + ρ S 1 + Y, X L + µ 2 X L 2 F Hard to compute Method of Multipliers iterate sequence: (L k+1, X k+1, S k+1 ) argmin{l µ (L, X, S; Y k ) : L 0 n, (X, S) φ} Y k+1 = Y k + µ(x k+1 L k+1 ), where µ > 0 Easy to compute ADMIP-C iterate sequence: (X k+1, S k+1 ) = argmin{l µk (L k, X, S; Y k ) : (X, S) φ} L k+1 = argmin{l µk (L, X k+1, S k+1 ; Y k ) : L 0 n } Y k+1 = Y k + µ k (X k+1 L k+1 ) Note: ADMM is sensitive to µ if µ k = µ for all k Choose {µ k } such that µ k+1 µ k, and sup k µ k < No need to search optimal µ 10

15 ADMIP-C: L-subproblem Augmented Lagrangian: L µ (L, X, S; Y ) := Tr(L) + ρ S 1 + Y, X L + µ 2 X L 2 F L k+1 = argmin{l µk (L, X k+1, S k+1 ; Y k ) : L 0 n }, ( ) =W diag max{λ k 1 µ k 1, 0} W T, W diag(λ k )W T is the eigenvalue decomp. of Q X k := X k µ k Y k. Note: No need to compute eigenvalues < µ 1 k Use PROPACK to compute partial SVD of Q X k X k L and {Y k } bounded {Q X k } numerically low-rank eventually Cheap partial SVDs in the transient phase for small µ k 11

16 ADMIP-C: (X, S)-subproblem Augmented Lagrangian: L µ (L, X, S; Y ) := Tr(L) + ρ S 1 + Y, X L + µ 2 X L 2 F (X k+1, S k+1 ) = argmin{l µk (L k, X, S; Y k ) : (X, S) φ}, Closed form solution: ( ) C k = sgn π Ω(D Q L k ) S k+1 = min{π Ω(D), max{ E n, C k }}, X k+1 = π Ω (D S k+1 ) + max{π Ω c(q L k ), 0 n }, max{ π Ω(D Q L k ) ρµ 1 k E n, 0 n }, where Ω := Ω D for Ω N N \ D, and Q L k := L k 1 µ k Y k φ := (X, S) S n S n : π Ω (X + S) = π Ω (D), X 0 n, diag(s) = 0, S ij 1 i j. 12

17 ADMIP-C: Convergence Algorithm ADMIP-C (ρ, Y 0, {µ k }) 1: k 0, L 0 0 n 2: while not converged do 3: (X k+1, S k+1 ) argmin{l µk (L k, X, S; Y k ) : (X, S) φ} 4: L k+1 argmin{l µk (L, X k+1, S k+1 ; Y k ) : L 0 n} 5: Y k+1 Y k + µ k (X k+1 L k+1 ) 6: end while Z k = (L k, X k, S k, Y k ) primal-dual ADMIP-C iterates Z set of primal-dual optimal pairs to RPCA-C: min Tr(L) + ρ S 1 s.t. X = L, L 0 n, (X, S) φ. L,X,S S n Then min{ Z Z k F : Z Z } 0. 13

18 G = (N, E) - Random Network Generation G: undirected network with n := N nodes, and r non-overlapping communities α (0, 1] parameter controlling cluster sizes Nl for l = 1,..., r [ ] Nl := n l = 1 α 1 α α l 1 n for l = 1,..., r { r Nl = l 1 i=1 n i + 1,..., } l i=1 n i Example: N = n = 100, and r = 5 α n 1 n 2 n 3 n 4 n

19 G = (N, E) - Random Network Generation After {N l }r l=1 is generated with r = 0.05 N Edge generation: E U U := {(i, j) N N : i < j} chosen uniformly at random among all subsets with cardinality E U = [0.05 U ] E L := {(i, j) N N : (j, i) E U } Ē := {(i, j) N N : l s.t. i, j N l } E := Ē (EU E L ) - symmetric difference Observable indices: Ω U U chosen uniformly at random among all subsets with cardinality Ω U = [p 0 U ] Ω L := {(i, j) N N : (j, i) Ω U } Ω := Ω U Ω L and Ω := Ω D 15

20 Numerical tests: ADMIP-C vs RPCA-B ADMIP-C solves RPCA-C formulation (tighter than RPCA) RPCA-B solves RPCA with bisection on ρ Given L be low-rank component output, define for 1 l 1, l 2 r { ( ) Bl 1,l 2 := L Enl, if l 1 = l 2 = l; ij, Bl1 i Nl,j N,l 2 := 1 l 2 0 nl1 n l2, o.w. Define R l1,l 2 := B l 1,l 2 B l1,l 2 F for 1 l 1, l 2 r. The quality of clustering is measured by 3 statistics: [0, 1] s in :=average of { } Rl,l r n l l=1 1 [0, 1] s out := l1 l2 n l1 n l2 l 1 l 2 R 2 l 1,l 2 [0, 1] s f : fraction of correctly recovered clusters 16

21 Numerical tests: ADMIP-C vs RPCA-B 10 random G = (N, E) for each (n, α), and p 0 = 0.9 ADMIP-C RPCA-B RPCA n α s in s out s f s in s out s f s in s out s f Table: Mean values for s in, s out, and s f in %, and ρ = 1 n 17

22 Numerical tests: ADMIP-C vs RPCA-B 5 random G = (N, E) for n {500, 1000}, α = 0.95, and p 0 = 0.9 ADMIP-C RPCA-B n # s in s out s f cpu svd s in s out s f cpu svd it B Table: Recovery and runtime statistics: s in, s out, s f in %, cpu in sec. 18

23 Numerical tests: ADMIP-C vs RPCA-B Summary: Increasing n, decreasing α adversely affects all methods Failure caused by s out for RPCA-B, by s in for RPCA and ADMIP-C RPCA cannot detect small clusters due to high s in RPCA-B reduces s in at the cost of s out : merging small clusters ADMIP-C correctly identifies 92% all the time 19

24 Numerical tests: ADMIP-C vs Louvain Measures for comparing similarity of C = {N i }r i=1 and C = {N j }r j=1 : n i := N i, n j := N j s.t. n = r i=1 n i = r j=1 n j, and m ij = N i N j Jaccard Index: JI:= # node pairs in same clusters both in C and C # node pairs in same clusters either in C or C Normalized Mutual Information: NMI:= I(C,C ), where H(C)H(C ) Mutual Information: I(C, C ) = r Entropy: H(C) = r i=1 n i n log 2 r i=1 j=1 ( n i n PERC:= proportion of exactly recovered clusters ( ) m ij n log mij/n 2 n i nj/n2 ), H(C ) defined similarly. 20

25 Numerical tests: ADMIP-C vs Louvain Experimental setup: n {100, 200, 300, 400, 500} and α = {1, 0.9, 0.8} 20 random G = (N, E) generated for each (n, α) Output of Louvain depends on the ordering of nodes For each G, 200 node orderings generated uniformly at random 21

26 Numerical tests: ADMIP-C vs Louvain Louvain ADMIP-C n α JI NMI PERC Table: The mean values for 3 measures in % when p 0 = 0.9. Increasing n, decreasing α adversely affects all methods Louvain: small clusters swallowed by large ones in both cases ADMIP-C: detects small clusters even for small α ADMIP-C: creates isolated nodes for large n 22

27 Conclusions and Future work Conclusions: RPCA-C: a tighter model based on RPCA ADMIP-C: an ADMM with increasing penalty for RPCA-C Detection is robust to increases in n and variation among cluster sizes Future Work: Weighted networks Overlapping communities Partial SVD is the bottleneck Code is available at Thanks! 23

Robust Principal Component Analysis

ELE 538B: Mathematics of High-Dimensional Data Robust Principal Component Analysis Yuxin Chen Princeton University, Fall 2018 Disentangling sparse and low-rank matrices Suppose we are given a matrix M