Matrix Factorization with Applications to Clustering Problems: Formulation, Algorithms and Performance

Size: px

Start display at page:

Download "Matrix Factorization with Applications to Clustering Problems: Formulation, Algorithms and Performance"

Annis Louisa Perry
5 years ago
Views:

1 Date: Mar. 3rd, 2017 Matrix Factorization with Applications to Clustering Problems: Formulation, Algorithms and Performance Presenter: Songtao Lu Department of Electrical and Computer Engineering Iowa State University 1

2 Outline Formulation: Spcetral Clustering Joint Factor Analysis and Latent Clustering Algorithms SymNMF Joint Factor Analysis and Latent Clustering Deep Neural Networks for Clustering Other Problems Conclusion 2

Applications: Graph Partitioning Figure : The above two graphs are the same graph re-organized and drawn from the stochastic block model (SBM) with 1000 vertices, 5

3 Applications: Graph Partitioning Figure : The above two graphs are the same graph re-organized and drawn from the stochastic block model (SBM) with 1000 vertices, 5 balanced communities, within-cluster probability of 1/50 and across-cluster probability of 1/1000. Emmanuel Abbe, http : // soc CD1.pdf 3

4 Clustering Problem Formulation 4

5 Kernel K-means clustering K-means clustering K J K = x i m k 2 k=1 i C k (1) K 1 =c 2 x T n i x j k i,j C k (2) k=1 m k = i C k x i /n k is the centroid of cluster C k of n k points. c 2 = i x i 2 5

6 Kernel K-means clustering K-means clustering (matrix form) J K = Tr(X T X) Tr(H T X T XH) (3) H = (h 1,..., h K ), h T k h l = δ kl h k = (0,..., 0, 1, }. {{.., 1 }, 0,..., 0) T /n 1/2 k n k min J K becomes W = X T X max J W (H) = Tr(H T WH) (4) H T H=I,H 0 6

7 Kernel K-means clustering A nonlinear transformation (mapping) x i φ(x i ) (5) Kernel K-means can be written as K minimize φ(x i ) m k 2, (6) C k, k i C k k=1 where m k = i C k φ(x i )/n k is the centroid of cluster C k of n k points. Kernel K-means is equivalent to maximize H k 1 n k i,j C k W i,j = Tr(H T WH) (7) Kernel: W i,j = φ(x i ) T φ(x j ) Membership Matrix: H = (h 1,..., h K ), h T k h l = δ kl, h k = (0,..., 0, 1, }. {{.., 1 }, 0,..., 0) T /n 1/2 k n k 7

8 Spectral Clustering 8

9 Spectral Clustering VS K-Means 9

10 Challenge of Spectral Clustering W = W W W 3 (8) If λ 3 (W 1 ) > max(λ 1 (W 2 ), λ 2 (W 3 )), three leading eigenvector of the similarity matrix are 10

11 Challenge of Spectral Clustering Original Data New Representation in Eigenvectors Spectral Clustering (accuracy: 37.95%) SymNMF (accuracy: 88.78%) D. Huang, S. Yun and H. Park, SymNMF: nonnegative low rank approximation of a similarity matrix for graph clustering, Journal of Global Optimization, vol. 62,no. 3,pp , July,

12 Kernel K-means clustering Equivalence between K-means and matrix factorization H = arg min 2Tr(H T WH) (9) H T H=I,H 0 = arg min W 2 H T F 2Tr(HT WH) + H T H 2 F (10) H=I,H 0 = arg min W HH T 2 H T F (11) H=I,H 0 12

13 Motivation Relaxed version of K-means: Spectral Clustering min H HH T W 2 F maximize Tr(H T WH) (12) H T H=I,H 0 subject to HHT = I H T W SymNMF min H HH T W 2 F H subject to H 0 H T W H D. Huang, S. Yun and H. Park, SymNMF: nonnegative low rank approximation of a similarity matrix for graph clustering, Journal of Global Optimization, vol. 62,no. 3,pp , July,

14 SymNMF for Clustering samples (N) similarity matrix features (M) construct SymNMF: minimize X R N K subject to X 0 XX T Z 2 F Z R N N : pairwise similarity matrix X R N K : clustering indicator matrix X T Z X 14

15 Joint factor analysis and latent clustering min X WH 2 W R N F,H R F M F (13) s.t. W 0, H 0 (14) 15

16 Joint factor analysis and latent clustering min S,M X SM 2 F (15) s.t. S(i, j) {0, 1}, S(i, :) 0 = 1 (16) 16

17 Joint factor analysis and latent clustering Step1 : dimension reduction via factorization (e.g. SVD, NMF) Step 2: perform K-means clustering on the latent factor W Drawbacks of two-step approach ignores latent cluster structure when performing dimension reduction uses naive factorization when clustering 17

18 Motivation A real-world example 2 clusters of documents taken from Reuters text corpus NMF with rank= 2 The figure shows the weights of each document on the two latent topics. 18

19 Latent clustering {X(i, :)} N i = 1 have latent representations drawn from K clusters The rows of W can be divided into K clusters min W,H,S,M X WH 2 F + λ W SM 2 F (17) s.t. W 0, H 0, S(i, :) 0 = 1, S(i, k) {0, 1}, i, k (18) λ 0 is a pre-specified regularization parameter S R N K denotes cluster membership. S(i, k) = 1 means that W(i, :) belongs to cluster k M R K F denotes centroid matrix, where each centroid is M(k, :) 19

20 DNN min W,S,M where w i = f(x i ; H) min H,Z,S,M N f(x i ; H) SM 2 F (19) i=1 s.t. S(i, :) 0 = 1, S(i, k) {0, 1}, i, k (20) N l(g(f(x i ; H), Z), x i ) + λ 2 f(x i; H) SM 2 F (21) i=1 s.t. S(i, :) 0 = 1, S(i, k) {0, 1}, i, k (22) where ˆx i = g(w i ; Z) 20

21 Algorithms 21

22 Symmetric Nonnegative Matrix Factorization P1 : minimize X R N K subject to X 0 f(x) 1 2 XXT Z 2 F Challenges: 4th order polynomial (nonconvex) in terms of X No Lipschitz continuous gradient General matrix Z R N N non-symmetric indefinite contains negative entries K is any integer in [1, N]. 22

23 SymNMF (two dimensional case, K = 1) minimize x subject to x xxt Z 2 F where x = [x 1, x 2 ] T. Hessian matrix: ( 12x 2 H = 1 + 4x 2 2 4Z 11 8x 1 x 2 4Z 12 8x 2 x 1 4Z 21 12x x2 1 4Z 22 ) [0, 0] T [0, 0] T [0, 0] T Z: positive definite Z: indefinite Z: negative definite 23

24 Background and Motivation Clustering Symmetric Nonnegative Matrix Factorization (SymNMF) Probabilistic clustering [Zass et al 05] Community detection [Wang et al 11] [Ma et al 10] Overlapping community detection [Zhang et al 13] Graph partitioning and image segmentation [Park et al 15] Clustering accuracy K-means variants NMF variants Spectral clustering variants SymNMF D. Huang, S. Yun and H. Park, SymNMF: nonnegative low rank approximation of a similarity matrix for graph clustering, Journal of Global Optimization, vol. 62,no. 3,pp , July,

25 Related Work Existing Algorithms Projected Gradient Descent (PGD) [Kuang, Park et al 12] X (t+1) = proj + [X (t) α f(x (t) )] where proj + [X] = max{x, 0}. Projected Newton Method (PNewton) [Kuang, Park et al 12] X (t+1) = proj + [X (t) α( 2 f(x (t) )) 1 f(x (t) )] Disadvantages: no global convergence guarantee (no Lipschitz continuous gradient) 25

26 Related Work Existing Algorithms (continue) Eigen-value decomposition based SymNMF [Huang, Sidiropoulos et al 14] + Z = U K Σ K U T K and Let B U KΣ 1/2 K + Assume Z = XX T minimize X,Q subject to 1 2 X BQ 2 F X 0, Q T Q = QQ T = I Disadvantages: assume there is an exact decomposition (i.e., Z = XX T ) there is no proof that the optimal objective value is 0 26

27 Related Work Existing Algorithms (continue) Alternating Non-negative Least Square (ANLS) [Kuang, Park et al 15] 1 minimize X,Y 2 XYT Z 2 F + λ X Y 2 F subject to Y 0, X 0 Disadvantages: KKT points of the problem is different from P1 Coordinate Descent (CD) [Vandaele, Gillis et al 16] where minimize X :,j 0 X :,jx T :,j R(j) 2 F R (j) = Z K k=1,k j X :,k X T :,k Disadvantages: no convergence guarantee to KKT points (the optimal solution of each subproblem is not unique) 27

28 New Formulation of SymNMF Our formulation of SymNMF: P2 : 1 minimize Z 2 X R N K,Y R N K 2 XYT F subject to Y 0, X = Y, Y i,: 2 2 τ, i Advantages of the new formulation: variable splitting feasible set is compact (closed and bounded) 28

29 New Formulation of SymNMF How to solve this problem? P2 : 1 minimize Z 2 X R N K,Y R N K 2 XYT F subject to Y 0, X = Y, Y i,: 2 2 τ, i 29

30 Alternating Direction Method of Multipliers (ADMM) Problem: minimize x,y subject to h(x) + g(y) Ax + By = c where x R N, z R M, A R P N, B R P M, and c R P. The augmented Lagrangian is L(x, y; λ) = h(x) + g(y) + λ T (Ax + Bz c) + ρ/2 Ax + By c 2 2 where ρ > 0 ADMM consists of the iterations [Boyd et al 04] x (t+1) = arg min x L(x, y (t) ; λ (t) ) y (t+1) = arg min y L(x(t+1), y; λ (t) ) λ (t+1) = λ (t) + ρ(ax (t+1) + By (t+1) c) 30

31 ADMM for SymNMF Partial augmented Lagrangian: L(X, Y; Λ) = 1 2 XYT Z 2 F + Y X, Λ + ρ 2 Y X 2 F Λ R N K : dual variables : inner product operator ρ > 0 Y-subproblem minimize Y 0, Y i,: 2 τ, i 1 2 XYT Z 2 F + Y X, Λ + ρ 2 Y X 2 F X-subproblem minimize X 1 2 XYT Z 2 F + Y X, Λ + ρ 2 Y X 2 F 31

32 Comparison with Classical ADMM classical ADMM: minimize x X,y Y subject to h(x) + g(y) Ax + By = c P2 : minimize X,Y subject to 1 2 XYT Z 2 F Y 0, X = Y, Y i,: 2 2 τ, i Challenges: Nonconvex objective Objective function is non-separable Recent analysis results of ADMM for nonconvex problem do not apply [Hong et al 16] [Pong et al 15] 32

33 Nonconvex Splitting SymNMF (NS-SymNMF)) Parameter Update: iteration dependent penalty parameter: Primal Update: Y (t+1) = arg β (t) = 6 ρ X(t) (Y (t) ) T Z 2 F 1 min Y 0, Y i,: 2 2 τ, i 2 X(t) Y T Z 2 F + ρ 2 Y X(t) + Λ (t) /ρ 2 F + β(t) 2 Y Y(t) 2 F } proximal term X (t+1) = arg min X Dual Update: 1 2 X(Y(t+1) ) T Z 2 F + ρ 2 X Λ(t) /ρ Y (t+1) 2 F Λ (t+1) = Λ (t) + ρ(y (t+1) X (t+1) ) 33

34 Convergence Analysis of ADMM (convex) Convex case [Boyd et al 11],[Hong, Luo 12]: X (t) X 2 + Y (t) Y 2 0 and Λ (t) Λ 2 0 where (X, Y ; Λ ) is the globally optimal primal-dual pair. 34

35 Convergence Analysis (Convergence Rate) Define the proximal gradient of the augmented Lagrangian function as L(X, Y, Λ) where the operator proj Y (W) arg [ YT proj Y [Y T Y (L(Y, X, Λ)] X L(X, Y, Λ) min W Y 2 Y 0, Y i,: 2 2 τ, i F. We use the following quantity to measure the progress of the algorithm primal gap dual gap P(X (t), Y (t), Λ (t) ) L(X (t), Y (t), Λ (t) ) 2 F + X (t) Y (t) 2 F. ] If lim t P(X (t), Y (t), Λ (t) ) = 0, then a KKT point of P2 is obtained. 35

36 Numerical Results Data sets: Synthetic data sets Real data sets Algorithms Comparison: PGD: Projected Gradient Descent [Kuang, Park et al 12] PNewton: Projected Newton Method [Kuang, Park et al 12] SNMF: Eigen Value based SymNMF [Huang, Sidiropoulos 14] ANLS: Alternating Non-negative Least Square [Kuang, Park et al 15] CD: Coordinate Descent [Vandaele, Gillis et al 16] 36

37 Numerical Results Initialization 20 random initializations (Y or X follows i.i.d uniform distribution in the range [0, τ]) Every algorithm starts from the same initial point Parameter Chosen of NS-SymNMF: Accelerating convergence rate: Initialization: ρ (0) = N τ where τ is the average of the column norm of Z. ρ (t+1) = min{ρ (t) /(1 ɛ/ρ (t) ), 6.1Nτ} where ɛ = 10 3 β (t) = 6ξ (t) X (t) Y (t) Z 2 F /ρ(t) where ξ (t+1) = min{ξ (t) /(1 ɛ/ξ (t) ), 1} and ξ (1) =

38 Numerical Results (Synthetic Data) Data Set I: Random symmetric matrices: M R N N + i.i.d. Gaussian Z = M + M T Data Set II: Adjacency matrices N = 2000, K = 4 The numbers of data points within each cluster are 300,500,800,400. Data points {x i } R, i = 1,..., N. Mean: 2,3,6,8; Variance: 0.5. Gaussian function Z i,j = exp( (x i x j ) 2 /(2σ 2 )) where σ 2 = 0.5 Relative objective value: XX T Z 2 F / Z 2 F 38

39 Numerical Results (Random symmetric matrices) Data set I: Random symmetric matrices: N = 500, K = Monte Carlo (MC) trials Full rank 39

40 Numerical Results (Adjacency matrices) Swamp Data set II: Adjacency matrices: N = 2000, K = 4 Optimality Gap: X proj + [X X (g(x, Y))] 20 MC trials 40

41 Numerical Results (Optimality) Check local optimality: initialize δ as 1 decrease it by 0.01 each time check the minimum eigenvalue of T. More examples: fix the ratio of the number of nodes within each cluster (i.e., 3 : 5 : 8 : 4) test on the different total numbers of nodes N λmin (T ) δ Local Optimality (true) % % % 41

42 Numerical Results (Real Data Set) Text Mining (dense similarity matrix): Vertices: documents Edges: similarity between to documents Datasets: Reuters; topic detection and tracking2 (TDT2) [Cai, et al 11] Social Network (sparse similarity matrix): Vertices: individuals Edges: relationship Datasets: -Enron [Leskovec et al 09]; Brightkite (location-based social networking) [Cho et al 11], Facebook [McAuley et al 12] 42

43 Numerical Results Algorithms NS-SymNMF PGD ANLS SNMF CD Mean and Variance 1.01e-2±5.35e e-2±7,34e e-2±1.25e e e-2±1.21e-6 Text mining data dense similarity matrix : topic detection and tracking2 (TDT2) N = 8, 939 documents, K = 25 classes Gaussian function (similarity) 43

44 Numerical Results Algorithms NS-SymNMF ANLS SNMF CD Mean and Variance 8.75e-1±9.52e e-1±1.93e e ±1.49e-3 Social network data sparse similarity matrix : Brightkite (location-based social networking) people N = 58, 228, edges 428, 156, K = 50 44

45 Joint factor analysis and latent clustering min X DWH 2 W,H,S,M,{d i } N F + λ W SM 2 F + η H 2 F (23) i=1 s.t. W 0, H 0, W(i, :) 2 = 1, i (24) D = diag(d 1,..., d N ), (25) S(i, :) 0 = 1, S(i, k) {0, 1}, i, k (26) η 0 is a regularization parameter the Euclidean distance based K-means clustering on the unit 2-norm ball is equivalent to correlation based clustering η H 2 F control the scaling ambiguity 45

46 Joint factor analysis and latent clustering min W,H,Z,S,M,{d i } N i=1 X DWH 2 F + λ W SM 2 F + η H 2 F µ 0 and Z is a slack variable a large µ to enforce W Z + µ W Z 2 F s.t. W 0, H 0, Z(i, :) 2 = 1, i (27) D = diag(d 1,..., d N ), (28) S(i, :) 0 = 1, S(i, k) {0, 1}, i, k (29) 46

47 Alternating Optimization W-update W = arg min W 0 X DWH 2 F + λ W SM 2 F + µ W Z 2 F (30) H-update d i -update H = arg min H 0 X DWH 2 F + η H 2 F (31) d i = arg min d i X(i, :) d i W(i, :)H 2 2 (32) d i = X(i, :)bt i b i, where b T i = W(i, :)H Z-update Z = arg min Z Z(:,i) 2 =1, i W 2 F (33) S, M-update K-means 47

48 Numerical Results Reuters text corpus, top 41 clusters, 8213 documents, words Test for various number of clusters k For each k, 10 Monte-Carlo trials by randomly picking k-clusters locally consistent concept factorization (LCCF) [Cai et al 11] min X U,V XUVT 2 F + λtr(vt LV) s.t.u 0, V 0 (34) L: Graph Laplacian. Intuition: ambient proximity latent proximity 48

49 DNN 49

50 Other Works Other works: Tensor decomposition Distributed matrix factorization 50

51 PARAFAC 2 PARAllel FACtor analysis 2 (PARAFAC2) [Harshman 72]: Cross-Language Information Retrieval [Chew 07] and Multilingual Document Clustering [Romeo 14 ]: terms X(:, :, k) = F k D k (C)A T, k = 1,..., K documents concept terms weight concept documents X(:, :, k) F k D k (C) A T French Chinese Spanish Italian English X(:, :, 2) X(:, :, 1) X(:, :, 4) X(:, :, 3) X(:, :, 5) 51

52 Distributed Matrix Factorization minimize XY T Z 2 F X,Y Distributed matrix factorization: minimize U i,y i, i subject to N i=1 U i Y T i Z i 2 F U i = U j, i, j E U 2, Y 2 U 1, Y 1 Z 2 U 4, Y 4 U 3, Y 3 E Z 1 U 7, Y 7 Z 3 Z 4 Z 5 Z 7 Z 6 U 6, Y 6 U 5, Y 5 Convergence analysis KKT points [Hong 16] Global optimal solution (future work) 52

53 Conclusion Clustering Formulation Spectral Clustering Joint Factor Analysis and Latent Clustering Algorithms Joint Factor Analysis and Latent Clustering SymNMF Other Problems Tensor decomposition Distributed Matrix Factorization 53

54 Thanks for Your Attention! 54

55 References D. Kuang, S. Yun, and H. Park, SymNMF: nonnegative low-rank approximation of a similarity matrix for graph clustering, Journal of Global Optimization, vol. 62, no. 3, pp , Jul Songtao Lu, Mingyi Hong, and Zhengdao Wang, A nonconvex splitting method for symmetric nonnegative matrix factorization: Convergence analysis and optimality, IEEE Transactions on Signal Processing, Feb Bo Yang, Xiao Fu, Nicholas D. Sidiropoulos, Towards K-means-friendly Spaces: Simultaneous Deep Learning and Clustering, 2016 B. Yang, X. Fu, N. D. Sidiropoulos, Learning from hidden traits: Joint factor analysis and latent clustering, IEEE Transactions on Signal Processing, accepted, Sep

First-order methods of solving nonconvex optimization problems: Algorithms, convergence, and optimality

Graduate Theses and Dissertations Iowa State University Capstones, Theses and Dissertations 2018 First-order methods of solving nonconvex optimization problems: Algorithms, convergence, and optimality