Clustering with the SVD and NMF Amy Langville Mathematics Department College of Charleston Dagstuhl 2/14/2007
Outline Fielder Method Extended Fielder Method and SVD Clustering with SVD vs. NMF Demos with vismatrix Tool
Fiedler Method
Fiedler Method s.p.d Laplacian matrix L = D A A must be symmetrix. Else, symmetrize by: A T A AA T A + A T Properties of L L is spd so all λ i 0. λ i = 0, connected component i If L is scc, then λ 1 = 0, λ i > 0, i = 2,...,n. The subdominant eigenvector v 2 gives info. on clustering.
Understanding Graph Theory 2 3 1 4 5 6 10 7 8 9 Given this graph, there are no apparent clusters.
Understanding Graph Theory 2 3 1 4 5 6 10 7 8 9 However, suppose we untangle the web.
Understanding Graph Theory 2 3 1 4 6 5 10 7 8 9 However, suppose we untangle the web.
Understanding Graph Theory 2 3 1 7 4 6 5 10 8 9 However, suppose we untangle the web.
Understanding Graph Theory 2 3 1 7 4 9 6 5 10 8 However, suppose we untangle the web.
Understanding Graph Theory 7 1 2 3 4 8 9 6 5 10 However, suppose we untangle the web.
Understanding Graph Theory 7 1 2 3 4 8 9 6 5 10 However, suppose we untangle the web.
Understanding Graph Theory 7 1 8 2 3 4 9 Although the clusters are now apparent, we need a better method. 6 5 10
Using the Fiedler Vector Generally, v 2 is recursively used to partition the graph by separating the components into negative and positive values. sign(v 2 )=[-, -, -, +, +, +, -, -, -, +] 4 6 2 7 5 1 3 8 9 10
A Fiedler Clustering Example 4 5 6 10 2 4 6 2 7 5 1 3 8 9 10 3 7 1 8 9 4 5 6 10 2 3 7 1 8 9
Reminder: Problems With Laplacian The Laplacian method requires the use of: an undirected graph. a pictorially symmetric matrix. Square matrices. Zero may not always be the best choice for partitioning the eigenvector values of v 2. Recursive algorithms are expensive.
Why does Fiedler vector cluster? Two-way partition A = [ ] A1 A 2 A 3 A 4 D = [ D1 0 0 D 4 Assign each node to one of two clusters. Create decision variables x i x i = 1, if node i goes in cluster 1 x i = 1, if node i goes in cluster 2 Objective: minimize the number of between-cluster links maximize the number of in-cluster links min x x T Lx Suppose x = [ e e ]. Then min x x T Lx =(e T D 1 e + e T D 4 e)+(e T A 2 e + e T A 3 e) (e T A 1 e + e T A 4 e) for balancing between-cluster links in-cluster links ]
Why does Fiedler vector cluster? Optimization Problem min x T Lx is NP-hard. Relax from discrete to continuous values for x. By Rayleigh theorem, min x 2 =1 xt Lx = λ 1, with x as the eigenvector corresponding to λ 1. BUT, x = e, which is not helpful for clustering!
Optimization Solution Add constraint x T e = 0. By Courant-Fischer theorem, min x T Lx = λ 2, x 2 =1, x T e=0 with x = v 2 as the eigenvector corresponding to λ 2. v 2 is called the Fiedler vector.
Extended Fiedler
Other subdominant eigenvectors If v 2 gives approximate info. about clustering, what about the other subdominant eigenvectors v 3, v 4,...? Since L is symmetric, it has an orthogonal eigendecomposition. Thus, for i 1, v i v 1 = e. Thus, for i 1, v i sums to 0. Thus, for i 1, v i is centered about 0. This explains why, when bisecting a graph, a split about 0 makes sense.
Example Graph
Clustered Example Graph using eigenvectors v 2, v 3, v 4 Nodes 4 and 18 are saddle nodes that straddle a few clusters.
Clustered Example Eigenvectors - - + - + - - + + 6 clusters, 6 [3,2 3 ] + + + + - + + - - Sign patterns in eigenvectors give clusters and saddle nodes. Number of clusters found is not fixed, but lies in [j, 2 j ], where j is the number of eigenvectors used. takes a global view for clustering (whereas recursive Fiedler tunnels down, despite possible errors at initial iterations.)
Clustering with the SVD
Applying this to SVD A A k = U k Σ k V T k Left singular vectors in U k of term-by-document matrix A are eigenvectors of AA T. sign pattern in U k will cluster terms. Right singular vectors in V k of term-by-document matrix A are eigenvectors of A T A. sign pattern in V k will cluster documents.
Pseudocode For Term Clustering k = truncation point for SVD input U k (matrix of k left singular vectors of A) input j (user-defined scalar, # of clusters [j, 2 j ]. Note: j k.) create B = U(:, 1 : j) >= 0); binary matrix with sign patterns associate a number x(i) with each unique sign pattern x=zeros(n,1); for i=1:j x=x+(2 j i )*(B(:,i)); end reorder rows of A by indices in sorted vector x
vismatrix tool David Gleich
SVD Clustering of Reuters10 72K terms, 9K docs, 415K nonzeros j = 10 for terms produces 486 term clusters j = 4 for documents produces 8 document clusters
SVD Clustering of Reuters10 72K terms, 9K docs, 415K nonzeros j = 10 for terms produces 486 term clusters j = 4 for documents produces 8 document clusters j = 10 for terms produces 486 term clusters j = 10 for documents produces 391 document clusters
Summary of SVD Clustering + variable # of clusters returned between j and 2 j + sign pattern idea allows for natural division within clusters + clusters ordered so that similar clusters are nearby each other + less work than recursive Fiedler + can choose different # of clusters for terms and documents + can identify saddle terms and saddle documents only hard clustering is possible picture can be too refined with too many clusters as j increases, range for # of clusters becomes too wide EX: j=10, # of clusters is between 10 and 1024 In some sense, terms and docs are treated separately. (due to symmetry requirement)
Clustering with the NMF
Nonnegative Matrix Factorization (Lee and Seung 2000) Mean squared error objective function min A WH 2 F s.t. W, H 0 Nonlinear Optimization Problem convex in W or H, but not both tough to get global min huge # unknowns: mk for W and kn for H (EX: A 70K 10K and k=10 topics 800K unknowns) above objective is one of many possible convergence to local min NOT guaranteed for any algorithm
Multiplicative update rules Lee-Seung 2000 Hoyer 2002 NMF Algorithms Gradient Descent Hoyer 2004 Berry-Plemmons 2004 Lin 2007 Alternating Least Squares Paatero 1994 ACLS AHCLS
NMF Algorithm: Lee and Seung 2000 Mean Squared Error objective function min A WH 2 F s.t. W, H 0 W = abs(randn(m,k)); H = abs(randn(k,n)); for i = 1 : maxiter H = H.* (W T A)./(W T WH + 10 9 ); W = W.* (AH T )./(WHH T + 10 9 ); end (proof of convergence to fixed point based on E-M convergence proof) (objective function tails off after 50-100 iterations)
NMF Algorithm: Lee and Seung 2000 Divergence objective function min i,j (A ij log A ij [WH] ij A ij +[WH] ij ) s.t. W, H 0 W = abs(randn(m,k)); H = abs(randn(k,n)); for i = 1 : maxiter H = H.* (W T (A./ (WH + 10 9 )))./ W T ee T ; W = W.* ((A./ (WH + 10 9 ))H T )./ee T H T ; end (proof of convergence to fixed point based on E-M convergence proof) (objective function tails off after 50-100 iterations)
Multiplicative Update Summary Pros Cons + convergence theory: guaranteed to converge to fixed point + good initialization W (0), H (0) speeds convergence and gets to better fixed point fixed point may be local min or saddle point good initialization W (0), H (0) speeds convergence and gets to better fixed point slow: many M-M multiplications at each iteration hundreds/thousands of iterations until convergence no sparsity of W and H incorporated into mathematical setup 0 elements locked
Multiplicative Update and Locking During iterations of mult. update algorithms, once an element in W or H becomes 0, it can never become positive. Implications for W: In order to improve objective function, algorithm can only take terms out, not add terms, to topic vectors. Very inflexible: once algorithm starts down a path for a topic vector, it must continue in that vein. ALS-type algorithms do not lock elements, greater flexibility allows them to escape from path heading towards poor local min
Sparsity Measures Berry et al. x 2 2 Hoyer spar(x n 1 )= n x 1 / x 2 n 1 Diversity measure E (p) (x) = n i=1 x i p, 0 p 1 E (p) (x) = n i=1 x i p,p<0 Rao and Kreutz-Delgado: algorithms for minimizing E (p) (x) s.t. Ax = b, but expensive iterative procedure Ideal nnz(x) not continuous, NP-hard to use this in optim.
NMF Algorithm: Berry et al. 2004 Gradient Descent Constrained Least Squares W = abs(randn(m,k)); (scale cols of W to unit norm) H = zeros(k,n); for i = 1 : maxiter CLS forj=1: #docs, solve min H j A j WH j 2 2 + λ H j 2 2 s.t. H j 0 GD W = W.* (AH T )./(WHH T + 10 9 ); (scale cols of W) end
NMF Algorithm: Berry et al. 2004 Gradient Descent Constrained Least Squares W = abs(randn(m,k)); (scale cols of W to unit norm) H = zeros(k,n); for i = 1 : maxiter CLS forj=1: #docs, solve min H j A j WH j 2 2 + λ H j 2 2 s.t. H j 0 solve for H: (W T W + λ I) H = W T A; (small matrix solve) GD W = W.* (AH T )./(WHH T + 10 9 ); (scale cols of W) end (objective function tails off after 15-30 iterations)
Berry et al. 2004 Summary Pros Cons + fast: less work per iteration than most other NMF algorithms + fast: small # of iterations until convergence + sparsity parameter for H 0 elements in W are locked no sparsity parameter for W ad hoc nonnegativity: negative elements in H are set to 0, could run lsqnonneg or snnls instead no convergence theory
PMF Algorithm: Paatero & Tapper 1994 Mean Squared Error Alternating Least Squares min A WH 2 F s.t. W, H 0 W = abs(randn(m,k)); for i = 1 : maxiter LS for j = 1 : #docs, solve LS min H j A j WH j 2 2 s.t. H j 0 for j = 1 : #terms, solve min Wj A j W j H 2 2 s.t. W j 0 end
ALS Algorithm W = abs(randn(m,k)); for i = 1 : maxiter LS solve matrix equation W T WH = W T A for H nonneg H = H. (H >= 0) LS solve matrix equation HH T W T = HA T for W nonneg W = W. (W >= 0) end
ALS Summary Pros + fast + works well in practice + speedy convergence + only need to initialize W (0) + 0 elements not locked Cons no sparsity of W and H incorporated into mathematical setup ad hoc nonnegativity: negative elements are set to 0 ad hoc sparsity: negative elements are set to 0 no convergence theory
Alternating Constrained Least Squares If the very fast ALS works well in practice and no NMF algorithms guarantee convergence to local min, why not use ALS? W = abs(randn(m,k)); for i = 1 : maxiter CLS forj=1: #docs, solve CLS forj=1: #terms, solve min H j A j WH j 2 2 + λ H H j 2 2 s.t. H j 0 min Wj A j W j H 2 2 + λ W W j 2 2 s.t. W j 0 end
Alternating Constrained Least Squares If the very fast ALS works well in practice and no NMF algorithms guarantee convergence to local min, why not use ALS? W = abs(randn(m,k)); for i = 1 : maxiter cls solve for H: (W T W + λ H I) H = W T A nonneg H = H. (H >= 0) cls solve for W: (HH T + λ W I) W T = HA T nonneg W = W. (W >= 0) end
ACLS Summary Pros Cons + fast: 6.6 sec vs. 9.8 sec (gd-cls) + works well in practice + speedy convergence + only need to initialize W (0) + 0 elements not locked + allows for sparsity in both W and H ad hoc nonnegativity: after LS, negative elements set to 0, could run lsqnonneg or snnls instead (doesn t improve accuracy much) no convergence theory
ACLS + spar(x) Is there a better way to measure sparsity and still maintain speed of ACLS? spar(x n 1 )= n x 1 / x 2 n 1 ((1 spar(x)) n+spar(x)) x 2 x 1 =0 (spar(w j )=α W and spar(h j )=α H ) W = abs(randn(m,k)); for i = 1 : maxiter cls cls for j = 1 : #docs, solve min H j A j WH j 2 2 + λ H(((1 α H ) k + α H ) H j 2 2 H j 2 1 ) for j = 1 : #terms, solve s.t. H j 0 min Wj A j W j H 2 2 + λ W(((1 α W ) k + α W ) W j 2 2 W j 2 1 ) s.t. W j 0 end
AHCLS (spar(w j )=α W and spar(h j )=α H ) W = abs(randn(m,k)); β H =((1 α H ) k + α H ) 2 β W =((1 α W ) k + α W ) 2 for i = 1 : maxiter cls solve for H: (W T W + λ H β H I λ H E) H = W T A nonneg H = H. (H >= 0) cls solve for W: (HH T + λ W β W I λ W E) W T = HA T nonneg W = W. (W >= 0) end
AHCLS Summary 122 Convergence of GDCLS and AHCLS Algorithms on MED dataset Pros + fast: 6.8 vs. 9.8 sec (gd-cls) + works well in practice + speedy convergence + only need to initialize W (0) + 0 elements not locked Accuracy A -WH F 121.5 121 120.5 120 119. 5 119 AHCLS GDCLS SVD Error = 118.49 118.5 0 5 10 15 20 25 30 35 40 45 50 number of iterations + allows for more explicit sparsity in both W and H Cons ad hoc nonnegativity: after LS, negative elements set to 0, could run lsqnonneg or snnls instead (doesn t improve accuracy much) convergence to stationary point only guaranteed when nonnegativity properly enforced.
Clustering with the NMF Clustering Terms use rows of W m k = cl.1 cl.2... cl.k term1.9 0....3 term2.1.8....2......... Clustering Documents use cols of H k n = doc1 doc2... docn cl.1.4 0....5......... cl.k 0.8....2 soft clustering is very natural
The Enron Email Dataset (SAS) PRIVATE email collection of 150 Enron employees during 2001 92,000 terms and 65,000 messages Term-by-Message Matrix f astow1 f astow2 skilling1.......... subpoena 2 0 1... dynegy 0 3 0..........
Summary of NMF Clustering + appealing picture of clusters + some NMF algorithms are faster than SVD + terms and docs are treated in integrated fashion + hard and soft clustering is possible user must choose k, a fixed # of clusters clusters have no particular order, similar clusters do not appear near each other
Conclusions SVD variable # of clusters returned clusters ordered so that similar clusters are nearby each other can choose different # of clusters for terms and documents can identify saddle terms and saddle documents NMF fast sparse factorization terms and docs are treated in integrated fashion hard and soft clustering is possible