Clustering. SVD and NMF

Similar documents
Nonnegative Matrix Factorization. Data Mining

Applications. Nonnegative Matrices: Ranking

1 Matrix notation and preliminaries from spectral graph theory

1 Matrix notation and preliminaries from spectral graph theory

Non-Negative Matrix Factorization

1 Non-negative Matrix Factorization (NMF)

Non-Negative Matrix Factorization

Lecture: Local Spectral Methods (1 of 4)

Dimension Reduction and Iterative Consensus Clustering

Communities, Spectral Clustering, and Random Walks

Matrix Decomposition in Privacy-Preserving Data Mining JUN ZHANG DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF KENTUCKY

Iterative Methods for Solving A x = b

A Least Squares Formulation for Canonical Correlation Analysis

Eigenvalue problems and optimization

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

ECS231: Spectral Partitioning. Based on Berkeley s CS267 lecture on graph partition

Inverse Power Method for Non-linear Eigenproblems

Applied Machine Learning for Biomedical Engineering. Enrico Grisan

MULTIPLICATIVE ALGORITHM FOR CORRENTROPY-BASED NONNEGATIVE MATRIX FACTORIZATION

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization

Communities, Spectral Clustering, and Random Walks

Quick Introduction to Nonnegative Matrix Factorization

Spectral Clustering on Handwritten Digits Database

Chapter 11. Matrix Algorithms and Graph Partitioning. M. E. J. Newman. June 10, M. E. J. Newman Chapter 11 June 10, / 43

CONVOLUTIVE NON-NEGATIVE MATRIX FACTORISATION WITH SPARSENESS CONSTRAINT

Package sgpca. R topics documented: July 6, Type Package. Title Sparse Generalized Principal Component Analysis. Version 1.0.

Rank Selection in Low-rank Matrix Approximations: A Study of Cross-Validation for NMFs

STA141C: Big Data & High Performance Statistical Computing

Networks and Their Spectra

Data Mining and Matrices

Non-Negative Matrix Factorization with Quasi-Newton Optimization

Lecture 3: Huge-scale optimization problems

Note on Algorithm Differences Between Nonnegative Matrix Factorization And Probabilistic Latent Semantic Indexing

Linear Algebra Methods for Data Mining

8.1 Concentration inequality for Gaussian random matrix (cont d)

Online Social Networks and Media. Link Analysis and Web Search

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Markov Chains and Spectral Clustering

Gradient Descent. Dr. Xiaowei Huang

Lecture 9: Numerical Linear Algebra Primer (February 11st)

A hybrid reordered Arnoldi method to accelerate PageRank computations

Matrix Factorization with Applications to Clustering Problems: Formulation, Algorithms and Performance

Dictionary Learning Using Tensor Methods

Non-negative matrix factorization with fixed row and column sums

Fast Nonnegative Matrix Factorization with Rank-one ADMM

Compressive Sensing, Low Rank models, and Low Rank Submatrix

Fast Coordinate Descent methods for Non-Negative Matrix Factorization

Abstract. LANDI, AMANDA KIM. The Nonnegative Matrix Factorization: Methods and Applications. (Under the direction of Kazufumi Ito.


c Springer, Reprinted with permission.

Factor Analysis (FA) Non-negative Matrix Factorization (NMF) CSE Artificial Intelligence Grad Project Dr. Debasis Mitra

Using Matrix Decompositions in Formal Concept Analysis

Link Analysis. Leonid E. Zhukov

Lecture: Local Spectral Methods (3 of 4) 20 An optimization perspective on local spectral methods

Algorithms and Applications for Approximate Nonnegative Matrix Factorization

Data Mining Techniques

Lecture 6 Positive Definite Matrices

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Nonnegative matrix factorization for interactive topic modeling and document clustering

Eigenvalue Problems Computation and Applications

Block Coordinate Descent for Regularized Multi-convex Optimization

Descent methods for Nonnegative Matrix Factorization

Certifying the Global Optimality of Graph Cuts via Semidefinite Programming: A Theoretic Guarantee for Spectral Clustering

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

Data Exploration and Unsupervised Learning with Clustering

Rank and Rating Aggregation. Amy Langville Mathematics Department College of Charleston

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication.

arxiv: v3 [cs.lg] 18 Mar 2013

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Single-channel source separation using non-negative matrix factorization

On the Equivalence of Nonnegative Matrix Factorization and Spectral Clustering

Lecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora

Performance Evaluation of the Matlab PCT for Parallel Implementations of Nonnegative Tensor Factorization

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017

On the Equivalence of Nonnegative Matrix Factorization and Spectral Clustering

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Data Mining and Matrices

Matrices, Vector Spaces, and Information Retrieval

Additional Exercises for Introduction to Nonlinear Optimization Amir Beck March 16, 2017

Solving linear systems (6 lectures)

Krylov Subspace Methods to Calculate PageRank

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Subgradient methods for huge-scale optimization problems

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

Low Rank Matrix Completion Formulation and Algorithm

Regularized NNLS Algorithms for Nonnegative Matrix Factorization with Application to Text Document Clustering

A Nearly Sublinear Approximation to exp{p}e i for Large Sparse Matrices from Social Networks

7 Principal Component Analysis

Spectral Graph Theory and its Applications. Daniel A. Spielman Dept. of Computer Science Program in Applied Mathematics Yale Unviersity

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression

The Rectified Gaussian Distribution

Data Analysis and Manifold Learning Lecture 7: Spectral Clustering

ENGG5781 Matrix Analysis and Computations Lecture 10: Non-Negative Matrix Factorization and Tensor Decomposition

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Singular Value Decomposition

Projected Gradient Methods for Non-negative Matrix Factorization

LINEAR SYSTEMS (11) Intensive Computation

Constrained Optimization

Transcription:

Clustering with the SVD and NMF Amy Langville Mathematics Department College of Charleston Dagstuhl 2/14/2007

Outline Fielder Method Extended Fielder Method and SVD Clustering with SVD vs. NMF Demos with vismatrix Tool

Fiedler Method

Fiedler Method s.p.d Laplacian matrix L = D A A must be symmetrix. Else, symmetrize by: A T A AA T A + A T Properties of L L is spd so all λ i 0. λ i = 0, connected component i If L is scc, then λ 1 = 0, λ i > 0, i = 2,...,n. The subdominant eigenvector v 2 gives info. on clustering.

Understanding Graph Theory 2 3 1 4 5 6 10 7 8 9 Given this graph, there are no apparent clusters.

Understanding Graph Theory 2 3 1 4 5 6 10 7 8 9 However, suppose we untangle the web.

Understanding Graph Theory 2 3 1 4 6 5 10 7 8 9 However, suppose we untangle the web.

Understanding Graph Theory 2 3 1 7 4 6 5 10 8 9 However, suppose we untangle the web.

Understanding Graph Theory 2 3 1 7 4 9 6 5 10 8 However, suppose we untangle the web.

Understanding Graph Theory 7 1 2 3 4 8 9 6 5 10 However, suppose we untangle the web.

Understanding Graph Theory 7 1 2 3 4 8 9 6 5 10 However, suppose we untangle the web.

Understanding Graph Theory 7 1 8 2 3 4 9 Although the clusters are now apparent, we need a better method. 6 5 10

Using the Fiedler Vector Generally, v 2 is recursively used to partition the graph by separating the components into negative and positive values. sign(v 2 )=[-, -, -, +, +, +, -, -, -, +] 4 6 2 7 5 1 3 8 9 10

A Fiedler Clustering Example 4 5 6 10 2 4 6 2 7 5 1 3 8 9 10 3 7 1 8 9 4 5 6 10 2 3 7 1 8 9

Reminder: Problems With Laplacian The Laplacian method requires the use of: an undirected graph. a pictorially symmetric matrix. Square matrices. Zero may not always be the best choice for partitioning the eigenvector values of v 2. Recursive algorithms are expensive.

Why does Fiedler vector cluster? Two-way partition A = [ ] A1 A 2 A 3 A 4 D = [ D1 0 0 D 4 Assign each node to one of two clusters. Create decision variables x i x i = 1, if node i goes in cluster 1 x i = 1, if node i goes in cluster 2 Objective: minimize the number of between-cluster links maximize the number of in-cluster links min x x T Lx Suppose x = [ e e ]. Then min x x T Lx =(e T D 1 e + e T D 4 e)+(e T A 2 e + e T A 3 e) (e T A 1 e + e T A 4 e) for balancing between-cluster links in-cluster links ]

Why does Fiedler vector cluster? Optimization Problem min x T Lx is NP-hard. Relax from discrete to continuous values for x. By Rayleigh theorem, min x 2 =1 xt Lx = λ 1, with x as the eigenvector corresponding to λ 1. BUT, x = e, which is not helpful for clustering!

Optimization Solution Add constraint x T e = 0. By Courant-Fischer theorem, min x T Lx = λ 2, x 2 =1, x T e=0 with x = v 2 as the eigenvector corresponding to λ 2. v 2 is called the Fiedler vector.

Extended Fiedler

Other subdominant eigenvectors If v 2 gives approximate info. about clustering, what about the other subdominant eigenvectors v 3, v 4,...? Since L is symmetric, it has an orthogonal eigendecomposition. Thus, for i 1, v i v 1 = e. Thus, for i 1, v i sums to 0. Thus, for i 1, v i is centered about 0. This explains why, when bisecting a graph, a split about 0 makes sense.

Example Graph

Clustered Example Graph using eigenvectors v 2, v 3, v 4 Nodes 4 and 18 are saddle nodes that straddle a few clusters.

Clustered Example Eigenvectors - - + - + - - + + 6 clusters, 6 [3,2 3 ] + + + + - + + - - Sign patterns in eigenvectors give clusters and saddle nodes. Number of clusters found is not fixed, but lies in [j, 2 j ], where j is the number of eigenvectors used. takes a global view for clustering (whereas recursive Fiedler tunnels down, despite possible errors at initial iterations.)

Clustering with the SVD

Applying this to SVD A A k = U k Σ k V T k Left singular vectors in U k of term-by-document matrix A are eigenvectors of AA T. sign pattern in U k will cluster terms. Right singular vectors in V k of term-by-document matrix A are eigenvectors of A T A. sign pattern in V k will cluster documents.

Pseudocode For Term Clustering k = truncation point for SVD input U k (matrix of k left singular vectors of A) input j (user-defined scalar, # of clusters [j, 2 j ]. Note: j k.) create B = U(:, 1 : j) >= 0); binary matrix with sign patterns associate a number x(i) with each unique sign pattern x=zeros(n,1); for i=1:j x=x+(2 j i )*(B(:,i)); end reorder rows of A by indices in sorted vector x

vismatrix tool David Gleich

SVD Clustering of Reuters10 72K terms, 9K docs, 415K nonzeros j = 10 for terms produces 486 term clusters j = 4 for documents produces 8 document clusters

SVD Clustering of Reuters10 72K terms, 9K docs, 415K nonzeros j = 10 for terms produces 486 term clusters j = 4 for documents produces 8 document clusters j = 10 for terms produces 486 term clusters j = 10 for documents produces 391 document clusters

Summary of SVD Clustering + variable # of clusters returned between j and 2 j + sign pattern idea allows for natural division within clusters + clusters ordered so that similar clusters are nearby each other + less work than recursive Fiedler + can choose different # of clusters for terms and documents + can identify saddle terms and saddle documents only hard clustering is possible picture can be too refined with too many clusters as j increases, range for # of clusters becomes too wide EX: j=10, # of clusters is between 10 and 1024 In some sense, terms and docs are treated separately. (due to symmetry requirement)

Clustering with the NMF

Nonnegative Matrix Factorization (Lee and Seung 2000) Mean squared error objective function min A WH 2 F s.t. W, H 0 Nonlinear Optimization Problem convex in W or H, but not both tough to get global min huge # unknowns: mk for W and kn for H (EX: A 70K 10K and k=10 topics 800K unknowns) above objective is one of many possible convergence to local min NOT guaranteed for any algorithm

Multiplicative update rules Lee-Seung 2000 Hoyer 2002 NMF Algorithms Gradient Descent Hoyer 2004 Berry-Plemmons 2004 Lin 2007 Alternating Least Squares Paatero 1994 ACLS AHCLS

NMF Algorithm: Lee and Seung 2000 Mean Squared Error objective function min A WH 2 F s.t. W, H 0 W = abs(randn(m,k)); H = abs(randn(k,n)); for i = 1 : maxiter H = H.* (W T A)./(W T WH + 10 9 ); W = W.* (AH T )./(WHH T + 10 9 ); end (proof of convergence to fixed point based on E-M convergence proof) (objective function tails off after 50-100 iterations)

NMF Algorithm: Lee and Seung 2000 Divergence objective function min i,j (A ij log A ij [WH] ij A ij +[WH] ij ) s.t. W, H 0 W = abs(randn(m,k)); H = abs(randn(k,n)); for i = 1 : maxiter H = H.* (W T (A./ (WH + 10 9 )))./ W T ee T ; W = W.* ((A./ (WH + 10 9 ))H T )./ee T H T ; end (proof of convergence to fixed point based on E-M convergence proof) (objective function tails off after 50-100 iterations)

Multiplicative Update Summary Pros Cons + convergence theory: guaranteed to converge to fixed point + good initialization W (0), H (0) speeds convergence and gets to better fixed point fixed point may be local min or saddle point good initialization W (0), H (0) speeds convergence and gets to better fixed point slow: many M-M multiplications at each iteration hundreds/thousands of iterations until convergence no sparsity of W and H incorporated into mathematical setup 0 elements locked

Multiplicative Update and Locking During iterations of mult. update algorithms, once an element in W or H becomes 0, it can never become positive. Implications for W: In order to improve objective function, algorithm can only take terms out, not add terms, to topic vectors. Very inflexible: once algorithm starts down a path for a topic vector, it must continue in that vein. ALS-type algorithms do not lock elements, greater flexibility allows them to escape from path heading towards poor local min

Sparsity Measures Berry et al. x 2 2 Hoyer spar(x n 1 )= n x 1 / x 2 n 1 Diversity measure E (p) (x) = n i=1 x i p, 0 p 1 E (p) (x) = n i=1 x i p,p<0 Rao and Kreutz-Delgado: algorithms for minimizing E (p) (x) s.t. Ax = b, but expensive iterative procedure Ideal nnz(x) not continuous, NP-hard to use this in optim.

NMF Algorithm: Berry et al. 2004 Gradient Descent Constrained Least Squares W = abs(randn(m,k)); (scale cols of W to unit norm) H = zeros(k,n); for i = 1 : maxiter CLS forj=1: #docs, solve min H j A j WH j 2 2 + λ H j 2 2 s.t. H j 0 GD W = W.* (AH T )./(WHH T + 10 9 ); (scale cols of W) end

NMF Algorithm: Berry et al. 2004 Gradient Descent Constrained Least Squares W = abs(randn(m,k)); (scale cols of W to unit norm) H = zeros(k,n); for i = 1 : maxiter CLS forj=1: #docs, solve min H j A j WH j 2 2 + λ H j 2 2 s.t. H j 0 solve for H: (W T W + λ I) H = W T A; (small matrix solve) GD W = W.* (AH T )./(WHH T + 10 9 ); (scale cols of W) end (objective function tails off after 15-30 iterations)

Berry et al. 2004 Summary Pros Cons + fast: less work per iteration than most other NMF algorithms + fast: small # of iterations until convergence + sparsity parameter for H 0 elements in W are locked no sparsity parameter for W ad hoc nonnegativity: negative elements in H are set to 0, could run lsqnonneg or snnls instead no convergence theory

PMF Algorithm: Paatero & Tapper 1994 Mean Squared Error Alternating Least Squares min A WH 2 F s.t. W, H 0 W = abs(randn(m,k)); for i = 1 : maxiter LS for j = 1 : #docs, solve LS min H j A j WH j 2 2 s.t. H j 0 for j = 1 : #terms, solve min Wj A j W j H 2 2 s.t. W j 0 end

ALS Algorithm W = abs(randn(m,k)); for i = 1 : maxiter LS solve matrix equation W T WH = W T A for H nonneg H = H. (H >= 0) LS solve matrix equation HH T W T = HA T for W nonneg W = W. (W >= 0) end

ALS Summary Pros + fast + works well in practice + speedy convergence + only need to initialize W (0) + 0 elements not locked Cons no sparsity of W and H incorporated into mathematical setup ad hoc nonnegativity: negative elements are set to 0 ad hoc sparsity: negative elements are set to 0 no convergence theory

Alternating Constrained Least Squares If the very fast ALS works well in practice and no NMF algorithms guarantee convergence to local min, why not use ALS? W = abs(randn(m,k)); for i = 1 : maxiter CLS forj=1: #docs, solve CLS forj=1: #terms, solve min H j A j WH j 2 2 + λ H H j 2 2 s.t. H j 0 min Wj A j W j H 2 2 + λ W W j 2 2 s.t. W j 0 end

Alternating Constrained Least Squares If the very fast ALS works well in practice and no NMF algorithms guarantee convergence to local min, why not use ALS? W = abs(randn(m,k)); for i = 1 : maxiter cls solve for H: (W T W + λ H I) H = W T A nonneg H = H. (H >= 0) cls solve for W: (HH T + λ W I) W T = HA T nonneg W = W. (W >= 0) end

ACLS Summary Pros Cons + fast: 6.6 sec vs. 9.8 sec (gd-cls) + works well in practice + speedy convergence + only need to initialize W (0) + 0 elements not locked + allows for sparsity in both W and H ad hoc nonnegativity: after LS, negative elements set to 0, could run lsqnonneg or snnls instead (doesn t improve accuracy much) no convergence theory

ACLS + spar(x) Is there a better way to measure sparsity and still maintain speed of ACLS? spar(x n 1 )= n x 1 / x 2 n 1 ((1 spar(x)) n+spar(x)) x 2 x 1 =0 (spar(w j )=α W and spar(h j )=α H ) W = abs(randn(m,k)); for i = 1 : maxiter cls cls for j = 1 : #docs, solve min H j A j WH j 2 2 + λ H(((1 α H ) k + α H ) H j 2 2 H j 2 1 ) for j = 1 : #terms, solve s.t. H j 0 min Wj A j W j H 2 2 + λ W(((1 α W ) k + α W ) W j 2 2 W j 2 1 ) s.t. W j 0 end

AHCLS (spar(w j )=α W and spar(h j )=α H ) W = abs(randn(m,k)); β H =((1 α H ) k + α H ) 2 β W =((1 α W ) k + α W ) 2 for i = 1 : maxiter cls solve for H: (W T W + λ H β H I λ H E) H = W T A nonneg H = H. (H >= 0) cls solve for W: (HH T + λ W β W I λ W E) W T = HA T nonneg W = W. (W >= 0) end

AHCLS Summary 122 Convergence of GDCLS and AHCLS Algorithms on MED dataset Pros + fast: 6.8 vs. 9.8 sec (gd-cls) + works well in practice + speedy convergence + only need to initialize W (0) + 0 elements not locked Accuracy A -WH F 121.5 121 120.5 120 119. 5 119 AHCLS GDCLS SVD Error = 118.49 118.5 0 5 10 15 20 25 30 35 40 45 50 number of iterations + allows for more explicit sparsity in both W and H Cons ad hoc nonnegativity: after LS, negative elements set to 0, could run lsqnonneg or snnls instead (doesn t improve accuracy much) convergence to stationary point only guaranteed when nonnegativity properly enforced.

Clustering with the NMF Clustering Terms use rows of W m k = cl.1 cl.2... cl.k term1.9 0....3 term2.1.8....2......... Clustering Documents use cols of H k n = doc1 doc2... docn cl.1.4 0....5......... cl.k 0.8....2 soft clustering is very natural

The Enron Email Dataset (SAS) PRIVATE email collection of 150 Enron employees during 2001 92,000 terms and 65,000 messages Term-by-Message Matrix f astow1 f astow2 skilling1.......... subpoena 2 0 1... dynegy 0 3 0..........

Summary of NMF Clustering + appealing picture of clusters + some NMF algorithms are faster than SVD + terms and docs are treated in integrated fashion + hard and soft clustering is possible user must choose k, a fixed # of clusters clusters have no particular order, similar clusters do not appear near each other

Conclusions SVD variable # of clusters returned clusters ordered so that similar clusters are nearby each other can choose different # of clusters for terms and documents can identify saddle terms and saddle documents NMF fast sparse factorization terms and docs are treated in integrated fashion hard and soft clustering is possible