Nonnegative Matrix Factorization. Data Mining

Similar documents
Clustering. SVD and NMF

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang

Non-Negative Matrix Factorization

Text Mining Approaches for Surveillance

Quick Introduction to Nonnegative Matrix Factorization

Information Retrieval

Non-Negative Matrix Factorization

Applications. Nonnegative Matrices: Ranking

Using Matrix Decompositions in Formal Concept Analysis

Latent Semantic Analysis. Hongning Wang

Dimension Reduction and Iterative Consensus Clustering

Data Mining and Matrices

Latent Semantic Models. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

Matrices, Vector Spaces, and Information Retrieval

Embeddings Learned By Matrix Factorization

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Manning & Schuetze, FSNLP, (c)

Data Mining and Matrices

Linear Algebra Background

c Springer, Reprinted with permission.

Latent Semantic Analysis. Hongning Wang

Performance Evaluation of the Matlab PCT for Parallel Implementations of Nonnegative Tensor Factorization

Introduction to Information Retrieval

A few applications of the SVD

Rank Selection in Low-rank Matrix Approximations: A Study of Cross-Validation for NMFs

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017

Variable Latent Semantic Indexing

MULTIPLICATIVE ALGORITHM FOR CORRENTROPY-BASED NONNEGATIVE MATRIX FACTORIZATION

Dictionary Learning Using Tensor Methods

USING SINGULAR VALUE DECOMPOSITION (SVD) AS A SOLUTION FOR SEARCH RESULT CLUSTERING

Note on Algorithm Differences Between Nonnegative Matrix Factorization And Probabilistic Latent Semantic Indexing

9 Searching the Internet with the SVD

CONVOLUTIVE NON-NEGATIVE MATRIX FACTORISATION WITH SPARSENESS CONSTRAINT

PV211: Introduction to Information Retrieval

Fast Nonnegative Matrix Factorization with Rank-one ADMM

Data Mining Techniques

Data Mining and Matrices

Lecture 5: Web Searching using the SVD

Nonnegative Matrix Factorization

Linear Algebra Methods for Data Mining

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

CS264: Beyond Worst-Case Analysis Lecture #15: Topic Modeling and Nonnegative Matrix Factorization

13 Searching the Web with the SVD

Singular Value Decompsition

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2017

Collaborative topic models: motivations cont

Single-channel source separation using non-negative matrix factorization

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

From Non-Negative Matrix Factorization to Deep Learning

1. Ignoring case, extract all unique words from the entire set of documents.

Non-Negative Matrix Factorization with Quasi-Newton Optimization

Let A an n n real nonsymmetric matrix. The eigenvalue problem: λ 1 = 1 with eigenvector u 1 = ( ) λ 2 = 2 with eigenvector u 2 = ( 1

Factor Analysis (FA) Non-negative Matrix Factorization (NMF) CSE Artificial Intelligence Grad Project Dr. Debasis Mitra

1 Non-negative Matrix Factorization (NMF)

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

EIGENVALE PROBLEMS AND THE SVD. [5.1 TO 5.3 & 7.4]

Notes on Latent Semantic Analysis

Compressive Sensing, Low Rank models, and Low Rank Submatrix

Non-negative matrix factorization with fixed row and column sums

Algorithms, Initializations, and Convergence for the Nonnegative Matrix Factorization

Preserving Privacy in Data Mining using Data Distortion Approach

CS47300: Web Information Search and Management

Online Social Networks and Media. Link Analysis and Web Search

Eigenvalue Problems Computation and Applications

Faloutsos, Tong ICDE, 2009

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

BOOLEAN MATRIX AND TENSOR DECOMPOSITIONS. Pauli Miettinen TML September 2013

Generic Text Summarization

Detect and Track Latent Factors with Online Nonnegative Matrix Factorization

Recommendation Systems

Semantic Similarity from Corpora - Latent Semantic Analysis

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Collaborative Filtering. Radek Pelánek

Non-negative matrix factorization and term structure of interest rates

Dimensionality Reduction

Bruce Hendrickson Discrete Algorithms & Math Dept. Sandia National Labs Albuquerque, New Mexico Also, CS Department, UNM

Low-Rank Orthogonal Decompositions. CS April 1995

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent

Non-negative Matrix Factorization: Algorithms, Extensions and Applications

Examination of Initialization Techniques for Nonnegative Matrix Factorization

CS 572: Information Retrieval

Collaborative Filtering: A Machine Learning Perspective

Matrix Decomposition in Privacy-Preserving Data Mining JUN ZHANG DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF KENTUCKY

Data Mining Techniques

Dimensionality Reduction

NMF WITH SPECTRAL AND TEMPORAL CONTINUITY CRITERIA FOR MONAURAL SOUND SOURCE SEPARATION. Julian M. Becker, Christian Sohn and Christian Rohlfing

Latent semantic indexing

Interpretable Nonnegative Matrix Decompositions

Fast Coordinate Descent methods for Non-Negative Matrix Factorization

Ranking from Crowdsourced Pairwise Comparisons via Matrix Manifold Optimization

Orthogonal Nonnegative Matrix Factorization: Multiplicative Updates on Stiefel Manifolds

ENGG5781 Matrix Analysis and Computations Lecture 10: Non-Negative Matrix Factorization and Tensor Decomposition

Unsupervised Learning

cross-language retrieval (by concatenate features of different language in X and find co-shared U). TOEFL/GRE synonym in the same way.

INF 141 IR METRICS LATENT SEMANTIC ANALYSIS AND INDEXING. Crista Lopes

Machine Learning for Signal Processing Sparse and Overcomplete Representations

Manning & Schuetze, FSNLP (c) 1999,2000

CS 3750 Advanced Machine Learning. Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya

Transcription:

The Nonnegative Matrix Factorization in Data Mining Amy Langville langvillea@cofc.edu Mathematics Department College of Charleston Charleston, SC Yahoo! Research 10/18/2005

Outline Part 1: Historical Developments in Data Mining Vector Space Model (1960s-1970s) Latent Semantic Indexing (1990s) Other VSM decompositions (1990s) Part 2: Nonnegative Matrix Factorization (2000) Applications in Image and Text Mining Algorithms Current and Future Work

Vector Space Model (1960s and 1970s) Gerard Salton s Information Retrieval System SMART: System for the Mechanical Analysis and Retrieval of Text (Salton s Magical Automatic Retriever of Text) turn n textual documents into n document vectors d 1, d 2,..., d n create term-by-document matrix A m n = [ d 1 d 2... d n ] to retrieve info., create query vector q, which is a pseudo-doc

Vector Space Model (1960s and 1970s) Gerard Salton s Information Retrieval System SMART: System for the Mechanical Analysis and Retrieval of Text (Salton s Magical Automatic Retriever of Text) turn n textual documents into n document vectors d 1, d 2,..., d n create term-by-document matrix A m n = [ d 1 d 2... d n ] to retrieve info., create query vector q, which is a pseudo-doc GOAL: find doc. d i closest to q angular cosine measure used: δ i = cos θ i = q T d i /( q 2 d i 2 )

Latent Semantic Indexing (1990s) Susan Dumais s improvement to VSM = LSI Idea: use low-rank approximation to A to filter out noise A m n : rank r term-by-document matrix SVD: A = UΣ V T = r i=1 σ iu i v T i LSI: use A k = k i=1 σ iu i v T i in place of A Why? reduce storage when k<<r filter out uncertainty, so that performance on text mining tasks (e.g., query processing and clustering) improves

Properties of SVD basis vectors u i are orthogonal u ij, v ij are mixed in sign A k = U k Σ k V T k nonneg mixed nonneg mixed U, V are dense uniqueness while there are many SVD algorithms, they all create the same (truncated) factorization of all rank-k approximations, A k is optimal (in Frobenius norm) A A k F = min rank(b) k A B F

Strengths and Weaknesses of LSI Strengths using A k in place of A gives improved performance dimension reduction considers only essential components of term-by-document matrix, filters out noise best rank-k approximation Weaknesses storage U k and V k are usually completely dense interpretation of basis vectors u i is impossible due to mixed signs good truncation point k is hard to determine orthogonality restriction

Other Low-Rank Approximations QR decomposition any URV T factorization Semidiscrete decomposition (SDD) A k = X k D k Y T k, where D k is diagonal, and elements of X k, Y k { 1, 0, 1}.

Other Low-Rank Approximations QR decomposition any URV T factorization Semidiscrete decomposition (SDD) A k = X k D k Y T k, where D k is diagonal, and elements of X k, Y k { 1, 0, 1}. BUT All create basis vectors that are mixed in sign. Negative elements make interpretation difficult.

The Power of Positivity Positive anything is better than negative nothing. Elbert Hubbard It takes but one positive thought when given a chance to survive and thrive to overpower an entire army of negative thoughts. Robert H. Schuller Learn to think like a winner. Think positive and visualize your strengths. Vic Braden Positive thinking will let you do everything better than negative thinking will. Zig Ziglar

The Power of Nonnegativity Nonnegative anything is better than negative nothing. Elbert Hubbard It takes but one nonnegative thought when given a chance to survive and thrive to overpower an entire army of negative thoughts. Robert H. Schuller Learn to think like a winner. Think nonnegative and visualize your strengths. Vic Braden Nonnegative thinking will let you do everything better than negative thinking will. Zig Ziglar

Nonnegative Matrix Factorization (2000) Daniel Lee and Sebastian Seung s Nonnegative Matrix Factorization Idea: use low-rank approximation with nonnegative factors to improve LSI A k = U k Σ k V T k nonneg mixed nonneg mixed A k = W k H k nonneg nonneg nonneg

Interpretation with NMF columns of W are the underlying basis vectors, i.e., each of the n columns of A can be built from k columns of W. columns of H give the weights associated with each basis vector.... A k e 1 = W k H 1 = w 1. h 11 + w 2. h 21 +... + w k. h k1 W k, H k 0 immediate interpretation (additive parts-based rep.)

Image Mining A i SVD W H i U Σ V i

Image Mining Applications Data compression Find similar images Cluster images Original Image Reconstructed Images r = 400 k = 100

Text Mining MED dataset (k = 10) Highest Weighted Terms in Basis Vector W *1 Highest Weighted Terms in Basis Vector W *2 term 1 ventricular 2 3 4 5 6 7 8 9 10 aortic septal left defect regurgitation ventricle valve cardiac pressure 0 1 2 3 4 weight Highest Weighted Terms in Basis Vector W *5 term 1 oxygen 2 3 4 5 6 7 8 9 10 flow pressure blood cerebral hypothermia fluid venous arterial perfusion 0 0.5 1 1.5 2 2.5 weight Highest Weighted Terms in Basis Vector W *6 term 1 children 2 3 4 5 6 7 8 9 10 child autistic speech group early visual anxiety emotional autism 0 1 2 3 4 weight term 1 kidney 2 3 4 5 6 7 8 9 10 marrow dna cells nephrectomy unilateral lymphocytes bone thymidine rats 0 0.5 1 1.5 2 2.5 weight

Text Mining polysems broken across several basis vectors w i

Text Mining Applications Data compression Find similar terms Find similar documents Cluster documents Topic detection and tracking

Text Mining Applications Enron email messages 2001

Recommendation Systems purchase history matrix A = Item 1 1 5... 0 Item 2 0 0... 1......... Item m 0 1... 2 User 1 User 2... User n Create profiles for classes of users from basis vectors w i Find similar users Find similar items

Properties of NMF basis vectors w i are not can have overlap of topics can restrict W, H to be sparse W k, H k 0 immediate interpretation (additive parts-based rep.) EX: large w ij s basis vector w i is mostly about terms j EX: h i1 how much doc 1 is pointing in the direction of topic vector w i... A k e 1 = W k H 1 = w 1. h 11 + w 2. h 21 +... + NMF is algorithm-dependent: W, H not unique w k. h k1

Computation of NMF (Lee and Seung 2000) Mean squared error objective function min A WH 2 F s.t. W, H 0 Nonlinear Optimization Problem convex in W or H, but not both tough to get global min huge # unknowns: mk for W and kn for H (EX: A 70K 1K and k=10 topics 800K unknowns) above objective is one of many possible convergence to local min NOT guaranteed for any algorithm

NMF Algorithms Multiplicative update rules Lee-Seung 2000 Hoyer 2002 Gradient Descent Hoyer 2004 Berry-Plemmons 2004 Alternating Least Squares Paatero 1994 ACLS AHCLS

NMF Algorithm: Lee and Seung 2000 Mean Squared Error objective function min A WH 2 F s.t. W, H 0 W = abs(randn(m,k)); H = abs(randn(k,n)); fori=1: maxiter H = H.* (W T A)./(W T WH + 10 9 ); W = W.* (AH T )./(WHH T + 10 9 ); end Many parameters affect performance (k, obj. function, sparsity constraints, algorithm, etc.). NMF is not unique! (proof of convergence to fixed point based on E-M convergence proof)

NMF Algorithm: Lee and Seung 2000 Divergence objective function min i,j (A ij log A ij [WH] ij A ij +[WH] ij ) s.t. W, H 0 W = abs(randn(m,k)); H = abs(randn(k,n)); fori=1: maxiter H = H.* (W T (A./ (WH + 10 9 )))./ W T ee T ; W = W.* ((A./ (WH + 10 9 ))H T )./ee T H T ; end (proof of convergence to fixed point based on E-M convergence proof) (objective function tails off after 50-100 iterations)

Multiplicative Update Summary Pros Cons + convergence theory: guaranteed to converge to fixed point + good initialization W (0), H (0) speeds convergence and gets to better fixed point fixed point may be local min or saddle point good initialization W (0), H (0) speeds convergence and gets to better fixed point slow: many M-M multiplications at each iteration hundreds/thousands of iterations until convergence no sparsity of W and H incorporated into mathematical setup 0 elements locked

Multiplicative Update and Locking During iterations of mult. update algorithms, once an element in W or H becomes 0, it can never become positive. Implications for W: In order to improve objective function, algorithm can only take terms out, not add terms, to topic vectors. Very inflexible: once algorithm starts down a path for a topic vector, it must continue in that vein. ALS-type algorithms do not lock elements, greater flexibility allows them to escape from path heading towards poor local min

Sparsity Measures Berry et al. x 2 2 Hoyer spar(x n 1 )= n x 1 / x 2 n 1 Diversity measure E (p) (x) = n i=1 x i p, 0 p 1 E (p) (x) = n i=1 x i p,p<0 Rao and Kreutz-Delgado: algorithms for minimizing E (p) (x) s.t. Ax = b, but expensive iterative procedure Ideal nnz(x) not continuous, NP-hard to use this in optim.

NMF Algorithm: Berry et al. 2004 Gradient Descent Constrained Least Squares W = abs(randn(m,k)); (scale cols of W to unit norm) H = zeros(k,n); fori=1: maxiter CLS forj=1: #docs, solve min H j A j WH j 2 2 + λ H j 2 2 s.t. H j 0 GD W = W.* (AH T )./(WHH T + 10 9 ); (scale cols of W) end

NMF Algorithm: Berry et al. 2004 Gradient Descent Constrained Least Squares W = abs(randn(m,k)); (scale cols of W to unit norm) H = zeros(k,n); fori=1: maxiter CLS forj=1: #docs, solve min H j A j WH j 2 2 + λ H j 2 2 s.t. H j 0 solve for H: (W T W + λ I) H = W T A; (small matrix solve) GD W = W.* (AH T )./(WHH T + 10 9 ); (scale cols of W) end (objective function tails off after 15-30 iterations)

Berry et al. 2004 Summary Pros Cons + fast: less work per iteration than most other NMF algorithms + fast: small # of iterations until convergence + sparsity parameter for H 0 elements in W are locked no sparsity parameter for W ad hoc nonnegativity: negative elements in H are set to 0, could run lsqnonneg or snnls instead no convergence theory

PMF Algorithm: Paatero & Tapper 1994 Mean Squared Error Alternating Least Squares min A WH 2 F s.t. W, H 0 W = abs(randn(m,k)); fori=1: maxiter LS forj=1: #docs, solve LS min H j A j WH j 2 2 s.t. H j 0 forj=1: #terms, solve min Wj A j W j H 2 2 s.t. W j 0 end

ALS Algorithm W = abs(randn(m,k)); fori=1: maxiter LS solve matrix equation W T WH = W T A for H nonneg H = H. (H >= 0) LS solve matrix equation HH T W T = HA T for W nonneg W = W. (W >= 0) end

ALS Summary Pros + fast + works well in practice + speedy convergence + only need to initialize W (0) + 0 elements not locked Cons no sparsity of W and H incorporated into mathematical setup ad hoc nonnegativity: negative elements are set to 0 ad hoc sparsity: negative elements are set to 0 no convergence theory

Alternating Constrained Least Squares If the very fast ALS works well in practice and no NMF algorithms guarantee convergence to local min, why not use ALS? W = abs(randn(m,k)); fori=1: maxiter CLS forj=1: #docs, solve CLS forj=1: #terms, solve min H j A j WH j 2 2 + λ H H j 2 2 s.t. H j 0 min Wj A j W j H 2 2 + λ W W j 2 2 s.t. W j 0 end

Alternating Constrained Least Squares If the very fast ALS works well in practice and no NMF algorithms guarantee convergence to local min, why not use ALS? W = abs(randn(m,k)); fori=1: maxiter cls solve for H: (W T W + λ H I) H = W T A nonneg H = H. (H >= 0) cls solve for W: (HH T + λ W I) W T = HA T nonneg W = W. (W >= 0) end

ACLS Summary Pros Cons + fast: 6.6 sec vs. 9.8 sec (gd-cls) + works well in practice + speedy convergence + only need to initialize W (0) + 0 elements not locked + allows for sparsity in both W and H ad hoc nonnegativity: after LS, negative elements set to 0, could run lsqnonneg or snnls instead (doesn t improve accuracy much) no convergence theory

ACLS + spar(x) Is there a better way to measure sparsity and still maintain speed of ACLS? spar(x n 1 )= n x 1 / x 2 n 1 ((1 spar(x)) n+spar(x)) x 2 x 1 =0 (spar(w j )=α W and spar(h j )=α H ) W = abs(randn(m,k)); fori=1: maxiter cls cls forj=1: #docs, solve min H j A j WH j 2 2 + λ H(((1 α H ) k + α H ) H j 2 2 H j 2 1 ) forj=1: #terms, solve s.t. H j 0 min Wj A j W j H 2 2 + λ W(((1 α W ) k + α W ) W j 2 2 W j 2 1 ) s.t. W j 0 end

AHCLS (spar(w j )=α W and spar(h j )=α H ) W = abs(randn(m,k)); β H =((1 α H ) k + α H ) 2 β W =((1 α W ) k + α W ) 2 fori=1: maxiter cls solve for H: (W T W + λ H β H I λ H E) H = W T A nonneg H = H. (H >= 0) cls solve for W: (HH T + λ W β W I λ W E) W T = HA T nonneg W = W. (W >= 0) end

AHCLS Summary Pros Cons + fast: 6.8 vs. 9.8 sec (gd-cls) + works well in practice + speedy convergence + only need to initialize W (0) + 0 elements not locked + allows for more explicit sparsity in both W and H ad hoc nonnegativity: after LS, negative elements set to 0, could run lsqnonneg or snnls instead (doesn t improve accuracy much) no convergence theory

Strengths and Weaknesses of NMF Strengths Great Interpretability Performance for data mining tasks comparable to LSI Sparsity of factorization allows for significant storage savings Scalability good as k, m, n increase possibly faster computation time than SVD Weaknesses Factorization is not unique dependency on algorithm and parameters Unable to reduce the size of the basis without recomputing the NMF

Current NMF Research Algorithms Alternative Objective Functions Convergence Criterion Updating NMF Initializing NMF Choosing k

Extensions for NMF Tensor NMF p way factorization A = A 1 A 2...A p A, A i 0 Embedded NMF A = ( topic term A 1 ) ( doc topic A 2 ), then A 1 = ( subtopic term B 1 ) ( doc subtopic B 2 ). NMF on Web s hyperlink matrix terms from anchor text create A A = term 1 1 5... 0 term 2 0 0... 1......... term m 0 1... 2 node 1 node 2... node n