Data Mining and Matrices

Similar documents
Non-Negative Matrix Factorization

Probabilistic Latent Semantic Analysis

Data Mining and Matrices

Non-Negative Matrix Factorization

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Non-negative Matrix Factorization: Algorithms, Extensions and Applications

Probabilistic Latent Semantic Analysis

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017

Non-negative matrix factorization with fixed row and column sums

Nonnegative Matrix Factorization

Document and Topic Models: plsa and LDA

Note on Algorithm Differences Between Nonnegative Matrix Factorization And Probabilistic Latent Semantic Indexing

Simplicial Nonnegative Matrix Tri-Factorization: Fast Guaranteed Parallel Algorithm

Rank Selection in Low-rank Matrix Approximations: A Study of Cross-Validation for NMFs

Latent Semantic Analysis. Hongning Wang

CS 572: Information Retrieval

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018

1 Non-negative Matrix Factorization (NMF)

MULTIPLICATIVE ALGORITHM FOR CORRENTROPY-BASED NONNEGATIVE MATRIX FACTORIZATION

c Springer, Reprinted with permission.

Probabilistic Latent Variable Models as Non-Negative Factorizations

Latent Semantic Analysis. Hongning Wang

Orthogonal Nonnegative Matrix Factorization: Multiplicative Updates on Stiefel Manifolds

Information Retrieval

Knowledge Discovery and Data Mining 1 (VO) ( )

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)

CS145: INTRODUCTION TO DATA MINING

Quick Introduction to Nonnegative Matrix Factorization

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Non-Negative Matrix Factorization with Quasi-Newton Optimization

cross-language retrieval (by concatenate features of different language in X and find co-shared U). TOEFL/GRE synonym in the same way.

Single-channel source separation using non-negative matrix factorization

Data Mining and Matrices

Notes on Latent Semantic Analysis

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology


Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Regularized Alternating Least Squares Algorithms for Non-negative Matrix/Tensor Factorization

CS264: Beyond Worst-Case Analysis Lecture #15: Topic Modeling and Nonnegative Matrix Factorization

Dimensionality Reduction and Principle Components Analysis

Data Mining Techniques

Automatic Rank Determination in Projective Nonnegative Matrix Factorization

Linear Algebra Background

Generative Clustering, Topic Modeling, & Bayesian Inference

CONVOLUTIVE NON-NEGATIVE MATRIX FACTORISATION WITH SPARSENESS CONSTRAINT

he Applications of Tensor Factorization in Inference, Clustering, Graph Theory, Coding and Visual Representation

Nonnegative matrix factorization (NMF) for determined and underdetermined BSS problems

Latent Semantic Models. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Expectation Maximization

Problems. Looks for literal term matches. Problems:

Intelligent Data Analysis. Mining Textual Data. School of Computer Science University of Birmingham

Singular Value Decompsition

ENGG5781 Matrix Analysis and Computations Lecture 10: Non-Negative Matrix Factorization and Tensor Decomposition

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

NMF WITH SPECTRAL AND TEMPORAL CONTINUITY CRITERIA FOR MONAURAL SOUND SOURCE SEPARATION. Julian M. Becker, Christian Sohn and Christian Rohlfing

CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu

Lecture Notes 10: Matrix Factorization

Novel Alternating Least Squares Algorithm for Nonnegative Matrix and Tensor Factorizations

On the equivalence between Non-negative Matrix Factorization and Probabilistic Latent Semantic Indexing

Nonnegative Matrix Factorization. Data Mining

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

CSE 494/598 Lecture-6: Latent Semantic Indexing. **Content adapted from last year s slides

Embeddings Learned By Matrix Factorization

Collaborative Filtering: A Machine Learning Perspective

Latent Dirichlet Allocation (LDA)

Using Matrix Decompositions in Formal Concept Analysis

Modeling Environment

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Language Information Processing, Advanced. Topic Models

An Empirical Study on Dimensionality Optimization in Text Mining for Linguistic Knowledge Acquisition

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Fast Coordinate Descent methods for Non-Negative Matrix Factorization

Stochastic Variational Inference

Abstract. LANDI, AMANDA KIM. The Nonnegative Matrix Factorization: Methods and Applications. (Under the direction of Kazufumi Ito.

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2017

Principal Component Analysis

Manning & Schuetze, FSNLP, (c)

Probabilistic Latent Semantic Analysis

Latent Variable Models and EM algorithm

Dimension Reduction Using Nonnegative Matrix Tri-Factorization in Multi-label Classification

Non-Negative Matrix Factorization And Its Application to Audio. Tuomas Virtanen Tampere University of Technology

Nonnegative Matrix Factor 2-D Deconvolution for Blind Single Channel Source Separation

Probabilistic Time Series Classification

Statistical Data Mining and Machine Learning Hilary Term 2016

Nonnegative Matrix Factorization with Orthogonality Constraints

Recent Advances in Bayesian Inference Techniques

Graphical Models for Collaborative Filtering

Lecture 13 : Variational Inference: Mean Field Approximation

Topic Models and Applications to Short Documents

Nonnegative Matrix Factorization and Probabilistic Latent Semantic Indexing: Equivalence, Chi-square Statistic, and a Hybrid Method

Dictionary Learning Using Tensor Methods

Fast Nonnegative Matrix Factorization with Rank-one ADMM

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane

Matrix decompositions and latent semantic indexing

Note for plsa and LDA-Version 1.1

Transcription:

Data Mining and Matrices 6 Non-Negative Matrix Factorization Rainer Gemulla, Pauli Miettinen May 23, 23

Non-Negative Datasets Some datasets are intrinsically non-negative: Counters (e.g., no. occurrences of each word in a text document) Quantities (e.g., amount of each ingredient in a chemical experiment) Intensities (e.g., intensity of each color in an image) The corresponding data matrix D has only non-negative values. Decompositions such as SVD and SDD may involve negative values in factors and components Negative values describe the absence of something Often no natural interpretation Can we find a decomposition that is more natural to non-negative data? 2 / 39

Example (SVD) Consider the following bridge matrix and its truncated SVD:.8.6.5.3.6.3.6.3.5.5.3.5.3.5.3.5.5.5 D = U Σ V T Here are the corresponding components:.6.3.6.3.6.4.3.4.3.4.3.8.3.8.3.3.2.3.2.3.3.8.3.8.3.3.2.3.2.3 D = U D V T + U 2 D 22 V T 2 Negative values make interpretation unnatural or difficult. 3 / 39

Outline Non-Negative Matrix Factorization 2 Algorithms 3 Probabilistic Latent Semantic Analysis 4 Summary 4 / 39

Non-Negative Matrix Factorization (NMF) Definition (Non-negative matrix factorization, basic form) Given a non-negative matrix D R m n +, a non-negative matrix factorization of rank k is D LR, where L R m r + and R R r n + are both non-negative. Additive decomposition: factors and components non-negative No cancellation effects Rows of R can be thought as parts Row of D obtained by mixing (or assembling ) parts in L Smallest r such that D = LR exists is called non-negative rank of D rank(d) rank + (D) min { m, n } 5 / 39

Example (NMF) Consider the following bridge matrix and its rank-2 NMF: D = L R Here are the corresponding components: D = L R + L 2 R 2 Non-negative matrix decomposition encourage a more natural, part-based representation and (sometimes) sparsity. 6 / 39

Decomposing faces (PCA) D i (original) [LR] i = L i R [UΣV T ] i = U i Σ V T Lee and Seung, 999. PCA factors are hard to interpret. 7 / 39

Decomposing faces (NMF) D i (original) [LR] i = L i R NMF factors correspond to parts of faces. Lee and Seung, 999. 8 / 39

Decomposing digits (NMF) D R NMF factors correspond to parts of digits and background. Cichocki et al., 29. 9 / 39

Some applications Cichocki et al., 29. Text mining (more later) Bioinformatics Microarray analysis Mineral exploration Neuroscience Image understanding Air pollution research Chemometrics Spectral data analysis Linear sparse coding Image classification Clustering Neural learning process Sound recognition Remote sensing Object characterization... / 39

Gaussian NMF Gaussian NMF is the most basic form of non-negative factorizations: minimize D LR 2 F s. t. L R m r + R R r n + Truncated SVD minimizes the same objective (but without non-negativity constraints) Many other variants exist Different objective functions (e.g., KL-divergence) Additional regularizations (e.g., L -regularization) Different constraints (e.g., orthogonality of R) Different compositions (e.g., 3 matrices) multi-layer NMF, semi-nmf, sparse NMF, tri-nmf, symmetric NMF, orthogonal NMF, non-smooth NMF (nsnmf), overlapping NMF, convolutive NMF (CNMF), k-means,... / 39

k-means can be seen as a variant of NMF Additional constraint: L contains exactly one in each row, rest D i (original) [LR] i = L i R k-means factors correspond to prototypical faces. Lee and Seung, 999. 2 / 39

NMF is not unique Factors are not ordered = = One way of ordering: decreasing Frobenius norm of components (i.e., order by L k R k F ) Factors/components are not unique = + =.5.5 +.5.5 = + Additional constraints or regularization can encourage uniqueness. 3 / 39

NMF is not hierarchical Rank- NMF.6.3.3.8.6.3.3.8.6.3 =.8.5.7.5.7.5.7.3.8.3.8.3.5 Rank-2 NMF = + = Best rank-k approximation may differ significantly from best rank-(k ) approximation Rank influences sparsity, interpretability, and statistical fidelity Optimum choice of rank is not well-studied (often requires experimentation) 4 / 39

Outline Non-Negative Matrix Factorization 2 Algorithms 3 Probabilistic Latent Semantic Analysis 4 Summary 5 / 39

NMF is difficult We focus on minimizing L(L, R) = D LR 2 F. For varying m, n, and r, problem is NP-hard When rank(d) = (or r = ), can be solved in polynomial time Take first non-zero column of D as L m 2 Determine R n entry by entry (using the fact that D j = LR j ) Problem is not convex Local optimum may not correspond to global optimum Generally little hope to find global optimum But: Problem is biconvex For fixed R, f (L) = D LR 2 F is convex f (L) = i D i L i R 2 F Lik f (L) = 2(D i L i R)R T k (chain rule) (product rule) 2 L ik f (L) = 2R k R T k (does not depend on L) For fixed L, f (R) = D LR 2 F Allows for efficient algorithms is convex 6 / 39

General framework Gradient descent generally slow Stochastic gradient descent inappropriate Key approach: alternating minimization : Pick starting point L and R 2: while not converged do 3: Keep R fixed, optimize L 4: Keep L fixed, optimize R 5: end while Update steps 3 and 4 easier than full problem Also called alternating projections or (block) coordinate descent Starting point Random Multi-start initialization: try multiple random starting points, run a few epochs, continue with best Based on SVD... 7 / 39

Example Ignore non-negativity for now. Consider the regularized least-square error: L(L, R) = D LR 2 F + λ( L 2 F + R 2 F ) By setting m = n = r =, D = () and λ =.5, we obtain L(l, r) = ( lr) 2 +.5(l 2 + r 2 ) L(l,r) 5 l f (l) = 2r( lr) +.l 4 r f (r) = 2l( lr) +.r Local optima: 3 ( ) ( 2 9 2 2, 9 9 2, 2, 2 r Stationary point: (,) 9 2 ) 2 2 l 8 / 39

Example (ALS) f (l, r) = ( lr) 2 +.5(l 2 + r 2 ) l min l f (l) = r min r f (r) = Step l r 2 2.49 2 2.49.68 3.58.68 4.58.49....97.97 Converges to local minimum 2r 2r 2 +. 2l 2l 2 +. r..5..5 2. 2.5..5..5 2. 2.5 9 / 39

Example (ALS) f (l, r) = ( lr) 2 +.5(l 2 + r 2 ) l min l f (l) = r min r f (r) = Step l r 2.. Converges to stationary point. 2r 2r 2 +. 2l 2l 2 +. r..5..5 2. 2.5..5..5 2. 2.5 2 / 39

Alternating non-negative least squares (ANLS) Uses non-negative least squares approximation of L and R: argmin D LR 2 F and argmin D LR 2 F L R m r + R R r n + Equivalently: find non-negative least squares solution to LR = D Common approach: Solve unconstrained least squares problems and remove negative values. E.g., when columns (rows) of L (R) are linearly independent, set L = [DR ] ɛ and R = [L D] ɛ where R = R T (RR T ) is the right pseudo-inverse of R L = (L T L) L T is the left pseudo-inverse of L [a]ɛ = max { ɛ, a } for ɛ = or some small constant (e.g., ɛ = 9 ) Difficult to analyze due to non-linear update steps Often slow convergence to a bad local minimum (better when regularized) 2 / 39

Example (ANLS) f (l, r) = ( lr) 2 +.5(l 2 + r 2 ) and set ɛ = 9 [ ] 2r l 2r 2 +. ɛ [ ] r 2l 2l 2 +. Step l r 2 9 2 9 2 8 3 4 7 2 8 4 4 7 8 6....97.97 Converges to local minimum ɛ r..5..5 2. 2.5..5..5 2. 2.5 22 / 39

Hierarchical alternating least squares (HALS) Work locally on a single factor, then proceed to next factor, and so on Let D (k) be the residual matrix (error) when k-th factor is removed: D (k) = D LR + L k R k = D k k L k R k HALS minimizes D (k) L k R k 2 F for k =, 2,..., r,,... (equivalently: finds best solution for k-th factor, fixing the rest) In each iteration, set (once or multiple times): [ ] L k = R k 2 D (k) R T k and R T k = ] [(D (k) F ɛ L k 2 ) T L k F D (k) can be incrementally maintained fast implementation D (k+) = D (k) + L k R k L (k+) R (k+) ɛ Often better performance in practice than ANLS Converges to stationary point when initialized with positive matrix and sufficiently small ɛ 23 / 39

Multiplicative updates Gradient descent step with step size η kj R kj R kj + η kj ([L T D] kj [L T LR] kj ) Setting η kj = R kj (L T LR) kj, we obtain the multiplicative update rules L L DRT LRR T and R R LT D L T LR, where multiplication ( ) and division are element-wise Does not necessarily find optimum L (or R), but can be shown to never increase loss Faster than ANLS (no computation of pseudo-inverse), easy to implement and parallelize Zeros in factors are problematic (divisions become undefined) L L [DRT ] ɛ LRR T + ɛ and R R [LT D] ɛ L T LR + ɛ Lee and Seung, 2. Cichocki et al., 26. 24 / 39

Example (multiplicative updates) f (l, r) = ( lr) 2 +.5(l 2 + r 2 ) l l r.5l lr 2 r r l.5r l 2 r Step l r 2 2.48 2 2.48.66 3.59.66 4.58.45....97.97 Converges to local minimum r..5..5 2. 2.5..5..5 2. 2.5 25 / 39

Outline Non-Negative Matrix Factorization 2 Algorithms 3 Probabilistic Latent Semantic Analysis 4 Summary 26 / 39

Topic modeling Consider a document-word matrix constructed from some corpus D = air water pollution democrat republican doc 3 2 8 doc 2 4 2 doc 3 doc 4 8 5 doc 5 Documents seem to talk about two topics Environment (with words air, water, and pollution) 2 Congress (with words democrat and republican) Can we automatically detect topics in documents? 27 / 39

A probabilistic viewpoint Let s normalize such that the entries sum to unity D = air water pollution democrat republican doc.4.3.2.. doc 2..6.7.. doc 3....4.6 doc 4....2.7 doc 5.....5 Put all words in an urn and draw. The probability to draw word w from document d is given by P(d, w) = D dw Matrix D can represent any probability distribution plsa tries to find a distribution that is close to D but exposes information about topics 28 / 39

Probabilistic latent semantic analysis (plsa) Definition (plsa, NMF formulation) Given a rank r, find matrices L, Σ, and R such that where D LΣR L m r is a non-negative, column-stochastic matrix (columns sum to unity), Σ r r is a non-negative, diagonal matrix that sums to unity, and R r n is a non-negative, row-stochastic matrix (rows sum to unity). is usually taken to be the (generalized) KL divergence Additional regularization or tempering necessary to avoid overfitting Hofmann, 2. 29 / 39

Example plsa factorization of example matrix air wat pol dem rep air wat pol dem rep.4.3.2.39.48.5.2.64..6.7.52.52.53.47.4.6.58.2.7.36..... D L Σ R Rank r corresponds to number of topics Σ kk corresponds to overall frequency of topic k L dk corresponds to contribution of document d to topic k R kw corresponds to frequency of word w in topic k.9.6 plsa constraints allow for probabilistic interpretation P(d, w) [LΣR] dw = k Σ kkl dk R kw = k P(k)P(d k)p(w k) plsa model imposes conditional independence constraints restricted space of distributions 3 / 39

Another example Concepts ( of 28) extracted from Science Magazine articles (2K) Hofmann, 24. 3 / 39

plsa geometry Rewrite probabilistic formulation P(d, w) = k P(w d) = z P(k)P(d k)p(w k) P(w z)p(z d) Generative process of creating a word Pick a document according to P(d) 2 Select a topic acc. to P(z d) 3 Select a word acc. to P(w z) Hofmann, 2. 32 / 39

Kullback-Leibler divergence () Let D be the unnormalized word-count data and denote by N total number of words Likelihood of seeing D when drawing N words with replacement is proportional to m n P(d, w) D dw d= w= plsa maximizes the log-likelihood of seeing the data given the model m n log P( D L, Σ, R) D dw log P(d, w L, Σ, R) Gaussier and Goutte, 25. d= w= = m n D log [LΣR] d= w= dw m n D dw D dw log +c D [LΣR] d= w= dw }{{} Kullback-Leibler divergence 33 / 39

Kullback-Leibler divergence (2) KL divergence D KL (P Q) = m n d= w= P dw log P dw Q dw Interpretation: expected number of extra bits for encoding a value drawn from P using an optimum code for distribution Q D KL (P Q) D KL (P P) = D KL (P Q) D KL (Q P) NMF-based plsa algorithms minimize the generalized KL divergence D GKL ( P Q) = m d= w= where P = D and Q = L ΣR n ( P dw log P dw Q P dw + Q dw ), dw 34 / 39

Multiplicative updates for GKL (w/o tempering) We first find a decomposition D L R, where L and R are non-negative matrices Update rules L L D R T diag(/ rowsums( R)) L R R R diag(/ colsums( L)) L T D L R GKL is non-increasing under these update rules Normalize by rescaling columns of L and rows of R to obtain L = L diag(/ colsums( L)) R = diag(/ rowsums( R)) R Σ = diag(colsums( L) rowsums( R)) Σ = Σ/ k Σ kk Lee and Seung, 2. 35 / 39

Applications of plsa Topic modeling Clustering documents Clustering terms Information retrieval Treat query q as a new document (new row in D and L) Determine P(k q) by keeping Σ and R fixed ( fold in the query) Retrieve documents with similar topic mixture as query Can deal with synonymy and polysemy Better generalization performance than LSA (=SVD), esp. with tempering In practice, outperformed by Latent Dirichlet Allocation (LDA) Hofmann, 2. 36 / 39

Outline Non-Negative Matrix Factorization 2 Algorithms 3 Probabilistic Latent Semantic Analysis 4 Summary 37 / 39

Lessons learned Non-negative matrix factorization (NMF) appears natural for non-negative data NMF encourages parts-based decomposition, interpretability, and (sometimes) sparseness Many variants, many applications Usually solved via alternating minimization algorithms Alternating non-negative least squares (ANLS) Projected gradient local hierarchical ALS (HALS) Multiplicative updates plsa is an approach to topic modeling that can be seen as an NMF 38 / 39

Literature David Skillicorn Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 27 Andrzej Cichocki, Rafal Zdunek, Anh Huy Phan, and Shun-ichi Amari Nonnegative Matrix and Tensor Factorizations: Applications to Exploratory Multi-way Data Analysis and Blind Source Separation Wiley, 29 Yifeng Li and Alioune Ngom The NMF MATLAB Toolbox http://cs.uwindsor.ca/~li2c/nmf Renaud Gaujoux and Cathal Seoighe NMF R package http://cran.r-project.org/web/packages/nmf/index.html References given at bottom of slides 39 / 39