Quick Introduction to Nonnegative Matrix Factorization

Similar documents
a Short Introduction

1 Non-negative Matrix Factorization (NMF)

Singular Value Decompsition

Linear Algebra and Eigenproblems

Jeffrey D. Ullman Stanford University

2.3. Clustering or vector quantization 57

Lecture: Face Recognition and Feature Reduction

Lecture Notes 10: Matrix Factorization

Lecture: Face Recognition and Feature Reduction

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

Using SVD to Recommend Movies

Getting Started with Communications Engineering

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Singular Value Decomposition. 1 Singular Value Decomposition and the Four Fundamental Subspaces

Machine Learning (Spring 2012) Principal Component Analysis

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Information Retrieval

CS 231A Section 1: Linear Algebra & Probability Review

Collaborative Filtering Matrix Completion Alternating Least Squares

7 Principal Component Analysis

Machine Learning - MT & 14. PCA and MDS

Factor Analysis (FA) Non-negative Matrix Factorization (NMF) CSE Artificial Intelligence Grad Project Dr. Debasis Mitra

Dimension Reduction and Iterative Consensus Clustering

Nonnegative Matrix Factorization. Data Mining

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

CS264: Beyond Worst-Case Analysis Lecture #15: Topic Modeling and Nonnegative Matrix Factorization

Linear Regression. S. Sumitra

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Numerical Linear Algebra

Principal Component Analysis

Singular Value Decomposition and Digital Image Compression

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

From Non-Negative Matrix Factorization to Deep Learning

Non-Negative Matrix Factorization

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

Structured matrix factorizations. Example: Eigenfaces

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

Recommendation Systems

Lecture 2: Linear Algebra Review

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017

L2-7 Some very stylish matrix decompositions for solving Ax = b 10 Oct 2015

c Springer, Reprinted with permission.

LU Factorization. LU factorization is the most common way of solving linear systems! Ax = b LUx = b

Applied Mathematics 205. Unit II: Numerical Linear Algebra. Lecturer: Dr. David Knezevic

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018

Singular Value Decomposition

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Rank Selection in Low-rank Matrix Approximations: A Study of Cross-Validation for NMFs

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

One Picture and a Thousand Words Using Matrix Approximtions October 2017 Oak Ridge National Lab Dianne P. O Leary c 2017

Notes on Latent Semantic Analysis

Embeddings Learned By Matrix Factorization

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Homework 5. Convex Optimization /36-725

Lecture 9: Numerical Linear Algebra Primer (February 11st)

Data Mining and Matrices

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

The Singular Value Decomposition

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Non-Negative Matrix Factorization

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2016

Linear Algebra, Summer 2011, pt. 2

MAT2342 : Introduction to Applied Linear Algebra Mike Newman, fall Projections. introduction

MAT 343 Laboratory 6 The SVD decomposition and Image Compression

IV. Matrix Approximation using Least-Squares

CS60021: Scalable Data Mining. Dimensionality Reduction

Introduction to Data Mining

Exercise Sheet 1. 1 Probability revision 1: Student-t as an infinite mixture of Gaussians

Generic Text Summarization

SVD, PCA & Preprocessing

Matrix Assembly in FEA

14 Singular Value Decomposition

Collaborative Filtering. Radek Pelánek

The Singular Value Decomposition

Learning goals: students learn to use the SVD to find good approximations to matrices and to compute the pseudoinverse.

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Bindel, Fall 2009 Matrix Computations (CS 6210) Week 8: Friday, Oct 17

18.6 Regression and Classification with Linear Models

Dimensionality Reduction

CMU CS 462/662 (INTRO TO COMPUTER GRAPHICS) HOMEWORK 0.0 MATH REVIEW/PREVIEW LINEAR ALGEBRA

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

Linear Models for Regression

Chapter 3 Transformations

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

B553 Lecture 5: Matrix Algebra Review

Using Matrix Decompositions in Formal Concept Analysis

Main matrix factorizations

ENGG5781 Matrix Analysis and Computations Lecture 10: Non-Negative Matrix Factorization and Tensor Decomposition

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

1 Review of the dot product

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

Transcription:

Quick Introduction to Nonnegative Matrix Factorization Norm Matloff University of California at Davis 1 The Goal Given an u v matrix A with nonnegative elements, we wish to find nonnegative, rank-k matrices W (u k) and H (k v) such that A W H (1) We typically hope that a good approximation can be achieved with k rank(a) (2) The benefits range from compression (W and H are much smaller than A, which in some applications can be huge) to avoidance of overfitting in a prediction context. 2 Notation We ll use the following notation for a matrix Q Q ij : element in row i, column j Q i : row i Q j : column j Note the key relation k (W H).j = H ij W.i (3) 1 i=1

In other words, in (1), we have that: Column j of A is approximately a linear combination of the columns of W, with the coefficients in that linear combination being the elements of column j of H. Thus the columns W 1,..., W k then (approximately) form a basis for the vector space spanned by the columns of A. 1 Since the relation is only approximate, let s use the term pseudo-basis. Similarly, we could view everything in terms of rows of A and H: Row i of W H is a linear combination of the rows of H, with the coefficients being row i of W. The rows of H would then be a pseudo-basis for the rows of A. If (1) is a good approximation, then most of the information carried by A is contained in W and H. We can think of the columns of W as typifying those of A, with a similar statement holding for the rows of H and A. Here information could mean the appearance of an image, the predictive content of data to be used for machine text classification, and so on. 3 Applications This notion of nonnegative matrix factorization has become widely used in a variety of applications, such as: Image recognition: Say we have n image files, each of which has brightness data for r rows and c columns of pixels. We also know the class, i.e. the subject, of each image, say car, building, person and so on. From this data, we wish to predict the classses of new images. Denote the class of image j in our original data by C j. We form a matrix A with rc rows and n columns, where the i th column, A i, stores the data for the i th image, say in column-major order. In the sense stated above, the columns of W typify images. What about H? This is a little more complex, though probably more important: Each image contains information about rc pixels. The m th of these, say, varies from image to image, and thus can be thought of as a random variable in the statistical sense. Row m in A can be thought of as a random sample from the population distribution of that random variable. 1 Or, at least for the nonnegative orthant in that space, meaning all the vectors in that space with nonnegative components. 2

So, the k rows of H typify these rc random variables, and since k is hopefully much smaller than rc, we have received what is called dimension reduction among those variables. It should be intuitively clear that this is possible, because there should be a high correlation between neighboring pixels, thus a lot of redundant information. This is very important with respect to the overfitting problem in statistics/machine learning. This is beyond the scope of this tutorial, but the essence of fhe concept is that having too complex a model can lead to noise fitting, and actually harm our goal of predicting the class of future images. By having only k rows in H, we are reducing the dimensionality of the problem from rc to k, thus reducing the chance of overfitting. Our prediction of new images works as follows. Given the rc-element vector q of the new image of unknown class, we find its representation q p in the pseudo-basis for the rows of A, and then find g such G g best matches q p. Our predicted class is then C g. Text classification: Here A consists of, say, word counts. We have a list of d key words, and m documents of known classes (politics, finance, sports etc.). A ij is the count of the number of times word i appears in document j. Otherwise, the situation is the same as for image recognition above. We find the NMF, and then given a new document to classify, with word counts q, we find its coordinates q p, and then predict class by matching to the rows of W. Recommender systems: Here most of A is unknown. For instance, we have users, who have rated movies they ve seen, with A ij being the rating of movie j by user i. Our goal is to fill in estimates of the unknown entries. One approach is to initially insert 0s for those entries, then perform NMF, producing W and H. 2 We then compute W H as our estimate of A, and now have estimates for the missing entries. 4 The R Package NMF The R package NMF is quite extensive, with many, many options. In its simplest form, though, it is quite easy to use. For a matrix a and desired rank k, we simply run > nout < nmf ( a, k ) 2 A popular alternative uses stochastic gradient methods to find the best fit directly, without creating the matrix A. 3

Here the returned value nout is an object of class NMF defined in the package. It uses R s S4 class structure, with @ as the delimiter denoting class members. As is the case in many R packages, NMF objects contain classes within classes. The computed factors are in nout@fit@w and nout@fit@h. Let s illustrate it in an image context, using the following: Here we have only one image, and we ll use NMF to compress it, not do classification. First obtain A: > library ( pixmap ) # read f i l e > mtr < read. pnm( MtRush. pgm ) > class ( mtr ) [ 1 ] pixmapgrey attr (, package ) [ 1 ] pixmap # mtr i s an R S4 o b j e c t o f c l a s s pixmapgrey # e x t r a c t the p i x e l s matrix > a < mtr@grey Now, perform NMF, find the approximation to A, and display it: > aout < nmf ( a, 5 0 ) > w < aout@fit@w > h < aout@fit@h > approxa < w % % h # b r i g h t n e s s v a l u e s must be in [ 0, 1 ] > approxa < pmin( approxa, 1 ) 4

> mtrnew < mtr > mtrnew@grey < approxa > plot ( mtrnew ) Here is the result: This is somewhat blurry. The original matrix has dimension 194 259, and thus presumably has rank 194. 3 We ve approximated the matrix by one of rank only 50, a storage savings. Not important for one small picture, but possibly worthwhile if we have many. The approximation is not bad in that light, and may be good enough for image recognition or other applications. 5 Algorithms How are the NMF solutions found? What is nmf() doing internally? Needless to say, the methods are all iterative, with one approach being that of the Alternating Least Squares algorithm. By the way, this is not the default for nmf(); to select it, set method = snmf/r. 3 This is confirmed by running the function rankmatrix() in the Matrix package. 5

5.1 Objective Function We need an objective function, a criterion to optimize, in this case a criterion for goodness of approximation. Here we will take that to be the Frobenius norm, which is just the Euclidean (L 2 ) norm with the matrix treated as a vector: 4 So our criterion for error of approximation will be Q 2 = Q 2 ij (4) i,j A W H 2 (5) This measure is specified in nmf() by setting objective = euclidean. 5.2 Alternating Least Squares So, how does it work? It s actually quite simple. Suppose just for a moment that we know the exact value of W, with H unknown. Then for each j we could minimize A j W H j 2 (6) We are seeking to find H j that minimizes (6), with A j and W known. But since the Frobenius norm is just a sum of squares, that minimization is just a least-squares problem, i.e. linear regression; we are predicting A j from W. The solution is well-known to be 5 H j = (W W ) 1 W A j (7) R s lm() function does this for us > h [, j ] < lm( a [, j ] w 1) 4 The L p norm of a vector v = (v 1,..., v r) is defined to be ( ) 1/p v i p i 5 If you have background in regression analysis, you might notice there is no constant term, β 0, here. 6

for each j. 6 On the other hand, suppose we know H but not W. We could take transposes, A = H W (8) and then just interchange the roles of W and H above. Here a call to lm() gives us a row of W, and we do this for all rows. Putting all this together, we first choose initial guesses, say random numbers, for W and H; nmf() gives us various choices as to how to do this. Then we alternate: Compute the new guess for W assuming H is correct, then choose the new guess for H based on that new W, and so on. During the above process, we may generate some negative values. simply truncate to 0. If so, we 5.3 Multiplicative Update Alternating Least Squares is appealing in several senses. At each iteration, it is minimizing a convex function, meaning in essence that there is a unique local and global minimum; it is easy to implement, since there are many leastsquares routines publicly available, such as lm() here; and above all, it has a nice interpretation, predicting columns of A. Another popular algorithm is multiplicative update, due to Lee and Seung. Here are the update formulas for W given H and vice versa: W W AH W HH (9) H H W A W W H (10) where Q R and Q R represent elementwise multiplication and division with conformable matrices Q and R, and the juxtaposition QR means ordinary matrix multiplication. 6 Predicting New Cases Once we have settled on W and H, what can we do with them? In the recommender system application mentioned earlier, we simply multiply them, and 6 The -1 specifies that we do not want a constant term in the model. 7

then retrieve the predicted values at the matrix positions at which we did not have user ratings. But recall the description given above for the image and text processing examples: Given the rc-element vector q of the new image, we find its representation q p in the pseudo-basis, and then find g such W g best matches q p. Our predicted class is then C g. How do we find q p? The answer is that again we can use the least-squares above. > qp < lm(q h 1) 7 Convergence and Uniqueness Issues There are no panaceas for applications considered here. potential problems. Every solution has With NMF, an issue may be uniqueness there might not be a unique pair (W, H) that minimizes (5). 7 In turn, this may result in convergence problems. The NMF documentation recommends running nmf() multiple times; it will use a different seed for the random initial values each time. The Alternating Least Squares method used here is considered by some to have better convergence properties, since the solution at each iteration is unique. 8 How Do We Choose the Rank? This is not an easy question. One approach is to first decompose our matrix A via singular value decomposition. Roughly speaking, SVD finds an orthonormal basis for A, but then treats the vectors in that basis as random variables. The ones with relatively small variance may be considered unimportant. > asvd < svd ( a ) > asvd$d [ 1 ] 1.188935 e+02 2.337674 e+01 1.685734 e+01 1.353372 e+01 1.292724 e+01 [ 6 ] 1.039371 e+01 9.517687 e+00 9.154770 e+00 8.551464 e+00 7.776239 e+00 [ 1 1 ] 6.984436 e+00 6.505657 e+00 5.987315 e+00 5.059873 e+00 5.032140 e+00 [ 1 6 ] 4.826665 e+00 4.576616 e+00 4.558703 e+00 4.107498 e+00 3.983899 e+00 [ 2 1 ] 3.923439 e+00 3.606591 e+00 3.415380 e+00 3.279176 e+00 3.093363 e+00 [ 2 6 ] 3.019918 e+00 2.892923 e+00 2.814265 e+00 2.721734 e+00 2.593321 e+00 7 See Donoho and Stodden,When Does Non-Negative Matrix Factorization Give a Correct Decomposition into Parts?, https://web.stanford.edu/~vcs/papers/nmfcdp.pdf. 8

So, the first standard deviation is about 119, the next about 23 and so on. The seventh is already down into the single-digit range. So we might take k to be, say, 10. The columns of A lie in a subspace of R 194 of dimension of about 10. 9 Why Nonnegative? In the applications we ve mentioned here, we always have A 0. However, that doesn t necesarily mean that we need W and H to be nonnegative, since we could always truncate. Indeed, we could consider using SVD instead. There are a couple of reasons NMF may be preferable. First, truncation may be difficult if we have a lot of negative values. But the second reason is rather philosophical, as follows: In the image recognition case, there is hope that the vectors W j will be sparse, i.e. mostly 0s. Then we might have, say, the nonzero elements of W 1 correspond to eyes, W 2 correspond to nose and so on with other parts of the face. We are then summing to form a complete face. This may enable effective parts-based recognition. Sparsity (and uniqueness) might be achieved by using regularization methods, in which we minimize something like A W H + λ( W 1 + H 1 ) (11) where the subscript 1 ( L 1 norm ) means the norm involves sums of absolute values rather than sums of squares. This guards against one of the factors becoming too large, and it turns out that this also can lead to sparse solutions. We can try this under nmf() by setting method = pe-nmf. 9