Rank Selection in Low-rank Matrix Approximations: A Study of Cross-Validation for NMFs

Similar documents
Non-Negative Matrix Factorization

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang

Data Mining and Matrices

Automatic Rank Determination in Projective Nonnegative Matrix Factorization

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

CS281 Section 4: Factor Analysis and PCA

Nonnegative Matrix Factorization

Data Mining and Matrices

A Randomized Approach for Crowdsourcing in the Presence of Multiple Views

Quick Introduction to Nonnegative Matrix Factorization

ENGG5781 Matrix Analysis and Computations Lecture 10: Non-Negative Matrix Factorization and Tensor Decomposition

Local Feature Extraction Models from Incomplete Data in Face Recognition Based on Nonnegative Matrix Factorization

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Orthogonal Nonnegative Matrix Factorization: Multiplicative Updates on Stiefel Manifolds

Probabilistic Latent Semantic Analysis


a Short Introduction

MULTIPLICATIVE ALGORITHM FOR CORRENTROPY-BASED NONNEGATIVE MATRIX FACTORIZATION

Simplicial Nonnegative Matrix Tri-Factorization: Fast Guaranteed Parallel Algorithm

Data Mining Techniques

cross-language retrieval (by concatenate features of different language in X and find co-shared U). TOEFL/GRE synonym in the same way.

Note on Algorithm Differences Between Nonnegative Matrix Factorization And Probabilistic Latent Semantic Indexing

Factor Analysis (FA) Non-negative Matrix Factorization (NMF) CSE Artificial Intelligence Grad Project Dr. Debasis Mitra

Decoupled Collaborative Ranking

Fast Nonnegative Matrix Factorization with Rank-one ADMM

Structured matrix factorizations. Example: Eigenfaces

2.3. Clustering or vector quantization 57

CVPR A New Tensor Algebra - Tutorial. July 26, 2017

Data Mining and Matrices

Robust Principal Component Analysis

Non-Negative Matrix Factorization

Robust Asymmetric Nonnegative Matrix Factorization

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

Multi-Label Informed Latent Semantic Indexing

Adaptive one-bit matrix completion

Machine Learning - MT & 14. PCA and MDS

Nonnegative Matrix Factorization. Data Mining

Non-negative Matrix Factorization: Algorithms, Extensions and Applications

Non-negative matrix factorization with fixed row and column sums

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

Non-Negative Matrix Factorization with Quasi-Newton Optimization

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

CS60021: Scalable Data Mining. Dimensionality Reduction

Non-convex Robust PCA: Provable Bounds

Binary Principal Component Analysis in the Netflix Collaborative Filtering Task

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017

Fantope Regularization in Metric Learning

CSC 576: Variants of Sparse Learning

Nonnegative matrix factorization for interactive topic modeling and document clustering

ENGG5781 Matrix Analysis and Computations Lecture 0: Overview

Non-negative Matrix Factor Deconvolution; Extraction of Multiple Sound Sources from Monophonic Inputs

Dimension Reduction Using Nonnegative Matrix Tri-Factorization in Multi-label Classification

CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu

Linear dimensionality reduction for data analysis

Tensor Methods for Feature Learning

Fast Coordinate Descent methods for Non-Negative Matrix Factorization

Lecture Notes 10: Matrix Factorization

On Spectral Basis Selection for Single Channel Polyphonic Music Separation

Clustering. SVD and NMF

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

Generalized Linear Models in Collaborative Filtering

SINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIX FACTORIZATION AND SPECTRAL MASKS. Emad M. Grais and Hakan Erdogan

c Springer, Reprinted with permission.

arxiv: v3 [cs.lg] 18 Mar 2013

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Compressed Sensing and Sparse Recovery

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Rank Determination for Low-Rank Data Completion

ECS289: Scalable Machine Learning

CONVOLUTIVE NON-NEGATIVE MATRIX FACTORISATION WITH SPARSENESS CONSTRAINT

Gaussian Models

Machine Learning for NLP

SPARSE signal representations have gained popularity in recent

CME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Data Mining Techniques

1 Non-negative Matrix Factorization (NMF)

* Matrix Factorization and Recommendation Systems

NON-NEGATIVE SPARSE CODING

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Lecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Introduction to Logistic Regression

ECE521 week 3: 23/26 January 2017

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

NMF WITH SPECTRAL AND TEMPORAL CONTINUITY CRITERIA FOR MONAURAL SOUND SOURCE SEPARATION. Julian M. Becker, Christian Sohn and Christian Rohlfing

CS264: Beyond Worst-Case Analysis Lecture #15: Topic Modeling and Nonnegative Matrix Factorization

Sparse Gaussian conditional random fields

Compressive Sensing, Low Rank models, and Low Rank Submatrix

Collaborative topic models: motivations cont

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

Generalized Low Rank Models

Collaborative Filtering Matrix Completion Alternating Least Squares

PROBLEM of online factorization of data matrices is of

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

Probabilistic Latent Variable Models as Non-Negative Factorizations

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent

Institute for Computational Mathematics Hong Kong Baptist University

Transcription:

Rank Selection in Low-rank Matrix Approximations: A Study of Cross-Validation for NMFs Bhargav Kanagal Department of Computer Science University of Maryland College Park, MD 277 bhargav@cs.umd.edu Vikas Sindhwani Mathematical Sciences IBM T.J. Watson Research Center Yorktown, Heights, NY 598 vsindhw@us.ibm.com Abstract We consider the problem of model selection in unsupervised statistical learning techniques based on low-rank matrix approximations. While k-fold crossvalidation (CV) has become the standard method of choice for model selection in supervised learning techniques, its adaptation to unsupervised matrix approximation settings has not received sufficient attention in the literature. In this paper, we emphasize the natural link between cross-validating matrix approximations and the task of matrix completion from partially observed entries. In particular, we focus on Non-negative Matrix Factorizations and propose scalable adaptations of Weighted NMF algorithms to efficiently implement cross-validation procedures for different choices of holdout patterns and sizes. Empirical observations on text modeling problems involving large, sparse document-term matrices suggest that these procedures enable easier and more accurate selection of rank (i.e., number of topics in text) than other alternatives for implementing CV. 1 Introduction Let V be a large, sparse matrix of size m n representing a collection of m high-dimensional data points in R n. The assumption that V is low-rank implies that most data points can be compactly and accurately represented as linear combinations of a small set of k basis vectors H R k n, with coefficients W R m k providing a lower-dimensional encoding. When V is non-negative, it is appealing to enforce H and W to be non-negative as this lends a parts-based interpretability to the representation i.e., each of the non-negative data points may be seen as an additive, sparse composition of k parts. Such Non-negative Matrix Factorizations (NMF) [5]), V W H, find wide applicability in a variety of problems [2]. In particular, when the approximation error is measured in terms of Generalized KL-divergence (I-divergence), NMF becomes identical to probabilistic Latent Semantic Analysis (plsi), a classic algorithm for modeling topics in text. NMF can also be regularized with different types of matrix norms, including l 1 for promoting sparsity. Similarly, with the squared frobenius approximation error, NMF is akin to Latent Semantic Analysis (LSI) except for additionally enforcing non-negativity constraints (which turns it into an NP-hard optimization problem). In this paper, we are concerned with the problem of model selection in NMFs and more specifically, how to choose the correct number of topics when building NMF models of text, using forms of cross-validation. In standard supervised learning, K-fold cross-validation refers to the common procedure where the data is partitioned into K chunks, each taking turns to serve as the heldout test set for a model trained on the rest. In the end, the error is averaged across K runs and the model with the smallest K-fold error is selected. When thinking of how to extend CV ideas to matrix approximation, several natural questions arise: How should a matrix be partitioned to define CV folds? How many folds should one use? Should rows be held out, or columns, or both? Should the held-out pattern be somehow aligned with the data sparsity to avoiding holding out too many zero cells? How should the model be trained in the presence of held out cells? The last question immedi- 1

A C B D 1 2 3 4 5 6 7 8 9 (i) Partition matrix (ii) Fold #5 of 9 A B 5 4 6 C 2 8 D 1 3 7 9 STEP 1: W d, H d = NMF(D) STEP 2: W a = NMF(B, H d ) STEP 3: H a = NMF(C, W d ) STEP 4: A' = W a H a (iii) Reconstruction Algorithm 1.1 Row-holdout (4-fold CV) Row-holdout (-fold CV) 5 15 2 25 3 (iv) NMF with Row-holdouts Figure 1: (i),(ii) illustrate 3 3 holdouts. (iii) Owen and Perry s reconstruction algorithm. (iv) Row-holdouts for CV in NMF causes overfitting (monotonic decrease in ) ately establishes a link to matrix completion, a subject of significant interest both in the practice of recommender systems [4] as well as in the theory of nuclear norm minimization [1]. Held-out cells may be seen as missing values that need to be estimated. In this work, we propose a non-negative matrix completion based approach to implementing CV as an alternative to previously considered procedures. For scalability, we adapt weighted NMF algorithms for this purpose, where zero-valued weights allow cells to be held out. We empirically explore the distinction between [3] and Wold [11] holdouts that correspond to holding out random cells or submatrices respectively, and study the sensitivity of rank-selection to holdout sizes. 2 Cross-Validation in Matrix Approximations For simplicity in exposition, we work with the Frobenius distance, V W H 2 fro, as a measure of matrix approximation error in the rest of the paper. The most common CV procedure is based on holding out rows of the matrix, as in the supervised learning case. However, models like LSI, plsi and NMF are transductive and do not inherently allow out of sample extension, and hence a foldin procedure is employed. Let V train be the held-in and V test be the heldout rows. An NMF is first learnt on the training set (W train, H train ) = arg min W,H V train W H 2 fro. The held out error, estimated as min W V test W H train 2 where H train is held fixed, may be seen as a measure of how well topics learnt on training rows explain the unseen held-out data. However, row-holds have been criticized in the literature [] and may lead to overfitting since model parameters W are re-estimated on V test and in this sense the heldout data is not completely unseen (also observed in our analysis, Figure 1(iv)). In a recent paper, Owen and Perry [7] revisit cross-validation in the context of SVD and NMF. They use the term Bi-Cross-Validation (BiCV) to refer to more general two-dimensional holdout patterns. In particular, they work with holdout patterns [3] where the rows of the matrix are divided into h groups and the columns are divided into l groups. The number of folds is hl; in each fold a given row and column group identifies the heldout submatrix while the remaining cells are available for training. An illustration of 3x3-fold cross-validation is shown in Figure 1(i,ii). ( During each ) A B round of BiCV, the matrix may be viewed as being composed of 4 submatrices, V = C D where A is the held out block, and the training blocks B, C, D are used to reconstruct A  resulting in the heldout error estimate A  2 fro. BiCV for SVD and NMF calls for different reconstruction procedures. For SVD, it may be understood in terms of multivariate regression of target variables C in terms of input variables D, followed by prediction of A in terms of B. Thus, a least squares problem is solved, ˆβ = C Dβ 2 fro, which provides the estimate  = Bβ = BD C where D is the pseudo-inverse of D. In this procedure, D may be exactly estimated from SVD of D. Thus, the BiCV procedure for SVD reduces to SVD of a training submatrix followed by reconstruction of the heldout submatrix. Owen and Perry remark that this avoids the use of missingvalue methods to estimate heldout cells, for which only locally optimal solutions can be found. This is also the reason holdouts may be preferred over Wold [11] holdouts where random cells (instead of submatrices) are held out and need to be estimated by missing value imputation. Owen and Perry s NMF BiCV procedure is described in Figure 1(iii). It is reasoned as follows. Instead of taking the SVD of D, an NMF D W D H D is constructed. Then,  = B(W D H D ) C = (BH D )(W D C) = W AH A where W A = BH D = arg min W B W H D 2 fro and H A = W D C = arg min H C W D H 2 fro. However, this reasoning produces W A and H A that may have negative entries and therefore Owen and Perry replace their least squares estimates with non-negative least squares (steps 2 and 3 in Figure 1(iii)). 2

Rather than adapting SVD BiCV formulations to NMF in this manner, we propose the direct use of non-negative matrix completion (imputation) methods. Unlike BiCV in SVD where Owen and Perry s procedure avoids potentially suboptimal imputation methods, the NMF case is different since the factorization problem itself is non-convex. Hence, we revisit imputation for both and Wold holdouts in the NMF context. We begin by describing weighted NMFs that can be adapted for BiCV. We then show how efficient BiCV can be performed by using low-rank weight matrices. An empirical study is then reported on high dimensional text modeling problems. For related work on NMF model selection, see [8, 7] and references therein. 3 Cross-validation Using Weighted NMF Weighted NMFs minimize the following objective function, arg min W,H S (V W H) 2, where X is an m n data matrix, S is an m n matrix of weights and W, H are m k and k n matrices where k m, n. is the Hadamard product. The weights S allow domainspecific emphasis in reconstructing certain matrix entries in preference to others. For example, in face recognition applications [9], more emphasis may be needed in central image pixels with higher likelihood of not being in the background. It has been shown [6, 9] that the following modifications to Lee and Seung multiplicative update rules [5] lead to minimization of the above function. ( S is the element-wise square root, not the matrix square root): W = W S V ) ( S V )H ( T S ) H = H W T ( (W H) H T W T ( S (W H)) In the context of cross-validation, weighted NMFs may be used by setting S to be binary, i.e., S ij = 1 for held-in entries and for held-out entries. Also note that in this case S = S. We now develop techniques for efficiently evaluating Equation (1). Since held-in is typically a larger set than held out, we will work with C = 1 S to allow for more efficient sparse matrix computations. S V zeros out the held-out entries from V and can be efficiently computed using V C V. Computing S (W H) efficiently is non-trivial since it might seem that we first need to multiply W and H resulting a large dense matrix W H. In the unweighted case, W T (W H) = (W T W )H and so we can avoid computing the dense matrix W H. However, this does not work for the weighted case. First, if C is highly sparse, we can compute W T ((1 C) (W H)) = (W T W )H W T (C (W H)) where (C (W H)) can be efficiently computed by evaluating W H only where C is sparse. However, this trick also does not work when C is large and dense such as in a 2 2 holdout. We now show that by setting C to be a low-rank binary matrix we can cover holdouts in addition to a large class of Wold holdouts. First, we observe a simple relationship involving elementwise products with low-rank matrices. Let C = q i=1 u ivi T be a rank-q matrix where u i R m and v i R n. Then we observe, q q W T (C (W H)) = W T (D ui (W H)D vi ) = (W T D ui W )(HD vi ) i=1 Here, D u is the diagonal matrix with diagonal elements given by u, i.e., D u (i, i) = u(i). style block holdouts can be trivially covered by noting that C = uv T where u(i) = 1 if row i is in the heldout block, otherwise and v(i) = 1 if column i is in the heldout block, otherwise. In other words, holdouts are a special case when q = 1. Taking q > 1, more general heldout patterns can be designed by appropriately choosing (u i, v i ) i.e. holdout patterns consisting of multiple blocks can also be expressed. 4 Experimental Evaluation In this section, we describe results of our preliminary experimental evaluation. To implement Wold hold outs, we split the matrix into 16 blocks (after shuffling rows and columns). We randomly select four blocks (one in each row. i.e., q = 4 in Section 3) to holdout and the remaining submatrices are held in. Since Weighted NMF is highly sensitive to the initial values of the W and H matrices [6], we execute it multiple times and take the mean of the. We work with one i=1 (1) 3

8 7 6 5 4 3 2 OP () WNMF () OP () WNMF () 5 5 15 2 25 3 35 4 2 4 6 8 12 14 16 18 2 5 15 2 25 3 15 2 25 3 35 (a) Synthetic dataset (b) Reuters dataset (c) BBC dataset (d) TDT2 dataset 3 25 2 15 OP () WNMF () 14 8 6 4 2 OP () WNMF () 1 5-1 5 15 2 2 4 6 8 12 14 16 18 2 5 15 2 25 15 2 25 3 35 (e) Synthetic dataset (f) Reuters dataset (g) BBC dataset (h) TDT2 dataset 8 5 24 18 7 5 22 16 2 6 14 5 18 16 4 14 3 8 2 6 8 6 4 4 2 5 15 2 4 6 8 12 14 16 18 2 5 15 2 25 15 2 25 3 35 (i) Synthetic dataset (j) Reuters dataset (k) BBC dataset (l) TDT2 dataset Figure 2: Results: Parts (a,b,c,d) illustrate benefits of WNMF over OP. Parts (e,f,g,h) indicate performance of different holdouts and (i,j,k,l) compare Wold and holdouts. synthetic and three real world text datasets. In the synthetic dataset, we construct a V matrix with rank, i.e., it has exactly topics. We use three real world document collections with human labeled topics: BBC (size 2235 9635 with 2 topics), Reuters (size 7285 18221 with topics), and TDT2 (size 9394 6545 with 3 topics). We will evaluate model selection in terms of recovering the estimated number of topics in the datasets. OP vs WNMF - for holdouts Here, we compare the technique of Owen and Perry (OP) and our WNMF-based approach for predicting the number of topics. For both techniques, we pick the same 2 2 holdout. We compute the cross-validation errors for different values of K for both OP and the WNMF approach; and plot them in Figure 2(a,b,c,d) for the four different datasets. As shown in the figures, for small values of K, both WNMF and OP provide almost similar s, however after a certain model complexity, the in WNMF increases dramatically. Hence, the WNMF-based technique provides a distinct point at which we can read off the model parameters; unlike OP, for which the cross-validation error decreases monotonically with increasing model complexity. For each of the datasets, we find that the minima of WNMF s s occurs close to the user predicted values. Study of WNMF for different holdouts In this experiment, we plot the cross-validation error as a function of the number of topics, for holdouts of different sizes - 2 2, 3 3 and 4 4. The results are shown in Figure 2(e,f,g,h) for the different datasets. The errors for 4 4 holdouts is smaller than 3 3 holdouts, which in turn is smaller than 2 2 since the error is computed over a smaller set of matrix entries. As we move from 2 2 holdouts to 4 4 holdouts, the value of K at which minima is achieved progressively shifts to the right. This behavior is expected since, we are using more training data to predict a smaller heldout set increasing model complexity can lead to a smaller. A similar observation was also found in Owen and Perry [7]. Also, the error curves get progressively smoother. Since we are using a larger training set, we obtain lower variance in the. vs Wold In this experiment, we check if the capability to handle more general holdouts provides a more robust procedure for model selection. We compare the cross-validation errors obtained using 2 2 holdouts and Wold holdouts (4 iterations) for different values for K. Our results are shown in Figure 2(i,j,k,l). As shown in the Figure, the s for Wold holdouts are quite similar to that of holdouts. However, the error curves are not only smoother for Wold holdouts than holdouts, but also allow for more distinctive model selection. 4

References [1] E. Candes and T. Tao. The power of convex relaxation: Near-optimal matrix completion. In IEEE Trans. Information Theory 56(5), pages 253 28, 29. [2] A. Cichocki, R. Zdunek, A. Phan, and S. Amari. Nonnegative Matrix and Tensor Factorizations: Applications to Exploratory Multi-way Data Analysis and Blind Source Separation. Wiley, 29. [3] G. K. Le biplot util dexplaration de donnees multidimensionellles. In J. Roy. Stat. Soc. Series, 22. [4] Y. Koren, R. M. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. IEEE Computer, 42(8):3 37, 29. [5] D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. In Nature, pages 788 791, 41(6755) 1999. [6] Y. Mao and L. K. Saul. Modeling distances in large-scale networks by matrix factorization. In Internet Measurement Conference, pages 278 287, 24. [7] P. O. Perry and A. B. Owen. Bi-cross-validation of the svd and the non-negative matrix factorization. In Annals of Applied Statistics, pages 564 594, 29. [8] V. Y. F. Tan and C. Fevotte. Automatic relevance determination in nonnegative matrix factorization. In SPARS, 29. [9] N.-D. H. V. Blondel and P. V. Dooren. Weighted nonnegative matrix factorization and face feature extraction. In submitted to Image and Vision Computing, 27. [] M. Welling, C. Chemudugunta, and N. Sutter. Deterministic latent variable models and their pitfalls. In SDM, 28. [11] H. Wold. Cross-validatory estimation of the number of components in factor and principal component models. In Technometrics, 1978. 5