Fast Coordinate Descent methods for Non-Negative Matrix Factorization

Similar documents
Fast Coordinate Descent Methods with Variable Selection for Non-negative Matrix Factorization

ECS289: Scalable Machine Learning

Block Coordinate Descent for Regularized Multi-convex Optimization

arxiv: v2 [math.oc] 6 Oct 2011

MULTIPLICATIVE ALGORITHM FOR CORRENTROPY-BASED NONNEGATIVE MATRIX FACTORIZATION

Fast Bregman Divergence NMF using Taylor Expansion and Coordinate Descent

Sparse Linear Programming via Primal and Dual Augmented Coordinate Descent

NONNEGATIVE matrix factorization (NMF) has become

Nearest Neighbor based Coordinate Descent

PU Learning for Matrix Completion

Nonnegative Tensor Factorization using a proximal algorithm: application to 3D fluorescence spectroscopy

Non-negative Matrix Factorization: Algorithms, Extensions and Applications

Parallel Matrix Factorization for Recommender Systems

Dropping Symmetry for Fast Symmetric Nonnegative Matrix Factorization

arxiv: v4 [math.na] 10 Nov 2014

Note on Algorithm Differences Between Nonnegative Matrix Factorization And Probabilistic Latent Semantic Indexing

Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems

Data Mining and Matrices

ECS289: Scalable Machine Learning

Convex Optimization Algorithms for Machine Learning in 10 Slides

he Applications of Tensor Factorization in Inference, Clustering, Graph Theory, Coding and Visual Representation

Single-channel source separation using non-negative matrix factorization

Parallel Numerical Algorithms

Non-Negative Matrix Factorization with Quasi-Newton Optimization

Large-scale Collaborative Ranking in Near-Linear Time

STA141C: Big Data & High Performance Statistical Computing

Matrix Factorization with Applications to Clustering Problems: Formulation, Algorithms and Performance

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J

ECS171: Machine Learning

AN ALTERNATING MINIMIZATION ALGORITHM FOR NON-NEGATIVE MATRIX APPROXIMATION

Rank Selection in Low-rank Matrix Approximations: A Study of Cross-Validation for NMFs

Large-scale Matrix Factorization. Kijung Shin Ph.D. Student, CSD

IEEE TRANSACTIONS ON SIGNAL PROCESSING 1. Arnaud Vandaele, Nicolas Gillis, Qi Lei, Kai Zhong, and Inderjit Dhillon, Fellow, IEEE

1 Non-negative Matrix Factorization (NMF)

Robust Asymmetric Nonnegative Matrix Factorization

Simplicial Nonnegative Matrix Tri-Factorization: Fast Guaranteed Parallel Algorithm

Fast Nonnegative Matrix Factorization with Rank-one ADMM

Non-Negative Matrix Factorization

Approximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.

Factor Analysis (FA) Non-negative Matrix Factorization (NMF) CSE Artificial Intelligence Grad Project Dr. Debasis Mitra

Applied Machine Learning for Biomedical Engineering. Enrico Grisan

Clustering. SVD and NMF

A Randomized Approach for Crowdsourcing in the Presence of Multiple Views

An Extended Frank-Wolfe Method, with Application to Low-Rank Matrix Completion

Non-Negative Matrix Factorization

Improved Bounded Matrix Completion for Large-Scale Recommender Systems

CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu

Selected Topics in Optimization. Some slides borrowed from

STA141C: Big Data & High Performance Statistical Computing

Non-negative Tensor Factorization Based on Alternating Large-scale Non-negativity-constrained Least Squares

Bregman Divergences. Barnabás Póczos. RLAI Tea Talk UofA, Edmonton. Aug 5, 2008

CONVOLUTIVE NON-NEGATIVE MATRIX FACTORISATION WITH SPARSENESS CONSTRAINT

ENGG5781 Matrix Analysis and Computations Lecture 10: Non-Negative Matrix Factorization and Tensor Decomposition

Automatic Rank Determination in Projective Nonnegative Matrix Factorization

Nonnegative Matrix Factorization

Non-negative Matrix Factorization via accelerated Projected Gradient Descent

Inderjit Dhillon The University of Texas at Austin

ECS171: Machine Learning

Accelerated Anti-lopsided Algorithm for Non-negative Least Squares

Sparseness Constraints on Nonnegative Tensor Decomposition

Memory Efficient Kernel Approximation

A Divide-and-Conquer Procedure for Sparse Inverse Covariance Estimation

STA141C: Big Data & High Performance Statistical Computing

ECS289: Scalable Machine Learning

Sparse Inverse Covariance Estimation for a Million Variables

Primal-Dual Algorithms for Non-negative Matrix Factorization with the Kullback-Leibler Divergence

One Picture and a Thousand Words Using Matrix Approximtions October 2017 Oak Ridge National Lab Dianne P. O Leary c 2017

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Recovery of Sparse Signals from Noisy Measurements Using an l p -Regularized Least-Squares Algorithm

Nonnegative Matrix Factorization. Data Mining

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

SOCP Relaxation of Sensor Network Localization

A Unified Algorithm for One-class Structured Matrix Factorization with Side Information

A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation

Homework 5. Convex Optimization /36-725

Accelerated Parallel and Distributed Algorithm using Limited Internal Memory for Nonnegative Matrix Factorization

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2017

Variables. Cho-Jui Hsieh The University of Texas at Austin. ICML workshop on Covariance Selection Beijing, China June 26, 2014

Online Nonnegative Matrix Factorization with General Divergences

Preserving Privacy in Data Mining using Data Distortion Approach

14. Nonlinear equations

SQL-Rank: A Listwise Approach to Collaborative Ranking

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM

EA = I 3 = E = i=1, i k

Big & Quic: Sparse Inverse Covariance Estimation for a Million Variables

Accelerated Proximal Gradient Methods for Convex Optimization

Simultaneous Discovery of Common and Discriminative Topics via Joint Nonnegative Matrix Factorization

ONP-MF: An Orthogonal Nonnegative Matrix Factorization Algorithm with Application to Clustering

Linear Algebra (Review) Volker Tresp 2017

7.5 Operations with Matrices. Copyright Cengage Learning. All rights reserved.

Classification with Perceptrons. Reading:

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Compressive Sensing, Low Rank models, and Low Rank Submatrix

Faloutsos, Tong ICDE, 2009

Big Data Analytics: Optimization and Randomization

Linear Algebra (Review) Volker Tresp 2018

Algorithms for Nonnegative Matrix and Tensor Factorizations: A Unified View Based on Block Coordinate Descent Framework

Learning to Disentangle Factors of Variation with Manifold Learning

Transcription:

Fast Coordinate Descent methods for Non-Negative Matrix Factorization Inderjit S. Dhillon University of Texas at Austin SIAM Conference on Applied Linear Algebra Valencia, Spain June 19, 2012 Joint work with Cho-Jui Hsieh

Outline 1 Non-negative Matrix Factorization

Outline Non-negative Matrix Factorization 1 Non-negative Matrix Factorization

Problem Definition Input: Given a non-negative matrix V R m n and the target rank k Output: two non-negative matrices W R m k and H R n k, such that WH T is a good approximation to V. Usually m,n k. How to measure goodness of approximation? Two widely used choices: Least squares NMF: min f (W, H) V W,H 0 WHT 2 F = (V ij (WH T ) ij ) 2 i,j KL-divergence NMF: min W,H 0 L(W, H) i,j V ij log(v ij /(WH T ) ij ) V ij + (WH T ) ij

Problem Definition (Cont d) Applications: text mining, image processing,.... Can get more interpretable basis than SVD. To achieve better sparsity, researchers have proposed adding L1 regularization terms on W and H: 1 (W,H) = arg min W,H 0 2 V WHT 2 F + ρ 1 W ir + ρ 2 i,r j,r H jr

Existing Optimization Methods NMF is nonconvex, but is convex when W or H is fixed. Recent methods follow the alternating minimization framework: Iteratively solve min W 0 f (W,H) and min H 0 f (W,H) until convergence. For least squares NMF, each sub-problem can be exactly or approximately solved by 1 Multiplicative rule (Lee and Seung, 2001) 2 Projected gradient method (Lin, 2007) 3 Newton type updates (Kim, Sra and Dhillon, 2007) 4 Active set method (Kim and Park, 2008) 5 Cyclic coordinate descent method (Chichocki and Phan, 2009)

Coordinate Descent Method Update one variable at a time until convergence: (W,H) (W + se ir,h). Get s by solving a one-variable problem: min s:w ir +s 0 gw ir (s) f (W + se ir,h). For square loss, g W ir has a closed form solution: s = max ( 0,W ir g ir (0)/g ir (0)) W ir, where g ir(0) = Wir f (W,H) = (WH T H VH) ir, g ir(0) = 2 W ir f (W,H) = (H T H) rr.

Outline Non-negative Matrix Factorization 1 Non-negative Matrix Factorization

Cyclic Coordinate Descent for Least Squares NMF (FastHals) Recently, (Chichocki and Phan, 2009) proposed a cyclic coordinate descent algorithm (FastHals) for least squares NMF. Fixed update sequence: W 11,,W 1,2,...,W 1,k,W 2,1,...,W m,k,...,h 1,1,...,H n,k,w 1,1,... Each update has time complexity O(k).

Variable Selection Non-negative Matrix Factorization FastHals updates variables uniformly. However, an efficient algorithm should update variables with frequency proportional to their importance! We propose a Greedy Coordinate Descent method (GCD) for NMF. 7.8 x 104 7.6 CD FastHals x 10 3 5 Objective Value 7.4 7.2 7 6.8 6.6 number of updates 5 4 3 2 4 3 2 1 values in the solution 6.4 1 0 0 6.2 0 0 0.5 1 1.5 2 2 4 6 8 10 variables in H Number of Coordinate Updates x 10 7 x 10 6 # updates vs obj The behavior of FastHals The behavior of GCD

Greedy Coordinate Descent (GCD) Stategy select variables which maximally reduce objective function When W ir is selected, the objective function can be reduced by D W ir f (W,H) f (W + s E ir,h) = G W ir s 1 2 (HT H) rr (s ) 2, where G W W f (W,H) = WH T H VH, and s is the optimal step size. If D W can be easily maintained, we can choose variables with the largest objective function value reduction according to D W.

How to maintain D W (objective value reduction) s can be computed from G W and H T H (from one-variable update rule). D W ir = G W s 1 2 (HT H) rr (s ) 2, where G W = WH T H VH. Therefore, we can maintain D W if G W and H T H are known. When W ir W ir + s, the ith row of G W is changed: G W ij G W ij + s (H T H) rj j = 1,...,k. Therefore, time for maintaining D W is only O(k), which has the same time complexity as Cyclic Coordinate Descent!

Greedy Coordinate Descent (GCD) Follow the alternating minimization framework, our algorithm GCD alternatively updates variables in W and H. When updating one variables in W, we can maintain D W in O(k) time. We conduct a sequence of updates on W : W (0),W (1),... with a corresponding sequence (D W ) (0),(D W ) (1),... When to switch from W s updates to H s updates? We update variables in W until the maximum function value decrease is small enough. max Dij W < ǫp init, where p init = (D W ) (0) j

Greedy Coordinate Descent (GCD) Initialize H T H,W T W. While (not converged) 1. Compute G W = W (H T H) VH. 2. Compute D W according to G W. 3. Compute p init = max i,r (D W ir ). 4. For each row i of W - q i = arg max r D W i,r - While D W i,q i > ǫp init 4.1 Update W i,qi. 4.2 Update W T W and D W 4.3 q i arg max r D W ir 5. For updates to H, repeat steps analogous to Steps 1 through 4.

Comparisons Non-negative Matrix Factorization dataset m n k relative Time (in seconds) error GCD FHals PGrad BPivot Synth03 500 1,000 10 10 4 0.6 2.3 2.1 1.7 30 10 4 4.0 9.3 26.6 12.4 Synth08 500 1,000 10 10 4 0.21 0.43 0.53 0.56 30 10 4 0.43 0.77 2.54 2.86 0.0410 2.3 4.0 13.5 10.6 CBCL 361 2,429 49 0.0376 8.9 18.0 45.6 30.9 0.0373 14.6 29.0 84.6 51.5 0.0365 1.8 6.5 9.0 7.4 ORL 10,304 400 25 0.0335 14.1 30.3 98.6 33.9 0.0332 33.0 63.3 256.8 76.5

Comparisons Non-negative Matrix Factorization Results on MNIST (m = 780,n = 60000,# nz = 8994156,k = 10). Relative function value difference 10 0 10 1 10 2 10 3 10 4 GCD FastHals BlockPivot 10 5 0 50 100 150 200 250 300 350 time(sec)

Outline Non-negative Matrix Factorization 1 Non-negative Matrix Factorization

KL-NMF Non-negative Matrix Factorization KL-NMF: min W,H 0 L(W,H) i,j V ij V ij log( (WH T ) V ij + (WH T ) ij ) ij The one variable sub-problem: D ir (s)=l(w +se ir,h)= l j=1 ) V ij log ((WH T ) ij +sh rj +sh jr +constant One variable update does not have closed form solution

Newton Update for One Variable Subproblem We use Newton method to solve each one-variable sub-problem When updating W ir, iteratively update s by Newton direction: where s max( W ir,s h ir(s)/h ir(s)), h ir(s) = h ir(s) = n V ij H jr (1 (WH T ) ij + sh jr j=1 n j=1 V ij H 2 jr ((WH T ) ij + sh jr ) 2. ). Can show that Newton method without line search converges to optimum for this one-variable sub-problem.

CCD for KL-divergence Compute WH T While (not converged) 1. For all (i,r) pairs While (not converged) - s max( W ir,s h ir (s)/h ir (s)) - W ir W ir + s - Maintain (WH T ) i,: by (WH T ) i,: (WH T ) i,: + sh T :,r. 2. For updates to H, repeats steps analogous to Step 1.

Experimental Results Time comparison results for KL divergence. indicates the specified objective value is not achievable. dataset k relative Time (in seconds) error CCD Multiplicative Synth03 30 10 3 121.1 749.5 10 5 184.32 7092.3 Synth08 30 10 2 22.6 46.0 10 5 56.8 * 0.1202 38.2 21.2 CBCL 49 0.1103 123.2 562.6 0.1093 166.0 3266.9 0.3370 73.7 165.2 ORL 25 0.3095 253.6 902.2 0.3067 370.2 1631.9

Outline Non-negative Matrix Factorization 1 Non-negative Matrix Factorization

Our method can be naturally extended to solve Non-negative Tensor Factorization (NTF). A tensor is a multi-dimensional matrix: V R I 1 I 2 I N, where I 1,...,I N are index upper bounds and N is the order (dimension) of the tensor. A rank-k approximation to the N-way tensor: V k u j 1 uj 2 uj N, j=1 where indicates outer product of vectors.

GCD for non-negative tensor factorization GCD can be extended to solve NTF problems. Similar to the NMF cases, GCD outperforms state-of-the-art algorithms. 10 6 GCD FastHals BlockPivot 10 6 10 5 GCD FastHals BlockPivot Relative error 10 4 10 2 Relative error 10 4 10 3 10 2 10 0 10 1 0 10 20 30 40 50 60 Time (sec) 10 0 0 5 10 15 20 Time (sec) Synthetic (1000 200 100) CMU image data (640 128 120)

Conclusions We propose two algorithms: Greedy Coordinate Descent (GCD) for least squares NMF and Cyclic Coordinate Descent (CCD) for KL-NMF. Both algorithms outperform state-of-the-art methods. Code is available at http://www.cs.utexas.edu/~cjhsieh/nmf/