Non-negative Matrix Factorization via accelerated Projected Gradient Descent

Similar documents
Majorization Minimization - the Technique of Surrogate

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

Convex Optimization Lecture 16

On Nesterov s Random Coordinate Descent Algorithms - Continued

Adaptive Restarting for First Order Optimization Methods

Selected Topics in Optimization. Some slides borrowed from

Optimization methods

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Optimization methods

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

More First-Order Optimization Algorithms

Lasso: Algorithms and Extensions

Cubic regularization of Newton s method for convex problems with constraints

CSC 576: Gradient Descent Algorithms

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Block Coordinate Descent for Regularized Multi-convex Optimization

Continuous methods for numerical linear algebra problems

Non-convex optimization. Issam Laradji

Constrained Optimization

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

Randomized Coordinate Descent Methods on Optimization Problems with Linearly Coupled Constraints

Line Search Methods for Unconstrained Optimisation

Fast proximal gradient methods

Second Order Optimization Algorithms I

ACCELERATED FIRST-ORDER PRIMAL-DUAL PROXIMAL METHODS FOR LINEARLY CONSTRAINED COMPOSITE CONVEX PROGRAMMING

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

Lecture 14 Ellipsoid method

MATH 4211/6211 Optimization Basics of Optimization Problems

Nonlinear Optimization for Optimal Control

Accelerated primal-dual methods for linearly constrained convex problems

1 Non-negative Matrix Factorization (NMF)

Gradient Methods Using Momentum and Memory

Subgradient Method. Ryan Tibshirani Convex Optimization

Optimization. Yuh-Jye Lee. March 21, Data Science and Machine Intelligence Lab National Chiao Tung University 1 / 29

Optimized first-order minimization methods

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

ECE580 Exam 1 October 4, Please do not write on the back of the exam pages. Extra paper is available from the instructor.

Coordinate Descent Methods on Huge-Scale Optimization Problems

1 Strict local optimality in unconstrained optimization

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018

Accelerated gradient methods

Lecture 1: September 25

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

Lecture 9: September 28

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

arxiv: v1 [math.oc] 1 Jul 2016

Complexity of gradient descent for multiobjective optimization

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19

6.252 NONLINEAR PROGRAMMING LECTURE 10 ALTERNATIVES TO GRADIENT PROJECTION LECTURE OUTLINE. Three Alternatives/Remedies for Gradient Projection

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

Programming, numerics and optimization

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

8 Numerical methods for unconstrained problems

Conditional Gradient (Frank-Wolfe) Method

Coordinate Descent and Ascent Methods

An interior-point gradient method for large-scale totally nonnegative least squares problems

Unconstrained minimization of smooth functions

Deep Learning II: Momentum & Adaptive Step Size

Math 273a: Optimization Subgradient Methods

Lecture 8: February 9

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #16: Gradient Descent February 18, 2015

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

Numerical Methods. V. Leclère May 15, x R n

Introduction. New Nonsmooth Trust Region Method for Unconstraint Locally Lipschitz Optimization Problems

ECS171: Machine Learning

Primal-Dual Interior-Point Methods for Linear Programming based on Newton s Method

Lecture 5: September 12

On Nesterov s Random Coordinate Descent Algorithms

ECE580 Fall 2015 Solution to Midterm Exam 1 October 23, Please leave fractions as fractions, but simplify them, etc.

Worst Case Complexity of Direct Search

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Matrix Derivatives and Descent Optimization Methods

Homework 4. Convex Optimization /36-725

Higher-Order Methods

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

Duality in Linear Programs. Lecturer: Ryan Tibshirani Convex Optimization /36-725

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

Optimization II: Unconstrained Multivariable

Constrained Optimization and Lagrangian Duality

Newton s Method for Constrained Norm Minimization and Its Application to Weighted Graph Problems

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

14. Nonlinear equations

Lecture 5: September 15

An introduction to complexity analysis for nonconvex optimization

Log-determinant Non-negative Matrix Factorization via Successive Trace Approximation

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Lecture 12 Unconstrained Optimization (contd.) Constrained Optimization. October 15, 2008

You should be able to...

An adaptive accelerated first-order method for convex optimization

Expanding the reach of optimal methods

Numerical Optimization

On the convergence properties of the modified Polak Ribiére Polyak method with the standard Armijo line search

Chapter 2. Optimization. Gradients, convexity, and ALS

Transcription:

Non-negative Matrix Factorization via accelerated Projected Gradient Descent Andersen Ang Mathématique et recherche opérationnelle UMONS, Belgium Email: manshun.ang@umons.ac.be Homepage: angms.science First draft : March 11, 2018 Last update : March 13, 2018 1 / 14

Overview 1 NMF 2 Projected gradient descent 3 Accelerated projected gradient descent 4 Summary 2 / 14

NMF The NMF problem: given X R m n +, find W Rm r + and H R r n + by [W H 1 ] = arg min W,H 0 2 X W H 2 F it has two variables W and H (hence it is not convex!) it is a constrained optimization problem (W, H have to be non-negative) it is a NP-Hard problem : see Vavasis s 2008 paper : On the complexity of nonnegative matrix factorization 3 / 14

Projected gradient descent algorithm PGD PGD is a way to solve constrained optimization problem in the form where Q is the constraint set. min f(x) x Q Starting from a initial feasible point x 0 Q, PGD iterates the following equation until a stopping condition is met : ) x k+1 = P Q (x k t k f(x k ) P Q (. ) is the projection operator with the expression as 1 P Q (x 0 ) = arg min x Q 2 x x 0 2 2 i.e. given a point x 0, P Q try to find a point x Q that is closest to x 0. 4 / 14

The PGD algorithm PGD for single variable problem min x Q f(x) : 1 Pick an inital point x 0 Q 2 Loop until stopping condition is met 1 Pick descent direction f(x k ) and step size t k 2 Update and project x k+1 = P Q ( xk t k f(x k ) ) where the projection 1 P Q (z) = arg min x Q 2 x z 2 2 Note. If f is β-smooth ( f is β-lipschitz), we can pick t k = 1 β. 5 / 14

The Accelerated PGD algorithm Like GD, Nesterov s acceleration can also be applied to PGD. The Accelerated PGD (APGD) algorithm for single variable problem min x Q f(x) has two additional parameters λ, γ and an additional coupling sequence y. Acceleration is done by extrapolating the current iterate with the term momentum (the previous iterate). 1 Pick an inital point x 0 Q, y 0 = x 0, λ 0 = 0 2 Loop until stopping condition is met 1 Pick descent direction f(x k () and compute step size t k 2 Update and project : y k = P Q xk t k f(x k ) ) 3 Compute coefficients λ k = 1 2 4 Extrapolation : x k+1 = (1 γ k )y k + γy k 1 For the details of the accelerated gradient : click me ( ) 1 + 1 + 4λ 2 k 1, γ k = 1 λk 1 λ k 6 / 14

Applying APGD algorithm on NMF To apply APGD on NMF, note that NMF has two variables: W, H = we have two loops. As the original AGD framework considers single variable problem. The variables are matrices, not vectors = the projections become 1 P Q (W 0 ) = arg min Z Q W 2 Z W 0 2 F 1 P Q (H 0 ) = arg min Z Q H 2 Z H 0 2 F 7 / 14

The framework of APGD algorithm on NMF The framework of the alternating 1 APGD for NMF 1 Pick inital matrices W 0 R m r + and H 0 R r n + and parameters. 2 On W, loop until stopping condition is met 1 Picks descent direction and step size 2 Update and project 3 Extrapolate 3 On H, loop until stopping condition is met 1 Picks descent direction and step size 2 Update and project 3 Extrapolate 1 So in fact such algorithm falls into the framework of block coordinate descent, the blocks are the two variable matrices. 8 / 14

The APGD algorithm for NMF To be specific, the APGD algorithm for NMF is 1 (On W ) Initalize W 0, ɛ, set V 0 = W 0, λ 0 = 0, then loop until stopping condition is met : 1 λ k = 1 ( 1 + 1 + 4λ 2 ) k 1, γ k = 1 λk 1 2 λ k ( ) 2 Update: V k = max [W k t k W W f(w k ; H)] ij, ɛ 3 Extrapolation : W k = (1 γ k )V k + γv k 1 2 (On H) Initalize H 0, ɛ, set G 0 = H 0, λ 0 = 0, then loop until stopping condition is met : 1 λ k = 1 ( 1 + 1 + 4λ 2 ) k 1, γ k = 1 λk 1 2 λ k ( ) 2 Update: G k = max [H k t k H Hf(H k ; W )] ij, ɛ 3 Extrapolation : H k = (1 γ k )G k + γg k 1 The acceleration is achieved by extrapolating current W, H by the coupling variables V, G with the extrapolation weights {γ k }. MATLAB code (click me) 9 / 14

The wrong way to apply APGD on NMF Note in each Update step the other variable is held fix (without the superscript k). That is, the matrix H when updating W k for k = 1, 2,... is all the same. The following shows a wrong way to apply APGD on NMF : 1 Initalize W 0, H 0 and ɛ, set V 0 = W 0, G 0 = H 0, and λ 0 = 0 2 Loop until stopping condition is met 1 λ k = 1 ( 2 1 + 1 + 4λ 2 ) k 1, γ k = 1 λk 1 λ k ( ) 2 Update: V k = max [W k t k W W f(w k ; H k )] ij, ɛ 3 Extrapolation : W k ( = (1 γ k )V k + γv k 1 ) 4 Update: G k = max [H k t k H Hf(H k ; W k+1 )] ij, ɛ 5 Extrapolation : H k = (1 γ k )G k + γg k 1 Why it is wrong : consider line 2, at k = 1 ( ) k = 1 V 1 = max [W 1 t 1 W W f(w 1 ; H 1 )] ij, ɛ. k = 2 ( ) V 2 = max [W 2 t 2 W W f(w 2 ; H 2 )] ij, ɛ Unless variable H is already in optimal (H converges and so H 1 = H 2 ) otherwise the update of H at k = 1 makes H 2 H 1 and so the optimization problem on minimizing W at k = 1 is different from that at k = 2. 10 / 14

PGD vs APGD Example : MATLAB rice image (m, n, r) = (256, 256, 64) 11 / 14

PGD vs APGD Here APGD takes about 50% of the computation to achieve the same amount of fitting error of PGD. 12 / 14

Further discussion Recall that accelerated GD for unconstrained convex problem with single variable is not monotone, it has ripples in the trace of the objective function value. In this case adaptive restart can be used. That is, if the extrapolated solution produces objective function value higher than that of the previous iterate, the extrapolated solution is thrown away, the momentum and extrapolation weight are all reset to initial state. That is : if at step k, f(w k ) > f(w k 1 ) : W k = W k 1 V k = W k λ k = 0 13 / 14

Last page - summary Accelerated projected gradient descent on NMF End of document 14 / 14