Non-negative Matrix Factorization via accelerated Projected Gradient Descent

Non-negative Matrix Factorization via accelerated Projected Gradient Descent Andersen Ang Mathématique et recherche opérationnelle UMONS, Belgium Email: manshun.ang@umons.ac.be Homepage: angms.science First draft : March 11, 2018 Last update : March 13, 2018 1 / 14

Overview 1 NMF 2 Projected gradient descent 3 Accelerated projected gradient descent 4 Summary 2 / 14

NMF The NMF problem: given X R m n +, find W Rm r + and H R r n + by [W H 1 ] = arg min W,H 0 2 X W H 2 F it has two variables W and H (hence it is not convex!) it is a constrained optimization problem (W, H have to be non-negative) it is a NP-Hard problem : see Vavasis s 2008 paper : On the complexity of nonnegative matrix factorization 3 / 14

Projected gradient descent algorithm PGD PGD is a way to solve constrained optimization problem in the form where Q is the constraint set. min f(x) x Q Starting from a initial feasible point x 0 Q, PGD iterates the following equation until a stopping condition is met : ) x k+1 = P Q (x k t k f(x k ) P Q (. ) is the projection operator with the expression as 1 P Q (x 0 ) = arg min x Q 2 x x 0 2 2 i.e. given a point x 0, P Q try to find a point x Q that is closest to x 0. 4 / 14

The PGD algorithm PGD for single variable problem min x Q f(x) : 1 Pick an inital point x 0 Q 2 Loop until stopping condition is met 1 Pick descent direction f(x k ) and step size t k 2 Update and project x k+1 = P Q ( xk t k f(x k ) ) where the projection 1 P Q (z) = arg min x Q 2 x z 2 2 Note. If f is β-smooth ( f is β-lipschitz), we can pick t k = 1 β. 5 / 14

The Accelerated PGD algorithm Like GD, Nesterov s acceleration can also be applied to PGD. The Accelerated PGD (APGD) algorithm for single variable problem min x Q f(x) has two additional parameters λ, γ and an additional coupling sequence y. Acceleration is done by extrapolating the current iterate with the term momentum (the previous iterate). 1 Pick an inital point x 0 Q, y 0 = x 0, λ 0 = 0 2 Loop until stopping condition is met 1 Pick descent direction f(x k () and compute step size t k 2 Update and project : y k = P Q xk t k f(x k ) ) 3 Compute coefficients λ k = 1 2 4 Extrapolation : x k+1 = (1 γ k )y k + γy k 1 For the details of the accelerated gradient : click me ( ) 1 + 1 + 4λ 2 k 1, γ k = 1 λk 1 λ k 6 / 14

Applying APGD algorithm on NMF To apply APGD on NMF, note that NMF has two variables: W, H = we have two loops. As the original AGD framework considers single variable problem. The variables are matrices, not vectors = the projections become 1 P Q (W 0 ) = arg min Z Q W 2 Z W 0 2 F 1 P Q (H 0 ) = arg min Z Q H 2 Z H 0 2 F 7 / 14

The framework of APGD algorithm on NMF The framework of the alternating 1 APGD for NMF 1 Pick inital matrices W 0 R m r + and H 0 R r n + and parameters. 2 On W, loop until stopping condition is met 1 Picks descent direction and step size 2 Update and project 3 Extrapolate 3 On H, loop until stopping condition is met 1 Picks descent direction and step size 2 Update and project 3 Extrapolate 1 So in fact such algorithm falls into the framework of block coordinate descent, the blocks are the two variable matrices. 8 / 14

The APGD algorithm for NMF To be specific, the APGD algorithm for NMF is 1 (On W ) Initalize W 0, ɛ, set V 0 = W 0, λ 0 = 0, then loop until stopping condition is met : 1 λ k = 1 ( 1 + 1 + 4λ 2 ) k 1, γ k = 1 λk 1 2 λ k ( ) 2 Update: V k = max [W k t k W W f(w k ; H)] ij, ɛ 3 Extrapolation : W k = (1 γ k )V k + γv k 1 2 (On H) Initalize H 0, ɛ, set G 0 = H 0, λ 0 = 0, then loop until stopping condition is met : 1 λ k = 1 ( 1 + 1 + 4λ 2 ) k 1, γ k = 1 λk 1 2 λ k ( ) 2 Update: G k = max [H k t k H Hf(H k ; W )] ij, ɛ 3 Extrapolation : H k = (1 γ k )G k + γg k 1 The acceleration is achieved by extrapolating current W, H by the coupling variables V, G with the extrapolation weights {γ k }. MATLAB code (click me) 9 / 14

The wrong way to apply APGD on NMF Note in each Update step the other variable is held fix (without the superscript k). That is, the matrix H when updating W k for k = 1, 2,... is all the same. The following shows a wrong way to apply APGD on NMF : 1 Initalize W 0, H 0 and ɛ, set V 0 = W 0, G 0 = H 0, and λ 0 = 0 2 Loop until stopping condition is met 1 λ k = 1 ( 2 1 + 1 + 4λ 2 ) k 1, γ k = 1 λk 1 λ k ( ) 2 Update: V k = max [W k t k W W f(w k ; H k )] ij, ɛ 3 Extrapolation : W k ( = (1 γ k )V k + γv k 1 ) 4 Update: G k = max [H k t k H Hf(H k ; W k+1 )] ij, ɛ 5 Extrapolation : H k = (1 γ k )G k + γg k 1 Why it is wrong : consider line 2, at k = 1 ( ) k = 1 V 1 = max [W 1 t 1 W W f(w 1 ; H 1 )] ij, ɛ. k = 2 ( ) V 2 = max [W 2 t 2 W W f(w 2 ; H 2 )] ij, ɛ Unless variable H is already in optimal (H converges and so H 1 = H 2 ) otherwise the update of H at k = 1 makes H 2 H 1 and so the optimization problem on minimizing W at k = 1 is different from that at k = 2. 10 / 14

PGD vs APGD Example : MATLAB rice image (m, n, r) = (256, 256, 64) 11 / 14

PGD vs APGD Here APGD takes about 50% of the computation to achieve the same amount of fitting error of PGD. 12 / 14

Further discussion Recall that accelerated GD for unconstrained convex problem with single variable is not monotone, it has ripples in the trace of the objective function value. In this case adaptive restart can be used. That is, if the extrapolated solution produces objective function value higher than that of the previous iterate, the extrapolated solution is thrown away, the momentum and extrapolation weight are all reset to initial state. That is : if at step k, f(w k ) > f(w k 1 ) : W k = W k 1 V k = W k λ k = 0 13 / 14

Last page - summary Accelerated projected gradient descent on NMF End of document 14 / 14