Proximal Methods for Optimization with Spasity-inducing Norms

Size: px

Start display at page:

Download "Proximal Methods for Optimization with Spasity-inducing Norms"

Rosamund Watson
6 years ago
Views:

1 Proximal Methods for Optimization with Spasity-inducing Norms Group Learning Presentation Xiaowei Zhou Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology Xiaowei Zhou (HKUST) Proximal Methods Group Learning 1 / 28

2 Outline 1 Background 2 Proximal Methods Overview Convergence Computation of Proximal Operator 3 ProxFlow algorithm for group-sparsity 4 Discussion Xiaowei Zhou (HKUST) Proximal Methods Group Learning 2 / 28

3 Problem Setting min w R p f (w) }{{} data fitting + λω(w) }{{}. (1) sparsity inducing f : R p R is convex differentiable Ω : R p R is convex but nonsmooth l 1 -norm : w 1 p w j j=1 l 1 /l q -norm : w l1 /l q g G η g w g q nuclear norm : w r j=1 σ j Xiaowei Zhou (HKUST) Proximal Methods Group Learning 3 / 28

4 Generic Methods Generic convex optimization methods: Subgradient descent w t+1 = w t α(s + λs ), where s f (w t ), s Ω(w t ) Reformulation as standard solvable convex programs (LP, QP, SOCP, SDP), e.g. Lasso can be formulated as: 1 min w +,w R p 2 y Xw + Xw 2 + λ(1 T w T w ) (2) + General-purpose toolboxes can be used (e.g. CVX). Blind to problem structure, tend to be slow and memory-consuming. Xiaowei Zhou (HKUST) Proximal Methods Group Learning 4 / 28

5 Outline 1 Background 2 Proximal Methods Overview Convergence Computation of Proximal Operator 3 ProxFlow algorithm for group-sparsity 4 Discussion Xiaowei Zhou (HKUST) Proximal Methods Group Learning 5 / 28

6 Proximal methods In proximal methods, (1) is optimized by iteratively solving the following problem: min f (w 0 ) + (w w 0 ) f (w 0 ) w R p }{{} linearization of f (w) +λω(w) + L 2 w w }{{}. (3) augmented term The first two terms linearly expand f at w 0. The last term keeps the solution in a neighborhood where the current linear approximation holds. L > 0 is an upper bound on the Lipschitz constant of f. Xiaowei Zhou (HKUST) Proximal Methods Group Learning 6 / 28

7 Proximal Operator Definition: Proximal Operator The proximal operator associated with the regularization λω is defined as the function that maps a vector u in R p onto the unique solution of: The operator is usually denoted as Prox λω (u) Example: 1 min v R p 2 u v λω(v), (4) When Ω is the l 1 -norm, the proximal operator is the well-known elementwise soft-thresholding operator: { 0 if u j λ j [1,, p], u j sign(u j )( u j λ) + = sign(u j )( u j λ) otherwise. Xiaowei Zhou (HKUST) Proximal Methods Group Learning 7 / 28

8 Algorithm Since (3) can be rewritten as: 1 min w R p 2 w (w 0 1 L f (w 0)) λ Ω(w), (5) L the proximal method can be written as: Algorithm 1: proximal method to solve (1) repeat w prox λ Ω(w 1 L f (w)) L until Convergence Remark. If λ = 0, it turns to be Gradient Descent. Questions: 1 How to compute the proximal operator? 2 How about the convergence of the algorithm? Xiaowei Zhou (HKUST) Proximal Methods Group Learning 8 / 28

9 Outline 1 Background 2 Proximal Methods Overview Convergence Computation of Proximal Operator 3 ProxFlow algorithm for group-sparsity 4 Discussion Xiaowei Zhou (HKUST) Proximal Methods Group Learning 9 / 28

10 Assumption We assume that f has Lipschitz continuous gradient: f (w) f (v) L w v. (6) Remark: It means the change of function gradient is upper-bounded. Lemma The above condition is equivalent to f (w) f (v)+ < f (v), w v > + L 2 w v 2. (7) Xiaowei Zhou (HKUST) Proximal Methods Group Learning 10 / 28

11 Two lemma Following two are important to the proof of convergence: Lemma (Sandwich) If F(w; v) is the linear approximation of F(w) = f (w) + λω(w) in v, w.r.t. f (w), that is F(w; v) = f (v) + f (v), w v + λω(w), then: F(w) F(w; v) + L 2 w v 2 F(w) + L 2 w v 2. (8) Lemma (3-point property) If ŵ = argmin w 1 2 w w φ(w), then for any w: φ(ŵ) ŵ w 0 2 φ(w) w w w ŵ 2. (9) Xiaowei Zhou (HKUST) Proximal Methods Group Learning 11 / 28

12 Proof of the convergence rate I F(w t ) is monotone non-increasing for t = 0,, T : F(w t+1 ) F(w t+1 ; w t ) + L 2 w t+1 w t 2 F(w t ; w t ) + L 2 w t w t 2 = F(w t ) Sandwich-left Definition Also, we have F (w t+1 ) F(w t+1 ; w t ) + L 2 w t+1 w t 2 F(w ; w t ) + L 2 w w t 2 L 2 w w t+1 2 F(w ) + L 2 w w t 2 L 2 w w t+1 2 Sandwich-left 3-point property Sandwich-right Xiaowei Zhou (HKUST) Proximal Methods Group Learning 12 / 28

13 Proof of the convergence rate II Define ɛ t = F(w t ) F(w ), so that ɛ t+1 L 2 w w t 2 L 2 w w t+1 2 T 1 T ɛ t ɛ t+1 L 2 w w 0 2 L 2 w w t 2 L 2 w w 0 2 t=0 ɛ t L w w 0 2 2T (10) At iteration T, Algorithm 1 yields a solution w T that satisfies: F(w T ) F(w ) L w w T Algorithm 1 has a convergence rate of O(1/T ). Xiaowei Zhou (HKUST) Proximal Methods Group Learning 13 / 28

14 Outline 1 Background 2 Proximal Methods Overview Convergence Computation of Proximal Operator 3 ProxFlow algorithm for group-sparsity 4 Discussion Xiaowei Zhou (HKUST) Proximal Methods Group Learning 14 / 28

15 Dual norm Definition: Dual Norm The dual norm Ω of the norm Ω is defined by: Examples: l 2 -norm is self-dual l 1 -norm is dual to l -norm Ω (z) = max w R p zt w, s.t. Ω(w) 1 (11) Xiaowei Zhou (HKUST) Proximal Methods Group Learning 15 / 28

16 Dual proximal operator Dual proximal operator: In the case where Ω is a norm, the following problem is dual to the proximal problem in (4): max v R p 1 2 v u 2 + u 2, s.t. Ω (v) λ. (12) Let Proj Ω λ be the projector on the ball of radius λ associated with Ω, then Proj Ω λ(u) is the unique solution to the problem (12) and it has the following relation with Prox λω : Prox λω = I Proj Ω λ (13) Xiaowei Zhou (HKUST) Proximal Methods Group Learning 16 / 28

17 Proof: conic duality Definition: Dual Cone Assume K is a convex cone, the dual cone K is defined as: Primal: Lagrangian: K = {z R p : z T x 0, x K } min x f (x), s.t. Ax + b K 0 Dual: L(x, λ) = f (x) + λ T (Ax + b), λ K max λ inf L(x, λ), s.t. x λ K Xiaowei Zhou (HKUST) Proximal Methods Group Learning 17 / 28

18 Typical examples When Ω is the l 1 -norm, the solution is: u u Proj. λ(u), which gives the elementwise soft-thresholding: j [1,, p], u j sign(u j )( u j λ) +. When Ω is the l 1 /l 2 -norm with nonoverlapping groups, the proximal problem is separable in every group, and the solution is a group-thresholding operator: { 0 if u g 2 λ g G, u g u g Proj. 2 λ(u g ) = u g 2 λ u g 2 u g otherwise, Xiaowei Zhou (HKUST) Proximal Methods Group Learning 18 / 28

19 Outline 1 Background 2 Proximal Methods Overview Convergence Computation of Proximal Operator 3 ProxFlow algorithm for group-sparsity 4 Discussion Xiaowei Zhou (HKUST) Proximal Methods Group Learning 19 / 28

20 Duality of the proximal problem with group-sparsity The following two problems are dual: P : 1 min v R p 2 u v λ η g v g (14) g G D : max ξ R p G 1 2 u g G ξ g u 2 2 (15) s.t. g G, ξ g λη g and ξ g j = 0 if j / g, and are two convex norms dual to each other. ξ = (ξ g ) g G where ξ g R p is the dual variable associated with v g. Xiaowei Zhou (HKUST) Proximal Methods Group Learning 20 / 28

21 Block Coordinate Descent BCD algorithm to solve the proximal problem in (14) Initialization: ξ = 0. repeat g G, ξ g Proj. λη g ( [ u h g ξh] g ). until Convergence v u g G ξg. Generally, the above algorithm is not guaranteed to converge in finite number of iterations. When is l 1 or l norm and the groups in G are in a hierarchical structure, only one iteration is needed to reach convergence. Jenatton, R. and Mairal, J. and Obozinski, G. and Bach, F. Proximal methods for hierarchical sparse coding Journal of Machine Learning Research 12 (2011) Xiaowei Zhou (HKUST) Proximal Methods Group Learning 21 / 28

22 Quadratic Min-Cost Flow Formulation How to solve the group-sparsity problem efficiently in general cases? If l 1 /l -norm is used, the dual problem (15) becomes: 1 min ξ R p G 2 u ξ g 2 2 g G s.t. g G, ξ g 1 λη g and ξ g j = 0 if j / g, The above problem can be formulated on a graph as a Quadratic Min-Cost Flow problem. Mairal, J. and Jenatton, R. and Obozinski, G. and Bach, F. Convex and Network Flow Optimization for Structured Sparsity Journal of Machine Learning Research 12 (2011) Xiaowei Zhou (HKUST) Proximal Methods Group Learning 22 / 28

23 Xiaowei Figure: Zhou (HKUST) Graph representation Proximal with Methods difference group structures Group Learning 23 / 28

24 Speed comparison Figure: Comparison: Proximal + Network Flow (ProxFlow), Quadratic Programming (QP), Conic Programming (CP) and Sub-gradient Descent (SG) Xiaowei Zhou (HKUST) Proximal Methods Group Learning 24 / 28

25 Outline 1 Background 2 Proximal Methods Overview Convergence Computation of Proximal Operator 3 ProxFlow algorithm for group-sparsity 4 Discussion Xiaowei Zhou (HKUST) Proximal Methods Group Learning 25 / 28

26 Toolbox SPAMS SPArse Modeling Software Developer: CVML INRI. Addressing various machine learning and signal processing problems: Dictionary learning and matrix factorization (NMF, sparse PCA,...). Sparse decomposition problems with LARS, coordinate descent, OMP, SOMP, proximal methods. Structured sparse decomposition problems. Xiaowei Zhou (HKUST) Proximal Methods Group Learning 26 / 28

27 Convex formulation of DECOLOR? Replace the MRFs penalty in DECOLOR with a group sparsity-inducing norm: min L,S 1 2 M L S 2 F + µ L + λω(s), where Ω(S) := g G η g s g. Stable PCP with Group Sparsity by BCD Initialize:S 0 = 0. repeat compute L k+1 = D µ (M S k ); compute S k+1 = Prox λω (M L k+1 ); until convergence Output: L, S. Xiaowei Zhou (HKUST) Proximal Methods Group Learning 27 / 28

28 Key References Mairal, J. and Jenatton, R. and Obozinski, G. and Bach, F. Convex and Network Flow Optimization for Structured Sparsity Journal of Machine Learning Research 12 (2011) Bach, F. and Jenatton, R. and Mairal, J. and Obozinski, G. Convex optimization with sparsity-inducing norms Optimization for Machine Learning, MIT Press, To appear Luca Baldassarre Proximal Methods (Lecture Notes) Jenatton, R. and Mairal, J. and Obozinski, G. and Bach, F. Proximal methods for hierarchical sparse coding Journal of Machine Learning Research 12 (2011) Xiaowei Zhou (HKUST) Proximal Methods Group Learning 28 / 28

1 Sparsity and l 1 relaxation

1 Sparsity and l 1 relaxation 6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the