Proximal methods. S. Villa. October 7, PDF Free Download

Proximal methods S. Villa October 7, 2014 1 Review of the basics Often machine learning problems require the solution of minimization problems. For instance, the ERM algorithm requires to solve a problem of the form min y Kc 2, c R d for various choices of the loss function. Another typical problem is the regularized one, e.g. Tikhonov regularization where, for linear kernels one looks for 1 min w R d n n V ( w, x i, y i ) + λr(w). i=1 The class of methods we will consider are suitable to solve problems involving a non smooth regularization term. In particular, we will be motivated by the following regularization terms: (i) l 1 norm regularization (ii) group l 1 norm regularization (iii) matrix norm regularization More generally, we are interested in solving a minimization problem min F (w). w R d We review the basic concepts that allow to study the problem. Existence of a minimizer domain of F is We will consider extended real valued functions F : R d R {+ }. The domf = {w R d : F (w) < + }. This all F is proper if the domain is nonempty. It is useful to consider extended valued functions since they allow to include constraints in the regularization. F is lower semicontinuous if epif is closed (example). F is coercive if lim w + F (w) = +. Theorem 1.1. If F is lower semicontinuous and coercive then there exists w such that F (w ) = min F. We will always assume that the functions we consider are lower semicontinuous. 1

1.1 Convexity concepts Convexity F is convex if ( w, w domf )( λ [0, 1]) F (λw + (1 λ)w ) λf (w) + (1 λ)f (w ). If F is differentiable, we can write an equivalent characterization of convexity based on the gradient: ( w, w R d ) F (w ) F (w) + F (w), w w If F is twice differentiable, and 2 F is the Hessian matrix, convexity is equivalent to 2 F (w) positive semidefinite for all w R d. If a function is convex and differentiable, then F (w) = 0 implies that w is a global minimizer. Strict Convexity F is strictly convex if ( w, w domf )( λ (0, 1)) F (λw + (1 λ)w ) < λf (w) + (1 λ)f (w ). If F is differentiable, we can write an equivalent charcterization of strct convexity based on the gradient: ( w, w R d ) F (w ) > F (w) + F (w), w w If F is twice differentiable, and 2 F is the Hessian matrix, convexity is implied by 2 F (w) positive definite for all w R d. The minimizer of a strictly convex function is unique (if it exists) Strong Convexity [0, 1]) F is µ-strongly convex if the function f µ 2 is convex, i.e. ( w, w domf )( λ F (λw + (1 λ)w ) λf (w) + (1 λ)f (w ) µ 2 λ(1 λ) w w 2. If F is differentiable, then strong convexity is equivalent to ( w, w R d ) F (w ) F (w) + F (w), w w + µ 2 w w 2 If F is twice differentiable, and 2 F is the Hessian matrix, strong convexity is equivalent to 2 F (w) µi for all w R d. If F is strongly convex then it is coercive. Therefore if it is lsc, it admits a unique minimizer. Moreover F (w) F (w ) µ 2 w w 2. We will often assume Lipschitz continuity of the gradient This gives a useful quadratic upper bound of F F (w) F (w ) L w w. F (w ) F (w) + F (w), w w + L 2 w w 2 ( w, w domf ) (1) Moreover, for every w domf and w is a minimizer, 1 2L F (w) 2 F (w) F (w ) L 2 w w 2. The second inequality follows by substituting in the quadratic upper bound w = w and w = w. The first follows by substituting w = w 1 L F (w). 2

2 Convergence of the gradient method with constant step-size Assume F to be convex, differentiable, with L Lipschitz continuous gradient, and that a minimizer exists. The first order necessary condition is F (w) = 0. Therefore w α F (w ) = w This suggests an algorithm based on the fixed point iteration w k+1 = w k α F (w k ). We want to study convergence of this algorithm. Convergence can be intended in two senses, towards the minimum or towards a minimizer. Start from the first one. Different strategis to choose stepsize. We keep α fixed and determine a priori conditions guaranteeing convergence. From the quadratic upper bound (1) we get F (w k+1 ) F (w k ) α F (w k ) 2 + Lα2 2 F (w k) 2 = F (w k ) α (1 L2 ) α F (w k ) 2 If 0 < α < 2/L the iteration decreases the function value. Choose α = 1/L (which gives the maximum decrease) and get F (w k+1 ) F (w k ) 1 2L F (w k) 2 F (w ) + F (w k ), w k w 1 2L F (w k) 2 = F (w ) + L ( 1L 2 F (w k), w k w 1L ) 2 F (w k) 2 w k w 2 + w k w 2 = F (w ) + L 2 ( w k w 2 w k 1 L F (w k) w 2 ) = F (w ) + L 2 ( w k w 2 w k+1 w 2 ) Summing the above inequality for k = 0,..., K 1 we get K 1 k=0 K 1 k=0 F (w k ) F (w ) K 1 k=0 F (w k ) F (w ) L 2 w 0 w 2 L 2 ( w k w 2 w k+1 w 2 ) Noting that F (w k ) is decreasing, F (w K ) F (w ) F (w k ) F (w ) for every k, therefore we obtain F (w K ) F (w ) L 2K w 0 w 2. This is called sublinear rate of convergence. For strongly convex functions, it is possible to prove that the operator I α F is a contraction, and therefore we get linear convergence rate: w K w 2 ( ) 2K L µ w 0 w 2 L + µ 3

which gives, using the bound following (1) which is much better. F (w K ) F (w ) L 2 ( ) 2K L µ w 0 w 2 L + µ It is known that for general convex problems problems, with Lipschitz continuous gradient, the performance of any first order method is lower bounded by 1/k 2. Nesterov in 1983 devised an algorithm reaching the lower bound. The algorithm is called accelerated gradient descent and is very similar to the gradient. It needs to store two iterates, instead of only one. It is of the form w k+1 = u k 1 L F (u k) u k+1 = a k w k + b k w k+1, for some w 0 domf, and u 1 = w 0 and a suitable (a priori determined) sequence of parameters a k and b k. More precisely, choose w 0 domf, and u 1 = w 0. Set t 1 = 1. Then define w k+1 = u k 1 L F (u k) t k+1 = 1 + 1 + 4t 2 k ( 2 u k+1 = 1 + t ) k 1 w k + 1 t k w k+1. t k+1 t k+1 We obtain F (w k ) F (w ) L w 0 w 2 2k 2 3 Regularized optimization We often want to minimize min F (w) + R(w), w R d where either F is smooth (e.g. square loss) and R is convex and nonsmooth, either R is smooth and F is not (SVM). We would like to write a similar condition to = 0 to characterize a minimizer. We use the subdifferential. Let R be a convex, lsc proper function. η R d is a subgradient of R at w if R(w ) R(w) + η, w w. The subdifferential R(w) is the set of all subgradients. It is easy to see that R(w ) = min R 0 R(w ). If R is differentiable, the subdifferential is a singleton and coincides with the gradient. Example 3.1 (Subdifferential of the indicator function). Let i C be the indicator function of a convex set C (constrained regularization). Let w C. Then i C =. If w C, then η i C (w) if and only if, for all v C i C (v) i C (w) η, v w 0 η, v w. This is the normal cone to C. 4

Example 3.2 (Subdifferential of R = 1 ). If, η is such that for all i = 1,..., d n v i i=1 n w i η, v w. i=1 v i w i η i (v i w i ), then η R(w). Vice versa, taking v j = w j for all j i we get that η R(w) implies that v i w i η i (v i w i ), and thus η i (w i ). We therefore proved that R(w) = ( (w 1 ),..., (w d )). Proximity operator Let R be lsc, convex, proper. Then prox R (v) = argmin w R d{r(w) + 1 2 w v 2 } is well-defined and is unique. Imposing the first order necessary conditions, we get Example 3.3. u = prox R (v) 0 R(u) + (u v) v u R(u) u = (I + R) 1 (v) i) If R = 0, then prox(v) = v. ii) If R = i C, directly from the definition prox R (v) = P C (v). ii) Proximity operator of the l 1 norm. Let λ > 0 and set R = λ 1. Let v R d and u = prox R (v). Then v u λ 1 (u). Since the subdifferential can be computed componentwise, the same holds for the prox. In particular, u = (I + R) 1 (v) By the previous example, this is equivalent to u = (I + R) 1 (v). To compute this quantity first note that v i + λ if v i > λ ((I + R)(v)) i = [ λ, λ] if v i = 0 v i λ if v i < λ Inverting the previous relationship we get u i λ if u i > λ (prox 1 (u)) i = 0 if u i [ λ, λ] u i + λ if u i < λ 4 Basic proximal algorithm (forward-backward splitting) Assume that F is convex and differentiable with Lipschitz continuous gradient. As for gradient descent, the idea is to start from a fixed point equation characterizing the minimizer. If we write the first order conditions, we get 0 F (w ) + R(w ) α F (w ) α R(w ) w α F (w ) w αr(w ) w = prox αr (w α F (w )). 5

We consider the fixed point iteration Another interpretation: w k+1 = prox αk R(w k α k F (w k )). w k+1 = argmin{α k R(w) + 1 2 w (w k α k F (w k )) 2 } = argmin{r(w) + 1 2α k w w k 2 + w w k, F (w k ) + F (w k )} Special cases: R = 0 (gradient method), R = i C (projected gradient method). The proof of convergence for the sequence of objective values with α k = 1/L is similar to the proof of convergence for the differentiable case. The rate of convergence is the same as in the differentiable case (this would not be the case if a subdifferential method was used, compare...) F (w k ) F (w ) L w 0 w 2 2k Convergence proof Set α k = 1/L and define the gradient mapping as Then G 1/L (w) = L(w 1 L prox R/L(w 1 F (w))) L w k+1 = w k 1 L G 1/L(w k ). Note that G 1/L is not a gradient or a subgradient of F + R but is called gradient mapping. By wiriting the first order condition for the prox operator, we get: Recalling the upper bound (1), we obtain G 1/L (w) F (w) + R(w 1 L G 1/L(w)) F (w 1 L G 1/L(w)) F (w) 1 L F (w), G 1/L(w) + 1 2L G 1/L(w) 2 (2) If inequality (2) holds, then for every v R d :... F (w 1 L G 1/L(w)) F (v) + G 1/L (w), w v + 1 2L G 1/L(w) 2 Accelerated versions As for the gradient. The problem is that the forward-backward algorithm is effective only when prox is easy to compute. Note indeed that we replaced our original problem with a sequence of new minimization problems. They are strongly convex (therefore easier), but in general not solvable in closed form. 5 Fenchel conjugate and Moreau decomposition Fenchel conjugate The Fenchel conjugate is a function R : R d R {+ } defined as R (η) w R d { η, w R(w)}. R is a convex function (even if R is not), since it is the pointwise supremum of convex (linear) functions. Example 6

1. Conjugate of an affine function. It is the indicator function. If R(w) = a, w +b, then R (w) = bι {a} 2. The conjugate of an indicator function i C is the support function sup u C u, 3. Conjugate of the norm R(w) = w. In this case define η w: w 1 η, w. Then R = i B, where B = {η η 1}. (for the l 1 norm, it is the l norm) Indeed, let η R d : R (η) w R d η, w w sup t R + w =1 t R + t( η 1) η, tw t w and thus R (η) = i B By definition, R (η) + R(w) η, w for η, w R d (Fenchel Young inequality). Moreover, R(w) + R (η) = η, w η R(w) w R (η) Suppose that R (η) = η, w R(w) iff η, w R(w ) η, w R(w) for every w iff η R(w). From R(w) + R (η) = w, η we get R (η ) η, u R(u) u η, w R(w) = η η, w + η, w R(w) = η η, w R (η). If R is lsc and convex, then R = R (which gives the other equivalence). Moreau decomposition w = prox R (w) + prox R (w) It follows from the properties stated above of the subdifferential and of the conjugate: u = prox R (w) w u R(u) u R (w u) w (w u) R (w u) w u = prox R (w). Note that this is a generalization of the classical decomposition on orthogonal components. So if V is a linear subspace and V is the orthogonal subspace, we know w = P V (w) + P V (w). This is a special case of the Moreau decomposition obtained by choosing R = i V (and noting that R = i V ). Properties of the proximity operators examples prox R (w) = (prox R1 (w 1 ), prox R2 (w 2 )). Scaling: ( ) v prox R+ µ 2 2(v) = prox 1 1+µ R 1 + µ Separable sum: If R(w) = R 1 (w 1 )+R 2 (w 2 ), then 7

Generalized Moreau decomposition: for every λ > 0: w = prox λr (w) + λ prox R /λ(w/λ) Sometimes, Moreau decomposition is useful to compute proximity operators. Let R(w) = λ w. We have seen that R = i B (λ). Therefore, from the Moreau decomposition, we get prox R (w) = w P B (λ)(w). In particular, if R = 1, noting that =, we obtain again the formula for the soft-thresholding seen before. Elastic-net. Let G = {G 1,..., G t } be a partition of the indices {1,..., d}. The following norm is called group lasso penalty: t R(w) = w Gi, i=1 where w 2 G i = j G i w 2 j. The dual norm is max j=1,...,t w G j, and therefore prox R (w) = w P B (w), where B = {w R d : w Gj 1, j = 1,..., t}. The projection on this set can be expressed componentwise as w Gj if w Gj 1 (P B (w)) Gj = w Gj w Gj otherwise 6 Computation of the proximity operator of matrix functions Let W R D T and let σ : R D T R q, where q = min{d, T } and σ 1 (W ) σ 2 (W )... σ q (W ) 0 are the singular values of the matrix W. We consider regularization terms of the form R(W ) = g(σ(w )) where g is absolutely symmetric, namely g(α) = g( ˆα), where ˆα is the vector obtained ordering the components of α in a decreasing order. We will show that the computation of the proximity operator of R can be reduced to the computation of that of the function g. Indeed, if W = U diag(σ(w ))V T (with U R D D, diag(σ(w )) R D T, and V R T T ), it holds prox λr (W ) = U diag(prox λg (σ(w )))V T. The proof of this statement is based on two results. Theorem 6.1 (Von Neumann 1937). For any D T matrices Z, W and A, and hence max{ UZV T, W U and V orthogonal} = σ(z), σ(w ), Z, W σ(z), σ(w ). Equality holds if and only if there exists a simultaneous SVD of Z and W. 8

Theorem 6.2 (Conjugacy formula). If g : R q ], + ] is absolutely symmetric then (g σ) = g σ. Proof. Let Z R D T. Then (g σ) (Z) Z, W g(σ(z)) W U,V ort.,a A R D T UAV T, W g(σ(a)) { } sup UAV T, W U,V A R D T σ(a), σ(w ) g(σ(a)) α R q α, σ(w ) g(α) = g (σ(w )). g(σ(a)) Using the two previous results it is easy to show a formula for the subdifferential of the function R. Theorem 6.3. Let g : R q ], + ] be absolutely symmetric. Then (g σ)(w ) = {U diag(α)v T α g(σ(w )), W = Uσ(W )V T, U and V orthogonal} Proof. Let Z (g σ)(w ). By definition of Fenchel conjugate and using the previous theorem (g σ)(w ) = {Z (g σ)(w ) + (g σ) (Z) = W, Z } = {Z (g σ)(w ) + (g σ)(z) = W, Z }. Now, by Young-Fenchel inequality (g σ)(w ) + (g σ)(z) σ(w ), σ(z) W, Z. Let Z (g σ)(w ). Then, we must have σ(w ), σ(z) = W, Z and hence, by Von Neumann Theorem Z and W have the same singular value decomposition. Moreover, by Theorem 6.1 g(σ(w )) + g (σ)(z) = σ(w ), σ(z) and therefore σ(z) g(σ(w )). So we proved one inclusion. Let s prove the other one. Take α g(σ(w ) and define Z = U diag(α)v T where U and V are orthonormal matrices such that W = Uσ(W )V T. Then W and Z have a simultaneous singular value decomposition and hence This implies that Z (g σ)(w ). (g σ)(w ) + (g σ) (Z) = g(σ(w )) + g (α) = σ(w ), α = σ(w ), σ(z) = W, Z. Theorem 6.4. Let R = g σ, with g : R q ], + ] absolutely symmetric, let W = Uσ(W )V T, with U and V are orthogonal, and let λ > 0. Then prox λr (W ) = U prox λg (σ(w ))V T. Proof. By definition, Z = prox λr (W ) is the unique minimizer of Z R(Z) + 1 Z W 2 2λ 9

and is thus the unique point in R D T satisfying 1 ( ) W Z R(Z). λ Now, let Z = U prox λg (σ(w ))V T, and let α = prox λg (σ(w )). By definition of prox λg, we have 1 (σ(w ) α) g(α). λ Using the charachterization of the subdifferential we found in Theorem 6.3 we get U 1 λ (σ(w ) α)v T (g σ)(z), and the conclusion follows by noting that Z = Z since the left hand side coincides with (W Z)/λ. Using these general results it is easy to compute the proximity operator of the nuclear norm (a.k.a. trace class norm, Schatten 1 norm): R(W ) = W 1 = q σ(w ) i. The function 1 is the ocmposition of the absolutely symmetric function g : R q R, g(α) = q i=1 α i with σ. The computation of the prox of R is thus reduced to the computation of the prox of the l 1 norm in R q. We already know that prox λg (α) = S λ (α), and finally we get i=1 prox λr (W ) = US λ (σ(w ))V T, where W = Uσ(W )V T. 7 References [A] Combettes, Pesquet Proximal splitting methods in signal processing, 2009 [B] Combettes and Wajs, Signal Recovery by Proximal forward-backward splitting, Multiscale Model Simul, 2005 [C] Lewis, The convex analysis of unitary invariant matrix functions, 1995 [D] Nesterov, A basic course in optimization [E] Beck- Teboulle, A fast iterative soft-thresholding algorithm for linear inverse problems, SIAM J Imaging Sciences 2009 10

Proximal methods. S. Villa. October 7, 2014