ELE 538B: Large-Scale Optimization for Data Science Proximal gradient methods Yuxin Chen Princeton University, Spring 08
Outline Proximal gradient descent for composite functions Proximal mapping / operator Convergence analysis
Proximal gradient descent for composite functions
Composite models minimize x subject to F (x) := f(x) + h(x) x R n f: convex and smooth h: convex (may not be differentiable) let F opt := min x F (x) be optimal cost Proximal gradient methods 6-4
Examples l regularized minimization minimize x f(x) + x }{{} h(x): l norm use l regularization to promote sparsity nuclear norm regularized minimization minimize X f(x) + X }{{} h(x): nuclear norm use nuclear norm regularization to promote low-rank structure Proximal gradient methods 6-5
A proximal view of gradient descent To motivate proximal gradient methods, we first revisit gradient descent x t+ = arg min x { x t+ = x t η t f(x t ) f(x t ) + f(x t ), x x t }{{} first-order approximation + } x x t η } t {{} proximal term Proximal gradient methods 6-6
A proximal view of gradient descent I JA proximal view of gradien Îx xt Î A proximal view of t gradient descent I x t+ x = arg min f (xt ) + ÈÒf (xt ), x xt Í A proximal view of gradient descent x ( ) xt+ = arg min f (xt ) + ÈÒf (xt ), x xt Í + xt+ = arg min f (xt ) + h f (xt ), x xt i + x kx xt k ηt A proximal view of gradien Îx xt Î + c f (x) tt+ tis the point where By optimality condition, xt+ x = x t Òf (xt ) I t f (xt ) + ÈÒf (xt ), x xt Í and xtmin Î have same xt+ = arg f (x ) +slope ÈÒf (xt ), x xt Í t Îx x Proximal gradient methods 5-3 Ì Îx I xt Î + c f (x) J t+ By optimality condition, x is the point wher t xt Í and t+ is the point f (x By optimality condition, where xt+ = argxmin f (xt ) + ÈÒf (xt ), x t )x+tèòf Í +(xt ), xîx xt Î t Îx xt Î t t t t x f (x ) + ÈÒf (x ), x x Í and t Îx x Î have same slope t roximal gradient methods 5-3 Proximal gradient methods first-order approximation proximal term By optimality condition, xt+ is the point where f (xt ) + h f (xt ), x xt i and - η t kx xt k have same slope Proximal gradient methods 6-7
How about projected gradient descent? x t+ = arg min x = arg min x where C (x) = x t+ = P C ( x t η t f(x t ) ) { f(x t ) + f(x t ), x x t + } x x t + C (x) η t { x (x t η t f(x t )) } + η t C (x) (6.) { 0, if x C, else Proximal gradient methods 6-8
Proximal operator Define proximal operator prox h (x) := arg min z for any convex function h { z x + h(z) } This allows one to express projected GD update (6.) as x t+ = prox ηt C ( x t η t f(x t ) ) (6.) Proximal gradient methods 6-9
Proximal gradient methods One can generalize (6.) to accommodate more general h Algorithm 6. Proximal gradient algorithm : for t = 0,, do : x t+ ( = prox ηth x t η t f(x t ) ) alternates between gradient updates on f and proximal minimization on h useful if prox h is inexpensive Proximal gradient methods 6-0
Proximal mapping / operator
Why consider proximal operators? { prox h (x) := arg min z z x + h(z)} well-defined under very general conditions (including nonsmooth convex functions) can be evaluated efficiently for many widely used functions (in particular, regularizers) this abstraction is conceptually and mathematically simple, and covers many well-known optimization algorithms Proximal gradient methods 6-
bt+ = t t + t t {z } Example: indicator functions = b momentum term proxµt+ g bt+ µt+ rf t+ t+ xt+ = xt t Òf (xt ) + t+ x = I Ì A proximal view of gradient descent J arg min f (x ) + ÈÒf (xtt+ ), x xtnorm Í+ Îx xt Î Example: x = arg min ft (xt ) + ÈÒf (xt ), x xt Í + x x approximation Îx xt Î +first-order c proximal term I t If h = C is indicator function Proximal gradient methods ( 0, if x C If h(x) =h(x) ÎxÎ, then = else prox h (x) = Â, (sof t thresholding) st (x; ) Îx x t 5- condition, xt+ is the point where t ), x xt Í and Îx xt Î have same slo f (x ) + ÈÒf (x t where Âst (x) = x +, if x <, is applied in entry-wise manner then Y By optimality ]x, if tx > [0, else proxh (x) = arg min kz xk Proximal gradient methods z C Proximal gradient methods (Euclidean projection) Proximal gradient methods 5-6-3
Example: l norm x t x t+ x t Îx xt Î + c Îx xt Î + c x t+ If h(x) = x, then (prox λh (x)) i = ψ st (x i ; λ) (soft-thresholding) x λ, if x > λ where ψ st (x) = x + λ, if x < λ 0, else Proximal gradient methods 6-4
Basic rules If f(x) = ag(x) + b with a > 0, then prox f (x) = prox ag (x) affine addition: if f(x) = g(x) + a x + b, then prox f (x) = prox g (x a) Proximal gradient methods 6-5
Basic rules quadratic addition: if f(x) = g(x) + ρ x a, then prox f (x) = prox +ρ g ( + ρ x + ρ + ρ a ) scaling and translation: if f(x) = g(ax + b) with a 0, then prox f (x) = a ( ) prox a g(ax + b) b (homework) Proximal gradient methods 6-6
Proof for quadratic addition { prox f (x) = arg min z x x + g(z) + ρ } z a { } + ρ = arg min x z z, x + ρa + g(z) { = arg min x z } z, x + ρa + + ρ + ρ g(z) = arg min x = prox +ρ g { ( z + ρ x + ρ ) + ρ a ( + ρ x + ρ ) + ρ a + + ρ g(z) } Proximal gradient methods 6-7
Basic rules orthogonal mapping: if f(x) = g(qx) with Q orthogonal (QQ = Q Q = I), then prox f (x) = Q prox g (Qx) (homework) orthogonal affine mapping: if f(x) = g(qx + b) with QQ = α I, then }{{} does not require Q Q=α I prox f (x) = ( ) ( ) I αq Q x + αq prox α g(qx + b) b for general Q, it is not easy to derive prox f from prox g Proximal gradient methods 6-8
Basic rules norm composition: if f(x) = g( x ) with domain(g) = [0, ), then x prox f (x) = prox g ( x ) x 0 x Proximal gradient methods 6-9
Proof for norm composition Observe that { min f(z) + } z z x { = min g( z ) + z z z x + } x { = min min g(α) + α 0 z: z =α α z x + } x { = min g(α) + α 0 α α x + } x (Cauchy-Schwarz) = min {g(α) + } (α x ) α 0 From above calculation, we know optimal point is α = prox g ( x ) and z = α x x = prox x g ( x ), x thus concluding proof Proximal gradient methods 6-0
= PC (x)> PC (z) xp (x)z S= PC (z) z PCs( > C( ) sp if x œ C, proxh (x) is Euclidean Œ else methods (constrained) Gradient I C, which is nonexpansive: projection PC onto 0, if x œ C Recall that when h(x) =, proxh (x) is Euclidean ŒC (xelse Îx x Î + c Gradient x methods x (constrained) ÎP )Gradient P (x )Î methods (constrained) Gradient Æ Îx x Î C methods (constrained) projection PC onto C, which is nonexpansive: a property for general prox ( ) h PC (z) x when z h(x) = 0, Recall that PC (x) 0 nsiveness of proximal operators s Cx Nonexpansiveness of proximal operators I C Proximal gradient methods W C s P ( ÎPC (x ) PC (x )Î Æ Îx xc Î veness of proximal operators Gradient methods (constrained) Gradient methods (constrained) gradient methods Îx x Î + cproximal x x property for general proxh ( ) Gradient methods (constrained) ) + @ PC ( 5 5- Gradient methods (constrained) Recall that when h(x) = C (x), proxh (x) is Euclidean projection PC onto C, which is nonexpansive for convex C: ansiveness) Proximal gradient methods 3-3 kpc (x ) PC (x )k kx x k 6-3-3
Nonexpansiveness of proximal operators Nonexpansiveness of proximal operators Îx xoperators Î + c x x nsiveness ofc proximal Îx x Î + Nonexpansiveness of proximal operators Nonexpansiveness of proximal operators onexpansiveness is a property for general proxh ( ) x Î + Îx x Î + c Îxx xxî+ c Îx xîx Î + c Îx x Î + c x x of proximal Nonexpansiveness of proximal operators Nonexpansiveness c x x prox (x ) prox ) operators h h (xis a property for general proxh ( ) a property for general proxh ( ) Nonexpansiveness in some sense, proximal operator behaves like projection Nonexpansiveness is a property for general proxh ( ) + x x Îc Îx xîx Î +xcî + c Îx x Îx+ Îx c x xis aprox h (x ) prox h (x ) Nonexpansiveness property for general proxh ( ) Nonexpansiveness is a property for general proxh ( ) Fact 6. act 5. (Nonexpansiveness) Fact 5. (Nonexpansiveness) Fact 5. (Nonexpansiveness) Îprox (firm(xnonexpansiveness) ) prox (x )Î Æ Îx x Î hîprox (x ) prox (x )Î Æ Îx x Î (x ) prox h Îprox h h (x )Î Æ Îx x Î Fact 5. (Nonexpansiveness)h h Proximal gradient methods hprox ) sense, prox ),like xprojection x i kprox (x5-3 ) proxh (x )k ansiveness) Fact 5. (Nonexpansiveness) operator h (x h (xoperator In some sense, proximal behaves ximal gradient methods 5-3 h In some proximal behaveslike projection Îproxh (x )Proximal prox (xmethods )Îsense, Æ Îx proximal x Î operator behaves like projection In hsome gradient Îprox (x ) prox (x )Î Æ Îx (nonexpansiveness) x Îh h Îx Æsense, proximal operator behaves h (x ) prox h (x In )Î some like projection Proximal gradient methods x Î In some sense, proximal operator behaves like projection kprox proximal operator behaves like projection 5-3 h (x ) prox h (x )k Proximal gradient methods Proximal gradient methods 5-3 5-3 5-3 kx x k 6-
Proof of Fact 6. Let z = prox h (x ) and z = prox h (x ). Subgradient characterizations of z and z read x z h(z ) and x z h(z ) The nonexpansiveness claim z z x x would follow if (x x ) (z z ) z z }{{} firm nonexpansiveness (together with Cauchy-Schwarz) = (x z x + z ) (z z ) 0 h(z ) h(z ) + x z, z }{{} z add these inequalities h(z = ) h(z ) h(z ) + x z, z }{{} z h(z ) Proximal gradient methods 6-3
Resolvent of subdifferential operator One can interpret prox via resolvant of subdifferential operator Fact 6. z = prox f (x) z = (I + f) }{{} (x) resolvent of operator f where I is identity mapping Proximal gradient methods 6-4
Justification of Fact 6. { z = arg min f(u) + } u u x 0 f(z) + z x (optimality condition) x (I + f) (z) z = (I + f) (x) Proximal gradient methods 6-5
Moreau decomposition Fact 6.3 Suppose f is closed convex, and f (x) := sup z { x, z f(z)} is convex conjugate of f. Then x = prox f (x) + prox f (x) key relationship between proximal mapping and duality generalization of orthogonal decomposition Proximal gradient methods 6-6
x + g g(x) K K ++, g(z) + + g(x) = max{x, x a n + + x = arg +} = S (x) x x kzk hz, x + ai + min kzk hz, x + ai + g(z) n = arg min = prox + x+ a x g + + prox (x) = x P (x) g g(x) = max{x,, x } = S (x) x + an kzk = prox hz,+ x g+ ai g ++ + g(z) + = arg min kzk = max{x, g(x) =, arg x=n }min + x= S (x) hz, x + ai + max{x n g}=+s (x)pc (x) + P? (x) (,, x ) x C + + = x g prox (x) P (x) ( g + = max{x n } = S (x) proxg (x) = x= arg Pprox (x) g (x) = x min z P (x) x g(x) + a +,, x g(z) x +C? + = arg min C + x+ a + P (x) + P? (x) prox (x) = P(x) (x)x xk? z P g(x) +PxP + + C? (x) K? PC (x) + PC? (x) C K CK P K (x) PKC(x) = prox + x + a P (x) + P (x)? n C g C + (x) x = prox g x+ a K K PK (x) PK PK+? (x) + x+ + g(x) = max{x,,+ x } = S + (x) = prox = x P (x) prox (x) Moreau decomposition for convex cones = prox x+ a + + P (x) + P (x) g(x) = max{x,, x } = S (x) K KP?=(x) PK (x) P}K=P= (x) (x) PxK? (x) +x? (x) K K K (x) K (x) (x) prox x(x) PP P PK (x) PK (x) g(x) x, K nk S K max{x K PK PK (x) PK (x) x g, x(x) +? proxg (x) = x P (x) P (x) + PC? (x) PC (x) + PCC (x)? K K PK (x) K K PK (x) PK? (x) x g(x) = pr + g(x) = max{x,, xn } = S (x) proxg (x) = x P (x) PC (x) + PC? (x) PK (x) PK (x) PK? (x) K K xpk (x) K P K(x) PK (x) K? When K is closed convex cone, (K ) (x) = K (x) (exercise) with K := {x hx, zi 0, z K} polar cone of K. This gives x = PK (x) + PK (x) a special case: if K is subspace, then K = K, and hence Proximal gradient methods x = PK (x) + PK (x) 6-7 x P
Proof of Fact 6.4 Let u = prox f (x), then from optimality condition we know that x u f(u). This together with conjugate subgradient theorem (homework) yields u f (x u) In view of optimality condition, this means x u = prox f (x) = x = u + (x u) = prox f (x) + prox f (x) Proximal gradient methods 6-8
Example: prox of support function For any closed and convex set C, support function S C is defined as S C (x) = sup z C x, z. Then prox SC (x) = x P C (x) (6.3) Proof: First of all, it is easy to verify that (exercise) Then Moreau decomposition gives S C(x) = C (x) prox SC (x) = x prox S C (x) = x prox C (x) = x P C (x) Proximal gradient methods 6-9
Example: l norm prox (x) = x P B (x) where B := {z z } is unit l ball Remark: projection onto l ball can be computed efficiently Proof: Since x = sup z: z x, z = S B (x), we can invoke (6.3) to arrive at prox (x) = prox (x) = x P SB B (x) Proximal gradient methods 6-30
Example: max function Let g(x) = max{x,, x n }, then prox g (x) = x P (x) where := {z R n + z = } is probability simplex Remark: projection onto can be computed efficiently Proof: Since g(x) = max{x,, x n } = S (x) (support function of ), we can invoke (6.3) to reach prox g (x) = x P (x) Proximal gradient methods 6-3
Extended Moreau decomposition A useful extension (homework): Fact 6.4 Suppose f is closed convex and λ > 0. Then x = prox λf (x) + λprox λ f (x/λ) Proximal gradient methods 6-3
Convergence analysis
Cost monotonicity Objective value is non-increasing in t: Lemma 6.5 Suppose f is convex and L-smooth. If η t /L, then F (x t+ ) F (x t ) different from subgradient methods (for which objective value might be non-monotonic in t) constant stepsize rule is recommended when f is convex and smooth Proximal gradient methods 6-34
Proof of cost monotonicity Main pillar: Lemma 6.6 a fundamental inequality Let y + = prox L h ( y L f(y)), then F (y + ) F (x) L x y L x y+ g(x, y) }{{} 0 by convexity where g(x, y) := f(x) f(y) f(y), x y Take x = y = x t (and hence y + = x t+ ) to complete proof Proximal gradient methods 6-35
Monotonicity in estimation error Proximal gradient iterates are not only monotonic w.r.t. cost, but also monotonic in estimation error Lemma 6.7 Suppose f is convex and L-smooth. If η t /L, then x t+ x x t x Proof: from Lemma 6.6, taking x = x, y = x t (and hence y + = x t+ ) yields F (x t+ ) F (x ) }{{} 0 which immediately concludes proof + g(x, y) L }{{} x x t L x x t+ 0 Proximal gradient methods 6-36
Proof of Lemma 6.6 Define φ(z) = f(y) + f(y), z y + L z y + h(z) It is easily seen that y + = arg min z φ(z). Two important properties: Since φ(z) is L-strongly convex, one has φ(x) φ(y + ) + L x y+ Remark: we are propergating smoothness of f to strong convexity of another function φ From smoothness, φ(y + ) = f(y) + f(y), y + y + L y+ y + h(y + ) }{{} upper bound on f(y + ) f(y + ) + h(y + ) = F (y + ) Proximal gradient methods 6-37
Proof of Lemma 6.6 (cont.) Taken collectively, these yield φ(x) F (y + ) + L x y+, which together with definition of φ(x) gives f(y) + f(y), x y + h(x) + L }{{} x y F (y + ) + L x y+ =f(x)+h(x) g(x,y)=f (x) g(x,y) which finishes proof Proximal gradient methods 6-38
Convergence for convex problems Theorem 6.8 (Convergence of proximal gradient methods for convex problems) Suppose f is convex and L-smooth. If η t /L, then F (x t ) F opt L x0 x t achieves better iteration complexity (i.e. O(/ε)) than subgradient method (i.e. O(/ε )) fast if prox can be efficiently implemented Proximal gradient methods 6-39
Proof of Theorem 6.8 With Lemma 6.6 in mind, set x = x, y = x t to obtain F (x t+ ) F (x ) L xt x L xt+ x g(x, x t ) }{{} 0 by convexity L xt x L xt+ x Apply it recursively and add up all inequalities to get t k=0 ( ) F (x k+ ) F (x ) L x0 x L xt x This combined with monotonicity of F (x t ) (cf. Lemma 6.6) yields F (x t ) F (x ) L x0 x t Proximal gradient methods 6-40
Convergence for strongly convex problems Theorem 6.9 (Convergence of proximal gradient methods for strongly convex problems) Suppose f is µ-strongly convex and L-smooth. If η t /L, then x t x ( µ ) t x 0 x L linear convergence: attains ε accuracy within O(log ε ) iterations Proximal gradient methods 6-4
Proof of Theorem 6.9 Taking x = x, y = x t (and hence y + = x t+ ) in Lemma 6.6 gives F (x t+ ) F (x ) L x x t L x x t+ g(x, x t ) }{{} µ x x t+ L µ x t x L xt+ x This taken collectively with F (x t+ ) F (x ) 0 yields x t+ x Applying it recursively concludes proof ( µ ) x t x L Proximal gradient methods 6-4
Numerical example: LASSO taken from UCLA EE36C minimizex f (x) = kax bk + kxk with i.i.d. Gaussian A R000 000, ηt = /L, L = λmax (A> A) Proximal gradient methods 6-43
with R := supxœc DÏ x, x A Ô Lf R best,t opt f f ÆO Ô fl Suppose f is convex and Lipschitz continuous (i and suppoe Ï is fl-strongly convex w.r.t. Î Î. Example: -norm regularized least-squares! qt! minimize one can further remove log t factor k=0 k " supxœc DÏ x, x0 + 4, we immediatenumerical arrive at example: LASSO bk + kxk kax taken from UCLA EE36C vex and Lipschitz continuous (i.e. Îg t Îú Æ Lf ) on C, 00 fl-strongly convex w.r.t. Î Î. Then " 0 4 flR Ô Lf t 3 Ô (ff(x 0 qt Lf fl k=0 k! " qt k=0 k Mirror descent ) f )/fæ? opt?! If t = R Ô t Æ supxœc0dï x, x0 + best,t (k) Theorem 5.3 f opt 0 5 with R := sup DÏ x, x0, then 0 xœc 0 0 A 0Ô30 40 50B60 70 80 90 00 Lf R log kt Ô f best,t f opt Æ O Ô fl 000 000 randomly generated A R ; step ttk = /L with L = max(at A) Proximal gradient methods 6-43
Backtracking line search Recall that for unconstrained case, backtracking line search is based on sufficient decrease criterion f ( x t η f(x t ) ) f(x t ) η f(xt ) Proximal gradient methods 6-44
Backtracking line search Recall that for unconstrained case, backtracking line search is based on sufficient decrease criterion f ( x t η f(x t ) ) f(x t ) η f(xt ) As a result, this is equivalent to updating η t = /L t until f ( x t η t f(x t ) ) f(x t ) L t f(x t ), f(x t ) + L t f(x t ) = f(x t ) f(x t ), x t x t+ + L t xt x t+ Proximal gradient methods 6-44
Backtracking line search Let T L (x) := prox L h ( x L f(x)) : Algorithm 6. Backtracking line search for proximal gradient methods : Initialize η =, 0 < α /, 0 < β < : while f ( T Lt (x t ) ) > f(x t ) f(x t ), x t T Lt (x t ) + L t TLt (x t ) x t do 3: L t β Lt (or L t β L t ) here, L t corresponds to η t, and T Lt (x t ) generalizes x t+ Proximal gradient methods 6-45
Summary: proximal gradient methods convex & smooth (w.r.t. f) problems strongly convex & smooth (w.r.t. f) problems stepsize convergence iteration rule rate complexity ( ) ( ) η t = L O t O ε ( η t = L O ( κ )t) O ( κ log ) ε Proximal gradient methods 6-46
Reference [] Proximal algorithms, N. Parikh and S. Boyd, Foundations and Trends in Optimization, 03. [] First-order methods in optimization, A. Beck, Vol. 5, SIAM, 07. [3] Convex optimization and algorithms, D. Bertsekas, 05. [4] Convex optimization: algorithms and complexity, S. Bubeck, Foundations and trends in machine learning, 05. [5] Mathematical optimization, MATH30 lecture notes, E. Candes, Stanford. [6] Optimization methods for large-scale systems, EE36C lecture notes, L. Vandenberghe, UCLA. Proximal gradient methods 6-47