Proximal gradient methods

Similar documents
Lasso: Algorithms and Extensions

Accelerated gradient methods

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

6. Proximal gradient method

6. Proximal gradient method

Dual and primal-dual methods

EE 546, Univ of Washington, Spring Proximal mapping. introduction. review of conjugate functions. proximal mapping. Proximal mapping 6 1

Optimization methods

Conditional Gradient (Frank-Wolfe) Method

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Fast proximal gradient methods

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization

Subgradient Method. Ryan Tibshirani Convex Optimization

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

The proximal mapping

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Optimization methods

5. Subgradient method

Optimization and Optimal Control in Banach Spaces

Introduction to Alternating Direction Method of Multipliers

Lecture 5: September 15

Dual Proximal Gradient Method

A Unified Approach to Proximal Algorithms using Bregman Distance

1 Sparsity and l 1 relaxation

Descent methods. min x. f(x)

Selected Topics in Optimization. Some slides borrowed from

Lecture 8: February 9

Proximal methods. S. Villa. October 7, 2014

Lecture 9: September 28

Lecture 5 : Projections

Douglas-Rachford splitting for nonconvex feasibility problems

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

The Proximal Gradient Method

Optimization for Machine Learning

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

Math 273a: Optimization Subgradient Methods

Convex Optimization Conjugate, Subdifferential, Proximation

Accelerated Proximal Gradient Methods for Convex Optimization

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

ORIE 6326: Convex Optimization. Quasi-Newton Methods

Lecture 23: November 19

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

9. Dual decomposition and dual algorithms

Lecture 7: September 17

Unconstrained minimization of smooth functions

Math 273a: Optimization Subgradients of convex functions

Smoothing Proximal Gradient Method. General Structured Sparse Regression

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Convex Optimization Lecture 16

EE 367 / CS 448I Computational Imaging and Display Notes: Image Deconvolution (lecture 6)

Nesterov s Optimal Gradient Methods

Exponentiated Gradient Descent

Math 273a: Optimization Convex Conjugacy

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

CPSC 540: Machine Learning

Convex Optimization. Problem set 2. Due Monday April 26th

MATH 680 Fall November 27, Homework 3

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables

10. Unconstrained minimization

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

Linearized Alternating Direction Method: Two Blocks and Multiple Blocks. Zhouchen Lin 林宙辰北京大学

Stochastic and online algorithms

Sparse Covariance Selection using Semidefinite Programming

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

Dual Decomposition.

Algorithms for Nonsmooth Optimization

Stochastic Optimization: First order method

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION

Unconstrained minimization

Lecture 1: Background on Convex Analysis

Lecture 23: November 21

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

Lecture 6: September 17

Douglas-Rachford Splitting: Complexity Estimates and Accelerated Variants

ELE 538B: Large-Scale Optimization for Data Science. Quasi-Newton methods. Yuxin Chen Princeton University, Spring 2018

4th Preparation Sheet - Solutions

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

A projection algorithm for strictly monotone linear complementarity problems.

Functional Analysis Exercise Class

Optimization for Learning and Big Data

WHY DUALITY? Gradient descent Newton s method Quasi-newton Conjugate gradients. No constraints. Non-differentiable ???? Constrained problems? ????

Lecture 5: September 12

Iterative Convex Optimization Algorithms; Part One: Using the Baillon Haddad Theorem

2 Regularized Image Reconstruction for Compressive Imaging and Beyond

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

Unconstrained minimization: assumptions

Lecture 2: Convex Sets and Functions

Transcription:

ELE 538B: Large-Scale Optimization for Data Science Proximal gradient methods Yuxin Chen Princeton University, Spring 08

Outline Proximal gradient descent for composite functions Proximal mapping / operator Convergence analysis

Proximal gradient descent for composite functions

Composite models minimize x subject to F (x) := f(x) + h(x) x R n f: convex and smooth h: convex (may not be differentiable) let F opt := min x F (x) be optimal cost Proximal gradient methods 6-4

Examples l regularized minimization minimize x f(x) + x }{{} h(x): l norm use l regularization to promote sparsity nuclear norm regularized minimization minimize X f(x) + X }{{} h(x): nuclear norm use nuclear norm regularization to promote low-rank structure Proximal gradient methods 6-5

A proximal view of gradient descent To motivate proximal gradient methods, we first revisit gradient descent x t+ = arg min x { x t+ = x t η t f(x t ) f(x t ) + f(x t ), x x t }{{} first-order approximation + } x x t η } t {{} proximal term Proximal gradient methods 6-6

A proximal view of gradient descent I JA proximal view of gradien Îx xt Î A proximal view of t gradient descent I x t+ x = arg min f (xt ) + ÈÒf (xt ), x xt Í A proximal view of gradient descent x ( ) xt+ = arg min f (xt ) + ÈÒf (xt ), x xt Í + xt+ = arg min f (xt ) + h f (xt ), x xt i + x kx xt k ηt A proximal view of gradien Îx xt Î + c f (x) tt+ tis the point where By optimality condition, xt+ x = x t Òf (xt ) I t f (xt ) + ÈÒf (xt ), x xt Í and xtmin Î have same xt+ = arg f (x ) +slope ÈÒf (xt ), x xt Í t Îx x Proximal gradient methods 5-3 Ì Îx I xt Î + c f (x) J t+ By optimality condition, x is the point wher t xt Í and t+ is the point f (x By optimality condition, where xt+ = argxmin f (xt ) + ÈÒf (xt ), x t )x+tèòf Í +(xt ), xîx xt Î t Îx xt Î t t t t x f (x ) + ÈÒf (x ), x x Í and t Îx x Î have same slope t roximal gradient methods 5-3 Proximal gradient methods first-order approximation proximal term By optimality condition, xt+ is the point where f (xt ) + h f (xt ), x xt i and - η t kx xt k have same slope Proximal gradient methods 6-7

How about projected gradient descent? x t+ = arg min x = arg min x where C (x) = x t+ = P C ( x t η t f(x t ) ) { f(x t ) + f(x t ), x x t + } x x t + C (x) η t { x (x t η t f(x t )) } + η t C (x) (6.) { 0, if x C, else Proximal gradient methods 6-8

Proximal operator Define proximal operator prox h (x) := arg min z for any convex function h { z x + h(z) } This allows one to express projected GD update (6.) as x t+ = prox ηt C ( x t η t f(x t ) ) (6.) Proximal gradient methods 6-9

Proximal gradient methods One can generalize (6.) to accommodate more general h Algorithm 6. Proximal gradient algorithm : for t = 0,, do : x t+ ( = prox ηth x t η t f(x t ) ) alternates between gradient updates on f and proximal minimization on h useful if prox h is inexpensive Proximal gradient methods 6-0

Proximal mapping / operator

Why consider proximal operators? { prox h (x) := arg min z z x + h(z)} well-defined under very general conditions (including nonsmooth convex functions) can be evaluated efficiently for many widely used functions (in particular, regularizers) this abstraction is conceptually and mathematically simple, and covers many well-known optimization algorithms Proximal gradient methods 6-

bt+ = t t + t t {z } Example: indicator functions = b momentum term proxµt+ g bt+ µt+ rf t+ t+ xt+ = xt t Òf (xt ) + t+ x = I Ì A proximal view of gradient descent J arg min f (x ) + ÈÒf (xtt+ ), x xtnorm Í+ Îx xt Î Example: x = arg min ft (xt ) + ÈÒf (xt ), x xt Í + x x approximation Îx xt Î +first-order c proximal term I t If h = C is indicator function Proximal gradient methods ( 0, if x C If h(x) =h(x) ÎxÎ, then = else prox h (x) = Â, (sof t thresholding) st (x; ) Îx x t 5- condition, xt+ is the point where t ), x xt Í and Îx xt Î have same slo f (x ) + ÈÒf (x t where Âst (x) = x +, if x <, is applied in entry-wise manner then Y By optimality ]x, if tx > [0, else proxh (x) = arg min kz xk Proximal gradient methods z C Proximal gradient methods (Euclidean projection) Proximal gradient methods 5-6-3

Example: l norm x t x t+ x t Îx xt Î + c Îx xt Î + c x t+ If h(x) = x, then (prox λh (x)) i = ψ st (x i ; λ) (soft-thresholding) x λ, if x > λ where ψ st (x) = x + λ, if x < λ 0, else Proximal gradient methods 6-4

Basic rules If f(x) = ag(x) + b with a > 0, then prox f (x) = prox ag (x) affine addition: if f(x) = g(x) + a x + b, then prox f (x) = prox g (x a) Proximal gradient methods 6-5

Basic rules quadratic addition: if f(x) = g(x) + ρ x a, then prox f (x) = prox +ρ g ( + ρ x + ρ + ρ a ) scaling and translation: if f(x) = g(ax + b) with a 0, then prox f (x) = a ( ) prox a g(ax + b) b (homework) Proximal gradient methods 6-6

Proof for quadratic addition { prox f (x) = arg min z x x + g(z) + ρ } z a { } + ρ = arg min x z z, x + ρa + g(z) { = arg min x z } z, x + ρa + + ρ + ρ g(z) = arg min x = prox +ρ g { ( z + ρ x + ρ ) + ρ a ( + ρ x + ρ ) + ρ a + + ρ g(z) } Proximal gradient methods 6-7

Basic rules orthogonal mapping: if f(x) = g(qx) with Q orthogonal (QQ = Q Q = I), then prox f (x) = Q prox g (Qx) (homework) orthogonal affine mapping: if f(x) = g(qx + b) with QQ = α I, then }{{} does not require Q Q=α I prox f (x) = ( ) ( ) I αq Q x + αq prox α g(qx + b) b for general Q, it is not easy to derive prox f from prox g Proximal gradient methods 6-8

Basic rules norm composition: if f(x) = g( x ) with domain(g) = [0, ), then x prox f (x) = prox g ( x ) x 0 x Proximal gradient methods 6-9

Proof for norm composition Observe that { min f(z) + } z z x { = min g( z ) + z z z x + } x { = min min g(α) + α 0 z: z =α α z x + } x { = min g(α) + α 0 α α x + } x (Cauchy-Schwarz) = min {g(α) + } (α x ) α 0 From above calculation, we know optimal point is α = prox g ( x ) and z = α x x = prox x g ( x ), x thus concluding proof Proximal gradient methods 6-0

= PC (x)> PC (z) xp (x)z S= PC (z) z PCs( > C( ) sp if x œ C, proxh (x) is Euclidean Œ else methods (constrained) Gradient I C, which is nonexpansive: projection PC onto 0, if x œ C Recall that when h(x) =, proxh (x) is Euclidean ŒC (xelse Îx x Î + c Gradient x methods x (constrained) ÎP )Gradient P (x )Î methods (constrained) Gradient Æ Îx x Î C methods (constrained) projection PC onto C, which is nonexpansive: a property for general prox ( ) h PC (z) x when z h(x) = 0, Recall that PC (x) 0 nsiveness of proximal operators s Cx Nonexpansiveness of proximal operators I C Proximal gradient methods W C s P ( ÎPC (x ) PC (x )Î Æ Îx xc Î veness of proximal operators Gradient methods (constrained) Gradient methods (constrained) gradient methods Îx x Î + cproximal x x property for general proxh ( ) Gradient methods (constrained) ) + @ PC ( 5 5- Gradient methods (constrained) Recall that when h(x) = C (x), proxh (x) is Euclidean projection PC onto C, which is nonexpansive for convex C: ansiveness) Proximal gradient methods 3-3 kpc (x ) PC (x )k kx x k 6-3-3

Nonexpansiveness of proximal operators Nonexpansiveness of proximal operators Îx xoperators Î + c x x nsiveness ofc proximal Îx x Î + Nonexpansiveness of proximal operators Nonexpansiveness of proximal operators onexpansiveness is a property for general proxh ( ) x Î + Îx x Î + c Îxx xxî+ c Îx xîx Î + c Îx x Î + c x x of proximal Nonexpansiveness of proximal operators Nonexpansiveness c x x prox (x ) prox ) operators h h (xis a property for general proxh ( ) a property for general proxh ( ) Nonexpansiveness in some sense, proximal operator behaves like projection Nonexpansiveness is a property for general proxh ( ) + x x Îc Îx xîx Î +xcî + c Îx x Îx+ Îx c x xis aprox h (x ) prox h (x ) Nonexpansiveness property for general proxh ( ) Nonexpansiveness is a property for general proxh ( ) Fact 6. act 5. (Nonexpansiveness) Fact 5. (Nonexpansiveness) Fact 5. (Nonexpansiveness) Îprox (firm(xnonexpansiveness) ) prox (x )Î Æ Îx x Î hîprox (x ) prox (x )Î Æ Îx x Î (x ) prox h Îprox h h (x )Î Æ Îx x Î Fact 5. (Nonexpansiveness)h h Proximal gradient methods hprox ) sense, prox ),like xprojection x i kprox (x5-3 ) proxh (x )k ansiveness) Fact 5. (Nonexpansiveness) operator h (x h (xoperator In some sense, proximal behaves ximal gradient methods 5-3 h In some proximal behaveslike projection Îproxh (x )Proximal prox (xmethods )Îsense, Æ Îx proximal x Î operator behaves like projection In hsome gradient Îprox (x ) prox (x )Î Æ Îx (nonexpansiveness) x Îh h Îx Æsense, proximal operator behaves h (x ) prox h (x In )Î some like projection Proximal gradient methods x Î In some sense, proximal operator behaves like projection kprox proximal operator behaves like projection 5-3 h (x ) prox h (x )k Proximal gradient methods Proximal gradient methods 5-3 5-3 5-3 kx x k 6-

Proof of Fact 6. Let z = prox h (x ) and z = prox h (x ). Subgradient characterizations of z and z read x z h(z ) and x z h(z ) The nonexpansiveness claim z z x x would follow if (x x ) (z z ) z z }{{} firm nonexpansiveness (together with Cauchy-Schwarz) = (x z x + z ) (z z ) 0 h(z ) h(z ) + x z, z }{{} z add these inequalities h(z = ) h(z ) h(z ) + x z, z }{{} z h(z ) Proximal gradient methods 6-3

Resolvent of subdifferential operator One can interpret prox via resolvant of subdifferential operator Fact 6. z = prox f (x) z = (I + f) }{{} (x) resolvent of operator f where I is identity mapping Proximal gradient methods 6-4

Justification of Fact 6. { z = arg min f(u) + } u u x 0 f(z) + z x (optimality condition) x (I + f) (z) z = (I + f) (x) Proximal gradient methods 6-5

Moreau decomposition Fact 6.3 Suppose f is closed convex, and f (x) := sup z { x, z f(z)} is convex conjugate of f. Then x = prox f (x) + prox f (x) key relationship between proximal mapping and duality generalization of orthogonal decomposition Proximal gradient methods 6-6

x + g g(x) K K ++, g(z) + + g(x) = max{x, x a n + + x = arg +} = S (x) x x kzk hz, x + ai + min kzk hz, x + ai + g(z) n = arg min = prox + x+ a x g + + prox (x) = x P (x) g g(x) = max{x,, x } = S (x) x + an kzk = prox hz,+ x g+ ai g ++ + g(z) + = arg min kzk = max{x, g(x) =, arg x=n }min + x= S (x) hz, x + ai + max{x n g}=+s (x)pc (x) + P? (x) (,, x ) x C + + = x g prox (x) P (x) ( g + = max{x n } = S (x) proxg (x) = x= arg Pprox (x) g (x) = x min z P (x) x g(x) + a +,, x g(z) x +C? + = arg min C + x+ a + P (x) + P? (x) prox (x) = P(x) (x)x xk? z P g(x) +PxP + + C? (x) K? PC (x) + PC? (x) C K CK P K (x) PKC(x) = prox + x + a P (x) + P (x)? n C g C + (x) x = prox g x+ a K K PK (x) PK PK+? (x) + x+ + g(x) = max{x,,+ x } = S + (x) = prox = x P (x) prox (x) Moreau decomposition for convex cones = prox x+ a + + P (x) + P (x) g(x) = max{x,, x } = S (x) K KP?=(x) PK (x) P}K=P= (x) (x) PxK? (x) +x? (x) K K K (x) K (x) (x) prox x(x) PP P PK (x) PK (x) g(x) x, K nk S K max{x K PK PK (x) PK (x) x g, x(x) +? proxg (x) = x P (x) P (x) + PC? (x) PC (x) + PCC (x)? K K PK (x) K K PK (x) PK? (x) x g(x) = pr + g(x) = max{x,, xn } = S (x) proxg (x) = x P (x) PC (x) + PC? (x) PK (x) PK (x) PK? (x) K K xpk (x) K P K(x) PK (x) K? When K is closed convex cone, (K ) (x) = K (x) (exercise) with K := {x hx, zi 0, z K} polar cone of K. This gives x = PK (x) + PK (x) a special case: if K is subspace, then K = K, and hence Proximal gradient methods x = PK (x) + PK (x) 6-7 x P

Proof of Fact 6.4 Let u = prox f (x), then from optimality condition we know that x u f(u). This together with conjugate subgradient theorem (homework) yields u f (x u) In view of optimality condition, this means x u = prox f (x) = x = u + (x u) = prox f (x) + prox f (x) Proximal gradient methods 6-8

Example: prox of support function For any closed and convex set C, support function S C is defined as S C (x) = sup z C x, z. Then prox SC (x) = x P C (x) (6.3) Proof: First of all, it is easy to verify that (exercise) Then Moreau decomposition gives S C(x) = C (x) prox SC (x) = x prox S C (x) = x prox C (x) = x P C (x) Proximal gradient methods 6-9

Example: l norm prox (x) = x P B (x) where B := {z z } is unit l ball Remark: projection onto l ball can be computed efficiently Proof: Since x = sup z: z x, z = S B (x), we can invoke (6.3) to arrive at prox (x) = prox (x) = x P SB B (x) Proximal gradient methods 6-30

Example: max function Let g(x) = max{x,, x n }, then prox g (x) = x P (x) where := {z R n + z = } is probability simplex Remark: projection onto can be computed efficiently Proof: Since g(x) = max{x,, x n } = S (x) (support function of ), we can invoke (6.3) to reach prox g (x) = x P (x) Proximal gradient methods 6-3

Extended Moreau decomposition A useful extension (homework): Fact 6.4 Suppose f is closed convex and λ > 0. Then x = prox λf (x) + λprox λ f (x/λ) Proximal gradient methods 6-3

Convergence analysis

Cost monotonicity Objective value is non-increasing in t: Lemma 6.5 Suppose f is convex and L-smooth. If η t /L, then F (x t+ ) F (x t ) different from subgradient methods (for which objective value might be non-monotonic in t) constant stepsize rule is recommended when f is convex and smooth Proximal gradient methods 6-34

Proof of cost monotonicity Main pillar: Lemma 6.6 a fundamental inequality Let y + = prox L h ( y L f(y)), then F (y + ) F (x) L x y L x y+ g(x, y) }{{} 0 by convexity where g(x, y) := f(x) f(y) f(y), x y Take x = y = x t (and hence y + = x t+ ) to complete proof Proximal gradient methods 6-35

Monotonicity in estimation error Proximal gradient iterates are not only monotonic w.r.t. cost, but also monotonic in estimation error Lemma 6.7 Suppose f is convex and L-smooth. If η t /L, then x t+ x x t x Proof: from Lemma 6.6, taking x = x, y = x t (and hence y + = x t+ ) yields F (x t+ ) F (x ) }{{} 0 which immediately concludes proof + g(x, y) L }{{} x x t L x x t+ 0 Proximal gradient methods 6-36

Proof of Lemma 6.6 Define φ(z) = f(y) + f(y), z y + L z y + h(z) It is easily seen that y + = arg min z φ(z). Two important properties: Since φ(z) is L-strongly convex, one has φ(x) φ(y + ) + L x y+ Remark: we are propergating smoothness of f to strong convexity of another function φ From smoothness, φ(y + ) = f(y) + f(y), y + y + L y+ y + h(y + ) }{{} upper bound on f(y + ) f(y + ) + h(y + ) = F (y + ) Proximal gradient methods 6-37

Proof of Lemma 6.6 (cont.) Taken collectively, these yield φ(x) F (y + ) + L x y+, which together with definition of φ(x) gives f(y) + f(y), x y + h(x) + L }{{} x y F (y + ) + L x y+ =f(x)+h(x) g(x,y)=f (x) g(x,y) which finishes proof Proximal gradient methods 6-38

Convergence for convex problems Theorem 6.8 (Convergence of proximal gradient methods for convex problems) Suppose f is convex and L-smooth. If η t /L, then F (x t ) F opt L x0 x t achieves better iteration complexity (i.e. O(/ε)) than subgradient method (i.e. O(/ε )) fast if prox can be efficiently implemented Proximal gradient methods 6-39

Proof of Theorem 6.8 With Lemma 6.6 in mind, set x = x, y = x t to obtain F (x t+ ) F (x ) L xt x L xt+ x g(x, x t ) }{{} 0 by convexity L xt x L xt+ x Apply it recursively and add up all inequalities to get t k=0 ( ) F (x k+ ) F (x ) L x0 x L xt x This combined with monotonicity of F (x t ) (cf. Lemma 6.6) yields F (x t ) F (x ) L x0 x t Proximal gradient methods 6-40

Convergence for strongly convex problems Theorem 6.9 (Convergence of proximal gradient methods for strongly convex problems) Suppose f is µ-strongly convex and L-smooth. If η t /L, then x t x ( µ ) t x 0 x L linear convergence: attains ε accuracy within O(log ε ) iterations Proximal gradient methods 6-4

Proof of Theorem 6.9 Taking x = x, y = x t (and hence y + = x t+ ) in Lemma 6.6 gives F (x t+ ) F (x ) L x x t L x x t+ g(x, x t ) }{{} µ x x t+ L µ x t x L xt+ x This taken collectively with F (x t+ ) F (x ) 0 yields x t+ x Applying it recursively concludes proof ( µ ) x t x L Proximal gradient methods 6-4

Numerical example: LASSO taken from UCLA EE36C minimizex f (x) = kax bk + kxk with i.i.d. Gaussian A R000 000, ηt = /L, L = λmax (A> A) Proximal gradient methods 6-43

with R := supxœc DÏ x, x A Ô Lf R best,t opt f f ÆO Ô fl Suppose f is convex and Lipschitz continuous (i and suppoe Ï is fl-strongly convex w.r.t. Î Î. Example: -norm regularized least-squares! qt! minimize one can further remove log t factor k=0 k " supxœc DÏ x, x0 + 4, we immediatenumerical arrive at example: LASSO bk + kxk kax taken from UCLA EE36C vex and Lipschitz continuous (i.e. Îg t Îú Æ Lf ) on C, 00 fl-strongly convex w.r.t. Î Î. Then " 0 4 flR Ô Lf t 3 Ô (ff(x 0 qt Lf fl k=0 k! " qt k=0 k Mirror descent ) f )/fæ? opt?! If t = R Ô t Æ supxœc0dï x, x0 + best,t (k) Theorem 5.3 f opt 0 5 with R := sup DÏ x, x0, then 0 xœc 0 0 A 0Ô30 40 50B60 70 80 90 00 Lf R log kt Ô f best,t f opt Æ O Ô fl 000 000 randomly generated A R ; step ttk = /L with L = max(at A) Proximal gradient methods 6-43

Backtracking line search Recall that for unconstrained case, backtracking line search is based on sufficient decrease criterion f ( x t η f(x t ) ) f(x t ) η f(xt ) Proximal gradient methods 6-44

Backtracking line search Recall that for unconstrained case, backtracking line search is based on sufficient decrease criterion f ( x t η f(x t ) ) f(x t ) η f(xt ) As a result, this is equivalent to updating η t = /L t until f ( x t η t f(x t ) ) f(x t ) L t f(x t ), f(x t ) + L t f(x t ) = f(x t ) f(x t ), x t x t+ + L t xt x t+ Proximal gradient methods 6-44

Backtracking line search Let T L (x) := prox L h ( x L f(x)) : Algorithm 6. Backtracking line search for proximal gradient methods : Initialize η =, 0 < α /, 0 < β < : while f ( T Lt (x t ) ) > f(x t ) f(x t ), x t T Lt (x t ) + L t TLt (x t ) x t do 3: L t β Lt (or L t β L t ) here, L t corresponds to η t, and T Lt (x t ) generalizes x t+ Proximal gradient methods 6-45

Summary: proximal gradient methods convex & smooth (w.r.t. f) problems strongly convex & smooth (w.r.t. f) problems stepsize convergence iteration rule rate complexity ( ) ( ) η t = L O t O ε ( η t = L O ( κ )t) O ( κ log ) ε Proximal gradient methods 6-46

Reference [] Proximal algorithms, N. Parikh and S. Boyd, Foundations and Trends in Optimization, 03. [] First-order methods in optimization, A. Beck, Vol. 5, SIAM, 07. [3] Convex optimization and algorithms, D. Bertsekas, 05. [4] Convex optimization: algorithms and complexity, S. Bubeck, Foundations and trends in machine learning, 05. [5] Mathematical optimization, MATH30 lecture notes, E. Candes, Stanford. [6] Optimization methods for large-scale systems, EE36C lecture notes, L. Vandenberghe, UCLA. Proximal gradient methods 6-47