Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra
Optimal gradient methods 2 / 27
Optimal gradient methods We saw following efficiency estimates for the gradient method f CL 1 : f(x k ) f 2L x0 x 2 2 k + 4 ( ) L µ 2k f SL,µ 1 : f(x k ) f L 2 x 0 x 2 L + µ 2. 3 / 27
Optimal gradient methods We saw following efficiency estimates for the gradient method f CL 1 : f(x k ) f 2L x0 x 2 2 k + 4 ( ) L µ 2k f SL,µ 1 : f(x k ) f L 2 x 0 x 2 L + µ 2. We also saw lower complexity bounds f CL 1 : f(x k ) f(x ) 3L x0 x 2 2 32(k + 1) 2 ( fsl,µ : f(x k ) f(x ) µ ) 2k L µ 2 x 0 x 2 2. L + µ 3 / 27
Optimal gradient methods We saw following efficiency estimates for the gradient method f CL 1 : f(x k ) f 2L x0 x 2 2 k + 4 ( ) L µ 2k f SL,µ 1 : f(x k ) f L 2 x 0 x 2 L + µ 2. We also saw lower complexity bounds f CL 1 : f(x k ) f(x ) 3L x0 x 2 2 32(k + 1) 2 ( fsl,µ : f(x k ) f(x ) µ ) 2k L µ 2 x 0 x 2 2. L + µ Can we close the gap? 3 / 27
Gradient with momentum Polyak s method (aka heavy-ball) for f S 1 L,µ x k+1 = x k α k f(x k ) + β k (x k x k 1 ) 4 / 27
Gradient with momentum Polyak s method (aka heavy-ball) for f S 1 L,µ x k+1 = x k α k f(x k ) + β k (x k x k 1 ) Converges (locally, i.e., for x 0 x 2 ɛ) as x k x 2 2 ( L µ L + µ ) 2k x 0 x 2 2, ( ) 2 4 for α k = ( L+ and β L µ µ) 2 k = L+ µ 4 / 27
Nesterov s optimal gradient method min x f(x), where S 1 L,µ with µ 0 5 / 27
Nesterov s optimal gradient method 1. Choose x 0 R n, α 0 (0, 1) 2. Let y 0 x 0 ; set q = µ/l min x f(x), where S 1 L,µ with µ 0 5 / 27
Nesterov s optimal gradient method min x f(x), where S 1 L,µ with µ 0 1. Choose x 0 R n, α 0 (0, 1) 2. Let y 0 x 0 ; set q = µ/l 3. k-th iteration (k 0): a). Compute f(y k ) and f(y k ); update primary solution x k+1 = y k 1 L f(yk ) 5 / 27
Nesterov s optimal gradient method min x f(x), where S 1 L,µ with µ 0 1. Choose x 0 R n, α 0 (0, 1) 2. Let y 0 x 0 ; set q = µ/l 3. k-th iteration (k 0): a). Compute f(y k ) and f(y k ); update primary solution x k+1 = y k 1 L f(yk ) b). Compute stepsize α k+1 by solving α 2 k+1 = (1 α k+1 )α 2 k + qα k+1 5 / 27
Nesterov s optimal gradient method min x f(x), where S 1 L,µ with µ 0 1. Choose x 0 R n, α 0 (0, 1) 2. Let y 0 x 0 ; set q = µ/l 3. k-th iteration (k 0): a). Compute f(y k ) and f(y k ); update primary solution x k+1 = y k 1 L f(yk ) b). Compute stepsize α k+1 by solving α 2 k+1 = (1 α k+1 )α 2 k + qα k+1 c). Set β k = α k (1 α k )/(α 2 k + α k+1) d). Update secondary solution y k+1 = x k+1 + β k (x k+1 x k ) 5 / 27
Optimal gradient method rate Theorem Let { x k} be sequence generated by above algorithm. If α 0 µ/l, then f(x k ) f(x ) c 1 min { ( 1 where constants c 1, c 2 depend on α 0, L, µ. Proof: Somewhat involved; see notes. µ ) } k, 4L L (2, L + c 2 k) 2 6 / 27
Strongly convex case simplification If µ > 0, select α 0 = µ/l. The two main steps get simplified: 1. Set β k = α k (1 α k )/(α 2 k + α k+1) 2. y k+1 = x k+1 + β k (x k+1 x k ) α k = µ L µ L β k =, k 0. L + µ 7 / 27
Strongly convex case simplification If µ > 0, select α 0 = µ/l. The two main steps get simplified: 1. Set β k = α k (1 α k )/(α 2 k + α k+1) 2. y k+1 = x k+1 + β k (x k+1 x k ) α k = µ L µ L β k =, k 0. L + µ Optimal method simplifies to 1. Choose y 0 = x 0 R n 2. k-th iteration (k 0): 7 / 27
Strongly convex case simplification If µ > 0, select α 0 = µ/l. The two main steps get simplified: 1. Set β k = α k (1 α k )/(α 2 k + α k+1) 2. y k+1 = x k+1 + β k (x k+1 x k ) α k = µ L µ L β k =, k 0. L + µ Optimal method simplifies to 1. Choose y 0 = x 0 R n 2. k-th iteration (k 0): a). x k+1 = y k 1 L f(yk ) b). y k+1 = x k+1 + β(x k+1 x k ) 7 / 27
Strongly convex case simplification If µ > 0, select α 0 = µ/l. The two main steps get simplified: 1. Set β k = α k (1 α k )/(α 2 k + α k+1) 2. y k+1 = x k+1 + β k (x k+1 x k ) α k = µ L µ L β k =, k 0. L + µ Optimal method simplifies to 1. Choose y 0 = x 0 R n 2. k-th iteration (k 0): a). x k+1 = y k 1 L f(yk ) b). y k+1 = x k+1 + β(x k+1 x k ) Notice similarity to Polyak s method! 7 / 27
Summary so far Convex f(x) with f G subgradient method Differentiable f CL 1 using gradient methods Rate of convergence for smooth convex problems Faster rate of convergence for smooth, strongly convex Constrained gradient methods Frank-Wolfe method Constrained gradient methods gradient projection Nesterov s optimal gradient methods (smooth) 8 / 27
Summary so far Convex f(x) with f G subgradient method Differentiable f CL 1 using gradient methods Rate of convergence for smooth convex problems Faster rate of convergence for smooth, strongly convex Constrained gradient methods Frank-Wolfe method Constrained gradient methods gradient projection Nesterov s optimal gradient methods (smooth) Gap between lower and upper bounds O(1/ t) convex (subgradient method); O(1/t 2 ) for CL 1 ; linear for smoooth, strongly convex 8 / 27
Nonsmooth optimization Unconstrained problem: min f(x), where x R n f convex on R n, and Lipschitz cont. on bounded set f(x) f(y) L x y 2, x, y X. 9 / 27
Nonsmooth optimization Unconstrained problem: min f(x), where x R n f convex on R n, and Lipschitz cont. on bounded set f(x) f(y) L x y 2, x, y X. At each step, we access to f(x) and g f(x) 9 / 27
Nonsmooth optimization Unconstrained problem: min f(x), where x R n f convex on R n, and Lipschitz cont. on bounded set f(x) f(y) L x y 2, x, y X. At each step, we access to f(x) and g f(x) Find x k R n such that f(x k ) f ɛ 9 / 27
Nonsmooth optimization Unconstrained problem: min f(x), where x R n f convex on R n, and Lipschitz cont. on bounded set f(x) f(y) L x y 2, x, y X. At each step, we access to f(x) and g f(x) Find x k R n such that f(x k ) f ɛ First-order methods: x k x 0 + span { g 0,..., g k 1} 9 / 27
Nonsmooth optimization EXAMPLE Let φ(x) = x for x R 10 / 27
Nonsmooth optimization EXAMPLE Let φ(x) = x for x R Subgradient method x k+1 = x k α k g k, where g k x k. 10 / 27
Nonsmooth optimization EXAMPLE Let φ(x) = x for x R Subgradient method x k+1 = x k α k g k, where g k x k. If x 0 = 1 and α k = 1 k+1 + 1 k+2 (this stepsize is known to be optimal), then x k = 1 k+1 10 / 27
Nonsmooth optimization EXAMPLE Let φ(x) = x for x R Subgradient method x k+1 = x k α k g k, where g k x k. If x 0 = 1 and α k = 1 k+1 + 1 k+2 (this stepsize is known to be optimal), then x k = 1 k+1 Thus, O( 1 ɛ 2 ) iterations are needed to obtain ɛ-accuracy. 10 / 27
Nonsmooth optimization EXAMPLE Let φ(x) = x for x R Subgradient method x k+1 = x k α k g k, where g k x k. If x 0 = 1 and α k = 1 k+1 + 1 k+2 (this stepsize is known to be optimal), then x k = 1 k+1 Thus, O( 1 ɛ 2 ) iterations are needed to obtain ɛ-accuracy. This behavior typical for the subgradient method which exhibits O(1/ k) convergence in general 10 / 27
Nonsmooth optimization EXAMPLE Let φ(x) = x for x R Subgradient method x k+1 = x k α k g k, where g k x k. If x 0 = 1 and α k = 1 k+1 + 1 k+2 (this stepsize is known to be optimal), then x k = 1 k+1 Thus, O( 1 ɛ 2 ) iterations are needed to obtain ɛ-accuracy. This behavior typical for the subgradient method which exhibits O(1/ k) convergence in general Can we do better in general? 10 / 27
Nonsmooth optimization Nope! 11 / 27
Nonsmooth optimization Nope! Theorem (Nesterov.) Let B = { x x x 0 2 D }. Assume, x B. There exists a convex function f in CL 0 (B) (with L > 0), such that for 0 k n 1, the lower-bound f(x k ) f(x ) LD 2(1+ k+1), holds for any algorithm that generates x k by linearly combining the previous iterates and subgradients. 11 / 27
Nonsmooth optimization Nope! Theorem (Nesterov.) Let B = { x x x 0 2 D }. Assume, x B. There exists a convex function f in CL 0 (B) (with L > 0), such that for 0 k n 1, the lower-bound f(x k ) f(x ) LD 2(1+ k+1), holds for any algorithm that generates x k by linearly combining the previous iterates and subgradients. Should we give up? 11 / 27
Nonsmooth optimization Nope! Theorem (Nesterov.) Let B = { x x x 0 2 D }. Assume, x B. There exists a convex function f in CL 0 (B) (with L > 0), such that for 0 k n 1, the lower-bound f(x k ) f(x ) LD 2(1+ k+1), holds for any algorithm that generates x k by linearly combining the previous iterates and subgradients. Should we give up? No! 11 / 27
Nonsmooth optimization Nope! Theorem (Nesterov.) Let B = { x x x 0 2 D }. Assume, x B. There exists a convex function f in CL 0 (B) (with L > 0), such that for 0 k n 1, the lower-bound f(x k ) f(x ) LD 2(1+ k+1), holds for any algorithm that generates x k by linearly combining the previous iterates and subgradients. Should we give up? No! Several possibilities remain! 11 / 27
Nonsmooth optimization Blackbox too pessimistic 12 / 27
Nonsmooth optimization Blackbox too pessimistic Nesterov s breakthroughs Excessive gap technique Composite objective minimization 12 / 27
Nonsmooth optimization Blackbox too pessimistic Nesterov s breakthroughs Excessive gap technique Composite objective minimization Nemirovski s workshorse of general convex optimization Mirror-descent, NERML Mirror-prox 12 / 27
Nonsmooth optimization Blackbox too pessimistic Nesterov s breakthroughs Excessive gap technique Composite objective minimization Nemirovski s workshorse of general convex optimization Mirror-descent, NERML Mirror-prox Other techniques, problem classes, etc. 12 / 27
Proximal splitting 13 / 27
Composite objectives Frequently nonsmooth problems take the form minimize f(x) := l(x) + r(x) 14 / 27
Composite objectives Frequently nonsmooth problems take the form minimize f(x) := l(x) + r(x) l + r 14 / 27
Composite objectives Frequently nonsmooth problems take the form minimize f(x) := l(x) + r(x) l + r Example: l(x) = 1 2 Ax b 2 and r(x) = λ x 1 Lasso, L1-LS, compressed sensing 14 / 27
Composite objectives Frequently nonsmooth problems take the form minimize f(x) := l(x) + r(x) l + r Example: l(x) = 1 2 Ax b 2 and r(x) = λ x 1 Lasso, L1-LS, compressed sensing Example: l(x) : Logistic loss, and r(x) = λ x 1 L1-Logistic regression, sparse LR 14 / 27
Composite objective minimization minimize f(x) := l(x) + r(x) subgradient: x k+1 = x k α k g k, g k f(x k ) 15 / 27
Composite objective minimization minimize f(x) := l(x) + r(x) subgradient: x k+1 = x k α k g k, g k f(x k ) subgradient: converges slowly at rate O(1/ k) 15 / 27
Composite objective minimization minimize f(x) := l(x) + r(x) subgradient: x k+1 = x k α k g k, g k f(x k ) subgradient: converges slowly at rate O(1/ k) but: f is smooth plus nonsmooth we should exploit: smoothness of l for better method! 15 / 27
Projections: another view Let I X be the indicator function for closed, cvx X, defined as: { 0 if x X, I X (x) := otherwise. 16 / 27
Projections: another view Let I X be the indicator function for closed, cvx X, defined as: { 0 if x X, I X (x) := otherwise. Recall orthogonal projection P X (y) P X (y) := argmin 1 2 x y 2 2 s.t. x X. 16 / 27
Projections: another view Let I X be the indicator function for closed, cvx X, defined as: { 0 if x X, I X (x) := otherwise. Recall orthogonal projection P X (y) P X (y) := argmin 1 2 x y 2 2 s.t. x X. Rewrite orthogonal projection P X (y) as P X (y) := argmin x R n 1 2 x y 2 2 + I X (x). 16 / 27
Generalizing projections proximity Projection P X (y) := argmin x R n 1 2 x y 2 2 + I X (x) 17 / 27
Generalizing projections proximity Projection P X (y) := argmin x R n 1 2 x y 2 2 + I X (x) Proximity: Replace I X by some convex function! prox r (y) := argmin x R n 1 2 x y 2 2 + r(x) 17 / 27
Generalizing projections proximity Projection P X (y) := argmin x R n 1 2 x y 2 2 + I X (x) Proximity: Replace I X by some convex function! prox r (y) := argmin x R n 1 2 x y 2 2 + r(x) Def. prox R : R n R n is called a proximity operator 17 / 27
Proximity operator l 1 -norm ball of radius ρ(λ) 18 / 27
Proximity operators Exercise: Let r(x) = x 1. Solve prox λr (y). 1 min x R n 2 x y 2 2 + λ x 1. Hint 1: The above problem decomposes into n independent subproblems of the form min x R 1 2 (x y)2 + λ x. Hint 2: Consider the two cases separately: either x = 0 or x 0 19 / 27
Proximity operators Exercise: Let r(x) = x 1. Solve prox λr (y). 1 min x R n 2 x y 2 2 + λ x 1. Hint 1: The above problem decomposes into n independent subproblems of the form min x R 1 2 (x y)2 + λ x. Hint 2: Consider the two cases separately: either x = 0 or x 0 Aka: Soft-thresholding operator 19 / 27
Basics of proximal splitting Recall Gradient projection for solving min X f(x) for f C 1 L : x k+1 = P X (x k α k f(x k )) 20 / 27
Basics of proximal splitting Recall Gradient projection for solving min X f(x) for f C 1 L : x k+1 = P X (x k α k f(x k )) Proximal gradient method solves min l(x) + r(x) x k+1 = prox αk r(x k α k f(x k )). 20 / 27
Basics of proximal splitting Recall Gradient projection for solving min X f(x) for f C 1 L : x k+1 = P X (x k α k f(x k )) Proximal gradient method solves min l(x) + r(x) x k+1 = prox αk r(x k α k f(x k )). This method aka: Forward-backward splitting (FBS) Forward step: The gradient-descent step Backward step: The prox-operator 20 / 27
FBS example Lasso / L1-LS min 1 2 Ax b 2 2 + λ x 1. 21 / 27
FBS example Lasso / L1-LS min 1 2 Ax b 2 2 + λ x 1. prox λ x 1 y = sgn(y) max( y λ, 0) x k+1 = prox αk λ 1 (x k α k A T (Ax k b)). so-called iterative soft-thresholding algorithm! 21 / 27
FBS example Lasso / L1-LS min 1 2 Ax b 2 2 + λ x 1. prox λ x 1 y = sgn(y) max( y λ, 0) x k+1 = prox αk λ 1 (x k α k A T (Ax k b)). so-called iterative soft-thresholding algorithm! Exercise: Try solving the problem: min 1 2 Ax b 2 2 + λ x 2. 21 / 27
Exercise Recall our older example: 1 2 DT x b 2 2. We solved its unconstrained and constrained versions so far. Now implement a Matlab script to solve min 1 2 DT x b 2 2 + λ x 1. 22 / 27
Exercise Recall our older example: 1 2 DT x b 2 2. We solved its unconstrained and constrained versions so far. Now implement a Matlab script to solve Use FBS as shown above min 1 2 DT x b 2 2 + λ x 1. Try different values of λ > 0 in your code 22 / 27
Exercise Recall our older example: 1 2 DT x b 2 2. We solved its unconstrained and constrained versions so far. Now implement a Matlab script to solve Use FBS as shown above min 1 2 DT x b 2 2 + λ x 1. Try different values of λ > 0 in your code Use different choices of b 22 / 27
Exercise Recall our older example: 1 2 DT x b 2 2. We solved its unconstrained and constrained versions so far. Now implement a Matlab script to solve Use FBS as shown above min 1 2 DT x b 2 2 + λ x 1. Try different values of λ > 0 in your code Use different choices of b Show a choice of b for which x = 0 (zero vector) is optimal 22 / 27
Exercise Recall our older example: 1 2 DT x b 2 2. We solved its unconstrained and constrained versions so far. Now implement a Matlab script to solve Use FBS as shown above min 1 2 DT x b 2 2 + λ x 1. Try different values of λ > 0 in your code Use different choices of b Show a choice of b for which x = 0 (zero vector) is optimal Experiment with different stepsizes (try α k = 1/4 and if that does not work try smaller values that might work). Do not expect monotonic descent 22 / 27
Exercise Recall our older example: 1 2 DT x b 2 2. We solved its unconstrained and constrained versions so far. Now implement a Matlab script to solve Use FBS as shown above min 1 2 DT x b 2 2 + λ x 1. Try different values of λ > 0 in your code Use different choices of b Show a choice of b for which x = 0 (zero vector) is optimal Experiment with different stepsizes (try α k = 1/4 and if that does not work try smaller values that might work). Do not expect monotonic descent Compare with versions of the subgradient method 22 / 27
Proximity operators 23 / 27
Proximity operators prox r has several nice properties Read / Skim the paper: Proximal Splitting Methods in Signal Processing, by Combettes and Pesquet (2010). Theorem The operator prox r is firmly nonexpansive (FNE) prox r x prox r y 2 2 prox r x prox r y, x y Proof: (blackboard) Corollary. The operator prox r is nonexpansive Proof: apply Cauchy-Schwarz to FNE. 24 / 27
Consequence of FNE Gradient projection x k+1 = P X (x k α k f(x k )) Proximal gradients / FBS x k+1 = prox αk r(x k α k f(x k )) Same convergence theory goes through! Exercise: Try extending proof of gradient-projection convergence to convergence for FBS. Hint: First show that at x, the fixed-point equation x = prox αr (x α f(x )), α > 0 25 / 27
Moreau Decomposition Aim: Compute prox r y Sometimes it is easier to compute prox r y r (u) := sup u T x r(x) x Moreau decomposition: y = prox R y + prox R y 26 / 27
Moreau decomposition Proof sketch: Consider min 1 2 x y 2 2 + r(x) 27 / 27
Moreau decomposition Proof sketch: Consider min 1 2 x y 2 2 + r(x) Introduce new variable z = x, to get prox r y := 1 2 x y 2 2 + r(z), s.t. x = z 27 / 27
Moreau decomposition Proof sketch: Consider min 1 2 x y 2 2 + r(x) Introduce new variable z = x, to get prox r y := 1 2 x y 2 2 + r(z), s.t. x = z Derive Lagrangian dual for this 27 / 27
Moreau decomposition Proof sketch: Consider min 1 2 x y 2 2 + r(x) Introduce new variable z = x, to get prox r y := 1 2 x y 2 2 + r(z), s.t. x = z Derive Lagrangian dual for this Simplify, and conclude! 27 / 27