ELE 538B: Large-Scale Optimization for Data Science Accelerated gradient methods Yuxin Chen Princeton University, Spring 018
Outline Heavy-ball methods Nesterov s accelerated gradient methods Accelerated proximal gradient methods (FISTA) Convergence analysis Lower bounds
(Proximal) gradient methods Iteration complexities of (proximal) gradient methods strongly convex and smooth problems ( O κ log 1 ) ε convex and smooth problems ( ) 1 O ε Can one still hope to further accelerate convergence? Accelerated GD 7-3
Issues and possible solutions Issues: GD focuses on improving cost per iteration, which might be too short-sighted GD might sometimes zigzag / experience abrupt changes Solutions: exploit information from history / past iterates add buffer (like momentum) to yield smoother trajectory Accelerated GD 7-4
Heavy-ball methods Polyak 64
Heavy-ball method minimize x R n f(x) B. Polyak x t+1 = x t η t f(x t ) + θ t (x t x t 1 ) }{{} momentum term add inertia to the ball (i.e. include momentum term) to mitigate zigzagging Accelerated GD 7-6
Heavy-ball method x ú gradient descent x ú heavy-ball method Accelerated GD 7-7
State-space models One can understand heavy-ball methods through dynamics of following dynamical system [ ] x t+1 ] x t = [ (1 + θt )I θ t I I 0 ] [ ] [ x t ηt f(x x t 1 t ) 0 or equivalently, [ ] [ x t+1 x (1 + θt )I θ x t x = t I I 0 }{{} state ] [ x t x x t 1 x ] [ ηt f(x t ) 0 [ 1 ] [ (1 + θt )I η = t 0 f(x τ )dτ θ t I x t x I 0 x t 1 x }{{} system matrix ] ] with x τ = x t + τ(x x t ), where last line comes from fundamental theorem of calculus Accelerated GD 7-8
System matrix [ ] [ x t+1 x ( ) 1 ] [ ] 1 + θt I ηt x t x = 0 f(x τ )dτ θ t I x t x I 0 x t 1 x }{{} :=H t (system matrix) (7.1) implication: convergence of heavy-ball methods depends on spectrum of system matrix H t key idea: find appropriate stepsize η t and momentum coefficient θ t to control spectrum of H Accelerated GD 7-9
Convergence for strongly convex problems Theorem 7.1 (Convergence of heavy-ball methods for strongly convex functions) Suppose f is µ-strongly convex and L-smooth. Set κ = L/µ, η t 4/( L + µ), θ t max { 1 η t L, 1 η t µ }. Then [ ] x t+1 x x t x ( κ 1 κ + 1 ) t [ x 1 x x 0 x ] iteration complexity: O( κ log 1 ε ) significant improvement over GD: O( κ log 1 ε ) vs. O(κ log 1 ε ) relies on knowledge of both L and µ Accelerated GD 7-
Proof of Theorem 7.1 In view of (7.1), it suffices to control spectrum of H t. Let λ i denote ith eigenvalue of λ 1 1 0 f(x τ )dτ and set Λ :=..., then λ n [ ( ) ] H t = 1 + θt I ηt Λ θ t I t I 0 [ ] max 1 + θt η t λ i θ t 1 i n 1 0 To finish proof, it suffices to show [ max 1 + θt η t λ i θ t i 1 0 ] κ 1 κ + 1 (7.) Accelerated GD 7-11
Proof of Theorem 7.1 To show (7.), note that two eigenvalues of roots of z (1 + θ t η t λ i )z + θ t = 0 [ 1 + θt η t λ i θ t 1 0 ] are If (1 + θ t η t λ i ) 4θ t, then roots of this equation have same magnitudes θt (as they are either both imaginary or there is only one root). In addition, one can easily check that (1 + θ t η t λ i ) 4θ t is satisfied if θ t [( 1 η t λ i ), ( 1 + ηt λ i ) ], (7.3) which would hold if one picks θ t = max {( 1 η t L ), ( 1 ηt µ ) } Accelerated GD 7-1
Proof of Theorem 7.1 With this choice of θ t, we have H t θ t 4 Finally, setting η t = ( L+ ensures 1 η µ) t L = (1 η t µ), which yields ( ) ( θ t = max 1 L, 1 ) µ L + µ L + µ = This in turn establishes H t κ 1 κ + 1 ( ) κ 1 κ + 1 Accelerated GD 7-13
Nesterov s accelerated gradient methods
Convex case minimize x R n f(x) For strongly convex f, including momentum terms allows to improve iteration complexity from O ( κ log 1 ) ( ) ε to O κ log 1 ε Can we obtain improvement for convex case (without strong convexity) as well? Accelerated GD 7-15
Nesterov s idea Nesterov 83 x t+1 = y t η t f(y t ) y t+1 = x t+1 + t ( x t+1 x t) t + 3 Y. Nesterov alternates between gradient updates and proper extrapolation each iteration takes nearly same cost as GD not a descent method (i.e. we may not have f(x t+1 ) f(x t )) one of most beautiful and mysterious results in optimization... Accelerated GD 7-16
Convergence of Nesterov s accelerated gradient method Suppose f is convex and L-smooth. If η t η = 1/L, then iteration complexity: O ( 1 ε ) f(x t ) f opt L x0 x (t + 1) much faster than gradient methods we ll provide proof for (more general) proximal version later Accelerated GD 7-17
Interpretation using differential equations Nesterov s momentum coefficient mysterious t t+3 = 1 3 t is particularly Accelerated GD 7-18
has Consider 1/ > of 1/ and note that P = W 1/ RW version > (X) P := W 1/isBrescaled B>W B B WR: P = F( => )rf = U U >. diagonal elements are effective resistances. Given th Interpretationwhose using differential equations As a result, leverage is recaled effective resistance has 3 3 score w.r.t. and note that P = W 1/ RW 1/ is rescaled version of R ping With forceconstant > > x ls klg kxls klg X prob., X kxls ( > )resistances. = U UGiven. whose diagonal elements Pare=effective t ime-varying dampling force spring force has F = (X) leverage score w.r.t. As arf result, is recaled effective resis > P ls= x (ls k> ) kx =k U U >. ls With constant prob., kx LG LG As a result, leverage score w.r.t. is recaled effective res Fls= (X)ls k With constant prob., kxls x klg rf kx LG F = rf (X) damping force 3 X 3 X time-varying dampling force spring force damping force 3 X 3 X To develop insights into why Nesterov s method works so well, it s helpful to look at its continuous limits (ηt 0), which is given by second-order ordinary differential equations (ODE).. X(τ ) + 3/τ {z} dampling coefficient Accelerated GD. X(τ ) + f (X(τ )) = 0 {z potential } Su, Boyd, Candes 14 7-19
Heuristic derivation of ODE To begin with, Nesterov s update rule is equivalent to x t+1 x t = t 1 η t + x t x t 1 η η f(y t ) (7.4) Let t = τ η. Set X(τ) x τ/ η = x t and X(τ + η) x t+1. Then Taylor expansion gives x t+1 x t Ẋ(τ) + 1.. X(τ) η η x t x t 1 Ẋ(τ) 1.. X(τ) η η which combined with (7.4) yields. X(τ) + 1.. X(τ) ( η 1 3 ) η 1.. (Ẋ(τ) X(τ) ) η η f ( X(τ) ) τ =... X(τ) + 3 X(τ) + f ( X(τ) ) 0 τ Accelerated GD 7-0
Convergence rate of ODE... X + 3 X + f(x) = 0 (7.5) τ It is not hard to show that this ODE obeys ( ) 1 f(x(τ)) f opt O τ (7.6) which somehow explains Nesterov s O(1/t ) convergence Accelerated GD 7-1
Proof of (7.6) Define E(τ) = τ ( f(x) f opt) + X + τ X X. This obeys }{{} Lyapunov function / energy function. E = τ ( f(x) f opt) + τ. f(x), X + 4 X + τ. X X, 3. X + τ.. X (i) = τ ( f(x) f opt) τ X X, f ( X ) (by convexity) 0 where (i) follows by replacing τ.. X + 3Ẋ with τ f( X ) This means E is non-decreasing in τ, and hence (defn of E) opt E(τ) f(x(τ)) f τ E(0) ( ) 1 τ = O τ. Accelerated GD 7-
Magic number 3... X + 3 X + f(x) = 0 τ 3 is smallest constant that guarantees O(1/τ ) convergence, and can be replaced by any other α 3 in some sense, 3 minimizes pre-constant in convergence bound O(1/τ ) (see Su, Boyd, Candes 14) Accelerated GD 7-3
minimize log m P i=1 ng as xk œ x0 + span{òf (x0 ),, Òf (xk 1 )} for all 1 Æ k estingly, no first-order methods can improve upon Nesterov s t in general Example T Numerical exp(a i x + bi ) example taken from UCLA EE36C Convergence analysis precisely, convex and L-smooth function f s.t. y generated problems with m = 000, n = 00 m X (k)?? 3LÎx0 xú Î 3(t + 1) tep size5.4, usedwe forimmediate gradient method FISTA minimize Lemma arriveand at log x exp(a> i x i=1! + bi ) definition of first-order methods (f (x ) f )/f em 5.3 with randomly generated problems and m = 000, n = 00 f best,t f opt Æ! f (xt ) f opt Ø 0 e f is convex and Lipschitz continuous (i.e. Îg t Îú Æ Lf ) on C, gradient gradient ppoe Ï is fl-strongly convexw.r.t. Î Î. Then FISTA FISTA 1 " supxœc DÏ x, x0 + 3 4 qt Lf fl k=0 k qt k=0 k f ÆO Ô fl Ô t ted GD f Accelerated GD ent methods 00 5! " flR Ô1 t = Lf t with R := supxœc DÏ x, x0, then 6 0 150 00 0A 50 0 B 150 Ô k k L R log t f best,t opt Ô 9-8 7-4
Extension to composite models minimize x subject to F (x) := f(x) + h(x) x R n f: convex and smooth h: convex (may not be differentiable) let F opt := min x F (x) be optimal cost Accelerated GD 7-5
FISTA (Beck & Teboulle 09) Fast iterative shrinkage-thresholding algorithm x t+1 ( = prox ηth y t η t f(y t ) ) y t+1 = x t+1 + θ t 1 (x t+1 x t ) θ t+1 where y 0 = x 0, θ 0 = 1 and θ t+1 = 1+ 1+4θ t momentum coefficient originally proposed by Nesterov 83 Accelerated GD 7-6
Momentum coefficient 0 15 5 0 0 0 30 θ t+1 = 1 + 1 + 4θ t 1 0.8 0.6 0.4 0. 0 0 5 15 0 5 30 with θ 0 = 1 coefficient θt 1 θ t+1 = 1 3 t + o( 1 t ) (homework) asymptotically equivalent to t t+3 Accelerated GD 7-7
Momentum coefficient 0 15 5 0 0 0 30 θ t+1 = 1 + 1 + 4θ t 1 0.8 0.6 0.4 0. 0 0 5 15 0 5 30 with θ 0 = 1 Fact 7. For all t 1, one has θ t t+ (homework) Accelerated GD 7-7
Convergence analysis
Convergence for convex problems Theorem 7.3 (Convergence of accelerated proximal gradient methods for convex problems) Suppose f is convex and L-smooth. If η t 1/L, then F (x t ) F opt L x0 x (t + 1) improved iteration complexity (i.e. O(1/ ε)) than proximal gradient method (i.e. O(1/ε)) fast if prox can be efficiently implemented Accelerated GD 7-9
Recap: fundamental inequality for proximal method Recall following fundamental inequality shown in last lecture: Lemma 7.4 Let y + = prox 1 L h ( y 1 L f(y)), then F (y + ) F (x) L x y L x y+ Accelerated GD 7-30
Proof of Theorem 7.6 1. build discrete-time version of Lyapunov function. magic happens! Lyapunov function is non-increasing when Nesterov s momentum coefficient is adopted Accelerated GD 7-31
Proof of Theorem 7.6 Key lemma: monotonicity of certain Lyapunov function Lemma 7.5 Let u t = θ t 1 x t ( x + (θ t 1 1)x t 1). Then }{{} or θ t 1 (x t x ) (θ t 1 1)(x t 1 x ) u t+1 + ( L θ t F (x t+1 ) F opt) u t + ( L θ t 1 F (x t ) F opt). quite similar to X + τ X X + τ ( f(x) f opt) (Lyapunov function) as discussed before (think about θ t t/) Accelerated GD 7-3
Proof of Theorem 7.6 With Lemma 7.5 in place, one has ( L θ t 1 F (x t ) F opt) u 1 + L 0( θ F (x 1 ) F opt) = x 1 x + ( F (x 1 ) F opt) L To bound RHS of this inequality, we use Lemma 7.4 and y 0 = x 0 to get ( F (x 1 ) F opt) y 0 x x 1 x = x 0 x x 1 x L x 1 x + L( F (x 1 ) F opt) x 0 x As a result, L t 1( θ F (x t ) F opt) x 1 x + ( F (x 1 ) F opt) x 0 x L, = F (x t ) F opt L x0 x θ t 1 (Fact 7.) L x 0 x (t + 1) Accelerated GD 7-33
Proof of Lemma 7.5 Take x = 1 θ t x + ( 1 1 θ t ) x t and y = y t in Lemma 7.4 to get ( F (x t+1 ) F θ 1 t θ 1 t x + ( 1 θt 1 ) ) x t (7.7) x + ( 1 θt 1 ) x t y t L L = L x θt + ( θ t 1 ) x t θ t y t L θt (i) = L ( u t θt u t+1 θ 1 t x + ( 1 θt 1 ) x t x t+1 x + ( θ t 1 ) x t θ t x t+1 }{{} = u t+1 ), (7.8) where (i) follows from definition of u t and y t = x t + θt 1 1 θ t (x t x t 1 ) Accelerated GD 7-34
Proof of Lemma 7.5 (cont.) We will also lower bound (7.7). By convexity of F, ( F θt 1 x + ( 1 θt 1 ) ) x t θt 1 F (x ) + ( 1 θt 1 ) F (x t ) = θt 1 F opt + ( 1 θt 1 ) F (x t ) ( F θ 1 t ( 1 θ 1 t x + ( 1 θt 1 ) ) x t F (x t+1 ) )( F (x t ) F opt) ( F (x t+1 ) F opt) Combining this with (7.8) and θ t θ t = θ t 1 yields L( u t u t+1 ( ) θ t F (x t+1 ) F opt) ( θt )( θ t F (x t ) F opt) = θt ( F (x t+1 ) F opt) θt 1( F (x t ) F opt), thus finishing proof Accelerated GD 7-35
Convergence for strongly convex case x t+1 ( = prox ηth y t η t f(y t ) ) κ 1 y t+1 = x t+1 + (x t+1 x t ) κ + 1 Theorem 7.6 (Convergence of accelerated proximal gradient methods for strongly convex case) Suppose f is µ-strongly convex and L-smooth. If η t 1/L, then F (x t ) F opt ( 1 1 ) t (F (x 0 ) F opt + µ x0 x ) κ Accelerated GD 7-36
A practical issue Fast convergence requires knowledge of κ = L/µ in practice, estimating µ is typically very challenging A common observation: ripples / bumps in the traces of cost values Accelerated GD 7-37
as long as xk œ x0 + span{òf (x0 ),, Òf (xk 1 )} for More precisely, convex and L-smooth function f s.t. 3LÎx0 xú Î 3(t + 1) Rippling behavior Convergence analysis definition of first-order methods Numericalarrive example: take y t+1 = xt+1 +, we immediate at 1 q t+1 1+ q (x vex and Lipschitz continuous (i.e. Îg t Îú Æ Lf ) on C, fl-strongly convex w.r.t. Î Î. Then t f (x ) f opt Ø f (x ) f 4! supxœc DÏ x, x 6 k 8 1 qt q q q q q q q " 0 k=0 =0 = q / = q /3 = q = 3q = q =1! + Lf fl k " qt k=0 period of ripples is often proportional to p L/µ k O Donoghue, Candes 1 with R := supxœc DÏ x, x0, then A Ô B k Lf ofr log 1twith different estimates of q. Algorithm best,t opt Figure 1: Convergence Ô f f ÆO Ô fl t when q > q : we underestimate momentum slower convergence Interpretation. The optimal momentum depends on the condition number of the function; 14 16 0 500 00 1500 000 500 3000 3500 4000 4500 5000 Accelerated GD 1 Ô t result in general f opt Æ 0 xt ); q = 1/κ specifically, higher momentum is required when the function has a higher condition number. Under- estimating the amount of momentum required leads to slower convergence. However we are more 1 q further remove log t factor when q < q : we overestimate momentum ( overshooting / rippling behavior often in the other regime, that of overestimated momentum, because generally q = 0, in which case 1+ βk 1; this corresponds to high momentum and rippling behavior, as we see in Figure 5-371. This can be visually understood in Figure (), which shows the trajectories of sequences generated by Algorithm 1 minimizing a positive definite quadratic in two dimensions, under q = q, the optimal choice of q, and q = 0. The high momentum causes the trajectory to overshoot the minimum and oscillate around it. This causes a rippling in the function values along the trajectory. Later we Accelerated GD shall demonstrate that the period of these ripples is proportional to the square root of the (local) condition number of the function. q is large) 7-38
Adaptive restart (O Donoghue, Candes 1) When certain criterion is met, restart running FISTA with x 0 y 0 x t x t θ 0 = 1 take current iteration as new starting point erase all memory of previous iterates and restart momentum back to zero Accelerated GD 7-39
as long as xk œ x0 + span{òf (x0 ),, Òf (x 3LÎx0 xú 3(t + 1) More precisely, convex and L-smooth functio Interestingly, no first-order methods can improv result in general 5.3 definition of first-order methods manumerical 5.4, we immediate arrive at comparisons of adaptive restart schemes flR Ô1 Lf t 4 6! " supxœc DÏ x, x0 + 8 k Æ 1 14 qt Lf fl k=0 k qt k=0 k Gradient descent No Restart Restart every 0 Restart every 400 Restart every 00 Restart optimal, 700 With q=µ/l Adaptive Restart Function scheme Adaptive Restart Gradient scheme! " with R := supxœc DÏ x, x0, then A Ô B k Lf Roflog t best,t opt Figure 3: Comparison fixed Ô and adaptive restart intervals. f f ÆO Ô fl function scheme: restart when ft(xt ) > f (xt 1 ) 18 00 400 600 800 00 0 Accelerated GD 16 = Ô f opt t f (x ) f opt Ø f (x ) f f best,t 0 is convex and Lipschitz continuous (i.e. Îg t Îú Æ Lf ) on C, e Ï is fl-strongly convex w.r.t. Î Î. Then 1400 1600 1800 000 gradient scheme: restart when h f (y t 1 ), xt xt 1 i > 0 one can further remove log t factor {z } q=q q = 0 momentum lead us towards5-37 restart when a bad direction 0.1 adaptive restart Accelerated GD 0.05 7-40
Illustration 0.1 q = q q =0 adaptive restart 0.05 x 0 0.05 0.1 0.1 0.05 0 0.05 0.1 x 1 with overestimated momentum (e.g. q = 0), one sees spiralling trajectory adaptive restart helps mitigate this issue Accelerated GD 7-41
Lower bounds
Optimality of Nesterov s method Interestingly, no first-order methods can improve upon Nesterov s result in general More precisely, convex and L-smooth function f s.t. f(x t ) f opt 3L x0 x 3(t + 1) as long as x k x 0 + span{ f(x 0 ),, f(x k 1 )} for all 1 k t }{{} definition of first-order methods Nemirovski, Yudin 83 Accelerated GD 7-43
Example L f (x) = 4 minimizex R(n+1) 1 where A = 1... 1... 1... 1 f is convex and L-smooth f Accelerated GD L = 8 1 > x Ax e> 1x R(n+1) (n+1) 1 optimizer x is given by x i = 1 opt 1 1 n + i n+ (1 i n) obeying and kx k n + 3 7-44
Example L f (x) = 4 minimizex R(n+1) 1 where A = f (x) = 1... L 4 Ax 1... 1 L4 e1... 1 1 > x Ax e> 1x R(n+1) (n+1) 1 span{ f (x0 ),, f (xk 1 )} = span{e1,, ek } if x0 = 0 {z :=Kk } every iteration of first-order methods expands search space by at most one dimension Accelerated GD 7-44
Example (cont.) If we start with x 0 = 0, then = f(x n ) inf f(x) = L x K n 8 f(x n ) f opt x 0 x ( L 1 8 ( 1 n + 1 1 ) ) n+1 1 n+ 1 3 (n + ) = 3L 3(n + 1) Accelerated GD 7-45
Summary: accelerated proximal gradient stepsize convergence iteration rule rate complexity convex & smooth ( ) ( ) η t = 1 L O 1 O 1 ε problems t ( strongly convex & ( ) ) t ( ηt = 1 L O 1 1 ) κ O κ log 1 smooth problems ε Accelerated GD 7-46
Reference [1] Some methods of speeding up the convergence of iteration methods, B. Polyak, USSR Computational Mathematics and Mathematical Physics, 1964 [] A method of solving a convex programming problem with convergence rate O(1/k ), Y. Nestrove, Soviet Mathematics Doklady, 1983. [3] A fast iterative shrinkage-thresholding algorithm for linear inverse problems, A. Beck and M. Teboulle, SIAM journal on imaging sciences, 009. [4] First-order methods in optimization, A. Beck, Vol. 5, SIAM, 017. [5] Mathematical optimization, MATH301 lecture notes, E. Candes, Stanford. [6] Gradient methods for minimizing composite functions,, Y. Nesterov, Technical Report, 007. Accelerated GD 7-47
Reference [7] Large-scale numerical optimization, MS&E318 lecture notes, M. Saunders, Stanford. [8] Proximal algorithms, N. Parikh and S. Boyd, Foundations and Trends in Optimization, 013. [9] Convex optimization: algorithms and complexity, S. Bubeck, Foundations and trends in machine learning, 015. [] Optimization methods for large-scale systems, EE36C lecture notes, L. Vandenberghe, UCLA. [11] Problem complexity and method efficiency in optimization, A. Nemirovski, D. Yudin, Wiley, 1983. [1] Introductory lectures on convex optimization: a basic course, Y. Nesterov, 004 [13] A differential equation for modeling Nesterov s accelerated gradient method, W. Su, S. Boyd, E. Candes, NIPS, 014. Accelerated GD 7-48
Reference [14] On accelerated proximal gradient methods for convex-concave optimization, P. Tseng, 008 [15] Adaptive restart for accelerated gradient schemes, B. O donoghue, and E. Candes, Foundations of computational mathematics, 01 Accelerated GD 7-49