Accelerated gradient methods

Similar documents
Proximal gradient methods

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Dual and primal-dual methods

Lasso: Algorithms and Extensions

Optimization methods

Fast proximal gradient methods

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University

A Conservation Law Method in Optimization

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

6. Proximal gradient method

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Adaptive Restarting for First Order Optimization Methods

Optimization methods

6. Proximal gradient method

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

Lecture 8: February 9

ORIE 6326: Convex Optimization. Quasi-Newton Methods

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

Optimization for Machine Learning

Stochastic Optimization: First order method

10. Unconstrained minimization

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

A Unified Approach to Proximal Algorithms using Bregman Distance

Stochastic and online algorithms

GRADIENT = STEEPEST DESCENT

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION

Optimized first-order minimization methods

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization

Accelerated Proximal Gradient Methods for Convex Optimization

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

Lecture 2 February 25th

Selected Topics in Optimization. Some slides borrowed from

Nonlinear Optimization for Optimal Control

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Subgradient Method. Ryan Tibshirani Convex Optimization

Accelerating Nesterov s Method for Strongly Convex Functions

Math 273a: Optimization Subgradient Methods

ELE 538B: Large-Scale Optimization for Data Science. Quasi-Newton methods. Yuxin Chen Princeton University, Spring 2018

Sparse Covariance Selection using Semidefinite Programming

Gradient Methods Using Momentum and Memory

Composite nonlinear models at scale

A Differential Equation for Modeling Nesterov s Accelerated Gradient Method: Theory and Insights

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

A Differential Equation for Modeling Nesterov s Accelerated Gradient Method: Theory and Insights

CPSC 540: Machine Learning

Conditional Gradient (Frank-Wolfe) Method

Introduction to gradient descent

MATH 829: Introduction to Data Mining and Analysis Computing the lasso solution

5. Subgradient method

Proximal and First-Order Methods for Convex Optimization

Non-negative Matrix Factorization via accelerated Projected Gradient Descent

On Nesterov s Random Coordinate Descent Algorithms - Continued

CS 435, 2018 Lecture 3, Date: 8 March 2018 Instructor: Nisheeth Vishnoi. Gradient Descent

Lecture 5: September 15

Lecture 9: September 28

Convex Optimization Lecture 16

Descent methods. min x. f(x)

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Math 273a: Optimization Subgradients of convex functions

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization

Unconstrained minimization

Optimization for Machine Learning

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation

Geometric Descent Method for Convex Composite Minimization

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Newton s Method. Javier Peña Convex Optimization /36-725

Accelerate Subgradient Methods

Integration Methods and Optimization Algorithms

Convex Optimization. 9. Unconstrained minimization. Prof. Ying Cui. Department of Electrical Engineering Shanghai Jiao Tong University

FAST FIRST-ORDER METHODS FOR COMPOSITE CONVEX OPTIMIZATION WITH BACKTRACKING

A SIMPLE PARALLEL ALGORITHM WITH AN O(1/T ) CONVERGENCE RATE FOR GENERAL CONVEX PROGRAMS

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

Non-convex optimization. Issam Laradji

A DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Stochastic Optimization Algorithms Beyond SG

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Unconstrained minimization: assumptions

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Accelerated primal-dual methods for linearly constrained convex problems

The Proximal Gradient Method

Expanding the reach of optimal methods

Lecture 5: September 12

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

Hamiltonian Descent Methods

arxiv: v1 [math.oc] 7 Dec 2018

Transcription:

ELE 538B: Large-Scale Optimization for Data Science Accelerated gradient methods Yuxin Chen Princeton University, Spring 018

Outline Heavy-ball methods Nesterov s accelerated gradient methods Accelerated proximal gradient methods (FISTA) Convergence analysis Lower bounds

(Proximal) gradient methods Iteration complexities of (proximal) gradient methods strongly convex and smooth problems ( O κ log 1 ) ε convex and smooth problems ( ) 1 O ε Can one still hope to further accelerate convergence? Accelerated GD 7-3

Issues and possible solutions Issues: GD focuses on improving cost per iteration, which might be too short-sighted GD might sometimes zigzag / experience abrupt changes Solutions: exploit information from history / past iterates add buffer (like momentum) to yield smoother trajectory Accelerated GD 7-4

Heavy-ball methods Polyak 64

Heavy-ball method minimize x R n f(x) B. Polyak x t+1 = x t η t f(x t ) + θ t (x t x t 1 ) }{{} momentum term add inertia to the ball (i.e. include momentum term) to mitigate zigzagging Accelerated GD 7-6

Heavy-ball method x ú gradient descent x ú heavy-ball method Accelerated GD 7-7

State-space models One can understand heavy-ball methods through dynamics of following dynamical system [ ] x t+1 ] x t = [ (1 + θt )I θ t I I 0 ] [ ] [ x t ηt f(x x t 1 t ) 0 or equivalently, [ ] [ x t+1 x (1 + θt )I θ x t x = t I I 0 }{{} state ] [ x t x x t 1 x ] [ ηt f(x t ) 0 [ 1 ] [ (1 + θt )I η = t 0 f(x τ )dτ θ t I x t x I 0 x t 1 x }{{} system matrix ] ] with x τ = x t + τ(x x t ), where last line comes from fundamental theorem of calculus Accelerated GD 7-8

System matrix [ ] [ x t+1 x ( ) 1 ] [ ] 1 + θt I ηt x t x = 0 f(x τ )dτ θ t I x t x I 0 x t 1 x }{{} :=H t (system matrix) (7.1) implication: convergence of heavy-ball methods depends on spectrum of system matrix H t key idea: find appropriate stepsize η t and momentum coefficient θ t to control spectrum of H Accelerated GD 7-9

Convergence for strongly convex problems Theorem 7.1 (Convergence of heavy-ball methods for strongly convex functions) Suppose f is µ-strongly convex and L-smooth. Set κ = L/µ, η t 4/( L + µ), θ t max { 1 η t L, 1 η t µ }. Then [ ] x t+1 x x t x ( κ 1 κ + 1 ) t [ x 1 x x 0 x ] iteration complexity: O( κ log 1 ε ) significant improvement over GD: O( κ log 1 ε ) vs. O(κ log 1 ε ) relies on knowledge of both L and µ Accelerated GD 7-

Proof of Theorem 7.1 In view of (7.1), it suffices to control spectrum of H t. Let λ i denote ith eigenvalue of λ 1 1 0 f(x τ )dτ and set Λ :=..., then λ n [ ( ) ] H t = 1 + θt I ηt Λ θ t I t I 0 [ ] max 1 + θt η t λ i θ t 1 i n 1 0 To finish proof, it suffices to show [ max 1 + θt η t λ i θ t i 1 0 ] κ 1 κ + 1 (7.) Accelerated GD 7-11

Proof of Theorem 7.1 To show (7.), note that two eigenvalues of roots of z (1 + θ t η t λ i )z + θ t = 0 [ 1 + θt η t λ i θ t 1 0 ] are If (1 + θ t η t λ i ) 4θ t, then roots of this equation have same magnitudes θt (as they are either both imaginary or there is only one root). In addition, one can easily check that (1 + θ t η t λ i ) 4θ t is satisfied if θ t [( 1 η t λ i ), ( 1 + ηt λ i ) ], (7.3) which would hold if one picks θ t = max {( 1 η t L ), ( 1 ηt µ ) } Accelerated GD 7-1

Proof of Theorem 7.1 With this choice of θ t, we have H t θ t 4 Finally, setting η t = ( L+ ensures 1 η µ) t L = (1 η t µ), which yields ( ) ( θ t = max 1 L, 1 ) µ L + µ L + µ = This in turn establishes H t κ 1 κ + 1 ( ) κ 1 κ + 1 Accelerated GD 7-13

Nesterov s accelerated gradient methods

Convex case minimize x R n f(x) For strongly convex f, including momentum terms allows to improve iteration complexity from O ( κ log 1 ) ( ) ε to O κ log 1 ε Can we obtain improvement for convex case (without strong convexity) as well? Accelerated GD 7-15

Nesterov s idea Nesterov 83 x t+1 = y t η t f(y t ) y t+1 = x t+1 + t ( x t+1 x t) t + 3 Y. Nesterov alternates between gradient updates and proper extrapolation each iteration takes nearly same cost as GD not a descent method (i.e. we may not have f(x t+1 ) f(x t )) one of most beautiful and mysterious results in optimization... Accelerated GD 7-16

Convergence of Nesterov s accelerated gradient method Suppose f is convex and L-smooth. If η t η = 1/L, then iteration complexity: O ( 1 ε ) f(x t ) f opt L x0 x (t + 1) much faster than gradient methods we ll provide proof for (more general) proximal version later Accelerated GD 7-17

Interpretation using differential equations Nesterov s momentum coefficient mysterious t t+3 = 1 3 t is particularly Accelerated GD 7-18

has Consider 1/ > of 1/ and note that P = W 1/ RW version > (X) P := W 1/isBrescaled B>W B B WR: P = F( => )rf = U U >. diagonal elements are effective resistances. Given th Interpretationwhose using differential equations As a result, leverage is recaled effective resistance has 3 3 score w.r.t. and note that P = W 1/ RW 1/ is rescaled version of R ping With forceconstant > > x ls klg kxls klg X prob., X kxls ( > )resistances. = U UGiven. whose diagonal elements Pare=effective t ime-varying dampling force spring force has F = (X) leverage score w.r.t. As arf result, is recaled effective resis > P ls= x (ls k> ) kx =k U U >. ls With constant prob., kx LG LG As a result, leverage score w.r.t. is recaled effective res Fls= (X)ls k With constant prob., kxls x klg rf kx LG F = rf (X) damping force 3 X 3 X time-varying dampling force spring force damping force 3 X 3 X To develop insights into why Nesterov s method works so well, it s helpful to look at its continuous limits (ηt 0), which is given by second-order ordinary differential equations (ODE).. X(τ ) + 3/τ {z} dampling coefficient Accelerated GD. X(τ ) + f (X(τ )) = 0 {z potential } Su, Boyd, Candes 14 7-19

Heuristic derivation of ODE To begin with, Nesterov s update rule is equivalent to x t+1 x t = t 1 η t + x t x t 1 η η f(y t ) (7.4) Let t = τ η. Set X(τ) x τ/ η = x t and X(τ + η) x t+1. Then Taylor expansion gives x t+1 x t Ẋ(τ) + 1.. X(τ) η η x t x t 1 Ẋ(τ) 1.. X(τ) η η which combined with (7.4) yields. X(τ) + 1.. X(τ) ( η 1 3 ) η 1.. (Ẋ(τ) X(τ) ) η η f ( X(τ) ) τ =... X(τ) + 3 X(τ) + f ( X(τ) ) 0 τ Accelerated GD 7-0

Convergence rate of ODE... X + 3 X + f(x) = 0 (7.5) τ It is not hard to show that this ODE obeys ( ) 1 f(x(τ)) f opt O τ (7.6) which somehow explains Nesterov s O(1/t ) convergence Accelerated GD 7-1

Proof of (7.6) Define E(τ) = τ ( f(x) f opt) + X + τ X X. This obeys }{{} Lyapunov function / energy function. E = τ ( f(x) f opt) + τ. f(x), X + 4 X + τ. X X, 3. X + τ.. X (i) = τ ( f(x) f opt) τ X X, f ( X ) (by convexity) 0 where (i) follows by replacing τ.. X + 3Ẋ with τ f( X ) This means E is non-decreasing in τ, and hence (defn of E) opt E(τ) f(x(τ)) f τ E(0) ( ) 1 τ = O τ. Accelerated GD 7-

Magic number 3... X + 3 X + f(x) = 0 τ 3 is smallest constant that guarantees O(1/τ ) convergence, and can be replaced by any other α 3 in some sense, 3 minimizes pre-constant in convergence bound O(1/τ ) (see Su, Boyd, Candes 14) Accelerated GD 7-3

minimize log m P i=1 ng as xk œ x0 + span{òf (x0 ),, Òf (xk 1 )} for all 1 Æ k estingly, no first-order methods can improve upon Nesterov s t in general Example T Numerical exp(a i x + bi ) example taken from UCLA EE36C Convergence analysis precisely, convex and L-smooth function f s.t. y generated problems with m = 000, n = 00 m X (k)?? 3LÎx0 xú Î 3(t + 1) tep size5.4, usedwe forimmediate gradient method FISTA minimize Lemma arriveand at log x exp(a> i x i=1! + bi ) definition of first-order methods (f (x ) f )/f em 5.3 with randomly generated problems and m = 000, n = 00 f best,t f opt Æ! f (xt ) f opt Ø 0 e f is convex and Lipschitz continuous (i.e. Îg t Îú Æ Lf ) on C, gradient gradient ppoe Ï is fl-strongly convexw.r.t. Î Î. Then FISTA FISTA 1 " supxœc DÏ x, x0 + 3 4 qt Lf fl k=0 k qt k=0 k f ÆO Ô fl Ô t ted GD f Accelerated GD ent methods 00 5! " flR Ô1 t = Lf t with R := supxœc DÏ x, x0, then 6 0 150 00 0A 50 0 B 150 Ô k k L R log t f best,t opt Ô 9-8 7-4

Extension to composite models minimize x subject to F (x) := f(x) + h(x) x R n f: convex and smooth h: convex (may not be differentiable) let F opt := min x F (x) be optimal cost Accelerated GD 7-5

FISTA (Beck & Teboulle 09) Fast iterative shrinkage-thresholding algorithm x t+1 ( = prox ηth y t η t f(y t ) ) y t+1 = x t+1 + θ t 1 (x t+1 x t ) θ t+1 where y 0 = x 0, θ 0 = 1 and θ t+1 = 1+ 1+4θ t momentum coefficient originally proposed by Nesterov 83 Accelerated GD 7-6

Momentum coefficient 0 15 5 0 0 0 30 θ t+1 = 1 + 1 + 4θ t 1 0.8 0.6 0.4 0. 0 0 5 15 0 5 30 with θ 0 = 1 coefficient θt 1 θ t+1 = 1 3 t + o( 1 t ) (homework) asymptotically equivalent to t t+3 Accelerated GD 7-7

Momentum coefficient 0 15 5 0 0 0 30 θ t+1 = 1 + 1 + 4θ t 1 0.8 0.6 0.4 0. 0 0 5 15 0 5 30 with θ 0 = 1 Fact 7. For all t 1, one has θ t t+ (homework) Accelerated GD 7-7

Convergence analysis

Convergence for convex problems Theorem 7.3 (Convergence of accelerated proximal gradient methods for convex problems) Suppose f is convex and L-smooth. If η t 1/L, then F (x t ) F opt L x0 x (t + 1) improved iteration complexity (i.e. O(1/ ε)) than proximal gradient method (i.e. O(1/ε)) fast if prox can be efficiently implemented Accelerated GD 7-9

Recap: fundamental inequality for proximal method Recall following fundamental inequality shown in last lecture: Lemma 7.4 Let y + = prox 1 L h ( y 1 L f(y)), then F (y + ) F (x) L x y L x y+ Accelerated GD 7-30

Proof of Theorem 7.6 1. build discrete-time version of Lyapunov function. magic happens! Lyapunov function is non-increasing when Nesterov s momentum coefficient is adopted Accelerated GD 7-31

Proof of Theorem 7.6 Key lemma: monotonicity of certain Lyapunov function Lemma 7.5 Let u t = θ t 1 x t ( x + (θ t 1 1)x t 1). Then }{{} or θ t 1 (x t x ) (θ t 1 1)(x t 1 x ) u t+1 + ( L θ t F (x t+1 ) F opt) u t + ( L θ t 1 F (x t ) F opt). quite similar to X + τ X X + τ ( f(x) f opt) (Lyapunov function) as discussed before (think about θ t t/) Accelerated GD 7-3

Proof of Theorem 7.6 With Lemma 7.5 in place, one has ( L θ t 1 F (x t ) F opt) u 1 + L 0( θ F (x 1 ) F opt) = x 1 x + ( F (x 1 ) F opt) L To bound RHS of this inequality, we use Lemma 7.4 and y 0 = x 0 to get ( F (x 1 ) F opt) y 0 x x 1 x = x 0 x x 1 x L x 1 x + L( F (x 1 ) F opt) x 0 x As a result, L t 1( θ F (x t ) F opt) x 1 x + ( F (x 1 ) F opt) x 0 x L, = F (x t ) F opt L x0 x θ t 1 (Fact 7.) L x 0 x (t + 1) Accelerated GD 7-33

Proof of Lemma 7.5 Take x = 1 θ t x + ( 1 1 θ t ) x t and y = y t in Lemma 7.4 to get ( F (x t+1 ) F θ 1 t θ 1 t x + ( 1 θt 1 ) ) x t (7.7) x + ( 1 θt 1 ) x t y t L L = L x θt + ( θ t 1 ) x t θ t y t L θt (i) = L ( u t θt u t+1 θ 1 t x + ( 1 θt 1 ) x t x t+1 x + ( θ t 1 ) x t θ t x t+1 }{{} = u t+1 ), (7.8) where (i) follows from definition of u t and y t = x t + θt 1 1 θ t (x t x t 1 ) Accelerated GD 7-34

Proof of Lemma 7.5 (cont.) We will also lower bound (7.7). By convexity of F, ( F θt 1 x + ( 1 θt 1 ) ) x t θt 1 F (x ) + ( 1 θt 1 ) F (x t ) = θt 1 F opt + ( 1 θt 1 ) F (x t ) ( F θ 1 t ( 1 θ 1 t x + ( 1 θt 1 ) ) x t F (x t+1 ) )( F (x t ) F opt) ( F (x t+1 ) F opt) Combining this with (7.8) and θ t θ t = θ t 1 yields L( u t u t+1 ( ) θ t F (x t+1 ) F opt) ( θt )( θ t F (x t ) F opt) = θt ( F (x t+1 ) F opt) θt 1( F (x t ) F opt), thus finishing proof Accelerated GD 7-35

Convergence for strongly convex case x t+1 ( = prox ηth y t η t f(y t ) ) κ 1 y t+1 = x t+1 + (x t+1 x t ) κ + 1 Theorem 7.6 (Convergence of accelerated proximal gradient methods for strongly convex case) Suppose f is µ-strongly convex and L-smooth. If η t 1/L, then F (x t ) F opt ( 1 1 ) t (F (x 0 ) F opt + µ x0 x ) κ Accelerated GD 7-36

A practical issue Fast convergence requires knowledge of κ = L/µ in practice, estimating µ is typically very challenging A common observation: ripples / bumps in the traces of cost values Accelerated GD 7-37

as long as xk œ x0 + span{òf (x0 ),, Òf (xk 1 )} for More precisely, convex and L-smooth function f s.t. 3LÎx0 xú Î 3(t + 1) Rippling behavior Convergence analysis definition of first-order methods Numericalarrive example: take y t+1 = xt+1 +, we immediate at 1 q t+1 1+ q (x vex and Lipschitz continuous (i.e. Îg t Îú Æ Lf ) on C, fl-strongly convex w.r.t. Î Î. Then t f (x ) f opt Ø f (x ) f 4! supxœc DÏ x, x 6 k 8 1 qt q q q q q q q " 0 k=0 =0 = q / = q /3 = q = 3q = q =1! + Lf fl k " qt k=0 period of ripples is often proportional to p L/µ k O Donoghue, Candes 1 with R := supxœc DÏ x, x0, then A Ô B k Lf ofr log 1twith different estimates of q. Algorithm best,t opt Figure 1: Convergence Ô f f ÆO Ô fl t when q > q : we underestimate momentum slower convergence Interpretation. The optimal momentum depends on the condition number of the function; 14 16 0 500 00 1500 000 500 3000 3500 4000 4500 5000 Accelerated GD 1 Ô t result in general f opt Æ 0 xt ); q = 1/κ specifically, higher momentum is required when the function has a higher condition number. Under- estimating the amount of momentum required leads to slower convergence. However we are more 1 q further remove log t factor when q < q : we overestimate momentum ( overshooting / rippling behavior often in the other regime, that of overestimated momentum, because generally q = 0, in which case 1+ βk 1; this corresponds to high momentum and rippling behavior, as we see in Figure 5-371. This can be visually understood in Figure (), which shows the trajectories of sequences generated by Algorithm 1 minimizing a positive definite quadratic in two dimensions, under q = q, the optimal choice of q, and q = 0. The high momentum causes the trajectory to overshoot the minimum and oscillate around it. This causes a rippling in the function values along the trajectory. Later we Accelerated GD shall demonstrate that the period of these ripples is proportional to the square root of the (local) condition number of the function. q is large) 7-38

Adaptive restart (O Donoghue, Candes 1) When certain criterion is met, restart running FISTA with x 0 y 0 x t x t θ 0 = 1 take current iteration as new starting point erase all memory of previous iterates and restart momentum back to zero Accelerated GD 7-39

as long as xk œ x0 + span{òf (x0 ),, Òf (x 3LÎx0 xú 3(t + 1) More precisely, convex and L-smooth functio Interestingly, no first-order methods can improv result in general 5.3 definition of first-order methods manumerical 5.4, we immediate arrive at comparisons of adaptive restart schemes flR Ô1 Lf t 4 6! " supxœc DÏ x, x0 + 8 k Æ 1 14 qt Lf fl k=0 k qt k=0 k Gradient descent No Restart Restart every 0 Restart every 400 Restart every 00 Restart optimal, 700 With q=µ/l Adaptive Restart Function scheme Adaptive Restart Gradient scheme! " with R := supxœc DÏ x, x0, then A Ô B k Lf Roflog t best,t opt Figure 3: Comparison fixed Ô and adaptive restart intervals. f f ÆO Ô fl function scheme: restart when ft(xt ) > f (xt 1 ) 18 00 400 600 800 00 0 Accelerated GD 16 = Ô f opt t f (x ) f opt Ø f (x ) f f best,t 0 is convex and Lipschitz continuous (i.e. Îg t Îú Æ Lf ) on C, e Ï is fl-strongly convex w.r.t. Î Î. Then 1400 1600 1800 000 gradient scheme: restart when h f (y t 1 ), xt xt 1 i > 0 one can further remove log t factor {z } q=q q = 0 momentum lead us towards5-37 restart when a bad direction 0.1 adaptive restart Accelerated GD 0.05 7-40

Illustration 0.1 q = q q =0 adaptive restart 0.05 x 0 0.05 0.1 0.1 0.05 0 0.05 0.1 x 1 with overestimated momentum (e.g. q = 0), one sees spiralling trajectory adaptive restart helps mitigate this issue Accelerated GD 7-41

Lower bounds

Optimality of Nesterov s method Interestingly, no first-order methods can improve upon Nesterov s result in general More precisely, convex and L-smooth function f s.t. f(x t ) f opt 3L x0 x 3(t + 1) as long as x k x 0 + span{ f(x 0 ),, f(x k 1 )} for all 1 k t }{{} definition of first-order methods Nemirovski, Yudin 83 Accelerated GD 7-43

Example L f (x) = 4 minimizex R(n+1) 1 where A = 1... 1... 1... 1 f is convex and L-smooth f Accelerated GD L = 8 1 > x Ax e> 1x R(n+1) (n+1) 1 optimizer x is given by x i = 1 opt 1 1 n + i n+ (1 i n) obeying and kx k n + 3 7-44

Example L f (x) = 4 minimizex R(n+1) 1 where A = f (x) = 1... L 4 Ax 1... 1 L4 e1... 1 1 > x Ax e> 1x R(n+1) (n+1) 1 span{ f (x0 ),, f (xk 1 )} = span{e1,, ek } if x0 = 0 {z :=Kk } every iteration of first-order methods expands search space by at most one dimension Accelerated GD 7-44

Example (cont.) If we start with x 0 = 0, then = f(x n ) inf f(x) = L x K n 8 f(x n ) f opt x 0 x ( L 1 8 ( 1 n + 1 1 ) ) n+1 1 n+ 1 3 (n + ) = 3L 3(n + 1) Accelerated GD 7-45

Summary: accelerated proximal gradient stepsize convergence iteration rule rate complexity convex & smooth ( ) ( ) η t = 1 L O 1 O 1 ε problems t ( strongly convex & ( ) ) t ( ηt = 1 L O 1 1 ) κ O κ log 1 smooth problems ε Accelerated GD 7-46

Reference [1] Some methods of speeding up the convergence of iteration methods, B. Polyak, USSR Computational Mathematics and Mathematical Physics, 1964 [] A method of solving a convex programming problem with convergence rate O(1/k ), Y. Nestrove, Soviet Mathematics Doklady, 1983. [3] A fast iterative shrinkage-thresholding algorithm for linear inverse problems, A. Beck and M. Teboulle, SIAM journal on imaging sciences, 009. [4] First-order methods in optimization, A. Beck, Vol. 5, SIAM, 017. [5] Mathematical optimization, MATH301 lecture notes, E. Candes, Stanford. [6] Gradient methods for minimizing composite functions,, Y. Nesterov, Technical Report, 007. Accelerated GD 7-47

Reference [7] Large-scale numerical optimization, MS&E318 lecture notes, M. Saunders, Stanford. [8] Proximal algorithms, N. Parikh and S. Boyd, Foundations and Trends in Optimization, 013. [9] Convex optimization: algorithms and complexity, S. Bubeck, Foundations and trends in machine learning, 015. [] Optimization methods for large-scale systems, EE36C lecture notes, L. Vandenberghe, UCLA. [11] Problem complexity and method efficiency in optimization, A. Nemirovski, D. Yudin, Wiley, 1983. [1] Introductory lectures on convex optimization: a basic course, Y. Nesterov, 004 [13] A differential equation for modeling Nesterov s accelerated gradient method, W. Su, S. Boyd, E. Candes, NIPS, 014. Accelerated GD 7-48

Reference [14] On accelerated proximal gradient methods for convex-concave optimization, P. Tseng, 008 [15] Adaptive restart for accelerated gradient schemes, B. O donoghue, and E. Candes, Foundations of computational mathematics, 01 Accelerated GD 7-49