NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

Size: px

Start display at page:

Download "NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained"

Dinah Simon
5 years ago
Views:

1 NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume the optimum of f exists and is unique. By first-order, we are referring to methods that use only function value and gradient information. Compared to second-order methods (e.g. Newtontype methods, interior-point methods), first-order methods usually converge slower, and their performance degrades on ill-conditioned problems. However, they are well-suited to large-scale problems because each iteration usually only requires vector operations, 1 in addition to the evaluation of f(x) and f(x).. Gradient descent. The simplest first-order method is gradient descent or steepest descent. It starts at an initial point x 0 and generates successive iterates by (.1) x k+1 x k α k f(x k ). The step size α k is either fixed or chosen by a line search that (approximately) minimizes f(x k α f(x k )). For small to medium-scale problems, each iteration is inexpensive, requiring only vector operations. However, the iterates may converge slowly to the optimum x. Let s take a closer look at the convergence of gradient descent. For now, fix a step size α and consider gradient descent as a fixed point iteration: x k+1 G α (x k ), where G α (x) := x α f(x) is a contraction: G α (x) G α (y) L G x y for some L G < 1. Since the optimum x is a fixed point of G α, we expect the fixed point iteration x k+1 G α (x k ) to converge to the fixed point. 1 Second-order methods usually require the solution of linear system(s) at each iteration, making them prohibitively expensive for very large problems. 1

2 MS&E 318: LARGE-SCALE NUMERICAL OPTIMIZATION Lemma.1. When the gradient step G α (x) is a contraction, gradient descent converges linearly to x : x k+1 x L k G x 0 x. Proof. We begin by showing that each iteration of gradient descent is a contraction: x k+1 x = x k α k f(x k ) (x α k f(x )) = G α (x) G α (x ) L G x k x. By induction, x k+1 x L k G x 0 x. At first blush, it seems like gradient descent converges quickly. However, the contraction coefficient L G depends poorly on the condition number κ := L µ of the problem. In particular L G 1 1 κ with an optimal step size. From now on, we assume the function f is convex. Lemma.. When f : R n R is 1. twice continuously differentiable. strongly convex with constant µ > 0 : (.) f(y) f(x) + f(x) T (y x) + µ y x 3. strongly smooth with constant L 0 : (.3) f(y) f(x) + f(x) T (y x) + L y x, the contraction coefficient L G at most max{ 1 αµ, 1 αl }. Proof. By the mean value theorem, G α (x) G α (y) = x α f(x) (y α f(y)) = (I f(z))(x y) I f(z) x y for some z on the line segment between x and y. By strong convexity and smoothness, the eigenvalues of f(z) are between µ and L. Thus, I α f(z) max{ 1 αµ, 1 αl }.

3 FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS Fig 1. The iterates of gradient descent (left panel) and the heavy ball method (right panel) starting at ( 1, 1). We combine Lemmas.1 and. to deduce x k+1 x max{ 1 αµ, 1 αl } k x 0 x. The step size L+µ minimizes the contraction coefficient. Substituting the optimal step size, we conclude that x k+1 x ( ) κ 1 k x 0 x, κ + 1 where κ := L µ is a condition number of the problem. When κ is large, the contraction coefficient is roughly 1 1 κ. In other words, gradient descent requires at most O ( κ log( 1 ɛ )) iterations to attain an ɛ-accurate solution. 3. The heavy ball method. The heavy ball method is usually attributed to Polyak (1964). The intuition is simple: the iterates of gradient descent tend to bounce between the walls of narrow valleys on the objective surface. The left panel of Figure shows the iterates of gradient descent bouncing from wall to wall. To avoid bouncing, the heavy ball method adds a momentum term to the gradient step: x k+1 x k α k f(x k ) + β k (x k x k 1 ). The term x k x k 1 nudges x k+1 in the direction of the previous step (hence momentum). The right panel of Figure shows the effects of adding momentum.

4 4 MS&E 318: LARGE-SCALE NUMERICAL OPTIMIZATION Let s study the heavy ball method and compare it to gradient descent. Since each iterate depends on the previous two iterates, we study the threeterm recurrence in matrix form: [ ] xk+1 x [ x k x = xk + β(x k x k 1 ) x ] [ ] f(xk ) x k x α 0 [ ] [ = (1 + β)i βi xk x ] [ ] I 0 x k 1 x α f(z k )(x k x ) 0 for some z k on the line segment between x k and x [ = (1 + β)i α ] [ f(z k ) βi xk x ] I 0 x k 1 x [ ] (1 + β)i αk [ f(z k ) βi xk x ] (3.1) I 0. x k 1 x [ ] The contraction coefficient is given by (1 + β)i αk f(z k ) βi. I 0 We introduce a lemma to bound the norm. Lemma 3.1. Under the conditions of Lemma., [ ] (1 + β)i α f(z) βi max{ 1 αµ, 1 αl } I 0 for β = max{ 1 αµ, 1 αl }. Proof. Let Λ := diag(λ 1,..., λ n ), where {λ i } i [1:n] are the eigenvalues of f(z). By diagonalization, [ ] (1 + β)i α f(z) βi I 0 [ ] [ ] = (1 + β)i αλ βi = max 1 + β αλi β I i [1:n] The second equality holds because it is possible to permute the matrix to a block diagonal matrix with blocks. For each i [1 : n], the eigenvalues of the matrices are given by the roots of p i (λ) := λ (1 + β αλ i )λ + β = 0. It is possible to show that when β is at least (1 αλ i ), the roots of p i are imaginary and both have magnitude β. Since β := max{ 1 αµ, 1 αl }, the magnitude of the roots are at most β..

5 FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 5 Theorem 3.. Under the conditions of Lemma., the heavy ball method 4 with fixed parameters α = ( L+ and β = max{ 1 αµ, 1 αl } µ) converges linearly to x : ( ) k κ 1 x k+1 x x 0 x, κ + 1 where κ = L µ. Proof. By Lemma 3.1, [ ] xk+1 x x k x max{ 1 αµ, 1 [ αl } xk x ] x k 1 x 4 Substituting α = ( L+ gives µ) [ ] xk+1 x x k x κ 1 κ + 1 [ xk x x k 1 x ] where κ := L µ. We induct on k to obtain the stated result. The heavy ball method also converges linearly, but its contraction coefficient is roughly 1 1 κ. In other words, the heavy ball method attains an ɛ-accurate solution after at most O ( κ log( 1 ɛ )) iterations. Figure 3 compares the convergence of gradient descent and the heavy ball method. Unlike gradient descent, the heavy ball method is not a descent method, i.e. f(x k+1 ) is not necessarily smaller than f(x k ). On a related note, the conjugate gradient method (CG) is an instance of the heavy ball method with adaptive step sizes and momentum parameters. However, CG does not require knowledge of L and µ to choose step sizes. 4. Accelerated gradient methods Gradient descent in the absence of strong convexity. Before delving into Nesterov s celebrated method, we study gradient descent in the setting he considered. Nesterov studied (1.1) when f is continuously differentiable and strongly smooth. In the absence of strong convexity, (1.1) may not have a unique optimum, so we cannot expect the iterates to always converge to a point x. However, if f remains convex, we expect the optimality gap f(x k ) f to converge to zero. To begin, we study the convergence rate of gradient descent in the absence of strong convexity. The iterates may converge to different points depending on the initial point x 0.,

6 6 MS&E 318: LARGE-SCALE NUMERICAL OPTIMIZATION 10 5 gradient descent heavy ball 10 0 f(xk) f f k Fig. Convergence of gradient descent and heavy ball function values on a strongly convex function. Although the heavy ball function values are not monotone, they converge faster. Theorem 4.1. When f : R n R is convex and 1. continuously differentiable. strongly smooth with constant L, gradient descent with step size 1 L satisfies f(x k ) f L k x 0 x. Proof. The proof consists of two parts. First, we show that, at each step, gradient descent reduces the function value by at least a certain amount. In other words, we show each iteration of gradient descent achieves sufficient descent. Recall x k+1 x k α f(x k ). By the strong smoothness of f, ( α ) L f(x k+1 ) f(x k ) + α f(x k ). When α = 1 L, the right side simplifies to (4.1) f(x k+1 ) f(x k ) 1 L f(x k). The second part of the part of the proof combines the sufficient descent condition with the convexity of f to obtain a bound on the optimality gap.

7 FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 7 By the first-order condition for convexity, f(x k+1 ) f + f(x k ) T (x k x ) 1 L f(x k) ( L = f x k x f(x k ) T (x k x ) + 1 ) L f(x k) + L x k x, where the second equality follows from completing the square, = f L x k x 1 L f(x k) + L x k x = f L x k+1 x + L x k x. We sum over iterations to obtain k f(x l ) f L l=1 k x l 1 x x l x = L ( x0 x x k x ) L x 0 x. Since f(x k ) is non-increasing, we have l=1 f(x k ) f 1 k k f(x l ) f L k x 0 x. l=1 In the absence of strong convexity, the convergence rate of gradient descent slows to sublinear. Roughly speaking, gradient descent requires at most O( L ɛ ) iterations to obtain an ɛ-optimal solution. The story is similar when the step sizes are not fixed, but chosen by a backtracking line search. More precisely, a backtracking line search 1. starts with some initial trial step size α k α init. decreases the trial step size geometrically: α k α k 3 3. continues until α k satisfies a sufficient descent condition: (4.) f(x k + α k f(x k )) f(x k ) α k f(x k). 3 The new trial step may be β α, for any β (0, 1). In practice, it s frequently set close to 1 to avoid taking conservative step sizes. We let β = 1 for simplicity.

8 8 MS&E 318: LARGE-SCALE NUMERICAL OPTIMIZATION Lemma 4.. Under the conditions of Theorem 4.1, backtracking with initial step size α init chooses a step size that is at least min{α init, 1 L }. Proof. Recall the quadratic upper bound ( α ) L f(x k + α f(x k )) f(x k ) + α f(x k ) for any α 0. It s minimized at α = 1 L. We check that choosing any α k 1 L satisfies (4.): f(x k + α k f(x k )) f(x k ) + 1 ( ) αk L L 1 f(x k ) f(x k ) + 1 ( ) 1 L 1 f(x k ) f(x k ) 1 L f(x k) f(x k ) α k f(x k). Since any step size shorter than 1 L satisfies (4.), the line search stops backtracking at a step size longer than 1 L. Thus, at each step, gradient descent with backtracking line search decreases the function value by at least 1 4L f(x k). The rest of the story is similar to that of gradient descent with a fixed step size. To summarize, gradient descent with a backtracking line search also requires at most O( L ɛ ) iterations to obtain an ɛ-optimal solution. 4.. Nesterov s accelerated gradient method. Accelerated gradient methods obtain an ɛ-optimal solution in O( L/ɛ) iterations. Nesterov (1983) is the original and (one of) the best known. It initializes y 1 x 0, γ 0 1 and generates iterates according to (4.3) (4.4) y k+1 x k + β k (x k x k 1 ) x k+1 y k+1 1 L f(y k+1), where L is the strong smoothness constant of f and (4.5) (4.6) γ k 1 ( (4γ k 1 + γk 1 4 ) 1 ) γk 1 β k γ k ( 1 γ k 1).

9 FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 9 It s very similar to the heavy ball method, except the gradient is evaluated at an intermediate point y k+1. Often, the γ-update (4.5) is stated as the solution to (4.7) γ k = (1 γ k)γ k 1. To study Nesterov s 1983 method, we first formulate the recursion in terms of interpretable quantities (4.8) (4.9) z k x k γ k 1 (x k x k 1 ) : z 0 x 0 y k+1 (1 γ k )x k + γ k z k (4.10) x k+1 y k+1 1 L f(y k+1). Behind the scenes, the method forms an estimating sequence of quadratic functions centered at {z k } : φ k (x) := φ k + η k x z k. Definition 4.3. A sequence of functions φ k : R n R is an estimating sequence of f : R n R if, for some non-negative sequence of η k 0, φ k (x) (1 η k )f(x) + η k φ 0 (x). Intuitively, an estimating sequence for f consists of successively tighter approximations of f. As we shall see, {φ k } is an estimating sequence of f. Nesterov chose φ 0 to be a quadratic function centered at the initial point: φ 0 (x) φ 0 + η 0 x x 0, where φ 0 = f(x 0) and η 0 = L. The subsequent φ k are convex combinations of φ k 1 and the first-order approximation of f at y k+1 : (4.11) φ k+1 (x) (1 γ k )φ k (x) + γ k ˆfk (x), where ˆf k (x) := f(y k+1 )+ f(y k+1 ) T (x y k+1 ) and x k, y k are given by (4.10) and (4.9). It s straightforward to show the subsequent φ k are quadratic. Lemma 4.4. (4.1) (4.13) The functions φ k have the form φ k + η k x z k, where η k+1 (1 γ k )η k, z k+1 z k γ k f(y k+1 ), η k+1 (4.14) φ k+1 (1 γ k)φ k + γ k ˆf k (z k ) γ k η k+1 f(y k+1 ).

10 10 MS&E 318: LARGE-SCALE NUMERICAL OPTIMIZATION Proof. We know φ k = η k, but by (4.11), it is also (1 γ k ) φ k 1 + γ k ˆfk 1 = (1 γ k )η k 1 I. Thus η k (1 γ k )η k 1. Similarly, we know φ k = η k (x z k ), but it is also (1 γ k ) φ k 1 + γ k ˆf k 1 = (1 γ k )η k 1 (x z k 1 ) + γ k f(y k ) = η k (x z k ) + γ k f(y k ) ( = η k (x z k γ )) k f(y k ). η k Thus z k z k 1 γ k 1 η k f(y k ). The intercept φ k is simply φ k(z k ) : φ k (z k ) = (1 γ k 1 )φ k 1 (z k ) + γ k 1 ˆfk 1 (z k ) ( = (1 γ k 1 ) φ k 1 + η ) k 1 z k z k 1 + γ k 1 ˆfk 1 (z k ). Since z k z k 1 γ k 1 η k f(y k ), η k (1 γ k )η k 1, (4.15) = (1 γ k 1 )φ k 1 + (1 γ k 1)η k 1 f(y k ) η k γ k 1 + γ k 1 ˆfk 1 (z k ) = (1 γ k 1 )φ k 1 + γ k 1 (1 γ k 1)η k 1 ηk f(y k ) + γ k 1 ˆfk 1 (z k ) = (1 γ k 1 )φ k 1 + γ k 1 η k f(y k ) + γ k 1 ˆfk 1 (z k ). Since ˆf k 1 (x) = f(y k ) + f(y k ) T (x y k ) and z k z k 1 γ k 1 η k f(y k ), ˆf k 1 (z k ) = f(y k ) + f(y k ) T (z k 1 γ k 1 η k f(y k ) y k ) (4.16) = ˆf k 1 (z k 1 ) γ k 1 η k f(y k ). We combine (4.15) and (4.16) to obtain (4.11). Why are estimating sequences useful? As it turns out, for any sequence {x k } obeying f(x k ) inf x φ k (x), the convergence rate of {f(x k )} is given by the convergence rate of { η k }.

11 FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 11 Corollary 4.5. If {φ k } is an estimating sequence of f and if f(x k ) inf x φ(x) for some sequence {x k }, then Proof. Since f(x k ) inf x φ k (x), f(x k ) f η k (φ 0 (x ) f ). f(x k ) f φ k (x ) f (1 η k )f + η k φ 0 (x ) f η k (φ 0 (x ) f ). We show that Nesterov s choice of {φ k } is an estimating sequence. First, we state two lemmas concerning the sequence {γ k }. Their proofs are in the Appendix. Lemma 4.6. The sequence {γ k } satisfies γ k = k l=1 (1 γ l). Lemma 4.7. At the k-th iteration, γ k is at most k+. Lemma 4.8. The functions {φ k } form an estimating sequence of f : where η k := k l=1 (1 γ k). Further, φ k (x) (1 η k )f(x) + η k φ 0 (x), η k 4 for any k 0. (k + ) Proof. Consider the sequence η 0 = 0, η k = k l=1 (1 γ k). At x 0, clearly φ 0 (x) φ 0 (x). Since f is convex, f(x) ˆf k (x), and φ k+1 (x) (1 γ k )φ k (x) + γ k f(x). By the induction hypothesis, (1 γ k ) ((1 η k 1 )f(x) + η k 1 φ 0 (x)) + γ k f(x) = (1 (1 γ k ) η k 1 )f(x) + (1 γ k ) η k 1 φ 0 (x) = (1 η k )f(x) + η k φ 0 (x). It remains to show η k 4 (k+). By (4.1), η k = k l=1 (1 γ l). Further, by Lemma (4.6), (4.17) η k = k (1 γ l ) = γk. l=1 We bound the right side by Lemma 4.7 to obtain the stated result.

12 1 MS&E 318: LARGE-SCALE NUMERICAL OPTIMIZATION To invoke Corollary 4.5, we must first show f(x k ) inf x φ k (x). As we shall see, {γ k } and {y k } are carefully chosen to ensure f(x k ) inf x φ k (x). For now, assume f(x k ) inf x φ k (x). By (4.11) and the convexity of f, φ k+1 (1 γ k)f(x k ) + γ k ˆfk (z k ) (1 γ k ) ˆf k (x k ) + γ k ˆfk (z k ) γ k η k+1 f(y k+1 ) γ k η k+1 f(y k+1 ) = f(y k+1 ) + f(y k+1 ) T ((1 γ k )x k + γ k z k y k+1 ) γ k η k+1 f(y k+1 ). By (4.9), the linear term is zero: φ k+1 f(y k+1) γ k η k+1 f(y k+1 ). To ensure f(x k+1 ) φ k+1, it suffices to ensure f(x k+1 ) f(y k+1 ) γ k η k+1 f(y k+1 ), which, as we shall see, is possible by taking a gradient step from y k+1. Lemma 4.9. Under the conditions of Theorem 4.1, the sequence {x k } given by (4.10) obeys f(x k ) φ k. Proof. We proceed by induction. At x 0, f(x 0 ) = φ 0. At each iteration, x k+1 is a gradient step from y k+1. By Lemma 4., the function value decreases by at least 1 L f(y k+1) : f(x k+1 ) f(y k+1 ) 1 L f(y k+1) = ˆf k (y k+1 ) 1 L f(y k+1), where the second inequality follows from the definition of ˆf k. By (4.9), ˆf k (y k+1 ) = ˆf((1 γ k )x k + γ k (z k )) Thus f(x k+1 ) is further bounded by = (1 γ k ) ˆf k (x k ) + γ k ˆfk (z k ). f(x k+1 ) (1 γ k ) ˆf k (x k ) + γ k ˆfk (z k ) 1 L f(y k+1) (1 γ k )f(x k ) + γ k ˆfk (z k ) 1 L f(y k+1) (1 γ k )φ k + γ k ˆf k (z k ) 1 L f(y k+1),

13 FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 13 where the second inequality follows from the induction hypothesis. Substituting (4.14), we obtain ( γ f(x k+1 ) φ k+1 + k 1 ) f(y k+1 ) η k+1 L. By (4.1), η k+1 = L k l=1 (1 γ l). Thus ( ) γk 1 η k+1 L = 1 γk L k l=1 (1 γ l) 1. The difference in parentheses is zero by Lemma 4.6. Thus the iterates {x k } obey f(x k ) inf x φ k (x). We put the pieces together to obtain the convergence rate of Nesterov s 1983 method. Theorem Under the conditions of Theorem 4.1, the iterates {x k } generated by (4.10) satisfy f(x k ) f 4L (k + ) x 0 x. Proof. By Corollary 4.5 and Lemma 4.8, the function values obey f(x k ) f η k (φ 0 (x ) f ) 4 (k + ) (φ 0(x ) f ). Further, recalling φ 0 (x) = f(x 0 ) + L x x 0, we obtain ( ) f(x k ) f 4 L (k + ) x x 0 + f(x 0 ) f 4L (k + ) x x 0, where the second inequality follows from the strong smoothness of f. Finally, we check the iterates {(x k, y k, z k )} produced by forming auxiliary functions are the same as those generated by (4.10), (4.9), (4.8). Lemma The iterates {(x k, y k, z k )} generated by the estimating sequence are identical to those generated by (4.10), (4.9), (4.8).

14 14 MS&E 318: LARGE-SCALE NUMERICAL OPTIMIZATION f(xk) f f gradient descent N k Fig 3. Convergence of gradient descent and Nesterov s 1983 method on a strongly convex function. Like the heavy ball method, Nesterov s 1983 method generates function values that are not monotonically decreasing. Proof. Since x k, y k are given by (4.10) and (4.9) when forming an estimating sequence, it suffices to check that (4.8) and (4.13) are identical. z k+1 x k + 1 γ k (x k+1 x k ) = x k + 1 γ k ( y k+1 1 L f(y k+1) x k ) = x k + 1 γ k ( (1 γ k )x k + γ k z k 1 L f(y k+1) x k ) = z k 1 γ k L f(y k+1). It remains to show 1 γ k L = γ k η k+1 = γ k η k+1. By (4.1), η k+1 = L k l=1 (1 γ l). Thus L k γ k l=1 (1 γ l) = 1 where the second equality follows by Lemma 4.6. γ k L, Nesterov s 1983 method, as stated, does not converge linearly when f is strongly convex. Figure 4. compares the convergence of gradient descent and Nesterov s 1983 method. It s possible to modify the method to attain linear convergence. The key idea is to form φ k by convex combinations of φ k 1 and quadratic lower bounds of f given by (.). We refer to Nesterov (004), Section. for details.

15 FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 15 A more practical issue is choosing the step size in (4.10) when L is unknown. It is possible to incorporate a line search: (4.18) x k+1 y k+1 α k f(y k+1 ), where α k is chosen by a line search. To form a valid estimating sequence, the step sizes must decrease: α k α k 1. If α k is chosen by a backtracking line search, the initial trial step at the k-th iteration is set to α k 1. The key idea in Nesterov (1983) is forming an estimating sequence, and Nesterov s 1983 method is merely a special case. The development of accelerated first-order methods by forming estimating sequences remains an active area of research. Just recently, 1. Meng and Chen (011) proposed a modification of Nesterov s 1983 method that further accelerates the speed of convergence.. Gonzaga and Karas (013) proposed a variant of Nesterov s original method that adapts to unknown strong convexity. 3. O Donoghue and Candés (013) proposed a heuristic for resetting the momentum term to zero that improves the convergence rate. To end on a practical note, the Matlab package TFOCS by Becker, Candès and Grant (011) implements some of the aforementioned methods Optimality of accelerated gradient methods. To wrap up the section, we pause and ask: are there methods that converge faster than accelerated gradient descent? In other words, given a convex function with Lipschitz gradient, what is the fastest convergence rate attainable by first-order methods? More precisely, a first-order method is one that chooses the k-th iterate in x 0 + span{ f(x 0 ),..., f(x k 1 )} for k = 1,,... As it turns out, the O( 1 k ) convergence rate attained by Nesterov s accelerated gradient method is optimal. That is, unless f has special extra structure (e.g. strong convexity), no method has better worst-case performance than Nesterov s 1983 method. Theorem 4.1. There exists a convex and strongly smooth function f : R k+1 R such that f(x k ) f 3L x 0 x 3(k + 1), where {x l } obey x l x 0 + span({ f(x 0 ),..., f(x l 1 )}) for any l [ k ].

16 16 MS&E 318: LARGE-SCALE NUMERICAL OPTIMIZATION Proof sketch. It suffices to exhibit a function that is hard to optimize. More precisely, we construct a function and initial guess such that any first-order method has at best O( 1 ) convergence rate when applied to the k problem. Consider the convex quadratic function ( ) f(x) = L 1 4 x k (x i x i+1 ) + 1 x k+1 x 1. In matrix form, f(x) = L ( 1 4 xt Ax e T 1 x), where A = R (k+1) (k+1) It s possible to show i=1 1. f is strongly smooth ( with) constant L,. its minimum L 1 8 k+ 1 is attained at [ x ] i = 1 i k+. 3. K l [f] := span({ f(x 0 ),..., f(x l 1 )}) is span({e 1,..., e l }). If we initialize x 0 0, then at the k-th iteration, f(x k ) inf x K k [f] f(x) = L 8 ( 1 k ). Thus f(x k ) f x 0 x ( ) L 1 8 k+1 1 k+ 1 3 (k + ) = 3L 3(k + 1). 5. Projected gradient descent. To wrap up, we consider a simple modification of gradient descent for constrained optimization: projected gradient descent. At each iteration, we take a gradient step and project back onto the feasible set: (5.1) x k+1 P C (x k α k f(x k )), 1 where P C (x) := arg min y C x y is the projection of x onto C. For many sets (e.g. subspace, (symmetric) cones, norm balls etc.), it is possible

17 FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 17 to project onto the set efficiently, making projected gradient descent a viable choice for optimization over the set. Despite the requirement of being able to project efficiently onto the feasible set, projected gradient descent is quite general. Consider the (generic) nonlinear optimization problem (5.) minimize f(x) x R n subject to l c(x) u. By introducing slack variables, (5.) is equivalent to minimize s R m,x R n f(x) subject to c(x) = s, l s u. The bound-constrained Lagrangian approach (BCL) solves a sequence of bound-constrained augmented Lagrangian subproblems of the form minimize s R m,x R n subject to l s u. f(x) + λ T k c(x) + ρ k c(x) s Since projection onto rectangles is efficient, projected gradient descent may be used to solve the BCL subproblems. Given the similarity of projected gradient descent to gradient descent, it is no suprise that the convergence rate of projected gradient descent is the same as that of its counterpart. For now, fix a step size α and consider projected gradient descent as a fixed point iteration: x k+1 P C (G α (x k )), where G α is the gradient step. Since the optimum x is a fixed point of P C G α, the iterates converge if P C G α is a contraction. Lemma 5.1. The projection onto a convex set C R n is non-expansive: P C (y) P C (x) (x y) T (P C (y) P C (x)). Proof. The optimality condition of the constrained optimization problem defining the projection onto C is (x P C (x)) T (y P C (x)) 0 for any y C.

18 18 MS&E 318: LARGE-SCALE NUMERICAL OPTIMIZATION For any two points x, y and their projections onto C, we have (x P C (x)) T (P C (y) P C (x)) 0 (y P C (y)) T (P C (x) P C (y)) 0. We add the two inequalities and rearrange to obtain P C (y) P C (x) (x y) T (P C (y) P C (x)). We apply the Cauchy-Schwartz inequality to obtain the stated result. Since P C is non-expansive, P C (G α (x)) P C (G α (y)) G α (x) G α (y). Thus, as long as α is chosen to ensure G α is a contraction, P C G α is also a contraction. The rest of the story is the same as that of gradient descent: when f is strongly convex and strongly smooth, projected gradient descent converges linearly. That is, it requires at most O(κ log( 1 ɛ )) iterations to obtain an ɛ-accurate solution. When f is not strongly convex (but convex and strongly smooth), the story is also similar to that of gradient descent. However, the proof is more involved. First, we show projected gradient descent is a descent method. Lemma 5.. When 1. f : R n R is convex and strongly smooth with constant L. C R n is convex, projected gradient descent with step size 1 L satisfies f(x k+1 ) f(x k ) L x k+1 x k. Proof. Since f is strongly smooth, we have (5.3) f(x k+1 ) f(x k ) + f(x k ) T (x k+1 x k ) + L x k+1 x k. The point x k+1 is the projection of x k 1 L f(x k) onto C. Thus (5.4) (x k 1 L f(x k) x k+1 ) T (y x k+1 ) 0 for any y C. By setting y = x k in (5.4) and rearranging, we obtain f(x k ) T (x k x k+1 ) L x k x k+1. We substitute the bound into (5.3) to obtain the stated result.

19 FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 19 We put the pieces together to show projected gradient descent requires at most O( L ɛ ) iterations to obtain an ɛ-suboptimal point. Theorem 5.3. Under the conditions of Lemma 5., projected gradient descent with step size 1 L satisfies f(x k ) f L k x x 0. Proof. By setting y = x in (5.4) (and rearranging), we obtain f(x k ) T (x k+1 x ) L(x k+1 x k ) T (x x k+1 ). Substituting the bound into (5.3), we have f(x k+1 ) f(x k ) + L(x k+1 x k ) T (x x k+1 ) + L x k+1 x k. We complete the square to obtain f(x k+1 ) f(x k ) + L x x k+1 + x k+1 x k L x x k+1 = f(x k ) + L Summing over iterations, we obtain k f(x l ) f L l=1 ( x x k x x k+1 ). = L k x x k 1 x x k l=1 ( x x 0 x x k ) L x x 0. By Lemma 5., {f(x k )} is decreasing. Thus f(x k ) f 1 k k f(x l ) f L k x x 0. l=1 When the strong smoothness constant is unknown, it is possible to choose step sizes by a projected backtracking line search. We 1. start with some initial trial step size α k α init

20 0 MS&E 318: LARGE-SCALE NUMERICAL OPTIMIZATION. decrease the trial step size geometrically: α k α k 3. continue until α k satisfies a sufficient descent condition: (5.5) f(x k+1 ) f(x k ) + f(x k ) T (x k+1 x k ) + L x k+1 x k, where x k+1 P C (x k α k f(x k )). Thus the line search backtracks along the projection of the search ray onto the feasible set. It is possible to show projected gradient descent with a projected backtracking line search preserves the O( 1 k ) convergence rate of projected gradient descent with step size 1 L. APPENDIX Proof of Lemma 4.6. We proceed by induction. Recall {γ k } is generated by the recursion (4.7). Since γ 0 1, we automatically have γ 1 = (1 γ 1 ). By the induction hypothesis γ k 1 = k 1 l=1 (1 γ l) and (4.7), k γk = (1 γ k)γk 1 = (1 γ l ). l=1 Proof of Lemma 4.7. We consider the gap between successive 1 γ k : By (4.7), we have 1 γ l+1 1 γ l = γ l γ l+1 γ l γ l+1 = γ l γ l+1 γ l γ l+1 (γ l + γ l+1 ). 1 γ l+1 1 γ l = γ l γ l (1 γ l+1) γ l γ l+1 (γ l + γ l+1 ) = γ l γ l+1 γ l γ l+1 (γ l + γ l+1 ). It is easy to show the sequence {γ l } is decreasing. Thus 1 γ l+1 1 γ l γ l γ l+1 γ l γ l+1 (γ l ) = 1. Summing the inequalities for l = 0, 1,..., k 1, we obtain We recall γ 0 1 to deduce 1 γ k 1 γ k 1 γ 0 k. 1+ k, which implies the second bound.

21 FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1 REFERENCES Beck, A. and Teboulle, M. (009). Gradient-based algorithms with applications to signal recovery. Convex Optimization in Signal Processing and Communications. Becker, S. R., Candès, E. J. and Grant, M. C. (011). Templates for convex cone problems with applications to sparse signal recovery. Mathematical Programming Computation Gonzaga, C. C. and Karas, E. W. (013). Fine tuning Nesterov s steepest descent algorithm for differentiable convex programming. Mathematical Programming Meng, X. and Chen, H. (011). Accelerating Nesterov s method for strongly convex functions with Lipschitz gradient. arxiv preprint arxiv: Nesterov, Y. (1983). A method of solving a convex programming problem with convergence rate O( 1 ). Soviet Mathematics Doklady k Nesterov, Y. (004). Introductory Lectures on Convex Optimization 87. Springer Science & Business Media. O Donoghue, B. and Candés, E. (013). Adaptive restart for accelerated gradient schemes. Foundations of Computational Mathematics Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics Yuekai Sun Stanford, California April 9, yuekai@stanford.edu

Gradient Methods Using Momentum and Memory

Chapter 3 Gradient Methods Using Momentum and Memory The steepest descent method described in Chapter always steps in the negative gradient direction, which is orthogonal to the boundary of the level set