NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

Size: px
Start display at page:

Download "NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained"

Transcription

1 NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume the optimum of f exists and is unique. By first-order, we are referring to methods that use only function value and gradient information. Compared to second-order methods (e.g. Newtontype methods, interior-point methods), first-order methods usually converge slower, and their performance degrades on ill-conditioned problems. However, they are well-suited to large-scale problems because each iteration usually only requires vector operations, 1 in addition to the evaluation of f(x) and f(x).. Gradient descent. The simplest first-order method is gradient descent or steepest descent. It starts at an initial point x 0 and generates successive iterates by (.1) x k+1 x k α k f(x k ). The step size α k is either fixed or chosen by a line search that (approximately) minimizes f(x k α f(x k )). For small to medium-scale problems, each iteration is inexpensive, requiring only vector operations. However, the iterates may converge slowly to the optimum x. Let s take a closer look at the convergence of gradient descent. For now, fix a step size α and consider gradient descent as a fixed point iteration: x k+1 G α (x k ), where G α (x) := x α f(x) is a contraction: G α (x) G α (y) L G x y for some L G < 1. Since the optimum x is a fixed point of G α, we expect the fixed point iteration x k+1 G α (x k ) to converge to the fixed point. 1 Second-order methods usually require the solution of linear system(s) at each iteration, making them prohibitively expensive for very large problems. 1

2 MS&E 318: LARGE-SCALE NUMERICAL OPTIMIZATION Lemma.1. When the gradient step G α (x) is a contraction, gradient descent converges linearly to x : x k+1 x L k G x 0 x. Proof. We begin by showing that each iteration of gradient descent is a contraction: x k+1 x = x k α k f(x k ) (x α k f(x )) = G α (x) G α (x ) L G x k x. By induction, x k+1 x L k G x 0 x. At first blush, it seems like gradient descent converges quickly. However, the contraction coefficient L G depends poorly on the condition number κ := L µ of the problem. In particular L G 1 1 κ with an optimal step size. From now on, we assume the function f is convex. Lemma.. When f : R n R is 1. twice continuously differentiable. strongly convex with constant µ > 0 : (.) f(y) f(x) + f(x) T (y x) + µ y x 3. strongly smooth with constant L 0 : (.3) f(y) f(x) + f(x) T (y x) + L y x, the contraction coefficient L G at most max{ 1 αµ, 1 αl }. Proof. By the mean value theorem, G α (x) G α (y) = x α f(x) (y α f(y)) = (I f(z))(x y) I f(z) x y for some z on the line segment between x and y. By strong convexity and smoothness, the eigenvalues of f(z) are between µ and L. Thus, I α f(z) max{ 1 αµ, 1 αl }.

3 FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS Fig 1. The iterates of gradient descent (left panel) and the heavy ball method (right panel) starting at ( 1, 1). We combine Lemmas.1 and. to deduce x k+1 x max{ 1 αµ, 1 αl } k x 0 x. The step size L+µ minimizes the contraction coefficient. Substituting the optimal step size, we conclude that x k+1 x ( ) κ 1 k x 0 x, κ + 1 where κ := L µ is a condition number of the problem. When κ is large, the contraction coefficient is roughly 1 1 κ. In other words, gradient descent requires at most O ( κ log( 1 ɛ )) iterations to attain an ɛ-accurate solution. 3. The heavy ball method. The heavy ball method is usually attributed to Polyak (1964). The intuition is simple: the iterates of gradient descent tend to bounce between the walls of narrow valleys on the objective surface. The left panel of Figure shows the iterates of gradient descent bouncing from wall to wall. To avoid bouncing, the heavy ball method adds a momentum term to the gradient step: x k+1 x k α k f(x k ) + β k (x k x k 1 ). The term x k x k 1 nudges x k+1 in the direction of the previous step (hence momentum). The right panel of Figure shows the effects of adding momentum.

4 4 MS&E 318: LARGE-SCALE NUMERICAL OPTIMIZATION Let s study the heavy ball method and compare it to gradient descent. Since each iterate depends on the previous two iterates, we study the threeterm recurrence in matrix form: [ ] xk+1 x [ x k x = xk + β(x k x k 1 ) x ] [ ] f(xk ) x k x α 0 [ ] [ = (1 + β)i βi xk x ] [ ] I 0 x k 1 x α f(z k )(x k x ) 0 for some z k on the line segment between x k and x [ = (1 + β)i α ] [ f(z k ) βi xk x ] I 0 x k 1 x [ ] (1 + β)i αk [ f(z k ) βi xk x ] (3.1) I 0. x k 1 x [ ] The contraction coefficient is given by (1 + β)i αk f(z k ) βi. I 0 We introduce a lemma to bound the norm. Lemma 3.1. Under the conditions of Lemma., [ ] (1 + β)i α f(z) βi max{ 1 αµ, 1 αl } I 0 for β = max{ 1 αµ, 1 αl }. Proof. Let Λ := diag(λ 1,..., λ n ), where {λ i } i [1:n] are the eigenvalues of f(z). By diagonalization, [ ] (1 + β)i α f(z) βi I 0 [ ] [ ] = (1 + β)i αλ βi = max 1 + β αλi β I i [1:n] The second equality holds because it is possible to permute the matrix to a block diagonal matrix with blocks. For each i [1 : n], the eigenvalues of the matrices are given by the roots of p i (λ) := λ (1 + β αλ i )λ + β = 0. It is possible to show that when β is at least (1 αλ i ), the roots of p i are imaginary and both have magnitude β. Since β := max{ 1 αµ, 1 αl }, the magnitude of the roots are at most β..

5 FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 5 Theorem 3.. Under the conditions of Lemma., the heavy ball method 4 with fixed parameters α = ( L+ and β = max{ 1 αµ, 1 αl } µ) converges linearly to x : ( ) k κ 1 x k+1 x x 0 x, κ + 1 where κ = L µ. Proof. By Lemma 3.1, [ ] xk+1 x x k x max{ 1 αµ, 1 [ αl } xk x ] x k 1 x 4 Substituting α = ( L+ gives µ) [ ] xk+1 x x k x κ 1 κ + 1 [ xk x x k 1 x ] where κ := L µ. We induct on k to obtain the stated result. The heavy ball method also converges linearly, but its contraction coefficient is roughly 1 1 κ. In other words, the heavy ball method attains an ɛ-accurate solution after at most O ( κ log( 1 ɛ )) iterations. Figure 3 compares the convergence of gradient descent and the heavy ball method. Unlike gradient descent, the heavy ball method is not a descent method, i.e. f(x k+1 ) is not necessarily smaller than f(x k ). On a related note, the conjugate gradient method (CG) is an instance of the heavy ball method with adaptive step sizes and momentum parameters. However, CG does not require knowledge of L and µ to choose step sizes. 4. Accelerated gradient methods Gradient descent in the absence of strong convexity. Before delving into Nesterov s celebrated method, we study gradient descent in the setting he considered. Nesterov studied (1.1) when f is continuously differentiable and strongly smooth. In the absence of strong convexity, (1.1) may not have a unique optimum, so we cannot expect the iterates to always converge to a point x. However, if f remains convex, we expect the optimality gap f(x k ) f to converge to zero. To begin, we study the convergence rate of gradient descent in the absence of strong convexity. The iterates may converge to different points depending on the initial point x 0.,

6 6 MS&E 318: LARGE-SCALE NUMERICAL OPTIMIZATION 10 5 gradient descent heavy ball 10 0 f(xk) f f k Fig. Convergence of gradient descent and heavy ball function values on a strongly convex function. Although the heavy ball function values are not monotone, they converge faster. Theorem 4.1. When f : R n R is convex and 1. continuously differentiable. strongly smooth with constant L, gradient descent with step size 1 L satisfies f(x k ) f L k x 0 x. Proof. The proof consists of two parts. First, we show that, at each step, gradient descent reduces the function value by at least a certain amount. In other words, we show each iteration of gradient descent achieves sufficient descent. Recall x k+1 x k α f(x k ). By the strong smoothness of f, ( α ) L f(x k+1 ) f(x k ) + α f(x k ). When α = 1 L, the right side simplifies to (4.1) f(x k+1 ) f(x k ) 1 L f(x k). The second part of the part of the proof combines the sufficient descent condition with the convexity of f to obtain a bound on the optimality gap.

7 FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 7 By the first-order condition for convexity, f(x k+1 ) f + f(x k ) T (x k x ) 1 L f(x k) ( L = f x k x f(x k ) T (x k x ) + 1 ) L f(x k) + L x k x, where the second equality follows from completing the square, = f L x k x 1 L f(x k) + L x k x = f L x k+1 x + L x k x. We sum over iterations to obtain k f(x l ) f L l=1 k x l 1 x x l x = L ( x0 x x k x ) L x 0 x. Since f(x k ) is non-increasing, we have l=1 f(x k ) f 1 k k f(x l ) f L k x 0 x. l=1 In the absence of strong convexity, the convergence rate of gradient descent slows to sublinear. Roughly speaking, gradient descent requires at most O( L ɛ ) iterations to obtain an ɛ-optimal solution. The story is similar when the step sizes are not fixed, but chosen by a backtracking line search. More precisely, a backtracking line search 1. starts with some initial trial step size α k α init. decreases the trial step size geometrically: α k α k 3 3. continues until α k satisfies a sufficient descent condition: (4.) f(x k + α k f(x k )) f(x k ) α k f(x k). 3 The new trial step may be β α, for any β (0, 1). In practice, it s frequently set close to 1 to avoid taking conservative step sizes. We let β = 1 for simplicity.

8 8 MS&E 318: LARGE-SCALE NUMERICAL OPTIMIZATION Lemma 4.. Under the conditions of Theorem 4.1, backtracking with initial step size α init chooses a step size that is at least min{α init, 1 L }. Proof. Recall the quadratic upper bound ( α ) L f(x k + α f(x k )) f(x k ) + α f(x k ) for any α 0. It s minimized at α = 1 L. We check that choosing any α k 1 L satisfies (4.): f(x k + α k f(x k )) f(x k ) + 1 ( ) αk L L 1 f(x k ) f(x k ) + 1 ( ) 1 L 1 f(x k ) f(x k ) 1 L f(x k) f(x k ) α k f(x k). Since any step size shorter than 1 L satisfies (4.), the line search stops backtracking at a step size longer than 1 L. Thus, at each step, gradient descent with backtracking line search decreases the function value by at least 1 4L f(x k). The rest of the story is similar to that of gradient descent with a fixed step size. To summarize, gradient descent with a backtracking line search also requires at most O( L ɛ ) iterations to obtain an ɛ-optimal solution. 4.. Nesterov s accelerated gradient method. Accelerated gradient methods obtain an ɛ-optimal solution in O( L/ɛ) iterations. Nesterov (1983) is the original and (one of) the best known. It initializes y 1 x 0, γ 0 1 and generates iterates according to (4.3) (4.4) y k+1 x k + β k (x k x k 1 ) x k+1 y k+1 1 L f(y k+1), where L is the strong smoothness constant of f and (4.5) (4.6) γ k 1 ( (4γ k 1 + γk 1 4 ) 1 ) γk 1 β k γ k ( 1 γ k 1).

9 FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 9 It s very similar to the heavy ball method, except the gradient is evaluated at an intermediate point y k+1. Often, the γ-update (4.5) is stated as the solution to (4.7) γ k = (1 γ k)γ k 1. To study Nesterov s 1983 method, we first formulate the recursion in terms of interpretable quantities (4.8) (4.9) z k x k γ k 1 (x k x k 1 ) : z 0 x 0 y k+1 (1 γ k )x k + γ k z k (4.10) x k+1 y k+1 1 L f(y k+1). Behind the scenes, the method forms an estimating sequence of quadratic functions centered at {z k } : φ k (x) := φ k + η k x z k. Definition 4.3. A sequence of functions φ k : R n R is an estimating sequence of f : R n R if, for some non-negative sequence of η k 0, φ k (x) (1 η k )f(x) + η k φ 0 (x). Intuitively, an estimating sequence for f consists of successively tighter approximations of f. As we shall see, {φ k } is an estimating sequence of f. Nesterov chose φ 0 to be a quadratic function centered at the initial point: φ 0 (x) φ 0 + η 0 x x 0, where φ 0 = f(x 0) and η 0 = L. The subsequent φ k are convex combinations of φ k 1 and the first-order approximation of f at y k+1 : (4.11) φ k+1 (x) (1 γ k )φ k (x) + γ k ˆfk (x), where ˆf k (x) := f(y k+1 )+ f(y k+1 ) T (x y k+1 ) and x k, y k are given by (4.10) and (4.9). It s straightforward to show the subsequent φ k are quadratic. Lemma 4.4. (4.1) (4.13) The functions φ k have the form φ k + η k x z k, where η k+1 (1 γ k )η k, z k+1 z k γ k f(y k+1 ), η k+1 (4.14) φ k+1 (1 γ k)φ k + γ k ˆf k (z k ) γ k η k+1 f(y k+1 ).

10 10 MS&E 318: LARGE-SCALE NUMERICAL OPTIMIZATION Proof. We know φ k = η k, but by (4.11), it is also (1 γ k ) φ k 1 + γ k ˆfk 1 = (1 γ k )η k 1 I. Thus η k (1 γ k )η k 1. Similarly, we know φ k = η k (x z k ), but it is also (1 γ k ) φ k 1 + γ k ˆf k 1 = (1 γ k )η k 1 (x z k 1 ) + γ k f(y k ) = η k (x z k ) + γ k f(y k ) ( = η k (x z k γ )) k f(y k ). η k Thus z k z k 1 γ k 1 η k f(y k ). The intercept φ k is simply φ k(z k ) : φ k (z k ) = (1 γ k 1 )φ k 1 (z k ) + γ k 1 ˆfk 1 (z k ) ( = (1 γ k 1 ) φ k 1 + η ) k 1 z k z k 1 + γ k 1 ˆfk 1 (z k ). Since z k z k 1 γ k 1 η k f(y k ), η k (1 γ k )η k 1, (4.15) = (1 γ k 1 )φ k 1 + (1 γ k 1)η k 1 f(y k ) η k γ k 1 + γ k 1 ˆfk 1 (z k ) = (1 γ k 1 )φ k 1 + γ k 1 (1 γ k 1)η k 1 ηk f(y k ) + γ k 1 ˆfk 1 (z k ) = (1 γ k 1 )φ k 1 + γ k 1 η k f(y k ) + γ k 1 ˆfk 1 (z k ). Since ˆf k 1 (x) = f(y k ) + f(y k ) T (x y k ) and z k z k 1 γ k 1 η k f(y k ), ˆf k 1 (z k ) = f(y k ) + f(y k ) T (z k 1 γ k 1 η k f(y k ) y k ) (4.16) = ˆf k 1 (z k 1 ) γ k 1 η k f(y k ). We combine (4.15) and (4.16) to obtain (4.11). Why are estimating sequences useful? As it turns out, for any sequence {x k } obeying f(x k ) inf x φ k (x), the convergence rate of {f(x k )} is given by the convergence rate of { η k }.

11 FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 11 Corollary 4.5. If {φ k } is an estimating sequence of f and if f(x k ) inf x φ(x) for some sequence {x k }, then Proof. Since f(x k ) inf x φ k (x), f(x k ) f η k (φ 0 (x ) f ). f(x k ) f φ k (x ) f (1 η k )f + η k φ 0 (x ) f η k (φ 0 (x ) f ). We show that Nesterov s choice of {φ k } is an estimating sequence. First, we state two lemmas concerning the sequence {γ k }. Their proofs are in the Appendix. Lemma 4.6. The sequence {γ k } satisfies γ k = k l=1 (1 γ l). Lemma 4.7. At the k-th iteration, γ k is at most k+. Lemma 4.8. The functions {φ k } form an estimating sequence of f : where η k := k l=1 (1 γ k). Further, φ k (x) (1 η k )f(x) + η k φ 0 (x), η k 4 for any k 0. (k + ) Proof. Consider the sequence η 0 = 0, η k = k l=1 (1 γ k). At x 0, clearly φ 0 (x) φ 0 (x). Since f is convex, f(x) ˆf k (x), and φ k+1 (x) (1 γ k )φ k (x) + γ k f(x). By the induction hypothesis, (1 γ k ) ((1 η k 1 )f(x) + η k 1 φ 0 (x)) + γ k f(x) = (1 (1 γ k ) η k 1 )f(x) + (1 γ k ) η k 1 φ 0 (x) = (1 η k )f(x) + η k φ 0 (x). It remains to show η k 4 (k+). By (4.1), η k = k l=1 (1 γ l). Further, by Lemma (4.6), (4.17) η k = k (1 γ l ) = γk. l=1 We bound the right side by Lemma 4.7 to obtain the stated result.

12 1 MS&E 318: LARGE-SCALE NUMERICAL OPTIMIZATION To invoke Corollary 4.5, we must first show f(x k ) inf x φ k (x). As we shall see, {γ k } and {y k } are carefully chosen to ensure f(x k ) inf x φ k (x). For now, assume f(x k ) inf x φ k (x). By (4.11) and the convexity of f, φ k+1 (1 γ k)f(x k ) + γ k ˆfk (z k ) (1 γ k ) ˆf k (x k ) + γ k ˆfk (z k ) γ k η k+1 f(y k+1 ) γ k η k+1 f(y k+1 ) = f(y k+1 ) + f(y k+1 ) T ((1 γ k )x k + γ k z k y k+1 ) γ k η k+1 f(y k+1 ). By (4.9), the linear term is zero: φ k+1 f(y k+1) γ k η k+1 f(y k+1 ). To ensure f(x k+1 ) φ k+1, it suffices to ensure f(x k+1 ) f(y k+1 ) γ k η k+1 f(y k+1 ), which, as we shall see, is possible by taking a gradient step from y k+1. Lemma 4.9. Under the conditions of Theorem 4.1, the sequence {x k } given by (4.10) obeys f(x k ) φ k. Proof. We proceed by induction. At x 0, f(x 0 ) = φ 0. At each iteration, x k+1 is a gradient step from y k+1. By Lemma 4., the function value decreases by at least 1 L f(y k+1) : f(x k+1 ) f(y k+1 ) 1 L f(y k+1) = ˆf k (y k+1 ) 1 L f(y k+1), where the second inequality follows from the definition of ˆf k. By (4.9), ˆf k (y k+1 ) = ˆf((1 γ k )x k + γ k (z k )) Thus f(x k+1 ) is further bounded by = (1 γ k ) ˆf k (x k ) + γ k ˆfk (z k ). f(x k+1 ) (1 γ k ) ˆf k (x k ) + γ k ˆfk (z k ) 1 L f(y k+1) (1 γ k )f(x k ) + γ k ˆfk (z k ) 1 L f(y k+1) (1 γ k )φ k + γ k ˆf k (z k ) 1 L f(y k+1),

13 FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 13 where the second inequality follows from the induction hypothesis. Substituting (4.14), we obtain ( γ f(x k+1 ) φ k+1 + k 1 ) f(y k+1 ) η k+1 L. By (4.1), η k+1 = L k l=1 (1 γ l). Thus ( ) γk 1 η k+1 L = 1 γk L k l=1 (1 γ l) 1. The difference in parentheses is zero by Lemma 4.6. Thus the iterates {x k } obey f(x k ) inf x φ k (x). We put the pieces together to obtain the convergence rate of Nesterov s 1983 method. Theorem Under the conditions of Theorem 4.1, the iterates {x k } generated by (4.10) satisfy f(x k ) f 4L (k + ) x 0 x. Proof. By Corollary 4.5 and Lemma 4.8, the function values obey f(x k ) f η k (φ 0 (x ) f ) 4 (k + ) (φ 0(x ) f ). Further, recalling φ 0 (x) = f(x 0 ) + L x x 0, we obtain ( ) f(x k ) f 4 L (k + ) x x 0 + f(x 0 ) f 4L (k + ) x x 0, where the second inequality follows from the strong smoothness of f. Finally, we check the iterates {(x k, y k, z k )} produced by forming auxiliary functions are the same as those generated by (4.10), (4.9), (4.8). Lemma The iterates {(x k, y k, z k )} generated by the estimating sequence are identical to those generated by (4.10), (4.9), (4.8).

14 14 MS&E 318: LARGE-SCALE NUMERICAL OPTIMIZATION f(xk) f f gradient descent N k Fig 3. Convergence of gradient descent and Nesterov s 1983 method on a strongly convex function. Like the heavy ball method, Nesterov s 1983 method generates function values that are not monotonically decreasing. Proof. Since x k, y k are given by (4.10) and (4.9) when forming an estimating sequence, it suffices to check that (4.8) and (4.13) are identical. z k+1 x k + 1 γ k (x k+1 x k ) = x k + 1 γ k ( y k+1 1 L f(y k+1) x k ) = x k + 1 γ k ( (1 γ k )x k + γ k z k 1 L f(y k+1) x k ) = z k 1 γ k L f(y k+1). It remains to show 1 γ k L = γ k η k+1 = γ k η k+1. By (4.1), η k+1 = L k l=1 (1 γ l). Thus L k γ k l=1 (1 γ l) = 1 where the second equality follows by Lemma 4.6. γ k L, Nesterov s 1983 method, as stated, does not converge linearly when f is strongly convex. Figure 4. compares the convergence of gradient descent and Nesterov s 1983 method. It s possible to modify the method to attain linear convergence. The key idea is to form φ k by convex combinations of φ k 1 and quadratic lower bounds of f given by (.). We refer to Nesterov (004), Section. for details.

15 FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 15 A more practical issue is choosing the step size in (4.10) when L is unknown. It is possible to incorporate a line search: (4.18) x k+1 y k+1 α k f(y k+1 ), where α k is chosen by a line search. To form a valid estimating sequence, the step sizes must decrease: α k α k 1. If α k is chosen by a backtracking line search, the initial trial step at the k-th iteration is set to α k 1. The key idea in Nesterov (1983) is forming an estimating sequence, and Nesterov s 1983 method is merely a special case. The development of accelerated first-order methods by forming estimating sequences remains an active area of research. Just recently, 1. Meng and Chen (011) proposed a modification of Nesterov s 1983 method that further accelerates the speed of convergence.. Gonzaga and Karas (013) proposed a variant of Nesterov s original method that adapts to unknown strong convexity. 3. O Donoghue and Candés (013) proposed a heuristic for resetting the momentum term to zero that improves the convergence rate. To end on a practical note, the Matlab package TFOCS by Becker, Candès and Grant (011) implements some of the aforementioned methods Optimality of accelerated gradient methods. To wrap up the section, we pause and ask: are there methods that converge faster than accelerated gradient descent? In other words, given a convex function with Lipschitz gradient, what is the fastest convergence rate attainable by first-order methods? More precisely, a first-order method is one that chooses the k-th iterate in x 0 + span{ f(x 0 ),..., f(x k 1 )} for k = 1,,... As it turns out, the O( 1 k ) convergence rate attained by Nesterov s accelerated gradient method is optimal. That is, unless f has special extra structure (e.g. strong convexity), no method has better worst-case performance than Nesterov s 1983 method. Theorem 4.1. There exists a convex and strongly smooth function f : R k+1 R such that f(x k ) f 3L x 0 x 3(k + 1), where {x l } obey x l x 0 + span({ f(x 0 ),..., f(x l 1 )}) for any l [ k ].

16 16 MS&E 318: LARGE-SCALE NUMERICAL OPTIMIZATION Proof sketch. It suffices to exhibit a function that is hard to optimize. More precisely, we construct a function and initial guess such that any first-order method has at best O( 1 ) convergence rate when applied to the k problem. Consider the convex quadratic function ( ) f(x) = L 1 4 x k (x i x i+1 ) + 1 x k+1 x 1. In matrix form, f(x) = L ( 1 4 xt Ax e T 1 x), where A = R (k+1) (k+1) It s possible to show i=1 1. f is strongly smooth ( with) constant L,. its minimum L 1 8 k+ 1 is attained at [ x ] i = 1 i k+. 3. K l [f] := span({ f(x 0 ),..., f(x l 1 )}) is span({e 1,..., e l }). If we initialize x 0 0, then at the k-th iteration, f(x k ) inf x K k [f] f(x) = L 8 ( 1 k ). Thus f(x k ) f x 0 x ( ) L 1 8 k+1 1 k+ 1 3 (k + ) = 3L 3(k + 1). 5. Projected gradient descent. To wrap up, we consider a simple modification of gradient descent for constrained optimization: projected gradient descent. At each iteration, we take a gradient step and project back onto the feasible set: (5.1) x k+1 P C (x k α k f(x k )), 1 where P C (x) := arg min y C x y is the projection of x onto C. For many sets (e.g. subspace, (symmetric) cones, norm balls etc.), it is possible

17 FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 17 to project onto the set efficiently, making projected gradient descent a viable choice for optimization over the set. Despite the requirement of being able to project efficiently onto the feasible set, projected gradient descent is quite general. Consider the (generic) nonlinear optimization problem (5.) minimize f(x) x R n subject to l c(x) u. By introducing slack variables, (5.) is equivalent to minimize s R m,x R n f(x) subject to c(x) = s, l s u. The bound-constrained Lagrangian approach (BCL) solves a sequence of bound-constrained augmented Lagrangian subproblems of the form minimize s R m,x R n subject to l s u. f(x) + λ T k c(x) + ρ k c(x) s Since projection onto rectangles is efficient, projected gradient descent may be used to solve the BCL subproblems. Given the similarity of projected gradient descent to gradient descent, it is no suprise that the convergence rate of projected gradient descent is the same as that of its counterpart. For now, fix a step size α and consider projected gradient descent as a fixed point iteration: x k+1 P C (G α (x k )), where G α is the gradient step. Since the optimum x is a fixed point of P C G α, the iterates converge if P C G α is a contraction. Lemma 5.1. The projection onto a convex set C R n is non-expansive: P C (y) P C (x) (x y) T (P C (y) P C (x)). Proof. The optimality condition of the constrained optimization problem defining the projection onto C is (x P C (x)) T (y P C (x)) 0 for any y C.

18 18 MS&E 318: LARGE-SCALE NUMERICAL OPTIMIZATION For any two points x, y and their projections onto C, we have (x P C (x)) T (P C (y) P C (x)) 0 (y P C (y)) T (P C (x) P C (y)) 0. We add the two inequalities and rearrange to obtain P C (y) P C (x) (x y) T (P C (y) P C (x)). We apply the Cauchy-Schwartz inequality to obtain the stated result. Since P C is non-expansive, P C (G α (x)) P C (G α (y)) G α (x) G α (y). Thus, as long as α is chosen to ensure G α is a contraction, P C G α is also a contraction. The rest of the story is the same as that of gradient descent: when f is strongly convex and strongly smooth, projected gradient descent converges linearly. That is, it requires at most O(κ log( 1 ɛ )) iterations to obtain an ɛ-accurate solution. When f is not strongly convex (but convex and strongly smooth), the story is also similar to that of gradient descent. However, the proof is more involved. First, we show projected gradient descent is a descent method. Lemma 5.. When 1. f : R n R is convex and strongly smooth with constant L. C R n is convex, projected gradient descent with step size 1 L satisfies f(x k+1 ) f(x k ) L x k+1 x k. Proof. Since f is strongly smooth, we have (5.3) f(x k+1 ) f(x k ) + f(x k ) T (x k+1 x k ) + L x k+1 x k. The point x k+1 is the projection of x k 1 L f(x k) onto C. Thus (5.4) (x k 1 L f(x k) x k+1 ) T (y x k+1 ) 0 for any y C. By setting y = x k in (5.4) and rearranging, we obtain f(x k ) T (x k x k+1 ) L x k x k+1. We substitute the bound into (5.3) to obtain the stated result.

19 FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 19 We put the pieces together to show projected gradient descent requires at most O( L ɛ ) iterations to obtain an ɛ-suboptimal point. Theorem 5.3. Under the conditions of Lemma 5., projected gradient descent with step size 1 L satisfies f(x k ) f L k x x 0. Proof. By setting y = x in (5.4) (and rearranging), we obtain f(x k ) T (x k+1 x ) L(x k+1 x k ) T (x x k+1 ). Substituting the bound into (5.3), we have f(x k+1 ) f(x k ) + L(x k+1 x k ) T (x x k+1 ) + L x k+1 x k. We complete the square to obtain f(x k+1 ) f(x k ) + L x x k+1 + x k+1 x k L x x k+1 = f(x k ) + L Summing over iterations, we obtain k f(x l ) f L l=1 ( x x k x x k+1 ). = L k x x k 1 x x k l=1 ( x x 0 x x k ) L x x 0. By Lemma 5., {f(x k )} is decreasing. Thus f(x k ) f 1 k k f(x l ) f L k x x 0. l=1 When the strong smoothness constant is unknown, it is possible to choose step sizes by a projected backtracking line search. We 1. start with some initial trial step size α k α init

20 0 MS&E 318: LARGE-SCALE NUMERICAL OPTIMIZATION. decrease the trial step size geometrically: α k α k 3. continue until α k satisfies a sufficient descent condition: (5.5) f(x k+1 ) f(x k ) + f(x k ) T (x k+1 x k ) + L x k+1 x k, where x k+1 P C (x k α k f(x k )). Thus the line search backtracks along the projection of the search ray onto the feasible set. It is possible to show projected gradient descent with a projected backtracking line search preserves the O( 1 k ) convergence rate of projected gradient descent with step size 1 L. APPENDIX Proof of Lemma 4.6. We proceed by induction. Recall {γ k } is generated by the recursion (4.7). Since γ 0 1, we automatically have γ 1 = (1 γ 1 ). By the induction hypothesis γ k 1 = k 1 l=1 (1 γ l) and (4.7), k γk = (1 γ k)γk 1 = (1 γ l ). l=1 Proof of Lemma 4.7. We consider the gap between successive 1 γ k : By (4.7), we have 1 γ l+1 1 γ l = γ l γ l+1 γ l γ l+1 = γ l γ l+1 γ l γ l+1 (γ l + γ l+1 ). 1 γ l+1 1 γ l = γ l γ l (1 γ l+1) γ l γ l+1 (γ l + γ l+1 ) = γ l γ l+1 γ l γ l+1 (γ l + γ l+1 ). It is easy to show the sequence {γ l } is decreasing. Thus 1 γ l+1 1 γ l γ l γ l+1 γ l γ l+1 (γ l ) = 1. Summing the inequalities for l = 0, 1,..., k 1, we obtain We recall γ 0 1 to deduce 1 γ k 1 γ k 1 γ 0 k. 1+ k, which implies the second bound.

21 FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1 REFERENCES Beck, A. and Teboulle, M. (009). Gradient-based algorithms with applications to signal recovery. Convex Optimization in Signal Processing and Communications. Becker, S. R., Candès, E. J. and Grant, M. C. (011). Templates for convex cone problems with applications to sparse signal recovery. Mathematical Programming Computation Gonzaga, C. C. and Karas, E. W. (013). Fine tuning Nesterov s steepest descent algorithm for differentiable convex programming. Mathematical Programming Meng, X. and Chen, H. (011). Accelerating Nesterov s method for strongly convex functions with Lipschitz gradient. arxiv preprint arxiv: Nesterov, Y. (1983). A method of solving a convex programming problem with convergence rate O( 1 ). Soviet Mathematics Doklady k Nesterov, Y. (004). Introductory Lectures on Convex Optimization 87. Springer Science & Business Media. O Donoghue, B. and Candés, E. (013). Adaptive restart for accelerated gradient schemes. Foundations of Computational Mathematics Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics Yuekai Sun Stanford, California April 9, yuekai@stanford.edu

Gradient Methods Using Momentum and Memory

Gradient Methods Using Momentum and Memory Chapter 3 Gradient Methods Using Momentum and Memory The steepest descent method described in Chapter always steps in the negative gradient direction, which is orthogonal to the boundary of the level set

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013 Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for

More information

Unconstrained minimization of smooth functions

Unconstrained minimization of smooth functions Unconstrained minimization of smooth functions We want to solve min x R N f(x), where f is convex. In this section, we will assume that f is differentiable (so its gradient exists at every point), and

More information

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44 Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)

More information

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method L. Vandenberghe EE236C (Spring 2016) 1. Gradient method gradient method, first-order methods quadratic bounds on convex functions analysis of gradient method 1-1 Approximate course outline First-order

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

Adaptive Restarting for First Order Optimization Methods

Adaptive Restarting for First Order Optimization Methods Adaptive Restarting for First Order Optimization Methods Nesterov method for smooth convex optimization adpative restarting schemes step-size insensitivity extension to non-smooth optimization continuation

More information

Fast proximal gradient methods

Fast proximal gradient methods L. Vandenberghe EE236C (Spring 2013-14) Fast proximal gradient methods fast proximal gradient method (FISTA) FISTA with line search FISTA as descent method Nesterov s second method 1 Fast (proximal) gradient

More information

Accelerating Nesterov s Method for Strongly Convex Functions

Accelerating Nesterov s Method for Strongly Convex Functions Accelerating Nesterov s Method for Strongly Convex Functions Hao Chen Xiangrui Meng MATH301, 2011 Outline The Gap 1 The Gap 2 3 Outline The Gap 1 The Gap 2 3 Our talk begins with a tiny gap For any x 0

More information

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Newton s Method. Ryan Tibshirani Convex Optimization /36-725 Newton s Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, Properties and examples: f (y) = max x

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

Accelerated gradient methods

Accelerated gradient methods ELE 538B: Large-Scale Optimization for Data Science Accelerated gradient methods Yuxin Chen Princeton University, Spring 018 Outline Heavy-ball methods Nesterov s accelerated gradient methods Accelerated

More information

Newton s Method. Javier Peña Convex Optimization /36-725

Newton s Method. Javier Peña Convex Optimization /36-725 Newton s Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, f ( (y) = max y T x f(x) ) x Properties and

More information

You should be able to...

You should be able to... Lecture Outline Gradient Projection Algorithm Constant Step Length, Varying Step Length, Diminishing Step Length Complexity Issues Gradient Projection With Exploration Projection Solving QPs: active set

More information

Convex Optimization. Problem set 2. Due Monday April 26th

Convex Optimization. Problem set 2. Due Monday April 26th Convex Optimization Problem set 2 Due Monday April 26th 1 Gradient Decent without Line-search In this problem we will consider gradient descent with predetermined step sizes. That is, instead of determining

More information

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016 Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall 206 2 Nov 2 Dec 206 Let D be a convex subset of R n. A function f : D R is convex if it satisfies f(tx + ( t)y) tf(x)

More information

Optimized first-order minimization methods

Optimized first-order minimization methods Optimized first-order minimization methods Donghwan Kim & Jeffrey A. Fessler EECS Dept., BME Dept., Dept. of Radiology University of Michigan web.eecs.umich.edu/~fessler UM AIM Seminar 2014-10-03 1 Disclosure

More information

On Nesterov s Random Coordinate Descent Algorithms - Continued

On Nesterov s Random Coordinate Descent Algorithms - Continued On Nesterov s Random Coordinate Descent Algorithms - Continued Zheng Xu University of Texas At Arlington February 20, 2015 1 Revisit Random Coordinate Descent The Random Coordinate Descent Upper and Lower

More information

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)

More information

Lecture 5: September 12

Lecture 5: September 12 10-725/36-725: Convex Optimization Fall 2015 Lecture 5: September 12 Lecturer: Lecturer: Ryan Tibshirani Scribes: Scribes: Barun Patra and Tyler Vuong Note: LaTeX template courtesy of UC Berkeley EECS

More information

Non-negative Matrix Factorization via accelerated Projected Gradient Descent

Non-negative Matrix Factorization via accelerated Projected Gradient Descent Non-negative Matrix Factorization via accelerated Projected Gradient Descent Andersen Ang Mathématique et recherche opérationnelle UMONS, Belgium Email: manshun.ang@umons.ac.be Homepage: angms.science

More information

Optimization Tutorial 1. Basic Gradient Descent

Optimization Tutorial 1. Basic Gradient Descent E0 270 Machine Learning Jan 16, 2015 Optimization Tutorial 1 Basic Gradient Descent Lecture by Harikrishna Narasimhan Note: This tutorial shall assume background in elementary calculus and linear algebra.

More information

Lecture 3: Linesearch methods (continued). Steepest descent methods

Lecture 3: Linesearch methods (continued). Steepest descent methods Lecture 3: Linesearch methods (continued). Steepest descent methods Coralia Cartis, Mathematical Institute, University of Oxford C6.2/B2: Continuous Optimization Lecture 3: Linesearch methods (continued).

More information

Subgradient Method. Ryan Tibshirani Convex Optimization

Subgradient Method. Ryan Tibshirani Convex Optimization Subgradient Method Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last last time: gradient descent min x f(x) for f convex and differentiable, dom(f) = R n. Gradient descent: choose initial

More information

Global convergence of the Heavy-ball method for convex optimization

Global convergence of the Heavy-ball method for convex optimization Noname manuscript No. will be inserted by the editor Global convergence of the Heavy-ball method for convex optimization Euhanna Ghadimi Hamid Reza Feyzmahdavian Mikael Johansson Received: date / Accepted:

More information

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30 Optimization Escuela de Ingeniería Informática de Oviedo (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30 Unconstrained optimization Outline 1 Unconstrained optimization 2 Constrained

More information

Descent methods. min x. f(x)

Descent methods. min x. f(x) Gradient Descent Descent methods min x f(x) 5 / 34 Descent methods min x f(x) x k x k+1... x f(x ) = 0 5 / 34 Gradient methods Unconstrained optimization min f(x) x R n. 6 / 34 Gradient methods Unconstrained

More information

Programming, numerics and optimization

Programming, numerics and optimization Programming, numerics and optimization Lecture C-3: Unconstrained optimization II Łukasz Jankowski ljank@ippt.pan.pl Institute of Fundamental Technological Research Room 4.32, Phone +22.8261281 ext. 428

More information

Unconstrained optimization

Unconstrained optimization Chapter 4 Unconstrained optimization An unconstrained optimization problem takes the form min x Rnf(x) (4.1) for a target functional (also called objective function) f : R n R. In this chapter and throughout

More information

6.252 NONLINEAR PROGRAMMING LECTURE 10 ALTERNATIVES TO GRADIENT PROJECTION LECTURE OUTLINE. Three Alternatives/Remedies for Gradient Projection

6.252 NONLINEAR PROGRAMMING LECTURE 10 ALTERNATIVES TO GRADIENT PROJECTION LECTURE OUTLINE. Three Alternatives/Remedies for Gradient Projection 6.252 NONLINEAR PROGRAMMING LECTURE 10 ALTERNATIVES TO GRADIENT PROJECTION LECTURE OUTLINE Three Alternatives/Remedies for Gradient Projection Two-Metric Projection Methods Manifold Suboptimization Methods

More information

ORIE 6326: Convex Optimization. Quasi-Newton Methods

ORIE 6326: Convex Optimization. Quasi-Newton Methods ORIE 6326: Convex Optimization Quasi-Newton Methods Professor Udell Operations Research and Information Engineering Cornell April 10, 2017 Slides on steepest descent and analysis of Newton s method adapted

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

Lecture 7: September 17

Lecture 7: September 17 10-725: Optimization Fall 2013 Lecture 7: September 17 Lecturer: Ryan Tibshirani Scribes: Serim Park,Yiming Gu 7.1 Recap. The drawbacks of Gradient Methods are: (1) requires f is differentiable; (2) relatively

More information

Higher-Order Methods

Higher-Order Methods Higher-Order Methods Stephen J. Wright 1 2 Computer Sciences Department, University of Wisconsin-Madison. PCMI, July 2016 Stephen Wright (UW-Madison) Higher-Order Methods PCMI, July 2016 1 / 25 Smooth

More information

Complexity of gradient descent for multiobjective optimization

Complexity of gradient descent for multiobjective optimization Complexity of gradient descent for multiobjective optimization J. Fliege A. I. F. Vaz L. N. Vicente July 18, 2018 Abstract A number of first-order methods have been proposed for smooth multiobjective optimization

More information

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL) Part 3: Trust-region methods for unconstrained optimization Nick Gould (RAL) minimize x IR n f(x) MSc course on nonlinear optimization UNCONSTRAINED MINIMIZATION minimize x IR n f(x) where the objective

More information

Accelerated primal-dual methods for linearly constrained convex problems

Accelerated primal-dual methods for linearly constrained convex problems Accelerated primal-dual methods for linearly constrained convex problems Yangyang Xu SIAM Conference on Optimization May 24, 2017 1 / 23 Accelerated proximal gradient For convex composite problem: minimize

More information

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018 Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 08 Instructor: Quoc Tran-Dinh Scriber: Quoc Tran-Dinh Lecture 4: Selected

More information

Suppose that the approximate solutions of Eq. (1) satisfy the condition (3). Then (1) if η = 0 in the algorithm Trust Region, then lim inf.

Suppose that the approximate solutions of Eq. (1) satisfy the condition (3). Then (1) if η = 0 in the algorithm Trust Region, then lim inf. Maria Cameron 1. Trust Region Methods At every iteration the trust region methods generate a model m k (p), choose a trust region, and solve the constraint optimization problem of finding the minimum of

More information

10. Unconstrained minimization

10. Unconstrained minimization Convex Optimization Boyd & Vandenberghe 10. Unconstrained minimization terminology and assumptions gradient descent method steepest descent method Newton s method self-concordant functions implementation

More information

Lecture 14: Newton s Method

Lecture 14: Newton s Method 10-725/36-725: Conve Optimization Fall 2016 Lecturer: Javier Pena Lecture 14: Newton s ethod Scribes: Varun Joshi, Xuan Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes

More information

A Conservation Law Method in Optimization

A Conservation Law Method in Optimization A Conservation Law Method in Optimization Bin Shi Florida International University Tao Li Florida International University Sundaraja S. Iyengar Florida International University Abstract bshi1@cs.fiu.edu

More information

5. Subgradient method

5. Subgradient method L. Vandenberghe EE236C (Spring 2016) 5. Subgradient method subgradient method convergence analysis optimal step size when f is known alternating projections optimality 5-1 Subgradient method to minimize

More information

Multipoint secant and interpolation methods with nonmonotone line search for solving systems of nonlinear equations

Multipoint secant and interpolation methods with nonmonotone line search for solving systems of nonlinear equations Multipoint secant and interpolation methods with nonmonotone line search for solving systems of nonlinear equations Oleg Burdakov a,, Ahmad Kamandi b a Department of Mathematics, Linköping University,

More information

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725 Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h

More information

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725 Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:

More information

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 3 Gradient Method Shiqian Ma, MAT-258A: Numerical Optimization 2 3.1. Gradient method Classical gradient method: to minimize a differentiable convex

More information

Cubic regularization of Newton s method for convex problems with constraints

Cubic regularization of Newton s method for convex problems with constraints CORE DISCUSSION PAPER 006/39 Cubic regularization of Newton s method for convex problems with constraints Yu. Nesterov March 31, 006 Abstract In this paper we derive efficiency estimates of the regularized

More information

Nonlinear Programming

Nonlinear Programming Nonlinear Programming Kees Roos e-mail: C.Roos@ewi.tudelft.nl URL: http://www.isa.ewi.tudelft.nl/ roos LNMB Course De Uithof, Utrecht February 6 - May 8, A.D. 2006 Optimization Group 1 Outline for week

More information

MS&E 318 (CME 338) Large-Scale Numerical Optimization

MS&E 318 (CME 338) Large-Scale Numerical Optimization Stanford University, Management Science & Engineering (and ICME) MS&E 318 (CME 338) Large-Scale Numerical Optimization 1 Origins Instructor: Michael Saunders Spring 2015 Notes 9: Augmented Lagrangian Methods

More information

8 Numerical methods for unconstrained problems

8 Numerical methods for unconstrained problems 8 Numerical methods for unconstrained problems Optimization is one of the important fields in numerical computation, beside solving differential equations and linear systems. We can see that these fields

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning First-Order Methods, L1-Regularization, Coordinate Descent Winter 2016 Some images from this lecture are taken from Google Image Search. Admin Room: We ll count final numbers

More information

Lecture 14: October 17

Lecture 14: October 17 1-725/36-725: Convex Optimization Fall 218 Lecture 14: October 17 Lecturer: Lecturer: Ryan Tibshirani Scribes: Pengsheng Guo, Xian Zhou Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:

More information

More First-Order Optimization Algorithms

More First-Order Optimization Algorithms More First-Order Optimization Algorithms Yinyu Ye Department of Management Science and Engineering Stanford University Stanford, CA 94305, U.S.A. http://www.stanford.edu/ yyye Chapters 3, 8, 3 The SDM

More information

An adaptive accelerated first-order method for convex optimization

An adaptive accelerated first-order method for convex optimization An adaptive accelerated first-order method for convex optimization Renato D.C Monteiro Camilo Ortiz Benar F. Svaiter July 3, 22 (Revised: May 4, 24) Abstract This paper presents a new accelerated variant

More information

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization / Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear

More information

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Donghwan Kim and Jeffrey A. Fessler EECS Department, University of Michigan

More information

FINE TUNING NESTEROV S STEEPEST DESCENT ALGORITHM FOR DIFFERENTIABLE CONVEX PROGRAMMING. 1. Introduction. We study the nonlinear programming problem

FINE TUNING NESTEROV S STEEPEST DESCENT ALGORITHM FOR DIFFERENTIABLE CONVEX PROGRAMMING. 1. Introduction. We study the nonlinear programming problem FINE TUNING NESTEROV S STEEPEST DESCENT ALGORITHM FOR DIFFERENTIABLE CONVEX PROGRAMMING CLÓVIS C. GONZAGA AND ELIZABETH W. KARAS Abstract. We modify the first order algorithm for convex programming described

More information

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent 10-725/36-725: Convex Optimization Spring 2015 Lecturer: Ryan Tibshirani Lecture 5: Gradient Descent Scribes: Loc Do,2,3 Disclaimer: These notes have not been subjected to the usual scrutiny reserved for

More information

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):

More information

CS 435, 2018 Lecture 3, Date: 8 March 2018 Instructor: Nisheeth Vishnoi. Gradient Descent

CS 435, 2018 Lecture 3, Date: 8 March 2018 Instructor: Nisheeth Vishnoi. Gradient Descent CS 435, 2018 Lecture 3, Date: 8 March 2018 Instructor: Nisheeth Vishnoi Gradient Descent This lecture introduces Gradient Descent a meta-algorithm for unconstrained minimization. Under convexity of the

More information

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Lecture 15 Newton Method and Self-Concordance. October 23, 2008 Newton Method and Self-Concordance October 23, 2008 Outline Lecture 15 Self-concordance Notion Self-concordant Functions Operations Preserving Self-concordance Properties of Self-concordant Functions Implications

More information

Math 273a: Optimization Subgradient Methods

Math 273a: Optimization Subgradient Methods Math 273a: Optimization Subgradient Methods Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Nonsmooth convex function Recall: For ˉx R n, f(ˉx) := {g R

More information

Conjugate Gradient (CG) Method

Conjugate Gradient (CG) Method Conjugate Gradient (CG) Method by K. Ozawa 1 Introduction In the series of this lecture, I will introduce the conjugate gradient method, which solves efficiently large scale sparse linear simultaneous

More information

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725 Subgradient Method Guest Lecturer: Fatma Kilinc-Karzan Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization 10-725/36-725 Adapted from slides from Ryan Tibshirani Consider the problem Recall:

More information

The Newton Bracketing method for the minimization of convex functions subject to affine constraints

The Newton Bracketing method for the minimization of convex functions subject to affine constraints Discrete Applied Mathematics 156 (2008) 1977 1987 www.elsevier.com/locate/dam The Newton Bracketing method for the minimization of convex functions subject to affine constraints Adi Ben-Israel a, Yuri

More information

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725 Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like

More information

5 Handling Constraints

5 Handling Constraints 5 Handling Constraints Engineering design optimization problems are very rarely unconstrained. Moreover, the constraints that appear in these problems are typically nonlinear. This motivates our interest

More information

E5295/5B5749 Convex optimization with engineering applications. Lecture 8. Smooth convex unconstrained and equality-constrained minimization

E5295/5B5749 Convex optimization with engineering applications. Lecture 8. Smooth convex unconstrained and equality-constrained minimization E5295/5B5749 Convex optimization with engineering applications Lecture 8 Smooth convex unconstrained and equality-constrained minimization A. Forsgren, KTH 1 Lecture 8 Convex optimization 2006/2007 Unconstrained

More information

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem Michael Patriksson 0-0 The Relaxation Theorem 1 Problem: find f := infimum f(x), x subject to x S, (1a) (1b) where f : R n R

More information

Algorithms for Constrained Optimization

Algorithms for Constrained Optimization 1 / 42 Algorithms for Constrained Optimization ME598/494 Lecture Max Yi Ren Department of Mechanical Engineering, Arizona State University April 19, 2015 2 / 42 Outline 1. Convergence 2. Sequential quadratic

More information

Worst Case Complexity of Direct Search

Worst Case Complexity of Direct Search Worst Case Complexity of Direct Search L. N. Vicente May 3, 200 Abstract In this paper we prove that direct search of directional type shares the worst case complexity bound of steepest descent when sufficient

More information

THE NEWTON BRACKETING METHOD FOR THE MINIMIZATION OF CONVEX FUNCTIONS SUBJECT TO AFFINE CONSTRAINTS

THE NEWTON BRACKETING METHOD FOR THE MINIMIZATION OF CONVEX FUNCTIONS SUBJECT TO AFFINE CONSTRAINTS THE NEWTON BRACKETING METHOD FOR THE MINIMIZATION OF CONVEX FUNCTIONS SUBJECT TO AFFINE CONSTRAINTS ADI BEN-ISRAEL AND YURI LEVIN Abstract. The Newton Bracketing method [9] for the minimization of convex

More information

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization racle Complexity of Second-rder Methods for Smooth Convex ptimization Yossi Arjevani had Shamir Ron Shiff Weizmann Institute of Science Rehovot 7610001 Israel Abstract yossi.arjevani@weizmann.ac.il ohad.shamir@weizmann.ac.il

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE CONVEX ANALYSIS AND DUALITY Basic concepts of convex analysis Basic concepts of convex optimization Geometric duality framework - MC/MC Constrained optimization

More information

Lecture 8: February 9

Lecture 8: February 9 0-725/36-725: Convex Optimiation Spring 205 Lecturer: Ryan Tibshirani Lecture 8: February 9 Scribes: Kartikeya Bhardwaj, Sangwon Hyun, Irina Caan 8 Proximal Gradient Descent In the previous lecture, we

More information

FAST FIRST-ORDER METHODS FOR COMPOSITE CONVEX OPTIMIZATION WITH BACKTRACKING

FAST FIRST-ORDER METHODS FOR COMPOSITE CONVEX OPTIMIZATION WITH BACKTRACKING FAST FIRST-ORDER METHODS FOR COMPOSITE CONVEX OPTIMIZATION WITH BACKTRACKING KATYA SCHEINBERG, DONALD GOLDFARB, AND XI BAI Abstract. We propose new versions of accelerated first order methods for convex

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization Panos Parpas Department of Computing Imperial College London www.doc.ic.ac.uk/ pp500 p.parpas@imperial.ac.uk jointly with D.V.

More information

On the Local Quadratic Convergence of the Primal-Dual Augmented Lagrangian Method

On the Local Quadratic Convergence of the Primal-Dual Augmented Lagrangian Method Optimization Methods and Software Vol. 00, No. 00, Month 200x, 1 11 On the Local Quadratic Convergence of the Primal-Dual Augmented Lagrangian Method ROMAN A. POLYAK Department of SEOR and Mathematical

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme Last time Proximal gradient method

More information

R-Linear Convergence of Limited Memory Steepest Descent

R-Linear Convergence of Limited Memory Steepest Descent R-Linear Convergence of Limited Memory Steepest Descent Frank E. Curtis, Lehigh University joint work with Wei Guo, Lehigh University OP17 Vancouver, British Columbia, Canada 24 May 2017 R-Linear Convergence

More information

Optimization and Optimal Control in Banach Spaces

Optimization and Optimal Control in Banach Spaces Optimization and Optimal Control in Banach Spaces Bernhard Schmitzer October 19, 2017 1 Convex non-smooth optimization with proximal operators Remark 1.1 (Motivation). Convex optimization: easier to solve,

More information

Lecture 9 Sequential unconstrained minimization

Lecture 9 Sequential unconstrained minimization S. Boyd EE364 Lecture 9 Sequential unconstrained minimization brief history of SUMT & IP methods logarithmic barrier function central path UMT & SUMT complexity analysis feasibility phase generalized inequalities

More information

Math 273a: Optimization Subgradients of convex functions

Math 273a: Optimization Subgradients of convex functions Math 273a: Optimization Subgradients of convex functions Made by: Damek Davis Edited by Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com 1 / 42 Subgradients Assumptions

More information

AM 205: lecture 19. Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods

AM 205: lecture 19. Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods AM 205: lecture 19 Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods Optimality Conditions: Equality Constrained Case As another example of equality

More information

Line Search Methods for Unconstrained Optimisation

Line Search Methods for Unconstrained Optimisation Line Search Methods for Unconstrained Optimisation Lecture 8, Numerical Linear Algebra and Optimisation Oxford University Computing Laboratory, MT 2007 Dr Raphael Hauser (hauser@comlab.ox.ac.uk) The Generic

More information

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms Coordinate Update Algorithm Short Course Proximal Operators and Algorithms Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 36 Why proximal? Newton s method: for C 2 -smooth, unconstrained problems allow

More information

Worst Case Complexity of Direct Search

Worst Case Complexity of Direct Search Worst Case Complexity of Direct Search L. N. Vicente October 25, 2012 Abstract In this paper we prove that the broad class of direct-search methods of directional type based on imposing sufficient decrease

More information

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

Second Order Optimization Algorithms I

Second Order Optimization Algorithms I Second Order Optimization Algorithms I Yinyu Ye Department of Management Science and Engineering Stanford University Stanford, CA 94305, U.S.A. http://www.stanford.edu/ yyye Chapters 7, 8, 9 and 10 1 The

More information

Iterative Methods for Solving A x = b

Iterative Methods for Solving A x = b Iterative Methods for Solving A x = b A good (free) online source for iterative methods for solving A x = b is given in the description of a set of iterative solvers called templates found at netlib: http

More information

Unconstrained minimization

Unconstrained minimization CSCI5254: Convex Optimization & Its Applications Unconstrained minimization terminology and assumptions gradient descent method steepest descent method Newton s method self-concordant functions 1 Unconstrained

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

1 Computing with constraints

1 Computing with constraints Notes for 2017-04-26 1 Computing with constraints Recall that our basic problem is minimize φ(x) s.t. x Ω where the feasible set Ω is defined by equality and inequality conditions Ω = {x R n : c i (x)

More information

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties Fedor S. Stonyakin 1 and Alexander A. Titov 1 V. I. Vernadsky Crimean Federal University, Simferopol,

More information