NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained
|
|
- Dinah Simon
- 5 years ago
- Views:
Transcription
1 NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume the optimum of f exists and is unique. By first-order, we are referring to methods that use only function value and gradient information. Compared to second-order methods (e.g. Newtontype methods, interior-point methods), first-order methods usually converge slower, and their performance degrades on ill-conditioned problems. However, they are well-suited to large-scale problems because each iteration usually only requires vector operations, 1 in addition to the evaluation of f(x) and f(x).. Gradient descent. The simplest first-order method is gradient descent or steepest descent. It starts at an initial point x 0 and generates successive iterates by (.1) x k+1 x k α k f(x k ). The step size α k is either fixed or chosen by a line search that (approximately) minimizes f(x k α f(x k )). For small to medium-scale problems, each iteration is inexpensive, requiring only vector operations. However, the iterates may converge slowly to the optimum x. Let s take a closer look at the convergence of gradient descent. For now, fix a step size α and consider gradient descent as a fixed point iteration: x k+1 G α (x k ), where G α (x) := x α f(x) is a contraction: G α (x) G α (y) L G x y for some L G < 1. Since the optimum x is a fixed point of G α, we expect the fixed point iteration x k+1 G α (x k ) to converge to the fixed point. 1 Second-order methods usually require the solution of linear system(s) at each iteration, making them prohibitively expensive for very large problems. 1
2 MS&E 318: LARGE-SCALE NUMERICAL OPTIMIZATION Lemma.1. When the gradient step G α (x) is a contraction, gradient descent converges linearly to x : x k+1 x L k G x 0 x. Proof. We begin by showing that each iteration of gradient descent is a contraction: x k+1 x = x k α k f(x k ) (x α k f(x )) = G α (x) G α (x ) L G x k x. By induction, x k+1 x L k G x 0 x. At first blush, it seems like gradient descent converges quickly. However, the contraction coefficient L G depends poorly on the condition number κ := L µ of the problem. In particular L G 1 1 κ with an optimal step size. From now on, we assume the function f is convex. Lemma.. When f : R n R is 1. twice continuously differentiable. strongly convex with constant µ > 0 : (.) f(y) f(x) + f(x) T (y x) + µ y x 3. strongly smooth with constant L 0 : (.3) f(y) f(x) + f(x) T (y x) + L y x, the contraction coefficient L G at most max{ 1 αµ, 1 αl }. Proof. By the mean value theorem, G α (x) G α (y) = x α f(x) (y α f(y)) = (I f(z))(x y) I f(z) x y for some z on the line segment between x and y. By strong convexity and smoothness, the eigenvalues of f(z) are between µ and L. Thus, I α f(z) max{ 1 αµ, 1 αl }.
3 FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS Fig 1. The iterates of gradient descent (left panel) and the heavy ball method (right panel) starting at ( 1, 1). We combine Lemmas.1 and. to deduce x k+1 x max{ 1 αµ, 1 αl } k x 0 x. The step size L+µ minimizes the contraction coefficient. Substituting the optimal step size, we conclude that x k+1 x ( ) κ 1 k x 0 x, κ + 1 where κ := L µ is a condition number of the problem. When κ is large, the contraction coefficient is roughly 1 1 κ. In other words, gradient descent requires at most O ( κ log( 1 ɛ )) iterations to attain an ɛ-accurate solution. 3. The heavy ball method. The heavy ball method is usually attributed to Polyak (1964). The intuition is simple: the iterates of gradient descent tend to bounce between the walls of narrow valleys on the objective surface. The left panel of Figure shows the iterates of gradient descent bouncing from wall to wall. To avoid bouncing, the heavy ball method adds a momentum term to the gradient step: x k+1 x k α k f(x k ) + β k (x k x k 1 ). The term x k x k 1 nudges x k+1 in the direction of the previous step (hence momentum). The right panel of Figure shows the effects of adding momentum.
4 4 MS&E 318: LARGE-SCALE NUMERICAL OPTIMIZATION Let s study the heavy ball method and compare it to gradient descent. Since each iterate depends on the previous two iterates, we study the threeterm recurrence in matrix form: [ ] xk+1 x [ x k x = xk + β(x k x k 1 ) x ] [ ] f(xk ) x k x α 0 [ ] [ = (1 + β)i βi xk x ] [ ] I 0 x k 1 x α f(z k )(x k x ) 0 for some z k on the line segment between x k and x [ = (1 + β)i α ] [ f(z k ) βi xk x ] I 0 x k 1 x [ ] (1 + β)i αk [ f(z k ) βi xk x ] (3.1) I 0. x k 1 x [ ] The contraction coefficient is given by (1 + β)i αk f(z k ) βi. I 0 We introduce a lemma to bound the norm. Lemma 3.1. Under the conditions of Lemma., [ ] (1 + β)i α f(z) βi max{ 1 αµ, 1 αl } I 0 for β = max{ 1 αµ, 1 αl }. Proof. Let Λ := diag(λ 1,..., λ n ), where {λ i } i [1:n] are the eigenvalues of f(z). By diagonalization, [ ] (1 + β)i α f(z) βi I 0 [ ] [ ] = (1 + β)i αλ βi = max 1 + β αλi β I i [1:n] The second equality holds because it is possible to permute the matrix to a block diagonal matrix with blocks. For each i [1 : n], the eigenvalues of the matrices are given by the roots of p i (λ) := λ (1 + β αλ i )λ + β = 0. It is possible to show that when β is at least (1 αλ i ), the roots of p i are imaginary and both have magnitude β. Since β := max{ 1 αµ, 1 αl }, the magnitude of the roots are at most β..
5 FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 5 Theorem 3.. Under the conditions of Lemma., the heavy ball method 4 with fixed parameters α = ( L+ and β = max{ 1 αµ, 1 αl } µ) converges linearly to x : ( ) k κ 1 x k+1 x x 0 x, κ + 1 where κ = L µ. Proof. By Lemma 3.1, [ ] xk+1 x x k x max{ 1 αµ, 1 [ αl } xk x ] x k 1 x 4 Substituting α = ( L+ gives µ) [ ] xk+1 x x k x κ 1 κ + 1 [ xk x x k 1 x ] where κ := L µ. We induct on k to obtain the stated result. The heavy ball method also converges linearly, but its contraction coefficient is roughly 1 1 κ. In other words, the heavy ball method attains an ɛ-accurate solution after at most O ( κ log( 1 ɛ )) iterations. Figure 3 compares the convergence of gradient descent and the heavy ball method. Unlike gradient descent, the heavy ball method is not a descent method, i.e. f(x k+1 ) is not necessarily smaller than f(x k ). On a related note, the conjugate gradient method (CG) is an instance of the heavy ball method with adaptive step sizes and momentum parameters. However, CG does not require knowledge of L and µ to choose step sizes. 4. Accelerated gradient methods Gradient descent in the absence of strong convexity. Before delving into Nesterov s celebrated method, we study gradient descent in the setting he considered. Nesterov studied (1.1) when f is continuously differentiable and strongly smooth. In the absence of strong convexity, (1.1) may not have a unique optimum, so we cannot expect the iterates to always converge to a point x. However, if f remains convex, we expect the optimality gap f(x k ) f to converge to zero. To begin, we study the convergence rate of gradient descent in the absence of strong convexity. The iterates may converge to different points depending on the initial point x 0.,
6 6 MS&E 318: LARGE-SCALE NUMERICAL OPTIMIZATION 10 5 gradient descent heavy ball 10 0 f(xk) f f k Fig. Convergence of gradient descent and heavy ball function values on a strongly convex function. Although the heavy ball function values are not monotone, they converge faster. Theorem 4.1. When f : R n R is convex and 1. continuously differentiable. strongly smooth with constant L, gradient descent with step size 1 L satisfies f(x k ) f L k x 0 x. Proof. The proof consists of two parts. First, we show that, at each step, gradient descent reduces the function value by at least a certain amount. In other words, we show each iteration of gradient descent achieves sufficient descent. Recall x k+1 x k α f(x k ). By the strong smoothness of f, ( α ) L f(x k+1 ) f(x k ) + α f(x k ). When α = 1 L, the right side simplifies to (4.1) f(x k+1 ) f(x k ) 1 L f(x k). The second part of the part of the proof combines the sufficient descent condition with the convexity of f to obtain a bound on the optimality gap.
7 FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 7 By the first-order condition for convexity, f(x k+1 ) f + f(x k ) T (x k x ) 1 L f(x k) ( L = f x k x f(x k ) T (x k x ) + 1 ) L f(x k) + L x k x, where the second equality follows from completing the square, = f L x k x 1 L f(x k) + L x k x = f L x k+1 x + L x k x. We sum over iterations to obtain k f(x l ) f L l=1 k x l 1 x x l x = L ( x0 x x k x ) L x 0 x. Since f(x k ) is non-increasing, we have l=1 f(x k ) f 1 k k f(x l ) f L k x 0 x. l=1 In the absence of strong convexity, the convergence rate of gradient descent slows to sublinear. Roughly speaking, gradient descent requires at most O( L ɛ ) iterations to obtain an ɛ-optimal solution. The story is similar when the step sizes are not fixed, but chosen by a backtracking line search. More precisely, a backtracking line search 1. starts with some initial trial step size α k α init. decreases the trial step size geometrically: α k α k 3 3. continues until α k satisfies a sufficient descent condition: (4.) f(x k + α k f(x k )) f(x k ) α k f(x k). 3 The new trial step may be β α, for any β (0, 1). In practice, it s frequently set close to 1 to avoid taking conservative step sizes. We let β = 1 for simplicity.
8 8 MS&E 318: LARGE-SCALE NUMERICAL OPTIMIZATION Lemma 4.. Under the conditions of Theorem 4.1, backtracking with initial step size α init chooses a step size that is at least min{α init, 1 L }. Proof. Recall the quadratic upper bound ( α ) L f(x k + α f(x k )) f(x k ) + α f(x k ) for any α 0. It s minimized at α = 1 L. We check that choosing any α k 1 L satisfies (4.): f(x k + α k f(x k )) f(x k ) + 1 ( ) αk L L 1 f(x k ) f(x k ) + 1 ( ) 1 L 1 f(x k ) f(x k ) 1 L f(x k) f(x k ) α k f(x k). Since any step size shorter than 1 L satisfies (4.), the line search stops backtracking at a step size longer than 1 L. Thus, at each step, gradient descent with backtracking line search decreases the function value by at least 1 4L f(x k). The rest of the story is similar to that of gradient descent with a fixed step size. To summarize, gradient descent with a backtracking line search also requires at most O( L ɛ ) iterations to obtain an ɛ-optimal solution. 4.. Nesterov s accelerated gradient method. Accelerated gradient methods obtain an ɛ-optimal solution in O( L/ɛ) iterations. Nesterov (1983) is the original and (one of) the best known. It initializes y 1 x 0, γ 0 1 and generates iterates according to (4.3) (4.4) y k+1 x k + β k (x k x k 1 ) x k+1 y k+1 1 L f(y k+1), where L is the strong smoothness constant of f and (4.5) (4.6) γ k 1 ( (4γ k 1 + γk 1 4 ) 1 ) γk 1 β k γ k ( 1 γ k 1).
9 FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 9 It s very similar to the heavy ball method, except the gradient is evaluated at an intermediate point y k+1. Often, the γ-update (4.5) is stated as the solution to (4.7) γ k = (1 γ k)γ k 1. To study Nesterov s 1983 method, we first formulate the recursion in terms of interpretable quantities (4.8) (4.9) z k x k γ k 1 (x k x k 1 ) : z 0 x 0 y k+1 (1 γ k )x k + γ k z k (4.10) x k+1 y k+1 1 L f(y k+1). Behind the scenes, the method forms an estimating sequence of quadratic functions centered at {z k } : φ k (x) := φ k + η k x z k. Definition 4.3. A sequence of functions φ k : R n R is an estimating sequence of f : R n R if, for some non-negative sequence of η k 0, φ k (x) (1 η k )f(x) + η k φ 0 (x). Intuitively, an estimating sequence for f consists of successively tighter approximations of f. As we shall see, {φ k } is an estimating sequence of f. Nesterov chose φ 0 to be a quadratic function centered at the initial point: φ 0 (x) φ 0 + η 0 x x 0, where φ 0 = f(x 0) and η 0 = L. The subsequent φ k are convex combinations of φ k 1 and the first-order approximation of f at y k+1 : (4.11) φ k+1 (x) (1 γ k )φ k (x) + γ k ˆfk (x), where ˆf k (x) := f(y k+1 )+ f(y k+1 ) T (x y k+1 ) and x k, y k are given by (4.10) and (4.9). It s straightforward to show the subsequent φ k are quadratic. Lemma 4.4. (4.1) (4.13) The functions φ k have the form φ k + η k x z k, where η k+1 (1 γ k )η k, z k+1 z k γ k f(y k+1 ), η k+1 (4.14) φ k+1 (1 γ k)φ k + γ k ˆf k (z k ) γ k η k+1 f(y k+1 ).
10 10 MS&E 318: LARGE-SCALE NUMERICAL OPTIMIZATION Proof. We know φ k = η k, but by (4.11), it is also (1 γ k ) φ k 1 + γ k ˆfk 1 = (1 γ k )η k 1 I. Thus η k (1 γ k )η k 1. Similarly, we know φ k = η k (x z k ), but it is also (1 γ k ) φ k 1 + γ k ˆf k 1 = (1 γ k )η k 1 (x z k 1 ) + γ k f(y k ) = η k (x z k ) + γ k f(y k ) ( = η k (x z k γ )) k f(y k ). η k Thus z k z k 1 γ k 1 η k f(y k ). The intercept φ k is simply φ k(z k ) : φ k (z k ) = (1 γ k 1 )φ k 1 (z k ) + γ k 1 ˆfk 1 (z k ) ( = (1 γ k 1 ) φ k 1 + η ) k 1 z k z k 1 + γ k 1 ˆfk 1 (z k ). Since z k z k 1 γ k 1 η k f(y k ), η k (1 γ k )η k 1, (4.15) = (1 γ k 1 )φ k 1 + (1 γ k 1)η k 1 f(y k ) η k γ k 1 + γ k 1 ˆfk 1 (z k ) = (1 γ k 1 )φ k 1 + γ k 1 (1 γ k 1)η k 1 ηk f(y k ) + γ k 1 ˆfk 1 (z k ) = (1 γ k 1 )φ k 1 + γ k 1 η k f(y k ) + γ k 1 ˆfk 1 (z k ). Since ˆf k 1 (x) = f(y k ) + f(y k ) T (x y k ) and z k z k 1 γ k 1 η k f(y k ), ˆf k 1 (z k ) = f(y k ) + f(y k ) T (z k 1 γ k 1 η k f(y k ) y k ) (4.16) = ˆf k 1 (z k 1 ) γ k 1 η k f(y k ). We combine (4.15) and (4.16) to obtain (4.11). Why are estimating sequences useful? As it turns out, for any sequence {x k } obeying f(x k ) inf x φ k (x), the convergence rate of {f(x k )} is given by the convergence rate of { η k }.
11 FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 11 Corollary 4.5. If {φ k } is an estimating sequence of f and if f(x k ) inf x φ(x) for some sequence {x k }, then Proof. Since f(x k ) inf x φ k (x), f(x k ) f η k (φ 0 (x ) f ). f(x k ) f φ k (x ) f (1 η k )f + η k φ 0 (x ) f η k (φ 0 (x ) f ). We show that Nesterov s choice of {φ k } is an estimating sequence. First, we state two lemmas concerning the sequence {γ k }. Their proofs are in the Appendix. Lemma 4.6. The sequence {γ k } satisfies γ k = k l=1 (1 γ l). Lemma 4.7. At the k-th iteration, γ k is at most k+. Lemma 4.8. The functions {φ k } form an estimating sequence of f : where η k := k l=1 (1 γ k). Further, φ k (x) (1 η k )f(x) + η k φ 0 (x), η k 4 for any k 0. (k + ) Proof. Consider the sequence η 0 = 0, η k = k l=1 (1 γ k). At x 0, clearly φ 0 (x) φ 0 (x). Since f is convex, f(x) ˆf k (x), and φ k+1 (x) (1 γ k )φ k (x) + γ k f(x). By the induction hypothesis, (1 γ k ) ((1 η k 1 )f(x) + η k 1 φ 0 (x)) + γ k f(x) = (1 (1 γ k ) η k 1 )f(x) + (1 γ k ) η k 1 φ 0 (x) = (1 η k )f(x) + η k φ 0 (x). It remains to show η k 4 (k+). By (4.1), η k = k l=1 (1 γ l). Further, by Lemma (4.6), (4.17) η k = k (1 γ l ) = γk. l=1 We bound the right side by Lemma 4.7 to obtain the stated result.
12 1 MS&E 318: LARGE-SCALE NUMERICAL OPTIMIZATION To invoke Corollary 4.5, we must first show f(x k ) inf x φ k (x). As we shall see, {γ k } and {y k } are carefully chosen to ensure f(x k ) inf x φ k (x). For now, assume f(x k ) inf x φ k (x). By (4.11) and the convexity of f, φ k+1 (1 γ k)f(x k ) + γ k ˆfk (z k ) (1 γ k ) ˆf k (x k ) + γ k ˆfk (z k ) γ k η k+1 f(y k+1 ) γ k η k+1 f(y k+1 ) = f(y k+1 ) + f(y k+1 ) T ((1 γ k )x k + γ k z k y k+1 ) γ k η k+1 f(y k+1 ). By (4.9), the linear term is zero: φ k+1 f(y k+1) γ k η k+1 f(y k+1 ). To ensure f(x k+1 ) φ k+1, it suffices to ensure f(x k+1 ) f(y k+1 ) γ k η k+1 f(y k+1 ), which, as we shall see, is possible by taking a gradient step from y k+1. Lemma 4.9. Under the conditions of Theorem 4.1, the sequence {x k } given by (4.10) obeys f(x k ) φ k. Proof. We proceed by induction. At x 0, f(x 0 ) = φ 0. At each iteration, x k+1 is a gradient step from y k+1. By Lemma 4., the function value decreases by at least 1 L f(y k+1) : f(x k+1 ) f(y k+1 ) 1 L f(y k+1) = ˆf k (y k+1 ) 1 L f(y k+1), where the second inequality follows from the definition of ˆf k. By (4.9), ˆf k (y k+1 ) = ˆf((1 γ k )x k + γ k (z k )) Thus f(x k+1 ) is further bounded by = (1 γ k ) ˆf k (x k ) + γ k ˆfk (z k ). f(x k+1 ) (1 γ k ) ˆf k (x k ) + γ k ˆfk (z k ) 1 L f(y k+1) (1 γ k )f(x k ) + γ k ˆfk (z k ) 1 L f(y k+1) (1 γ k )φ k + γ k ˆf k (z k ) 1 L f(y k+1),
13 FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 13 where the second inequality follows from the induction hypothesis. Substituting (4.14), we obtain ( γ f(x k+1 ) φ k+1 + k 1 ) f(y k+1 ) η k+1 L. By (4.1), η k+1 = L k l=1 (1 γ l). Thus ( ) γk 1 η k+1 L = 1 γk L k l=1 (1 γ l) 1. The difference in parentheses is zero by Lemma 4.6. Thus the iterates {x k } obey f(x k ) inf x φ k (x). We put the pieces together to obtain the convergence rate of Nesterov s 1983 method. Theorem Under the conditions of Theorem 4.1, the iterates {x k } generated by (4.10) satisfy f(x k ) f 4L (k + ) x 0 x. Proof. By Corollary 4.5 and Lemma 4.8, the function values obey f(x k ) f η k (φ 0 (x ) f ) 4 (k + ) (φ 0(x ) f ). Further, recalling φ 0 (x) = f(x 0 ) + L x x 0, we obtain ( ) f(x k ) f 4 L (k + ) x x 0 + f(x 0 ) f 4L (k + ) x x 0, where the second inequality follows from the strong smoothness of f. Finally, we check the iterates {(x k, y k, z k )} produced by forming auxiliary functions are the same as those generated by (4.10), (4.9), (4.8). Lemma The iterates {(x k, y k, z k )} generated by the estimating sequence are identical to those generated by (4.10), (4.9), (4.8).
14 14 MS&E 318: LARGE-SCALE NUMERICAL OPTIMIZATION f(xk) f f gradient descent N k Fig 3. Convergence of gradient descent and Nesterov s 1983 method on a strongly convex function. Like the heavy ball method, Nesterov s 1983 method generates function values that are not monotonically decreasing. Proof. Since x k, y k are given by (4.10) and (4.9) when forming an estimating sequence, it suffices to check that (4.8) and (4.13) are identical. z k+1 x k + 1 γ k (x k+1 x k ) = x k + 1 γ k ( y k+1 1 L f(y k+1) x k ) = x k + 1 γ k ( (1 γ k )x k + γ k z k 1 L f(y k+1) x k ) = z k 1 γ k L f(y k+1). It remains to show 1 γ k L = γ k η k+1 = γ k η k+1. By (4.1), η k+1 = L k l=1 (1 γ l). Thus L k γ k l=1 (1 γ l) = 1 where the second equality follows by Lemma 4.6. γ k L, Nesterov s 1983 method, as stated, does not converge linearly when f is strongly convex. Figure 4. compares the convergence of gradient descent and Nesterov s 1983 method. It s possible to modify the method to attain linear convergence. The key idea is to form φ k by convex combinations of φ k 1 and quadratic lower bounds of f given by (.). We refer to Nesterov (004), Section. for details.
15 FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 15 A more practical issue is choosing the step size in (4.10) when L is unknown. It is possible to incorporate a line search: (4.18) x k+1 y k+1 α k f(y k+1 ), where α k is chosen by a line search. To form a valid estimating sequence, the step sizes must decrease: α k α k 1. If α k is chosen by a backtracking line search, the initial trial step at the k-th iteration is set to α k 1. The key idea in Nesterov (1983) is forming an estimating sequence, and Nesterov s 1983 method is merely a special case. The development of accelerated first-order methods by forming estimating sequences remains an active area of research. Just recently, 1. Meng and Chen (011) proposed a modification of Nesterov s 1983 method that further accelerates the speed of convergence.. Gonzaga and Karas (013) proposed a variant of Nesterov s original method that adapts to unknown strong convexity. 3. O Donoghue and Candés (013) proposed a heuristic for resetting the momentum term to zero that improves the convergence rate. To end on a practical note, the Matlab package TFOCS by Becker, Candès and Grant (011) implements some of the aforementioned methods Optimality of accelerated gradient methods. To wrap up the section, we pause and ask: are there methods that converge faster than accelerated gradient descent? In other words, given a convex function with Lipschitz gradient, what is the fastest convergence rate attainable by first-order methods? More precisely, a first-order method is one that chooses the k-th iterate in x 0 + span{ f(x 0 ),..., f(x k 1 )} for k = 1,,... As it turns out, the O( 1 k ) convergence rate attained by Nesterov s accelerated gradient method is optimal. That is, unless f has special extra structure (e.g. strong convexity), no method has better worst-case performance than Nesterov s 1983 method. Theorem 4.1. There exists a convex and strongly smooth function f : R k+1 R such that f(x k ) f 3L x 0 x 3(k + 1), where {x l } obey x l x 0 + span({ f(x 0 ),..., f(x l 1 )}) for any l [ k ].
16 16 MS&E 318: LARGE-SCALE NUMERICAL OPTIMIZATION Proof sketch. It suffices to exhibit a function that is hard to optimize. More precisely, we construct a function and initial guess such that any first-order method has at best O( 1 ) convergence rate when applied to the k problem. Consider the convex quadratic function ( ) f(x) = L 1 4 x k (x i x i+1 ) + 1 x k+1 x 1. In matrix form, f(x) = L ( 1 4 xt Ax e T 1 x), where A = R (k+1) (k+1) It s possible to show i=1 1. f is strongly smooth ( with) constant L,. its minimum L 1 8 k+ 1 is attained at [ x ] i = 1 i k+. 3. K l [f] := span({ f(x 0 ),..., f(x l 1 )}) is span({e 1,..., e l }). If we initialize x 0 0, then at the k-th iteration, f(x k ) inf x K k [f] f(x) = L 8 ( 1 k ). Thus f(x k ) f x 0 x ( ) L 1 8 k+1 1 k+ 1 3 (k + ) = 3L 3(k + 1). 5. Projected gradient descent. To wrap up, we consider a simple modification of gradient descent for constrained optimization: projected gradient descent. At each iteration, we take a gradient step and project back onto the feasible set: (5.1) x k+1 P C (x k α k f(x k )), 1 where P C (x) := arg min y C x y is the projection of x onto C. For many sets (e.g. subspace, (symmetric) cones, norm balls etc.), it is possible
17 FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 17 to project onto the set efficiently, making projected gradient descent a viable choice for optimization over the set. Despite the requirement of being able to project efficiently onto the feasible set, projected gradient descent is quite general. Consider the (generic) nonlinear optimization problem (5.) minimize f(x) x R n subject to l c(x) u. By introducing slack variables, (5.) is equivalent to minimize s R m,x R n f(x) subject to c(x) = s, l s u. The bound-constrained Lagrangian approach (BCL) solves a sequence of bound-constrained augmented Lagrangian subproblems of the form minimize s R m,x R n subject to l s u. f(x) + λ T k c(x) + ρ k c(x) s Since projection onto rectangles is efficient, projected gradient descent may be used to solve the BCL subproblems. Given the similarity of projected gradient descent to gradient descent, it is no suprise that the convergence rate of projected gradient descent is the same as that of its counterpart. For now, fix a step size α and consider projected gradient descent as a fixed point iteration: x k+1 P C (G α (x k )), where G α is the gradient step. Since the optimum x is a fixed point of P C G α, the iterates converge if P C G α is a contraction. Lemma 5.1. The projection onto a convex set C R n is non-expansive: P C (y) P C (x) (x y) T (P C (y) P C (x)). Proof. The optimality condition of the constrained optimization problem defining the projection onto C is (x P C (x)) T (y P C (x)) 0 for any y C.
18 18 MS&E 318: LARGE-SCALE NUMERICAL OPTIMIZATION For any two points x, y and their projections onto C, we have (x P C (x)) T (P C (y) P C (x)) 0 (y P C (y)) T (P C (x) P C (y)) 0. We add the two inequalities and rearrange to obtain P C (y) P C (x) (x y) T (P C (y) P C (x)). We apply the Cauchy-Schwartz inequality to obtain the stated result. Since P C is non-expansive, P C (G α (x)) P C (G α (y)) G α (x) G α (y). Thus, as long as α is chosen to ensure G α is a contraction, P C G α is also a contraction. The rest of the story is the same as that of gradient descent: when f is strongly convex and strongly smooth, projected gradient descent converges linearly. That is, it requires at most O(κ log( 1 ɛ )) iterations to obtain an ɛ-accurate solution. When f is not strongly convex (but convex and strongly smooth), the story is also similar to that of gradient descent. However, the proof is more involved. First, we show projected gradient descent is a descent method. Lemma 5.. When 1. f : R n R is convex and strongly smooth with constant L. C R n is convex, projected gradient descent with step size 1 L satisfies f(x k+1 ) f(x k ) L x k+1 x k. Proof. Since f is strongly smooth, we have (5.3) f(x k+1 ) f(x k ) + f(x k ) T (x k+1 x k ) + L x k+1 x k. The point x k+1 is the projection of x k 1 L f(x k) onto C. Thus (5.4) (x k 1 L f(x k) x k+1 ) T (y x k+1 ) 0 for any y C. By setting y = x k in (5.4) and rearranging, we obtain f(x k ) T (x k x k+1 ) L x k x k+1. We substitute the bound into (5.3) to obtain the stated result.
19 FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 19 We put the pieces together to show projected gradient descent requires at most O( L ɛ ) iterations to obtain an ɛ-suboptimal point. Theorem 5.3. Under the conditions of Lemma 5., projected gradient descent with step size 1 L satisfies f(x k ) f L k x x 0. Proof. By setting y = x in (5.4) (and rearranging), we obtain f(x k ) T (x k+1 x ) L(x k+1 x k ) T (x x k+1 ). Substituting the bound into (5.3), we have f(x k+1 ) f(x k ) + L(x k+1 x k ) T (x x k+1 ) + L x k+1 x k. We complete the square to obtain f(x k+1 ) f(x k ) + L x x k+1 + x k+1 x k L x x k+1 = f(x k ) + L Summing over iterations, we obtain k f(x l ) f L l=1 ( x x k x x k+1 ). = L k x x k 1 x x k l=1 ( x x 0 x x k ) L x x 0. By Lemma 5., {f(x k )} is decreasing. Thus f(x k ) f 1 k k f(x l ) f L k x x 0. l=1 When the strong smoothness constant is unknown, it is possible to choose step sizes by a projected backtracking line search. We 1. start with some initial trial step size α k α init
20 0 MS&E 318: LARGE-SCALE NUMERICAL OPTIMIZATION. decrease the trial step size geometrically: α k α k 3. continue until α k satisfies a sufficient descent condition: (5.5) f(x k+1 ) f(x k ) + f(x k ) T (x k+1 x k ) + L x k+1 x k, where x k+1 P C (x k α k f(x k )). Thus the line search backtracks along the projection of the search ray onto the feasible set. It is possible to show projected gradient descent with a projected backtracking line search preserves the O( 1 k ) convergence rate of projected gradient descent with step size 1 L. APPENDIX Proof of Lemma 4.6. We proceed by induction. Recall {γ k } is generated by the recursion (4.7). Since γ 0 1, we automatically have γ 1 = (1 γ 1 ). By the induction hypothesis γ k 1 = k 1 l=1 (1 γ l) and (4.7), k γk = (1 γ k)γk 1 = (1 γ l ). l=1 Proof of Lemma 4.7. We consider the gap between successive 1 γ k : By (4.7), we have 1 γ l+1 1 γ l = γ l γ l+1 γ l γ l+1 = γ l γ l+1 γ l γ l+1 (γ l + γ l+1 ). 1 γ l+1 1 γ l = γ l γ l (1 γ l+1) γ l γ l+1 (γ l + γ l+1 ) = γ l γ l+1 γ l γ l+1 (γ l + γ l+1 ). It is easy to show the sequence {γ l } is decreasing. Thus 1 γ l+1 1 γ l γ l γ l+1 γ l γ l+1 (γ l ) = 1. Summing the inequalities for l = 0, 1,..., k 1, we obtain We recall γ 0 1 to deduce 1 γ k 1 γ k 1 γ 0 k. 1+ k, which implies the second bound.
21 FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1 REFERENCES Beck, A. and Teboulle, M. (009). Gradient-based algorithms with applications to signal recovery. Convex Optimization in Signal Processing and Communications. Becker, S. R., Candès, E. J. and Grant, M. C. (011). Templates for convex cone problems with applications to sparse signal recovery. Mathematical Programming Computation Gonzaga, C. C. and Karas, E. W. (013). Fine tuning Nesterov s steepest descent algorithm for differentiable convex programming. Mathematical Programming Meng, X. and Chen, H. (011). Accelerating Nesterov s method for strongly convex functions with Lipschitz gradient. arxiv preprint arxiv: Nesterov, Y. (1983). A method of solving a convex programming problem with convergence rate O( 1 ). Soviet Mathematics Doklady k Nesterov, Y. (004). Introductory Lectures on Convex Optimization 87. Springer Science & Business Media. O Donoghue, B. and Candés, E. (013). Adaptive restart for accelerated gradient schemes. Foundations of Computational Mathematics Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics Yuekai Sun Stanford, California April 9, yuekai@stanford.edu
Gradient Methods Using Momentum and Memory
Chapter 3 Gradient Methods Using Momentum and Memory The steepest descent method described in Chapter always steps in the negative gradient direction, which is orthogonal to the boundary of the level set
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationmin f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;
Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many
More informationConvex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013
Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for
More informationUnconstrained minimization of smooth functions
Unconstrained minimization of smooth functions We want to solve min x R N f(x), where f is convex. In this section, we will assume that f is differentiable (so its gradient exists at every point), and
More informationConvex Optimization. Newton s method. ENSAE: Optimisation 1/44
Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)
More information1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method
L. Vandenberghe EE236C (Spring 2016) 1. Gradient method gradient method, first-order methods quadratic bounds on convex functions analysis of gradient method 1-1 Approximate course outline First-order
More informationConditional Gradient (Frank-Wolfe) Method
Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties
More informationAdaptive Restarting for First Order Optimization Methods
Adaptive Restarting for First Order Optimization Methods Nesterov method for smooth convex optimization adpative restarting schemes step-size insensitivity extension to non-smooth optimization continuation
More informationFast proximal gradient methods
L. Vandenberghe EE236C (Spring 2013-14) Fast proximal gradient methods fast proximal gradient method (FISTA) FISTA with line search FISTA as descent method Nesterov s second method 1 Fast (proximal) gradient
More informationAccelerating Nesterov s Method for Strongly Convex Functions
Accelerating Nesterov s Method for Strongly Convex Functions Hao Chen Xiangrui Meng MATH301, 2011 Outline The Gap 1 The Gap 2 3 Outline The Gap 1 The Gap 2 3 Our talk begins with a tiny gap For any x 0
More informationNewton s Method. Ryan Tibshirani Convex Optimization /36-725
Newton s Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, Properties and examples: f (y) = max x
More information6. Proximal gradient method
L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping
More informationAccelerated gradient methods
ELE 538B: Large-Scale Optimization for Data Science Accelerated gradient methods Yuxin Chen Princeton University, Spring 018 Outline Heavy-ball methods Nesterov s accelerated gradient methods Accelerated
More informationNewton s Method. Javier Peña Convex Optimization /36-725
Newton s Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, f ( (y) = max y T x f(x) ) x Properties and
More informationYou should be able to...
Lecture Outline Gradient Projection Algorithm Constant Step Length, Varying Step Length, Diminishing Step Length Complexity Issues Gradient Projection With Exploration Projection Solving QPs: active set
More informationConvex Optimization. Problem set 2. Due Monday April 26th
Convex Optimization Problem set 2 Due Monday April 26th 1 Gradient Decent without Line-search In this problem we will consider gradient descent with predetermined step sizes. That is, instead of determining
More informationDesign and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016
Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall 206 2 Nov 2 Dec 206 Let D be a convex subset of R n. A function f : D R is convex if it satisfies f(tx + ( t)y) tf(x)
More informationOptimized first-order minimization methods
Optimized first-order minimization methods Donghwan Kim & Jeffrey A. Fessler EECS Dept., BME Dept., Dept. of Radiology University of Michigan web.eecs.umich.edu/~fessler UM AIM Seminar 2014-10-03 1 Disclosure
More informationOn Nesterov s Random Coordinate Descent Algorithms - Continued
On Nesterov s Random Coordinate Descent Algorithms - Continued Zheng Xu University of Texas At Arlington February 20, 2015 1 Revisit Random Coordinate Descent The Random Coordinate Descent Upper and Lower
More informationFrank-Wolfe Method. Ryan Tibshirani Convex Optimization
Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)
More informationLecture 5: September 12
10-725/36-725: Convex Optimization Fall 2015 Lecture 5: September 12 Lecturer: Lecturer: Ryan Tibshirani Scribes: Scribes: Barun Patra and Tyler Vuong Note: LaTeX template courtesy of UC Berkeley EECS
More informationNon-negative Matrix Factorization via accelerated Projected Gradient Descent
Non-negative Matrix Factorization via accelerated Projected Gradient Descent Andersen Ang Mathématique et recherche opérationnelle UMONS, Belgium Email: manshun.ang@umons.ac.be Homepage: angms.science
More informationOptimization Tutorial 1. Basic Gradient Descent
E0 270 Machine Learning Jan 16, 2015 Optimization Tutorial 1 Basic Gradient Descent Lecture by Harikrishna Narasimhan Note: This tutorial shall assume background in elementary calculus and linear algebra.
More informationLecture 3: Linesearch methods (continued). Steepest descent methods
Lecture 3: Linesearch methods (continued). Steepest descent methods Coralia Cartis, Mathematical Institute, University of Oxford C6.2/B2: Continuous Optimization Lecture 3: Linesearch methods (continued).
More informationSubgradient Method. Ryan Tibshirani Convex Optimization
Subgradient Method Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last last time: gradient descent min x f(x) for f convex and differentiable, dom(f) = R n. Gradient descent: choose initial
More informationGlobal convergence of the Heavy-ball method for convex optimization
Noname manuscript No. will be inserted by the editor Global convergence of the Heavy-ball method for convex optimization Euhanna Ghadimi Hamid Reza Feyzmahdavian Mikael Johansson Received: date / Accepted:
More informationOptimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30
Optimization Escuela de Ingeniería Informática de Oviedo (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30 Unconstrained optimization Outline 1 Unconstrained optimization 2 Constrained
More informationDescent methods. min x. f(x)
Gradient Descent Descent methods min x f(x) 5 / 34 Descent methods min x f(x) x k x k+1... x f(x ) = 0 5 / 34 Gradient methods Unconstrained optimization min f(x) x R n. 6 / 34 Gradient methods Unconstrained
More informationProgramming, numerics and optimization
Programming, numerics and optimization Lecture C-3: Unconstrained optimization II Łukasz Jankowski ljank@ippt.pan.pl Institute of Fundamental Technological Research Room 4.32, Phone +22.8261281 ext. 428
More informationUnconstrained optimization
Chapter 4 Unconstrained optimization An unconstrained optimization problem takes the form min x Rnf(x) (4.1) for a target functional (also called objective function) f : R n R. In this chapter and throughout
More information6.252 NONLINEAR PROGRAMMING LECTURE 10 ALTERNATIVES TO GRADIENT PROJECTION LECTURE OUTLINE. Three Alternatives/Remedies for Gradient Projection
6.252 NONLINEAR PROGRAMMING LECTURE 10 ALTERNATIVES TO GRADIENT PROJECTION LECTURE OUTLINE Three Alternatives/Remedies for Gradient Projection Two-Metric Projection Methods Manifold Suboptimization Methods
More informationORIE 6326: Convex Optimization. Quasi-Newton Methods
ORIE 6326: Convex Optimization Quasi-Newton Methods Professor Udell Operations Research and Information Engineering Cornell April 10, 2017 Slides on steepest descent and analysis of Newton s method adapted
More informationOptimization methods
Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to
More informationLecture 7: September 17
10-725: Optimization Fall 2013 Lecture 7: September 17 Lecturer: Ryan Tibshirani Scribes: Serim Park,Yiming Gu 7.1 Recap. The drawbacks of Gradient Methods are: (1) requires f is differentiable; (2) relatively
More informationHigher-Order Methods
Higher-Order Methods Stephen J. Wright 1 2 Computer Sciences Department, University of Wisconsin-Madison. PCMI, July 2016 Stephen Wright (UW-Madison) Higher-Order Methods PCMI, July 2016 1 / 25 Smooth
More informationComplexity of gradient descent for multiobjective optimization
Complexity of gradient descent for multiobjective optimization J. Fliege A. I. F. Vaz L. N. Vicente July 18, 2018 Abstract A number of first-order methods have been proposed for smooth multiobjective optimization
More informationPart 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)
Part 3: Trust-region methods for unconstrained optimization Nick Gould (RAL) minimize x IR n f(x) MSc course on nonlinear optimization UNCONSTRAINED MINIMIZATION minimize x IR n f(x) where the objective
More informationAccelerated primal-dual methods for linearly constrained convex problems
Accelerated primal-dual methods for linearly constrained convex problems Yangyang Xu SIAM Conference on Optimization May 24, 2017 1 / 23 Accelerated proximal gradient For convex composite problem: minimize
More informationSelected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018
Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 08 Instructor: Quoc Tran-Dinh Scriber: Quoc Tran-Dinh Lecture 4: Selected
More informationSuppose that the approximate solutions of Eq. (1) satisfy the condition (3). Then (1) if η = 0 in the algorithm Trust Region, then lim inf.
Maria Cameron 1. Trust Region Methods At every iteration the trust region methods generate a model m k (p), choose a trust region, and solve the constraint optimization problem of finding the minimum of
More information10. Unconstrained minimization
Convex Optimization Boyd & Vandenberghe 10. Unconstrained minimization terminology and assumptions gradient descent method steepest descent method Newton s method self-concordant functions implementation
More informationLecture 14: Newton s Method
10-725/36-725: Conve Optimization Fall 2016 Lecturer: Javier Pena Lecture 14: Newton s ethod Scribes: Varun Joshi, Xuan Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes
More informationA Conservation Law Method in Optimization
A Conservation Law Method in Optimization Bin Shi Florida International University Tao Li Florida International University Sundaraja S. Iyengar Florida International University Abstract bshi1@cs.fiu.edu
More information5. Subgradient method
L. Vandenberghe EE236C (Spring 2016) 5. Subgradient method subgradient method convergence analysis optimal step size when f is known alternating projections optimality 5-1 Subgradient method to minimize
More informationMultipoint secant and interpolation methods with nonmonotone line search for solving systems of nonlinear equations
Multipoint secant and interpolation methods with nonmonotone line search for solving systems of nonlinear equations Oleg Burdakov a,, Ahmad Kamandi b a Department of Mathematics, Linköping University,
More informationProximal Newton Method. Ryan Tibshirani Convex Optimization /36-725
Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h
More informationProximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725
Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:
More informationShiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method
Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 3 Gradient Method Shiqian Ma, MAT-258A: Numerical Optimization 2 3.1. Gradient method Classical gradient method: to minimize a differentiable convex
More informationCubic regularization of Newton s method for convex problems with constraints
CORE DISCUSSION PAPER 006/39 Cubic regularization of Newton s method for convex problems with constraints Yu. Nesterov March 31, 006 Abstract In this paper we derive efficiency estimates of the regularized
More informationNonlinear Programming
Nonlinear Programming Kees Roos e-mail: C.Roos@ewi.tudelft.nl URL: http://www.isa.ewi.tudelft.nl/ roos LNMB Course De Uithof, Utrecht February 6 - May 8, A.D. 2006 Optimization Group 1 Outline for week
More informationMS&E 318 (CME 338) Large-Scale Numerical Optimization
Stanford University, Management Science & Engineering (and ICME) MS&E 318 (CME 338) Large-Scale Numerical Optimization 1 Origins Instructor: Michael Saunders Spring 2015 Notes 9: Augmented Lagrangian Methods
More information8 Numerical methods for unconstrained problems
8 Numerical methods for unconstrained problems Optimization is one of the important fields in numerical computation, beside solving differential equations and linear systems. We can see that these fields
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning First-Order Methods, L1-Regularization, Coordinate Descent Winter 2016 Some images from this lecture are taken from Google Image Search. Admin Room: We ll count final numbers
More informationLecture 14: October 17
1-725/36-725: Convex Optimization Fall 218 Lecture 14: October 17 Lecturer: Lecturer: Ryan Tibshirani Scribes: Pengsheng Guo, Xian Zhou Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:
More informationMore First-Order Optimization Algorithms
More First-Order Optimization Algorithms Yinyu Ye Department of Management Science and Engineering Stanford University Stanford, CA 94305, U.S.A. http://www.stanford.edu/ yyye Chapters 3, 8, 3 The SDM
More informationAn adaptive accelerated first-order method for convex optimization
An adaptive accelerated first-order method for convex optimization Renato D.C Monteiro Camilo Ortiz Benar F. Svaiter July 3, 22 (Revised: May 4, 24) Abstract This paper presents a new accelerated variant
More informationUses of duality. Geoff Gordon & Ryan Tibshirani Optimization /
Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear
More informationAccelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)
Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Donghwan Kim and Jeffrey A. Fessler EECS Department, University of Michigan
More informationFINE TUNING NESTEROV S STEEPEST DESCENT ALGORITHM FOR DIFFERENTIABLE CONVEX PROGRAMMING. 1. Introduction. We study the nonlinear programming problem
FINE TUNING NESTEROV S STEEPEST DESCENT ALGORITHM FOR DIFFERENTIABLE CONVEX PROGRAMMING CLÓVIS C. GONZAGA AND ELIZABETH W. KARAS Abstract. We modify the first order algorithm for convex programming described
More informationLecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent
10-725/36-725: Convex Optimization Spring 2015 Lecturer: Ryan Tibshirani Lecture 5: Gradient Descent Scribes: Loc Do,2,3 Disclaimer: These notes have not been subjected to the usual scrutiny reserved for
More informationIFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent
IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):
More informationCS 435, 2018 Lecture 3, Date: 8 March 2018 Instructor: Nisheeth Vishnoi. Gradient Descent
CS 435, 2018 Lecture 3, Date: 8 March 2018 Instructor: Nisheeth Vishnoi Gradient Descent This lecture introduces Gradient Descent a meta-algorithm for unconstrained minimization. Under convexity of the
More informationLecture 15 Newton Method and Self-Concordance. October 23, 2008
Newton Method and Self-Concordance October 23, 2008 Outline Lecture 15 Self-concordance Notion Self-concordant Functions Operations Preserving Self-concordance Properties of Self-concordant Functions Implications
More informationMath 273a: Optimization Subgradient Methods
Math 273a: Optimization Subgradient Methods Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Nonsmooth convex function Recall: For ˉx R n, f(ˉx) := {g R
More informationConjugate Gradient (CG) Method
Conjugate Gradient (CG) Method by K. Ozawa 1 Introduction In the series of this lecture, I will introduce the conjugate gradient method, which solves efficiently large scale sparse linear simultaneous
More informationSubgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725
Subgradient Method Guest Lecturer: Fatma Kilinc-Karzan Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization 10-725/36-725 Adapted from slides from Ryan Tibshirani Consider the problem Recall:
More informationThe Newton Bracketing method for the minimization of convex functions subject to affine constraints
Discrete Applied Mathematics 156 (2008) 1977 1987 www.elsevier.com/locate/dam The Newton Bracketing method for the minimization of convex functions subject to affine constraints Adi Ben-Israel a, Yuri
More informationGradient Descent. Ryan Tibshirani Convex Optimization /36-725
Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like
More information5 Handling Constraints
5 Handling Constraints Engineering design optimization problems are very rarely unconstrained. Moreover, the constraints that appear in these problems are typically nonlinear. This motivates our interest
More informationE5295/5B5749 Convex optimization with engineering applications. Lecture 8. Smooth convex unconstrained and equality-constrained minimization
E5295/5B5749 Convex optimization with engineering applications Lecture 8 Smooth convex unconstrained and equality-constrained minimization A. Forsgren, KTH 1 Lecture 8 Convex optimization 2006/2007 Unconstrained
More informationLecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem
Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem Michael Patriksson 0-0 The Relaxation Theorem 1 Problem: find f := infimum f(x), x subject to x S, (1a) (1b) where f : R n R
More informationAlgorithms for Constrained Optimization
1 / 42 Algorithms for Constrained Optimization ME598/494 Lecture Max Yi Ren Department of Mechanical Engineering, Arizona State University April 19, 2015 2 / 42 Outline 1. Convergence 2. Sequential quadratic
More informationWorst Case Complexity of Direct Search
Worst Case Complexity of Direct Search L. N. Vicente May 3, 200 Abstract In this paper we prove that direct search of directional type shares the worst case complexity bound of steepest descent when sufficient
More informationTHE NEWTON BRACKETING METHOD FOR THE MINIMIZATION OF CONVEX FUNCTIONS SUBJECT TO AFFINE CONSTRAINTS
THE NEWTON BRACKETING METHOD FOR THE MINIMIZATION OF CONVEX FUNCTIONS SUBJECT TO AFFINE CONSTRAINTS ADI BEN-ISRAEL AND YURI LEVIN Abstract. The Newton Bracketing method [9] for the minimization of convex
More informationOracle Complexity of Second-Order Methods for Smooth Convex Optimization
racle Complexity of Second-rder Methods for Smooth Convex ptimization Yossi Arjevani had Shamir Ron Shiff Weizmann Institute of Science Rehovot 7610001 Israel Abstract yossi.arjevani@weizmann.ac.il ohad.shamir@weizmann.ac.il
More informationComparison of Modern Stochastic Optimization Algorithms
Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,
More informationLECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE
LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE CONVEX ANALYSIS AND DUALITY Basic concepts of convex analysis Basic concepts of convex optimization Geometric duality framework - MC/MC Constrained optimization
More informationLecture 8: February 9
0-725/36-725: Convex Optimiation Spring 205 Lecturer: Ryan Tibshirani Lecture 8: February 9 Scribes: Kartikeya Bhardwaj, Sangwon Hyun, Irina Caan 8 Proximal Gradient Descent In the previous lecture, we
More informationFAST FIRST-ORDER METHODS FOR COMPOSITE CONVEX OPTIMIZATION WITH BACKTRACKING
FAST FIRST-ORDER METHODS FOR COMPOSITE CONVEX OPTIMIZATION WITH BACKTRACKING KATYA SCHEINBERG, DONALD GOLDFARB, AND XI BAI Abstract. We propose new versions of accelerated first order methods for convex
More informationarxiv: v1 [math.oc] 1 Jul 2016
Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the
More informationA Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization
A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization Panos Parpas Department of Computing Imperial College London www.doc.ic.ac.uk/ pp500 p.parpas@imperial.ac.uk jointly with D.V.
More informationOn the Local Quadratic Convergence of the Primal-Dual Augmented Lagrangian Method
Optimization Methods and Software Vol. 00, No. 00, Month 200x, 1 11 On the Local Quadratic Convergence of the Primal-Dual Augmented Lagrangian Method ROMAN A. POLYAK Department of SEOR and Mathematical
More informationOn Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:
A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition
More informationAgenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples
Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme Last time Proximal gradient method
More informationR-Linear Convergence of Limited Memory Steepest Descent
R-Linear Convergence of Limited Memory Steepest Descent Frank E. Curtis, Lehigh University joint work with Wei Guo, Lehigh University OP17 Vancouver, British Columbia, Canada 24 May 2017 R-Linear Convergence
More informationOptimization and Optimal Control in Banach Spaces
Optimization and Optimal Control in Banach Spaces Bernhard Schmitzer October 19, 2017 1 Convex non-smooth optimization with proximal operators Remark 1.1 (Motivation). Convex optimization: easier to solve,
More informationLecture 9 Sequential unconstrained minimization
S. Boyd EE364 Lecture 9 Sequential unconstrained minimization brief history of SUMT & IP methods logarithmic barrier function central path UMT & SUMT complexity analysis feasibility phase generalized inequalities
More informationMath 273a: Optimization Subgradients of convex functions
Math 273a: Optimization Subgradients of convex functions Made by: Damek Davis Edited by Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com 1 / 42 Subgradients Assumptions
More informationAM 205: lecture 19. Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods
AM 205: lecture 19 Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods Optimality Conditions: Equality Constrained Case As another example of equality
More informationLine Search Methods for Unconstrained Optimisation
Line Search Methods for Unconstrained Optimisation Lecture 8, Numerical Linear Algebra and Optimisation Oxford University Computing Laboratory, MT 2007 Dr Raphael Hauser (hauser@comlab.ox.ac.uk) The Generic
More informationCoordinate Update Algorithm Short Course Proximal Operators and Algorithms
Coordinate Update Algorithm Short Course Proximal Operators and Algorithms Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 36 Why proximal? Newton s method: for C 2 -smooth, unconstrained problems allow
More informationWorst Case Complexity of Direct Search
Worst Case Complexity of Direct Search L. N. Vicente October 25, 2012 Abstract In this paper we prove that the broad class of direct-search methods of directional type based on imposing sufficient decrease
More informationA Unified Approach to Proximal Algorithms using Bregman Distance
A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department
More informationSecond Order Optimization Algorithms I
Second Order Optimization Algorithms I Yinyu Ye Department of Management Science and Engineering Stanford University Stanford, CA 94305, U.S.A. http://www.stanford.edu/ yyye Chapters 7, 8, 9 and 10 1 The
More informationIterative Methods for Solving A x = b
Iterative Methods for Solving A x = b A good (free) online source for iterative methods for solving A x = b is given in the description of a set of iterative solvers called templates found at netlib: http
More informationUnconstrained minimization
CSCI5254: Convex Optimization & Its Applications Unconstrained minimization terminology and assumptions gradient descent method steepest descent method Newton s method self-concordant functions 1 Unconstrained
More informationOptimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison
Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big
More information1 Computing with constraints
Notes for 2017-04-26 1 Computing with constraints Recall that our basic problem is minimize φ(x) s.t. x Ω where the feasible set Ω is defined by equality and inequality conditions Ω = {x R n : c i (x)
More informationOne Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties
One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties Fedor S. Stonyakin 1 and Alexander A. Titov 1 V. I. Vernadsky Crimean Federal University, Simferopol,
More information