FINE TUNING NESTEROV S STEEPEST DESCENT ALGORITHM FOR DIFFERENTIABLE CONVEX PROGRAMMING. 1. Introduction. We study the nonlinear programming problem

Size: px
Start display at page:

Download "FINE TUNING NESTEROV S STEEPEST DESCENT ALGORITHM FOR DIFFERENTIABLE CONVEX PROGRAMMING. 1. Introduction. We study the nonlinear programming problem"

Transcription

1 FINE TUNING NESTEROV S STEEPEST DESCENT ALGORITHM FOR DIFFERENTIABLE CONVEX PROGRAMMING CLÓVIS C. GONZAGA AND ELIZABETH W. KARAS Abstract. We modify the first order algorithm for convex programming described by Nesterov in his book [5]. In his algorithms, Nesterov makes explicit use of a Lipschitz constant L for the function gradient, which is either assumed known [5], or is estimated by an adaptive procedure [7]. We eliminate the use of L at the cost of an extra imprecise line search, and obtain an algorithm which keeps the optimal complexity properties and also inherit the global convergence properties of the steepest descent method for general continuously differentiable optimization. Besides this, we develop an adaptive procedure for estimating a strong convexity constant for the function. Numerical tests for a limited set of toy problems show an improvement in performance when compared with the original Nesterov s algorithms. Key words. Unconstrained convex problems Optimal method AMS subject classifications. 62K05, 65K05, 68Q25, 90C25, 90C60 1. Introduction. We study the nonlinear programming problem P ) minimize fx) subject to x IR n, where f : IR n IR is convex and continuously differentiable, with a Lipschitz constant L > 0 for the gradient and a convexity parameter µ 0. It means that for all x, y IR n, 1.1) and 1.2) fx) fy) L x y fx) fy) + fy) T x y) µ x y 2. If µ > 0, the function is said to be strongly convex. Note that µ L. Assume that the problem admits an optimal solution x not necessarily unique) and consider an algorithm which generates a sequence {x k }, starting at a given point x 0 IR n. The complexity results here will consist in establishing bounds for the number of iterations k which ensure fx k ) fx ) ε, for a given precision ε > 0. The bounds will depend on x 0 x, on the Lipschitz constant L and on the strong convexity constant µ if available). No method for solving P) based on accumulated first order information can achieve an iteration-complexity bound of less than O1/ ε) [4, 13]. The steepest descent method with constant step lengths cannot achieve a complexity better than O1/ε) [5]. Results for higher order methods are discussed in Nesterov and Polyak [9]. Department of Mathematics, Federal University of Santa Catarina. Cx. Postal 52, Florianópolis, SC, Brazil; clovis@mtm.ufsc.br. Research done while visiting LIMOS Université Blaise Pascal, Clermont- Ferrand, France. This author is supported by CNPq and PRONEX - Optimization. Department of Mathematics, Federal University of Paraná. Cx. Postal 19081, Curitiba, PR, Brazil; ewkaras@ufpr.br, ewkaras@gmail.com. Research done while visiting the Tokyo Institute of Technology, Global Edge Institute. Supported by CNPq, PRONEX - Optimization and MEXT s program. 1

2 2 C. GONZAGA AND E. KARAS Nesterov [5] proves that with insignificant increase in computational effort per iteration, the computational complexity of the steepest descent method is lowered to the optimal value O1/ ε) for general smooth convex functions. For strongly convex functions, Nesterov s methods are linearly convergent, with a complexity bound of L/µ log1/ε) see [5, Thm 2.2.2]). For general convex functions, the bound obtained in this theorem is 1.3) fx k ) fx ) 4L x 0 x k + 2) 2 4L x 0 x k 2. These results were extended to more general convex problems, including problems with so-called simple constraints see, for instance [2, 6, 7, 8]). In the special case of minimizing a strictly convex quadratic function, conjugate direction methods with exact line searches find the optimal solution in no more than n iterations and satisfy see [11, p. 170]): 1.4) fx k ) fx ) L x 0 x 22k + 1) 2 L x 0 x 8k 2. Note that this bound is independent of n, and is about 32 times better than the bound 1.3) obtained by Nesterov for general convex functions. Nemirovsky and Yudin [4, Sec 7.2.2] show that no first order method can be more than twice better than conjugate gradients for strongly convex quadratic problems. For non-quadratic convex functions, the same reference [4, Sec 8.2.2] shows that both Fletcher-Reeves and Polak-Ribière versions of conjugate gradient methods are not optimal. This is done by constructing examples in which, even with exact line searches, both methods need a multiple of 1/ε iterations to achieve an ε accuracy. Our paper does a fine tuning of Nesterov s algorithm [5, Scheme 2.2.6]. His method relies heavily on the usage of a Lipschitz constant L, which is either given or adaptively estimated as in Nesterov [7] at the cost of extra gradient computations. The methods achieve optimal complexity and when L is given this is done with the smallest possible computational time per iteration: only one gradient computation and no function evaluations besides the ones used in the line search, if it is done at all). We trade this economy in function evaluations for an extra inexact line search, eliminating the usage of L and improving the efficiency of each iteration, while keeping the optimal complexity in number of iterations. Nesterov s algorithm takes advantage of a strong convexity parameter µ > 0 whenever one is known. Our best algorithm, presented in Section 4, uses a decreasing sequence of estimates for the parameter µ. Remark. Estimations of the Lipschitz constant were used by Nesterov [7]. In another context constrained minimization of non-smooth homogeneous convex functions Richtárik [12] uses an adaptive scheme for guessing bounds for a distance between a point x 0 and an optimal solution x. This bound determines the number of subgradient steps needed to obtain a desired precision. We found no method for guessing the strong convexity constant µ in the literature. Structure of the method. A detailed explanation of the method is in Nesterov s book [5, Sec ]. Here we give a rough description of its main features. It uses local properties of f ) the steepest descent directions) and global properties of convex functions, by generating a sequence functional upper bounds which are locally imposed on the epigraph of f ) at each iteration. The functional upper bounds are simple distance functions with the

3 FINE TUNING NESTEROV S ALGORITHM 3 shape 1.5) φ k x) = φ k + γ k 2 x v k 2, where φ k IR, γ k > 0 and v k IR n. Note that these functions are strictly convex with minimum φ k at v k. At start, v 0 = x 0 is a given point and γ 0 > 0 is given. The method will compute a sequence of points x k, a sequence of second order constants γ k µ such that γ k µ and a sequence of functions φ k ) as above. The construction is such that at each iteration φ k fx k) in the present paper, we require φ k = fx k)). Hence any optimal solution x satisfies fx ) φ k φ kx k ), and so we can think of φ k ) as an upper bound imposed to the epigraph of f ). As k increases, the functions φ k become flatter. This is shown in Fig f φ k f φ k φ k * fx k ) φ k+1 v k x k x x Fig The mechanics of the algorithm. Nesterov shows how to construct these functions so that, defining the following relation holds: λ k = γ k µ γ 0 µ, fx k ) φ k ) λ k φ 0 ) + 1 λ k )f ). With this property, an immediate examination of an optimal solution results in fx k ) φ k x ) λ k φ 0 x ) + 1 λ k )fx ) = fx ) + λ k φ 0 x ) fx )), and we see that fx k ) fx ) with the speed with which λ k 0. Nesterov s method ensures λ k = O1/k 2 ), and this means optimal complexity: an error of ε > 0 is achieved in O1/ ε) iterations [5, Chap. 2]. In the present paper we shall not use these parameters λ k, dealing directly with the sequences γ k ). Moreover, the complete complexity proof made in Section 4 is for a modified algorithm based on the construction of an adaptive sequence of strong convexity parameter estimates µ k µ such that µ k µ. The practical efficiency of the method depends on how much each iteration explores the local properties around each iterate x k to push λ k down. Of course, the worst case complexity cannot be improved, but the speed can be improved in practical problems. This is the goal of this paper. Structure of the paper. In Section 2 we present a prototype algorithm, its main variants and we describe all variables, functions and parameters needed in the analysis.

4 4 C. GONZAGA AND E. KARAS Section 3 discusses some details on the implementation and on simplifications of the methods in several different environments. Section 4 describes the modified algorithm based on adaptive estimates of the strong convexity parameter and does a complete complexity analysis, proving that Nesterov s complexity bound is preserved. Finally, Section 5 compares numerically the performance of the algorithms. The numerical results show that our modifications improve the performance of Nesterov s algorithms in most examples in a set of toy problems, mainly when no good bound for L is known. 2. Description of the main algorithms. In this section we present a prototype algorithm which includes the choice of two parameters α k and θ k ) in each iteration: different choices of these parameters give different algorithms The algorithm. Here we state the prototype algorithm. At this moment it is not possible to understand the meaning of each procedure: this will be the subject of the rest of this section. Consider the problem P ). Here we assume that the strong convexity parameter µ is given possibly null). In Section 4 we modify the algorithm to use an adaptive decreasing sequence of estimates for µ. Algorithm 2.1. Prototype algorithm. Data: x 0 IR n, v 0 = x 0, γ 0 > µ. k = 0 repeat d k = v k x k. Choose θ k [0, 1]. y k = x k + θ k d k. If fy k ) = 0, then stop with y k as an optimal solution. Steepest descent step: x k+1 = y k ν fy k ). Choose α k 0, 1]. γ k+1 = 1 α k )γ k + α k µ. v k+1 = 1 γ k+1 1 α k )γ k v k + α k µy k fy k ))). k = k + 1. A complete proof of the optimal complexity will be done in Section 4, for an algorithm based on adaptive estimates of µ. The proof for a fixed µ is simpler and follows from that one. The variables associated with each iteration of the algorithm are: x k, v k IR n, with x 0 given and v 0 = x 0. α k [0, 1). γ k IR ++ : second order constants. γ 0 > µ is given. If L is available, then Nesterov [5, Thm ] suggests the choice γ 0 = L. y k IR n : a point in the segment joining x k and v k y k = x k + θ k v k x k ). It is from y k that a steepest descent step is computed in each iteration to obtain x k+1.

5 FINE TUNING NESTEROV S ALGORITHM 5 x k+1 = y k ν fy k ): next iterate, computed by any efficient line search procedure. We know that if L is known, then the step length ν = 1/L ensures that fy k ) fx k+1 ) fy k ) 2 /2L). We require that the line search used must be at least as good as this, whenever L is known. An Armijo search with well chosen parameters is usually the best choice, and even if L is not known it ensures a decrease of at least fy k ) 2 /4L), which is also acceptable for the complexity analysis. Hence we shall assume that at all iterations 2.1) fx k+1 ) fy k ) 1 4L fy k) 2. The geometry of one iteration of the algorithm is shown in Fig. 2.1 left): y k is between x k and v k, and x k+1 is obtained by a steepest descent step from y k. The directions x k+1 y k ) and v k+1 v k ) are both collinear with fy k ). φ k f x k y k x k+1 φ k+1 v k v k+1 l y k x k+1 v k+1 Fig One iteration of the algorithm Choice of the parameters: different variants of the algorithm stem from different choices of the parameters α k and θ k. The study of efficient choices will be the main task of this paper. The choice made by Nesterov and by us will be described in Sec. 2.3, after a technical section on the construction of the functions φ k ) Construction of the functions φ k ).. Each iteration starts with the simple quadratic function φ k ) as defined in 1.5) we call simple a quadratic function whose Hessian has the shape ti, with t 0). A new simple quadratic φ k+1 ) is constructed as a convex combination of φ k ) and a quadratic approximation l ) of f ) around y k. This is shown in Fig. 2.1 right) for a one-dimensional problem. The choice of this combination by Nesterov is so that the minimum value φ k+1 of φ k+1 ) satisfies φ k+1 fx k+1) fx ). We shall require that φ k+1 = fx k+1). Now we quote Nesterov [5] to describe the construction of the functions φ k ). We start from given x k, v k, γ k, φ k ). At this point we assume that y k is also given its calculation will be one of the tasks of this paper). Assume also that x k+1 has been computed and that 2.1) holds. Once y k and x k+1 are given, we must compute α k. Here we change the notation to include α as a variable and define the function α IR, x IR n φ k α, x).

6 6 C. GONZAGA AND E. KARAS The function φ k+1, ) is an affine combination of φ k ) and a quadratic approximation l k ) of f ) around y k : 2.2) φ k+1 α, x) = 1 α)φ k x) + αl k x), where l k x) = fy k ) + fy k ) T x y k ) + µ 2 x y k 2. Remark.. In this subsection we shall not require l k ) to be a lower quadratic approximation, i.e., µ > 0 may not be a strong convexity constant. If µ > 0, then l k ) is minimized at ˆx = y k fy k), with µ 2.3) min l kx) = l k ˆx) = fy k ) fy k) 2, x IR n 2µ and l k ) may be expressed as l k x) = l k ˆx) + µ 2 x ˆx 2. We shall assume that l k ) is not constant, otherwise y k would be an optimal solution for the problem P ). Define for α IR, 2.4) φ k+1α) = inf φ k+1α, x) IR { } x IR n and 2.5) γ k+1 α) = 1 α)γ k + αµ. The function φ k+1 α, ) is also a simple quadratic function, with Hessian γ k+1 α)i. If γ k+1 α) > 0, then φ k+1 α) is finite. If γ k+1α) < 0, then φ k+1 α) =. The fact below follows trivially from 2.5): Fact 2.2. For γ k > µ 0, {α IR γ k+1 α) > 0} =, α max ), with α max = γ k /γ k µ). So φ k+1 α) is well defined for any α < α max and φ k+1 α) = for α > α max because in this case γ k+1 α) < 0. Here we shall look at what happens for α = α max and isolate a simple situation where φ k and l k have coincident minimizers. Lemma 2.3. Consider γ k > µ and α max = γ k /γ k µ). Then, one and only one of the following situations occurs: i) v k is a minimizer of l k and φ k+1 is linear on, α max] with φ k+1α) = 1 α)φ k + αl k v k ). ii) φ k+1 α max) =. Proof. Let us assume that inf x IR n φ k+1 α max, x) is finite. By construction, φ k+1 α max, ) is linear, because γ + α max ) = 0. Thus, φ k+1 α max, ) = 1 α max )φ k ) + α max l k ) is constant. If µ = 0, then α max = 1, l k ) is constant and obviously v k is a minimizer of l k.

7 FINE TUNING NESTEROV S ALGORITHM 7 On the other hand, if µ > 0, we can write l k x) = l k ˆx) + µ 2 x ˆx 2, where ˆx is the minimizer of l k ). Differentiating the expression for φ k+1 α max, ) in relation to x, we have for any x IR n, 1 α max )γ k x v k ) + α max µx ˆx) = 0. If, by contradiction, v k ˆx, then substituting x = v k, we get α max µ = 0, which cannot be true, proving that v k = ˆx. So, in any case, v k is a minimizer of both φ k ) and l k ), and for any α, α max ], φ k+1 α) = 1 α)φ k + αl kv k ), which is linear, completing the proof. Lemma 2.4. For any α, α max ), the function x φ k+1 α, x) is a simple quadratic function given by 2.6) with 2.7) 2.8) φ k+1 α, x) = φ k+1α) + γ k+1α) x v k+1 α) 2, 2 α 2 φ k+1 α) = fy k) 2γ k+1 α) fy k) α) [φ k fy k )+ + αγ k µ γ k+1 α) 2 y k v k 2 + fy k ) T v k y k ) )], v k+1 α) = 1 γ k+1 α) 1 α)γ kv k + αµy k fy k )) and 2.9) γ k+1 α) = 1 α)γ k + αµ. In particular, if µ > 0 then 2.) φ k+11) = fy k ) fy k) 2. 2µ Proof. The proof is straightforward and done in detail in Nesterov s book [5]. We now use Danskin s theorem in the format presented in Bertsekas [1] to prove that φ k+1 ) is concave. Lemma 2.5. The function α IR φ k+1 α) constructed above is concave in IR and differentiable in, α max ). Proof. If φ k+1 α max) is finite, then Lemma 2.3 ensures that φ k+1 ) is linear and hence this case is proved. Assume then that φ k+1 α max) =. We must prove that φ k+1 ) is concave and differentiable in, α max). We will prove the concavity in an arbitrary interval [α 1, α 2 ], α max ). For each α [α 1, α 2 ], φ k+1 α, ) has a unique minimizer v k+1 α). By continuity of v k+1 ) in 2.8), v k+1 [0, ᾱ]) is bounded, i.e., there exists a compact set V IR n such that φ k+1α) = min x V φ k+1α, x). So, the hypotheses for Danskin s theorem are satisfied: φ k+1, x) is concave in fact linear) for each x IR n,

8 8 C. GONZAGA AND E. KARAS φ k+1 α, ) has a unique minimizer in the compact set V for each α [α 1, α 2 ]. We conclude that φ k+1 ) is concave and differentiable in [α 1, α 2 ]. Since [α 1, α 2 ] was taken arbitrary, this proves the concavity and differentiability in, α max ), completing the proof. Setting y k = x k + θ k d k, with d k = v k x k and θ k [0, 1], 2.7) may be written as 2.11) φ k+1α) = fy k ) α2 2γ k+1 fy k ) α)ζα, θ k ) with ζα, θ) = φ k fx k + θd k )+ αγ 2.12) k µ + 1 α)γ k + αµ 2 1 θ)2 d k θ)f x k + θd k, d k )). Lemma 2.6. Assume that fx k ) φ k, y k = x k + θ k d k with d k = v k x k and θ k [0, 1]. Assume also that x k+1 is obtained by a steepest descent step from y k, satisfying 2.1), where the Lipschitz constant L is possibly infinite. Then 2.13) 1 φ k+1α) fx k+1 ) + 4L α2 2γ k+1 ) fy k ) α)ζα, θ k ). Furthermore 2.14) ζα, θ) ) αγ k θ + 1 θ) f x k + θd k, d k ). 1 α)γ k + αµ Proof. The expression 2.13) follows directly from 2.11) and the fact that the steepest descent step satisfies 2.1). Using the facts that fx k ) φ k and µ 0 in the definition of ζ, ), we can write ζα, θ) fx k ) fx k + θd k ) + αγ k 1 α)γ k + αµ 1 θ)f x k + θd k, d k ). By the convexity of f ), we know that fx k ) fx k + θd k ) θf x k + θd k, d k ), and so ) αγ k ζα, θ) θ + 1 θ) f x k + θd k, d k ), 1 α)γ k + αµ completing the proof Choice of the parameters. In this section we assume that µ 0 is a strong convexity constant for f ). We present two different schemes for choosing the parameters θ k and α k in Algorithm 2.1. For both schemes we have two main objectives: Prove that for all choices of θ k, it is always possible to compute α k such that φ k+1 α k) = fx k+1 ); Prove that in any case α k is large, i.e., α k = Ω γ k+1 ). We begin by describing Nesterov s choice.

9 FINE TUNING NESTEROV S ALGORITHM 9 Nesterov s choice [5, Scheme 2.2.6].. When a Lipschitz constant L is known, Nesterov s choices are: α N is computed so that the central term in 2.13) is null. So, α N is the positive solution of the second degree equation 2.15) 2Lα 2 1 α)γ k αµ = 0. θ N is so that the right hand side of 2.14) is null: 2.16) θ N = γ k γ k + α N µ α N. Note: If L is unknown but exists, α N and θ N are well defined but unknown). If L does not exist, we set α N = 0, θ N = 0. Nesterov s original method sets α k = α N and θ k = θ N, obtaining directly from 2.13) and ζα N, θ N ) = 0 that 2.17) Our choice. φ k+1α N ) fx k+1 ) and α N = γk+1 2L. Compute θ k [0, 1] such that 2.18) fx k + θ k d k ) fx k ) and θ k = 1 or f x k + θ k d k, d k ) 0). These are an Armijo condition with parameter 0 and a Wolfe condition with parameter 0. Compute α k [0, 1] as the largest root of the equation 2.19) with Aα 2 + Bα + C = 0 µ ) G = γ k 2 v k y k 2 + fy k ) T v k y k ), A = G fy k) 2 + µ γ k ) fx k ) fy k )), B = µ γ k ) fx k+1 ) fx k )) γ k fy k ) fx k )) G, C = γ k fx k+1 ) fx k )). As we will see this equation comes from the equality φ k+1 α) = fx k+1). The next lemma uses the definition of α N to show under what conditions we can compute α k α N such that this equality holds. By construction, we have 2.20) 2.21) φ k+10) = φ k fx k ) φ k+11) = inf l kx) fx ) fx k+1 ). x IR n Let us first eliminate a trivial case: due to 2.21), if φ k+1 1) = fx k+1) then x k+1 is an optimal solution, and we may terminate the algorithm. Let us assume that φ k+1 1) < fx k+1).

10 C. GONZAGA AND E. KARAS The following lemma does not need the assumption that µ > 0 is a strong convexity constant. Lemma 2.7. Assume that φ k > fx k+1) > inf x IR n l kx). Then φ k+1 α) = fx k+1) has one or two real roots and the largest root belongs to [0, 1]. The roots are computed by solving the second degree equation 2.19). Proof. Let ᾱ be the largest α [0, 1) such that φ k+1 ᾱ) = max α [0,1]φ k+1α) which exists because φ k+1 ) is concave and continuous in [0, 1), with φ k+1 1) < φ k+1 ᾱ)). Since φ k+1 1) < φ k+1 ᾱ) and fx k+1) φ k+1 1), the concave function decreases from ᾱ to 1, and crosses the value fx k+1 ) exactly once in [ᾱ, + ), at some point α [0, 1]. Setting φ k+1 α) = fx k+1) in 2.7) and multiplying both sides of the equation by γ k+1 α) = 1 α)γ k + αµ, we obtain a second degree equation in α. The largest root of this equation must be α, completing the proof. Lemma 2.8. Consider an iteration k of Algorithm 2.1. Assume that x k is not a minimizer of f and φ k+1 1) < fx k+1). Assume that y k = x k + θ k d k is chosen so that ζα N, θ k ) 0. Then there exists a value α k [α N, 1] that solves the equation φ k+1 α k) = fx k+1 ) in 2.7). Proof. Assume that α N and y k = x k +θ k d k have been computed as in the Lemma. Since x k is not a minimizer of f, by convexity of f and definition of l k ) On the other hand, φ k fx k ) > fx ) l k x ) inf l kx). x IR n fx k+1 ) > φ k+11) = inf l kx). x IR n Using Lemma 2.6 with α = α N and ζα N, θ k ) 0, we get from 2.13) that φ k+1α N ) fx k+1 ). Now from Lemma 2.7 the equation φ k+1 α) = fx k+1) has a real root, and the largest root α is in [0, 1]. Note that α α N, because α is the largest root of the concave function φ k+1 ) fx k+1). We conclude from this lemma that the equality φ k+1 α k) = fx k+1 ) can always be achieved whenever ζα N, θ k ) 0. Theorem 2.9. Algorithm 2.1 with either of the choices 2.16) or 2.18) for θ k is well defined and generates sequences {x k } and {φ k } such that for all k = 0, 1,..., φ k fx k). Moreover, the parameters α k and γ k satisfy 2.22) α k γk+1 2L. Proof. i) Let us first prove that both versions of the algorithm are well defined: Nesterov: immediate. Taking θ k = θ N we obtain by definition of α N and θ N, ζα N, θ N ) = 0 and Lemma 2.8 ensures that α k [α N, 1) can be computed as desired. Our method: since fx k ) fx k +θ k d k ) 0 and f x k +θ k d k, d k ) 0 by construction, then ζα, θ k ) 0 for any α [0, 1], and Lemma 2.8 can be used similarly. ii) In both cases we obtain ζα k, θ k ) 0. Hence from 2.13), fx k+1 ) = φ k+1α k ) fx k+1 ) + 1 4L α2 k 2γ k+1 ) fy k ) 2,

11 FINE TUNING NESTEROV S ALGORITHM 11 which immediately gives 2.22), completing the proof. We conclude from this that Nesterov s algorithm, which uses the value θ k = θ N with α k = α N can be improved by re-calculating α k using 2.7) and keeping θ k = θ N ). Remark.. Even computing the best value for α k, the sequence {fx k )} is not necessarily decreasing. Remark.. In the case µ = 0, the expressions get simplified. Then α N is the positive solution of Lα 2 k = 1 α k)γ k, and expression 2.14) is reduced to In this case, θ N = α N. ζα, θ) α θ 1 α f x k + θd k, d k ). 3. More on the choice of θ k. Our choice of θ k uses a line search along d k. In this section we discuss the amount of computation needed in this line search, and show that when L and µ are known it may be done in polynomial time. We conclude from the preceding analysis that what must be shown for a given choice of y k = x k + θ k d k is that ζα N, θ k ) 0. The following lemma summarizes the conditions that ensure this property. Lemma 3.1. Let y k = x k + θ k d k be the choice made by Algorithm 2.1 in iteration k, and let θ N be the Nesterov choice 2.16). Assume that one of the following conditions is satisfied: i) fy k ) fx k ) and f y k, d k ) 0; ii) fy k ) fx k ) and θ k θ N ; iii) f y k, d k ) 0 and θ k θ N ; iv) f x k, d k ) µ 2 d k 2 and θ k = 0. Then ζα N, θ k ) 0 and α k α N may be computed using 2.7)). Proof. i) It has been proved in Theorem 2.9. Let us substitute θ k = θ N + θ in 2.14), with γ N = 1 α N )γ k + α N µ: ζα N, θ k ) θ N + α Nγ k γ N 1 θ N ) θ α ) Nγ k θ f x k + θ k d k, d k ). γ N By definition of θ N, θ N + α N γ k 1 θ N )/γ N = 0. Hence ζα N, θ k ) 1 + α ) Nγ k 3.1) θ f x k + θ k d k, d k ). γ N ii) Assume that fy k ) fx k ) and θ k θ N. If f y k, d k ) 0, then the result follows from 3.1) because θ 0. Otherwise, it follows from i). iii) Assume now that f y k, d k ) 0 and θ k θ N. Then the result follows from 3.1) because θ 0. iv) This case is also trivial by 2.12), completing the proof. All we need is to specify in Algorithm 2.1 the choice of θ k and α k as follows: Choose θ k [0, 1] satisfying one of the conditions in Lemma 3.1. Compute α k such that φ k+1 = fx k+1) using 2.7). We must discuss the computation of θ k, which requires a line search. Of course, if L is given, we can use θ k = θ N and ensure the optimal complexity, but larger values of α k may be obtained.

12 12 C. GONZAGA AND E. KARAS We would like to obtain θ k and then α k ) such that ζα k, θ k ) is as large as possible. Since the determination of α k depends on the steepest descent step from y k = x k + θ k d k, we cannot maximize ζ, ). But positive values of ζα, θ k ) will be obtained if θ k is such that fx k ) fx k + θ k d k ) and f x k + θ k d k, d k ) are made as large as possible of course, these are conflicting objectives). We define but do not compute) two points: 3.2) 3.3) θ argmin{fx k + θd k ) θ 0}, θ = max{θ [0, 1] fx k + θd k ) fx k )}. Let us examine two important cases: If f x k, d k ) 0, then we should choose θ k = 0: a steepest descent step from x k follows. If fx k + d k ) fx k ), then we may choose θ k = 1: a steepest descent step from v k follows. If none of these two special situations occur, then 0 < θ < θ < 1. Any point θ k [θ, θ ] gives ζα, θ k ) 0 for any α [0, 1], and y k = x k +θ k d k satisfies condition i) of Lemma 3.1. A point in this interval can be computed by the interval reduction algorithm described in the Appendix. We have no complexity estimates for a line search procedure such as this if L and µ are not available. When these constants are known, it is possible to limit the number of iterations in the line search to keep Nesterov s complexity. We shall discuss briefly the choice of θ k in three possible situations, according to our knowledge of the constants. Case 1. No knowledge of L or µ.. We start the iteration checking the special cases above: If fx k + d k ) fx k ), then we choose θ k = 1: a steepest descent step from v k will follow. Otherwise, we must decide whether f x k, d k ) 0. This may be done as follows: Compute fx k + θd k ) for a small value of θ > 0 If fx k + θd k ) fx k ), compute fx k ) and then f x k, d k ) = fx k ) T d k. Else f x k, d k ) < 0. If f x k, d k ) 0, set θ k = 0. Else, do a line search as in the Appendix, starting with the points 0, θ and 1. Case 2: L is given.. In this case we start the search with θ N computed as in 2.16). The line search can be abbreviated to keep the number of function calculations within a given bound, resulting in a step satisfying one of the three first conditions in Lemma 3.1. There are two cases to consider, depending on the sign of fx k ) fx k + θ N d k ). fx k + θ N d k ) < fx k ): Condition ii) of Lemma 3.1 is satisfied for any θ k θ N such that fx k + θ k d k ) fx k ). We may use an interval reduction method as in the Appendix, or simply take steps to increase θ. A trivial method is the following, using a constant β > 1: θ k = θ N while βθ k 1 and fx k + βθ k d k ) fx k ), set θ k = βθ k. In any method, the number of function calculations may be limited by a given constant, adopting as θ k the largest computed step length which satisfies the descent condition. fx k + θ N d k ) > fx k ): Condition iii) of Lemma 3.1 holds for θ N, and we intend to reduce the value of θ. For an interval reduction search between 0 and θ N we need

13 FINE TUNING NESTEROV S ALGORITHM 13 an intermediate point ν such that fx k + νd k ) fx k ), if it exists. We propose the following procedure: Compute fx k + νd k ) for ν << θ N If fx k + νd k ) fx k ), set θ k = ν Else compute θ k by an interval reduction method as in the Appendix, starting with the points 0, ν and θ N. The algorithm can be interrupted at any time, setting θ k = B. Case 3: L and µ > 0 are given.. This is a special case of case 2, with an interesting property: a point satisfying condition i) in Lemma 3.1 can be computed in polynomial time without computing fx k ), using the next lemma. Lemma 3.2. Consider θ = argmin{fx k + θd k ) θ 0} and θ > θ such that fx k + θ d k ) = fx k ), if it exists. Define θ = µ/2l). If fx k + θd k ) fx k ) µ2 8L d 2, then θ θ and θ θ θ, otherwise f x k, d k ) µ 2 d 2. Proof. Assume that fx k + θ d k ) fx k + θd k ) fx k ) µ2 8L d 2. Using the Lipschitz condition at θ with f x k + θ d k, d k ) = 0, we get for θ IR, Using the assumption, fx k + θd k ) fx k + θ d k ) + L 2 θ θ ) 2 d 2. fx k + θd k ) fx k ) + ) Lθ θ ) 2 µ2 d 2 4L 2. For θ = 0, this gives Lθ 2 µ 2 /4L) 0, or equivalently θ 2 θ 2, proving the first inequality. For θ = θ, fx k + θ d k ) = fx k ), and we obtain immediately θ θ ) 2 θ 2. Assume now that fx k + θd k ) > fx k ) µ2 8L d 2. By the Lipschitz condition for f ), we know that fx k + θd k ) fx k ) + f x k, d k ) θ + L 2 θ 2 d 2. Assuming by contradiction that f x k, d k ) < µ d 2 /2, we obtain fx k + θd k ) < fx k ) µ 2 θ d 2 + L 2 θ 2 d 2 = fx k ) µ2 4L d 2 + µ2 8L d 2 = fx k ) µ2 8L d 2, contradicting the hypothesis and completing the proof. In this case we may use a golden search, as follows: If fx k + θd k ) fx k ) µ2 8L d 2, set θ k = 0 and stop condition iv) in Lemma 3.1 holds). If fx k + d k ) fx k ), set θ k = 1 and stop condition ii) in Lemma 3.1 holds). Now we know that θ θ µ/2l) < 1). Use an interval reduction search as in the Appendix in the interval [ θ, 1]

14 14 C. GONZAGA AND E. KARAS If we use a golden section search, then at each iteration j before the stopping condition is met, we have A θ θ B, and the interval length satisfies B A β j, with β = 5 1)/ Using the lemma above, a point B [θ, θ ] will have been found when β j µ/2l). Hence the number of iterations of the search will be bounded by j max = logµ/2l)). log β This provides a bound on the computational effort per iteration of Olog2L/µ)) function evaluations. 4. Adaptive convexity parameter µ. In computational tests see next section) we see that the use of a strong convexity constant is very effective in reducing the number of iterations. But very seldom such constant is available. In this section we state an algorithm which uses a decreasing sequence of estimates for the value of the constant µ. We assume that a strong convexity parameter µ possibly null) is given, and the method will generate a sequence µ k µ, starting with µ 0 [µ, γ 0 ). We still need an extra hypothesis: that the level set associated with x 0 is bounded. Note that since f ) is convex, this is equivalent to the hypothesis that the optimal set is bounded. The algorithm iterations are the same as in Algorithm 2.1, using a parameter µ k µ. The parameter is reduced we do this by setting µ k = max{µ, µ k /}) in two situations: When γ k µ < βµ k µ ), where β > 1 is fixed we used β = 1.02), meaning that γ k is too close to µ k. When it is impossible to satisfy the equation φ k+1 α) = fx k+1) for α [0, 1]. By Lemma 2.7, this can only happen when fx k+1 ) min x IR n lx). This minimum is given by φ k+1 1) = fy k) fy k ) 2 /2µ k ), using expression 2.). So, µ k cannot be larger than µ = fy k ) 2 2fy k ) fx k+1 )). If µ k > µ, we reduce µ k. Notation: Denote φ k+1 x) = φ k+1 α k, x) and φ k+1 = φ k+1 α k). Algorithm 4.1. Algorithm with adaptive convexity parameter. Data: x 0 IR n, v 0 = x 0, γ 0 > 0, β > 1, µ = µ, µ 0 [µ, γ 0 ). k = 0 Set µ + = µ 0. repeat d k = v k x k. Choose θ k [0, 1] as in Algorithm 2.1. y k = x k + θ k d k. If fy k ) = 0, then stop with y k as an optimal solution. Steepest descent step: x k+1 = y k ν fy k ). If L is known, ν 1/L. If γ k µ ) < βµ + µ ), then choose µ + [µ, γ k /β]. Compute µ = fy k ) 2 2fy k ) fx k+1 )). If µ + µ, then µ + = max{µ, µ/}. Compute α k as the largest root of 2.19) with µ = µ +.

15 FINE TUNING NESTEROV S ALGORITHM 15 γ k+1 = 1 α k )γ k + α k µ +. v k+1 = 1 γ k+1 1 α k )γ k v k + α k y k µ + fy k ))). Set µ k = µ +. k = k + 1. We suggest γ 0 = L if L is known, µ 0 = max{µ, γ 0 /0}, β = We also suggest the following reduction scheme for µ +, provided that β : µ + = max{µ, γ k /}. Independently of this choice, the following fact holds: Fact 4.2. For any k 0, γ k µ ) βµ k µ ). Proof. If at the beginning of an iteration k, γ k µ ) < βµ + µ ), then we chose µ + [µ, γ k /β] with β > 1. Then, As µ k µ +, we complete the proof. βµ + µ ) γ k βµ γ k µ. We now show the optimality of the algorithm, using an additional hypothesis. Hypothesis.. Given x 0 IR n, assume that the level set associated with x 0 is bounded, i.e., Define Q = D2 2. D = sup{ x y fx) fx 0 ), fy) fx 0 )} <. Lemma 4.3. Consider γ 0 > µ. Let x be an optimal solution of P ). Then, at all iterations of Algorithm 4.1, 4.1) φ k x ) fx ) γ 0 + L γ 0 µ Qγ k µ ). Proof. First let us prove that 4.1) holds for k = 0. By definition of φ 0 and convexity of f with Lipschitz constant L for the gradient, we have φ 0 x ) fx ) = fx 0 ) fx ) + γ 0 2 x x 0 2 L + γ 0 x x L + γ 0 )Q. Thus, we can use induction. By 2.2) and 1.2), we have φ k+1 x) = 1 α k )φ k x) + α k fy k ) + fy k ) T x y k ) + µ k 2 x y k 2) 1 α k )φ k x) + α k fx) + µ k µ ) x y k 2. 2 Thus, φ k+1 x ) fx ) 1 α k ) φ k x ) fx )) + α k Qµ k µ ).

16 16 C. GONZAGA AND E. KARAS Using the induction hypothesis and the definition of γ k+1, we have φ k+1 x ) fx ) 1 α k ) γ 0 + L γ 0 µ Qγ k µ ) + α k Qµ k µ ) γ 0 + L γ 0 µ Q 1 α k)γ k + α k µ k µ ) = γ 0 + L γ 0 µ Qγ k+1 µ ), completing the proof. Now we prove a technical lemma, following a very similar result in Nesterov [5, Lemma 2.2.4]. Lemma 4.4. Consider a positive sequence {λ k }. Assume that there exists M > 0 such that λ k+1 1 M λ k+1 )λ k. Then, for all k > 0, λ k 4 M 2 1 k 2. Proof. Denote a k = 1 λk. Since {λ k } is a decreasing sequence, we have a k+1 a k = Using the hypothesis, Thus, a k a 0 + Mk 2 Mk 2 λk λ k+1 λ k λ = k+1 λk λ k+1 λk λ k+1 λk + ) λ k λ k+1. λ k+1 2λ k λk+1 a k+1 a k λ k 1 M λ k+1 )λ k 2λ k λk+1 = M 2. and consequently λ k 4 M 2, completing the proof. k2 Now we can discuss how fast γ k µ ) goes to zero for estimating the rate of convergence of the algorithm. Lemma 4.5. Consider γ 0 > µ. Then, for all k > 0, 4.2) γ k µ min 1 β 1 β ) k µ γ 0 µ ), 2L 8β 2 L 1 β 1) 2 k 2. Proof. By the definition of γ k+1, γ k+1 µ = 1 α k )γ k µ ) + α k µ k µ ). By Fact 4.2, γ k µ ) βµ k µ ). Thus, 4.3) γ k+1 µ 1 α k )γ k µ ) + α k β γ k µ ) = 1 β 1 ) β α k γ k µ ).

17 FINE TUNING NESTEROV S ALGORITHM 17 By Theorem ) α k γk+1 2L γk+1 µ. 2L Thus, γ k+1 µ 1 β 1 β ) µ γ k µ ) 2L 1 β 1 β ) k µ γ 0 µ ), proving the first inequality in 4.2). On the other hand, from 4.3) and 4.4), we have γ k+1 µ 1 β 1 ) β γk+1 µ γ k µ ). 2L Applying Lemma 4.4 with λ k = γ k µ, for k 0, and M = β 1 β, we complete the 2L proof. The next theorem discusses the complexity bound of the algorithm. Theorem 4.6. Consider γ 0 > µ 0. Then Algorithm 4.1 generates a sequence {x k } such that, for all k > 0, fx k ) fx ) min 1 β 1 ) k µ 8β 2 L 1, β 2L β 1) 2 γ 0 µ ) k 2 QL + γ 0). 2L Proof. By construction, fx k ) φ k x) for all x IR n, in particular for x. Using this and Lemma 4.3, we have: fx k ) fx ) L + γ 0) γ 0 µ Qγ k µ ). Applying Lemma 4.5, we complete the proof. This lemma ensures that an error of ε > 0 for the final objective function value is achieved in O1/ ε) iterations. Global convergence. In our algorithms, each iteration produces fx k+1 ) fy k ) fx k ). Hence, if we consider the sequence x 1, y 1, x 2, y 2,..., we have a descent algorithm, and every other step is a steepest descent step with an Armijo step length. The global convergence of such a method is a standard result. 5. Numerical results. In this section, we report the results of a limited set of computational experiments, comparing the variants of Algorithm 2.1 and Algorithm 4.1. The codes are written in Matlab and we used as stopping criterion the norm of the objective function gradient and the function values, whenever the optimal value is available. In all the algorithms, the line searches are of the Armijo type, and the same routine is used for all. We did not work on the efficiency of the line searches, and so it does not make much sense to compare methods based on the number of function calculations, with one exception: the original Nesterov algorithm never computes function values, which will be commented ahead. Hence our comparisons will be based

18 18 C. GONZAGA AND E. KARAS on the number of iterations, with one or two gradient computations per iteration and one or two line searches per iteration see below). Here are the algorithms to be tested, with their acronyms: [Nesterov1] Algorithm 2.1 with θ k = θ N, α k = α N. [α G, θ G ] Algorithm 2.1 with our choice of parameters. [Adaptive] Algorithm 4.1 with our choice of parameters. [Nesterov2] Nesterov s algorithm with adaptive estimates for L, from reference [7]. [Cauchy] Steepest descent algorithm. [F-R] Fletcher-Reeves as presented in [, p.120]. We shall see that in most examples [Adaptive] has the best performance among the optimal methods cited. When there are good estimates for the constants L and µ, [Nesterov1] may be the best. At the end of the section we compare our method with the Fletcher and Reeves conjugate gradient method, and conclude that in many cases this last method has the best performance. Now we describe our tests for three different convex functions: i) Quadratic function with µ = 1 and L = 00. ii) Nesterov s worst function in the world. iii) A convex function including exponentials. In i) and iii), the Hessian eigenvalues at the unique minimizer are randomly distributed between µ = 1 and L = 00. i) Quadratic functions. Given n, L > µ > 0 in these tests, L = 00 and µ = 1), the quadratic test functions are given by x IR n fx) = 1 2 xt Qx, where Q is a diagonal matrix with random entries in the interval [1, L]. For each experiment, we generate Q and a random initial point x 0 with x 0 = 1. We generated several instances of problems with n assuming the values, 0, 200, 500, 00, 2500 which were solved by all methods. Figure 5.1 shows the performance of the methods assuming ignorance of µ and L. The algorithms were fed with data µ = 0, L = On the left, we show the function values for a typical example in dimension 00, and on the right the performance profiles for all methods on the whole set of problems. In all cases [Adaptive] used the smallest number of iterations, followed by [Nesterov2] and [α G, θ G ]. [Nesterov1] always needed over times the number of iterations of [Adaptive]. Figure 5.2 shows the results for the same tests, but using the knowledge of the values µ = 1 and L = 00. Now [Nesterov1] performs very well and even if the number of iterations is a little larger, each iteration is faster because only one line search is needed. Now [Nesterov2] is slower because it does not take advantage of the knowledge of µ. ii) Nesterov s worst function of the world. This function is defined and explained in Nesterov [5, p.67]. Given L > µ > 0 we took µ = 1, L = 00), set Q = L/µ. The function is ) n 1 x IR n µq 1) 5.1) fx) = x x i x i+1 ) 2 + µ 8 2 x 2 i=1

19 FINE TUNING NESTEROV S ALGORITHM 19 f k Nesterov1 θ G,α G Adaptive Nesterov2 Cauchy Fig Quadratic functions with µ and L unknown. f k 0.3 Nesterov1 θ,α G G 0.2 Adaptive 0.1 Nesterov2 Cauchy Fig Quadratic functions with µ and L given. Generating a set of problems for the same values of n as before and several starting points, Figure 5.3 shows the performance profiles for all problems in two situations: on the left, considering µ and L unknown µ = 0 and L = 0000); on the right for µ = 1 and L = 00 given to the algorithms. We see that the behavior is similar to the case i), with a detail: [Nesterov1] was worse than [Cauchy] for unknown µ and L. This is certainly due to the choice L = 0000, which affects the computation of θ N in the algorithm Nesterov θ G,α G Adaptive Nesterov2 Cauchy Fig Performance profile for the function defined by 5.1). iii) Function with exponentials This function is convex and includes exponentials. The Hessian eigenvalues at the

20 20 C. GONZAGA AND E. KARAS optimizer are randomly distributed between 1 and 00: 5.2) fx) = 1 n 2 x 2 + q i e xi x i 1), i=1 where q i [1, 999] for i = 1,..., n. The minimizer is x = 0, with f0) = 0. We use the same values of n as in the former examples. For this function the Hessian matrix Hx) is diagonal with H ii x) = q i e xi + 1, and hence there is no global Lipschitz constant for the gradient. Figure 5.4 shows a typical run with n = 00 and the performance profile for µ = 0, L = 0000 fed to the algorithms. f k Nesterov1 θ G,α G Adaptive Nesterov2 Cauchy Fig Objective function value and performance profile for a function with exponentials. Again, [Adaptive] has the best performance, and [Nesterov2] takes about 5 times the number of iterations computing at least two gradients per iteration). Here again that [Nesterov1] is slower than [Cauchy]. Comparison with Fletcher-Reeves algorithm. As we saw in the Introduction, conjugate gradient algorithms are the most efficient first order methods for quadratic functions. They are also known to perform very well for a strictly convex function in a region where the second derivatives are near their values at an optimal solution possibly using restarts and precise line searches). There exists a vast literature on conjugate gradient methods, and we do not intend to review it here. We shall construct an example in which the second derivatives are illbehaved near the optimal solution, and observe that our method performs better than Fletcher-Reeves. For a drastic but not so simple counterexample, see the reference [4] discussed in the Introduction. The separable function. We start by defining a function g : IR IR as follows: partition the set [0, + ] in p intervals defined by the points 0, 2 p+2, 2 p+3,..., 2 0, +, and associate with each interval a positive value chosen at random. Now integrate twice the resulting function to obtain a strictly convex quadratic by parts function. Given τ j [1, L], j = 1,..., n, L > 0, our separable function is constructed as x IR n fx) = n τ j gx j ). This function has a global minimum at the origin, the Hessian is always positive diagonal, the derivatives are L-Lipschitz continuous, but the Hessian Hx) differs very much from H0) at any points x with x > 2 p+2. j=1

21 FINE TUNING NESTEROV S ALGORITHM 21 Figure 5.5 shows a typical behavior of our algorithm and [F-R] for problems with dimension n = 200 and n = The algorithm [F-R] was implemented with restarts at every n iterations, and also when the search direction becomes almost orthogonal to the gradient. The line search used was either golden section or Armijo. In practically all runs with dimensions from 0 to 000, [F-R] needed more iterations than [Adaptive]. f 3 f 3 Adaptive F R k k Fig Separable function values in IR 200 and IR Appendix: interval reduction search. Let g : [0, 1] IR be a convex differentiable function. Assume that g0) = 0, g 0) < 0 and g 1) > 0. We shall study the problem of finding θ [0, 1] such that gθ) 0 and g θ) 0. This can be seen as a line search problem with an Armijo constant 0 and a Wolfe condition with constant 0, for which there are algorithms in the literature see for instance [3, ]). In the present case, in which the function is convex, we can use a simple interval reduction algorithm, based on the following step: Assume that three points 0 A < ν < B 1 are given, satisfying ga) gν) gb). Note that as consequences of the convexity of g, the interval [A, B] contains a minimizer of g and g B) 0. The problem can be solved by the following interval reduction scheme: Algorithm 5.1. Interval reduction algorithm while gb) > 0, Choose ξ [0, 1], ξ ν. Set u = min{ν, ξ}, v = max{ν, ξ}. If gu) gv), set B = v, ν = u, else set A = u, ν = v. The initial value of ν and the values of ξ in each iteration may be such that u and v define the golden section of the interval [A, B], and then the interval length will be reduced by a factor of 5 1)/ in each iteration. We can also choose ξ as the minimizer of the quadratic function through ga), gν), gb). In this case one must avoid the case ξ = µ, in which ξ must be perturbed. Acknowledgements. The authors would like to thank Prof. Masakazu Kojima and Prof. Mituhiro Fukuda for useful discussions about the subject. We also thank the anonymous referees and Prof. Paulo J. S. Silva for their valuable comments and suggestions which very much improved this paper. This research was partially completed when the second author was visiting Tokyo Institute of Technology under the project supported by MEXT s program Promotion of Environmental Improvement

22 22 C. GONZAGA AND E. KARAS for Independence of Young Researchers under the Special Coordination Funds for Promoting Science and Technology. REFERENCES [1] D. P. Bertsekas, A. Nedić, and A. E. Ozdaglar. Convex Analysis and Optimization. Athena Scientific, Belmont, USA, [2] G. Lan, Z. Lu, and R.D.C. Monteiro. Primal-dual first-order methods with O1/ɛ) iterationcomplexity for cone programming. Technical report, School of ISyE, Georgia Tech, USA, Accepted in Mathematical Programming. [3] J. J. Moré and D. J. Thuente. Line search algorithms with guaranteed sufficient decrease. Technical Report MCS-P330-92, Mathematics and Computer Science Division, Argonne National Laboratory, [4] A. S. Nemirovsky and D. B. Yudin. Problem Complexity and Method Efficiency in Optimization. John Wiley, New York, [5] Y. Nesterov. Introductory Lectures on Convex Optimization. A basic course. Kluwer Academic Publishers, Boston, [6] Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 31): , [7] Y. Nesterov. Gradient methods for minimizing composite objective function. Discussion paper 76, CORE, UCL, Belgium, [8] Y. Nesterov. Smoothing technique and its applications in semidefinite optimization. Mathematical Programming, 12): , [9] Y. Nesterov and B. T. Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 8: , [] J. Nocedal and S. J. Wright. Numerical Optimization. Springer Series in Operations Research. Springer-Verlag, [11] B. T. Polyak. Introduction to Optimization. Optimization Software Inc., New York, [12] P. Richtárik. Improved algorithms for convex minimization in relative scale. SIAM Journal on Optimization, 213): , [13] N. Shor. Minimization Methods for Non-Differentiable Functions. Springer Verlag, Berlin, 1985.

OPTIMAL STEEPEST DESCENT ALGORITHMS FOR UNCONSTRAINED CONVEX PROBLEMS: FINE TUNING NESTEROV S METHOD

OPTIMAL STEEPEST DESCENT ALGORITHMS FOR UNCONSTRAINED CONVEX PROBLEMS: FINE TUNING NESTEROV S METHOD OPTIMAL STEEPEST DESCENT ALGORITHMS FOR UNCONSTRAINED CONVEX PROBLEMS: FINE TUNING NESTEROV S METHOD CLÓVIS C. GONZAGA AND ELIZABETH W. KARAS Abstract. We modify the first order algorithm for convex programming

More information

Primal-dual relationship between Levenberg-Marquardt and central trajectories for linearly constrained convex optimization

Primal-dual relationship between Levenberg-Marquardt and central trajectories for linearly constrained convex optimization Primal-dual relationship between Levenberg-Marquardt and central trajectories for linearly constrained convex optimization Roger Behling a, Clovis Gonzaga b and Gabriel Haeser c March 21, 2013 a Department

More information

On the iterate convergence of descent methods for convex optimization

On the iterate convergence of descent methods for convex optimization On the iterate convergence of descent methods for convex optimization Clovis C. Gonzaga March 1, 2014 Abstract We study the iterate convergence of strong descent algorithms applied to convex functions.

More information

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume

More information

8 Numerical methods for unconstrained problems

8 Numerical methods for unconstrained problems 8 Numerical methods for unconstrained problems Optimization is one of the important fields in numerical computation, beside solving differential equations and linear systems. We can see that these fields

More information

Interior Point Methods for Mathematical Programming

Interior Point Methods for Mathematical Programming Interior Point Methods for Mathematical Programming Clóvis C. Gonzaga Federal University of Santa Catarina, Florianópolis, Brazil EURO - 2013 Roma Our heroes Cauchy Newton Lagrange Early results Unconstrained

More information

On the Local Quadratic Convergence of the Primal-Dual Augmented Lagrangian Method

On the Local Quadratic Convergence of the Primal-Dual Augmented Lagrangian Method Optimization Methods and Software Vol. 00, No. 00, Month 200x, 1 11 On the Local Quadratic Convergence of the Primal-Dual Augmented Lagrangian Method ROMAN A. POLYAK Department of SEOR and Mathematical

More information

Some Inexact Hybrid Proximal Augmented Lagrangian Algorithms

Some Inexact Hybrid Proximal Augmented Lagrangian Algorithms Some Inexact Hybrid Proximal Augmented Lagrangian Algorithms Carlos Humes Jr. a, Benar F. Svaiter b, Paulo J. S. Silva a, a Dept. of Computer Science, University of São Paulo, Brazil Email: {humes,rsilva}@ime.usp.br

More information

Numerical Optimization of Partial Differential Equations

Numerical Optimization of Partial Differential Equations Numerical Optimization of Partial Differential Equations Part I: basic optimization concepts in R n Bartosz Protas Department of Mathematics & Statistics McMaster University, Hamilton, Ontario, Canada

More information

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods Renato D.C. Monteiro B. F. Svaiter May 10, 011 Revised: May 4, 01) Abstract This

More information

A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications

A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications Weijun Zhou 28 October 20 Abstract A hybrid HS and PRP type conjugate gradient method for smooth

More information

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties Fedor S. Stonyakin 1 and Alexander A. Titov 1 V. I. Vernadsky Crimean Federal University, Simferopol,

More information

Interior Point Methods in Mathematical Programming

Interior Point Methods in Mathematical Programming Interior Point Methods in Mathematical Programming Clóvis C. Gonzaga Federal University of Santa Catarina, Brazil Journées en l honneur de Pierre Huard Paris, novembre 2008 01 00 11 00 000 000 000 000

More information

Global convergence of trust-region algorithms for constrained minimization without derivatives

Global convergence of trust-region algorithms for constrained minimization without derivatives Global convergence of trust-region algorithms for constrained minimization without derivatives P.D. Conejo E.W. Karas A.A. Ribeiro L.G. Pedroso M. Sachine September 27, 2012 Abstract In this work we propose

More information

Unconstrained optimization

Unconstrained optimization Chapter 4 Unconstrained optimization An unconstrained optimization problem takes the form min x Rnf(x) (4.1) for a target functional (also called objective function) f : R n R. In this chapter and throughout

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

Nonlinear Programming

Nonlinear Programming Nonlinear Programming Kees Roos e-mail: C.Roos@ewi.tudelft.nl URL: http://www.isa.ewi.tudelft.nl/ roos LNMB Course De Uithof, Utrecht February 6 - May 8, A.D. 2006 Optimization Group 1 Outline for week

More information

Worst Case Complexity of Direct Search

Worst Case Complexity of Direct Search Worst Case Complexity of Direct Search L. N. Vicente May 3, 200 Abstract In this paper we prove that direct search of directional type shares the worst case complexity bound of steepest descent when sufficient

More information

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE CONVEX ANALYSIS AND DUALITY Basic concepts of convex analysis Basic concepts of convex optimization Geometric duality framework - MC/MC Constrained optimization

More information

An adaptive accelerated first-order method for convex optimization

An adaptive accelerated first-order method for convex optimization An adaptive accelerated first-order method for convex optimization Renato D.C Monteiro Camilo Ortiz Benar F. Svaiter July 3, 22 (Revised: May 4, 24) Abstract This paper presents a new accelerated variant

More information

Lecture 3. Optimization Problems and Iterative Algorithms

Lecture 3. Optimization Problems and Iterative Algorithms Lecture 3 Optimization Problems and Iterative Algorithms January 13, 2016 This material was jointly developed with Angelia Nedić at UIUC for IE 598ns Outline Special Functions: Linear, Quadratic, Convex

More information

A convergence result for an Outer Approximation Scheme

A convergence result for an Outer Approximation Scheme A convergence result for an Outer Approximation Scheme R. S. Burachik Engenharia de Sistemas e Computação, COPPE-UFRJ, CP 68511, Rio de Janeiro, RJ, CEP 21941-972, Brazil regi@cos.ufrj.br J. O. Lopes Departamento

More information

Convex Optimization Theory. Chapter 5 Exercises and Solutions: Extended Version

Convex Optimization Theory. Chapter 5 Exercises and Solutions: Extended Version Convex Optimization Theory Chapter 5 Exercises and Solutions: Extended Version Dimitri P. Bertsekas Massachusetts Institute of Technology Athena Scientific, Belmont, Massachusetts http://www.athenasc.com

More information

Step-size Estimation for Unconstrained Optimization Methods

Step-size Estimation for Unconstrained Optimization Methods Volume 24, N. 3, pp. 399 416, 2005 Copyright 2005 SBMAC ISSN 0101-8205 www.scielo.br/cam Step-size Estimation for Unconstrained Optimization Methods ZHEN-JUN SHI 1,2 and JIE SHEN 3 1 College of Operations

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method L. Vandenberghe EE236C (Spring 2016) 1. Gradient method gradient method, first-order methods quadratic bounds on convex functions analysis of gradient method 1-1 Approximate course outline First-order

More information

Optimality Conditions for Constrained Optimization

Optimality Conditions for Constrained Optimization 72 CHAPTER 7 Optimality Conditions for Constrained Optimization 1. First Order Conditions In this section we consider first order optimality conditions for the constrained problem P : minimize f 0 (x)

More information

Suppose that the approximate solutions of Eq. (1) satisfy the condition (3). Then (1) if η = 0 in the algorithm Trust Region, then lim inf.

Suppose that the approximate solutions of Eq. (1) satisfy the condition (3). Then (1) if η = 0 in the algorithm Trust Region, then lim inf. Maria Cameron 1. Trust Region Methods At every iteration the trust region methods generate a model m k (p), choose a trust region, and solve the constraint optimization problem of finding the minimum of

More information

Nonlinear Optimization: What s important?

Nonlinear Optimization: What s important? Nonlinear Optimization: What s important? Julian Hall 10th May 2012 Convexity: convex problems A local minimizer is a global minimizer A solution of f (x) = 0 (stationary point) is a minimizer A global

More information

Worst Case Complexity of Direct Search

Worst Case Complexity of Direct Search Worst Case Complexity of Direct Search L. N. Vicente October 25, 2012 Abstract In this paper we prove that the broad class of direct-search methods of directional type based on imposing sufficient decrease

More information

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen Numerisches Rechnen (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang Institut für Geometrie und Praktische Mathematik RWTH Aachen Wintersemester 2011/12 IGPM, RWTH Aachen Numerisches Rechnen

More information

Examination paper for TMA4180 Optimization I

Examination paper for TMA4180 Optimization I Department of Mathematical Sciences Examination paper for TMA4180 Optimization I Academic contact during examination: Phone: Examination date: 26th May 2016 Examination time (from to): 09:00 13:00 Permitted

More information

You should be able to...

You should be able to... Lecture Outline Gradient Projection Algorithm Constant Step Length, Varying Step Length, Diminishing Step Length Complexity Issues Gradient Projection With Exploration Projection Solving QPs: active set

More information

Some Background Math Notes on Limsups, Sets, and Convexity

Some Background Math Notes on Limsups, Sets, and Convexity EE599 STOCHASTIC NETWORK OPTIMIZATION, MICHAEL J. NEELY, FALL 2008 1 Some Background Math Notes on Limsups, Sets, and Convexity I. LIMITS Let f(t) be a real valued function of time. Suppose f(t) converges

More information

Unconstrained optimization I Gradient-type methods

Unconstrained optimization I Gradient-type methods Unconstrained optimization I Gradient-type methods Antonio Frangioni Department of Computer Science University of Pisa www.di.unipi.it/~frangio frangio@di.unipi.it Computational Mathematics for Learning

More information

4TE3/6TE3. Algorithms for. Continuous Optimization

4TE3/6TE3. Algorithms for. Continuous Optimization 4TE3/6TE3 Algorithms for Continuous Optimization (Algorithms for Constrained Nonlinear Optimization Problems) Tamás TERLAKY Computing and Software McMaster University Hamilton, November 2005 terlaky@mcmaster.ca

More information

Structural and Multidisciplinary Optimization. P. Duysinx and P. Tossings

Structural and Multidisciplinary Optimization. P. Duysinx and P. Tossings Structural and Multidisciplinary Optimization P. Duysinx and P. Tossings 2018-2019 CONTACTS Pierre Duysinx Institut de Mécanique et du Génie Civil (B52/3) Phone number: 04/366.91.94 Email: P.Duysinx@uliege.be

More information

Chapter 4. Unconstrained optimization

Chapter 4. Unconstrained optimization Chapter 4. Unconstrained optimization Version: 28-10-2012 Material: (for details see) Chapter 11 in [FKS] (pp.251-276) A reference e.g. L.11.2 refers to the corresponding Lemma in the book [FKS] PDF-file

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Accelerated Proximal Gradient Methods for Convex Optimization

Accelerated Proximal Gradient Methods for Convex Optimization Accelerated Proximal Gradient Methods for Convex Optimization Paul Tseng Mathematics, University of Washington Seattle MOPTA, University of Guelph August 18, 2008 ACCELERATED PROXIMAL GRADIENT METHODS

More information

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2 Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2 Coralia Cartis, University of Oxford INFOMM CDT: Modelling, Analysis and Computation of Continuous Real-World Problems Methods

More information

This manuscript is for review purposes only.

This manuscript is for review purposes only. 1 2 3 4 5 6 7 8 9 10 11 12 THE USE OF QUADRATIC REGULARIZATION WITH A CUBIC DESCENT CONDITION FOR UNCONSTRAINED OPTIMIZATION E. G. BIRGIN AND J. M. MARTíNEZ Abstract. Cubic-regularization and trust-region

More information

NONSMOOTH VARIANTS OF POWELL S BFGS CONVERGENCE THEOREM

NONSMOOTH VARIANTS OF POWELL S BFGS CONVERGENCE THEOREM NONSMOOTH VARIANTS OF POWELL S BFGS CONVERGENCE THEOREM JIAYI GUO AND A.S. LEWIS Abstract. The popular BFGS quasi-newton minimization algorithm under reasonable conditions converges globally on smooth

More information

An Infeasible Interior-Point Algorithm with full-newton Step for Linear Optimization

An Infeasible Interior-Point Algorithm with full-newton Step for Linear Optimization An Infeasible Interior-Point Algorithm with full-newton Step for Linear Optimization H. Mansouri M. Zangiabadi Y. Bai C. Roos Department of Mathematical Science, Shahrekord University, P.O. Box 115, Shahrekord,

More information

Higher-Order Methods

Higher-Order Methods Higher-Order Methods Stephen J. Wright 1 2 Computer Sciences Department, University of Wisconsin-Madison. PCMI, July 2016 Stephen Wright (UW-Madison) Higher-Order Methods PCMI, July 2016 1 / 25 Smooth

More information

BASICS OF CONVEX ANALYSIS

BASICS OF CONVEX ANALYSIS BASICS OF CONVEX ANALYSIS MARKUS GRASMAIR 1. Main Definitions We start with providing the central definitions of convex functions and convex sets. Definition 1. A function f : R n R + } is called convex,

More information

Written Examination

Written Examination Division of Scientific Computing Department of Information Technology Uppsala University Optimization Written Examination 202-2-20 Time: 4:00-9:00 Allowed Tools: Pocket Calculator, one A4 paper with notes

More information

CORE 50 YEARS OF DISCUSSION PAPERS. Globally Convergent Second-order Schemes for Minimizing Twicedifferentiable 2016/28

CORE 50 YEARS OF DISCUSSION PAPERS. Globally Convergent Second-order Schemes for Minimizing Twicedifferentiable 2016/28 26/28 Globally Convergent Second-order Schemes for Minimizing Twicedifferentiable Functions YURII NESTEROV AND GEOVANI NUNES GRAPIGLIA 5 YEARS OF CORE DISCUSSION PAPERS CORE Voie du Roman Pays 4, L Tel

More information

The Steepest Descent Algorithm for Unconstrained Optimization

The Steepest Descent Algorithm for Unconstrained Optimization The Steepest Descent Algorithm for Unconstrained Optimization Robert M. Freund February, 2014 c 2014 Massachusetts Institute of Technology. All rights reserved. 1 1 Steepest Descent Algorithm The problem

More information

Cubic regularization of Newton s method for convex problems with constraints

Cubic regularization of Newton s method for convex problems with constraints CORE DISCUSSION PAPER 006/39 Cubic regularization of Newton s method for convex problems with constraints Yu. Nesterov March 31, 006 Abstract In this paper we derive efficiency estimates of the regularized

More information

Constrained Optimization and Lagrangian Duality

Constrained Optimization and Lagrangian Duality CIS 520: Machine Learning Oct 02, 2017 Constrained Optimization and Lagrangian Duality Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may

More information

1 Numerical optimization

1 Numerical optimization Contents 1 Numerical optimization 5 1.1 Optimization of single-variable functions............ 5 1.1.1 Golden Section Search................... 6 1.1. Fibonacci Search...................... 8 1. Algorithms

More information

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44 Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)

More information

Iteration-complexity of first-order penalty methods for convex programming

Iteration-complexity of first-order penalty methods for convex programming Iteration-complexity of first-order penalty methods for convex programming Guanghui Lan Renato D.C. Monteiro July 24, 2008 Abstract This paper considers a special but broad class of convex programing CP)

More information

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization racle Complexity of Second-rder Methods for Smooth Convex ptimization Yossi Arjevani had Shamir Ron Shiff Weizmann Institute of Science Rehovot 7610001 Israel Abstract yossi.arjevani@weizmann.ac.il ohad.shamir@weizmann.ac.il

More information

Unconstrained minimization of smooth functions

Unconstrained minimization of smooth functions Unconstrained minimization of smooth functions We want to solve min x R N f(x), where f is convex. In this section, we will assume that f is differentiable (so its gradient exists at every point), and

More information

Complexity bounds for primal-dual methods minimizing the model of objective function

Complexity bounds for primal-dual methods minimizing the model of objective function Complexity bounds for primal-dual methods minimizing the model of objective function Yu. Nesterov July 4, 06 Abstract We provide Frank-Wolfe ( Conditional Gradients method with a convergence analysis allowing

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html

More information

A derivative-free nonmonotone line search and its application to the spectral residual method

A derivative-free nonmonotone line search and its application to the spectral residual method IMA Journal of Numerical Analysis (2009) 29, 814 825 doi:10.1093/imanum/drn019 Advance Access publication on November 14, 2008 A derivative-free nonmonotone line search and its application to the spectral

More information

Optimality, Duality, Complementarity for Constrained Optimization

Optimality, Duality, Complementarity for Constrained Optimization Optimality, Duality, Complementarity for Constrained Optimization Stephen Wright University of Wisconsin-Madison May 2014 Wright (UW-Madison) Optimality, Duality, Complementarity May 2014 1 / 41 Linear

More information

Some Properties of the Augmented Lagrangian in Cone Constrained Optimization

Some Properties of the Augmented Lagrangian in Cone Constrained Optimization MATHEMATICS OF OPERATIONS RESEARCH Vol. 29, No. 3, August 2004, pp. 479 491 issn 0364-765X eissn 1526-5471 04 2903 0479 informs doi 10.1287/moor.1040.0103 2004 INFORMS Some Properties of the Augmented

More information

QUADRATIC MAJORIZATION 1. INTRODUCTION

QUADRATIC MAJORIZATION 1. INTRODUCTION QUADRATIC MAJORIZATION JAN DE LEEUW 1. INTRODUCTION Majorization methods are used extensively to solve complicated multivariate optimizaton problems. We refer to de Leeuw [1994]; Heiser [1995]; Lange et

More information

Newton s Method. Javier Peña Convex Optimization /36-725

Newton s Method. Javier Peña Convex Optimization /36-725 Newton s Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, f ( (y) = max y T x f(x) ) x Properties and

More information

1 Directional Derivatives and Differentiability

1 Directional Derivatives and Differentiability Wednesday, January 18, 2012 1 Directional Derivatives and Differentiability Let E R N, let f : E R and let x 0 E. Given a direction v R N, let L be the line through x 0 in the direction v, that is, L :=

More information

On self-concordant barriers for generalized power cones

On self-concordant barriers for generalized power cones On self-concordant barriers for generalized power cones Scott Roy Lin Xiao January 30, 2018 Abstract In the study of interior-point methods for nonsymmetric conic optimization and their applications, Nesterov

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization Convex Optimization Ofer Meshi Lecture 6: Lower Bounds Constrained Optimization Lower Bounds Some upper bounds: #iter μ 2 M #iter 2 M #iter L L μ 2 Oracle/ops GD κ log 1/ε M x # ε L # x # L # ε # με f

More information

An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex concave saddle-point problems

An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex concave saddle-point problems Optimization Methods and Software ISSN: 1055-6788 (Print) 1029-4937 (Online) Journal homepage: http://www.tandfonline.com/loi/goms20 An accelerated non-euclidean hybrid proximal extragradient-type algorithm

More information

Research Note. A New Infeasible Interior-Point Algorithm with Full Nesterov-Todd Step for Semi-Definite Optimization

Research Note. A New Infeasible Interior-Point Algorithm with Full Nesterov-Todd Step for Semi-Definite Optimization Iranian Journal of Operations Research Vol. 4, No. 1, 2013, pp. 88-107 Research Note A New Infeasible Interior-Point Algorithm with Full Nesterov-Todd Step for Semi-Definite Optimization B. Kheirfam We

More information

A CHARACTERIZATION OF STRICT LOCAL MINIMIZERS OF ORDER ONE FOR STATIC MINMAX PROBLEMS IN THE PARAMETRIC CONSTRAINT CASE

A CHARACTERIZATION OF STRICT LOCAL MINIMIZERS OF ORDER ONE FOR STATIC MINMAX PROBLEMS IN THE PARAMETRIC CONSTRAINT CASE Journal of Applied Analysis Vol. 6, No. 1 (2000), pp. 139 148 A CHARACTERIZATION OF STRICT LOCAL MINIMIZERS OF ORDER ONE FOR STATIC MINMAX PROBLEMS IN THE PARAMETRIC CONSTRAINT CASE A. W. A. TAHA Received

More information

Chapter 2 Convex Analysis

Chapter 2 Convex Analysis Chapter 2 Convex Analysis The theory of nonsmooth analysis is based on convex analysis. Thus, we start this chapter by giving basic concepts and results of convexity (for further readings see also [202,

More information

MS&E 318 (CME 338) Large-Scale Numerical Optimization

MS&E 318 (CME 338) Large-Scale Numerical Optimization Stanford University, Management Science & Engineering (and ICME) MS&E 318 (CME 338) Large-Scale Numerical Optimization 1 Origins Instructor: Michael Saunders Spring 2015 Notes 9: Augmented Lagrangian Methods

More information

Accelerating Nesterov s Method for Strongly Convex Functions

Accelerating Nesterov s Method for Strongly Convex Functions Accelerating Nesterov s Method for Strongly Convex Functions Hao Chen Xiangrui Meng MATH301, 2011 Outline The Gap 1 The Gap 2 3 Outline The Gap 1 The Gap 2 3 Our talk begins with a tiny gap For any x 0

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning First-Order Methods, L1-Regularization, Coordinate Descent Winter 2016 Some images from this lecture are taken from Google Image Search. Admin Room: We ll count final numbers

More information

Convex Optimization. Problem set 2. Due Monday April 26th

Convex Optimization. Problem set 2. Due Monday April 26th Convex Optimization Problem set 2 Due Monday April 26th 1 Gradient Decent without Line-search In this problem we will consider gradient descent with predetermined step sizes. That is, instead of determining

More information

Conjugate Gradient (CG) Method

Conjugate Gradient (CG) Method Conjugate Gradient (CG) Method by K. Ozawa 1 Introduction In the series of this lecture, I will introduce the conjugate gradient method, which solves efficiently large scale sparse linear simultaneous

More information

DO NOT OPEN THIS QUESTION BOOKLET UNTIL YOU ARE TOLD TO DO SO

DO NOT OPEN THIS QUESTION BOOKLET UNTIL YOU ARE TOLD TO DO SO QUESTION BOOKLET EECS 227A Fall 2009 Midterm Tuesday, Ocotober 20, 11:10-12:30pm DO NOT OPEN THIS QUESTION BOOKLET UNTIL YOU ARE TOLD TO DO SO You have 80 minutes to complete the midterm. The midterm consists

More information

MA/OR/ST 706: Nonlinear Programming Midterm Exam Instructor: Dr. Kartik Sivaramakrishnan INSTRUCTIONS

MA/OR/ST 706: Nonlinear Programming Midterm Exam Instructor: Dr. Kartik Sivaramakrishnan INSTRUCTIONS MA/OR/ST 706: Nonlinear Programming Midterm Exam Instructor: Dr. Kartik Sivaramakrishnan INSTRUCTIONS 1. Please write your name and student number clearly on the front page of the exam. 2. The exam is

More information

AM 205: lecture 18. Last time: optimization methods Today: conditions for optimality

AM 205: lecture 18. Last time: optimization methods Today: conditions for optimality AM 205: lecture 18 Last time: optimization methods Today: conditions for optimality Existence of Global Minimum For example: f (x, y) = x 2 + y 2 is coercive on R 2 (global min. at (0, 0)) f (x) = x 3

More information

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09 Numerical Optimization 1 Working Horse in Computer Vision Variational Methods Shape Analysis Machine Learning Markov Random Fields Geometry Common denominator: optimization problems 2 Overview of Methods

More information

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Newton s Method. Ryan Tibshirani Convex Optimization /36-725 Newton s Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, Properties and examples: f (y) = max x

More information

Part 4: Active-set methods for linearly constrained optimization. Nick Gould (RAL)

Part 4: Active-set methods for linearly constrained optimization. Nick Gould (RAL) Part 4: Active-set methods for linearly constrained optimization Nick Gould RAL fx subject to Ax b Part C course on continuoue optimization LINEARLY CONSTRAINED MINIMIZATION fx subject to Ax { } b where

More information

AM 205: lecture 19. Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods

AM 205: lecture 19. Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods AM 205: lecture 19 Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods Optimality Conditions: Equality Constrained Case As another example of equality

More information

1 Numerical optimization

1 Numerical optimization Contents Numerical optimization 5. Optimization of single-variable functions.............................. 5.. Golden Section Search..................................... 6.. Fibonacci Search........................................

More information

Lecture 10: September 26

Lecture 10: September 26 0-725: Optimization Fall 202 Lecture 0: September 26 Lecturer: Barnabas Poczos/Ryan Tibshirani Scribes: Yipei Wang, Zhiguang Huo Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These

More information

Lecture 2: Convex Sets and Functions

Lecture 2: Convex Sets and Functions Lecture 2: Convex Sets and Functions Hyang-Won Lee Dept. of Internet & Multimedia Eng. Konkuk University Lecture 2 Network Optimization, Fall 2015 1 / 22 Optimization Problems Optimization problems are

More information

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL) Part 3: Trust-region methods for unconstrained optimization Nick Gould (RAL) minimize x IR n f(x) MSc course on nonlinear optimization UNCONSTRAINED MINIMIZATION minimize x IR n f(x) where the objective

More information

Constrained Optimization

Constrained Optimization 1 / 22 Constrained Optimization ME598/494 Lecture Max Yi Ren Department of Mechanical Engineering, Arizona State University March 30, 2015 2 / 22 1. Equality constraints only 1.1 Reduced gradient 1.2 Lagrange

More information

A PROJECTED HESSIAN GAUSS-NEWTON ALGORITHM FOR SOLVING SYSTEMS OF NONLINEAR EQUATIONS AND INEQUALITIES

A PROJECTED HESSIAN GAUSS-NEWTON ALGORITHM FOR SOLVING SYSTEMS OF NONLINEAR EQUATIONS AND INEQUALITIES IJMMS 25:6 2001) 397 409 PII. S0161171201002290 http://ijmms.hindawi.com Hindawi Publishing Corp. A PROJECTED HESSIAN GAUSS-NEWTON ALGORITHM FOR SOLVING SYSTEMS OF NONLINEAR EQUATIONS AND INEQUALITIES

More information

Kantorovich s Majorants Principle for Newton s Method

Kantorovich s Majorants Principle for Newton s Method Kantorovich s Majorants Principle for Newton s Method O. P. Ferreira B. F. Svaiter January 17, 2006 Abstract We prove Kantorovich s theorem on Newton s method using a convergence analysis which makes clear,

More information

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016 Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall 206 2 Nov 2 Dec 206 Let D be a convex subset of R n. A function f : D R is convex if it satisfies f(tx + ( t)y) tf(x)

More information

Optimization Tutorial 1. Basic Gradient Descent

Optimization Tutorial 1. Basic Gradient Descent E0 270 Machine Learning Jan 16, 2015 Optimization Tutorial 1 Basic Gradient Descent Lecture by Harikrishna Narasimhan Note: This tutorial shall assume background in elementary calculus and linear algebra.

More information

On the complexity of the hybrid proximal extragradient method for the iterates and the ergodic mean

On the complexity of the hybrid proximal extragradient method for the iterates and the ergodic mean On the complexity of the hybrid proximal extragradient method for the iterates and the ergodic mean Renato D.C. Monteiro B. F. Svaiter March 17, 2009 Abstract In this paper we analyze the iteration-complexity

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex-concave saddle-point problems

An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex-concave saddle-point problems An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex-concave saddle-point problems O. Kolossoski R. D. C. Monteiro September 18, 2015 (Revised: September 28, 2016) Abstract

More information

Acceleration Method for Convex Optimization over the Fixed Point Set of a Nonexpansive Mapping

Acceleration Method for Convex Optimization over the Fixed Point Set of a Nonexpansive Mapping Noname manuscript No. will be inserted by the editor) Acceleration Method for Convex Optimization over the Fixed Point Set of a Nonexpansive Mapping Hideaki Iiduka Received: date / Accepted: date Abstract

More information

GLOBAL CONVERGENCE OF CONJUGATE GRADIENT METHODS WITHOUT LINE SEARCH

GLOBAL CONVERGENCE OF CONJUGATE GRADIENT METHODS WITHOUT LINE SEARCH GLOBAL CONVERGENCE OF CONJUGATE GRADIENT METHODS WITHOUT LINE SEARCH Jie Sun 1 Department of Decision Sciences National University of Singapore, Republic of Singapore Jiapu Zhang 2 Department of Mathematics

More information

On the convergence properties of the modified Polak Ribiére Polyak method with the standard Armijo line search

On the convergence properties of the modified Polak Ribiére Polyak method with the standard Armijo line search ANZIAM J. 55 (E) pp.e79 E89, 2014 E79 On the convergence properties of the modified Polak Ribiére Polyak method with the standard Armijo line search Lijun Li 1 Weijun Zhou 2 (Received 21 May 2013; revised

More information

FALL 2018 MATH 4211/6211 Optimization Homework 4

FALL 2018 MATH 4211/6211 Optimization Homework 4 FALL 2018 MATH 4211/6211 Optimization Homework 4 This homework assignment is open to textbook, reference books, slides, and online resources, excluding any direct solution to the problem (such as solution

More information

Iterative Methods for Solving A x = b

Iterative Methods for Solving A x = b Iterative Methods for Solving A x = b A good (free) online source for iterative methods for solving A x = b is given in the description of a set of iterative solvers called templates found at netlib: http

More information