FINE TUNING NESTEROV S STEEPEST DESCENT ALGORITHM FOR DIFFERENTIABLE CONVEX PROGRAMMING CLÓVIS C. GONZAGA AND ELIZABETH W. KARAS Abstract. We modify the first order algorithm for convex programming described by Nesterov in his book [5]. In his algorithms, Nesterov makes explicit use of a Lipschitz constant L for the function gradient, which is either assumed known [5], or is estimated by an adaptive procedure [7]. We eliminate the use of L at the cost of an extra imprecise line search, and obtain an algorithm which keeps the optimal complexity properties and also inherit the global convergence properties of the steepest descent method for general continuously differentiable optimization. Besides this, we develop an adaptive procedure for estimating a strong convexity constant for the function. Numerical tests for a limited set of toy problems show an improvement in performance when compared with the original Nesterov s algorithms. Key words. Unconstrained convex problems Optimal method AMS subject classifications. 62K05, 65K05, 68Q25, 90C25, 90C60 1. Introduction. We study the nonlinear programming problem P ) minimize fx) subject to x IR n, where f : IR n IR is convex and continuously differentiable, with a Lipschitz constant L > 0 for the gradient and a convexity parameter µ 0. It means that for all x, y IR n, 1.1) and 1.2) fx) fy) L x y fx) fy) + fy) T x y) + 1 2 µ x y 2. If µ > 0, the function is said to be strongly convex. Note that µ L. Assume that the problem admits an optimal solution x not necessarily unique) and consider an algorithm which generates a sequence {x k }, starting at a given point x 0 IR n. The complexity results here will consist in establishing bounds for the number of iterations k which ensure fx k ) fx ) ε, for a given precision ε > 0. The bounds will depend on x 0 x, on the Lipschitz constant L and on the strong convexity constant µ if available). No method for solving P) based on accumulated first order information can achieve an iteration-complexity bound of less than O1/ ε) [4, 13]. The steepest descent method with constant step lengths cannot achieve a complexity better than O1/ε) [5]. Results for higher order methods are discussed in Nesterov and Polyak [9]. Department of Mathematics, Federal University of Santa Catarina. Cx. Postal 52, 88040-970 Florianópolis, SC, Brazil; e-mail : clovis@mtm.ufsc.br. Research done while visiting LIMOS Université Blaise Pascal, Clermont- Ferrand, France. This author is supported by CNPq and PRONEX - Optimization. Department of Mathematics, Federal University of Paraná. Cx. Postal 19081, 81531-980 Curitiba, PR, Brazil; e-mail : ewkaras@ufpr.br, ewkaras@gmail.com. Research done while visiting the Tokyo Institute of Technology, Global Edge Institute. Supported by CNPq, PRONEX - Optimization and MEXT s program. 1
2 C. GONZAGA AND E. KARAS Nesterov [5] proves that with insignificant increase in computational effort per iteration, the computational complexity of the steepest descent method is lowered to the optimal value O1/ ε) for general smooth convex functions. For strongly convex functions, Nesterov s methods are linearly convergent, with a complexity bound of L/µ log1/ε) see [5, Thm 2.2.2]). For general convex functions, the bound obtained in this theorem is 1.3) fx k ) fx ) 4L x 0 x k + 2) 2 4L x 0 x k 2. These results were extended to more general convex problems, including problems with so-called simple constraints see, for instance [2, 6, 7, 8]). In the special case of minimizing a strictly convex quadratic function, conjugate direction methods with exact line searches find the optimal solution in no more than n iterations and satisfy see [11, p. 170]): 1.4) fx k ) fx ) L x 0 x 22k + 1) 2 L x 0 x 8k 2. Note that this bound is independent of n, and is about 32 times better than the bound 1.3) obtained by Nesterov for general convex functions. Nemirovsky and Yudin [4, Sec 7.2.2] show that no first order method can be more than twice better than conjugate gradients for strongly convex quadratic problems. For non-quadratic convex functions, the same reference [4, Sec 8.2.2] shows that both Fletcher-Reeves and Polak-Ribière versions of conjugate gradient methods are not optimal. This is done by constructing examples in which, even with exact line searches, both methods need a multiple of 1/ε iterations to achieve an ε accuracy. Our paper does a fine tuning of Nesterov s algorithm [5, Scheme 2.2.6]. His method relies heavily on the usage of a Lipschitz constant L, which is either given or adaptively estimated as in Nesterov [7] at the cost of extra gradient computations. The methods achieve optimal complexity and when L is given this is done with the smallest possible computational time per iteration: only one gradient computation and no function evaluations besides the ones used in the line search, if it is done at all). We trade this economy in function evaluations for an extra inexact line search, eliminating the usage of L and improving the efficiency of each iteration, while keeping the optimal complexity in number of iterations. Nesterov s algorithm takes advantage of a strong convexity parameter µ > 0 whenever one is known. Our best algorithm, presented in Section 4, uses a decreasing sequence of estimates for the parameter µ. Remark. Estimations of the Lipschitz constant were used by Nesterov [7]. In another context constrained minimization of non-smooth homogeneous convex functions Richtárik [12] uses an adaptive scheme for guessing bounds for a distance between a point x 0 and an optimal solution x. This bound determines the number of subgradient steps needed to obtain a desired precision. We found no method for guessing the strong convexity constant µ in the literature. Structure of the method. A detailed explanation of the method is in Nesterov s book [5, Sec. 2.2.1]. Here we give a rough description of its main features. It uses local properties of f ) the steepest descent directions) and global properties of convex functions, by generating a sequence functional upper bounds which are locally imposed on the epigraph of f ) at each iteration. The functional upper bounds are simple distance functions with the
FINE TUNING NESTEROV S ALGORITHM 3 shape 1.5) φ k x) = φ k + γ k 2 x v k 2, where φ k IR, γ k > 0 and v k IR n. Note that these functions are strictly convex with minimum φ k at v k. At start, v 0 = x 0 is a given point and γ 0 > 0 is given. The method will compute a sequence of points x k, a sequence of second order constants γ k µ such that γ k µ and a sequence of functions φ k ) as above. The construction is such that at each iteration φ k fx k) in the present paper, we require φ k = fx k)). Hence any optimal solution x satisfies fx ) φ k φ kx k ), and so we can think of φ k ) as an upper bound imposed to the epigraph of f ). As k increases, the functions φ k become flatter. This is shown in Fig. 1.1. f φ k f φ k φ k * fx k ) φ k+1 v k x k x x Fig. 1.1. The mechanics of the algorithm. Nesterov shows how to construct these functions so that, defining the following relation holds: λ k = γ k µ γ 0 µ, fx k ) φ k ) λ k φ 0 ) + 1 λ k )f ). With this property, an immediate examination of an optimal solution results in fx k ) φ k x ) λ k φ 0 x ) + 1 λ k )fx ) = fx ) + λ k φ 0 x ) fx )), and we see that fx k ) fx ) with the speed with which λ k 0. Nesterov s method ensures λ k = O1/k 2 ), and this means optimal complexity: an error of ε > 0 is achieved in O1/ ε) iterations [5, Chap. 2]. In the present paper we shall not use these parameters λ k, dealing directly with the sequences γ k ). Moreover, the complete complexity proof made in Section 4 is for a modified algorithm based on the construction of an adaptive sequence of strong convexity parameter estimates µ k µ such that µ k µ. The practical efficiency of the method depends on how much each iteration explores the local properties around each iterate x k to push λ k down. Of course, the worst case complexity cannot be improved, but the speed can be improved in practical problems. This is the goal of this paper. Structure of the paper. In Section 2 we present a prototype algorithm, its main variants and we describe all variables, functions and parameters needed in the analysis.
4 C. GONZAGA AND E. KARAS Section 3 discusses some details on the implementation and on simplifications of the methods in several different environments. Section 4 describes the modified algorithm based on adaptive estimates of the strong convexity parameter and does a complete complexity analysis, proving that Nesterov s complexity bound is preserved. Finally, Section 5 compares numerically the performance of the algorithms. The numerical results show that our modifications improve the performance of Nesterov s algorithms in most examples in a set of toy problems, mainly when no good bound for L is known. 2. Description of the main algorithms. In this section we present a prototype algorithm which includes the choice of two parameters α k and θ k ) in each iteration: different choices of these parameters give different algorithms. 2.1. The algorithm. Here we state the prototype algorithm. At this moment it is not possible to understand the meaning of each procedure: this will be the subject of the rest of this section. Consider the problem P ). Here we assume that the strong convexity parameter µ is given possibly null). In Section 4 we modify the algorithm to use an adaptive decreasing sequence of estimates for µ. Algorithm 2.1. Prototype algorithm. Data: x 0 IR n, v 0 = x 0, γ 0 > µ. k = 0 repeat d k = v k x k. Choose θ k [0, 1]. y k = x k + θ k d k. If fy k ) = 0, then stop with y k as an optimal solution. Steepest descent step: x k+1 = y k ν fy k ). Choose α k 0, 1]. γ k+1 = 1 α k )γ k + α k µ. v k+1 = 1 γ k+1 1 α k )γ k v k + α k µy k fy k ))). k = k + 1. A complete proof of the optimal complexity will be done in Section 4, for an algorithm based on adaptive estimates of µ. The proof for a fixed µ is simpler and follows from that one. The variables associated with each iteration of the algorithm are: x k, v k IR n, with x 0 given and v 0 = x 0. α k [0, 1). γ k IR ++ : second order constants. γ 0 > µ is given. If L is available, then Nesterov [5, Thm. 2.2.2] suggests the choice γ 0 = L. y k IR n : a point in the segment joining x k and v k y k = x k + θ k v k x k ). It is from y k that a steepest descent step is computed in each iteration to obtain x k+1.
FINE TUNING NESTEROV S ALGORITHM 5 x k+1 = y k ν fy k ): next iterate, computed by any efficient line search procedure. We know that if L is known, then the step length ν = 1/L ensures that fy k ) fx k+1 ) fy k ) 2 /2L). We require that the line search used must be at least as good as this, whenever L is known. An Armijo search with well chosen parameters is usually the best choice, and even if L is not known it ensures a decrease of at least fy k ) 2 /4L), which is also acceptable for the complexity analysis. Hence we shall assume that at all iterations 2.1) fx k+1 ) fy k ) 1 4L fy k) 2. The geometry of one iteration of the algorithm is shown in Fig. 2.1 left): y k is between x k and v k, and x k+1 is obtained by a steepest descent step from y k. The directions x k+1 y k ) and v k+1 v k ) are both collinear with fy k ). φ k f x k y k x k+1 φ k+1 v k v k+1 l y k x k+1 v k+1 Fig. 2.1. One iteration of the algorithm Choice of the parameters: different variants of the algorithm stem from different choices of the parameters α k and θ k. The study of efficient choices will be the main task of this paper. The choice made by Nesterov and by us will be described in Sec. 2.3, after a technical section on the construction of the functions φ k ). 2.2. Construction of the functions φ k ).. Each iteration starts with the simple quadratic function φ k ) as defined in 1.5) we call simple a quadratic function whose Hessian has the shape ti, with t 0). A new simple quadratic φ k+1 ) is constructed as a convex combination of φ k ) and a quadratic approximation l ) of f ) around y k. This is shown in Fig. 2.1 right) for a one-dimensional problem. The choice of this combination by Nesterov is so that the minimum value φ k+1 of φ k+1 ) satisfies φ k+1 fx k+1) fx ). We shall require that φ k+1 = fx k+1). Now we quote Nesterov [5] to describe the construction of the functions φ k ). We start from given x k, v k, γ k, φ k ). At this point we assume that y k is also given its calculation will be one of the tasks of this paper). Assume also that x k+1 has been computed and that 2.1) holds. Once y k and x k+1 are given, we must compute α k. Here we change the notation to include α as a variable and define the function α IR, x IR n φ k α, x).
6 C. GONZAGA AND E. KARAS The function φ k+1, ) is an affine combination of φ k ) and a quadratic approximation l k ) of f ) around y k : 2.2) φ k+1 α, x) = 1 α)φ k x) + αl k x), where l k x) = fy k ) + fy k ) T x y k ) + µ 2 x y k 2. Remark.. In this subsection we shall not require l k ) to be a lower quadratic approximation, i.e., µ > 0 may not be a strong convexity constant. If µ > 0, then l k ) is minimized at ˆx = y k fy k), with µ 2.3) min l kx) = l k ˆx) = fy k ) fy k) 2, x IR n 2µ and l k ) may be expressed as l k x) = l k ˆx) + µ 2 x ˆx 2. We shall assume that l k ) is not constant, otherwise y k would be an optimal solution for the problem P ). Define for α IR, 2.4) φ k+1α) = inf φ k+1α, x) IR { } x IR n and 2.5) γ k+1 α) = 1 α)γ k + αµ. The function φ k+1 α, ) is also a simple quadratic function, with Hessian γ k+1 α)i. If γ k+1 α) > 0, then φ k+1 α) is finite. If γ k+1α) < 0, then φ k+1 α) =. The fact below follows trivially from 2.5): Fact 2.2. For γ k > µ 0, {α IR γ k+1 α) > 0} =, α max ), with α max = γ k /γ k µ). So φ k+1 α) is well defined for any α < α max and φ k+1 α) = for α > α max because in this case γ k+1 α) < 0. Here we shall look at what happens for α = α max and isolate a simple situation where φ k and l k have coincident minimizers. Lemma 2.3. Consider γ k > µ and α max = γ k /γ k µ). Then, one and only one of the following situations occurs: i) v k is a minimizer of l k and φ k+1 is linear on, α max] with φ k+1α) = 1 α)φ k + αl k v k ). ii) φ k+1 α max) =. Proof. Let us assume that inf x IR n φ k+1 α max, x) is finite. By construction, φ k+1 α max, ) is linear, because γ + α max ) = 0. Thus, φ k+1 α max, ) = 1 α max )φ k ) + α max l k ) is constant. If µ = 0, then α max = 1, l k ) is constant and obviously v k is a minimizer of l k.
FINE TUNING NESTEROV S ALGORITHM 7 On the other hand, if µ > 0, we can write l k x) = l k ˆx) + µ 2 x ˆx 2, where ˆx is the minimizer of l k ). Differentiating the expression for φ k+1 α max, ) in relation to x, we have for any x IR n, 1 α max )γ k x v k ) + α max µx ˆx) = 0. If, by contradiction, v k ˆx, then substituting x = v k, we get α max µ = 0, which cannot be true, proving that v k = ˆx. So, in any case, v k is a minimizer of both φ k ) and l k ), and for any α, α max ], φ k+1 α) = 1 α)φ k + αl kv k ), which is linear, completing the proof. Lemma 2.4. For any α, α max ), the function x φ k+1 α, x) is a simple quadratic function given by 2.6) with 2.7) 2.8) φ k+1 α, x) = φ k+1α) + γ k+1α) x v k+1 α) 2, 2 α 2 φ k+1 α) = fy k) 2γ k+1 α) fy k) 2 + 1 α) [φ k fy k )+ + αγ k µ γ k+1 α) 2 y k v k 2 + fy k ) T v k y k ) )], v k+1 α) = 1 γ k+1 α) 1 α)γ kv k + αµy k fy k )) and 2.9) γ k+1 α) = 1 α)γ k + αµ. In particular, if µ > 0 then 2.) φ k+11) = fy k ) fy k) 2. 2µ Proof. The proof is straightforward and done in detail in Nesterov s book [5]. We now use Danskin s theorem in the format presented in Bertsekas [1] to prove that φ k+1 ) is concave. Lemma 2.5. The function α IR φ k+1 α) constructed above is concave in IR and differentiable in, α max ). Proof. If φ k+1 α max) is finite, then Lemma 2.3 ensures that φ k+1 ) is linear and hence this case is proved. Assume then that φ k+1 α max) =. We must prove that φ k+1 ) is concave and differentiable in, α max). We will prove the concavity in an arbitrary interval [α 1, α 2 ], α max ). For each α [α 1, α 2 ], φ k+1 α, ) has a unique minimizer v k+1 α). By continuity of v k+1 ) in 2.8), v k+1 [0, ᾱ]) is bounded, i.e., there exists a compact set V IR n such that φ k+1α) = min x V φ k+1α, x). So, the hypotheses for Danskin s theorem are satisfied: φ k+1, x) is concave in fact linear) for each x IR n,
8 C. GONZAGA AND E. KARAS φ k+1 α, ) has a unique minimizer in the compact set V for each α [α 1, α 2 ]. We conclude that φ k+1 ) is concave and differentiable in [α 1, α 2 ]. Since [α 1, α 2 ] was taken arbitrary, this proves the concavity and differentiability in, α max ), completing the proof. Setting y k = x k + θ k d k, with d k = v k x k and θ k [0, 1], 2.7) may be written as 2.11) φ k+1α) = fy k ) α2 2γ k+1 fy k ) 2 + 1 α)ζα, θ k ) with ζα, θ) = φ k fx k + θd k )+ αγ 2.12) k µ + 1 α)γ k + αµ 2 1 θ)2 d k 2 + 1 θ)f x k + θd k, d k )). Lemma 2.6. Assume that fx k ) φ k, y k = x k + θ k d k with d k = v k x k and θ k [0, 1]. Assume also that x k+1 is obtained by a steepest descent step from y k, satisfying 2.1), where the Lipschitz constant L is possibly infinite. Then 2.13) 1 φ k+1α) fx k+1 ) + 4L α2 2γ k+1 ) fy k ) 2 + 1 α)ζα, θ k ). Furthermore 2.14) ζα, θ) ) αγ k θ + 1 θ) f x k + θd k, d k ). 1 α)γ k + αµ Proof. The expression 2.13) follows directly from 2.11) and the fact that the steepest descent step satisfies 2.1). Using the facts that fx k ) φ k and µ 0 in the definition of ζ, ), we can write ζα, θ) fx k ) fx k + θd k ) + αγ k 1 α)γ k + αµ 1 θ)f x k + θd k, d k ). By the convexity of f ), we know that fx k ) fx k + θd k ) θf x k + θd k, d k ), and so ) αγ k ζα, θ) θ + 1 θ) f x k + θd k, d k ), 1 α)γ k + αµ completing the proof. 2.3. Choice of the parameters. In this section we assume that µ 0 is a strong convexity constant for f ). We present two different schemes for choosing the parameters θ k and α k in Algorithm 2.1. For both schemes we have two main objectives: Prove that for all choices of θ k, it is always possible to compute α k such that φ k+1 α k) = fx k+1 ); Prove that in any case α k is large, i.e., α k = Ω γ k+1 ). We begin by describing Nesterov s choice.
FINE TUNING NESTEROV S ALGORITHM 9 Nesterov s choice [5, Scheme 2.2.6].. When a Lipschitz constant L is known, Nesterov s choices are: α N is computed so that the central term in 2.13) is null. So, α N is the positive solution of the second degree equation 2.15) 2Lα 2 1 α)γ k αµ = 0. θ N is so that the right hand side of 2.14) is null: 2.16) θ N = γ k γ k + α N µ α N. Note: If L is unknown but exists, α N and θ N are well defined but unknown). If L does not exist, we set α N = 0, θ N = 0. Nesterov s original method sets α k = α N and θ k = θ N, obtaining directly from 2.13) and ζα N, θ N ) = 0 that 2.17) Our choice. φ k+1α N ) fx k+1 ) and α N = γk+1 2L. Compute θ k [0, 1] such that 2.18) fx k + θ k d k ) fx k ) and θ k = 1 or f x k + θ k d k, d k ) 0). These are an Armijo condition with parameter 0 and a Wolfe condition with parameter 0. Compute α k [0, 1] as the largest root of the equation 2.19) with Aα 2 + Bα + C = 0 µ ) G = γ k 2 v k y k 2 + fy k ) T v k y k ), A = G + 1 2 fy k) 2 + µ γ k ) fx k ) fy k )), B = µ γ k ) fx k+1 ) fx k )) γ k fy k ) fx k )) G, C = γ k fx k+1 ) fx k )). As we will see this equation comes from the equality φ k+1 α) = fx k+1). The next lemma uses the definition of α N to show under what conditions we can compute α k α N such that this equality holds. By construction, we have 2.20) 2.21) φ k+10) = φ k fx k ) φ k+11) = inf l kx) fx ) fx k+1 ). x IR n Let us first eliminate a trivial case: due to 2.21), if φ k+1 1) = fx k+1) then x k+1 is an optimal solution, and we may terminate the algorithm. Let us assume that φ k+1 1) < fx k+1).
C. GONZAGA AND E. KARAS The following lemma does not need the assumption that µ > 0 is a strong convexity constant. Lemma 2.7. Assume that φ k > fx k+1) > inf x IR n l kx). Then φ k+1 α) = fx k+1) has one or two real roots and the largest root belongs to [0, 1]. The roots are computed by solving the second degree equation 2.19). Proof. Let ᾱ be the largest α [0, 1) such that φ k+1 ᾱ) = max α [0,1]φ k+1α) which exists because φ k+1 ) is concave and continuous in [0, 1), with φ k+1 1) < φ k+1 ᾱ)). Since φ k+1 1) < φ k+1 ᾱ) and fx k+1) φ k+1 1), the concave function decreases from ᾱ to 1, and crosses the value fx k+1 ) exactly once in [ᾱ, + ), at some point α [0, 1]. Setting φ k+1 α) = fx k+1) in 2.7) and multiplying both sides of the equation by γ k+1 α) = 1 α)γ k + αµ, we obtain a second degree equation in α. The largest root of this equation must be α, completing the proof. Lemma 2.8. Consider an iteration k of Algorithm 2.1. Assume that x k is not a minimizer of f and φ k+1 1) < fx k+1). Assume that y k = x k + θ k d k is chosen so that ζα N, θ k ) 0. Then there exists a value α k [α N, 1] that solves the equation φ k+1 α k) = fx k+1 ) in 2.7). Proof. Assume that α N and y k = x k +θ k d k have been computed as in the Lemma. Since x k is not a minimizer of f, by convexity of f and definition of l k ) On the other hand, φ k fx k ) > fx ) l k x ) inf l kx). x IR n fx k+1 ) > φ k+11) = inf l kx). x IR n Using Lemma 2.6 with α = α N and ζα N, θ k ) 0, we get from 2.13) that φ k+1α N ) fx k+1 ). Now from Lemma 2.7 the equation φ k+1 α) = fx k+1) has a real root, and the largest root α is in [0, 1]. Note that α α N, because α is the largest root of the concave function φ k+1 ) fx k+1). We conclude from this lemma that the equality φ k+1 α k) = fx k+1 ) can always be achieved whenever ζα N, θ k ) 0. Theorem 2.9. Algorithm 2.1 with either of the choices 2.16) or 2.18) for θ k is well defined and generates sequences {x k } and {φ k } such that for all k = 0, 1,..., φ k fx k). Moreover, the parameters α k and γ k satisfy 2.22) α k γk+1 2L. Proof. i) Let us first prove that both versions of the algorithm are well defined: Nesterov: immediate. Taking θ k = θ N we obtain by definition of α N and θ N, ζα N, θ N ) = 0 and Lemma 2.8 ensures that α k [α N, 1) can be computed as desired. Our method: since fx k ) fx k +θ k d k ) 0 and f x k +θ k d k, d k ) 0 by construction, then ζα, θ k ) 0 for any α [0, 1], and Lemma 2.8 can be used similarly. ii) In both cases we obtain ζα k, θ k ) 0. Hence from 2.13), fx k+1 ) = φ k+1α k ) fx k+1 ) + 1 4L α2 k 2γ k+1 ) fy k ) 2,
FINE TUNING NESTEROV S ALGORITHM 11 which immediately gives 2.22), completing the proof. We conclude from this that Nesterov s algorithm, which uses the value θ k = θ N with α k = α N can be improved by re-calculating α k using 2.7) and keeping θ k = θ N ). Remark.. Even computing the best value for α k, the sequence {fx k )} is not necessarily decreasing. Remark.. In the case µ = 0, the expressions get simplified. Then α N is the positive solution of Lα 2 k = 1 α k)γ k, and expression 2.14) is reduced to In this case, θ N = α N. ζα, θ) α θ 1 α f x k + θd k, d k ). 3. More on the choice of θ k. Our choice of θ k uses a line search along d k. In this section we discuss the amount of computation needed in this line search, and show that when L and µ are known it may be done in polynomial time. We conclude from the preceding analysis that what must be shown for a given choice of y k = x k + θ k d k is that ζα N, θ k ) 0. The following lemma summarizes the conditions that ensure this property. Lemma 3.1. Let y k = x k + θ k d k be the choice made by Algorithm 2.1 in iteration k, and let θ N be the Nesterov choice 2.16). Assume that one of the following conditions is satisfied: i) fy k ) fx k ) and f y k, d k ) 0; ii) fy k ) fx k ) and θ k θ N ; iii) f y k, d k ) 0 and θ k θ N ; iv) f x k, d k ) µ 2 d k 2 and θ k = 0. Then ζα N, θ k ) 0 and α k α N may be computed using 2.7)). Proof. i) It has been proved in Theorem 2.9. Let us substitute θ k = θ N + θ in 2.14), with γ N = 1 α N )γ k + α N µ: ζα N, θ k ) θ N + α Nγ k γ N 1 θ N ) θ α ) Nγ k θ f x k + θ k d k, d k ). γ N By definition of θ N, θ N + α N γ k 1 θ N )/γ N = 0. Hence ζα N, θ k ) 1 + α ) Nγ k 3.1) θ f x k + θ k d k, d k ). γ N ii) Assume that fy k ) fx k ) and θ k θ N. If f y k, d k ) 0, then the result follows from 3.1) because θ 0. Otherwise, it follows from i). iii) Assume now that f y k, d k ) 0 and θ k θ N. Then the result follows from 3.1) because θ 0. iv) This case is also trivial by 2.12), completing the proof. All we need is to specify in Algorithm 2.1 the choice of θ k and α k as follows: Choose θ k [0, 1] satisfying one of the conditions in Lemma 3.1. Compute α k such that φ k+1 = fx k+1) using 2.7). We must discuss the computation of θ k, which requires a line search. Of course, if L is given, we can use θ k = θ N and ensure the optimal complexity, but larger values of α k may be obtained.
12 C. GONZAGA AND E. KARAS We would like to obtain θ k and then α k ) such that ζα k, θ k ) is as large as possible. Since the determination of α k depends on the steepest descent step from y k = x k + θ k d k, we cannot maximize ζ, ). But positive values of ζα, θ k ) will be obtained if θ k is such that fx k ) fx k + θ k d k ) and f x k + θ k d k, d k ) are made as large as possible of course, these are conflicting objectives). We define but do not compute) two points: 3.2) 3.3) θ argmin{fx k + θd k ) θ 0}, θ = max{θ [0, 1] fx k + θd k ) fx k )}. Let us examine two important cases: If f x k, d k ) 0, then we should choose θ k = 0: a steepest descent step from x k follows. If fx k + d k ) fx k ), then we may choose θ k = 1: a steepest descent step from v k follows. If none of these two special situations occur, then 0 < θ < θ < 1. Any point θ k [θ, θ ] gives ζα, θ k ) 0 for any α [0, 1], and y k = x k +θ k d k satisfies condition i) of Lemma 3.1. A point in this interval can be computed by the interval reduction algorithm described in the Appendix. We have no complexity estimates for a line search procedure such as this if L and µ are not available. When these constants are known, it is possible to limit the number of iterations in the line search to keep Nesterov s complexity. We shall discuss briefly the choice of θ k in three possible situations, according to our knowledge of the constants. Case 1. No knowledge of L or µ.. We start the iteration checking the special cases above: If fx k + d k ) fx k ), then we choose θ k = 1: a steepest descent step from v k will follow. Otherwise, we must decide whether f x k, d k ) 0. This may be done as follows: Compute fx k + θd k ) for a small value of θ > 0 If fx k + θd k ) fx k ), compute fx k ) and then f x k, d k ) = fx k ) T d k. Else f x k, d k ) < 0. If f x k, d k ) 0, set θ k = 0. Else, do a line search as in the Appendix, starting with the points 0, θ and 1. Case 2: L is given.. In this case we start the search with θ N computed as in 2.16). The line search can be abbreviated to keep the number of function calculations within a given bound, resulting in a step satisfying one of the three first conditions in Lemma 3.1. There are two cases to consider, depending on the sign of fx k ) fx k + θ N d k ). fx k + θ N d k ) < fx k ): Condition ii) of Lemma 3.1 is satisfied for any θ k θ N such that fx k + θ k d k ) fx k ). We may use an interval reduction method as in the Appendix, or simply take steps to increase θ. A trivial method is the following, using a constant β > 1: θ k = θ N while βθ k 1 and fx k + βθ k d k ) fx k ), set θ k = βθ k. In any method, the number of function calculations may be limited by a given constant, adopting as θ k the largest computed step length which satisfies the descent condition. fx k + θ N d k ) > fx k ): Condition iii) of Lemma 3.1 holds for θ N, and we intend to reduce the value of θ. For an interval reduction search between 0 and θ N we need
FINE TUNING NESTEROV S ALGORITHM 13 an intermediate point ν such that fx k + νd k ) fx k ), if it exists. We propose the following procedure: Compute fx k + νd k ) for ν << θ N If fx k + νd k ) fx k ), set θ k = ν Else compute θ k by an interval reduction method as in the Appendix, starting with the points 0, ν and θ N. The algorithm can be interrupted at any time, setting θ k = B. Case 3: L and µ > 0 are given.. This is a special case of case 2, with an interesting property: a point satisfying condition i) in Lemma 3.1 can be computed in polynomial time without computing fx k ), using the next lemma. Lemma 3.2. Consider θ = argmin{fx k + θd k ) θ 0} and θ > θ such that fx k + θ d k ) = fx k ), if it exists. Define θ = µ/2l). If fx k + θd k ) fx k ) µ2 8L d 2, then θ θ and θ θ θ, otherwise f x k, d k ) µ 2 d 2. Proof. Assume that fx k + θ d k ) fx k + θd k ) fx k ) µ2 8L d 2. Using the Lipschitz condition at θ with f x k + θ d k, d k ) = 0, we get for θ IR, Using the assumption, fx k + θd k ) fx k + θ d k ) + L 2 θ θ ) 2 d 2. fx k + θd k ) fx k ) + ) Lθ θ ) 2 µ2 d 2 4L 2. For θ = 0, this gives Lθ 2 µ 2 /4L) 0, or equivalently θ 2 θ 2, proving the first inequality. For θ = θ, fx k + θ d k ) = fx k ), and we obtain immediately θ θ ) 2 θ 2. Assume now that fx k + θd k ) > fx k ) µ2 8L d 2. By the Lipschitz condition for f ), we know that fx k + θd k ) fx k ) + f x k, d k ) θ + L 2 θ 2 d 2. Assuming by contradiction that f x k, d k ) < µ d 2 /2, we obtain fx k + θd k ) < fx k ) µ 2 θ d 2 + L 2 θ 2 d 2 = fx k ) µ2 4L d 2 + µ2 8L d 2 = fx k ) µ2 8L d 2, contradicting the hypothesis and completing the proof. In this case we may use a golden search, as follows: If fx k + θd k ) fx k ) µ2 8L d 2, set θ k = 0 and stop condition iv) in Lemma 3.1 holds). If fx k + d k ) fx k ), set θ k = 1 and stop condition ii) in Lemma 3.1 holds). Now we know that θ θ µ/2l) < 1). Use an interval reduction search as in the Appendix in the interval [ θ, 1]
14 C. GONZAGA AND E. KARAS If we use a golden section search, then at each iteration j before the stopping condition is met, we have A θ θ B, and the interval length satisfies B A β j, with β = 5 1)/2 0.62. Using the lemma above, a point B [θ, θ ] will have been found when β j µ/2l). Hence the number of iterations of the search will be bounded by j max = logµ/2l)). log β This provides a bound on the computational effort per iteration of Olog2L/µ)) function evaluations. 4. Adaptive convexity parameter µ. In computational tests see next section) we see that the use of a strong convexity constant is very effective in reducing the number of iterations. But very seldom such constant is available. In this section we state an algorithm which uses a decreasing sequence of estimates for the value of the constant µ. We assume that a strong convexity parameter µ possibly null) is given, and the method will generate a sequence µ k µ, starting with µ 0 [µ, γ 0 ). We still need an extra hypothesis: that the level set associated with x 0 is bounded. Note that since f ) is convex, this is equivalent to the hypothesis that the optimal set is bounded. The algorithm iterations are the same as in Algorithm 2.1, using a parameter µ k µ. The parameter is reduced we do this by setting µ k = max{µ, µ k /}) in two situations: When γ k µ < βµ k µ ), where β > 1 is fixed we used β = 1.02), meaning that γ k is too close to µ k. When it is impossible to satisfy the equation φ k+1 α) = fx k+1) for α [0, 1]. By Lemma 2.7, this can only happen when fx k+1 ) min x IR n lx). This minimum is given by φ k+1 1) = fy k) fy k ) 2 /2µ k ), using expression 2.). So, µ k cannot be larger than µ = fy k ) 2 2fy k ) fx k+1 )). If µ k > µ, we reduce µ k. Notation: Denote φ k+1 x) = φ k+1 α k, x) and φ k+1 = φ k+1 α k). Algorithm 4.1. Algorithm with adaptive convexity parameter. Data: x 0 IR n, v 0 = x 0, γ 0 > 0, β > 1, µ = µ, µ 0 [µ, γ 0 ). k = 0 Set µ + = µ 0. repeat d k = v k x k. Choose θ k [0, 1] as in Algorithm 2.1. y k = x k + θ k d k. If fy k ) = 0, then stop with y k as an optimal solution. Steepest descent step: x k+1 = y k ν fy k ). If L is known, ν 1/L. If γ k µ ) < βµ + µ ), then choose µ + [µ, γ k /β]. Compute µ = fy k ) 2 2fy k ) fx k+1 )). If µ + µ, then µ + = max{µ, µ/}. Compute α k as the largest root of 2.19) with µ = µ +.
FINE TUNING NESTEROV S ALGORITHM 15 γ k+1 = 1 α k )γ k + α k µ +. v k+1 = 1 γ k+1 1 α k )γ k v k + α k y k µ + fy k ))). Set µ k = µ +. k = k + 1. We suggest γ 0 = L if L is known, µ 0 = max{µ, γ 0 /0}, β = 1.02. We also suggest the following reduction scheme for µ +, provided that β : µ + = max{µ, γ k /}. Independently of this choice, the following fact holds: Fact 4.2. For any k 0, γ k µ ) βµ k µ ). Proof. If at the beginning of an iteration k, γ k µ ) < βµ + µ ), then we chose µ + [µ, γ k /β] with β > 1. Then, As µ k µ +, we complete the proof. βµ + µ ) γ k βµ γ k µ. We now show the optimality of the algorithm, using an additional hypothesis. Hypothesis.. Given x 0 IR n, assume that the level set associated with x 0 is bounded, i.e., Define Q = D2 2. D = sup{ x y fx) fx 0 ), fy) fx 0 )} <. Lemma 4.3. Consider γ 0 > µ. Let x be an optimal solution of P ). Then, at all iterations of Algorithm 4.1, 4.1) φ k x ) fx ) γ 0 + L γ 0 µ Qγ k µ ). Proof. First let us prove that 4.1) holds for k = 0. By definition of φ 0 and convexity of f with Lipschitz constant L for the gradient, we have φ 0 x ) fx ) = fx 0 ) fx ) + γ 0 2 x x 0 2 L + γ 0 x x 0 2 2 L + γ 0 )Q. Thus, we can use induction. By 2.2) and 1.2), we have φ k+1 x) = 1 α k )φ k x) + α k fy k ) + fy k ) T x y k ) + µ k 2 x y k 2) 1 α k )φ k x) + α k fx) + µ k µ ) x y k 2. 2 Thus, φ k+1 x ) fx ) 1 α k ) φ k x ) fx )) + α k Qµ k µ ).
16 C. GONZAGA AND E. KARAS Using the induction hypothesis and the definition of γ k+1, we have φ k+1 x ) fx ) 1 α k ) γ 0 + L γ 0 µ Qγ k µ ) + α k Qµ k µ ) γ 0 + L γ 0 µ Q 1 α k)γ k + α k µ k µ ) = γ 0 + L γ 0 µ Qγ k+1 µ ), completing the proof. Now we prove a technical lemma, following a very similar result in Nesterov [5, Lemma 2.2.4]. Lemma 4.4. Consider a positive sequence {λ k }. Assume that there exists M > 0 such that λ k+1 1 M λ k+1 )λ k. Then, for all k > 0, λ k 4 M 2 1 k 2. Proof. Denote a k = 1 λk. Since {λ k } is a decreasing sequence, we have a k+1 a k = Using the hypothesis, Thus, a k a 0 + Mk 2 Mk 2 λk λ k+1 λ k λ = k+1 λk λ k+1 λk λ k+1 λk + ) λ k λ k+1. λ k+1 2λ k λk+1 a k+1 a k λ k 1 M λ k+1 )λ k 2λ k λk+1 = M 2. and consequently λ k 4 M 2, completing the proof. k2 Now we can discuss how fast γ k µ ) goes to zero for estimating the rate of convergence of the algorithm. Lemma 4.5. Consider γ 0 > µ. Then, for all k > 0, 4.2) γ k µ min 1 β 1 β ) k µ γ 0 µ ), 2L 8β 2 L 1 β 1) 2 k 2. Proof. By the definition of γ k+1, γ k+1 µ = 1 α k )γ k µ ) + α k µ k µ ). By Fact 4.2, γ k µ ) βµ k µ ). Thus, 4.3) γ k+1 µ 1 α k )γ k µ ) + α k β γ k µ ) = 1 β 1 ) β α k γ k µ ).
FINE TUNING NESTEROV S ALGORITHM 17 By Theorem 2.9 4.4) α k γk+1 2L γk+1 µ. 2L Thus, γ k+1 µ 1 β 1 β ) µ γ k µ ) 2L 1 β 1 β ) k µ γ 0 µ ), proving the first inequality in 4.2). On the other hand, from 4.3) and 4.4), we have γ k+1 µ 1 β 1 ) β γk+1 µ γ k µ ). 2L Applying Lemma 4.4 with λ k = γ k µ, for k 0, and M = β 1 β, we complete the 2L proof. The next theorem discusses the complexity bound of the algorithm. Theorem 4.6. Consider γ 0 > µ 0. Then Algorithm 4.1 generates a sequence {x k } such that, for all k > 0, fx k ) fx ) min 1 β 1 ) k µ 8β 2 L 1, β 2L β 1) 2 γ 0 µ ) k 2 QL + γ 0). 2L Proof. By construction, fx k ) φ k x) for all x IR n, in particular for x. Using this and Lemma 4.3, we have: fx k ) fx ) L + γ 0) γ 0 µ Qγ k µ ). Applying Lemma 4.5, we complete the proof. This lemma ensures that an error of ε > 0 for the final objective function value is achieved in O1/ ε) iterations. Global convergence. In our algorithms, each iteration produces fx k+1 ) fy k ) fx k ). Hence, if we consider the sequence x 1, y 1, x 2, y 2,..., we have a descent algorithm, and every other step is a steepest descent step with an Armijo step length. The global convergence of such a method is a standard result. 5. Numerical results. In this section, we report the results of a limited set of computational experiments, comparing the variants of Algorithm 2.1 and Algorithm 4.1. The codes are written in Matlab and we used as stopping criterion the norm of the objective function gradient and the function values, whenever the optimal value is available. In all the algorithms, the line searches are of the Armijo type, and the same routine is used for all. We did not work on the efficiency of the line searches, and so it does not make much sense to compare methods based on the number of function calculations, with one exception: the original Nesterov algorithm never computes function values, which will be commented ahead. Hence our comparisons will be based
18 C. GONZAGA AND E. KARAS on the number of iterations, with one or two gradient computations per iteration and one or two line searches per iteration see below). Here are the algorithms to be tested, with their acronyms: [Nesterov1] Algorithm 2.1 with θ k = θ N, α k = α N. [α G, θ G ] Algorithm 2.1 with our choice of parameters. [Adaptive] Algorithm 4.1 with our choice of parameters. [Nesterov2] Nesterov s algorithm with adaptive estimates for L, from reference [7]. [Cauchy] Steepest descent algorithm. [F-R] Fletcher-Reeves as presented in [, p.120]. We shall see that in most examples [Adaptive] has the best performance among the optimal methods cited. When there are good estimates for the constants L and µ, [Nesterov1] may be the best. At the end of the section we compare our method with the Fletcher and Reeves conjugate gradient method, and conclude that in many cases this last method has the best performance. Now we describe our tests for three different convex functions: i) Quadratic function with µ = 1 and L = 00. ii) Nesterov s worst function in the world. iii) A convex function including exponentials. In i) and iii), the Hessian eigenvalues at the unique minimizer are randomly distributed between µ = 1 and L = 00. i) Quadratic functions. Given n, L > µ > 0 in these tests, L = 00 and µ = 1), the quadratic test functions are given by x IR n fx) = 1 2 xt Qx, where Q is a diagonal matrix with random entries in the interval [1, L]. For each experiment, we generate Q and a random initial point x 0 with x 0 = 1. We generated several instances of problems with n assuming the values, 0, 200, 500, 00, 2500 which were solved by all methods. Figure 5.1 shows the performance of the methods assuming ignorance of µ and L. The algorithms were fed with data µ = 0, L = 0000. On the left, we show the function values for a typical example in dimension 00, and on the right the performance profiles for all methods on the whole set of problems. In all cases [Adaptive] used the smallest number of iterations, followed by [Nesterov2] and [α G, θ G ]. [Nesterov1] always needed over times the number of iterations of [Adaptive]. Figure 5.2 shows the results for the same tests, but using the knowledge of the values µ = 1 and L = 00. Now [Nesterov1] performs very well and even if the number of iterations is a little larger, each iteration is faster because only one line search is needed. Now [Nesterov2] is slower because it does not take advantage of the knowledge of µ. ii) Nesterov s worst function of the world. This function is defined and explained in Nesterov [5, p.67]. Given L > µ > 0 we took µ = 1, L = 00), set Q = L/µ. The function is ) n 1 x IR n µq 1) 5.1) fx) = x 2 1 + x i x i+1 ) 2 + µ 8 2 x 2 i=1
FINE TUNING NESTEROV S ALGORITHM 19 f 1 3 0.9 0.8 0 0.7 0.6 3 0.5 0.4 6 500 00 1500 k 0.3 0.2 0.1 0 Nesterov1 θ G,α G Adaptive Nesterov2 Cauchy 5 15 20 25 30 35 Fig. 5.1. Quadratic functions with µ and L unknown. f 3 1 0.9 0 0.8 0.7 0.6 3 0.5 0.4 6 200 400 600 800 k 0.3 Nesterov1 θ,α G G 0.2 Adaptive 0.1 Nesterov2 Cauchy 0 1 2 3 4 5 6 Fig. 5.2. Quadratic functions with µ and L given. Generating a set of problems for the same values of n as before and several starting points, Figure 5.3 shows the performance profiles for all problems in two situations: on the left, considering µ and L unknown µ = 0 and L = 0000); on the right for µ = 1 and L = 00 given to the algorithms. We see that the behavior is similar to the case i), with a detail: [Nesterov1] was worse than [Cauchy] for unknown µ and L. This is certainly due to the choice L = 0000, which affects the computation of θ N in the algorithm. 1 1 0.9 0.8 0.9 0.8 Nesterov1 0.7 0.6 0.5 0.7 0.6 0.5 θ G,α G Adaptive Nesterov2 Cauchy 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 5 15 20 25 30 35 40 0 2 4 6 8 12 14 16 18 20 Fig. 5.3. Performance profile for the function defined by 5.1). iii) Function with exponentials This function is convex and includes exponentials. The Hessian eigenvalues at the
20 C. GONZAGA AND E. KARAS optimizer are randomly distributed between 1 and 00: 5.2) fx) = 1 n 2 x 2 + q i e xi x i 1), i=1 where q i [1, 999] for i = 1,..., n. The minimizer is x = 0, with f0) = 0. We use the same values of n as in the former examples. For this function the Hessian matrix Hx) is diagonal with H ii x) = q i e xi + 1, and hence there is no global Lipschitz constant for the gradient. Figure 5.4 shows a typical run with n = 00 and the performance profile for µ = 0, L = 0000 fed to the algorithms. f 3 1 0.9 0 0.8 0.7 0.6 3 0.5 0.4 6 500 00 1500 2000 2500 k 0.3 0.2 0.1 0 Nesterov1 θ G,α G Adaptive Nesterov2 Cauchy 5 15 20 25 30 Fig. 5.4. Objective function value and performance profile for a function with exponentials. Again, [Adaptive] has the best performance, and [Nesterov2] takes about 5 times the number of iterations computing at least two gradients per iteration). Here again that [Nesterov1] is slower than [Cauchy]. Comparison with Fletcher-Reeves algorithm. As we saw in the Introduction, conjugate gradient algorithms are the most efficient first order methods for quadratic functions. They are also known to perform very well for a strictly convex function in a region where the second derivatives are near their values at an optimal solution possibly using restarts and precise line searches). There exists a vast literature on conjugate gradient methods, and we do not intend to review it here. We shall construct an example in which the second derivatives are illbehaved near the optimal solution, and observe that our method performs better than Fletcher-Reeves. For a drastic but not so simple counterexample, see the reference [4] discussed in the Introduction. The separable function. We start by defining a function g : IR IR as follows: partition the set [0, + ] in p intervals defined by the points 0, 2 p+2, 2 p+3,..., 2 0, +, and associate with each interval a positive value chosen at random. Now integrate twice the resulting function to obtain a strictly convex quadratic by parts function. Given τ j [1, L], j = 1,..., n, L > 0, our separable function is constructed as x IR n fx) = n τ j gx j ). This function has a global minimum at the origin, the Hessian is always positive diagonal, the derivatives are L-Lipschitz continuous, but the Hessian Hx) differs very much from H0) at any points x with x > 2 p+2. j=1
FINE TUNING NESTEROV S ALGORITHM 21 Figure 5.5 shows a typical behavior of our algorithm and [F-R] for problems with dimension n = 200 and n = 2000. The algorithm [F-R] was implemented with restarts at every n iterations, and also when the search direction becomes almost orthogonal to the gradient. The line search used was either golden section or Armijo. In practically all runs with dimensions from 0 to 000, [F-R] needed more iterations than [Adaptive]. f 3 f 3 Adaptive F R 0 0 3 3 6 6 0 200 400 600 800 00 k 0 200 400 600 800 00 k Fig. 5.5. Separable function values in IR 200 and IR 2000. Appendix: interval reduction search. Let g : [0, 1] IR be a convex differentiable function. Assume that g0) = 0, g 0) < 0 and g 1) > 0. We shall study the problem of finding θ [0, 1] such that gθ) 0 and g θ) 0. This can be seen as a line search problem with an Armijo constant 0 and a Wolfe condition with constant 0, for which there are algorithms in the literature see for instance [3, ]). In the present case, in which the function is convex, we can use a simple interval reduction algorithm, based on the following step: Assume that three points 0 A < ν < B 1 are given, satisfying ga) gν) gb). Note that as consequences of the convexity of g, the interval [A, B] contains a minimizer of g and g B) 0. The problem can be solved by the following interval reduction scheme: Algorithm 5.1. Interval reduction algorithm while gb) > 0, Choose ξ [0, 1], ξ ν. Set u = min{ν, ξ}, v = max{ν, ξ}. If gu) gv), set B = v, ν = u, else set A = u, ν = v. The initial value of ν and the values of ξ in each iteration may be such that u and v define the golden section of the interval [A, B], and then the interval length will be reduced by a factor of 5 1)/2 0.62 in each iteration. We can also choose ξ as the minimizer of the quadratic function through ga), gν), gb). In this case one must avoid the case ξ = µ, in which ξ must be perturbed. Acknowledgements. The authors would like to thank Prof. Masakazu Kojima and Prof. Mituhiro Fukuda for useful discussions about the subject. We also thank the anonymous referees and Prof. Paulo J. S. Silva for their valuable comments and suggestions which very much improved this paper. This research was partially completed when the second author was visiting Tokyo Institute of Technology under the project supported by MEXT s program Promotion of Environmental Improvement
22 C. GONZAGA AND E. KARAS for Independence of Young Researchers under the Special Coordination Funds for Promoting Science and Technology. REFERENCES [1] D. P. Bertsekas, A. Nedić, and A. E. Ozdaglar. Convex Analysis and Optimization. Athena Scientific, Belmont, USA, 2003. [2] G. Lan, Z. Lu, and R.D.C. Monteiro. Primal-dual first-order methods with O1/ɛ) iterationcomplexity for cone programming. Technical report, School of ISyE, Georgia Tech, USA, 2006. Accepted in Mathematical Programming. [3] J. J. Moré and D. J. Thuente. Line search algorithms with guaranteed sufficient decrease. Technical Report MCS-P330-92, Mathematics and Computer Science Division, Argonne National Laboratory, 1992. [4] A. S. Nemirovsky and D. B. Yudin. Problem Complexity and Method Efficiency in Optimization. John Wiley, New York, 1983. [5] Y. Nesterov. Introductory Lectures on Convex Optimization. A basic course. Kluwer Academic Publishers, Boston, 2004. [6] Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 31):127 152, 2005. [7] Y. Nesterov. Gradient methods for minimizing composite objective function. Discussion paper 76, CORE, UCL, Belgium, 2007. [8] Y. Nesterov. Smoothing technique and its applications in semidefinite optimization. Mathematical Programming, 12):245 259, 2007. [9] Y. Nesterov and B. T. Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 8:177 205, 2006. [] J. Nocedal and S. J. Wright. Numerical Optimization. Springer Series in Operations Research. Springer-Verlag, 1999. [11] B. T. Polyak. Introduction to Optimization. Optimization Software Inc., New York, 1987. [12] P. Richtárik. Improved algorithms for convex minimization in relative scale. SIAM Journal on Optimization, 213):1141 1167, 2011. [13] N. Shor. Minimization Methods for Non-Differentiable Functions. Springer Verlag, Berlin, 1985.