Accelerating the cubic regularization of Newton s method on convex problems

Size: px

Start display at page:

Download "Accelerating the cubic regularization of Newton s method on convex problems"

Tabitha Young
6 years ago
Views:

1 Accelerating the cubic regularization of Newton s method on convex problems Yu. Nesterov September 005 Abstract In this paper we propose an accelerated version of the cubic regularization of Newton s method [6]. The original version, used for minimizing a convex function with Lipschitzcontinuous Hessian, guarantees a global rate of convergence of order O( k ), where k is the iteration counter. Our modified version converges for the same problem class with order O( k ), keeping the complexity of each iteration unchanged. We study the complexity of both schemes on different classes of convex problems. In particular, we argue that for the second-order schemes, the class of non-degenerate problems is different from the standard class. Keywords: convex optimization, unconstrained minimization, Newton s method, cubic regularization, worst-case complexity, global complexity bounds, non-degenerate problems, condition number. Center for Operations Research and Econometrics (CORE), Catholic University of Louvain (UCL), 4 voie du Roman Pays, 48 Louvain-la-Neuve, Belgium; nesterov@core.ucl.ac.be. The research results presented in this paper have been supported by a grant Action de recherche concertè ARC 04/09-5 from the Direction de la recherche scientifique - Communautè française de Belgique. The scientific responsibility rests with the author.

2 Introduction Motivation. Newton s method is one of the oldest schemes in Numerical Analysis []. The first theoretical study of its local performance was carried out in [4]. During many years, numerous useful suggestions were developed for stabilizing its local behavior (a detailed description of the results and references can be found in the comprehensive monographs [, ] ). However, the global worst-case complexity analysis for the second-order schemes was only given recently. In [6] a cubic regularization of the Newton s method (CNM), which can be applied to a general smooth unconstrained minimization problem, was proposed. The main advantage of this scheme consists in its natural geometrical interpretation: At each iteration we minimize a cubic model, which appears to be a global upper estimate for the objective function. This key feature ensures predictable global behavior of the process. In [6], attention was mainly paid to different classes of nonconvex problems. For convex problems, some of the statements of [6] can be strengthened. Moreover, we can employ convex optimization techniques for accelerating the local scheme. In this paper we assume that the objective function of a convex unconstrained minimization problem has Lipschitz-continuous Hessian (that is the basic problem class under consideration). As was shown in [6], the global rate of convergence of CNM on this problem class is of the order O( ), where k is the iteration counter. However, note that CNM k is a local one-step second-order method. From the complexity theory of smooth convex optimization (see, for example, Chapter in [5]), it is known that the rate of convergence of the local one-step first-order method (that is just a gradient method) can be improved from O( k ) to O( ) by passing to a multi-step strategy. In this paper we show that a k similar trick works with CNM also. As a result, we get a new method, which converges on the specified class of problems as O( ). k Contents. Section contains all necessary results related to regular functions. Namely, we present the main properties of uniformly convex functions and functions with Lipschitzcontinuous derivatives of a certain degree. In Section we introduce a cubic regularization of the Newton step and prove several useful inequalities. At the end of this section we show that CNM, as applied to a convex function from the basic problem class, converges globally as O( k ). The majority of the results of this section can be found in [6], but in a slightly weaker form. In Section 4 we derive an accelerated multi-step version of CNM. We prove that on the basic problem class it converges as O( k ). This acceleration is achieved by a modification of the technique of estimate sequences described in Section.. [5]. In Section 5 we introduce a new class of problems, which can be seen as non-degenerate problems for the second-order methods. This class is composed of the functions from the basic problem class, which are uniformly convex of degree three. On these problems both CNM and its accelerated version exhibit a global linear rate of convergence, which is proportional to the product of a certain power of the second-order condition number and the logarithm of the required accuracy. We show that the accelerated scheme with an appropriately chosen restarting strategy works better than the pure CNM. The results of this section suggest that the notion of non-degeneracy is method-dependent. The standard classification of non-degenerate problems based on the usual condition number works

3 properly only for the first-order schemes. In Section 6 we analyze the performance of the proposed schemes on strongly convex functions from the basic problem class. We show that the main computational effort is spent at the first stage of the process, aiming to enter the region of quadratic convergence. In some sense, the complexity of such problems is almost independent on the required accuracy. This situation can lead to erroneous conclusions on the efficiency of some numerical schemes. In Section 7 we give an example of a method which, when applied to strongly convex functions, formally converges as O( ). However, its actual performance k 8 appears to be worse than that of accelerated CNM. Finally, in the last Section 8 we discuss the results. Notation. In what follows E denotes a finite-dimensional real vector space, and E the dual space, which is formed by all linear functions on E. The value of function s E at x E is denoted by s, x. Let us fix a positive definite self-adjoint operator B : E E. Define the following norms: h = Bh, h /, h E, s = s, B s /, s E, A = max h Ah, A : E E. For a self-adjoint operator A = A, the same norm can be defined as A = max h Ah, h. (.) Any s E generates a rank-one self-adjoint operator ss : E E acting as follows ss x = s, x s, x E. We extend operator A(s) = ss s onto the origin in a continuous way: A(0) = 0. Further, for function f(x), x E, we denote by f(x) its gradient at x: f(x + h) = f(x) + f(x), h + o( h ), h E. Clearly f(x) E. Similarly, we denote by f(x) the Hessian of f at x: f(x + h) = f(x) + f(x)h + o( h ), h E. Of course, f(x) is a self-adjoint linear operator from E to E. Acknowledgements. The author would like to thank Laurence Wolsey for very useful comments on the content of the paper. Regular functions In this section, for the sake of completeness, we include all necessary results on convex functions possessing a certain type of regularity. We start from well known properties of uniformly convex functions (see, for example, [8]).

4 Let a function f(x) be differentiable on E. We call it uniformly convex on E of degree p if there exists a constant σ p = σ p (f) > 0 such that f(y) f(x) + f(x), y x + p σ p y x p, x, y E. (.) The pair (p, σ p ) is called the pair of parameters of the uniformly convex function. Adding such a function to an arbitrary convex function gives a uniformly convex function with the same pair of parameters. Recall that the degree p = corresponds to strongly convex functions. Lemma Assume that for some p, σ > 0, and all x, y E the following inequality holds: f(x) f(y), x y σ x y p, x, y E. (.) Then the function f is uniformly convex on E with parameters p and σ. Proof: Indeed, f(y) f(x) f(x), y x = f(x + τ(y x)) f(x), y x dτ 0 = 0 τ f(x + τ(y x)) f(x), τ(y x) dτ (.) 0 στ p y x p dτ = p σ y x p. In our analysis we often use the following inequality. Lemma For any h E, and s E we have s, h + p σ h p p p ( σ ) p p p s. (.) Proof: Denote by h the minimum of the left-hand side in (.). It satisfies the first-order optimality condition s + σ h p Bh = 0. Hence, s, h = σ h p and s = σ h p. Therefore s, h + p σ h p ( = p p σ h p = p p σ σ s ) p p, and (.) follows. 4

5 Lemma Let f be uniformly convex on E of degree p. Then f(y) f(x) f(x), y x p p Proof: Assume that f attains its global minimum at some point x. Then f(x ) = min f(y) (.) y ( ) (.) = f(x) p p p p p σ f(x). ( ) p p p σ p f(y) f(x). (.4) min [f(x) + f(x), y x + x E p σ p y x p] Let us fix x and consider the convex function φ(y) = f(y) f(x), y. Note that it is uniformly convex with parameters p and σ p. Moreover, it attains its minimum at y = x. Hence, applying the above inequality to φ(y), we get (.4). Note that for p = conditions (.) and (.4) are necessary and sufficient for a function f to be strongly convex with parameter σ = σ. Twice differentiable strongly convex functions admit also the following convenient characterization: f(x)h, h σ h, x, h E. (.5) Finally, let us give an important example of a uniformly convex function. Let us fix an arbitrary x 0 E. Define the function d p (x) = p x x 0 p. Then d p (x) = x x 0 p B(x x 0 ), x E. Lemma 4 For any x and y from E we have d p (x) d p (y), x y ( ) p x y p, (.6) d p (x) d p (y) d p (y), x y p Proof: Without loss of generality assume x 0 = 0. Then ( ) p x y p. (.7) d p (x) d p (y), x y = x p Bx y p By, x y = x p + y p Bx, y ( x p + y p ). For proving (.6), we need to show that the right-hand side of the latter equality is greater or equal than ( ) p x y p = ( ) p [ x + y Bx, y ] p/. Without loss of generality we can assume that x 0 and y 0. Then, denoting by τ = y x, α = Bx,y x y [, ], 5

6 we get the statement required to be proved: + τ p ατ( + τ p ) + ( ) p [ + τ ατ] p/, τ 0, α. (.8) Since the right-hand side of this inequality is convex in α, we need to justify two marginal inequalities: α = : + τ p τ( + τ p ) + ( ) p τ p, α = : + τ p τ( + τ p ) + ( ) p ( + τ) p (.9) for all τ 0. The second inequality in (.9) can be derived from the lower bound for the ratio + τ p + τ( + τ p ) ( + τ) p = + τ p, τ 0. ( + τ) p Indeed, its minimum is attained at τ =, and that proves the second line in (.9). For proving the first line, note that it is valid for τ =. If τ 0 and τ, then we need to estimate from below the ratio + τ p τ( + τ p ) τ p = ( τ)( τ p ) τ p = + τ τ p τ p Since the absolute value of any coefficient of the polynomial ( τ) p does not exceed p, the first line in inequality (.9) is also justified. This proves (.6), and, for proving (.7), we can use now Lemma. Another type of regularity that we are interested concerns the smoothness conditions (see, for example, [7]). Usually they are stated in terms of Lipschitz conditions for derivatives of a certain order: k f(x) k f(y) L k+ (f) x y, x, y E, k 0. In this paper we mainly consider functions with Lipschitz-continuous Hessian: f(x) f(y) L x y, x, y E, (.0) where L def = L (f). Consequently, for all x and y from E we have f(y) f(x) f(x)(y x) L y x. (.) Moreover, for the quadratic model we can bound the residual: f (x; y) def = f(x) + f(x), y x + f(x)(y x), y x f(y) f (x; y) L 6 y x, x, y E. (.) The simplest functions satisfying condition (.0) are, of course, the quadratic functions. However, sometimes another function d (x) is quite useful. 6

7 Lemma 5 For any x, y E we have d (x) d (y) x y. (.) Proof: Without loss of generality, assume x 0 = 0. Then d (x) = x, and for any x E we have d (x) = x B + x Bxx B. Let us fix two points x, y E and arbitrary direction h E. Define x(τ) = x + τ(y x) and φ(τ) = d (x(τ))h, h = x(τ) h + x(τ) Bx(τ), h, τ [0, ]. Assume first that 0 [x, y]. Then φ(τ) is continuously differentiable on [0, ] and φ (τ) = Bx(τ),y x x(τ) ( = Bx(τ),y x x(τ) h Denote α = Bx(τ),h x(τ) h [, ]. Then h + Bx(τ),h x(τ) Bh, y x Bx(τ),h Bx(τ), y x x(τ) ) Bx(τ), h x(τ) }{{} 0 + Bx(τ),h x(τ) Bh, y x. φ (τ) y x h ( α + α ) y x h. Hence, ( d (y) d (x))h, h = φ() φ(0) y x h, and we get (.) from (.). The remaining case 0 [x, y] is trivial since d (0) = 0. In the sequel we often establish the complexity of different problem classes in terms of condition numbers of certain degree: γ p (f) def = σ p(f) L p (f), p. (.4) It is clear, for example, that γ (d ) =. On the other hand, from (.7) and (.) we conclude that γ (d ) = 4. Cubic regularization of Newton iteration In this section we present the most important properties of cubic regularization of the Newton s method. As compared with [6], here we take into account the convexity of the objective function. The main problem of interest is as follows: min x E f(x), (.) 7

8 where f is a twice differentiable convex function with Lipschitz-continuous Hessian. As suggested in [6], we introduce the following mapping: [ def T M (x) = Arg min ˆfM (x; y) def = f (x; y) + M y E 6 y x ]. (.) Note that T = T M (x) is the unique solution of the following system: f(x) + f(x)(t x) + M T x B(T x) = 0. (.) Denote r M (x) = x T M (x). Then, f(t ) (.) = f(t ) f(x) f(x)(t x) M r M(x)B(T x) (.) L +M r M (x). Further, multiplying (.) by T x, we obtain (.4) f(x), x T = f(x)(t x), T x + Mr M(x). (.5) Let us assume M L. Then, in view of (.), we have f(x) f(t ) f(x) ˆf M (x; T ) = f(x), x T f(x)(t x), T x M 6 r M (x) (.6) In particular, since f is convex, = f(x)(t x), T x + M r M (x). f(x) f(t ) (.6) M (.4) r M (x) M ( ) / L +M f(t ). (.7) Sometimes we need to interpret this step from a global perspective: f(t ) (ML ) [ min f (x; y) + M y 6 y (.) x ] Finally, let us prove the following result. Lemma 6 If M L, then [ min f(y) + L +M y 6 y x ]. (.8) f(t ), x T L +M f(t ) /. (.9) Proof: Denote T = T M (x) and r = r M (x). Then 4 L r4 = ( L T x ) (.) f(t ) f(x) f(x)(t x) (.) = f(t ) + M r B(T x) = f(t ) + Mr f(t ), T x + 4 M r 4. 8

9 Hence, f(t ), x T Mr f(t ) + 4M (M L )r. (.0) In view of the conditions of the lemma, we can estimate the derivative in r of the righthand side of the latter inequality: Mr f(t ) + r 4M (M L ) Mr f(t ) + ( ) L +M r M (.4) 0. [ / Thus, its minimum is attained at the boundary point r = L +M ] f(t ) of the feasible ray (.4). Substituting this value in (.0), we obtain (.9). To conclude this section, let us estimate the rate of convergence of the CNM method as applied to our basic problem (.). We assume that a solution of this problem x exists, and the Lipschitz constant L for the Hessian of the objective function is known. Thus, we just iterate x k+ = T L (x k ), k = 0,,.... (.) Using the same arguments as in [6], we can prove the following statement. Theorem Assume that the level sets of the problem (.) are bounded: x x D x : f(x) f(x 0 ). (.) If the sequence {x k } k= is generated by (.), then f(x k ) f(x ) 9L D (k+4), k. (.) Proof: In view of (.6), f(x k+ ) f(x k ), k 0. Thus, x k x D for all k 0. Further, in view of (.8), we have f(x ) f(x ) + L D. (.4) Consider now an arbitrary k. Denote x k (τ) = x + ( τ)(x k x ). In view of (.8), for any τ [0, ] we have f(x k+ ) f(x k (τ)) + τ L x k x f(x k ) τ(f(x k ) f(x )) + τ L D The minimum of the right-hand side is attained for τ = Thus, for any k we have f(xk ) f(x ) L D f(x ) f(x ) L D (.4) <. Denote δ k = f(x k ) f(x ). Then f(x k+ ) f(x k (τ)) (f(x k) f(x )) / L D. (.5) δk+ δk = δ k δ k+ δk δ k+ ( δ k + δ k+ ) (.5) L D δ k δk+ ( δ k + δ k+ ) L. D 9

10 Thus, for any k, we have δk δ + k L D (.4) ( ) L + k D k+4. L D 4 Accelerated scheme In order to accelerate method (.), we apply a variant of the technique of estimate sequences, which was described in Section.. [5] as a tool for accelerating the usual gradient method. In our situation, this idea can be applied to CNM in different ways. We mention only two of them.. Linear estimate functions. For solving the optimization problem (.), we recursively update the following sequences. Sequence of estimate functions ψ k (x) = l k (x) + N 6 x x 0, k =,,..., where l k (x) are linear functions in x E, and N is a positive real parameter. A minimizing sequence {x k } k=. A sequence of scaling parameters {A k } k= : A k+ def = A k + a k, k =,,.... For these objects, we are going to maintain the following relations: R k : A kf(x k ) ψk min ψ k(x), x R k : ψ k(x) A k f(x) + L +N 6 x x 0, x E. Let us ensure that relations (4.) hold for k =. We choose, k. (4.) x = T L (x 0 ), l (x) f(x ), x E, A =. (4.) Then ψ = f(x ), so R holds. On the other hand, in view of, we get ψ (x) = f(x ) + N 6 x x 0 (.8) [ min f(y) + L y 6 y x 0 ] + N 6 x x 0, and R follows. Assume now that relations (4.) hold for some k. Denote v k = arg min x 0 ψ k (x).

11 Let us choose some a k > 0 and M L. Define α k = a k A k +a k, y k = ( α k )x k + α k v k, x k+ = T M (y k ), (4.) ψ k+ (x) = ψ k (x) + a k [f(x k+ ) + f(x k+ ), x x k+ ]. In view of R k, for any x E we have ψ k+ (x) A k f(x) + L +N 6 x x 0 + a k [f(x k+ ) + f(x k+ ), x x k+ ] (A k + a k )f(x) + L +N 6 x x 0, and this is R k+. Let us show now that, for an appropriate choice of a k, N and M, relation R k+ is also valid. Indeed, in view of R k and by Lemma 4 with p =, for any x E, we have Therefore ψ k (x) l k (x) + N d (x) ψ k + N 6 x v k A k f(x k ) + N 6 x v k. (4.4) ψ k+ = min x {ψ k (x) + a k [f(x k+ ) + f(x k+ ), x x k+ ]} (4.4) { } min A k f(x k ) + N x x v k + a k [f(x k+ ) + f(x k+ ), x x k+ ] min x {(A k + a k )f(x k+ ) + A k f(x k+ ), x k x k+ +a k f(x k+ ), x x k+ + N x v k ]} (4.) = min x {A k+ f(x k+ ) + f(x k+ ), A k+ y k a k v k A k x k+ +a k f(x k+ ), x x k+ + N x v k ]} = min x {A k+ f(x k+ ) + A k+ f(x k+ ), y k x k+ +a k f(x k+ ), x v k + N x v k ]}. Further, if we choose M L, then by (.9) we have f(x k+ ), y k x k+ L +M f(x k+) /.

12 Hence, our choice of parameters must ensure the following inequality: A k+ L +M f(x k+) / + a k f(x k+ ), x v k + N x v k 0, for all x E. Using inequality (.) with p =, s = a k f(x k+ ), and σ = 4N, we come to the following condition: A k+ L +M 4 N a/ k. (4.5) For k, let us choose A k = k(k+)(k+) 6, a k = A k+ A k = (k+)(k+)(k+) 6 k(k+)(k+) 6 (4.6) Since = (k+)(k+). a / k A k+ = / (k+)(k+)(k+) 6[(k+)(k+)] / = / (k+) [(k+)(k+)] /, inequality (4.5) leads to the following condition on the parameters: Hence, we can choose L +M N. M = L, N = 4(L + M) = L. (4.7) In this case L + N = 4L. Now we are ready to put all the pieces together. Accelerated cubic regularization of Newton s method Initialization: Choose x 0 E. Set M = L and N = L. Compute x = T L (x 0 ) and define ψ (x) = f(x ) + N 6 x x 0. Iteration k, (k ): (4.8). Compute v k = arg min x E ψ k(x) and choose y k =. Compute x k+ = T M (y k ) and update k k+ x k + k+ v k. ψ k+ (x) = ψ k (x) + (k+)(k+) [f(x k+ ) + f(x k+ ), x x k+ ]. The above discussion proves the following theorem.

13 Theorem If sequence {x k } k= is generated by method (4.8) as applied to the problem (.), then for any k we have: where x is an optimal solution to the problem. Proof: Indeed, we have shown that f(x k ) f(x ) 4 L x 0 x k(k + )(k + ), (4.9) A k f(x k ) R k ψ k R k A k f(x ) + L +N 6 x 0 x. Thus, (4.9) follows from (4.6) and (4.7). Note that the point v k can be found in (4.8) by an explicit formula. Consider s k = l k (x). This vector does not depend on x since the function l k (x) is linear. Then v k = x 0 N B s k. s k /. Quadratic estimate functions. Similar analysis can be applied to a variant of scheme (4.8) based on quadratic estimate functions. Indeed, let us define ψ k (x) = q k (x) + N 6 x x 0, k =,,..., where q k (x) are some quadratic functions. If we choose x = T L (x 0 ), q (x) = f (x 0 ; x), A =, (4.0) then, for any N L we can guarantee that the conditions A k f(x k ) ψ k (x) (.8) ψ k, (.) f(x) + L +N 6 x x 0, x E, hold for k =. Then, by the same arguments as above, we can see that the rules (4.) maintain (4.0) for all k provided that N 4(L + M). Thus, the new version of (4.8) has a slightly better constant in the estimate of the rate of convergence: f(x k ) f(x ) L x 0 x k(k + )(k + ). However, the rules for computing v k become more complicated.

14 5 Global non-degeneracy for second order schemes Traditionally, in numerical analysis the term non-degenerate is applied to certain classes of efficiently solvable problems. For unconstrained optimization, non-degeneracy of the objective function is usually characterized by a lower bound on the angle between the gradient at point x and the direction pointing to the optimal solution: α(x) def = f(x),x x f(x) x x µ(f) > 0, x E. (5.) This condition has a nice geometric interpretation. Moreover, there exists a large class of smooth convex functions, possessing the required property. This is the class of strongly convex functions with Lipschitz-continuous gradient. Lemma 7 µ(f) γ (f) +γ (f) > γ (f). Indeed, in view of inequality (..4) in [5], we have f(x), x x σ +L f(x) + σ L σ +L x x σ L σ +L f(x) x x, and this proves the required inequality. Note that the complexity of the first-order schemes for the class of smooth strongly convex function can be completely characterized in terms of the condition number γ. Indeed, on the one hand, the lower complexity bound for finding an ɛ-solution of any problem from this problem class is proven to be ( ) O γ ln L D (5.) calls of oracle, where the constant D bounds the distance between the initial point and the optimal solution. On the other hand, there exist simple numerical schemes, which exhibit the required rate of convergence (see Chapter in [5] for details). What can be said about the complexity of the above problem class for the second order schemes? Surprisingly enough, in this situation it is quite difficult to find any advantages of the condition (5.). We will discuss the complexity of this class in detail later in Section 6. Now let us present a new non-degeneracy condition, which replaces (5.) for the second-order methods. Assume that γ (f) = σ (f) L (f) > 0. In this case, ɛ f(x) f(x ) (.4) σ f(x) / (5.) Therefore, for method (.) we have f(x k ) f(x k+ ) (.7) f(x L k+ ) / 4 (5.) γ (f) (f(x k+ ) f(x )). (5.4)

15 Hence, for any k we have f(x k ) f(x ) (5.4) f(x ) f (.8) ) k ( + γ (f) e γ (f) (k ) + γ (f) L x 0 x (5.5) Thus, the complexity of minimizing a function with positive third condition number γ (f) by method (.) is of the order ( ) O ln L D γ (f) ɛ (5.6) calls of oracle. The structure of this estimate is similar to that of (5.). Therefore, we will say that such functions possessing the global second-order non-degeneracy property. Let us demonstrate that the accelerated variant of the Newton s method (4.8) can be used to improve the complexity estimate (5.6). Denote by A k (x 0 ), k, the point x k generated by method (4.8) with starting point x 0. Consider the following process.. Define m = 5 γ (f) /, and set y0 = x 0.. For k 0, iterate y k+ = A m (y k ). (5.7) The performance of this scheme can be derived from the following lemma. Lemma 8 For any k 0 we have y k+ x e y k x. Proof: ( Indeed, since m 4e /, γ (f)) we have σ y k+ x (.) f(y k+ ) f(x ) (4.9) 4L y k x m(m+)(m+) e σ y k x. Thus, f(t L (y k )) f(x ) (.8) L y k x (.8) L y 0 x e k, and we conclude that an ɛ-solution to our problem can be found by (5.7) in ( O [γ (f)] / ln [ L ɛ x 0 x ]) (5.8) iterations. Unfortunately, since the complexity theory for this problem class is not developed yet, we cannot say how far our results are from the best possible ones. 5

16 6 Minimizing strongly convex functions Let us look now at the complexity of problem (.) with σ (f) > 0, L (f) <. (6.) The main advantage of such functions consists in possibility for the Newton s method (.) to converge quadratically in a certain neighborhood of the solution. Indeed, for T = T L (x) we have Hence, f(x) f(t ) (.6) f(t )(T x), T x (.5) σ r L (x) (.4) f(t ) f(x ) (.4) σ L f(t ) σ L [σ (f(t ) f(x ))] /. (6.) L σ (6.) (f(x) f(t )) L (f(x) f(x )), (6.) σ and the region of quadratic convergence of method (.) can be defined as Q f = { } x E : f(x) f(x ) σ. (6.4) L Alternatively, the region of quadratic convergence can be described by the norm of the gradients. Indeed, σ r L (x) Thus, (.5) f(t )(T x), T x (.6) f(x) f(t ) f(x) r L (x). f(x) σ r L (x) (.4) σ [ L f(t ) ] /. Consequently, f(t ) 4L f(x) σ, (6.5) and the required region of quadratic convergence can be defined as { } Q g = x E : f(x) σ 4L. (6.6) Thus, the global complexity of problem (.), (6.) is mainly related to the number of iterations, that is required to get from x 0 to the region Q f (or, to Q g ). For method (.), this value can be estimated from above by ( ) L O (f)d σ (f), (6.7) where D is defined by (.) (see Section 6 in [6]). Let us show that, using the accelerated scheme (4.8), it is possible to improve this complexity bound. Assume that we know an upper bound for the distance to the solution: x 0 x R ( D). 6

17 Consider the following process.. Set y 0 = T L (x 0 ), and define m 0 = 5 [ ] L (f)r /. σ (f). while f(t L (y k )) σ 4L do {y k+ = A mk (y k ), m k+ = / m k }. Theorem The process (6.8) terminates at most after [ ( ) ] ln 4 ln 8 L (f)r σ (f) (6.8) (6.9) stages. The total number of Newton steps in all stages does not exceed 4m 0. Proof: Denote R k = R ( ) k. It is clear that For k 0, let us prove by induction that [ ] m k 5 L (f)r / k σ (f), k 0. (6.0) y k x R k. (6.) Assume that for some k 0 this statement is valid (it is true for k = 0). Then, σ y k+ x (.) f(y k+ ) f(x ) (4.9) 4L R k m k (m k +)(m k +) (6.0) 4 5 σ R k 8 σ R k = σ R k+. Thus, (6.) is valid for all k 0. On the other hand, f(y k+ ) f(x ) (4.9) 4L y k x m k (m k +)(m k +) (6.) 4L y k x R k m k (m k +)(m k +) Hence (6.0) 8 σ y k x (.) 4 (f(y k) f(x )). (6.) σ L f(t L (y k )) f(y k ) f(t L (y k )) f(y k ) f(x ) ( 4) k (f(y0 ) f(x )) (.8) ( 4 ) k L R, and (6.9) follows form (6.6). Finally, the total number of Newton steps does not exceed m k k=0 = m 0 k=0 k/ = m 0 / < 4m 0. 7

18 7 Fake acceleration Note that the properties of the class of smooth strongly convex functions (6.) leave some space for erroneous conclusions related to the rate of convergence of the optimization methods in the first stage of the process, aiming to enter the region of quadratic convergence of the Newton s method. Let us demonstrate a possible mistake on a particular example. Consider a modified version M of the method (4.8). The only modification is introduced in Step. Now it looks as follows:. Compute ŷ k = T M (y k ) and update ψ k+ (x) = ψ k (x) + (k+)(k+) [f(ŷ k ) + f(ŷ k ), x ŷ k ]. (7.) Choose ˆx k : f(ˆx k ) = min{f(x k ), f(ŷ k )}. Set x k+ = T M (ˆx k ). Note that for M the statement of Theorem is valid. Moreover, the process now becomes monotone, and, using the same reasoning as in (6.) and M = L, we obtain f(x k ) f(x k+ ) f(ˆx k ) f(x k+ ) σ / L [f(x k+ ) f(x )] /. (7.) Further, let us fix the number of steps N. Define ˆk = N. Then, in view of (4.9), we can guarantee that f(xˆk) f(x ) 7 L R. (7.) N On the other hand f(xˆk) f(x ) f(xˆk) f(x N+ ) (7.) N / σ L [f(x N+ ) f(x )] /. (7.4) Combining (7.) and (7.4) we obtain f(x N+ ) f(x ) 0 7 L 4 R6 5 σ N 8. (7.5) As compared with (4.9), the proposed modification looks amazingly efficient. However, that is just an illusion. Indeed, in view of (6.4), in order to enter the region of quadratic convergence of the Newton s method, we need to make the right-hand-side of inequality (7.5) smaller than σ. For that we need L ( [ ] ) /4 O L R σ (7.6) iterations of M. This is much worse than the complexity estimate (6.7) of the basic scheme (.) even without acceleration (4.8). Another test could be an estimate for the number of steps, which is necessary ( for M [ ] ) / to halve the distance to the minimum. From (7.5) we see that it needs O L R σ iterations, which is worse than the corresponding estimate for the method (4.8). 8

19 8 Discussion. From the complexity results presented in the previous sections we can derive a class of problems, which are easy for the second order schemes: σ (f) > 0, σ (f) > 0, L (f) <. (8.) For such functions, the second-order methods exhibit a global linear rate of convergence and a local quadratic convergence. In accordance with (5.8) and (6.4), we need O ( [ ] L (f) / [ σ (f) ln L (f) σ (f) x 0 x ] ) (8.) iterations of (4.8) to enter the region of quadratic convergence. Note that the class (8.) is non-trivial. It contains, for example, all functions with parameters ξ α,β (x) = αd (x) + βd (x), α, β > 0, σ (ξ α,β ) = α, σ (ξ α,β ) = β, L (ξ α,β ) = β. Moreover, any convex function with Lipschitz-continuous Hessian can be regularized by adding an auxiliary function ξ α,β.. For one important class of convex problems, that is σ (f) > 0, L (f) <, L (f) <, (8.) we have actually failed to clarify the situation. The standard theory of the optimal firstorder methods (see, for example, Section. in [5])) can bound the number of iterations, that are required to enter the region of quadratic convergence (6.4), as follows: O ( [ ] L (f) / [ L (f)l σ (f) ln (f) σ (f) x 0 x ]). (8.4) Note that in this estimate the role of the second-order scheme is quite weak: it is used only to establish the bounds of the termination stage. Of course, as it is shown in Section 6, we could use it on the first stage also. However, in this case the size of the optimal solution x enters polynomially the estimate for the number of iterations. Thus, the following question is still open: For the problem class (8.), can we get any advantage from the second order schemes being used at the initial stage of minimization process?. From the computational point of view, the results presented in this paper are rather preliminary. For example, we always assume that all necessary constants are known. In practical algorithms, the adaptive strategies for updating these parameters are of crucial importance. However, the author believes that the theory presented here could serve as a useful guideline for such developments. 9

20 References [] Bennet A.A., Newton s method in general analysis, Proc. Nat. Ac. Sci. USA, 96,, No 0, [] Conn A.B., Gould N.I.M., Toint Ph.L., Trust Region Methods, SIAM, Philadelphia, 000. [] Dennis J.E., Jr., Schnabel R.B., Numerical Methods for Unconstrained Optimization and Nonlinear Equations, SIAM, Philadelphia, 996. [4] Kantorovich L.V., Functional analysis and applied mathematics, Uspehi Matem. Nauk, 948,, No, 89 85, (in Russian), translated as N.B.S. Report 509, Washington D.C., 95. [5] Nesterov Yu., Introductory lectures on convex programming. A basic course, Kluwer, Boston, 004. [6] Yu.Nesterov, B.Polyak. Cubic regularization of Newton method and its global performance. CORE Discussion paper # 00/4, (00). Submitted to Mathematical Programming. [7] Ortega J.M., Rheinboldt W.C., Iterative Solution of Nonlinear Equations in Several Variables, Academic Press, NY, 970. [8] A. Vladimirov, Yu. Nesterov, Yu. Chekanov. Uniformly convex functionals, (In Russian) Vestnik Moskovskogo universiteta, ser. Vychislit. Matem. i Kibern., 978, No.4,

Cubic regularization of Newton s method for convex problems with constraints

CORE DISCUSSION PAPER 006/39 Cubic regularization of Newton s method for convex problems with constraints Yu. Nesterov March 31, 006 Abstract In this paper we derive efficiency estimates of the regularized