CORE 50 YEARS OF DISCUSSION PAPERS. Globally Convergent Second-order Schemes for Minimizing Twicedifferentiable 2016/28

Size: px

Start display at page:

Download "CORE 50 YEARS OF DISCUSSION PAPERS. Globally Convergent Second-order Schemes for Minimizing Twicedifferentiable 2016/28"

Constance Ford
5 years ago
Views:

1 26/28 Globally Convergent Second-order Schemes for Minimizing Twicedifferentiable Functions YURII NESTEROV AND GEOVANI NUNES GRAPIGLIA 5 YEARS OF CORE DISCUSSION PAPERS

2 CORE Voie du Roman Pays 4, L Tel 2 ) Fax 2 ) immaq-library@uclouvainbe

3 CORE DISCUSSION PAPER 26/28 Globally convergent second-order schemes for minimizing twice-differentiable functions GN Grapiglia and Yu Nesterov July 8, 26 Abstract In this paper, we suggest new universal second-order methods for unconstrained minimization of twice-differentiable convex or non-convex) objective function For the current function, these methods automatically achieve the best possible global complexity estimates among different Hölder classes containing the Hessian of the objective The universal methods for functional residual and for norm of the gradient are different For development of the latter methods, we introduced a new line-search acceptance criterion, which can be seen as a nonlinear modification of the Armijo-Goldstein condition Keywords: unconstrained minimization, second-order methods, Hölder condition, worstcase global complexity bounds Federal University of Paraná, Brazil; research of this author was partially supported by CNPq - Brazil, grant 4288/24-5 CORE, UCL, Belgium; research of this author was partially supported by CNPq - Brazil, grants 4288/24-5 and 498/24-5, and by the grant Action de recherche concertè ARC 4/9-6 from the Direction de la recherche scientifique - Communautè française de Belgique Scientific responsibility rests with the author

4 Introduction Motivation Recent results on the global worst-case complexity bounds for a new variant of Newton method 9] became a starting point for a sequence of publications, addressing the global complexity issues for the second-order methods see, for example, 2], ], 4], 6], 8]) These papers, present the upper and lower estimates for the rate of convergence of second-order schemes in terms of the residual in function value or in the norm of the gradient In the majority of these publications, the standard assumption is the Lipschitz property of the Hessian of the objective function However, it is clear that this is not the only possibility for measuring the level of smoothness of the Hessian For example, we could assume that the Hessian is Höldercontinuous At the same time, a straightforward tuning of the methods to corresponding smoothness assumptions usually results in the explicit changes in the algorithms This is indeed unfortunate, since very often we do not know a priori which smoothness assumption better fits our particular objective function Recently, in the paper 7], it was shown that it is possible to develop universal first-order methods, which do not need preliminary information on the type of Hölder continuity of the gradient Therefore, it looks interesting to develop a similar theory for the second-order methods This is the main goal of this paper Contents In Section 2, we study the structure of Hölder constants for twice differentiable functions and derive main geometric inequalities We also justify the rate of convergence in terms of the function residual both for the simplest overestimating second-order method with perfect knowledge of the Hölder parameter and Hölder constant for the Hessian, and for an adaptive scheme, which needs only the Hölder parameter at the input In Section we present a universal second-order method for achieving a small residual in the function value It appears that, up to the choice of initial estimate for the Hölder constant, this is exactly the Cubic Regularization of Newton Method suggested in 9] Its complexity bound for a twice-differentiable objective function coincides with the bound for a method employing the knowledge on the best Hölder class for this particular function In the second part of the paper, we present methods targeting the finding of points with small norm of the gradient First of all, in Section 4 we show that, if the Hölder parameter is close to zero, then the standard line-search acceptance criterion, based on the relation between the function value at the candidate point and the predicted value of the regularized quadratic model, cannot help in finding the points with small gradients Instead, we propose two new acceptance criteria, which can be seen as a multi-dimensional generalization of Armijo-Goldstein condition First of them allows efficient adaptation of Hölder constant And the second one leads to a fully universal method for finding the points with small norm of the gradient both for convex and non-convex objective functions see Section 5) Notations and generalities In what follows, we denote by E a finite-dimensional linear space, and by E its dual space, composed by linear functions on E The value of function s E at point x E is denoted by s, x Important elements of the dual space are the gradients of a differentiable function f : E R: fx) E, x E

5 For an operator A : E E, denote by A its adjoint operator defined by identity Ax, y = A y, x, x, y E Thus, A : E E It is called self-adjoint if A = A Important examples of such operators are Hessians of twice differentiable function f : E R: 2 fx)u, v = 2 fx)v, u, x, u, v E Operator B : E E is positive-definite if Bx, x >, x E \ {, notation B ; we use notation B if the above inequality is not strict) In what follows, we fix some self-adjoint positive-definite operator B for defining Euclidean norms in the primal and dual spaces: x = Bx, x /2, x E, s = s, B s /2, s E The norm of operator A : E E is defined in a standard way: A = max { Au { : u = min r : r 2 B A B A u E r In what follows, we often use the following simple statement Lemma Let α, 2] Assume that the sequence of positive numbers {δ k k satisfies inequalities δ k δ k+ δk+ α, k ) Then for any t, t m, we have ln δ t ) t α ln δ, 2) and for t m we can guarantee that δ t δ m + α )t m)δα m +δm α ] α On the other hand, if for some m and k m we have then δ t < for t m, and the following inequality holds: δ t ] +δ α m α α )t m) ) δ k δ k+ δ α k, 4) δ m +α )t m)δm α ] α α )t m) ] α 5) Indeed, from condition ), we have δ t δt+ α, and 2) follows On the other hand, inequality ) can be rewritten as δ α k δ k+ + δk+ α )α Therefore, δ α k+ δ α k δ α k+ δ k+ = δ k++δ +δk+ α k+ α )α δ α k+ )α δ α k+ δ k++δk+ α )α 6) 2

6 Since α, ], function τ α is concave for τ Therefore τ α τ α 2 + α )τ α 2 2 τ τ 2 ), τ, τ 2 > Choosing τ = δ k+ and τ 2 = δ k+ + δk+ α, we get δ α k+ δ k+ + δ α k+ )α α )δ k+ + δ α k+ )α 2 δ α k+ Substituting this inequality in 6), for any k {m,, t, we obtain δ α k+ δ α k α )δ k++δk+ α )α 2 δk+ α = α )δ k+ δ α k+ δ k++δk+ α )α δ k+ +δk+ α = α +δ α k+ α +δm α It remains to sum up these inequalities for k = m,, t In order to prove inequality 5), note that function is convex in τ for τ > τ α Therefore, α ) τ+ ) α τ α τ, τ, τ + > 7) α Let δ m < Therefore, δ k < for all k m Choosing in inequality 7) τ = δ k and = δk α, we get δ α k+ 4) δ k δ α k ] α 7) + α )δα δ α k δ α k k = + α δ α k Summing up these inequalities for k = m,, t, we get inequality 5) Note that the right-hand sides of inequalities ) and 5) are increasing functions of δ m Since in inequality 5) δ m, we get the following simple bound: lim α δ t +t m)α )] α Remark In inequality ) we can take the limit in α Since ]) α ln + t m)α )δα m = t m 2, +δm α we conclude that inequality ) has the following limiting form: δ t δ m exp t m ) 2, t m 2 Twice-differentiable convex functions 8) In this section and in Section we consider second-order methods for solving the following unconstrained minimization problem: min fx), 2) x E where f is a convex twice-differentiable function on E We assume that there exists at least one optimal solution x E of this problem, and denote f = fx )

7 The level of smoothness of the objective function f in problem 2) is characterized by the system of Hölder constants H f ν) : x y, ν 22) def = sup x,y E { 2 fx) 2 fy) x y ν For some values of ν the correspondent constants can be infinite However, it is easy to see that function H f ) is log-convex Hence, if ν < ν 2 and H f ν i ) <, i =, 2, then ν 2 ν ν ν ν H f ν) H 2 ν ν f ν ) H 2 ν f ν 2 ), ν ν, ν 2 ] 2) In particular, if H f ) < and H f ) <, then H f ν) H ν f ) Hf ν ), ν, ] 24) At the same time, it is easy to find a uniform lower bound for all constants H f ν) Indeed, if we take two points x, y E with x y =, then clearly H f ν) 2 fx) 2 fy), ν, ] 25) Note that the condition H f ) < allows discontinuous Hessian ) However, its global variation must be bounded An interesting example of such a function is a quadratic penalty for the system of linear inequalities: fx) = m a i, x b i ) 2 +, 26) i= where τ) + = max{, τ The following two simple consequences of definition 22) are valid for all ν, ] and x, y E: fy) fx) fx), y x 2 2 fx)y x), y x H f ν) y x +ν)), 27) fy) fx) 2 fx)y x) H f ν) y x +ν +ν 28) A geometric interpretation of the upper variant of inequality 27) leads to the following constructions Let H f ν) < for some ν, ] Consider the following model of function f around some point x E: Qx; y) M ν,h x; y) def = fx) + fx), y x fx)y x), y x, def = Qx; y) + H y x +ν)), y E, where the parameter H > is an estimate for the Hölder constant H f ν) Clearly, for H H f ν) we have 27) fy) M ν,h x; y), y E 29) ) For this, we need to assume that function f is twice differentiable almost everywhere In this case, we need to use in definition 22) only the points x and y where the Hessian do exist 4

8 Therefore, it is natural to consider the point T ν,h x) def = arg min y E M ν,hx; y) 2) Since function M ν,h x; y) is strictly convex in y, point T = T ν,h x) is the unique solution of the following equation: fx) + 2 fx)t x) + Multiplying equation 2) by T x, we get H T x ν +ν BT x) = 2) fx), T x + 2 fx)t x), T x + H +ν T x = 22) Denote R ν,h x) = T x Then, Mν,H def x) = M ν,h x; T ) 22) = fx) 2 2 fx)t x), T x H R ν,h x) 2) Thus, taking into account inequality 29), we come to the following statement Lemma 2 Let H H f ν) Then for T = T ν,h x) we have fx) ft ) 2 2 fx)t x), T x + H R ν,h x) We will need also some bounds on the growth of values R ν,h x) Lemma For any ν, ] and H > we have In particular, R ν,2h Consider the following function: ξτ) = min y E H R ν,h x) 24) x) R ν,h x) 4R ν,2h x) 25) R,H x) 2 2/ R,2H x) 26) {Qx; y) + y x τ+ν)), τ > The objective function in this minimization problem is jointly convex in y and τ Therefore, function ξ ) is also convex Consequently, its derivative ξ τ) = increasing in τ Thus, choosing τ = H, we conclude that R ν,/τ x) τ 2 +ν)) is H 2 R ν,h x) +ν)) = ξ ) H ξ ) 4H 2H = 2 R ν,2h x) +ν)) In particular, for ν =, we get inequality 26) For solving problem 2), the methods presented in Sections 2 and of this paper, generate the minimizing sequence {x t t E in accordance to the following generic iteration: Define x t+ = T ν,ht x t ) with coefficient H t > such that 27) fx t+ ) Mν,H t x t ) 5

9 The methods differ only by the rules for choosing the parameter ν, ] and generating the sequence of scaling coefficients {H t t Since Mν,H t x t ) fx t ), all our schemes produce the minimizing sequence with monotonically decreasing values of objective function Thus, {x t t Fx ) def = {x E : fx) fx ) Denote D = sup x x x Fx ) 28) We assume that D < Theorem Assume that for some ν, ] with H f ν) <, the scaling coefficients in method 27) satisfy condition for some constant γ Then, for any t we have < H t γ H f ν), t, 29) fx t ) f +γ +ν H f ν)d + t +2ν ) +ν 22) Indeed, M ν,h t x t ) = min y E {Qx t ; y) + H t y x t +ν)) Therefore, for all t, we have 27) 29) min y E min y E {fy) + H f ν)+h t) y x t +ν)) {fy) + +γ)h f ν) y x t +ν)) fx t ) fx ) f + +γ)h f ν) x x +ν)) f + +γ)h f ν) D +ν)) 22) On the other hand, for t we get fx t+ ) Mν,H t x t ) min {fy) + +γ)h f ν) y x t α,] +ν)) : y = x t + αx x t ) min {fx t ) αfx t ) f ) + +γ)h f ν) D α,] +ν)) α The minimum of the last minimization problem is achieved at α = ] fxt) f )+ν) +ν +γ)h f ν) D 22) 6

10 Hence, ] fx t+ ) fx t ) fxt ) f )+ν) +ν fx +γ)h f ν) D t ) f ) Denoting now δ t = +ν is satisfied with α = +ν ] + +γ)h f ν) D +ν)) fxt) f )+ν) +ν +γ)h f ν) D = fx t ) +ν +ν +γ)h f ν)d ] +ν fx t ) f ) +ν ) ] +ν = fx t ) +ν +ν)fxt) f +ν ) +γ)h f ν)d fxt ) f ) ) +ν +ν)fxt ) f ) +γ)h f ν)d At the same time,, we see that the condition ) of Lemma Therefore, δ α +ν ) +ν δ = +ν +ν)fx ) f ) +γ)h f ν)d, and we conclude that 22) +ν ) +ν ) δ t + t δ α +ν +δ α ] +ν) +ν + t +ν +2ν ] +ν) Let us look now at different strategies for satisfying condition 29) Constant H f ν) is known Then we can take H t = H f ν) for all t In this case, we choose in 29) γ = Therefore, in view of the estimate 22), this scheme can find ϵ-solution of problem 2) in ) ] Hf ν)d O +ν ϵ 222) iterations At each iteration the oracle of our objective function is called only once 2 Adaptive estimate of H f ν) For real-life problems, usually it is difficult to have a good a priori estimate of the constant H f ν) In this case, we can apply the following 7

11 adaptive strategy Adaptive method I with specific ν, ] Initialization Choose x E and H, H f ν)] Iteration t 22) a) Find the smallest integer i t such that ft ν,2 i t Ht x t )) M ν,2 i t H t x t ) b) Set x t+ = T ν,2 i tht x t ) and H t+ = 2 it H t Note that the constant H can be chosen, for example, from inequality 25) Theorem 2 Assume that H f ν) < Then the scaling coefficients in method 22) satisfy condition < 2 i t H t 2 H f ν), t 224) Moreover, for any t we have fx t ) f 2 +ν H f ν)d + t +2ν ) +ν 225) At the same time, the total number of calls of oracle N t after t iterations of method 22) is bounded as follows: N t 2t log 2 H f ν) log 2 H Let us prove that all scaling coefficients H t in method 22) satisfy inequality H t H f ν) Indeed, this is true for H Assume that we enter tth iteration with H t H f ν) Then the final value of this coefficient 2 it H t cannot be bigger than 2H f ν) since otherwise we should stop the line-search process earlier Hence, 224) is valid, an consequently, H t+ = 2 2it H t H f ν) Since we have justified relation 224), we can use inequality 22) with γ = This is 225) Finally, let us estimate the total number of calls of oracle At each iteration the oracle is called i t + times At the same time, H k+ = 2 2i kh k Therefore, t i k + ) = t + + t k= k= log 2 2H k+ H k = 2t + ) + log 2 H t+ log 2 H 2t log 2 H f ν) log 2 H As we have seen, the scheme 22) can efficiently estimate the constant H f ν) At the same time, in average it needs only two calls of oracle per iteration Nevertheless, for this strategy we need to choose the value of smoothness parameter ν In the next section we show how to avoid this requirement 8

12 Universal second-order method Denote γ ν ϵ) = 6Hf ν) ] 2 +ν)) Consider the following minimization scheme +ν 2D 5ϵ ) ν +ν ) Universal Method I Initialization Choose x E and H Iteration t ], inf γ νϵ) ν 2) a) Find the smallest integer i t such that ft,2 i t Ht x t )) M,2 i t H t x t ) b) Set x t+ = T,2 i t Ht x t ) and H t+ = 2 i t H t As compared with method 22), we choose here ν = Up to the choice of initial value H, this is exactly 2) version 57) of Cubic Regularization of the Newton Method suggested in 9] for minimizing functions with Lipschitz continuous Hessians However, we prove that this method can work properly even if H f ν) < for some ν, ] Let us prove first the following auxiliary result, which is valid also for nonconvex function f Lemma 4 Let x + = T,H x) for some x E and H > If for some δ > and ν, ] we have ] 2 CHf ν) +ν fx + ) δ and H ) ν +ν +ν)) δ, ) where the constant C 6 Then Moreover, in this case, x + x ν CH f ν) +ν))h 4) H x x + 2 fx + ) 5) For ν =, the statements are trivial Assume ν, ) Denote r = x + x Then δ ) fx + ) fx + ) f x) 2 f x)x + x) + f x) + 2 f x)x + x) 6) 28),2) H f ν)r +ν +ν + 2 Hr2 = r +ν Hf ν) +ν + 2 Hr ν ] 2) As compared with 57) from 9], in method 2) there is no artificial lower bound for scaling coefficients H t 9

13 Assume that Hr ν < δ < r +ν Hf ν) CH f ν) +ν)) Then +ν + CH f ν) 2 +ν)) ] ) = r+ν +ν H f ν) + C 2) < H f ν) +ν ) + C 2) Note that for C 6 we have + C 2) CH f ν) +ν))h ] +ν ν ] 2 C Therefore, δ < 6Hf ν) ν ) +ν ν +ν)) H This contradicts the second inequality in ) Let us prove now inequality 5) In view of inequality 4), we have H f ν) +ν C Hr ν 7) Therefore, in view of inequality 6), we have ] fx + ) r +ν Hf ν) +ν + 2 Hr ν r 2 H C + ] 2 r 2 H 6+C 2C r2 H Corollary Under conditions of Lemma 4 with C 6, we have fx + ) M,H x) f x) 8) fx + ) 27) Q x; x + ) + H f ν)r +ν)) 4) Q x; x + ) + Hr 6 = M,H x) Now we can justify complexity bounds of method 2) on the whole class of twice differentiable objective functions Theorem Assume that for some ν, ] we have H f ν) < + Let sequence {x t T t= be generated by method 2) and satisfy conditions ft,2 i H t x t )) f ϵ >, i =,, i t, t =,, T 9) Then, for all t =,, T we have H t γ ν ϵ) ) Moreover, for t =,, T, we have fx t ) f ) 2 6γ νϵ) t+) 2 +ν D, ) Therefore, T 6) +ν 26 ) ν ] ) Hf ν)d +ν 5 +ν))ϵ 2)

14 Let us prove first, that the sequence {x t T t= is well defined Indeed, assume that i t > Since for all i =,, i t we have M,2 i H t x t ) 2) < ft,2 i H t x t )) 27) M ν,hf ν)x t, T,2 i H t x t )), we get 2i H t 6 R ν,2 i H t x t ) conclude that H f ν) +ν)) Therefore, R ν,2 i H t x t ) T,2 i H t x t ) x ν T,2 i H t x t ) x t + x t x ] ν 6H f ν) +ν)) 2 i H t, and we T,2 i H t x t ) x t ν + x t x ν 6H f ν) +ν)) 2 i H t + D ν Ω i Note that ft,2 i H t x t )) ft,2 i H t x t)) f T,2 i Ht x t ) x 9) ϵ T,2 i Ht x t ) x Therefore, in view of Corollary, the line-search process at each iteration of method 2) terminates in finite time, at least when the following inequality is satisfied: 2 i H t ] 2 ) 6Hf ν) +ν Ω i +ν +ν)) ϵ ν Thus, we have proved that the whole sequence {x t T t= is well defined and x t x 28) D, t =,, T ) Let us establish now an upper bound for values H t, t =,, T For t =, inequality ) is justified by the initial conditions of method 2) Suppose that ) is valid for some t If i t =, then H t+ = 2 H t < γ ν ϵ) Consider now the case i t > Denote ϵ y t = T,2 i t H t x t ), and choose δ = y t x Since in view of Corollary, we have δ 9) fy t) f y t x fy t ), H t+ 2) = 2 i t H t 6Hf ν) ] 2 +ν)) +ν y t x ϵ ) ν +ν 4) It remains to note that y t x y t x t + x t x 26),) 2 2/ x t+ x t + D ) + 2 5/ )D < 2 5 D Substituting this upper bound in inequality 4), we get H t+ γ ν ϵ)

15 Let us estimate now the rate of convergence of method 2) Denote r t = x t x Note that fx t+ ) 2) 27) ) min M,2H t+ x t, y) y E { min fy) + H f ν) y E +ν)) y x t + 2H t+ 6 y x t min α,] {fx t ) αfx t ) f ] + H f ν)α r t +ν)) + H t+α r t min {fx t ) αfx t ) f ] + H f ν)α D α,] +ν)) + γ νϵ)α D Denote the objective function in the latter optimization problem by ω t α) Note that ω ) = f + H f ν)d +ν)) + γνϵ)d fx ) fx t ), t =,, T 5) Therefore, for all t = 2,, T we have ω t) = 5) H f ν)d +ν + γ ν ϵ)d fx t) f ) H f ν)d +ν + γ ν ϵ)d H f ν)d +ν)) γ νϵ)d > Therefore, the solution α t of the optimization problem ω t = min α,] ω t α) is smaller than one and can be found from the equation = ω tα) = H f ν)d +ν α +ν + γ ν ϵ)d α2 fx t ) f ) = H f ν)d +ν α +ν 6Hf ν)d + α +ν +ν)) ] 2 +ν 2 5ϵ ) ν +ν fx t ) f ) 6) Note that ω t 6) = ω t α t ) α t ω tα t ) 6) = fx t ) αt fx t ) f ] + H f ν)α t ) D +ν)) + γνϵ)α t ) D ] α t fx t ) f ] + H f ν)α t )+ν D +ν + γ ν ϵ)αt ) 2 D ) ) = fx t ) αt fx t ) f ] + γ ν ϵ)αt ) D 2

16 Thus, for any t =,, T, we have fx t+ ) ω t fx t ) +ν α t fx t ) f ] 7) Let us find now the lower bound for αt Multiplying the second line in 6) by 2 5ϵ, we have = 2 5ϵ ω tα) = H f ν)d +ν 2α+ν Let us define ᾱ as a solution to equation Then H f ν)d +ν 2ᾱ+ν 5ϵ 6Hf ν)d 5ϵ + +ν)) 2α+ν 5ϵ 6H f ν)d 2ᾱ +ν +ν)) 5ϵ = 2 = 2) 6 Therefore, 2 5ϵ ω tᾱ) ν 2fx t) f ) 5ϵ ] 2 +ν 9) < 2fx t) f ) 5ϵ Thus, ] αt ᾱ = 5ϵ +ν)) +ν 8) 6H f ν)d Note that the second inequality in 8) can be rewritten in the following way: Therefore, = α t ) 2 γ ν ϵ)d fx t ) f 6) 9) α t ) 2 H f ν)d +ν ] αt ) ν 6H f ν)d ν +ν 5ϵ +ν)) 9) = H f ν)d H f ν)d +ν +ν α t ) +ν + γ ν ϵ)d α t ) 2 ] 6Hf ν)d ν 5ϵ +ν)) ] 6Hf ν)d ν +ν 5ϵ +ν)) +ν +ν)) 6H f ν) + γ ν ϵ)d ] ] 2 ] ) ν +ν 5ϵ +ν 2D D + = α t ) 2 γ ν ϵ)d 2 + ν) ) 2 ] 6 ν +ν + = α t ) 2 γ ν ϵ)d 6 ] +ν + 2) ν 2 α t ) 2 γ ν ϵ)d Hence, we obtain the following inequality: fx t ) fx t+ ) 7) +ν 2fxt) f ) γ νϵ)d fx t ) f ], t Denoting now δ t = 2fx t) f ) γ ν ϵ)d +ν ) 2, we get δ t δ t+ δ /2 t, t

17 This implies δ Now we can apply inequality 8) with α = 2 This gives us inequality ) In order to get the upper bound 2), note that ϵ 9) fx t ) f ) 6γ νϵ) t+) 2 +ν ) 2 D, t T Therefore, ] /2 ] 2 T 6γνϵ) +ν ϵ D = 6 6Hf ν) +ν ϵ +ν)) = +ν 6 ] 6Hf ν)d 2 +ν))ϵ +ν 2 5 +ν 2D 5ϵ ) ν +ν ] /2 = 6) +ν ] ) ν /2 +ν D 26 ) ν ] ) Hf ν)d +ν 5 +ν))ϵ Note that the complexity bound 2), up to a constant factor, coincides with the estimate 222) However, method 2) does not require knowledge of the smoothness parameter ν, ] It adapts automatically to its best value, ensuring the smallest right-hand side of inequality 2) Using the same reasoning, as in the proof of Theorem 2, we can prove the following bound for the total number of calls of oracle N t in method 2) after t iterations: N t 2t log 2 ˆγϵ) log 2 H, t, 2) where ˆγϵ) = inf γ νϵ) ν For starting the method 2), we need to ensure initial condition H, ˆγϵ)] 2) Usually, this is not a serious problem since typically all values γ ν ϵ), ν, ], are expected to be big In any case, we can try to find this value from an auxiliary search procedure based on the following fact Lemma 5 Let H >, and for two points y = T,H x ), y = T,2H x ) we have fy i ) f ϵ, i =, 2 If fy ) M,H x ), 22) fy ) M,2H x ), then H ˆγϵ) Denote δ = ϵ y x Since δ fy ) f y x fy ), using Corollary, we get H < 6Hf ν) ] 2 +ν)) +ν y x ϵ ) ν +ν 2) 4

18 On the other hand, fy ) 22) y x 28) D Thus, M,2H x ) fx ) Hence, y Fx ) and therefore, y x y x + x x 26) 2 2/ y x + x x 28) 2 2/ + ) D < 2 D Consequently, H 2) ] 2 6Hf ν) +ν < 2D ) ν +ν ) +ν)) ϵ = γ ν ϵ) It remains to note that the condition 22) does not involve any particular value of ν, ] Thus, we can try to get an appropriate value of H by checking the points y i = T,2 ix ) with positive or negative integer values of i 4 Decreasing the norm of the gradient Let us assume now that the objective function in problem min x E fx) 4) is not convex Then our main goal consists in finding a point with a small norm of the gradient However, note that non-convexity of the objective creates additional difficulties First of all, in our optimization schemes we need to use the global solution of the auxiliary problem 2) it can be efficiently computed, see 5, 9]) This solution T = T ν,h x) is characterized by the first-order optimality condition 2), and by the secondorder condition 2 fx) + H +ν T x ν B 42) see ]; for ν =, this characterization was firstly obtained in 9]) Note that condition 42) is stronger than the usual second-order optimality condition for the auxiliary problem 2) However, even with its help, the standard machinery for convergence analysis often do not work Let us look, for example, what we can guarantee for the process as applied to non-convex problem 4) Indeed, x E, x t+ = T ν,hf ν)x t ), t, 4) fx t+ ) 27) fx t ) + fx t ), x t+ x t fx t )x t+ x t ), x t+ x t + H f ν) +ν)) x t+ x t 22) = fx t ) 2 2 fx t )x t+ x t ), x t+ x t H f ν) x t+ x t 42) fx t ) νh f ν) )) x t+ x t 5

19 Thus, it seems that for ν, the method becomes slower and slower and we cannot guarantee any rate of convergence for the limiting value ν = Therefore, for non-convex problems we need to employ stronger conditions for accepting next points in the minimization sequence Let us replace the functional condition 27) by the following criterion: { G κ x, x + ) fx) fx + ) κ fx) x x +, 44) where κ is a constant belonging to the interval, 2 ] Criterion G κ can be seen as a natural strengthening of Armijo-Goldstein condition Note that it does not depend on the smoothness parameter ν We will need also the following inequality: ft ) ft ) fx) 2 ft )T x) + fx) + 2 ft )T x) 45) 28),2) H f ν)+h +ν T x +ν Lemma 6 Let x + = T ν,h x) and H f ν) < + Then G /4 x, x + ) is true for any H satisfying the inequality ) H + 4 H f ν) 46) If function f is convex, then G /2 x, x + ) is satisfied by any ) H + 2 H f ν) 47) Indeed, f x) fx + ) 27) f x), x x f x)x + x), x + x H f ν) x + x +ν)) ] 22) = 2 2 f x)x + x), x + x + H +ν H f ν) +ν)) x + x 48) Using matrix inequality 42), we get f x) fx + ) H ) ] H f ν) +ν)) x + x In view of assumption 46), the right-hand side of this inequality is nonnegative Thus, f x) fx + ) 45) ) H+H f ν) 2 H H f ν) fx + ) x + x 46) 4 fx +) x + x 6

20 Let us assume now that function f is convex Then, from inequality 48) we have: ] f x) fx + ) H +ν H f ν) +ν)) x + x Since by assumption 47) the right-hand side of this inequality is nonnegative, we get 45) ) f x) fx + ) H+H f ν) H H f ν) fx + ) x + x 47) 2 fx +) x + x Let us consider now different variants of the regularized Newton Method The simplest version uses a constant coefficient in the prox-term: Denote g t = min fx k) kt+ x E, x t+ = T ν,h x t ), t 49) Theorem 4 Let the objective function in problem 4) be below bounded by f and H f ν) < + for some ν, ] If the parameter H of method 49) satisfies the condition 46), then ] ) +ν gt H+Hf ν) 4fx ) f ) +ν t+ 4) Indeed, in view of Lemma 6, we have fx k ) fx k+ ) 4 fx k+) x k x k+ 45) 4 Summing up these inequalities for k =,, t, we get ] +ν 4 t + )gt ) +ν 4 +ν H+H f ν) +ν H+H f ν) ] +ν +ν H+H f ν) t k= ] +ν +ν fx k+ ) +ν fx k+ ) fx ) fx t+ ) fx ) f This is exactly the inequality 4) If the objective function of problem 4) is convex, we can guarantee a better rate of convergence In this case, we assume that problem 4) is solvable and denote by x one of its optimal solutions We assume also that the constant D defined by 28) is finite Theorem 5 Let function f in problem 4) be convex and parameter H in method 49) satisfy condition 47) Then for any t we have ] +ν fx t ) fx H+Hf ν))d ) +ν, 4) and for any t 2 we have ν) t ) ] ] gt 2 +ν +ν) 2 ) H+Hf ν) +ν D +ν 8+ν) +ν 2 t 42) 7

21 Indeed, in view of Lemma 6, at each iteration of method 49) criterion G /2 x t, x t+ ) is satisfied Therefore, this method forms a monotonically decreasing sequence of function values Hence, {x t t Fx ), and we conclude that Further, fx t+ ) fx ) fx t+ ), x t+ x fx t+ ) D fx k ) fx k+ ) 2 fx k+) x k x k+ 45) 2 2 +ν H+H f ν) +ν H+H f ν) ] ) +ν + D fx k+ ) fx +ν )) ] +ν +ν fx k+ ) = ] +ν)fxk+ ) fx )) +ν fx 2 +ν H+H f ν))d k+ ) fx )) Denoting δ k = +ν)fx k) fx )), we can rewrite this inequality as δ k δ k+ δk+ α with 4 +ν H+H f ν))d α = 47) +ν Note that H H f ν) Therefore, fx ) 27) 27) { H min Qx ; x) + x E +ν)) x x { min fx) + H+H f ν) x E +ν)) x x fx ) + H+H f ν) +ν)) D This means that δ Consequently, 2 +ν δα ) 2) /+ν) 2 < Since the estimate ) is monotone in δ m, we conclude that ) δ t δ α + δα t ) +δ α )+ν) +ν + t ) + )+ν) +ν = + t ) 4+ν) This inequality leads to the estimate 4) In order to prove inequality 42), let us fix a number k, k < t Then for any i, k i t, we have fx i ) fx i+ ) 2 fx i+) x i x i+ 45) 2 Summing up these inequalities for i = k,, t, we get 2 +ν H+H f ν) ] +ν t k + )gt ) +ν 2 +ν H+H f ν) +ν H+H f ν) ] +ν t i=k ] +ν ] +ν +ν fx i+ ) +ν fx i+ ) fx k ) fx t+ ) fx k ) f 8

22 Thus, g t ) +ν 4) ] H+Hf ν) +ν fx 2 k ) f +ν t k+ ] H+Hf ν) +ν D +ν +ν 2 t k ν) k ) ] +ν ] H+Hf ν) +ν D +ν +ν 8+ν) ] +ν 2 t k+)k ) +ν Considering now even and odd numbers for t, it is easy to prove that max k Z { t k + )k ) +ν : k t ) t 2 Thus, we obtain the bound 42) Let us consider now two versions of method 49) with adjustable estimates for the Hölder parameters We start from a version applicable to the general functions Adaptive method II with specific ν, ] General functions) Initialization Choose x E and H, H f ν)] Iteration t 4) a) Find the smallest integer i t such that the condition G /4 x t, T ν,2 i t Ht x t )) is satisfied b) Set x t+ = T ν,2 i tht x t ) and H t+ = 2 it H t Theorem 6 Let the objective function in problem 4) be below bounded by f and H f ν) < + for some ν, ] Then the method 4) finds a point x E with f x) δ in T 4fx ) f ) ] 4+ν)Hf ν) +ν ) +ν +ν)) δ 44) iterations The number of calls of oracle in this process does not exceed ) NF 2T + log H + log f ν) 2 H 45) Indeed, in view of Lemma 6, the parameters i t and H t in method 4) satisfy inequalities ) ) 2 i t H t H f ν), H t + 4 H f ν) 46) 9

23 Therefore, in view of Lemma 6, we have fx t ) fx t+ ) 4 fx t+) x t x t+ 45) 4 +ν 2 i th t+h f ν) ] +ν 46) ] +ν)) +ν 4 4+ν)H f ν) +ν fx t+ ) +ν fx t+ ) Assume that fx t ) δ for all t, t T Summing up the above inequalities for t =,, T, we get ] +ν)) +ν 4 26+ν)H f ν) T δ +ν T fx t ) fx t+ )) t= fx ) fx T ) fx ) f This gives us the upper bound 44) for the number of iterations The number of calls of oracle at tth iteration of method 4) is equal to i t + Therefore, NF = T i t + ) = T + T 2H log t+ 2 H t t= t= 46) ) 2T + log H + log f ν) 2 H = 2T + log 2 H T H Let us justify now a version of method 4) applicable to convex functions It differs 2

24 from 4) only by the stopping criterion at Step a) Adaptive method III with specific ν, ] Convex functions) Initialization Choose x E and H, H f ν)] Find the smallest integer i such that ft ν,2 i H x )) M ν,2 i H x ) and condition G /2 x, T ν,2 i H x )) is satisfied Set x = T ν,2 i H x ), H = 2 i H 47) Iteration t a) Find the smallest integer i t such that the condition G /2 x t, T ν,2 i t Ht x t )) is satisfied b) Set x t+ = T ν,2 i t Ht x t ) and H t+ = 2 it H t Theorem 7 Let function f in problem 4) be convex Then for any t we have fx t ) fx ) and for any t 2 we have g t 2 +ν ν) t ) +ν)hf ν)d +ν +ν)) ] +ν +ν)hf ν)d +ν)), 48) ] ] +ν) 2 ) 8+ν) +ν 2 t 49) Indeed, in view of Lemma 6, the parameters i t and H t in method 47) satisfy inequalities ) ) 2 i t H t H f ν), H t + 2 H f ν) 42) On the other hand, using the same arguments as in the beginning of the proof of Theorem 5, we conclude that in method 47) we have fx t ) fx t+ ) ] +ν)fxt+ ) fx )) +ν fx 2 +ν 2 i t H t+h f ν))d t+ ) fx )) 2

25 Denote δ t = +ν)fx t+) fx )) α = +ν 2 +ν 2 i t H t +H f ν))d The above inequality is then δ t δ t+ δ α t+ with Note that by the initialization procedure of method 47), we have fx ) 27) 27) { min Qx ; x) + 2i H x E +ν)) x x { min fx) + 2i H +H f ν) x E +ν)) x x fx ) + 2i H +H f ν) +ν)) D This means that δ Hence, from the proof of Theorem 5, we get the bound 2 +ν ) δ α Using Lemma we obtain, as in Theorem 5, ] +ν δ t + t ) 4+ν) Taking into account the bounds 42), we obtain 48) The proof of the bound 49) is very similar to the proof of inequality 42) Let us fix a number k, k < t Then for any s, k s t, we have fx s ) fx s+ ) 45) 2 +ν 2 i sh s +H f ν) ] +ν 42) ] +ν)) +ν 2 +ν)h f ν) Summing up these inequalities for s = k,, t, we get Thus, ] ] +ν)) +ν 2 +ν)h f ν) t k + )gt ) +ν +ν)) +ν 2 +ν)h f ν) gt ) +ν)hf ν) +ν 2 48) 2 2 +ν)) +ν fx s+ ) +ν fx s+ ) t s=k +ν fx s+ ) fx k ) fx t+ ) fx k ) f ] +ν fx k ) fx ) t k+ ] +ν)hf ν)d +ν +ν +ν)) t k+ ] +ν)hf ν)d +ν +ν)) +ν 8+ν) ν) k ) ] +ν ] +ν t k+)k ) +ν Choosing k 2 t, we get t k + )k )+ν ) t 2 This lower bound justifies inequality 49) 22

26 5 Universal methods for decreasing the norm of gradient Let us start with the following auxiliary result Note that its conditions are identical to conditions of Lemma 4 Lemma 7 Let x + = T,H x) for some x E and H > If for some δ > and ν, ] we have ] 2 CHf ν) +ν fx + ) δ and H ) ν +ν +ν)) δ, 5) with constant C 6, then f x) fx + ) H 2 x x + 52) If f is convex, then f x) fx + ) H x x + 5) Denote r = x + x Then f x) fx + ) 27) f x), x x f x)x + x), x + x H f ν)r +ν)) 54) 22) = 2 2 f x)x + x), x + x + H 2 r H f ν) +ν)) r Using matrix inequality 42) with ν =, we get f x) fx + ) H 4 r H 7) f ν) +ν)) r H 4 r H C r 2 Hr If f is convex, then from inequality 54) we obtain f x) fx + ) H 2 r H 7) f ν) +ν)) r H 2 r H C r Hr Let us introduce now the following stopping criterion { U κ x, H) fx) ft,h x)) κ ft H /2,H x)) /2, 55) where the parameter κ belongs to the interval, ) Note that this criterion does not depend on Hölder parameter ν, ] Denote ] 2 6Hf ν) +ν ξ ν δ) = ) ν +ν +ν)) δ, ˆξδ) = inf ξ νδ) ν In view of inequalities 5), 52), and 5), the following statement is valid 2

27 Corollary 2 For non-convex functions, if H ξ ν δ) and ft,h x)) δ, the criterion U /2 x, H) is satisfied If function is convex, the same conditions are sufficient for satisfying U / x, H) For U /2, it is enough to combine inequalities 5) and 52) For U / we apply 5) and 5) Same as in Section, the simplest universal method for decreasing the norm of the gradient is based on cubic regularization Universal Method II General functions) Initialization Choose x E and H, ˆξδ) ] Iteration t a) Find the smallest integer i t such that either criterion 56) U /2 x t, 2 i t H t ) is satisfied, or ft,2 i tht x t )) δ b) Set x t+ = T,2 i t Ht x t ) and H t+ = 2 i t H t c) If fx t+ ) δ, then Stop Theorem 8 Let objective function f in problem 4) be below bounded by value f and H f ν) < + for some ν, ] Assume that for all t, t T +, we have fx t ) δ Then the number of such steps of method 56) is bounded as follows T 2 ] 6Hf ν) +ν 2fx ) f ) ) +ν +ν)) δ 57) The number of calls of the oracle in this scheme is bounded as follows: ] 2 6Hf ν) NF 2T + log 2 +ν)) Indeed, in view of Corollary 2, we have +ν δ ) ν +ν ) log 2 H 58) 2 i t H t 2ξ ν δ), t =,, T, H t ξ ν δ), t =,, T + 59) 24

28 Therefore, in view of definition 55), we have ] /2 fx t ) fx t+ ) 2 2 i t H t δ /2 59) /2 2 2ξ νδ)] δ /2 Summing up these inequalities for t =,, T, we get T 2fx ) fx T )) 2ξ ν δ)] /2 ) /2 δ ] 2 ] 6Hf ν) +ν = 2fx ) f ) 2 ) ν /2 +ν +ν)) δ δ = 2 ] 6Hf ν) +ν 2fx ) f ) ) +ν +ν)) δ We can bound the number of calls of oracle in this scheme as follows The number of calls of oracle at tth iteration of method 56) is equal to i t + Therefore, NF = T i t + ) = T + T 2H log t+ 2 H t t= t= 59) ] 2 6Hf ν) 2T + log 2 +ν)) +ν δ ) ν +ν ) /2 = 2T + log 2 H T H ) log 2 H Let us look now at the universal method for finding a point with small norm of the gradient of convex function Universal Method III Convex functions) Initialization Choose x E and H, ˆξδ) ] Iteration t a) Find the smallest integer i t such that either criterion 5) U / x t, 2 i t H t ) is satisfied, or ft,2 i t Ht x t )) δ b) Set x t+ = T,2 i t Ht x t ) and H t+ = 2 i t H t c) If fx t+ ) δ, then Stop 25

29 Theorem 9 Let the objective function in problem 4) be convex and H f ν) < + for some ν, ] Assume that the sequence of points {x t t is generated by method 5) and fx t ) δ for all t, t T + Denote by m the first iteration number such that fx m+ ) f 6ξ ν δ)d Then { m ln/2) ln max fx, log ) f 2, 5) 8ξ ν δ)d and for all k m we have fx k+ ) f 42ξ νδ)d k m) 2 52) At the same time, if T = m + s for some integer s, then g T def 2 = min fx k) ξ ν δ) 2D kt + T m) 5) Moreover, the maximal number of such steps T is bounded from above as follows: Indeed, in view of Corollary 2, we have ) 6Hf ν) +ν T m + 2D +ν))δ 54) 2 i kh k 2ξ ν δ), k =,, T, H k ξ ν δ), k =,, T + 55) Therefore, in view of definition 55), we have fx k ) fx k+ ) 55) ] /2 2 i k fxk+ ) /2 H k 55) 28) 2ξ νδ) 2ξ νδ)d 2ξ νδ)d ] /2 fxk+ ) /2 ] /2 fxk+ ) x k+ x ) /2 ] /2 fxk+ ) f ) /2 56) Denoting now δ k = fx k+) f, we see that this sequence satisfies condition ) with 8ξ νδ)d α = 2 Since m is the first iteration number such that δ m 2, in view of inequality 2) we have { m ln/2) ln max{, log 2 δ ln/2) ln max fx, log ) f 2 8ξ νδ)d 26

30 At the same time, in view of inequality ), for k > m we get the following rate of convergence: fx k+ ) f 8ξ ν δ)d Thus, we get inequality 52) Further, let T = m + s for some s Then 42ξ νδ)d 4s 2 52) fx m+2s+ ) f = fx T + ) f + ] ξ k m < ν δ)d k m) 2 T k=m+2s+ fx k ) fx k+ )) 56) s 2ξ ν δ)] /2 g T ) /2 Therefore, g T 5 2 s ξ /2 ν δ)d ] 2/ = 5 2) / T m ) 2 ξν δ)d 2 < 2D T m) 2 ξν δ) Since g T δ, we conclude that T m + 2D δ ξ νδ) ] ] /2 6Hf ν) +ν = m + ) 2D +ν +ν)) δ Comparing the efficiency bound 54) with the rate of convergence 49) of the adaptive method 47) with known value of the smoothness parameter ν, we can see that the universal method 5) ensures the same dependence on the accuracy δ as method 47) Since the value ν is not employed in the scheme of method 5), the bound 54) can be strengthen as follows: ) 6Hf ν) +ν T m + 2D inf ν +ν))δ 57) 27

31 References ] HS Dollar, NIM Gould, and DP Robinson On solving trust-region and other regularised subproblems in optimization Technical Report RAL-TR-29-, Rutherford Appleton Laboratory 29) 2] C Cartis, N I M Gould, and Ph L Toint Adaptive cubic regularisation methods for unconstrained optimization Mathematical Programming, 272), ) ] C Cartis, N I M Gould, and Ph L Toint On the evaluation complexity of cubic regularization methods for potentially rank-deficient nonlinear least-squares problems and its relevance to constrained nonlinear optimization SIOPT, 2), ) 4] FE Curtis, DP Robinson, and M Samadi A trust-region algorithm with a worstcase iteration complexity of Oϵ /2 ) for nonconvex optimization Mathematical Programming, DOI: 7/s ) 5] Y Hsia, R Shew, R, and Y Yuan On the p-regularized trust region subproblem arxiv: ) 6] JM Martínez, and M Raydan Cubic-regularization counterpart of a variable-norm trust-region method for unconstrained minimization Optimization Online 25) 7] Yu Nesterov Universal gradient methods for convex optimization problems Mathematical Programming, 52-2), ) 8] Yu Nesterov Accelerating the cubic regularization of Newton s method on convex problems Mathematical Programming, 2), ) 9] Yu Nesterov, B Polyak Cubic regularization of Newton s method and its global performance Mathematical Programming, 8), ) 28

Cubic regularization of Newton s method for convex problems with constraints

CORE DISCUSSION PAPER 006/39 Cubic regularization of Newton s method for convex problems with constraints Yu. Nesterov March 31, 006 Abstract In this paper we derive efficiency estimates of the regularized