Estimate sequence methods: extensions and approximations

Size: px

Start display at page:

Download "Estimate sequence methods: extensions and approximations"

Blaise Walker
5 years ago
Views:

1 Estimate sequence methods: extensions and approximations Michel Baes August 11, 009 Abstract The approach of estimate sequence offers an interesting rereading of a number of accelerating schemes proposed by Nesterov [Nes03], [Nes05], and [Nes06]. It seems to us that this framewor is the most appropriate descriptive framewor to develop an analysis of the sensitivity of the schemes to approximations. We develop in this wor a simple, self-contained, and unified framewor for the study of estimate sequences, with which we can recover some accelerating scheme proposed by Nesterov, notably the acceleration procedure for constrained cubic regularization in convex optimization, and obtain easily generalizations to regularization schemes of any order. We analyze carefully the sensitivity of these algorithms to various types of approximations: partial resolution of subproblems, use of approximate subgradients, or both, and draw some guidelines on the design of further estimate sequence schemes. 1 Introduction The concept of estimate sequences was introduced by Nesterov in 1983 [Nes83] to define the provably fastest gradient-type schemes for convex optimization. This concept, in spite of its conceptual simplicity, has not attracted a lot of attention during the 0 first years of its existence. Some interest for this concept resurrected in 003, when Nesterov wrote his seminal paper on smoothing techniques [Nes05]. Indeed, the optimization method Nesterov uses on a smoothed approximation of the convex non-smooth objective function can be seen as an estimate sequence method. These estimate sequence methods play a crucial role in further papers of Nesterov [Nes06, Nes07]. Auslender and Teboulle [AT06] managed to extend the estimate sequence method, stated in Section. of [Nes03] for squared Euclidean norms as prox-functions, to general Bregman distances at the cost of a supplementary technical assumption on the domain of these Bregman distances. Several other papers propose generalizations of Nesterov s smoothing algorithm, and can be interpreted in the light of the estimate sequence concept or sight generalizations of it. For instance, M. Baes is with the Institute for Operations Research, ETH, Rämistrasse 101, CH-809 Zürich, Switzerland. Part of this wor has been done while the author was at the Department of Electrical Engineering (ESAT), Research Group SCD-SISTA and the Optimization in Engineering Center OPTEC, Katholiee Universiteit Leuven, Kasteelpar Arenberg 10, B-3001 Heverlee, Belgium. Michel.Baes@ifor.math.ethz.ch. 1

2 Lan, Lu, and Monteiro [LLM] propose an accelerating strategy for Nesterov s smoothing, which can be interpreted as a simple restarting procedure of an estimate sequence scheme albeit their algorithm is not an estimate sequence scheme as we define it in the present paper. D Aspremont investigates another interesting aspect of Nesterov s algorithm: its robustness with respect to incorrect data [d A08], more specifically to incorrect computation of the objective s gradient. In this paper, we show how we can benefit from the estimate sequence framewor to carry out an analysis of the effect not only of approximate gradient on general estimate sequence schemes, but also of approximate resolution of subproblems one has to solve at every iteration. This result can support a strategy for reducing the iteration cost by solving only coarsely these subproblems. The purpose of this paper is to provide a simple framewor for the study of estimate sequences schemes, and to demonstrate its power in various ways. First, we generalize the accelerated cubic regularization scheme of Nesterov to m-th regularization schemes (cubic regularization represents the case where m = ), with the fastest global convergence properties seen so far: these schemes require not more than O((1/ɛ) m+1 ) iterations to find an ɛ-approximation of the solution. Also, our results allows us to reinterpret the accelerated cubic regularization schemes of Nesterov, and to improve some hidden constants in his complexity analysis. Second, we show how accurately subgradients and the solution of intermediate problems have to be computed in order to guarantee that no error propagation occurs from possible approximations during the course of iterations. The essence of our approach is crystallized in Lemma.3, which provides the sole condition to chec for proving the convergence of the scheme, determine its speed, and investigating every extension we study in this paper. Interestingly, we can interpret Lan, Lu, and Monteiro s restarting scheme as an application of Lemma.3, albeit their scheme does not fall into the estimate sequence paradigm. The paper is organized as follows. In Section, we recall the concept of estimate sequence, and we establish Lemma.3, which plays a central role in our paper. As in [Nes03], we particularize in the next section the general estimate sequence method to smooth problems. However, we use a slightly more general setting, allowing ourselves non-euclidean norms. We show in Section 4 a simplified presentation of the fast cubic regularization for convex constrained problems developed by Nesterov in [Nes07]. This presentation allows us to extend the idea of cubic regularization to m-th regularization, and to obtain, at least theoretically the fastest blac-box methods obtained so far, provided that the subproblems one must solve at every iteration are simple enough. Interestingly, we can improve some constants in the complexity results when we focus on unconstrained problem, due to the much more tractable optimality condition. Section 5 displays a careful sensitivity analysis of the various algorithms developed in the paper with respect to different ind of approximations. Section 6 shows briefly how Lan, Lu, and Monteiro s restarting scheme can be analyzed easily with the sole Lemma.3. Finally, the Appendix contains some useful technical results. Foundation of estimate sequence methods Consider a general nonlinear optimization problem inf x Q f(x), where Q R n is a closed set, and f is a proper convex function on Q. We assume that f attains its infimum f in Q, and we denote

3 by X the set of its minimizers. Later on, we will introduce more assumptions on f and Q, such as convexity, Lipschitz regularity, and differentiability a.o. An estimate sequence (see Chapter in [Nes03]) is a sequence of convex functions (φ ) 0 and a sequence of positive number (λ ) 0 satisfying: lim λ = 0, and φ (x) (1 λ )f(x) + λ φ 0 (x) for all x Q, 1. We also need to guarantee the inequality φ 0 (x ) f for an element x of X. Since we obviously do not have access to any point of the set X before any computation starts, the latter condition has to be relaxed, e.g. to min x Q φ 0 (x) f(y) for a point y Q. The first proposition indicates how estimate sequences can be used for solving an optimization problem, and, if implementable, how fast the resulting procedure would converge. Proposition.1 Suppose that the sequence x 0, x 1, x,... of Q satisfies f(x ) min x Q φ (x). Then f(x ) f λ (φ 0 (x ) f ) for every 1. Proof It suffices to write: f(x ) min x Q φ (x) min x Q f(x) + λ (φ 0 (x) f(x)) f(x ) + λ (φ 0 (x ) f(x )). The next proposition describes a possible way of constructing an estimate sequence. It is a slight extension of Lemma.. in [Nes03]. Proposition. Let φ 0 : Q R be a convex function such that min x Q φ 0 (x) f. Let (α ) 0 (0, 1) be a sequence whose sum diverges. Suppose also that we have a sequence (f ) 0 of functions from Q to R that underestimate f: f (x) f(x) for all x Q and all 0. We define recursively λ 0 := 1, λ +1 := λ (1 α ), and φ +1 (x) := (1 α )φ (x) + α f (x) = λ +1 φ 0 (x) + ) for all 0. Then ((φ ) 0 ; (λ ) 0 is an estimate sequence. Proof Since ln(λ +1 ) = ln(1 α j ) j=0 j=0 α j i=0 λ +1 α i λ i+1 f i (x), (1) for each 0, the sequence (λ ) 0 converges to zero, as the sum of α j s diverges. Let us now chec that φ (x) (1 λ )f(x) + λ φ 0 (x) for 1. For = 1, this condition is immediately verified. Using now a recursive argument, φ +1 (x) = (1 α )φ (x) + α f (x) (1 α )φ (x) + α f(x) (1 α )(1 λ )f(x) + (1 α )λ φ 0 (x) + α f(x) = (1 λ +1 )f(x) + λ +1 φ 0 (x), 3

4 which proves that we have built an estimate sequence. The second equality in (1) is obtained by an elementary recurrence on. All the estimate sequence methods we describe in this text are constructed on the basis of this fundamental proposition. In a nutshell, each of these methods can be defined by the specification of four elements: a function φ 0 that is easy to minimize on Q and that is bounded from below by f(y) for a y Q; a sequence of weights α in ]0, 1[; a strategy for constructing the successive lower estimates f of f. For convex objective functions, affine underestimates constitute the most natural choice as they are cheap to build. For strongly convex functions, we can also thin of quadratic lower estimates (see Section..4 in [Nes03]). A way of constructing, preferably very cheaply, points x that satisfy the inequality prescribed in Proposition.1, namely f(x ) min x Q φ (x). In view of Lemma 8.1, the existence of a positive constant β such that α n /λ +1 β proves that the sequence (λ ) 0 decreases to zero as fast as O(1/( n β)), i.e. the resulting algorithm requires O((1/ɛ) 1/ ) iterations, where ɛ > 0 is the desired accuracy. The following lemma concentrates on the case where the feasible set Q as well as the objective function f are convex, and where the underestimates f of f are affine. It provides an intermediate inequality that we will use in the construction of the sequence (x ) 0 and for exploring various extensions and adaptations of the estimate sequence scheme. We denote the subdifferential of f at x as f(x) (see e.g. [Roc70], Section 3). ) Lemma.3 We are given an estimate sequence ((φ ) 0 ; (λ ) 0 for the convex problem min f(x), x Q constructed according to Proposition. using affine underestimates of f: f (x) := f(y ) + g(y ), x y for some y Q, where g(y ) f(y ). We also assume that φ 0 is continuously differentiable. Suppose that for every 0, we have a function χ : Q Q R + such that χ (x, y) = 0 implies x = y, and such that: φ 0 (x) φ 0 (v ) + φ 0(v ), x v + χ (x, v ) for all x Q, 0, () where v is the minimizer of φ on Q. We denote by (x ) 0 a sequence satisfying f(x ) φ (v ). Then φ +1 (v +1 ) f(y )+ g(y ), (1 α )x +α v y +min x Q α g(y ), x v + λ +1 χ (x, v )} (3) for every 0. 4

5 Proof Observe first that the condition () can be rewritten as φ (x) φ (v ) + φ (v ), x v + λ χ (x, v ) for all x Q, 0. (4) Indeed, in view of Proposition., we have φ (x) = λ φ 0 (x) + 1 λ α i i=0 λ i+1 f i (x) = λ φ 0 (x) + l (x), where l (x) is an affine function. Inequality (4) can be rewritten as: λ φ 0 (x) + l (x) λ φ 0 (v ) + λ φ 0(v ), x v + l (x) + λ χ (x, v ), and results immediately from (). Now, fixing 0 and x Q, we can write successively: φ +1 (x) = (1 α )φ (x) + α f (x) (1 α ) (φ (v ) + φ (v ), x v + λ χ (x, v )) + α (f(y ) + g(y ), x y ) (1 α ) (f(x ) + λ χ (x, v )) + α (f(y ) + g(y ), x y ) (1 α ) (f(y ) + g(y ), x y + λ χ (x, v )) + α (f(y ) + g(y ), x y ) = f(y ) + g(y ), (1 α )x + α v y + α g(y ), x v + λ +1 χ (x, v ). The first inequality comes from (4). The second one uses the fact that v is a minimizer of φ on Q, so that φ (v ), x v 0 for each x Q. The third one comes form g(y ) f(y ). It remains to minimize both sides on Q. The inequality (3) suggests at least two different lines of attac to construct the next approximation x +1 of an optimum x. First, if we can ensure that the sequence (y ) 0 satisfies at every 0 and at every x Q the inequality: g(y ), (1 α )x + α v y + α g(y ), x v + λ +1 χ (x, v )} 0, it suffices to set x +1 := y. Another possibility is to build a sequence (y ) 0 for which the inequality g(y ), (1 α )x + α v y 0 holds for every 0 for instance by letting y := (1 α )x + α v. Then, the inequality of the above lemma reduces to φ +1 (v +1 ) f(y ) + min x Q α g(y ), x v + λ +1 χ (x, v )}. In some situations, constructing a point x +1 for which f(x +1 ) is lower than the above right-hand side can be done very cheaply by an appropriate subgradient-lie step. More details are given in the subsequent sections. Finally, the above lemma suggests a new type of scheme, where one only ensures that the inequality (3) is maintained at every iteration regardless of the fact that it originates from the construction of an estimate sequence. Under some conditions on χ, the convergence speed of the resulting scheme also relies on how fast we can drive the sequence (λ ) 0 to 0. More details are given in Section 6. 5

6 3 Strongly convex estimates for convex constrained optimization In this setting, the function φ 0 we choose is a strongly convex function, not necessarily quadratic. Let us fix a norm of R n. We assume that the objective function f is differentiable and has a Lipschitz continuous gradient with constant L for the norm : x, y Q, f(y) f(x) f (x), y x L y x. Equivalently, denoting by the dual norm of, the inequality f (y) f (x) L y x holds for every x, y Q. Observe that we do not assume the strong convexity of f. The function φ 0 is constructed from a prox-function d for Q. A prox-function d is a nonnegative convex function minimized at a point x 0 relint Q, and for which d(x 0 ) = 0. Also, a prox-function is supposed to be strongly convex: there exists a constant σ > 0 for which every x, y Q and λ [0, 1], we have: λd(x) + (1 λ)d(y) d(λx + (1 λ)y) + σ λ(1 λ) x y. If the function d is differentiable, this condition can be rewritten as (see e.g Theorem IV in [HUL93a]): x, y Q, d (x) d (y), x y σ x y. The prox-function d is a crucial building tool for an estimate sequence, and it should be chosen carefully. Indeed, at each step, we will have to solve one (or two) problem(s) of the form min x Q d(x) + l(x), where l is a linear mapping. Sometimes (in cubic and m-th regularization schemes, see in Section 4, we even need to solve min x Q d(x) + p(x), where p is a polynomial function. Having an easy access to its minimizer is a sine qua non requirement for the subsequent algorithm to wor efficiently. The best instances for this scheme are of course those for which this minimizer can be computed analytically. Interestingly enough, the set of these instances is not reduced to a few trivial ones (see [Nes05, NP06]). These ingredients allow us to define the first function of our estimate sequence: φ 0 (x) := f(x 0 ) + L σ d(x), which is L-strongly convex for the norm. Also min x Q φ 0 (x) = f(x 0 ) f(x ). Moreover, we have for every x Q: φ 0 (x) φ 0 (x 0 ) φ 0(x 0 ), x x 0 L x x 0 f(x) f(x 0 ) f (x 0 ), x x 0. Therefore, φ 0 (x) f(x) f (x 0 ), x x 0. The underestimates f of f are chosen to be linear underestimates: f (x) := f(y ) + f (y ), x y. 6

7 Following the construction scheme of Proposition., we have φ +1 (x) = λ +1 ( f(x 0 ) + L σ d(x) ) + i=0 λ +1 α i λ i+1 (f(y i ) + f (y i ), x y i ) for all 0. As the function φ is strongly convex with constant λ L for the norm it has a unique minimizer. We denote this minimizer by v. At iteration +1, we need to construct a point x +1 R n such that f(x +1 ) φ +1 (v +1 ), given that f(x ) φ (v ). There are many ways to achieve this goal; we consider here two possibilities hinted by Lemma.3. Each of them are defined by a nonnegative function χ for which χ (x, x) = 0 and d(x) d(v ) + d (v ), x v + σ L χ (x, v ) for all x Q, 0. A first possibility is to choose χ (x, y) := L x y /; the above inequality reduces to the σ-strong convexity of d. We can also consider a suitable multiple of the Bregman distance induced by d, that is, χ (x, y) := γ (d(x) d(y) d (y), y x ). The required inequality is ensured as soon as σ/l γ > 0. The potential advantage of this second approach resides in the fact that, in the resulting algorithm, the computation of the next point x +1 requires to solve a problem of the same type than for computing the minimizer of φ, that is, minimizing a d plus a linear function over Q. 3.1 When the lower bound on φ is a norm Let us consider here the case where χ (x, y) := L x y /, where is a norm on R n, not necessarily Euclidean lie in [Nes03]. According to Lemma.3, we can set y := (1 α )x + α v. With this choice, we obtain: φ +1 (v +1 ) f(y ) + min x Q α f (y ), x v + λ } +1L x v (5) for every 0. The minimization problem on the right-hand side is closely related to the standard gradient method. We denote by x Q (y; h) the minimizer of f (y), x y + 1 x y h over Q. If the considered norm is Euclidean, this minimizer is simply the Euclidean projection of a gradient step over Q: x Q (y; h) = arg min x Q x (y hf (y)). 7

8 Observe that: m := min α f (y ), x v + λ } +1L x v x Q = min f (y ), α x + (1 α )x y + λ } +1L x Q α α x + (1 α )x y min f (y ), x y + λ } +1L x Q α x y. because α Q + (1 α )x Q in view of the convexity of Q. Thus: m min f (y ), x y + L } x y, x Q provided that λ +1 /α 1. Hence, we can bound φ +1(v +1 ) from below by: f(y ) + f (y ), x Q (y ; 1/L) y + L x Q(y ; 1/L) y. By Lipschitz continuity of the gradient of f, this quantity is larger than f(x Q (y ; 1/L)). Therefore, setting x +1 := x Q (y ; 1/L) is sufficient to ensure the required decrease of the objective. However, this choice assumes that optimization problems of the form min x Q x z + l(x), where l is a linear function, are easy to solve as well. An alternative is presented in the next subsection, where only optimization problems of the form min x Q d(x) + l(x) should be solved at every iteration. Algorithm 3.1 Assumptions: f has a Lipschitz continuous gradient with constant L for the norm ; the set Q is closed, convex, and has a nonempty interior. Choose x 0 Q, set v 0 := x 0 and λ 0 := 1. Choose a strongly convex function d with strong convexity constant σ > 0 for the norm, minimized in x 0 Q, and that vanishes in x 0 and set φ 0 := f(x 0 ) + Ld(x)/σ. For 0, Find α such that α = (1 α )λ. Set λ +1 := (1 α )λ. Set y := α v + (1 α )x. Set x +1 := x Q (y ; 1/L) = arg min x Q f (y ), x y + L x y }. Set φ +1 (x) := (1 α )φ (x) + α (f(y ) + f (y ), x y ). Set v +1 := arg min x Q φ +1 (x). End Assuming that α = λ +1 = (1 α )λ, we obtain in view of Lemma 8.1 and Proposition.1 a complexity of ( 1 f(x 0 ) f ɛ + L ) σ d(x ) iterations. There is a simple variant for the computation of the sequences (λ ) 0 and (α ) 0. The requirement λ α /(1 α ) is satisfied for every 0 when λ = 4/( + ) for. With this choice, we obtain α = 1 λ +1 λ = ( + 5)/( + 3) and α /λ +1 = (1 1/( + 6)) [5/36, 1[. 8

9 3. When the lower bound on φ is a Bregman distance In this setting, we define χ (x, z) := γ (d(x) d(z) d (z), x z ), where γ is a positive coefficient that will be determined in the course of our analysis. Observe that, in contrast with [AT06], we do not assume anything on the domain of d, except that it contains Q. For a fixed z Q, the function x χ (x, z) is strongly convex with constant σγ for the norm. Even better, we have for every x, y Q that: χ (x, y) γ σ x y. (6) If the coefficients γ are bounded from above by L/σ, we can apply Lemma.3 because the inequality () is satisfied in view of the Lipschitz continuity of f. Therefore, with: we have: for every 0. Let us denote: and chec that: y := (1 α )x + α v, φ +1 (v +1 ) f(y ) + min x Q α f (y ), x v + λ +1 χ (x, v )} w := arg min x Q α f (y ), x v + λ +1 χ (x, v )}, x +1 := α (w v ) + y = α w + (1 α )x yields a sufficient decrease of the objective. We can write: f(y ) + min x Q α f (y ), x v + λ +1 χ (x, v )} = f(y ) + α f (y ), w v + λ +1 χ (w, v ) f(y ) + α f (y ), w v + λ +1γ σ w v = f(y ) + f (y ), x +1 y + λ +1γ σ α x +1 y. Now, if λ +1 /α 1 and γ L/σ, the right-hand side is clearly larger than f(x +1 ) in view of the Lipschitz continuity of the gradient of f. The corresponding algorithm can be written as follows. Algorithm 3. Assumptions: f has a Lipschitz continuous gradient with constant L for the norm ; the set Q is closed, convex, and has a nonempty interior. Choose x 0 Q, set v 0 := x 0 and λ 0 := 1. Choose a strongly convex function d with strong convexity constant σ > 0 for the norm, minimized in x 0 Q, and that vanishes in x 0 and set φ 0 := f(x 0 ) + Ld(x)/σ. For 0, Find α such that α = (1 α )λ. Set λ +1 := (1 α )λ. Set y := α v + (1 α )x. Set w := arg min x Q α f (y ), x v + λ +1 χ (x, v )}. 9

10 Set x +1 := α w + (1 α )x. Set φ +1 (x) := (1 α )φ (x) + α (f(y ) + f (y ), x y ). Set v +1 := arg min x Q φ +1 (x). End 4 Cubic regularization and beyond 4.1 Cubic regularization for constrained problems Cubic regularization has been developed by Nesterov and Polya in [NP06], and further extended by Nesterov in [Nes06, Nes07] and by Cartis, Gould, and Toint [CGT07]. We derive in this section a slight modification of Nesterov s algorithm for constrained convex problems and establish its convergence speed using an alternative proof to his, which allows us to further extend the accelerated algorithm to other types of regularization. We consider here a convex objective function f that is twice continuously differentiable and that has a Lipschitz continuous Hessian with respect to a matrix induced norm, that is, a norm of the form a = Ga, a 1/, where G is a positive definite Gram matrix. In other words, we assume that there exists M > 0 such that for every x, y dom f, we can write: f (y) f (x) M y x. The matrix norm used above is the one induced by the norm we have chosen, that is, A := max x =1 Ax. A consequence of the Lipschitz continuity of the Hessian (see Lemma 1 in [NP06] for a proof) reads: f (y) f (x) f (x)(y x) M y x x, y dom f. (7) The optimization problem of interest here consists in minimizing f on a closed convex set Q R n with a nonempty interior. Let us fix a starting point x 0 Q. We initialize the construction of our estimate sequence by: φ 0 (x) := f(x 0 ) + M 6 x x 0 3. As in Proposition., we consider linear underestimates of f: f (x) = f(y ) + f (y ), x y to build the estimate sequence. Let us define χ (x, z) := M x z 3 /1 for every 0. Since the inequality () is satisfied in view of Lemma 8., we can use Lemma.3 to define our goal: at iteration, we need to find a point x +1 Q and suitable coefficients α for which: f(y ) + f (y ), (1 α )x + α v y } + min α f M (y ), x v + λ +1 x Q 1 x v 3 f(x +1 ). (8) 10

11 Recall that v is the minimizer of φ on Q. Our strategy here is to tae x +1 := y, so we are left with the problem of determining a point y for which the sum of the two last terms on the left-hand side is nonnegative. In parallel with what we have done in the previous sections where the objective function had a Lipschitz continuous gradient, we define: x N (x) := arg min f(x) + f (x), y x + 1 f (x)(y x), y x + N6 } y y Q x 3 for every N M. For the subsequent method to have a practical interest at all, the above optimization problem has to be easy to solve. As noticed by Nesterov and Polya in Section 5 of [NP06], unconstrained nonconvex although this paper does not leave the convex realm problems of the above type can be solved efficiently because their strong dual boils down to a convex optimization problem with only one variable. Moreover, the optimal solution of the original problem can be easily reconstructed from the dual optimum. For constrained problems, Nesterov observed in Section 6 of [Nes06] that, as long as convex quadratic functions can be minimized easily on Q, we can guarantee an easy access to x N (x). The optimality condition for x N (x) reads as follows: f (x)+f (x)(x N (x) x), y x N (x) + N x x N (x) G(x N (x) x), y x N (x) 0 y Q. (9) We start our analysis with an easy lemma, an immediate generalization of which will be exploited in the next section as well. Lemma 4.1 Let g R n, λ > 0, x, v Q and z := (1 α)x + αv for an α [0, 1]. We have: min α g, y v + λχ (y, v)} min g, y z + λ } y Q y Q α 3 χ (y, z). Proof By convexity of Q, we have Q = αq + (1 α)q. Therefore: because x belongs to Q. Now, we can write: Q z = (1 α)(q x) + α(q v) α(q v) minα g, y v + λχ (y, v) : y Q} = min g, u + λm u 3 /(1α 3 ) : u α(q v)} min g, u + λm u 3 /(1α 3 ) : u Q z} = min g, y z + λχ (y, z)/α 3 : y Q}. The next lemma plays the crucial role in the validation of the desired inequality. (Compare with item 1 of Theorem 1 in [Nes07]). We write r N (x) for x x N (x). Lemma 4. For every x, y Q, we have f (x N (x)), y x N (x) M + N r N (x) y x + N M r N (x) 3. 11

12 Proof By the optimality condition (9), we have for every x, y Q: 0 f (x) + f (x)(x N (x) x), y x N (x) + Nr N(x) G(xN (x) x), y x N (x). (10) Observe that: G(xN (x) x), y x N (x) r N (x) y x r N (x). Moreover, in view of the Hessian Lipschitz continuity (7), we have: f (x) + f (x)(x N (x) x), y x N (x) f (x) + f (x)(x N (x) x) f (x N (x)) y x N (x) + f (x N (x)), y x N (x) M r N (x) ( y x + r N (x) ) + f (x N (x)), y x N (x). Summing up these two inequalities with appropriate multiplicative coefficients, we get form (10): 0 Nr N(x) = N + M ( rn (x) y x r N (x) ) + M r N(x) ( y x + r N (x) ) + f (x N (x)), y x N (x) r N (x) y x + M N r N (x) 3 + f (x N (x)), y x N (x). We are now ready to design an estimate sequence scheme for the constrained optimization problem we are interested in. Algorithm 4.1 Assumption: f is convex and has a Lipschitz continuous Hessian with constant M for the matrix norm induced by. Choose x 0 Q, set v 0 := x 0 and λ 0 := 1. Set φ 0 (x) := f(x 0 ) + M x x 0 3 /6. For 0, Find α such that 1α 3 = (1 α )λ. Set λ +1 := (1 α )λ. Set z := α v + (1 α )x. Set y := arg min y Q f (z ), y z + 1 f (z )(y z ), y z + 5M 6 x z 3 }. Set x +1 := y. Set φ +1 (x) := (1 α )φ (x) + α (f(y ) + f (y ), x y ). Set v +1 := arg min x Q φ +1 (x). End Theorem 4.1 The above algorithm taes not more than ( ) 1/3 1 1 (f(x 0 ) f(x ) + M6 ) 1/3 ɛ x 0 x 3 iterations to find a point x for which f(x ) f ɛ. 1

13 Proof Let us show first that the inequality (8) is satisfied, that is, that: } f (y ), (1 α )x + α v y + min α f M (y ), x v + λ +1 x Q 1 x v 3 In view of the algorithm, we have set z := α v + (1 α )x and y := x N (z ). Using Lemma 4.1 and Lemma 4., the second term of the above inequality can be bounded from below as follows: } min α f M (y ), y v + λ +1 y Q 1 y v 3 f (y ), y z + λ } +1 min y Q = f (y ), y z + min y Q f (y ), y z + min y Q α 3 M 1 y z 3 f (y ), y y + λ +1 M + N Therefore, the inequality we need to prove becomes: N M r N (z ) 3 + min y Q α 3 } M 1 y z 3 0. r N (z ) y z + N M r N (z ) 3 + λ +1 M α 3 1 y z 3 M + N r N (z ) y z + λ +1 α 3 } M 1 y z 3 0. Suppressing the constraint of the above minimization problem, we get the following lower bound: min M + N r N (z ) y z + λ } +1 M y R n α 3 1 y z 3 = min M + N r N (z ) t + λ } +1 M (M + N) t 0 α 3 1 t3 = 3 r N(z ) 3 3 α 3. M λ +1 Thus, the inequality is satisfied as soon as: N M (M + N) 3 α 3 0, 3 M λ +1 or 9 M(N M) 8 (M + N) 3 α3. λ +1 The left-hand side can be maximized in N. Its maximizer turns out to be attained for N := 5M, in which case its value is 1/1. Note the constants prescribed here have been integrated in Algorithm 4.1. It suffices now to apply Lemma 8.1 to obtain the complexity of the algorithm. }. 13

14 4. Beyond cubic regularization In principle, the above reasoning can be applied in the study of an optimization scheme for constrained convex optimization with higher regularity. However, the obtained scheme would imply to solve at every iteration a problem of the form min f(x) + f (x)[y x] y Q f (m) M (x)[y x,..., y x] + (m + 1)! y x m+1, which can be highly nontrivial. A discussion of the cases where this problem is reasonably easy e.g. where the above objective function is convex and/or has an easy dual can be the topic of a further paper. In fact, this problem does not need to be solved extremely accurately. We show in this paper that, in the case where D Q is bounded, a solution with accuracy O(ɛ 1.5 ) is amply sufficient to guarantee the reliability of the algorithm see Subsection 5.1 for more details. Nevertheless, let us analyze this scheme, as an illustration of the power of the estimate sequence framewor. Given a norm on R n, we define the norm of a tensor A of rand d as: A := sup sup A[x 1,..., x d 1 ], x d. x 1 =1 x d =1 Let us assume that the m-th derivative of f is Lipschitz continuous: f (m) (y) f (m) (x) M y x for every x, y Q. By integrating several times the above inequality, we can easily deduce that for every j between 0 and m, and for every x, y R n, we have: f (m j) (y) f (m j) (x) f (m j+1) (x)[y x] 1 j! f (m) (x)[y x,..., y x] M (j + 1)! y x j+1. (11) Actually, we only need the inequality for j := m 1 in our reasoning. For constructing the first function of our estimate sequence, we choose a starting point x 0 Q, and set: M φ 0 (x) := f(x 0 ) + (m + 1)! x x 0 m+1. Then, following the construction outlined in Proposition., we define for an appropriate choice of α and y. φ +1 (x) = (1 α )φ (x) + α (f(y ) + f (y ), x y ) Let us restrict our analysis to the case where the norm is a matrix induced norm as in the previous section. Lemma 8. provides us with a constant c m+1 such that the function χ (x, y) := Mc m+1 (m + 1)! y x m

15 can be used in Lemma.3. As for cubic regularization, our strategy for exploiting this lemma consists in trying to find at every iteration a point y that satisfies the inequality: x N (x) := arg min y Q f (y ), (1 α )x + α v y + min y Q α f (y ), y v + λ +1 χ (y, v )} 0. (1) The structure of our construction parallels the one for cubic regularization. Our main tool is the minimizer: f(x) + f (x)[y x] f (m) (x)[y x,..., y x] + where N M. A necessary optimality condition reads, with r N (x) := x x N (x) : f (x) + f (x)[x N (x) x] Nr N(x) m 1 } N y x m+1, (m + 1)! 1 (m 1)! f (m) (x)[x N (x) x,, x N (x) x], y x N (x) G(xN (x) x), y x N (x) 0 y Q. (13) Let us extend the two lemmas of the previous section. We omit the proof of the first one, as it is a trivial extension of the one of Lemma 4.1. Lemma 4.3 Let g R n, λ > 0, x, v Q and z := (1 α)x + αv for an α [0, 1]. We have: min α g, y v + λχ (y, v)} min g, y z + λ } y Q y Q α m+1 χ (y, z). Lemma 4.4 For every x, y Q, we have f (x N (x)), y x N (x) M + N Proof First, we can use (11) to get: r N (x) m y x + N M r N (x) m+1. f 1 (x) + + (m 1)! f (m) (x)[x N (x) x,, x N (x) x] f (x N (x)), y x N (x) f 1 (x) + + (m 1)! f (m) (x)[x N (x) x,, x N (x) x] f (x N (x)) y x N (x) M r N (x) m y x N (x) M r N (x) m( y x + r N (x) ). Using the latter inequality in (13), we get: f (x N (x)), y x N (x) + M r N (x) m ( y x +r N (x))+ Nr N (x) m 1 G(xN (x) x), y x N (x) 0. It remains to use G(xN (x) x), y x N (x) r N (x) y x r N (x) to get the desired inequality. The m-regularization algorithm loos as follows. 15

16 Algorithm 4. Assumptions: f is convex and has a Lipschitz continuous m-th differential with constant M for the norm. Choose x 0 Q, set v 0 := x 0 and λ 0 := 1. Set φ 0 (x) := f(x 0 ) + M x x 0 m+1 /(m + 1)!. For 0, Find α such that (m + )α m+1 = c m+1 (1 α )λ. Set λ +1 := (1 α )λ. Set z := α v + (1 α )x. Set y := arg min y Q f (z ), y z f (m) (z )[ ], y z + (m+1)m (m+1)! y z m+1 }. Set x +1 := y. Set φ +1 (x) := (1 α )φ (x) + α (f(y ) + f (y ), x y ). Set v +1 := arg min x Q φ +1 (x). End Theorem 4. The above algorithm taes not more than ( ) 1/(m+1) ( ) 1/(m+1) m + 1 f(x 0 ) f(x M ) + c m+1 ɛ (m + 1)! x 0 x m+1 iterations to find a point x for which f(x ) f ɛ. Proof The proof is nothing more than an adaptation of the demonstration of Theorem 4.1. With z := α v + (1 α )x and y := x N (z ), the inequality (1) becomes f (y ), z y + min y Q α f (y ), y v + λ +1 χ (y, v )} 0. Applying successively Lemma 4.3 and Lemma 4.4, we can transform this inequality into: f (y ), z y + min α f (y ), y v + λ +1 χ (y, v )} y Q f (y ), z y + min f (y ), y z + λ } +1 χ (y, z ) min M + N y Q y Q N M r N (z ) m+1 + min t 0 ( = r N (z ) m+1 (N M) m m + 1 This quantity is nonnegative as soon as: α m+1 r N (x) m y x + N M r N (z ) m+1 + λ +1 α m+1 M + N r N (z ) m t + λ +1 α m+1 ( (M + N) m+1 α m+1 )1/m ). Mc m+1 λ +1 c m+1 M(N M) m (N + M) m+1 ( m + 1 m ) m αm+1 λ +1. } χ (y, z ) Mc m+1 (m + 1)! tm+1 } 16

17 Maximizing the left-hand side with respect to N, we get a value of c m+1 /(m + ), attained for N = (m + 1)M. It remains to apply Lemma 8.1. Interestingly, the case m := 1, we get a new algorithm for minimizing a convex function with a Lipschitz continuous gradient. However, we need in this algorithm to evaluate the gradient of the function f in two points at every iteration, instead of just one as in Algorithms 3.1 and 3.. We conclude this section with a short note on solving the equation: (m + )α m+1 = c m+1 (1 α )λ. With γ := c m+1 λ /(m + ) > 0, the equation to solve has the form p(t) = t m+1 + γt γ. As p(0) < 0 < p(1) and p (t) > 0 on t [0, 1], this equation has a unique solution and can be solved in a few steps of Newton s algorithm initialized at Cubic and m-th regularization for unconstrained problems It is possible to improve some constants in the complexity analysis for cubic regularization and m-th regularization when problems are unconstrained. We consider here a function f with a M-Lipschitz continuous m-th differential. From the viewpoint of our complexity analysis, the case m = does not bear anything special. There are essentially two elements in the proof that change with respect to the constrained situation. Firstly, the inequality (1) that we need to chec can be simplified because we can compute the exact value of the minimum. It can be rewritten as: f (y ), (1 α )x + α v y min α f (y ), y v + λ +1 χ (y, v )} y R n = m ( ) 1/m ( α m+1 f (y ) m+1 )1/m. (14) m + 1 λ +1 M Secondly, the form of the optimality condition for x N (x) changes. When Q = R n in (13), we have for all x R n : c m+1 f 1 (x) + + (m 1)! f (m) (x)[x N (x) x,, x N (x) x] + Nr N(x) m 1 G(x N (x) x) = 0. (15) This relation allows us to get a ind of counterpart to Lemma 4.4. Lemma 4.5 For every x R n and N M, we have f (x N (x)) (m + 1)N m ( m 1 m ) m 1 m+1 N f N M (x N (x)), x x N (x) m m+1. Proof We can transform the Lipschitz continuity of f (m) (see (11) with j := m 1) using the optimality 17

18 condition (15): ( ) M 0 r N (x) m f (x N (x)) f 1 (x) (m 1)! f (m) (x)[x N (x) x,..., x N (x) x] ( ) M = r N (x) m f (x N (x)) + Nr N (x) m 1 G(x N (x) x) = M N () r N (x) m f (x N (x)) Nr N (x) m 1 f (x N (x)), x N (x) x. Since N M, we can see that f (x N (x)), x N (x) x is negative. Also: f (x N (x)) M N () r N (x) m Nr N(x) m 1 f (x N (x)), x N (x) x ( ) m 1 (m + 1)N m 1 m+1 N f m m N M (x N (x)), x x N (x) m m+1. The last bound comes from the maximization of its left-hand side with respect to r N (x). Now, if we tae x := (1 α )x +α v in the previous lemma, the inequality resembles striingly to the desired inequality (14), provided that we choose y := x N (x). In light of the previous lemma, the following relation ensures that (14) is satisfied: m ( ) 1 m + 1 c m+1 m ( α m+1 λ +1 We can reformulate this inequality as: α m+1 λ +1 c m+1 ) 1 ( ) 1 ( ) m+1 m m 1 ( m m m M (m + 1)N m 1 ( ) m 1 m + 1 M(N M ) m 1 m 1 N m. Maximizing the right-hand side with respect to N, we get: α m+1 λ +1 c m+1 ( m + 1 m m ) m 1. N M ) m 1 m. N and this optimum is attained for N := mm. Comparing this value with the one obtained in the previous section, we see that the improvement is rather significant: their ratio is as large as: (m + 1) m+1 m m, that is, of order O( m) for large values of m. In particular, for cubic regularization (m = ) our constant equals 3/4, while we obtained only 1/1 in the constrained case. The algorithm now reads as follows: 18

19 Algorithm 4.3 Assumptions: f is convex and has a Lipschitz continuous m-th differential with constant M for the norm = G, 1/ ; Q R n. Choose x 0 R n, set v 0 := x 0 and λ 0 := 1. Set φ 0 (x) := f(x 0 ) + M x x 0 m+1 /. For 0, Find α such that α m+1 = c m+1 Set λ +1 := (1 α )λ. Set z := α v + (1 α )x. ( m+1 m m ) m 1 (1 α )λ. Set y := arg min y Q f (z ), y z f (m) (z )[ ], y z + mm (m+1)! y z m+1 }. Set x +1 := y. Set φ +1 (x) := (1 α )φ (x) + α (f(y ) + f (y ), x y ). Set v +1 := arg min x Q φ +1 (x). End The above algorithm does not tae more than: m ( m ) (m 1)/ ( 1 m + 1 ɛ c m+1 iterations to find an ɛ-approximate solution. ) 1/(m+1) ( f(x 0 ) f(x ) + ) 1/(m+1) M (m + 1)! x 0 x m+1 5 Estimate sequences and approximations In the course of an algorithm based on estimate sequences, we must compute at every iteration the minimizer of usually two optimization problems. In order to accelerate the scheme, we can consider solving these subproblems only approximately. A natural question arises immediately: how does the accuracy of the resolution of these subproblems relate to the precision of the final answer given by the algorithm? In particular, how do the successive errors accumulate in the course of the algorithm? For some optimization problems, computing the needed differentials can be quite hard to do accurately (e.g. in Stochastic Optimization, see [SN05], or in large-scaled Semidefinite Optimization, see [d A08]). How precisely must we compute these differentials in order to avoid any accumulation of error? How can we combine these approximations with an accelerated computation of subproblems? We answer these questions in this section for Algorithms 3.1, 3.1, and Inexact resolution of intermediate problems In this subsection, we assume that we have access to accurate gradients of the objective function f, but that we do not have the time or the patience of computing v and/or x +1 (in Algorithm 3.1) or y (in the other algorithms). In order to carry out our analysis, we need to formulate a few assumptions on the original optimization problem. First, we assume that the feasible set Q is compact. Fixing a norm on R n, we denote the finite diameter of Q by: D Q := sup y x : x, y Q}. 19

20 Second, we must formulate some regularity assumptions on the function φ 0. In view of the examples studied in the previous sections, we can consider bounds of the form: L 0 y x φ 0(y) φ 0(x), y x σ 0 y x p for every x, y Q, (16) where p 1 is an appropriately chosen number, and L 0, σ 0 0. We easily deduce that for every x and y, the following inequality holds: L 0 y x φ 0 (y) φ 0 (x) φ 0(x), y x σ 0 p y x p. (17) Also, we can write in view of Theorem.1.5 in [Nes03]: L 0 y x φ 0(y) φ 0(x). (18) Before studying the effect of solving subproblems inexactly, let us chec that the above condition is satisfied in two most typical settings. The following lemma deals with the situation we have considered in Section 4. Lemma 5.1 Consider a matrix induced norm = G, 1/ and a number m 1. The inequality (16) is satisfied when M φ 0 (x) = f(x 0 ) + (m + 1)! x x 0 m+1, with p := m + 1, L 0 := MD p Q /(p )!, and σ 0 := Mc p /p!, where c p is given in Lemma 8.. When m = 1, one can tae L 0 = σ 0 = M. Proof Let p := m + 1. In view of Lemma 8., we have: φ 0 (y) φ 0 (x) φ 0(x), y x M p! c p y x p for every x, y Q. Adding this inequality to the one obtained by inverting x and y, we obtained the desired value of σ. For proving the upper bound, let F p (x, y) := φ 0(y) φ 0(x), y x (p 1)!/M, and bound F p (x, y)/ y x from above. Without loss of generality, we can assume that x 0 = 0. First, max x y Q F p (x, y) y x = max y p ( y p + x p ) Gy, x + x p x y Q y Gy, x + x y p ( y p 1 x + y x p 1 )α + x p = max x y Q y y x α + x 1 α 1 Fixing two distinct points x and y in Q, we denote by ψ(α) the above right-hand side. After some trivial rearrangements, the numerator of its derivative is: ( y p + x p ) y x ( y p 1 x + y x p 1 )( y + x ) = y x ( y p x p )( y x ), 0

21 which is nonnegative, thus the maximum of ψ(α) on [ 1, 1] is attained when α = 1. The maximum of F p (x, y)/ y x reduces to: y p ( y p 1 x + y x p 1 ) + x p max x y Q ( y x ) y=tx ( y x )( y p 1 x p 1 ) max x y Q ( y x ) = max x y Q ( y p + y p 3 x + + x p ) (p 1)D p Q. When p =, φ 0(y) φ 0(x), y x = M y x, and L 0 = σ 0 = M wors. The entropy function is a common choice for constructing φ 0 in Algorithm 3.1 or 3.. In this setting, used when the feasible set Q is a simplex, that is, Q := x R n + : n i=1 x i = 1}, we choose the prox-function d as follows: n d(x) := x i ln(x i ) + ln(n). i=1 Its minimizer on Q is the all-1/n vector, and the second inequality in (16) holds with σ := 1 when we use the 1-norm (see Lemma 3 in [Nes05] for a proof that d(x) is 1-strongly convex on Q or that norm; we show below an extension of this result). However, the first inequality does not hold 1, and our result cannot be applied. Nevertheless, Ben-Tal and Nemirovsi [BTN05] suggest a slight modification of d to regularize this function. Let δ > 0 and d δ (x) := n i=1 ( x i + δ ) ( ln x i + δ ) (1 + δ) ln n n ( 1 + δ n ). Lemma 5. Using the 1-norm for, we have: L 0 y x 1 d δ(y) d δ(x), y x σ 0 y x 1 for every x, y Q, (19) where L 0 := n/δ, σ 0 := 1/(1 + δ). Proof We have to show that d δ (x)h, h / h 1 is bounded from above by n/δ and from below by 1/(1+δ). Using Cauchy-Schwartz s Inequality, we have for every x Q: d δ (x)h, h = n i=1 h i x i + δ/n = ( n ) ( h n ) ( i x i + δ n δ x i=1 i + δ/n n i=1 i=1 ( n ( n ) h h i ) i δ x i + δ/n i=1 i=1 h i x i + δ/n ) = h 1 δ d δ (x)h, h. 1 For a simple chec of this assertion, consider y ɛ := (1 (n 1)ɛ, ɛ,..., ɛ) T and x := (1/n,..., 1/n) T, for 1/n > ɛ > 0. As y ɛ x 1 = (n 1)(1/n ɛ) <, and d (y ɛ) d (x), y ɛ x = (n 1)(1/n ɛ) ln(1/ɛ (n 1)) is unbounded when ɛ 0, the upper bound in (16) cannot be guaranteed, whatever L 0 is. 1

22 From the other side, we have: n i=1 h i x i + δ/n n i=1 h i δ/n n δ h 1. The following proposition indicates the effect of constructing an approximate minimizer ˆv +1 to φ +1 and an approximate point ˆx +1 on the fundamental inequality φ (v ) f(x ). At the end of this subsection, we particularize this proposition to the three algorithms under consideration. The notation comes from Lemma.3. Proposition 5.3 Assume that inequality (16) holds, and that the following slight extension of Inequality () in Lemma.3 is satisfied with functions χ (x, y) σ 0 y x p /p: φ 0 (y) φ 0 (x) + φ 0(x), y x + χ (y, x) for all x, y Q. (0) Let ɛ 0, let γ [0, 1] and fix 0. Assume that ˆx, ˆv Q, and min x Q φ (x) f(ˆx ) ɛ. Suppose that the accuracy ˆɛ by which we have computed ˆv, that is a constant verifying ˆɛ φ (ˆv ) φ (v ), satisfies the following bound: ( ) p p 0 ˆɛ min 1, α γ ɛ 1 α p 1 + D Q L 0 pλ p 1, (1) /σ 0 and suppose that the accuracy by which we compute ˆx +1 guarantees: f(y )+ g(y ), (1 α )ˆx +α ˆv y +min x Q α g(y ), x ˆv + λ +1 χ(x, ˆv )} f(ˆx +1 ) α (1 γ)ɛ, where g(y ) f(y ). Then: Moreover, if min φ +1(x) f(ˆx +1 ) ɛ. x Q ɛ λ +1 α γ () ( ) 1 D p Q Lp p 1 0, (3) σ 0 the bound on ˆɛ can be improved to: ( ) p ( ) ( ) p α γ ɛ σ 0 0 ˆɛ 1 α D Q L 0 pλ p 1. (4) Proof Let us fix 0. Observe that the condition (0) implies: φ (y) φ (x) + φ (x), y x + λ χ (y, x) for all x, y Q, 0.

23 First, we bound min x Q φ (ˆv ), x ˆv from below. Obviously, the function φ has a Lipschitz continuous gradient with constant λ L 0. Observe that, in view of (18): L 0 λ ˆv v φ (ˆv ) φ (v ), and ˆɛ φ (ˆv ) φ (v ) σ 0λ p ˆv v p. (5) min x Q φ (ˆv ), x ˆv min x Q φ (v ), x ˆv φ (ˆv ) φ (v ) x ˆv } min x Q φ (v ), x v + φ (v ), v ˆv L 0 λ ˆv v x ˆv } φ (v ), v ˆv L 0 λ ˆv v D Q φ (v ) φ (ˆv ) + λ χ (v, ˆv ) L 0 λ ˆv v D Q ˆɛ + σ 0λ p ˆv v p L 0 λ ˆv v D Q. Now, the function t σ 0 t p /p L 0 D Q t is decreasing in [0, t ], where t := (L 0 D Q /σ 0 ) 1/(p 1). We now from (5) that we can estimate ˆv v by ˆt := p pˆɛ /σ 0 λ. If ˆt t, we can write the following bound: min x Q φ p pˆɛ (ˆv ), x ˆv L 0 λ D Q. σ 0 λ Observe that ˆt t is ensured when (3) and (4) hold. Also, the bound (4) implies: min x Q φ (ˆv ), x ˆv α γɛ. 1 α If we cannot guarantee that ˆt t, we can use the following slightly less favorable estimation, provided that ˆɛ 1: min x Q φ (ˆv ), x ˆv ˆɛ + σ 0λ p ˆv v p L 0 λ ˆv v D Q. p pˆɛ ˆɛ L 0 λ ˆv v D Q ˆɛ L 0 λ D Q. σ 0 λ The bound (1) on ˆɛ ensures that: p ˆɛ p pˆɛ L 0 λ D Q = p ˆɛ σ 0 λ min x Q φ (ˆv ), x ˆv α γɛ. 1 α ( p 1 + L 0 λ p σ 0 λ D Q ). 3

24 We conclude our proof by following essentially the same steps as in the argument of Lemma.3: [ ]} min φ +1(x) = min (1 α )φ (x) + α f(y ) + g(y ), x y x Q x Q [ ] [ ]} min (1 α ) φ (ˆv ) + φ (ˆv ), x ˆv + λ +1 χ (x, ˆv ) + α f(y ) + g(y ), x y x Q [ ]} min (1 α )φ (ˆv ) α γɛ + λ +1 χ (x, ˆv ) + α f(y ) + g(y ), x y x Q ] (1 α )f(ˆx ) (1 α + α γ)ɛ + α [f(y ) + g(y ), ˆv y + min x Q α g(y ), x ˆv + λ +1 χ (x, ˆv )} f(y ) + g(y ), (1 α )ˆx + α ˆv y (1 α + α γ)ɛ + min x Q α g(y ), x ˆv + λ +1 χ (x, ˆv )} f(ˆx +1 ) ɛ. The previous proposition has a clear meaning with respect to the computation of ˆv. The designer of an estimate sequence scheme must tae a particular care in the choice of φ 0 because the computation of v, that is, of a minimizer of φ 0 plus a linear term over Q, must be done quite accurately: for instance, for Algorithms 3.1 and 3., we have α = Θ(1/) Ω( L/ɛ), and we get the lower bound ɛ Ω(γ ɛ 3 /L). It would be desirable that this computation remains relatively cheap, or even that this minimizer can be computed analytically. As far as the criterion () on the accuracy of ˆx +1 is concerned, it is not difficult to relate it with a condition on how precisely the corresponding intermediate optimization problem has to be solved. Roughly speaing, this intermediate problem must be solved within an accuracy of (1 γ)α ɛ or of (1 γ)α ɛ/d Q. For instance, in Algorithms 3.1 and 3., the inequality () is guaranteed as soon as h(ˆx +1 ) min x Q h (x) (1 γ)α ɛ, where α f (y ), x v + λ +1 χ (x, v ). For Algorithm 4., we can replace the condition () by the following one, provided that the feasible set Q has a finite diameter D Q : where h (ˆx +1 ) (1 γ)α ɛ D Q, h (x) := f (z ), x z f (m) (z )[x z,, x z ], x z + Obviously, this condition implies the following approximate optimality criterion: h (ˆx +1 ), y ˆx +1 (1 γ)α ɛ y Q. (m + 1)M x z m+1. (m + 1)! In order to show how the above criterion implies (), one can easily adapt the technical Lemma 4.4 into: x Q f (ˆx +1 ), y ˆx +1 (1 γ)α ɛ M + N 4 ˆr m y z + N M ˆr m+1,

25 where ẑ = (1 α )ˆx +α ˆv, and ˆr := ˆx +1 ẑ. Using this inequality and the same argument as in the proof of Theorem 4., we can immediately show that the desired inequality () holds. 5. Approximate subgradients and higher-order differentials In some circumstances, e.g. in the framewor of stochastic optimization where a prior Monte-Carlo sampling is used to approximate the actual objective function (see [SN05]), we do not have access to an exact subgradient of the objective function f. Specifically, we assume that, for a given accuracy ɛ > 0 and a point x of dom f, we can only determine in a reasonable time a subgradient g that satisfies the two following properties: and y dom f, f(y) f(x) + g, y x ɛ, (6) y dom f, f(y) f(x) + g, y x ɛ y x, (7) where is an appropriate norm. The first inequality is used in Chapter XI of [HUL93b] in the definition of ɛ-subgradients. The second one is defined in Section 1.3 of [Mor05] as analytic ɛ-subgradients. We shall denote the set of the approximate subgradients that satisfy (6) and (7) by ɛ f(x). The interest of mixing these two notions of subgradient lies in the fact that we can use affine underestimates for constructing our estimate sequence, and, at the same time, employ the following lemma on the error of the approximate subgradient over the actual one. In a more careful analysis, we could mae a distinction between the required accuracy in (6) and (7), defining (ɛ 1, ɛ )- subgradients. It not difficult to incorporate this extra feature in our argument. The following lemma shows a useful consequence of the inequality (7). Lemma 5.4 Let f : R n R + } be a closed convex function, let x dom f, ɛ 0, and g R n satisfying (7). Then g(x) g ɛ for a subgradient g(x) f(x), where the norm is dual to the norm used in (7). Proof The subgradient of the function h(y) := y x in y = x is B [0, 1] := s R n : s 1}. Indeed, for every ŝ B [0, 1] and every y R n we have: h(y) = y x = max s, y x ŝ, y x = h(x) + ŝ, y x. s B [0,1] On the other hand, if some ŝ verifies h(y) h(x)+ ŝ, y x for every z R n, then 1 max ŝ, u : u 1} = ŝ. Let g be a vector satisfying (7), or equivalently g (f + ɛh)(x). According to Theorem 3.8 in [Roc70], we have: (f + ɛh)(x) = f(x) + ɛ h(x) = f(x) + ɛb [0, 1]. Therefore, there exist a g(x) f(x) and a ξ B [0, 1] such that g = g(x) + ɛξ, which implies g g(x) ɛ. 5

26 Given an approximate subgradient g ɛ f(y ) for an ɛ > 0, a natural candidate for the underestimate f is: f (x) = f(y ) + g, x y ɛ. (8) The functions of the estimate sequence become: 1 φ (x) = λ φ 0 (x) + i=0 1 λ α i (f(y i ) + g i, x y i ) λ i+1 i=0 λ α i λ i+1 ɛ i. The instrumental inequality in Lemma.3 can be easily extended to approximate subgradients. Its demonstration follows closely the proof of Lemma.3, and we just setch the small variations between the two proofs. ) Lemma 5.5 We are given an estimate sequence ((φ ) 0 ; (λ ) 0 for the convex problem min f(x), x Q constructed according to Proposition. using affine underestimates of f: f (x) := f(y ) + g, x y ɛ for some y Q, ɛ > 0 and g ɛ f(y ). We also assume that φ 0 is continuously differentiable and that we have functions χ : Q Q R + for 0 such that χ (x, y) = 0 implies x = y, and for which: φ 0 (y) φ 0 (x) + φ 0(x), y x + χ (y, x) for all y, x Q. If x and v satisfy φ (v ) f(x ) ɛ for an ɛ > 0, then: min φ +1(x)} f(y ) g, (1 α )x +α v y min α g, x v + λ +1 χ (x, v )} ɛ +(1 α )ɛ x Q x Q for every 0. Moreover, if y := (1 α )x + α v, the right-hand side can be replaced by Proof We have for every x Q: α (1 + (1 α )D Q ) ɛ + (1 α )ɛ. φ +1 (x) (1 α )(f(x ) ɛ) + λ +1 χ (x, v ) + α (f(y ) + g, x y ɛ ) (1 α )(f(y ) + g, x y ɛ ɛ) + λ +1 χ (x, v ) + α (f(y ) + g, x y ɛ ), which is exactly the desired inequality. For the second bound, we can proceed as follows. Here g(y ) is a subgradient of f at y such that g g(y ) ɛ, and y := (1 α )x + α v, lie in Algorithm 3.1 and Algorithm 3.. φ +1 (x) (1 α )(f(x ) ɛ) + λ +1 χ (x, v ) + α (f(y ) + g, x y ɛ ) (1 α )(f(y ) + g(y ), x y ɛ) + λ +1 χ (x, v ) + α (f(y ) + g, x y ɛ ) = f(y ) + α g(y ) g, y v (1 α )ɛ + λ +1 χ (x, v ) + α ( g, x v ɛ ). 6

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods Renato D.C. Monteiro B. F. Svaiter May 10, 011 Revised: May 4, 01) Abstract This