Estimate sequence methods: extensions and approximations

Size: px
Start display at page:

Download "Estimate sequence methods: extensions and approximations"

Transcription

1 Estimate sequence methods: extensions and approximations Michel Baes August 11, 009 Abstract The approach of estimate sequence offers an interesting rereading of a number of accelerating schemes proposed by Nesterov [Nes03], [Nes05], and [Nes06]. It seems to us that this framewor is the most appropriate descriptive framewor to develop an analysis of the sensitivity of the schemes to approximations. We develop in this wor a simple, self-contained, and unified framewor for the study of estimate sequences, with which we can recover some accelerating scheme proposed by Nesterov, notably the acceleration procedure for constrained cubic regularization in convex optimization, and obtain easily generalizations to regularization schemes of any order. We analyze carefully the sensitivity of these algorithms to various types of approximations: partial resolution of subproblems, use of approximate subgradients, or both, and draw some guidelines on the design of further estimate sequence schemes. 1 Introduction The concept of estimate sequences was introduced by Nesterov in 1983 [Nes83] to define the provably fastest gradient-type schemes for convex optimization. This concept, in spite of its conceptual simplicity, has not attracted a lot of attention during the 0 first years of its existence. Some interest for this concept resurrected in 003, when Nesterov wrote his seminal paper on smoothing techniques [Nes05]. Indeed, the optimization method Nesterov uses on a smoothed approximation of the convex non-smooth objective function can be seen as an estimate sequence method. These estimate sequence methods play a crucial role in further papers of Nesterov [Nes06, Nes07]. Auslender and Teboulle [AT06] managed to extend the estimate sequence method, stated in Section. of [Nes03] for squared Euclidean norms as prox-functions, to general Bregman distances at the cost of a supplementary technical assumption on the domain of these Bregman distances. Several other papers propose generalizations of Nesterov s smoothing algorithm, and can be interpreted in the light of the estimate sequence concept or sight generalizations of it. For instance, M. Baes is with the Institute for Operations Research, ETH, Rämistrasse 101, CH-809 Zürich, Switzerland. Part of this wor has been done while the author was at the Department of Electrical Engineering (ESAT), Research Group SCD-SISTA and the Optimization in Engineering Center OPTEC, Katholiee Universiteit Leuven, Kasteelpar Arenberg 10, B-3001 Heverlee, Belgium. Michel.Baes@ifor.math.ethz.ch. 1

2 Lan, Lu, and Monteiro [LLM] propose an accelerating strategy for Nesterov s smoothing, which can be interpreted as a simple restarting procedure of an estimate sequence scheme albeit their algorithm is not an estimate sequence scheme as we define it in the present paper. D Aspremont investigates another interesting aspect of Nesterov s algorithm: its robustness with respect to incorrect data [d A08], more specifically to incorrect computation of the objective s gradient. In this paper, we show how we can benefit from the estimate sequence framewor to carry out an analysis of the effect not only of approximate gradient on general estimate sequence schemes, but also of approximate resolution of subproblems one has to solve at every iteration. This result can support a strategy for reducing the iteration cost by solving only coarsely these subproblems. The purpose of this paper is to provide a simple framewor for the study of estimate sequences schemes, and to demonstrate its power in various ways. First, we generalize the accelerated cubic regularization scheme of Nesterov to m-th regularization schemes (cubic regularization represents the case where m = ), with the fastest global convergence properties seen so far: these schemes require not more than O((1/ɛ) m+1 ) iterations to find an ɛ-approximation of the solution. Also, our results allows us to reinterpret the accelerated cubic regularization schemes of Nesterov, and to improve some hidden constants in his complexity analysis. Second, we show how accurately subgradients and the solution of intermediate problems have to be computed in order to guarantee that no error propagation occurs from possible approximations during the course of iterations. The essence of our approach is crystallized in Lemma.3, which provides the sole condition to chec for proving the convergence of the scheme, determine its speed, and investigating every extension we study in this paper. Interestingly, we can interpret Lan, Lu, and Monteiro s restarting scheme as an application of Lemma.3, albeit their scheme does not fall into the estimate sequence paradigm. The paper is organized as follows. In Section, we recall the concept of estimate sequence, and we establish Lemma.3, which plays a central role in our paper. As in [Nes03], we particularize in the next section the general estimate sequence method to smooth problems. However, we use a slightly more general setting, allowing ourselves non-euclidean norms. We show in Section 4 a simplified presentation of the fast cubic regularization for convex constrained problems developed by Nesterov in [Nes07]. This presentation allows us to extend the idea of cubic regularization to m-th regularization, and to obtain, at least theoretically the fastest blac-box methods obtained so far, provided that the subproblems one must solve at every iteration are simple enough. Interestingly, we can improve some constants in the complexity results when we focus on unconstrained problem, due to the much more tractable optimality condition. Section 5 displays a careful sensitivity analysis of the various algorithms developed in the paper with respect to different ind of approximations. Section 6 shows briefly how Lan, Lu, and Monteiro s restarting scheme can be analyzed easily with the sole Lemma.3. Finally, the Appendix contains some useful technical results. Foundation of estimate sequence methods Consider a general nonlinear optimization problem inf x Q f(x), where Q R n is a closed set, and f is a proper convex function on Q. We assume that f attains its infimum f in Q, and we denote

3 by X the set of its minimizers. Later on, we will introduce more assumptions on f and Q, such as convexity, Lipschitz regularity, and differentiability a.o. An estimate sequence (see Chapter in [Nes03]) is a sequence of convex functions (φ ) 0 and a sequence of positive number (λ ) 0 satisfying: lim λ = 0, and φ (x) (1 λ )f(x) + λ φ 0 (x) for all x Q, 1. We also need to guarantee the inequality φ 0 (x ) f for an element x of X. Since we obviously do not have access to any point of the set X before any computation starts, the latter condition has to be relaxed, e.g. to min x Q φ 0 (x) f(y) for a point y Q. The first proposition indicates how estimate sequences can be used for solving an optimization problem, and, if implementable, how fast the resulting procedure would converge. Proposition.1 Suppose that the sequence x 0, x 1, x,... of Q satisfies f(x ) min x Q φ (x). Then f(x ) f λ (φ 0 (x ) f ) for every 1. Proof It suffices to write: f(x ) min x Q φ (x) min x Q f(x) + λ (φ 0 (x) f(x)) f(x ) + λ (φ 0 (x ) f(x )). The next proposition describes a possible way of constructing an estimate sequence. It is a slight extension of Lemma.. in [Nes03]. Proposition. Let φ 0 : Q R be a convex function such that min x Q φ 0 (x) f. Let (α ) 0 (0, 1) be a sequence whose sum diverges. Suppose also that we have a sequence (f ) 0 of functions from Q to R that underestimate f: f (x) f(x) for all x Q and all 0. We define recursively λ 0 := 1, λ +1 := λ (1 α ), and φ +1 (x) := (1 α )φ (x) + α f (x) = λ +1 φ 0 (x) + ) for all 0. Then ((φ ) 0 ; (λ ) 0 is an estimate sequence. Proof Since ln(λ +1 ) = ln(1 α j ) j=0 j=0 α j i=0 λ +1 α i λ i+1 f i (x), (1) for each 0, the sequence (λ ) 0 converges to zero, as the sum of α j s diverges. Let us now chec that φ (x) (1 λ )f(x) + λ φ 0 (x) for 1. For = 1, this condition is immediately verified. Using now a recursive argument, φ +1 (x) = (1 α )φ (x) + α f (x) (1 α )φ (x) + α f(x) (1 α )(1 λ )f(x) + (1 α )λ φ 0 (x) + α f(x) = (1 λ +1 )f(x) + λ +1 φ 0 (x), 3

4 which proves that we have built an estimate sequence. The second equality in (1) is obtained by an elementary recurrence on. All the estimate sequence methods we describe in this text are constructed on the basis of this fundamental proposition. In a nutshell, each of these methods can be defined by the specification of four elements: a function φ 0 that is easy to minimize on Q and that is bounded from below by f(y) for a y Q; a sequence of weights α in ]0, 1[; a strategy for constructing the successive lower estimates f of f. For convex objective functions, affine underestimates constitute the most natural choice as they are cheap to build. For strongly convex functions, we can also thin of quadratic lower estimates (see Section..4 in [Nes03]). A way of constructing, preferably very cheaply, points x that satisfy the inequality prescribed in Proposition.1, namely f(x ) min x Q φ (x). In view of Lemma 8.1, the existence of a positive constant β such that α n /λ +1 β proves that the sequence (λ ) 0 decreases to zero as fast as O(1/( n β)), i.e. the resulting algorithm requires O((1/ɛ) 1/ ) iterations, where ɛ > 0 is the desired accuracy. The following lemma concentrates on the case where the feasible set Q as well as the objective function f are convex, and where the underestimates f of f are affine. It provides an intermediate inequality that we will use in the construction of the sequence (x ) 0 and for exploring various extensions and adaptations of the estimate sequence scheme. We denote the subdifferential of f at x as f(x) (see e.g. [Roc70], Section 3). ) Lemma.3 We are given an estimate sequence ((φ ) 0 ; (λ ) 0 for the convex problem min f(x), x Q constructed according to Proposition. using affine underestimates of f: f (x) := f(y ) + g(y ), x y for some y Q, where g(y ) f(y ). We also assume that φ 0 is continuously differentiable. Suppose that for every 0, we have a function χ : Q Q R + such that χ (x, y) = 0 implies x = y, and such that: φ 0 (x) φ 0 (v ) + φ 0(v ), x v + χ (x, v ) for all x Q, 0, () where v is the minimizer of φ on Q. We denote by (x ) 0 a sequence satisfying f(x ) φ (v ). Then φ +1 (v +1 ) f(y )+ g(y ), (1 α )x +α v y +min x Q α g(y ), x v + λ +1 χ (x, v )} (3) for every 0. 4

5 Proof Observe first that the condition () can be rewritten as φ (x) φ (v ) + φ (v ), x v + λ χ (x, v ) for all x Q, 0. (4) Indeed, in view of Proposition., we have φ (x) = λ φ 0 (x) + 1 λ α i i=0 λ i+1 f i (x) = λ φ 0 (x) + l (x), where l (x) is an affine function. Inequality (4) can be rewritten as: λ φ 0 (x) + l (x) λ φ 0 (v ) + λ φ 0(v ), x v + l (x) + λ χ (x, v ), and results immediately from (). Now, fixing 0 and x Q, we can write successively: φ +1 (x) = (1 α )φ (x) + α f (x) (1 α ) (φ (v ) + φ (v ), x v + λ χ (x, v )) + α (f(y ) + g(y ), x y ) (1 α ) (f(x ) + λ χ (x, v )) + α (f(y ) + g(y ), x y ) (1 α ) (f(y ) + g(y ), x y + λ χ (x, v )) + α (f(y ) + g(y ), x y ) = f(y ) + g(y ), (1 α )x + α v y + α g(y ), x v + λ +1 χ (x, v ). The first inequality comes from (4). The second one uses the fact that v is a minimizer of φ on Q, so that φ (v ), x v 0 for each x Q. The third one comes form g(y ) f(y ). It remains to minimize both sides on Q. The inequality (3) suggests at least two different lines of attac to construct the next approximation x +1 of an optimum x. First, if we can ensure that the sequence (y ) 0 satisfies at every 0 and at every x Q the inequality: g(y ), (1 α )x + α v y + α g(y ), x v + λ +1 χ (x, v )} 0, it suffices to set x +1 := y. Another possibility is to build a sequence (y ) 0 for which the inequality g(y ), (1 α )x + α v y 0 holds for every 0 for instance by letting y := (1 α )x + α v. Then, the inequality of the above lemma reduces to φ +1 (v +1 ) f(y ) + min x Q α g(y ), x v + λ +1 χ (x, v )}. In some situations, constructing a point x +1 for which f(x +1 ) is lower than the above right-hand side can be done very cheaply by an appropriate subgradient-lie step. More details are given in the subsequent sections. Finally, the above lemma suggests a new type of scheme, where one only ensures that the inequality (3) is maintained at every iteration regardless of the fact that it originates from the construction of an estimate sequence. Under some conditions on χ, the convergence speed of the resulting scheme also relies on how fast we can drive the sequence (λ ) 0 to 0. More details are given in Section 6. 5

6 3 Strongly convex estimates for convex constrained optimization In this setting, the function φ 0 we choose is a strongly convex function, not necessarily quadratic. Let us fix a norm of R n. We assume that the objective function f is differentiable and has a Lipschitz continuous gradient with constant L for the norm : x, y Q, f(y) f(x) f (x), y x L y x. Equivalently, denoting by the dual norm of, the inequality f (y) f (x) L y x holds for every x, y Q. Observe that we do not assume the strong convexity of f. The function φ 0 is constructed from a prox-function d for Q. A prox-function d is a nonnegative convex function minimized at a point x 0 relint Q, and for which d(x 0 ) = 0. Also, a prox-function is supposed to be strongly convex: there exists a constant σ > 0 for which every x, y Q and λ [0, 1], we have: λd(x) + (1 λ)d(y) d(λx + (1 λ)y) + σ λ(1 λ) x y. If the function d is differentiable, this condition can be rewritten as (see e.g Theorem IV in [HUL93a]): x, y Q, d (x) d (y), x y σ x y. The prox-function d is a crucial building tool for an estimate sequence, and it should be chosen carefully. Indeed, at each step, we will have to solve one (or two) problem(s) of the form min x Q d(x) + l(x), where l is a linear mapping. Sometimes (in cubic and m-th regularization schemes, see in Section 4, we even need to solve min x Q d(x) + p(x), where p is a polynomial function. Having an easy access to its minimizer is a sine qua non requirement for the subsequent algorithm to wor efficiently. The best instances for this scheme are of course those for which this minimizer can be computed analytically. Interestingly enough, the set of these instances is not reduced to a few trivial ones (see [Nes05, NP06]). These ingredients allow us to define the first function of our estimate sequence: φ 0 (x) := f(x 0 ) + L σ d(x), which is L-strongly convex for the norm. Also min x Q φ 0 (x) = f(x 0 ) f(x ). Moreover, we have for every x Q: φ 0 (x) φ 0 (x 0 ) φ 0(x 0 ), x x 0 L x x 0 f(x) f(x 0 ) f (x 0 ), x x 0. Therefore, φ 0 (x) f(x) f (x 0 ), x x 0. The underestimates f of f are chosen to be linear underestimates: f (x) := f(y ) + f (y ), x y. 6

7 Following the construction scheme of Proposition., we have φ +1 (x) = λ +1 ( f(x 0 ) + L σ d(x) ) + i=0 λ +1 α i λ i+1 (f(y i ) + f (y i ), x y i ) for all 0. As the function φ is strongly convex with constant λ L for the norm it has a unique minimizer. We denote this minimizer by v. At iteration +1, we need to construct a point x +1 R n such that f(x +1 ) φ +1 (v +1 ), given that f(x ) φ (v ). There are many ways to achieve this goal; we consider here two possibilities hinted by Lemma.3. Each of them are defined by a nonnegative function χ for which χ (x, x) = 0 and d(x) d(v ) + d (v ), x v + σ L χ (x, v ) for all x Q, 0. A first possibility is to choose χ (x, y) := L x y /; the above inequality reduces to the σ-strong convexity of d. We can also consider a suitable multiple of the Bregman distance induced by d, that is, χ (x, y) := γ (d(x) d(y) d (y), y x ). The required inequality is ensured as soon as σ/l γ > 0. The potential advantage of this second approach resides in the fact that, in the resulting algorithm, the computation of the next point x +1 requires to solve a problem of the same type than for computing the minimizer of φ, that is, minimizing a d plus a linear function over Q. 3.1 When the lower bound on φ is a norm Let us consider here the case where χ (x, y) := L x y /, where is a norm on R n, not necessarily Euclidean lie in [Nes03]. According to Lemma.3, we can set y := (1 α )x + α v. With this choice, we obtain: φ +1 (v +1 ) f(y ) + min x Q α f (y ), x v + λ } +1L x v (5) for every 0. The minimization problem on the right-hand side is closely related to the standard gradient method. We denote by x Q (y; h) the minimizer of f (y), x y + 1 x y h over Q. If the considered norm is Euclidean, this minimizer is simply the Euclidean projection of a gradient step over Q: x Q (y; h) = arg min x Q x (y hf (y)). 7

8 Observe that: m := min α f (y ), x v + λ } +1L x v x Q = min f (y ), α x + (1 α )x y + λ } +1L x Q α α x + (1 α )x y min f (y ), x y + λ } +1L x Q α x y. because α Q + (1 α )x Q in view of the convexity of Q. Thus: m min f (y ), x y + L } x y, x Q provided that λ +1 /α 1. Hence, we can bound φ +1(v +1 ) from below by: f(y ) + f (y ), x Q (y ; 1/L) y + L x Q(y ; 1/L) y. By Lipschitz continuity of the gradient of f, this quantity is larger than f(x Q (y ; 1/L)). Therefore, setting x +1 := x Q (y ; 1/L) is sufficient to ensure the required decrease of the objective. However, this choice assumes that optimization problems of the form min x Q x z + l(x), where l is a linear function, are easy to solve as well. An alternative is presented in the next subsection, where only optimization problems of the form min x Q d(x) + l(x) should be solved at every iteration. Algorithm 3.1 Assumptions: f has a Lipschitz continuous gradient with constant L for the norm ; the set Q is closed, convex, and has a nonempty interior. Choose x 0 Q, set v 0 := x 0 and λ 0 := 1. Choose a strongly convex function d with strong convexity constant σ > 0 for the norm, minimized in x 0 Q, and that vanishes in x 0 and set φ 0 := f(x 0 ) + Ld(x)/σ. For 0, Find α such that α = (1 α )λ. Set λ +1 := (1 α )λ. Set y := α v + (1 α )x. Set x +1 := x Q (y ; 1/L) = arg min x Q f (y ), x y + L x y }. Set φ +1 (x) := (1 α )φ (x) + α (f(y ) + f (y ), x y ). Set v +1 := arg min x Q φ +1 (x). End Assuming that α = λ +1 = (1 α )λ, we obtain in view of Lemma 8.1 and Proposition.1 a complexity of ( 1 f(x 0 ) f ɛ + L ) σ d(x ) iterations. There is a simple variant for the computation of the sequences (λ ) 0 and (α ) 0. The requirement λ α /(1 α ) is satisfied for every 0 when λ = 4/( + ) for. With this choice, we obtain α = 1 λ +1 λ = ( + 5)/( + 3) and α /λ +1 = (1 1/( + 6)) [5/36, 1[. 8

9 3. When the lower bound on φ is a Bregman distance In this setting, we define χ (x, z) := γ (d(x) d(z) d (z), x z ), where γ is a positive coefficient that will be determined in the course of our analysis. Observe that, in contrast with [AT06], we do not assume anything on the domain of d, except that it contains Q. For a fixed z Q, the function x χ (x, z) is strongly convex with constant σγ for the norm. Even better, we have for every x, y Q that: χ (x, y) γ σ x y. (6) If the coefficients γ are bounded from above by L/σ, we can apply Lemma.3 because the inequality () is satisfied in view of the Lipschitz continuity of f. Therefore, with: we have: for every 0. Let us denote: and chec that: y := (1 α )x + α v, φ +1 (v +1 ) f(y ) + min x Q α f (y ), x v + λ +1 χ (x, v )} w := arg min x Q α f (y ), x v + λ +1 χ (x, v )}, x +1 := α (w v ) + y = α w + (1 α )x yields a sufficient decrease of the objective. We can write: f(y ) + min x Q α f (y ), x v + λ +1 χ (x, v )} = f(y ) + α f (y ), w v + λ +1 χ (w, v ) f(y ) + α f (y ), w v + λ +1γ σ w v = f(y ) + f (y ), x +1 y + λ +1γ σ α x +1 y. Now, if λ +1 /α 1 and γ L/σ, the right-hand side is clearly larger than f(x +1 ) in view of the Lipschitz continuity of the gradient of f. The corresponding algorithm can be written as follows. Algorithm 3. Assumptions: f has a Lipschitz continuous gradient with constant L for the norm ; the set Q is closed, convex, and has a nonempty interior. Choose x 0 Q, set v 0 := x 0 and λ 0 := 1. Choose a strongly convex function d with strong convexity constant σ > 0 for the norm, minimized in x 0 Q, and that vanishes in x 0 and set φ 0 := f(x 0 ) + Ld(x)/σ. For 0, Find α such that α = (1 α )λ. Set λ +1 := (1 α )λ. Set y := α v + (1 α )x. Set w := arg min x Q α f (y ), x v + λ +1 χ (x, v )}. 9

10 Set x +1 := α w + (1 α )x. Set φ +1 (x) := (1 α )φ (x) + α (f(y ) + f (y ), x y ). Set v +1 := arg min x Q φ +1 (x). End 4 Cubic regularization and beyond 4.1 Cubic regularization for constrained problems Cubic regularization has been developed by Nesterov and Polya in [NP06], and further extended by Nesterov in [Nes06, Nes07] and by Cartis, Gould, and Toint [CGT07]. We derive in this section a slight modification of Nesterov s algorithm for constrained convex problems and establish its convergence speed using an alternative proof to his, which allows us to further extend the accelerated algorithm to other types of regularization. We consider here a convex objective function f that is twice continuously differentiable and that has a Lipschitz continuous Hessian with respect to a matrix induced norm, that is, a norm of the form a = Ga, a 1/, where G is a positive definite Gram matrix. In other words, we assume that there exists M > 0 such that for every x, y dom f, we can write: f (y) f (x) M y x. The matrix norm used above is the one induced by the norm we have chosen, that is, A := max x =1 Ax. A consequence of the Lipschitz continuity of the Hessian (see Lemma 1 in [NP06] for a proof) reads: f (y) f (x) f (x)(y x) M y x x, y dom f. (7) The optimization problem of interest here consists in minimizing f on a closed convex set Q R n with a nonempty interior. Let us fix a starting point x 0 Q. We initialize the construction of our estimate sequence by: φ 0 (x) := f(x 0 ) + M 6 x x 0 3. As in Proposition., we consider linear underestimates of f: f (x) = f(y ) + f (y ), x y to build the estimate sequence. Let us define χ (x, z) := M x z 3 /1 for every 0. Since the inequality () is satisfied in view of Lemma 8., we can use Lemma.3 to define our goal: at iteration, we need to find a point x +1 Q and suitable coefficients α for which: f(y ) + f (y ), (1 α )x + α v y } + min α f M (y ), x v + λ +1 x Q 1 x v 3 f(x +1 ). (8) 10

11 Recall that v is the minimizer of φ on Q. Our strategy here is to tae x +1 := y, so we are left with the problem of determining a point y for which the sum of the two last terms on the left-hand side is nonnegative. In parallel with what we have done in the previous sections where the objective function had a Lipschitz continuous gradient, we define: x N (x) := arg min f(x) + f (x), y x + 1 f (x)(y x), y x + N6 } y y Q x 3 for every N M. For the subsequent method to have a practical interest at all, the above optimization problem has to be easy to solve. As noticed by Nesterov and Polya in Section 5 of [NP06], unconstrained nonconvex although this paper does not leave the convex realm problems of the above type can be solved efficiently because their strong dual boils down to a convex optimization problem with only one variable. Moreover, the optimal solution of the original problem can be easily reconstructed from the dual optimum. For constrained problems, Nesterov observed in Section 6 of [Nes06] that, as long as convex quadratic functions can be minimized easily on Q, we can guarantee an easy access to x N (x). The optimality condition for x N (x) reads as follows: f (x)+f (x)(x N (x) x), y x N (x) + N x x N (x) G(x N (x) x), y x N (x) 0 y Q. (9) We start our analysis with an easy lemma, an immediate generalization of which will be exploited in the next section as well. Lemma 4.1 Let g R n, λ > 0, x, v Q and z := (1 α)x + αv for an α [0, 1]. We have: min α g, y v + λχ (y, v)} min g, y z + λ } y Q y Q α 3 χ (y, z). Proof By convexity of Q, we have Q = αq + (1 α)q. Therefore: because x belongs to Q. Now, we can write: Q z = (1 α)(q x) + α(q v) α(q v) minα g, y v + λχ (y, v) : y Q} = min g, u + λm u 3 /(1α 3 ) : u α(q v)} min g, u + λm u 3 /(1α 3 ) : u Q z} = min g, y z + λχ (y, z)/α 3 : y Q}. The next lemma plays the crucial role in the validation of the desired inequality. (Compare with item 1 of Theorem 1 in [Nes07]). We write r N (x) for x x N (x). Lemma 4. For every x, y Q, we have f (x N (x)), y x N (x) M + N r N (x) y x + N M r N (x) 3. 11

12 Proof By the optimality condition (9), we have for every x, y Q: 0 f (x) + f (x)(x N (x) x), y x N (x) + Nr N(x) G(xN (x) x), y x N (x). (10) Observe that: G(xN (x) x), y x N (x) r N (x) y x r N (x). Moreover, in view of the Hessian Lipschitz continuity (7), we have: f (x) + f (x)(x N (x) x), y x N (x) f (x) + f (x)(x N (x) x) f (x N (x)) y x N (x) + f (x N (x)), y x N (x) M r N (x) ( y x + r N (x) ) + f (x N (x)), y x N (x). Summing up these two inequalities with appropriate multiplicative coefficients, we get form (10): 0 Nr N(x) = N + M ( rn (x) y x r N (x) ) + M r N(x) ( y x + r N (x) ) + f (x N (x)), y x N (x) r N (x) y x + M N r N (x) 3 + f (x N (x)), y x N (x). We are now ready to design an estimate sequence scheme for the constrained optimization problem we are interested in. Algorithm 4.1 Assumption: f is convex and has a Lipschitz continuous Hessian with constant M for the matrix norm induced by. Choose x 0 Q, set v 0 := x 0 and λ 0 := 1. Set φ 0 (x) := f(x 0 ) + M x x 0 3 /6. For 0, Find α such that 1α 3 = (1 α )λ. Set λ +1 := (1 α )λ. Set z := α v + (1 α )x. Set y := arg min y Q f (z ), y z + 1 f (z )(y z ), y z + 5M 6 x z 3 }. Set x +1 := y. Set φ +1 (x) := (1 α )φ (x) + α (f(y ) + f (y ), x y ). Set v +1 := arg min x Q φ +1 (x). End Theorem 4.1 The above algorithm taes not more than ( ) 1/3 1 1 (f(x 0 ) f(x ) + M6 ) 1/3 ɛ x 0 x 3 iterations to find a point x for which f(x ) f ɛ. 1

13 Proof Let us show first that the inequality (8) is satisfied, that is, that: } f (y ), (1 α )x + α v y + min α f M (y ), x v + λ +1 x Q 1 x v 3 In view of the algorithm, we have set z := α v + (1 α )x and y := x N (z ). Using Lemma 4.1 and Lemma 4., the second term of the above inequality can be bounded from below as follows: } min α f M (y ), y v + λ +1 y Q 1 y v 3 f (y ), y z + λ } +1 min y Q = f (y ), y z + min y Q f (y ), y z + min y Q α 3 M 1 y z 3 f (y ), y y + λ +1 M + N Therefore, the inequality we need to prove becomes: N M r N (z ) 3 + min y Q α 3 } M 1 y z 3 0. r N (z ) y z + N M r N (z ) 3 + λ +1 M α 3 1 y z 3 M + N r N (z ) y z + λ +1 α 3 } M 1 y z 3 0. Suppressing the constraint of the above minimization problem, we get the following lower bound: min M + N r N (z ) y z + λ } +1 M y R n α 3 1 y z 3 = min M + N r N (z ) t + λ } +1 M (M + N) t 0 α 3 1 t3 = 3 r N(z ) 3 3 α 3. M λ +1 Thus, the inequality is satisfied as soon as: N M (M + N) 3 α 3 0, 3 M λ +1 or 9 M(N M) 8 (M + N) 3 α3. λ +1 The left-hand side can be maximized in N. Its maximizer turns out to be attained for N := 5M, in which case its value is 1/1. Note the constants prescribed here have been integrated in Algorithm 4.1. It suffices now to apply Lemma 8.1 to obtain the complexity of the algorithm. }. 13

14 4. Beyond cubic regularization In principle, the above reasoning can be applied in the study of an optimization scheme for constrained convex optimization with higher regularity. However, the obtained scheme would imply to solve at every iteration a problem of the form min f(x) + f (x)[y x] y Q f (m) M (x)[y x,..., y x] + (m + 1)! y x m+1, which can be highly nontrivial. A discussion of the cases where this problem is reasonably easy e.g. where the above objective function is convex and/or has an easy dual can be the topic of a further paper. In fact, this problem does not need to be solved extremely accurately. We show in this paper that, in the case where D Q is bounded, a solution with accuracy O(ɛ 1.5 ) is amply sufficient to guarantee the reliability of the algorithm see Subsection 5.1 for more details. Nevertheless, let us analyze this scheme, as an illustration of the power of the estimate sequence framewor. Given a norm on R n, we define the norm of a tensor A of rand d as: A := sup sup A[x 1,..., x d 1 ], x d. x 1 =1 x d =1 Let us assume that the m-th derivative of f is Lipschitz continuous: f (m) (y) f (m) (x) M y x for every x, y Q. By integrating several times the above inequality, we can easily deduce that for every j between 0 and m, and for every x, y R n, we have: f (m j) (y) f (m j) (x) f (m j+1) (x)[y x] 1 j! f (m) (x)[y x,..., y x] M (j + 1)! y x j+1. (11) Actually, we only need the inequality for j := m 1 in our reasoning. For constructing the first function of our estimate sequence, we choose a starting point x 0 Q, and set: M φ 0 (x) := f(x 0 ) + (m + 1)! x x 0 m+1. Then, following the construction outlined in Proposition., we define for an appropriate choice of α and y. φ +1 (x) = (1 α )φ (x) + α (f(y ) + f (y ), x y ) Let us restrict our analysis to the case where the norm is a matrix induced norm as in the previous section. Lemma 8. provides us with a constant c m+1 such that the function χ (x, y) := Mc m+1 (m + 1)! y x m

15 can be used in Lemma.3. As for cubic regularization, our strategy for exploiting this lemma consists in trying to find at every iteration a point y that satisfies the inequality: x N (x) := arg min y Q f (y ), (1 α )x + α v y + min y Q α f (y ), y v + λ +1 χ (y, v )} 0. (1) The structure of our construction parallels the one for cubic regularization. Our main tool is the minimizer: f(x) + f (x)[y x] f (m) (x)[y x,..., y x] + where N M. A necessary optimality condition reads, with r N (x) := x x N (x) : f (x) + f (x)[x N (x) x] Nr N(x) m 1 } N y x m+1, (m + 1)! 1 (m 1)! f (m) (x)[x N (x) x,, x N (x) x], y x N (x) G(xN (x) x), y x N (x) 0 y Q. (13) Let us extend the two lemmas of the previous section. We omit the proof of the first one, as it is a trivial extension of the one of Lemma 4.1. Lemma 4.3 Let g R n, λ > 0, x, v Q and z := (1 α)x + αv for an α [0, 1]. We have: min α g, y v + λχ (y, v)} min g, y z + λ } y Q y Q α m+1 χ (y, z). Lemma 4.4 For every x, y Q, we have f (x N (x)), y x N (x) M + N Proof First, we can use (11) to get: r N (x) m y x + N M r N (x) m+1. f 1 (x) + + (m 1)! f (m) (x)[x N (x) x,, x N (x) x] f (x N (x)), y x N (x) f 1 (x) + + (m 1)! f (m) (x)[x N (x) x,, x N (x) x] f (x N (x)) y x N (x) M r N (x) m y x N (x) M r N (x) m( y x + r N (x) ). Using the latter inequality in (13), we get: f (x N (x)), y x N (x) + M r N (x) m ( y x +r N (x))+ Nr N (x) m 1 G(xN (x) x), y x N (x) 0. It remains to use G(xN (x) x), y x N (x) r N (x) y x r N (x) to get the desired inequality. The m-regularization algorithm loos as follows. 15

16 Algorithm 4. Assumptions: f is convex and has a Lipschitz continuous m-th differential with constant M for the norm. Choose x 0 Q, set v 0 := x 0 and λ 0 := 1. Set φ 0 (x) := f(x 0 ) + M x x 0 m+1 /(m + 1)!. For 0, Find α such that (m + )α m+1 = c m+1 (1 α )λ. Set λ +1 := (1 α )λ. Set z := α v + (1 α )x. Set y := arg min y Q f (z ), y z f (m) (z )[ ], y z + (m+1)m (m+1)! y z m+1 }. Set x +1 := y. Set φ +1 (x) := (1 α )φ (x) + α (f(y ) + f (y ), x y ). Set v +1 := arg min x Q φ +1 (x). End Theorem 4. The above algorithm taes not more than ( ) 1/(m+1) ( ) 1/(m+1) m + 1 f(x 0 ) f(x M ) + c m+1 ɛ (m + 1)! x 0 x m+1 iterations to find a point x for which f(x ) f ɛ. Proof The proof is nothing more than an adaptation of the demonstration of Theorem 4.1. With z := α v + (1 α )x and y := x N (z ), the inequality (1) becomes f (y ), z y + min y Q α f (y ), y v + λ +1 χ (y, v )} 0. Applying successively Lemma 4.3 and Lemma 4.4, we can transform this inequality into: f (y ), z y + min α f (y ), y v + λ +1 χ (y, v )} y Q f (y ), z y + min f (y ), y z + λ } +1 χ (y, z ) min M + N y Q y Q N M r N (z ) m+1 + min t 0 ( = r N (z ) m+1 (N M) m m + 1 This quantity is nonnegative as soon as: α m+1 r N (x) m y x + N M r N (z ) m+1 + λ +1 α m+1 M + N r N (z ) m t + λ +1 α m+1 ( (M + N) m+1 α m+1 )1/m ). Mc m+1 λ +1 c m+1 M(N M) m (N + M) m+1 ( m + 1 m ) m αm+1 λ +1. } χ (y, z ) Mc m+1 (m + 1)! tm+1 } 16

17 Maximizing the left-hand side with respect to N, we get a value of c m+1 /(m + ), attained for N = (m + 1)M. It remains to apply Lemma 8.1. Interestingly, the case m := 1, we get a new algorithm for minimizing a convex function with a Lipschitz continuous gradient. However, we need in this algorithm to evaluate the gradient of the function f in two points at every iteration, instead of just one as in Algorithms 3.1 and 3.. We conclude this section with a short note on solving the equation: (m + )α m+1 = c m+1 (1 α )λ. With γ := c m+1 λ /(m + ) > 0, the equation to solve has the form p(t) = t m+1 + γt γ. As p(0) < 0 < p(1) and p (t) > 0 on t [0, 1], this equation has a unique solution and can be solved in a few steps of Newton s algorithm initialized at Cubic and m-th regularization for unconstrained problems It is possible to improve some constants in the complexity analysis for cubic regularization and m-th regularization when problems are unconstrained. We consider here a function f with a M-Lipschitz continuous m-th differential. From the viewpoint of our complexity analysis, the case m = does not bear anything special. There are essentially two elements in the proof that change with respect to the constrained situation. Firstly, the inequality (1) that we need to chec can be simplified because we can compute the exact value of the minimum. It can be rewritten as: f (y ), (1 α )x + α v y min α f (y ), y v + λ +1 χ (y, v )} y R n = m ( ) 1/m ( α m+1 f (y ) m+1 )1/m. (14) m + 1 λ +1 M Secondly, the form of the optimality condition for x N (x) changes. When Q = R n in (13), we have for all x R n : c m+1 f 1 (x) + + (m 1)! f (m) (x)[x N (x) x,, x N (x) x] + Nr N(x) m 1 G(x N (x) x) = 0. (15) This relation allows us to get a ind of counterpart to Lemma 4.4. Lemma 4.5 For every x R n and N M, we have f (x N (x)) (m + 1)N m ( m 1 m ) m 1 m+1 N f N M (x N (x)), x x N (x) m m+1. Proof We can transform the Lipschitz continuity of f (m) (see (11) with j := m 1) using the optimality 17

18 condition (15): ( ) M 0 r N (x) m f (x N (x)) f 1 (x) (m 1)! f (m) (x)[x N (x) x,..., x N (x) x] ( ) M = r N (x) m f (x N (x)) + Nr N (x) m 1 G(x N (x) x) = M N () r N (x) m f (x N (x)) Nr N (x) m 1 f (x N (x)), x N (x) x. Since N M, we can see that f (x N (x)), x N (x) x is negative. Also: f (x N (x)) M N () r N (x) m Nr N(x) m 1 f (x N (x)), x N (x) x ( ) m 1 (m + 1)N m 1 m+1 N f m m N M (x N (x)), x x N (x) m m+1. The last bound comes from the maximization of its left-hand side with respect to r N (x). Now, if we tae x := (1 α )x +α v in the previous lemma, the inequality resembles striingly to the desired inequality (14), provided that we choose y := x N (x). In light of the previous lemma, the following relation ensures that (14) is satisfied: m ( ) 1 m + 1 c m+1 m ( α m+1 λ +1 We can reformulate this inequality as: α m+1 λ +1 c m+1 ) 1 ( ) 1 ( ) m+1 m m 1 ( m m m M (m + 1)N m 1 ( ) m 1 m + 1 M(N M ) m 1 m 1 N m. Maximizing the right-hand side with respect to N, we get: α m+1 λ +1 c m+1 ( m + 1 m m ) m 1. N M ) m 1 m. N and this optimum is attained for N := mm. Comparing this value with the one obtained in the previous section, we see that the improvement is rather significant: their ratio is as large as: (m + 1) m+1 m m, that is, of order O( m) for large values of m. In particular, for cubic regularization (m = ) our constant equals 3/4, while we obtained only 1/1 in the constrained case. The algorithm now reads as follows: 18

19 Algorithm 4.3 Assumptions: f is convex and has a Lipschitz continuous m-th differential with constant M for the norm = G, 1/ ; Q R n. Choose x 0 R n, set v 0 := x 0 and λ 0 := 1. Set φ 0 (x) := f(x 0 ) + M x x 0 m+1 /. For 0, Find α such that α m+1 = c m+1 Set λ +1 := (1 α )λ. Set z := α v + (1 α )x. ( m+1 m m ) m 1 (1 α )λ. Set y := arg min y Q f (z ), y z f (m) (z )[ ], y z + mm (m+1)! y z m+1 }. Set x +1 := y. Set φ +1 (x) := (1 α )φ (x) + α (f(y ) + f (y ), x y ). Set v +1 := arg min x Q φ +1 (x). End The above algorithm does not tae more than: m ( m ) (m 1)/ ( 1 m + 1 ɛ c m+1 iterations to find an ɛ-approximate solution. ) 1/(m+1) ( f(x 0 ) f(x ) + ) 1/(m+1) M (m + 1)! x 0 x m+1 5 Estimate sequences and approximations In the course of an algorithm based on estimate sequences, we must compute at every iteration the minimizer of usually two optimization problems. In order to accelerate the scheme, we can consider solving these subproblems only approximately. A natural question arises immediately: how does the accuracy of the resolution of these subproblems relate to the precision of the final answer given by the algorithm? In particular, how do the successive errors accumulate in the course of the algorithm? For some optimization problems, computing the needed differentials can be quite hard to do accurately (e.g. in Stochastic Optimization, see [SN05], or in large-scaled Semidefinite Optimization, see [d A08]). How precisely must we compute these differentials in order to avoid any accumulation of error? How can we combine these approximations with an accelerated computation of subproblems? We answer these questions in this section for Algorithms 3.1, 3.1, and Inexact resolution of intermediate problems In this subsection, we assume that we have access to accurate gradients of the objective function f, but that we do not have the time or the patience of computing v and/or x +1 (in Algorithm 3.1) or y (in the other algorithms). In order to carry out our analysis, we need to formulate a few assumptions on the original optimization problem. First, we assume that the feasible set Q is compact. Fixing a norm on R n, we denote the finite diameter of Q by: D Q := sup y x : x, y Q}. 19

20 Second, we must formulate some regularity assumptions on the function φ 0. In view of the examples studied in the previous sections, we can consider bounds of the form: L 0 y x φ 0(y) φ 0(x), y x σ 0 y x p for every x, y Q, (16) where p 1 is an appropriately chosen number, and L 0, σ 0 0. We easily deduce that for every x and y, the following inequality holds: L 0 y x φ 0 (y) φ 0 (x) φ 0(x), y x σ 0 p y x p. (17) Also, we can write in view of Theorem.1.5 in [Nes03]: L 0 y x φ 0(y) φ 0(x). (18) Before studying the effect of solving subproblems inexactly, let us chec that the above condition is satisfied in two most typical settings. The following lemma deals with the situation we have considered in Section 4. Lemma 5.1 Consider a matrix induced norm = G, 1/ and a number m 1. The inequality (16) is satisfied when M φ 0 (x) = f(x 0 ) + (m + 1)! x x 0 m+1, with p := m + 1, L 0 := MD p Q /(p )!, and σ 0 := Mc p /p!, where c p is given in Lemma 8.. When m = 1, one can tae L 0 = σ 0 = M. Proof Let p := m + 1. In view of Lemma 8., we have: φ 0 (y) φ 0 (x) φ 0(x), y x M p! c p y x p for every x, y Q. Adding this inequality to the one obtained by inverting x and y, we obtained the desired value of σ. For proving the upper bound, let F p (x, y) := φ 0(y) φ 0(x), y x (p 1)!/M, and bound F p (x, y)/ y x from above. Without loss of generality, we can assume that x 0 = 0. First, max x y Q F p (x, y) y x = max y p ( y p + x p ) Gy, x + x p x y Q y Gy, x + x y p ( y p 1 x + y x p 1 )α + x p = max x y Q y y x α + x 1 α 1 Fixing two distinct points x and y in Q, we denote by ψ(α) the above right-hand side. After some trivial rearrangements, the numerator of its derivative is: ( y p + x p ) y x ( y p 1 x + y x p 1 )( y + x ) = y x ( y p x p )( y x ), 0

21 which is nonnegative, thus the maximum of ψ(α) on [ 1, 1] is attained when α = 1. The maximum of F p (x, y)/ y x reduces to: y p ( y p 1 x + y x p 1 ) + x p max x y Q ( y x ) y=tx ( y x )( y p 1 x p 1 ) max x y Q ( y x ) = max x y Q ( y p + y p 3 x + + x p ) (p 1)D p Q. When p =, φ 0(y) φ 0(x), y x = M y x, and L 0 = σ 0 = M wors. The entropy function is a common choice for constructing φ 0 in Algorithm 3.1 or 3.. In this setting, used when the feasible set Q is a simplex, that is, Q := x R n + : n i=1 x i = 1}, we choose the prox-function d as follows: n d(x) := x i ln(x i ) + ln(n). i=1 Its minimizer on Q is the all-1/n vector, and the second inequality in (16) holds with σ := 1 when we use the 1-norm (see Lemma 3 in [Nes05] for a proof that d(x) is 1-strongly convex on Q or that norm; we show below an extension of this result). However, the first inequality does not hold 1, and our result cannot be applied. Nevertheless, Ben-Tal and Nemirovsi [BTN05] suggest a slight modification of d to regularize this function. Let δ > 0 and d δ (x) := n i=1 ( x i + δ ) ( ln x i + δ ) (1 + δ) ln n n ( 1 + δ n ). Lemma 5. Using the 1-norm for, we have: L 0 y x 1 d δ(y) d δ(x), y x σ 0 y x 1 for every x, y Q, (19) where L 0 := n/δ, σ 0 := 1/(1 + δ). Proof We have to show that d δ (x)h, h / h 1 is bounded from above by n/δ and from below by 1/(1+δ). Using Cauchy-Schwartz s Inequality, we have for every x Q: d δ (x)h, h = n i=1 h i x i + δ/n = ( n ) ( h n ) ( i x i + δ n δ x i=1 i + δ/n n i=1 i=1 ( n ( n ) h h i ) i δ x i + δ/n i=1 i=1 h i x i + δ/n ) = h 1 δ d δ (x)h, h. 1 For a simple chec of this assertion, consider y ɛ := (1 (n 1)ɛ, ɛ,..., ɛ) T and x := (1/n,..., 1/n) T, for 1/n > ɛ > 0. As y ɛ x 1 = (n 1)(1/n ɛ) <, and d (y ɛ) d (x), y ɛ x = (n 1)(1/n ɛ) ln(1/ɛ (n 1)) is unbounded when ɛ 0, the upper bound in (16) cannot be guaranteed, whatever L 0 is. 1

22 From the other side, we have: n i=1 h i x i + δ/n n i=1 h i δ/n n δ h 1. The following proposition indicates the effect of constructing an approximate minimizer ˆv +1 to φ +1 and an approximate point ˆx +1 on the fundamental inequality φ (v ) f(x ). At the end of this subsection, we particularize this proposition to the three algorithms under consideration. The notation comes from Lemma.3. Proposition 5.3 Assume that inequality (16) holds, and that the following slight extension of Inequality () in Lemma.3 is satisfied with functions χ (x, y) σ 0 y x p /p: φ 0 (y) φ 0 (x) + φ 0(x), y x + χ (y, x) for all x, y Q. (0) Let ɛ 0, let γ [0, 1] and fix 0. Assume that ˆx, ˆv Q, and min x Q φ (x) f(ˆx ) ɛ. Suppose that the accuracy ˆɛ by which we have computed ˆv, that is a constant verifying ˆɛ φ (ˆv ) φ (v ), satisfies the following bound: ( ) p p 0 ˆɛ min 1, α γ ɛ 1 α p 1 + D Q L 0 pλ p 1, (1) /σ 0 and suppose that the accuracy by which we compute ˆx +1 guarantees: f(y )+ g(y ), (1 α )ˆx +α ˆv y +min x Q α g(y ), x ˆv + λ +1 χ(x, ˆv )} f(ˆx +1 ) α (1 γ)ɛ, where g(y ) f(y ). Then: Moreover, if min φ +1(x) f(ˆx +1 ) ɛ. x Q ɛ λ +1 α γ () ( ) 1 D p Q Lp p 1 0, (3) σ 0 the bound on ˆɛ can be improved to: ( ) p ( ) ( ) p α γ ɛ σ 0 0 ˆɛ 1 α D Q L 0 pλ p 1. (4) Proof Let us fix 0. Observe that the condition (0) implies: φ (y) φ (x) + φ (x), y x + λ χ (y, x) for all x, y Q, 0.

23 First, we bound min x Q φ (ˆv ), x ˆv from below. Obviously, the function φ has a Lipschitz continuous gradient with constant λ L 0. Observe that, in view of (18): L 0 λ ˆv v φ (ˆv ) φ (v ), and ˆɛ φ (ˆv ) φ (v ) σ 0λ p ˆv v p. (5) min x Q φ (ˆv ), x ˆv min x Q φ (v ), x ˆv φ (ˆv ) φ (v ) x ˆv } min x Q φ (v ), x v + φ (v ), v ˆv L 0 λ ˆv v x ˆv } φ (v ), v ˆv L 0 λ ˆv v D Q φ (v ) φ (ˆv ) + λ χ (v, ˆv ) L 0 λ ˆv v D Q ˆɛ + σ 0λ p ˆv v p L 0 λ ˆv v D Q. Now, the function t σ 0 t p /p L 0 D Q t is decreasing in [0, t ], where t := (L 0 D Q /σ 0 ) 1/(p 1). We now from (5) that we can estimate ˆv v by ˆt := p pˆɛ /σ 0 λ. If ˆt t, we can write the following bound: min x Q φ p pˆɛ (ˆv ), x ˆv L 0 λ D Q. σ 0 λ Observe that ˆt t is ensured when (3) and (4) hold. Also, the bound (4) implies: min x Q φ (ˆv ), x ˆv α γɛ. 1 α If we cannot guarantee that ˆt t, we can use the following slightly less favorable estimation, provided that ˆɛ 1: min x Q φ (ˆv ), x ˆv ˆɛ + σ 0λ p ˆv v p L 0 λ ˆv v D Q. p pˆɛ ˆɛ L 0 λ ˆv v D Q ˆɛ L 0 λ D Q. σ 0 λ The bound (1) on ˆɛ ensures that: p ˆɛ p pˆɛ L 0 λ D Q = p ˆɛ σ 0 λ min x Q φ (ˆv ), x ˆv α γɛ. 1 α ( p 1 + L 0 λ p σ 0 λ D Q ). 3

24 We conclude our proof by following essentially the same steps as in the argument of Lemma.3: [ ]} min φ +1(x) = min (1 α )φ (x) + α f(y ) + g(y ), x y x Q x Q [ ] [ ]} min (1 α ) φ (ˆv ) + φ (ˆv ), x ˆv + λ +1 χ (x, ˆv ) + α f(y ) + g(y ), x y x Q [ ]} min (1 α )φ (ˆv ) α γɛ + λ +1 χ (x, ˆv ) + α f(y ) + g(y ), x y x Q ] (1 α )f(ˆx ) (1 α + α γ)ɛ + α [f(y ) + g(y ), ˆv y + min x Q α g(y ), x ˆv + λ +1 χ (x, ˆv )} f(y ) + g(y ), (1 α )ˆx + α ˆv y (1 α + α γ)ɛ + min x Q α g(y ), x ˆv + λ +1 χ (x, ˆv )} f(ˆx +1 ) ɛ. The previous proposition has a clear meaning with respect to the computation of ˆv. The designer of an estimate sequence scheme must tae a particular care in the choice of φ 0 because the computation of v, that is, of a minimizer of φ 0 plus a linear term over Q, must be done quite accurately: for instance, for Algorithms 3.1 and 3., we have α = Θ(1/) Ω( L/ɛ), and we get the lower bound ɛ Ω(γ ɛ 3 /L). It would be desirable that this computation remains relatively cheap, or even that this minimizer can be computed analytically. As far as the criterion () on the accuracy of ˆx +1 is concerned, it is not difficult to relate it with a condition on how precisely the corresponding intermediate optimization problem has to be solved. Roughly speaing, this intermediate problem must be solved within an accuracy of (1 γ)α ɛ or of (1 γ)α ɛ/d Q. For instance, in Algorithms 3.1 and 3., the inequality () is guaranteed as soon as h(ˆx +1 ) min x Q h (x) (1 γ)α ɛ, where α f (y ), x v + λ +1 χ (x, v ). For Algorithm 4., we can replace the condition () by the following one, provided that the feasible set Q has a finite diameter D Q : where h (ˆx +1 ) (1 γ)α ɛ D Q, h (x) := f (z ), x z f (m) (z )[x z,, x z ], x z + Obviously, this condition implies the following approximate optimality criterion: h (ˆx +1 ), y ˆx +1 (1 γ)α ɛ y Q. (m + 1)M x z m+1. (m + 1)! In order to show how the above criterion implies (), one can easily adapt the technical Lemma 4.4 into: x Q f (ˆx +1 ), y ˆx +1 (1 γ)α ɛ M + N 4 ˆr m y z + N M ˆr m+1,

25 where ẑ = (1 α )ˆx +α ˆv, and ˆr := ˆx +1 ẑ. Using this inequality and the same argument as in the proof of Theorem 4., we can immediately show that the desired inequality () holds. 5. Approximate subgradients and higher-order differentials In some circumstances, e.g. in the framewor of stochastic optimization where a prior Monte-Carlo sampling is used to approximate the actual objective function (see [SN05]), we do not have access to an exact subgradient of the objective function f. Specifically, we assume that, for a given accuracy ɛ > 0 and a point x of dom f, we can only determine in a reasonable time a subgradient g that satisfies the two following properties: and y dom f, f(y) f(x) + g, y x ɛ, (6) y dom f, f(y) f(x) + g, y x ɛ y x, (7) where is an appropriate norm. The first inequality is used in Chapter XI of [HUL93b] in the definition of ɛ-subgradients. The second one is defined in Section 1.3 of [Mor05] as analytic ɛ-subgradients. We shall denote the set of the approximate subgradients that satisfy (6) and (7) by ɛ f(x). The interest of mixing these two notions of subgradient lies in the fact that we can use affine underestimates for constructing our estimate sequence, and, at the same time, employ the following lemma on the error of the approximate subgradient over the actual one. In a more careful analysis, we could mae a distinction between the required accuracy in (6) and (7), defining (ɛ 1, ɛ )- subgradients. It not difficult to incorporate this extra feature in our argument. The following lemma shows a useful consequence of the inequality (7). Lemma 5.4 Let f : R n R + } be a closed convex function, let x dom f, ɛ 0, and g R n satisfying (7). Then g(x) g ɛ for a subgradient g(x) f(x), where the norm is dual to the norm used in (7). Proof The subgradient of the function h(y) := y x in y = x is B [0, 1] := s R n : s 1}. Indeed, for every ŝ B [0, 1] and every y R n we have: h(y) = y x = max s, y x ŝ, y x = h(x) + ŝ, y x. s B [0,1] On the other hand, if some ŝ verifies h(y) h(x)+ ŝ, y x for every z R n, then 1 max ŝ, u : u 1} = ŝ. Let g be a vector satisfying (7), or equivalently g (f + ɛh)(x). According to Theorem 3.8 in [Roc70], we have: (f + ɛh)(x) = f(x) + ɛ h(x) = f(x) + ɛb [0, 1]. Therefore, there exist a g(x) f(x) and a ξ B [0, 1] such that g = g(x) + ɛξ, which implies g g(x) ɛ. 5

26 Given an approximate subgradient g ɛ f(y ) for an ɛ > 0, a natural candidate for the underestimate f is: f (x) = f(y ) + g, x y ɛ. (8) The functions of the estimate sequence become: 1 φ (x) = λ φ 0 (x) + i=0 1 λ α i (f(y i ) + g i, x y i ) λ i+1 i=0 λ α i λ i+1 ɛ i. The instrumental inequality in Lemma.3 can be easily extended to approximate subgradients. Its demonstration follows closely the proof of Lemma.3, and we just setch the small variations between the two proofs. ) Lemma 5.5 We are given an estimate sequence ((φ ) 0 ; (λ ) 0 for the convex problem min f(x), x Q constructed according to Proposition. using affine underestimates of f: f (x) := f(y ) + g, x y ɛ for some y Q, ɛ > 0 and g ɛ f(y ). We also assume that φ 0 is continuously differentiable and that we have functions χ : Q Q R + for 0 such that χ (x, y) = 0 implies x = y, and for which: φ 0 (y) φ 0 (x) + φ 0(x), y x + χ (y, x) for all y, x Q. If x and v satisfy φ (v ) f(x ) ɛ for an ɛ > 0, then: min φ +1(x)} f(y ) g, (1 α )x +α v y min α g, x v + λ +1 χ (x, v )} ɛ +(1 α )ɛ x Q x Q for every 0. Moreover, if y := (1 α )x + α v, the right-hand side can be replaced by Proof We have for every x Q: α (1 + (1 α )D Q ) ɛ + (1 α )ɛ. φ +1 (x) (1 α )(f(x ) ɛ) + λ +1 χ (x, v ) + α (f(y ) + g, x y ɛ ) (1 α )(f(y ) + g, x y ɛ ɛ) + λ +1 χ (x, v ) + α (f(y ) + g, x y ɛ ), which is exactly the desired inequality. For the second bound, we can proceed as follows. Here g(y ) is a subgradient of f at y such that g g(y ) ɛ, and y := (1 α )x + α v, lie in Algorithm 3.1 and Algorithm 3.. φ +1 (x) (1 α )(f(x ) ɛ) + λ +1 χ (x, v ) + α (f(y ) + g, x y ɛ ) (1 α )(f(y ) + g(y ), x y ɛ) + λ +1 χ (x, v ) + α (f(y ) + g, x y ɛ ) = f(y ) + α g(y ) g, y v (1 α )ɛ + λ +1 χ (x, v ) + α ( g, x v ɛ ). 6

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods Renato D.C. Monteiro B. F. Svaiter May 10, 011 Revised: May 4, 01) Abstract This

More information

Cubic regularization of Newton s method for convex problems with constraints

Cubic regularization of Newton s method for convex problems with constraints CORE DISCUSSION PAPER 006/39 Cubic regularization of Newton s method for convex problems with constraints Yu. Nesterov March 31, 006 Abstract In this paper we derive efficiency estimates of the regularized

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

Accelerated Proximal Gradient Methods for Convex Optimization

Accelerated Proximal Gradient Methods for Convex Optimization Accelerated Proximal Gradient Methods for Convex Optimization Paul Tseng Mathematics, University of Washington Seattle MOPTA, University of Guelph August 18, 2008 ACCELERATED PROXIMAL GRADIENT METHODS

More information

DISCUSSION PAPER 2011/70. Stochastic first order methods in smooth convex optimization. Olivier Devolder

DISCUSSION PAPER 2011/70. Stochastic first order methods in smooth convex optimization. Olivier Devolder 011/70 Stochastic first order methods in smooth convex optimization Olivier Devolder DISCUSSION PAPER Center for Operations Research and Econometrics Voie du Roman Pays, 34 B-1348 Louvain-la-Neuve Belgium

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

Iteration-complexity of first-order penalty methods for convex programming

Iteration-complexity of first-order penalty methods for convex programming Iteration-complexity of first-order penalty methods for convex programming Guanghui Lan Renato D.C. Monteiro July 24, 2008 Abstract This paper considers a special but broad class of convex programing CP)

More information

Generalized Uniformly Optimal Methods for Nonlinear Programming

Generalized Uniformly Optimal Methods for Nonlinear Programming Generalized Uniformly Optimal Methods for Nonlinear Programming Saeed Ghadimi Guanghui Lan Hongchao Zhang Janumary 14, 2017 Abstract In this paper, we present a generic framewor to extend existing uniformly

More information

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 4. Subgradient

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 4. Subgradient Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 4 Subgradient Shiqian Ma, MAT-258A: Numerical Optimization 2 4.1. Subgradients definition subgradient calculus duality and optimality conditions Shiqian

More information

On proximal-like methods for equilibrium programming

On proximal-like methods for equilibrium programming On proximal-lie methods for equilibrium programming Nils Langenberg Department of Mathematics, University of Trier 54286 Trier, Germany, langenberg@uni-trier.de Abstract In [?] Flam and Antipin discussed

More information

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method L. Vandenberghe EE236C (Spring 2016) 1. Gradient method gradient method, first-order methods quadratic bounds on convex functions analysis of gradient method 1-1 Approximate course outline First-order

More information

Optimality Conditions for Nonsmooth Convex Optimization

Optimality Conditions for Nonsmooth Convex Optimization Optimality Conditions for Nonsmooth Convex Optimization Sangkyun Lee Oct 22, 2014 Let us consider a convex function f : R n R, where R is the extended real field, R := R {, + }, which is proper (f never

More information

Convex Functions and Optimization

Convex Functions and Optimization Chapter 5 Convex Functions and Optimization 5.1 Convex Functions Our next topic is that of convex functions. Again, we will concentrate on the context of a map f : R n R although the situation can be generalized

More information

BASICS OF CONVEX ANALYSIS

BASICS OF CONVEX ANALYSIS BASICS OF CONVEX ANALYSIS MARKUS GRASMAIR 1. Main Definitions We start with providing the central definitions of convex functions and convex sets. Definition 1. A function f : R n R + } is called convex,

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

On the complexity of the hybrid proximal extragradient method for the iterates and the ergodic mean

On the complexity of the hybrid proximal extragradient method for the iterates and the ergodic mean On the complexity of the hybrid proximal extragradient method for the iterates and the ergodic mean Renato D.C. Monteiro B. F. Svaiter March 17, 2009 Abstract In this paper we analyze the iteration-complexity

More information

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume

More information

Lecture 25: Subgradient Method and Bundle Methods April 24

Lecture 25: Subgradient Method and Bundle Methods April 24 IE 51: Convex Optimization Spring 017, UIUC Lecture 5: Subgradient Method and Bundle Methods April 4 Instructor: Niao He Scribe: Shuanglong Wang Courtesy warning: hese notes do not necessarily cover everything

More information

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE CONVEX ANALYSIS AND DUALITY Basic concepts of convex analysis Basic concepts of convex optimization Geometric duality framework - MC/MC Constrained optimization

More information

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

8 Numerical methods for unconstrained problems

8 Numerical methods for unconstrained problems 8 Numerical methods for unconstrained problems Optimization is one of the important fields in numerical computation, beside solving differential equations and linear systems. We can see that these fields

More information

An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex-concave saddle-point problems

An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex-concave saddle-point problems An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex-concave saddle-point problems O. Kolossoski R. D. C. Monteiro September 18, 2015 (Revised: September 28, 2016) Abstract

More information

Gradient Sliding for Composite Optimization

Gradient Sliding for Composite Optimization Noname manuscript No. (will be inserted by the editor) Gradient Sliding for Composite Optimization Guanghui Lan the date of receipt and acceptance should be inserted later Abstract We consider in this

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

Optimal Newton-type methods for nonconvex smooth optimization problems

Optimal Newton-type methods for nonconvex smooth optimization problems Optimal Newton-type methods for nonconvex smooth optimization problems Coralia Cartis, Nicholas I. M. Gould and Philippe L. Toint June 9, 20 Abstract We consider a general class of second-order iterations

More information

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties Fedor S. Stonyakin 1 and Alexander A. Titov 1 V. I. Vernadsky Crimean Federal University, Simferopol,

More information

UNDERGROUND LECTURE NOTES 1: Optimality Conditions for Constrained Optimization Problems

UNDERGROUND LECTURE NOTES 1: Optimality Conditions for Constrained Optimization Problems UNDERGROUND LECTURE NOTES 1: Optimality Conditions for Constrained Optimization Problems Robert M. Freund February 2016 c 2016 Massachusetts Institute of Technology. All rights reserved. 1 1 Introduction

More information

Math 273a: Optimization Subgradients of convex functions

Math 273a: Optimization Subgradients of convex functions Math 273a: Optimization Subgradients of convex functions Made by: Damek Davis Edited by Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com 1 / 42 Subgradients Assumptions

More information

Subgradients. subgradients. strong and weak subgradient calculus. optimality conditions via subgradients. directional derivatives

Subgradients. subgradients. strong and weak subgradient calculus. optimality conditions via subgradients. directional derivatives Subgradients subgradients strong and weak subgradient calculus optimality conditions via subgradients directional derivatives Prof. S. Boyd, EE364b, Stanford University Basic inequality recall basic inequality

More information

Primal-dual subgradient methods for convex problems

Primal-dual subgradient methods for convex problems Primal-dual subgradient methods for convex problems Yu. Nesterov March 2002, September 2005 (after revision) Abstract In this paper we present a new approach for constructing subgradient schemes for different

More information

Worst Case Complexity of Direct Search

Worst Case Complexity of Direct Search Worst Case Complexity of Direct Search L. N. Vicente May 3, 200 Abstract In this paper we prove that direct search of directional type shares the worst case complexity bound of steepest descent when sufficient

More information

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization Convex Optimization Ofer Meshi Lecture 6: Lower Bounds Constrained Optimization Lower Bounds Some upper bounds: #iter μ 2 M #iter 2 M #iter L L μ 2 Oracle/ops GD κ log 1/ε M x # ε L # x # L # ε # με f

More information

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization Compiled by David Rosenberg Abstract Boyd and Vandenberghe s Convex Optimization book is very well-written and a pleasure to read. The

More information

Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming

Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming Zhaosong Lu October 5, 2012 (Revised: June 3, 2013; September 17, 2013) Abstract In this paper we study

More information

An adaptive accelerated first-order method for convex optimization

An adaptive accelerated first-order method for convex optimization An adaptive accelerated first-order method for convex optimization Renato D.C Monteiro Camilo Ortiz Benar F. Svaiter July 3, 22 (Revised: May 4, 24) Abstract This paper presents a new accelerated variant

More information

Optimisation in Higher Dimensions

Optimisation in Higher Dimensions CHAPTER 6 Optimisation in Higher Dimensions Beyond optimisation in 1D, we will study two directions. First, the equivalent in nth dimension, x R n such that f(x ) f(x) for all x R n. Second, constrained

More information

Efficient Methods for Stochastic Composite Optimization

Efficient Methods for Stochastic Composite Optimization Efficient Methods for Stochastic Composite Optimization Guanghui Lan School of Industrial and Systems Engineering Georgia Institute of Technology, Atlanta, GA 3033-005 Email: glan@isye.gatech.edu June

More information

An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex concave saddle-point problems

An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex concave saddle-point problems Optimization Methods and Software ISSN: 1055-6788 (Print) 1029-4937 (Online) Journal homepage: http://www.tandfonline.com/loi/goms20 An accelerated non-euclidean hybrid proximal extragradient-type algorithm

More information

arxiv: v1 [math.oc] 13 Dec 2018

arxiv: v1 [math.oc] 13 Dec 2018 A NEW HOMOTOPY PROXIMAL VARIABLE-METRIC FRAMEWORK FOR COMPOSITE CONVEX MINIMIZATION QUOC TRAN-DINH, LIANG LING, AND KIM-CHUAN TOH arxiv:8205243v [mathoc] 3 Dec 208 Abstract This paper suggests two novel

More information

Subgradient. Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes. definition. subgradient calculus

Subgradient. Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes. definition. subgradient calculus 1/41 Subgradient Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes definition subgradient calculus duality and optimality conditions directional derivative Basic inequality

More information

An Infeasible Interior Proximal Method for Convex Programming Problems with Linear Constraints 1

An Infeasible Interior Proximal Method for Convex Programming Problems with Linear Constraints 1 An Infeasible Interior Proximal Method for Convex Programming Problems with Linear Constraints 1 Nobuo Yamashita 2, Christian Kanzow 3, Tomoyui Morimoto 2, and Masao Fuushima 2 2 Department of Applied

More information

5. Subgradient method

5. Subgradient method L. Vandenberghe EE236C (Spring 2016) 5. Subgradient method subgradient method convergence analysis optimal step size when f is known alternating projections optimality 5-1 Subgradient method to minimize

More information

Lecture: Smoothing.

Lecture: Smoothing. Lecture: Smoothing http://bicmr.pku.edu.cn/~wenzw/opt-2018-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghe s lecture notes Smoothing 2/26 introduction smoothing via conjugate

More information

Primal/Dual Decomposition Methods

Primal/Dual Decomposition Methods Primal/Dual Decomposition Methods Daniel P. Palomar Hong Kong University of Science and Technology (HKUST) ELEC5470 - Convex Optimization Fall 2018-19, HKUST, Hong Kong Outline of Lecture Subgradients

More information

Minimizing Cubic and Homogeneous Polynomials over Integers in the Plane

Minimizing Cubic and Homogeneous Polynomials over Integers in the Plane Minimizing Cubic and Homogeneous Polynomials over Integers in the Plane Alberto Del Pia Department of Industrial and Systems Engineering & Wisconsin Institutes for Discovery, University of Wisconsin-Madison

More information

COMPLEXITY OF A QUADRATIC PENALTY ACCELERATED INEXACT PROXIMAL POINT METHOD FOR SOLVING LINEARLY CONSTRAINED NONCONVEX COMPOSITE PROGRAMS

COMPLEXITY OF A QUADRATIC PENALTY ACCELERATED INEXACT PROXIMAL POINT METHOD FOR SOLVING LINEARLY CONSTRAINED NONCONVEX COMPOSITE PROGRAMS COMPLEXITY OF A QUADRATIC PENALTY ACCELERATED INEXACT PROXIMAL POINT METHOD FOR SOLVING LINEARLY CONSTRAINED NONCONVEX COMPOSITE PROGRAMS WEIWEI KONG, JEFFERSON G. MELO, AND RENATO D.C. MONTEIRO Abstract.

More information

arxiv: v1 [math.oc] 5 Dec 2014

arxiv: v1 [math.oc] 5 Dec 2014 FAST BUNDLE-LEVEL TYPE METHODS FOR UNCONSTRAINED AND BALL-CONSTRAINED CONVEX OPTIMIZATION YUNMEI CHEN, GUANGHUI LAN, YUYUAN OUYANG, AND WEI ZHANG arxiv:141.18v1 [math.oc] 5 Dec 014 Abstract. It has been

More information

Convex Optimization Theory. Chapter 5 Exercises and Solutions: Extended Version

Convex Optimization Theory. Chapter 5 Exercises and Solutions: Extended Version Convex Optimization Theory Chapter 5 Exercises and Solutions: Extended Version Dimitri P. Bertsekas Massachusetts Institute of Technology Athena Scientific, Belmont, Massachusetts http://www.athenasc.com

More information

Primal-dual first-order methods with O(1/ǫ) iteration-complexity for cone programming

Primal-dual first-order methods with O(1/ǫ) iteration-complexity for cone programming Mathematical Programming manuscript No. (will be inserted by the editor) Primal-dual first-order methods with O(1/ǫ) iteration-complexity for cone programming Guanghui Lan Zhaosong Lu Renato D. C. Monteiro

More information

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Lecture 15 Newton Method and Self-Concordance. October 23, 2008 Newton Method and Self-Concordance October 23, 2008 Outline Lecture 15 Self-concordance Notion Self-concordant Functions Operations Preserving Self-concordance Properties of Self-concordant Functions Implications

More information

Gradient methods for minimizing composite functions

Gradient methods for minimizing composite functions Gradient methods for minimizing composite functions Yu. Nesterov May 00 Abstract In this paper we analyze several new methods for solving optimization problems with the objective function formed as a sum

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem Michael Patriksson 0-0 The Relaxation Theorem 1 Problem: find f := infimum f(x), x subject to x S, (1a) (1b) where f : R n R

More information

5. Duality. Lagrangian

5. Duality. Lagrangian 5. Duality Convex Optimization Boyd & Vandenberghe Lagrange dual problem weak and strong duality geometric interpretation optimality conditions perturbation and sensitivity analysis examples generalized

More information

Algorithms for Nonsmooth Optimization

Algorithms for Nonsmooth Optimization Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented at Center for Optimization and Statistical Learning, Northwestern University 2 March 2018 Algorithms for Nonsmooth Optimization

More information

Adaptive Restarting for First Order Optimization Methods

Adaptive Restarting for First Order Optimization Methods Adaptive Restarting for First Order Optimization Methods Nesterov method for smooth convex optimization adpative restarting schemes step-size insensitivity extension to non-smooth optimization continuation

More information

Subgradients. subgradients and quasigradients. subgradient calculus. optimality conditions via subgradients. directional derivatives

Subgradients. subgradients and quasigradients. subgradient calculus. optimality conditions via subgradients. directional derivatives Subgradients subgradients and quasigradients subgradient calculus optimality conditions via subgradients directional derivatives Prof. S. Boyd, EE392o, Stanford University Basic inequality recall basic

More information

An Optimal Affine Invariant Smooth Minimization Algorithm.

An Optimal Affine Invariant Smooth Minimization Algorithm. An Optimal Affine Invariant Smooth Minimization Algorithm. Alexandre d Aspremont, CNRS & École Polytechnique. Joint work with Martin Jaggi. Support from ERC SIPA. A. d Aspremont IWSL, Moscow, June 2013,

More information

Introduction to Nonlinear Stochastic Programming

Introduction to Nonlinear Stochastic Programming School of Mathematics T H E U N I V E R S I T Y O H F R G E D I N B U Introduction to Nonlinear Stochastic Programming Jacek Gondzio Email: J.Gondzio@ed.ac.uk URL: http://www.maths.ed.ac.uk/~gondzio SPS

More information

Primal-Dual Interior-Point Methods for Linear Programming based on Newton s Method

Primal-Dual Interior-Point Methods for Linear Programming based on Newton s Method Primal-Dual Interior-Point Methods for Linear Programming based on Newton s Method Robert M. Freund March, 2004 2004 Massachusetts Institute of Technology. The Problem The logarithmic barrier approach

More information

Convex Optimization M2

Convex Optimization M2 Convex Optimization M2 Lecture 3 A. d Aspremont. Convex Optimization M2. 1/49 Duality A. d Aspremont. Convex Optimization M2. 2/49 DMs DM par email: dm.daspremont@gmail.com A. d Aspremont. Convex Optimization

More information

Constrained Optimization and Lagrangian Duality

Constrained Optimization and Lagrangian Duality CIS 520: Machine Learning Oct 02, 2017 Constrained Optimization and Lagrangian Duality Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may

More information

Convex Optimization Boyd & Vandenberghe. 5. Duality

Convex Optimization Boyd & Vandenberghe. 5. Duality 5. Duality Convex Optimization Boyd & Vandenberghe Lagrange dual problem weak and strong duality geometric interpretation optimality conditions perturbation and sensitivity analysis examples generalized

More information

Fast proximal gradient methods

Fast proximal gradient methods L. Vandenberghe EE236C (Spring 2013-14) Fast proximal gradient methods fast proximal gradient method (FISTA) FISTA with line search FISTA as descent method Nesterov s second method 1 Fast (proximal) gradient

More information

Sequential Unconstrained Minimization: A Survey

Sequential Unconstrained Minimization: A Survey Sequential Unconstrained Minimization: A Survey Charles L. Byrne February 21, 2013 Abstract The problem is to minimize a function f : X (, ], over a non-empty subset C of X, where X is an arbitrary set.

More information

Primal-dual first-order methods with O(1/ɛ) iteration-complexity for cone programming

Primal-dual first-order methods with O(1/ɛ) iteration-complexity for cone programming Math. Program., Ser. A (2011) 126:1 29 DOI 10.1007/s10107-008-0261-6 FULL LENGTH PAPER Primal-dual first-order methods with O(1/ɛ) iteration-complexity for cone programming Guanghui Lan Zhaosong Lu Renato

More information

Optimality Conditions for Constrained Optimization

Optimality Conditions for Constrained Optimization 72 CHAPTER 7 Optimality Conditions for Constrained Optimization 1. First Order Conditions In this section we consider first order optimality conditions for the constrained problem P : minimize f 0 (x)

More information

Interior-Point Methods for Linear Optimization

Interior-Point Methods for Linear Optimization Interior-Point Methods for Linear Optimization Robert M. Freund and Jorge Vera March, 204 c 204 Robert M. Freund and Jorge Vera. All rights reserved. Linear Optimization with a Logarithmic Barrier Function

More information

Real Analysis Math 131AH Rudin, Chapter #1. Dominique Abdi

Real Analysis Math 131AH Rudin, Chapter #1. Dominique Abdi Real Analysis Math 3AH Rudin, Chapter # Dominique Abdi.. If r is rational (r 0) and x is irrational, prove that r + x and rx are irrational. Solution. Assume the contrary, that r+x and rx are rational.

More information

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Haihao Lu August 3, 08 Abstract The usual approach to developing and analyzing first-order

More information

Mini-batch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization

Mini-batch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization Noname manuscript No. (will be inserted by the editor) Mini-batch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization Saeed Ghadimi Guanghui Lan Hongchao Zhang the date of

More information

On the acceleration of the double smoothing technique for unconstrained convex optimization problems

On the acceleration of the double smoothing technique for unconstrained convex optimization problems On the acceleration of the double smoothing technique for unconstrained convex optimization problems Radu Ioan Boţ Christopher Hendrich October 10, 01 Abstract. In this article we investigate the possibilities

More information

Pavel Dvurechensky Alexander Gasnikov Alexander Tiurin. July 26, 2017

Pavel Dvurechensky Alexander Gasnikov Alexander Tiurin. July 26, 2017 Randomized Similar Triangles Method: A Unifying Framework for Accelerated Randomized Optimization Methods Coordinate Descent, Directional Search, Derivative-Free Method) Pavel Dvurechensky Alexander Gasnikov

More information

A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications

A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications Weijun Zhou 28 October 20 Abstract A hybrid HS and PRP type conjugate gradient method for smooth

More information

Gradient methods for minimizing composite functions

Gradient methods for minimizing composite functions Math. Program., Ser. B 2013) 140:125 161 DOI 10.1007/s10107-012-0629-5 FULL LENGTH PAPER Gradient methods for minimizing composite functions Yu. Nesterov Received: 10 June 2010 / Accepted: 29 December

More information

On Stochastic Subgradient Mirror-Descent Algorithm with Weighted Averaging

On Stochastic Subgradient Mirror-Descent Algorithm with Weighted Averaging On Stochastic Subgradient Mirror-Descent Algorithm with Weighted Averaging arxiv:307.879v [math.oc] 7 Jul 03 Angelia Nedić and Soomin Lee July, 03 Dedicated to Paul Tseng Abstract This paper considers

More information

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

Primal-dual Subgradient Method for Convex Problems with Functional Constraints Primal-dual Subgradient Method for Convex Problems with Functional Constraints Yurii Nesterov, CORE/INMA (UCL) Workshop on embedded optimization EMBOPT2014 September 9, 2014 (Lucca) Yu. Nesterov Primal-dual

More information

January 29, Introduction to optimization and complexity. Outline. Introduction. Problem formulation. Convexity reminder. Optimality Conditions

January 29, Introduction to optimization and complexity. Outline. Introduction. Problem formulation. Convexity reminder. Optimality Conditions Olga Galinina olga.galinina@tut.fi ELT-53656 Network Analysis Dimensioning II Department of Electronics Communications Engineering Tampere University of Technology, Tampere, Finl January 29, 2014 1 2 3

More information

Kaisa Joki Adil M. Bagirov Napsu Karmitsa Marko M. Mäkelä. New Proximal Bundle Method for Nonsmooth DC Optimization

Kaisa Joki Adil M. Bagirov Napsu Karmitsa Marko M. Mäkelä. New Proximal Bundle Method for Nonsmooth DC Optimization Kaisa Joki Adil M. Bagirov Napsu Karmitsa Marko M. Mäkelä New Proximal Bundle Method for Nonsmooth DC Optimization TUCS Technical Report No 1130, February 2015 New Proximal Bundle Method for Nonsmooth

More information

Accelerated primal-dual methods for linearly constrained convex problems

Accelerated primal-dual methods for linearly constrained convex problems Accelerated primal-dual methods for linearly constrained convex problems Yangyang Xu SIAM Conference on Optimization May 24, 2017 1 / 23 Accelerated proximal gradient For convex composite problem: minimize

More information

Helly's Theorem and its Equivalences via Convex Analysis

Helly's Theorem and its Equivalences via Convex Analysis Portland State University PDXScholar University Honors Theses University Honors College 2014 Helly's Theorem and its Equivalences via Convex Analysis Adam Robinson Portland State University Let us know

More information

Math 341: Convex Geometry. Xi Chen

Math 341: Convex Geometry. Xi Chen Math 341: Convex Geometry Xi Chen 479 Central Academic Building, University of Alberta, Edmonton, Alberta T6G 2G1, CANADA E-mail address: xichen@math.ualberta.ca CHAPTER 1 Basics 1. Euclidean Geometry

More information

Lecture 3. Optimization Problems and Iterative Algorithms

Lecture 3. Optimization Problems and Iterative Algorithms Lecture 3 Optimization Problems and Iterative Algorithms January 13, 2016 This material was jointly developed with Angelia Nedić at UIUC for IE 598ns Outline Special Functions: Linear, Quadratic, Convex

More information

Non-smooth Non-convex Bregman Minimization: Unification and new Algorithms

Non-smooth Non-convex Bregman Minimization: Unification and new Algorithms Non-smooth Non-convex Bregman Minimization: Unification and new Algorithms Peter Ochs, Jalal Fadili, and Thomas Brox Saarland University, Saarbrücken, Germany Normandie Univ, ENSICAEN, CNRS, GREYC, France

More information

The Steepest Descent Algorithm for Unconstrained Optimization

The Steepest Descent Algorithm for Unconstrained Optimization The Steepest Descent Algorithm for Unconstrained Optimization Robert M. Freund February, 2014 c 2014 Massachusetts Institute of Technology. All rights reserved. 1 1 Steepest Descent Algorithm The problem

More information

Auxiliary-Function Methods in Optimization

Auxiliary-Function Methods in Optimization Auxiliary-Function Methods in Optimization Charles Byrne (Charles Byrne@uml.edu) http://faculty.uml.edu/cbyrne/cbyrne.html Department of Mathematical Sciences University of Massachusetts Lowell Lowell,

More information

Lagrangian-Conic Relaxations, Part I: A Unified Framework and Its Applications to Quadratic Optimization Problems

Lagrangian-Conic Relaxations, Part I: A Unified Framework and Its Applications to Quadratic Optimization Problems Lagrangian-Conic Relaxations, Part I: A Unified Framework and Its Applications to Quadratic Optimization Problems Naohiko Arima, Sunyoung Kim, Masakazu Kojima, and Kim-Chuan Toh Abstract. In Part I of

More information

Math 273a: Optimization Convex Conjugacy

Math 273a: Optimization Convex Conjugacy Math 273a: Optimization Convex Conjugacy Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Convex conjugate (the Legendre transform) Let f be a closed proper

More information

Lecture 2: Convex Sets and Functions

Lecture 2: Convex Sets and Functions Lecture 2: Convex Sets and Functions Hyang-Won Lee Dept. of Internet & Multimedia Eng. Konkuk University Lecture 2 Network Optimization, Fall 2015 1 / 22 Optimization Problems Optimization problems are

More information

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods Renato D.C. Monteiro B. F. Svaiter May 10, 011 Abstract This paper presents an accelerated

More information

CORE 50 YEARS OF DISCUSSION PAPERS. Globally Convergent Second-order Schemes for Minimizing Twicedifferentiable 2016/28

CORE 50 YEARS OF DISCUSSION PAPERS. Globally Convergent Second-order Schemes for Minimizing Twicedifferentiable 2016/28 26/28 Globally Convergent Second-order Schemes for Minimizing Twicedifferentiable Functions YURII NESTEROV AND GEOVANI NUNES GRAPIGLIA 5 YEARS OF CORE DISCUSSION PAPERS CORE Voie du Roman Pays 4, L Tel

More information

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30 Optimization Escuela de Ingeniería Informática de Oviedo (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30 Unconstrained optimization Outline 1 Unconstrained optimization 2 Constrained

More information

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44 Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)

More information

1. Nonlinear Equations. This lecture note excerpted parts from Michael Heath and Max Gunzburger. f(x) = 0

1. Nonlinear Equations. This lecture note excerpted parts from Michael Heath and Max Gunzburger. f(x) = 0 Numerical Analysis 1 1. Nonlinear Equations This lecture note excerpted parts from Michael Heath and Max Gunzburger. Given function f, we seek value x for which where f : D R n R n is nonlinear. f(x) =

More information

Inexact alternating projections on nonconvex sets

Inexact alternating projections on nonconvex sets Inexact alternating projections on nonconvex sets D. Drusvyatskiy A.S. Lewis November 3, 2018 Dedicated to our friend, colleague, and inspiration, Alex Ioffe, on the occasion of his 80th birthday. Abstract

More information

Convex Optimization and l 1 -minimization

Convex Optimization and l 1 -minimization Convex Optimization and l 1 -minimization Sangwoon Yun Computational Sciences Korea Institute for Advanced Study December 11, 2009 2009 NIMS Thematic Winter School Outline I. Convex Optimization II. l

More information

arxiv: v2 [math.oc] 21 Nov 2017

arxiv: v2 [math.oc] 21 Nov 2017 Unifying abstract inexact convergence theorems and block coordinate variable metric ipiano arxiv:1602.07283v2 [math.oc] 21 Nov 2017 Peter Ochs Mathematical Optimization Group Saarland University Germany

More information

Stochastic model-based minimization under high-order growth

Stochastic model-based minimization under high-order growth Stochastic model-based minimization under high-order growth Damek Davis Dmitriy Drusvyatskiy Kellie J. MacPhee Abstract Given a nonsmooth, nonconvex minimization problem, we consider algorithms that iteratively

More information

4TE3/6TE3. Algorithms for. Continuous Optimization

4TE3/6TE3. Algorithms for. Continuous Optimization 4TE3/6TE3 Algorithms for Continuous Optimization (Algorithms for Constrained Nonlinear Optimization Problems) Tamás TERLAKY Computing and Software McMaster University Hamilton, November 2005 terlaky@mcmaster.ca

More information