Complexity bounds for primal-dual methods minimizing the model of objective function

Complexity bounds for primal-dual methods minimizing the model of objective function Yu. Nesterov July 4, 06 Abstract We provide Frank-Wolfe ( Conditional Gradients method with a convergence analysis allowing to approach a primal-dual solution of convex optimization problem with composite objective function. Additional properties of complementary part of the objective (strong convexity significantly accelerate the scheme. We also justify a new variant of this method, which can be seen as a trust-region scheme applying to the linear model of objective function. For this variant, we prove also the rate of convergence for the total variation of linear model of composite objective over the feasible set. Our analysis works also for quadratic model, allowing to justify the global rate of convergence for a new second-order method as applied to a composite objective function. To the best of our knowledge, this is the first trust-region scheme supported by the worstcase complexity analysis both for the functional gap and for the variation of local quadratic model over the feasible set. Keywords: convex optimization, complexity bounds, linear optimization oracle, conditional gradient method, trust-region method. Center for Operations Research and Econometrics (CORE, Catholic University of Louvain (UCL, 34 voie du Roman Pays, 348 Louvain-la-Neuve, Belgium; e-mail: Yurii.Nnesterov@uclouvain.be. The research results presented in this paper have been supported by the grant Action de recherche concertè ARC 4/9-060 from the Direction de la recherche scientifique - Communautè française de Belgique. Scientific responsibility rests with the author.

Introduction Motivation. In the last years, we can see an increasing interest to Frank-Wolfe algorithm [3,, ], which sometimes is called Conditional Gradient Method (CGM [5, 7, 9, 8]. At each iteration of this scheme, we need to solve an auxiliary problem of minimizing a linear function over a convex feasible set. In some situations, mainly when the feasible set is a simple polytope, the complexity of this subproblem is much lower than that of the standard projection technique (e.g. []. The standard complexity results for this method are related to convex objective function with Lipschitz-continuous gradient. In this situation, CGM converges as O( k, where k is the number of iterations. Moreover, it appears that this rate of convergence is optimal for methods with linear optimization oracle [0]. For nonsmooth functions, CGM cannot converge (we give a simple example in Section. Therefore, it is interesting to study the dependence of the rate of convergence of CGM on the level of smoothness of the objective function. On the other hand, sometimes nonsmoothness of the objective function results from a complementary regularization term. This situation can be treated in the framework of composite minimization [4]. However, the performance of CGM for this structure of the objective function was not studied yet. Finally, by its spirit, CGM is a primal-dual method. Indeed, it generates the lower bounds on the optimal value of the objective function, which converge to the optimal value [4]. Therefore, it would be natural to extract from this method an approximate solution of the dual problem. These questions served as the main motivations for this paper. Contents. In Section, we introduce our main problem of interest, where the objective function has a composite form. Our main assumption is that the problem of minimizing a linear function augmented by a simple convex complementary term is solvable. We assume that the smooth part of the objective has Hölder-continuous gradients. For proving efficiency estimate for CGM, we apply the technique of estimating sequences (e.g. [] in its extended form [8]. As a result, we get a significant freedom in the choice of stepsize coefficients. In the next Section 3 we consider a new variant of CGM, which can be seen as a trust-region method with linear model of the objective. For this scheme, the trust region is formed by contracting the feasible set towards the current test point. This method ensures an appropriate rate of convergence for total variation of the linear model, computed at the sequence of test points. Analytical form of our bounds for the primal-dual gap is similar to the bounds obtained in [4]. However, we estimate the difference of the current value of the objective function and the minimal value of the accumulated linear model. In Section 4, we explain how to extract from our convergence results an upper bound on the duality gap for some feasible primal-dual solution. In our technique we use an explicit max-representation of the smooth part of the objective function. In Section 5, we show that the additional properties of complementary part of the objective (strong convexity significantly accelerate the scheme. Finally, in Section 6, we apply our technique for a new second-order trust-region method, where the quadratic approximation of our objective function is minimized on a trust region formed by a con- Performance of CGM as applied to objective function regularized by a norm was studied in [6].

tracted feasible set. To the best of our knowledge, this is the first trust-region scheme [] supported by the worst-case complexity analysis. Notation. In what follows, we consider optimization problems over finite-dimensional linear space E with the dual space E. The value of linear function s E at x E is denoted by s, x. In E, we fix a norm, which defines the conjugate norm s = max x { s, x : x }, s E. For a linear operator A : E E, its conjugate operator A : E E is defined by identity Ax, y = A y, x, x E, y E. We call operator B : E E self-conjugate if B = B. It is positive-semidefinite if Bx, x 0 for all x E. For a linear operator B : E E, we define its operator norm in the usual way: B = max { Bx : x }. x For a differentiable function f(x with dom f E, we denote by f(x E its gradient, and by f(x : E E its Hessian. Note that ( f(x = f(x. In the sequel, we often need to estimate the partial sums of different series. For that, it is convenient to use the following trivial lemma. Lemma Let function ξ(τ, τ R, be decreasing and convex. Then, for any two integers a and b, such that [a, b + ] dom ξ, we have b+ a ξ(τdτ b k=a ξ(k b+/ a / ξ(τ dτ. (. For example, for any t 0 and p t, we have t+p k+p+ (. t+p+ t τ+p+ dτ = ln(τ + p + t+p+ t = ln t+p+ t+p+ = ln. (. On the other hand, if t, then t+ (. (k+ t+3/ t / (τ+ dτ = τ+ t+3/ t / = t+3/ t+7/ (.3 = 4t+8 (t+3(4t+7 (t+3. In this paper, we will use some special dual functions. Let Q E be a bounded closed convex set. For a closed convex function F ( with dom F int Q, we define its restricted dual function, (with respect to a central point x Q, as follows: F x,q (s = max { s, x x + F ( x F (x}, s x Q E. (.4 Clearly, this function is well defined for all s E. Moreover, it is convex and nonnegative on E. 3

In the theory of Nonsmooth Convex Optimization, the dual functions are often used for analyzing the behavior of first-order methods. In these applications, function F in (.4 is assumed to be a multiple of a strongly convex prox-function (see, for example, [3]. Such an assumption provides us with a possibility to control the step sizes in the corresponding optimization schemes. In this paper, we are going to drop the assumption on strong convexity. Hence, we need to develop another mechanism for controlling the sizes of our moves. This is the reason why we introduce in construction (.4 an additional scaling parameter τ [0, ]. We call function Fτ, x,q (s = max { s, x y + F ( x F (y : y = ( τ x + τx}, s x Q E, (.5 the scaled restricted dual of function F. Lemma For any s E and τ [0, ], we have F x,q (s F τ, x,q (s τ F x,q (s. (.6 Proof: Since for any x Q, the point y = ( τ x + τx belongs to Q, the first inequality is trivial. On the other hand, F τ, x,q (s = max{ s, τ( x x + F ( x F (y : y = ( τ x + τx } x Q max{ s, τ( x x + F ( x ( τf ( x τf (x } x Q = τf x,q (s. Conditional gradient method In this paper we consider numerical methods for solving the following composite minimization problem: { } def min f(x = f(x + Ψ(x, (. x where Ψ is a simple closed convex function with bounded domain Q E, and f is a convex function, which is subdifferentiable on Q. Denote by x one of the optimal solutions of (., and D def = diam(q. Our assumption on simplicity of function Ψ means that some auxiliary optimization problems related to Ψ are easily solvable. Complexity of these problems will be always discussed for corresponding optimization schemes. The most important examples of function Ψ are as follows. Ψ is an indicator function of a closed convex set Q: { Ψ(x = Ind Q (x def 0, x Q, = +, otherwise. Ψ is a self-concordant barrier for a closed convex set Q (see [6, ]. Ψ is a nonsmooth convex function with simple structure (e.g. Ψ(x = x. 4 (.

We assume that function f is represented by a black-box oracle. If it is a first-order oracle, we assume its gradients satisfy the following Hölder condition: f(x f(y G ν x y ν, x, y Q. (.3 Constant G ν is formally defined for any ν (0, ]. For some values of ν it can be +. Note that for any x and y in Q we have f(y f(x + f(x, y x + G ν +ν y x +ν. (.4 If this is a second-order oracle, we assume that its Hessians satisfy Hölder condition f(x f(y H ν x y ν, x, y Q. (.5 In this case, for any x and y in Q we have f(y f(x + f(x, y x + f(x(y x, y x + Hν y x +ν (+ν(+ν. (.6 For the first variant of Conditional Gradient Methods (CGM our assumption on simplicity of function Ψ means exactly the following. Assumption For any s E, the auxiliary problem min { s, x + Ψ(x} (.7 x Q is easily solvable. Denote by v Ψ (s Q one of its optimal solutions. In the case (., this assumption implies that we are able to solve the problem min{ s, x : x Q}. x Note that point v Ψ (s is characterized by the following variational principle: s, x v Ψ (s + Ψ(x Ψ(v Ψ (s, x Q. (.8 In order to solve problem (., we apply the following method. Conditional Gradient Method, Type I. Choose an arbitrary point x 0 Q.. For t 0 iterate: a Compute v t = v Ψ ( f(x t. (.9 b Choose τ t (0, ] and set x t+ = ( τ t x t + τ t v t. It is clear that this method can minimize only functions with continuous gradient. 5

Example Let Ψ(x = Ind Q (x with Q = {x R : (x ( + (x ( }. Define f(x = max{x (, x ( }. ( T Then clearly x =,. Let us choose in (.9 x0 x. For function f, we can apply an oracle, which returns at any x Q a subgradient f(x {(, 0 T, (0, T }. Then, for any feasible x, the point v Ψ ( f(x is equal either to y = (, 0 T, or to y = (0, T. Therefore, all points of the sequence {x t } t0, generated by method (.9, belong to the triangle Conv{x 0, y, y }, which does not contain the optimal point x. In order to justify the rate of convergence of method (.9 for functions with Hölder continuous gradients, we apply the estimating sequences technique [] in its relaxed form [8]. For that, it is convenient to introduce in (.9 new control variables. Consider a sequence of nonnegative weights {a t } t0. Define A t = t k=0 a k, τ t = a t+ A t+, t 0. (.0 From now on, we assume that parameter τ t in method (.9 is chosen in accordance to the rule (.0. Denote It is clear that (.6 V 0 V 0 = max { f(x 0, x 0 x + Ψ(x 0 Ψ(x}, x B ν,t = a 0 V 0 + a +ν k A ν G ν D +ν, t 0. k { } max f(x 0 f(x + G ν x +ν x x 0 +ν + Ψ(x 0 Ψ(x f(x 0 f(x + G νd +ν +ν def = (x 0 + G νd +ν +ν. (. (. Theorem Let sequence {x t } t0 be generated by method (.9. Then, for any ν (0, ], any step t 0, and any x Q we have Proof: A t (f(x t + Ψ(x t t a k [f(x k + f(x k, x x k + Ψ(x] + B ν,t. (.3 k=0 6

Indeed, in view of definition (., for t = 0 inequality (.3 is satisfied. Assume that it is valid for some t 0. Then t+ k=0 (.3 a k [f(x k + f(x k, x x k + Ψ(x] + B ν,t A t (f(x t + Ψ(x t + a t+ [f(x t+ + f(x t+, x x t+ + Ψ(x] A t+ f(x t+ + A t Ψ(x t + f(x t+, a t+ (x x t+ + A t (x t x t+ + a t+ Ψ(x (.9 b = At+ f(x t+ + A t Ψ(x t + a t+ [Ψ(x + f(x t+, x v t ] (.9 b A t+ (f(x t+ + Ψ(x t+ + a t+ [Ψ(x Ψ(v t + f(x t+, x v t ] It remains to note that Ψ(x Ψ(v t + f(x t+, x v t (.8 f(x t+ f(x t, x v t (.3 τ ν t L ν D +ν. Thus, for keeping (.3 valid for the next iteration, it is enough to choose B ν,t+ = B ν,t + a+ν t+ A ν t+ G ν D +ν. Corollary For any t 0 with A t > 0, and any ν (0, ] we have f(x t f(x A t B ν,t. (.4 Let us discuss now the possible variants for choosing the weights {a t } t0.. Constant weights. Let us choose a t, t 0. Then A t = t +, and for ν (0, we have (. B ν,t = V 0 + (+k G ν ν D+ν V 0 + G ν D +ν ν ( + t+/ τ ν / (. [ (x 0 + G ν D +ν +ν + ( ( 3 ν ( ] ν + 3 t ν Thus, for ν (0,, we have A t B ν,t O(t ν. For the most important case ν =, ( ( we have lim ν ν + 3 t ν = ln( + 3t. Therefore, f(x t f(x t+ In this situation, in method (.9 we take τ t (.0 = t+. ( (x0 + G D [ + ln( + 3 t]. (.5 7

. Linear weights. Let us choose a t t, t 0. Then A t = t(t+, and for ν (0, with t we have B ν,t = (. ν k +ν k ν (+k ν G ν D +ν G ν D+ν ν ν τ ν t+/ / ν k ν G ν D +ν [ (t = ν ν + ν ( ] ν G ν D +ν. Thus, for ν (0,, we again have A t B ν,t O(t ν. For the case ν =, we get the following bound: f(x t f(x 4 t+ G D, t. (.6 As we can see, this rate of convergence is better than (.5. In this case, in (.0 method (.9 we take τ t = t+, which is a standard recommendation for CGM (.9. 3. Aggressive weights. Let us choose, for example, a t t, t 0. Then A t = t(t+(t+ k 6. Note that for k 0 we have +ν (k+ ν (k+ ν k ν. Therefore, for ν ν (0, with t we obtain B ν,t = (. 6 ν k (+ν k ν (+k ν (k+ ν G ν D +ν G ν D+ν 3ν 3 ν τ 3 ν t+/ / 3 ν k ν G ν D +ν [ (t = 3ν 3 ν + 3 ν ( ] 3 ν G ν D +ν. For ν (0,, we get again A t B ν,t O(t ν. For ν =, we obtain f(x t f(x 9 t+ G D, t., (.7 which is slightly worse than (.6. The rule for choosing the coefficients τ t in this (.0 6(t+ situation is τ t = (t+(t+3. It can be easily checked that further increase of the rate of growth of coefficients a t makes the rate of convergence of method (.9 even worse. Note that the above rules for choosing the coefficients {τ t } t0 in method (.9 do not depend on the smoothness parameter ν (0, ]. In this sense, method (.9 is a universal method for solving the problem (. (see [5]. Moreover, this method does not depend on the choice of the norm in E. Hence, its rate of convergence can be established with respect to the best norm describing the geometry of the feasible set. 3 Trust-region variant of Frank-Wolfe Method Let us consider a variant of method (.9, which takes into account the composite form of the objective function in problem (.. For Ψ(x Ind Q (x, these two methods coincide. 8

Otherwise, they generate different minimization sequences. Conditional Gradient Method, Type II. Choose an arbitrary point x 0 Q.. For t 0 iterate: Choose coefficient τ t (0, ] and compute (3. x t+ = arg min y { f(x t, y + Ψ(y : y = ( τ t x t + τ t x, x Q}. This method can be seen as a Trust-Region Scheme [] with linear model of the objective function. Trust region in method (3. is formed by a contraction of the initial feasible set. In Section 6, we will consider a more traditional trust-region method with quadratic model of the objective. Note that point x t+ in method (3. is characterized by the following variational principle: x t+ = ( τ t x t + τ t v t, v t Q, Ψ(( τ t x t + τ t x + τ t f(x t, x x t (3. Ψ(x t+ + f(x t, x t+ x t, x Q. Let us choose somehow the sequence of nonnegative weights {a t } t0, and define in (3. the coefficients τ t in accordance to (.0. Define now the estimating functional sequence {φ t (x} t0 as follows: φ 0 (x = a 0 f(x, φ t+ (x = φ t (x + a t+ [f(x t + f(x t, x x t + Ψ(x], t 0. (3.3 Clearly, for all t 0 we have Denote Let us introduce φ t (x A t f(x, x Q. (3.4 C ν,t = a 0 (x 0 + +ν a +ν k A ν G ν D +ν, t 0. (3.5 k δ(x def (.4 = max{ f(x, x y + Ψ(x Ψ(y} Ψ x,q ( f(x. (3.6 y Q For problem (., this value measures the level of satisfaction of the first-order optimality conditions at point x Q. For any x Q, we have δ(x f(x f(x 0. (3.7 We call δ(x the total variation of linear model of the composite objective function in problem (. over the feasible set. 9

Theorem Let sequence {x t } t0 be generated by method (3.. Then, for any ν (0, ] and any step t 0, we have Moreover, for any t 0 we have A t f(xt φ t (x + C ν,t, x Q. (3.8 f(x t f(x t+ τ t δ(x t G νd +ν +ν τ +ν t. (3.9 Proof: Let us prove inequality (3.8. For t = 0, we have C ν,0 = a 0 [ f(x 0 f(x ]. Thus, in this case (3.8 follows from (3.4. Assume now that (3.8 is valid for some t 0. In view of definition (.0, optimality condition (3. can written in the following form: a t+ f(x t, x x t A t+ [Ψ(x t+ Ψ(( τ t x t + τ t x + f(x t, x t+ x t ] for all x Q. Therefore, φ t+ (x + C ν,t = φ t (x + C ν,t + a t+ [f(x t + f(x t, x x t + Ψ(x] (3.,(3.8 A t [f(x t + Ψ(x t ] + a t+ [f(x t + Ψ(x] +A t+ [Ψ(x t+ Ψ(( τ t x t + τ t x + f(x t, x t+ x t ] A t+ [f(x t + f(x t, x t+ x t + Ψ(x t+ ] (.4 A t+ [ f(xt+ +ν G ν x t+ x t +ν ]. It remains to note that x t+ x t = τ t x t v t (.0 a t+ A t+ D. Thus, we can take C ν,t+ = C ν,t + a +ν t+ +ν A ν G ν D +ν. t+ In order to prove inequality (3.9, let us introduce the values δ t (τ def = max x Q { f(x t, x t y + Ψ(x t Ψ(y : y = ( τx t + τx} (.5 = Ψ τ,x t,q ( f(x t, τ [0, ]. Clearly, δ t (τ t = min x Q { f(x t, y x t + Ψ(y Ψ(x t : y = ( τ t x t + τ t x} = f(x t, x t+ x t + Ψ(x t+ Ψ(x t (.4 f(x t+ f(x t Gν +ν x t+ x t +ν. 0

Since x t+ x t τ t D, we conclude that f(x t f(x t+ δ t (τ t G νd +ν +ν τt +ν (.6 τ t δ(x t G νd +ν +ν τ +ν t. In view of (3.4, inequality (3.8 results in the following rate of convergence: f(x t f(x A t C ν,t, t 0. (3.0 For the linearly growing weights a t = t, A t = t(t+, t 0, we have already seen that [ C ν,t = +ν B (t ν,t ν (+ν( ν + ν ( ] ν G ν D +ν. In the case ν =, this results in the following rate of convergence: f(x t f(x t+ G D, t. (3. Let us justify for this case the rate of convergence of the sequence {δ(x t } t. We have (.0 τ t = a t+ A t+ = t+. On the other hand, for any T t, G D t+ (3. f(xt f(x (3.9 Denote δ T = min 0tT δ(x t. Then, choosing T = t +, we get ln δ T (. ( T k+ δt T [ τk δ(x k G D τk ] + f(xt + f(x. (3. G D [ (.3 [ ] G D t+ + (t+3 68 GD T +. t+ + T Thus, in the case ν =, for odd T, we get the following bound: ] (k+ [ ] = G D T + + (T + (3. δ T 34 ln GD T +. (3.3 4 Computing the primal-dual solution Note that both methods (.9 and (3. admit computable accuracy certificates. For the first method, denote { t } l t = A t min a k [f(x k + f(x k, x x k + Ψ(x] : x Q. x k=0 It seems that such line of arguments was used in the first time in Section 7.5 of [8].

Clearly, f(x t f(x f(x t l t (.3 A t B ν,t. (4. For the second method, let us choose a 0 = 0. Then the estimate functions are linear: φ t (x = t a k [f(x k + f(x k, x x k + Ψ(x]. Therefore, defining ˆl t = A t min{φ t (x : x Q}, we also have x f(x t f(x f(x t ˆl t (.3 A t C ν,t, t. (4. Accuracy certificates (4. and (4. justify that both methods (.9 and (3. are able to recover some information on the optimal dual solution. However, in order to implement this ability, we need to open the Black Box and introduce an explicit model of the function f(x. Let us assume that function f is representable in the following form: f(x = max u { Ax, u g(u : u Q d}, (4.3 where A : E E, Q d is a closed convex set in a finite-dimensional linear space E, and function g( is p-uniformly convex on Q d : g(u g(u, u u σ g u u p, u, u Q d, where the convexity degree p. ( It is well known (e.g. [5] that in this case, for ν = p we have G ν = p σ g. In view of Danskin Theorem, f(x = A u(x, where u(x Q d is the unique optimal solution to optimization problem in the representation (4.3. Let us write down the dual problem to (.. min{ f(x : x Q} (4.3 { } = min Ψ(x + max { Ax, u g(u : u Q x x d} u { } max g(u + min{ A u, x + Ψ(x}. u Q d x Thus, defining Φ(u = min{ A u, x + Ψ(x}, we get the following dual problem: x { } max ḡ(u def = g(u + Φ(u. (4.4 u Q d In this problem, the objective function is nonsmooth and uniformly strongly concave of degree p. Clearly, we have f(x ḡ(u 0, x Q, u Q d. (4.5 Let us show that both methods (.9 and (3. are able to approximate the optimal solution to the dual problem (4.4.

Note that for any x Q we have f( x + f( x, x x (4.3 = A x, u( x g(u( x + A u( x, x x = Ax, u( x g(u( x. Therefore, denoting for the first method (.9 u t = A t { l t = min Ψ(x + x Q A t t a k u(x k, we obtain k=0 } t a k [ Ax, u(x k g(u(x k ] k=0 Thus, we get t = Φ(u t A t a k g(u(x k ḡ(u t. k=0 0 (4.5 f(x t ḡ(u t f(x t l t (4. A t B ν,t, t 0. (4.6 For the second method (3., we need to choose a 0 = 0 and take u t = A t In this case, by the similar reasoning, we get t a k u(x k. 0 (4.5 f(x t ḡ(u t f(x t ˆl t (4. A t C ν,t, t. (4.7 5 Strong convexity of function Ψ In this section, we assume that function Ψ in problem (. is strongly convex (see, for example, Section..3 in []. This means that there exists a positive constant σ Ψ such that Ψ(τx + ( τy τψ(x + ( τψ(y σ Ψτ( τ x y (5. for all x, y Q and τ [0, ]. Let us show that in this case CG-methods converge much faster. We demonstrate it for method (.9. In view of strong convexity of Ψ, the variational principle (.8, characterizing the point v t in method (.9 can be strengthen: Ψ(x + f(x t, x v t Ψ(v t + σ ψ x v t, x Q. (5. Let V 0 be defined as in (.. Denote ˆB ν,t = a 0 V 0 + a +ν k G νd ν A ν σ k Ψ, t 0. (5.3 Theorem 3 Let sequence {x t } t0 be generated by method (.9, and function Ψ is strongly convex. Then, for any ν (0, ], any step t 0, and any x Q we have A t (f(x t + Ψ(x t t a k [f(x k + f(x k, x x k + Ψ(x] + ˆB ν,t. (5.4 k=0 3

Proof: The beginning of the proof of this statement is very similar to that of Theorem. Assuming that (5.4 is valid for some t 0, we get the following inequality: Further. t+ k=0 a k [f(x k + f(x k, x x k + Ψ(x] + B ν,t A t+ (f(x t+ + Ψ(x t+ + a t+ [Ψ(x Ψ(v t + f(x t+, x v t ]. Ψ(x Ψ(v t + f(x t+, x v t (5. f(x t+ f(x t, x v t + σ Ψ x v t (.3 σ Ψ f(x t+ f(x t (.3 σ Ψ ( a ν t+ A ν t+ G ν D ν. Thus, for keeping (5.4 valid for the next iteration, it is enough to choose ˆB ν,t+ = ˆB ν,t + a +ν t+ σ Ψ A ν t+ G νd ν. It can be easily checked that in our situation, the linear weights strategy a t t is not the best one. Let us choose a t = t, t 0. Then A t = t(t+(t+ 6, and we get ˆB ν,t = (. G νd ν σ Ψ 6 ν k (+ν G νd ν k ν (k+ ν (k+ ν σ Ψ 3 ν 3 ν τ 3 ν t+/ / [ = 3ν 3 ν ( 3 ν t k ( ν (t + 3 ν ( Thus, for ν (0,, we get A t ˆBν,t O(t ν. For ν =, we obtain which is much better than (.6. G νd ν σ Ψ 3 ν ] G ν D ν σ Ψ. f(x t f(x 54 (t+(t+ G D σ Ψ, (5.5 6 Second-order trust-region method Let us assume now that in problem (. function f is twice continuously differentiable. Then we can apply to this problem the following method. 4

Contracting Trust-Region Method. Choose an arbitrary point x 0 Q.. For t 0 iterate: Define coefficient τ t (0, ] and choose { x t+ Arg min y f(x t, y x t + f(x t (y x t, y x t + Ψ(y : } y ( τ t x t + τ t x, x Q. (6. Note that this scheme is well defined even if the Hessian of function f is positive semidefinite. Of course, in general, the computational cost of each iteration of this scheme can be big. However, in one important case, when Ψ(x is an indicator function of a Euclidean ball, the complexity of each iteration of this scheme is dominated by complexity of matrix inversion. Thus, method (6. can be easily applied to problems of the form min x {f(x : x x 0 r}, (6. where the norm is Euclidean. Let H ν < + for some ν (0, ]. In this section we assume that f(xh, h L h, x Q, h E. (6.3 Let us choose a sequence of nonnegative weights {a t } t0, and define in (6. the coefficients {τ t } t0 in accordance to (.0. Define the estimate functional sequence {φ t (x} t0 by recurrent relations (3.3, where the sequence {x t } t0 is generated by method (6.. Finally, denote Ĉ ν,t = a 0 (x 0 + a +ν k A +ν k H νd +ν (+ν(+ν + a k A k LD. (6.4 In our convergence results, we estimate also the level of the second-order optimality condition for problem (. at the current test points. Let us introduce θ(x def = max y Q { f(x, x y f(x(y x, y x + Ψ(x Ψ(y}. (6.5 For any x Q we have θ(x 0. We call θ(x the total variation of quadratic model of the composite objective function in problem (. over the feasible set. Denoting F x (y = f(x(y x, y x + Ψ(y, ( we get θ(x = F x ( f(x (see definition (.4. x,q Theorem 4 Let sequence {x t } t0 be generated by method (6.. Then, for any ν [0, ] and any step t 0 we have Moreover, for any t 0 we have A t f(xt φ t (x + Ĉν,t, x Q. (6.6 f(x t f(x t+ τ t θ(x t HνD+ν (+ν(+ν τ +ν t. (6.7 5

Proof: Let us prove inequality (6.6. For t = 0, Ĉ ν,0 = a 0 [ f(x 0 f(x ]. Therefore, this inequality is valid. Note that the point x t+ is characterized by the following variational principle: x t+ = ( τ t x t + τ t v t, v t Q, Ψ(y + f(x t + f(x t (x t+ x t, y x t+ Ψ(x t+, y = ( τ t x t + τ t x, x Q. Therefore, in view of definition (.0, for any x Q we have a t+ f(x t, x x t A t+ f(x t + f(x t (x t+ x t, x t+ x t +a t+ f(x t (x t+ x t, x t x +A t+ [Ψ(x t+ Ψ(( τ t x t + τ t x] (6.3 A t+ f(x t + f(x t (x t+ x t, x t+ x t Hence, +A t+ [Ψ(x t+ Ψ(( τ t x t + τ t x] a t+ A t+ LD. A t f(xt + a t+ [f(x t + f(x t, x x t + Ψ(x] A t Ψ(x t + A t+ [f(x t + f(x t + f(x t (x t+ x t, x t+ x t ] +a t+ Ψ(x + A t+ [Ψ(x t+ Ψ(( τ t x t + τ t x] a t+ A t+ LD (.6 A t+ [f(x t+ + Ψ(x t+ ] A t+ H ν x t+ x t +ν (+ν(+ν a t+ A t+ LD A t+ f(xt+ a+ν t+ A +ν t+ H ν D +ν (+ν(+ν a t+ A t+ LD. Thus, if (6.6 is valid for some t 0, then φ t+ (x + Ĉν,t A t f(xt + a t+ [f(x t + f(x t, x x t + Ψ(x] A t+ f(xt+ a+ν t+ A +ν t+ H ν D +ν (+ν(+ν a t+ A t+ LD. Therefore, we can take Ĉν,t+ = Ĉν,t + a+ν t+ A +ν t+ H νd +ν (+ν(+ν + a t+ A t+ LD. 6

In order to justify inequality (6.7, let us introduce the values θ t (τ def = max x Q { f(x t, x t y f(x t (y x t, y x t (.5 = +Ψ(x t Ψ(y : y = ( τx t + τx} ( F xt τ,x t,q ( f(x t, τ [0, ]. Clearly, θ t (τ t = min { f(x t, y x t x Q f(x t (y x t, y x t +Ψ(y Ψ(x t : y = ( τ t x t + τ t x} = f(x t, x t+ x t f(x t (x t+ x t, x t+ x t + Ψ(x t+ Ψ(x t (.6 f(x t+ f(x t H ν (+ν(+ν x t+ x t +ν. Since x t+ x t τ t D, we conclude that f(x t f(x t+ θ t (τ t H νd +ν (+ν(+ν τ t +ν (.6 τ t θ(x t H νd +ν (+ν(+ν τ +ν t. Thus, inequality (6.6 ensures the following rate of convergence of method (6. f(x t f(x A t Ĉ ν,t. (6.8 A particular expression of the right-hand side of this inequality for different values of ν [0, ] can be obtained exactly in the same way as it was done in Section. Here, we restrict ourselves only by the case ν = and a t = t, t 0. Then A t = t(t+(t+ 6, and t a 3 k A k = t 36k 6 k (k+ (k+ 8t, Thus, we get t a k A k = t f(x t f(x 3k 4 k(k+(k+ 3 t k = 3 4t(t +. 8H D 3 (t+(t+ + 9LD (t+. (6.9 Note that the rate of convergence (6.9 is worse than the convergence rate of cubic regularization of the Newton method [7]. However, to the best of our knowledge, inequality (6.9 gives us the first global rate of convergence of an optimization scheme belonging to the family of trust-region methods []. In view of inequality (6.6, the optimal solution of the dual problem (4.4 can be approximated by method (6. with a 0 = 0 in the same way as it was suggested in Section 4 for Conditional Gradient Methods. 7

Let us estimate now the rate of decrease of the values θ(x t, t 0, in the case ν =. (.0 Note that τ t = a t+ A t+ = 6(t+ (t+(t+3. It is easy to see that these coefficients satisfy the following inequalities: 3 t+3 τ t 6 t+5, t 0. (6.0 Therefore, choosing the total number of steps T = t +, we have T T τk 3 (6.0 τ k 3 t+ k+3 (. 3 ln, (6.0 t+ (.3 7 7 t+5/ (k+5/ 3 (k+5/ t / = 7 [ ] 4 = (T + (T +3 = 7 [ ] (t+ (t+5 7(3T +8(T +4 (T + (T +3 8 (T +(T +. (6. Now we can use the same trick as in the end of Section 3. Denote θ T = min 0tT θ(x t. Then 36H D 3 T (T + 9LD (T (6.9 f(x t f(x T ( f(x k f(x k+ (6.7 T θt τ k H D 3 6 Thus, for even T, we get the following bound: References θ T 3 ln T τk 3 [ 4H D 3 T (T + 3H D 3 4(T +(T + + LD (T [ ] 3 5H D 3 ln T (T + LD (T. (6. 3θ T ln 7H D 3 4(T +(T +. ] (6. [] Conn A.B., Gould N.I.M., Toint Ph.L., Trust Region Methods, SIAM, Philadelphia, 000. [] J. Dunn. Convergence rates for conditional gradient sequences generated by implicit step length rules. SIAM Journal on Control and Optimization, 8(5: 473-487, (980. [3] M. Frank, P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3: 49-54 (956. [4] R.M. Freund, P. Grigas. New analysis and results for the FrankWolfe method. Mathematical Programming, DOI 0.007/s007-04-084-6, (04. [5] D. Garber, E. Hazan. A linearly convergent conditional gradient algorithm with application to online and stochastic optimization. arxiv: 30.4666v5, (03. [6] Z. Harchaoui, A. Juditsky, and A. Nemirovski. Conditional gradient algorithms for norm-regularized smooth convex optimization. Mathematical Programming, DOI 0.007/s007-04-0778-9, (04. 8

[7] M. Jaggi. Revisiting Frank-Wolfe: projection-free sparse convex optimization. In Proceedings of the 30 th International Conderence on Machine Learning, Atlanta, Georgia, (03. [8] A. Juditsky and A. Nemirovski. Solving variational inequalities with monotone operators on domains given by Linear Minimization Oracles. Mathemetical Programming, Ser. A, DOI 0.007/s007-05-0876-3. [9] S. Lacoste-Julien, M. Jaggi, M. Schmidt, and P. Pletscher. Block-coordinate Frank- Wolfe optimization of structural svms. In Proceedings of the 30 th International Conderence on Machine Learning, Atlanta, Georgia, (03. [0] G. Lan. The complexity of large-scale convex programming under a linear optimization oracle. arxiv: 309.5550v, (04. [] A. Migdalas. A regularization of the Frank-Wolfe method and unification of certain nonlinear programming methods. Mathematical Programming, 63, 33-345, (994 [] Yu. Nesterov. Introductory Lectures on Convex Optimization. Kluwer, Boston, 004. [3] Yu. Nesterov. Primal-dual subgradient methods. (009 [4] Yu.Nesterov. Gradient methods for minimizing composite functions. Mathematical Programming, 40(, 5-6 (03. [5] Yu. Nesterov. Universal gradient methods for convex optimization problems. Mathematical Programming, DOI: 0.007/s007-04-0790-0, (04. [6] Nesterov Yu., Nemirovskii A., Interior-Point Polynomial Algorithms in Convex Programming, SIAM, Philadelphia, 994. [7] Yu. Nesterov, B. Polyak. Cubic regularization of Newton s method and its global performance. Mathematical Programming, 08(, 77-05, (006. [8] Yu. Nesterov, V. Shikhman. Quasi-monotone subgradient methods for nonsmooth convex minimization. JOTA, DOI 0.007/s0957-04-0677-5, (04. 9