Complexity bounds for primal-dual methods minimizing the model of objective function

Similar documents
Cubic regularization of Newton s method for convex problems with constraints

Accelerating the cubic regularization of Newton s method on convex problems

CORE 50 YEARS OF DISCUSSION PAPERS. Globally Convergent Second-order Schemes for Minimizing Twicedifferentiable 2016/28

Nonsymmetric potential-reduction methods for general cones

arxiv: v1 [math.oc] 1 Jul 2016

Universal Gradient Methods for Convex Optimization Problems

Constructing self-concordant barriers for convex cones

Gradient methods for minimizing composite functions

Gradient methods for minimizing composite functions

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

Primal-dual subgradient methods for convex problems

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization

Primal-dual IPM with Asymmetric Barrier

Gradient Sliding for Composite Optimization

Conditional Gradient (Frank-Wolfe) Method

Smooth minimization of non-smooth functions

Cubic regularization of Newton method and its global performance

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent

Optimization for Machine Learning

arxiv: v1 [math.oc] 10 Oct 2018

Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

Coordinate Descent and Ascent Methods

Convex Optimization Lecture 16

DISCUSSION PAPER 2011/70. Stochastic first order methods in smooth convex optimization. Olivier Devolder

An Optimal Affine Invariant Smooth Minimization Algorithm.

On self-concordant barriers for generalized power cones

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Stochastic and online algorithms

Large-scale Stochastic Optimization

Newton s Method. Javier Peña Convex Optimization /36-725

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

Unconstrained optimization

Introduction to Nonlinear Stochastic Programming

Worst Case Complexity of Direct Search

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming

Generalized Uniformly Optimal Methods for Nonlinear Programming

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

Solving Dual Problems

Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity

Supplement: Universal Self-Concordant Barrier Functions

Local Superlinear Convergence of Polynomial-Time Interior-Point Methods for Hyperbolic Cone Optimization Problems

Efficient Methods for Stochastic Composite Optimization

Stochastic Semi-Proximal Mirror-Prox

Iteration-complexity of first-order penalty methods for convex programming

Optimization and Optimal Control in Banach Spaces

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

A Greedy Framework for First-Order Optimization

Stochastic Gradient Descent with Only One Projection

A Second-Order Path-Following Algorithm for Unconstrained Convex Optimization

A Sparsity Preserving Stochastic Gradient Method for Composite Optimization

A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications

Augmented self-concordant barriers and nonlinear optimization problems with finite complexity

Optimization methods

An adaptive accelerated first-order method for convex optimization

A Unified Approach to Proximal Algorithms using Bregman Distance

Learning with stochastic proximal gradient

Self-Concordant Barrier Functions for Convex Optimization

Worst Case Complexity of Direct Search

Complexity and Simplicity of Optimization Problems

Research Note. A New Infeasible Interior-Point Algorithm with Full Nesterov-Todd Step for Semi-Definite Optimization

A full-newton step infeasible interior-point algorithm for linear programming based on a kernel function

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

Subgradient methods for huge-scale optimization problems

12. Interior-point methods

Limited Memory Kelley s Method Converges for Composite Convex and Submodular Objectives

Full Newton step polynomial time methods for LO based on locally self concordant barrier functions

Lecture 8 Plus properties, merit functions and gap functions. September 28, 2008

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

An interior-point trust-region polynomial algorithm for convex programming

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem

Largest dual ellipsoids inscribed in dual cones

Structural and Multidisciplinary Optimization. P. Duysinx and P. Tossings

Optimisation in Higher Dimensions

On Nesterov s Random Coordinate Descent Algorithms - Continued

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Lecture 25: Subgradient Method and Bundle Methods April 24

Convergence rate of inexact proximal point methods with relative error criteria for convex optimization

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

6. Proximal gradient method

c 2015 Society for Industrial and Applied Mathematics

Convex Optimization Algorithms for Machine Learning in 10 Slides

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

The Frank-Wolfe Algorithm:

Primal-dual first-order methods with O(1/ǫ) iteration-complexity for cone programming

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

On Conically Ordered Convex Programs

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

Worst case complexity of direct search

A Proximal Method for Identifying Active Manifolds

Global convergence of the Heavy-ball method for convex optimization

Transcription:

Complexity bounds for primal-dual methods minimizing the model of objective function Yu. Nesterov July 4, 06 Abstract We provide Frank-Wolfe ( Conditional Gradients method with a convergence analysis allowing to approach a primal-dual solution of convex optimization problem with composite objective function. Additional properties of complementary part of the objective (strong convexity significantly accelerate the scheme. We also justify a new variant of this method, which can be seen as a trust-region scheme applying to the linear model of objective function. For this variant, we prove also the rate of convergence for the total variation of linear model of composite objective over the feasible set. Our analysis works also for quadratic model, allowing to justify the global rate of convergence for a new second-order method as applied to a composite objective function. To the best of our knowledge, this is the first trust-region scheme supported by the worstcase complexity analysis both for the functional gap and for the variation of local quadratic model over the feasible set. Keywords: convex optimization, complexity bounds, linear optimization oracle, conditional gradient method, trust-region method. Center for Operations Research and Econometrics (CORE, Catholic University of Louvain (UCL, 34 voie du Roman Pays, 348 Louvain-la-Neuve, Belgium; e-mail: Yurii.Nnesterov@uclouvain.be. The research results presented in this paper have been supported by the grant Action de recherche concertè ARC 4/9-060 from the Direction de la recherche scientifique - Communautè française de Belgique. Scientific responsibility rests with the author.

Introduction Motivation. In the last years, we can see an increasing interest to Frank-Wolfe algorithm [3,, ], which sometimes is called Conditional Gradient Method (CGM [5, 7, 9, 8]. At each iteration of this scheme, we need to solve an auxiliary problem of minimizing a linear function over a convex feasible set. In some situations, mainly when the feasible set is a simple polytope, the complexity of this subproblem is much lower than that of the standard projection technique (e.g. []. The standard complexity results for this method are related to convex objective function with Lipschitz-continuous gradient. In this situation, CGM converges as O( k, where k is the number of iterations. Moreover, it appears that this rate of convergence is optimal for methods with linear optimization oracle [0]. For nonsmooth functions, CGM cannot converge (we give a simple example in Section. Therefore, it is interesting to study the dependence of the rate of convergence of CGM on the level of smoothness of the objective function. On the other hand, sometimes nonsmoothness of the objective function results from a complementary regularization term. This situation can be treated in the framework of composite minimization [4]. However, the performance of CGM for this structure of the objective function was not studied yet. Finally, by its spirit, CGM is a primal-dual method. Indeed, it generates the lower bounds on the optimal value of the objective function, which converge to the optimal value [4]. Therefore, it would be natural to extract from this method an approximate solution of the dual problem. These questions served as the main motivations for this paper. Contents. In Section, we introduce our main problem of interest, where the objective function has a composite form. Our main assumption is that the problem of minimizing a linear function augmented by a simple convex complementary term is solvable. We assume that the smooth part of the objective has Hölder-continuous gradients. For proving efficiency estimate for CGM, we apply the technique of estimating sequences (e.g. [] in its extended form [8]. As a result, we get a significant freedom in the choice of stepsize coefficients. In the next Section 3 we consider a new variant of CGM, which can be seen as a trust-region method with linear model of the objective. For this scheme, the trust region is formed by contracting the feasible set towards the current test point. This method ensures an appropriate rate of convergence for total variation of the linear model, computed at the sequence of test points. Analytical form of our bounds for the primal-dual gap is similar to the bounds obtained in [4]. However, we estimate the difference of the current value of the objective function and the minimal value of the accumulated linear model. In Section 4, we explain how to extract from our convergence results an upper bound on the duality gap for some feasible primal-dual solution. In our technique we use an explicit max-representation of the smooth part of the objective function. In Section 5, we show that the additional properties of complementary part of the objective (strong convexity significantly accelerate the scheme. Finally, in Section 6, we apply our technique for a new second-order trust-region method, where the quadratic approximation of our objective function is minimized on a trust region formed by a con- Performance of CGM as applied to objective function regularized by a norm was studied in [6].

tracted feasible set. To the best of our knowledge, this is the first trust-region scheme [] supported by the worst-case complexity analysis. Notation. In what follows, we consider optimization problems over finite-dimensional linear space E with the dual space E. The value of linear function s E at x E is denoted by s, x. In E, we fix a norm, which defines the conjugate norm s = max x { s, x : x }, s E. For a linear operator A : E E, its conjugate operator A : E E is defined by identity Ax, y = A y, x, x E, y E. We call operator B : E E self-conjugate if B = B. It is positive-semidefinite if Bx, x 0 for all x E. For a linear operator B : E E, we define its operator norm in the usual way: B = max { Bx : x }. x For a differentiable function f(x with dom f E, we denote by f(x E its gradient, and by f(x : E E its Hessian. Note that ( f(x = f(x. In the sequel, we often need to estimate the partial sums of different series. For that, it is convenient to use the following trivial lemma. Lemma Let function ξ(τ, τ R, be decreasing and convex. Then, for any two integers a and b, such that [a, b + ] dom ξ, we have b+ a ξ(τdτ b k=a ξ(k b+/ a / ξ(τ dτ. (. For example, for any t 0 and p t, we have t+p k+p+ (. t+p+ t τ+p+ dτ = ln(τ + p + t+p+ t = ln t+p+ t+p+ = ln. (. On the other hand, if t, then t+ (. (k+ t+3/ t / (τ+ dτ = τ+ t+3/ t / = t+3/ t+7/ (.3 = 4t+8 (t+3(4t+7 (t+3. In this paper, we will use some special dual functions. Let Q E be a bounded closed convex set. For a closed convex function F ( with dom F int Q, we define its restricted dual function, (with respect to a central point x Q, as follows: F x,q (s = max { s, x x + F ( x F (x}, s x Q E. (.4 Clearly, this function is well defined for all s E. Moreover, it is convex and nonnegative on E. 3

In the theory of Nonsmooth Convex Optimization, the dual functions are often used for analyzing the behavior of first-order methods. In these applications, function F in (.4 is assumed to be a multiple of a strongly convex prox-function (see, for example, [3]. Such an assumption provides us with a possibility to control the step sizes in the corresponding optimization schemes. In this paper, we are going to drop the assumption on strong convexity. Hence, we need to develop another mechanism for controlling the sizes of our moves. This is the reason why we introduce in construction (.4 an additional scaling parameter τ [0, ]. We call function Fτ, x,q (s = max { s, x y + F ( x F (y : y = ( τ x + τx}, s x Q E, (.5 the scaled restricted dual of function F. Lemma For any s E and τ [0, ], we have F x,q (s F τ, x,q (s τ F x,q (s. (.6 Proof: Since for any x Q, the point y = ( τ x + τx belongs to Q, the first inequality is trivial. On the other hand, F τ, x,q (s = max{ s, τ( x x + F ( x F (y : y = ( τ x + τx } x Q max{ s, τ( x x + F ( x ( τf ( x τf (x } x Q = τf x,q (s. Conditional gradient method In this paper we consider numerical methods for solving the following composite minimization problem: { } def min f(x = f(x + Ψ(x, (. x where Ψ is a simple closed convex function with bounded domain Q E, and f is a convex function, which is subdifferentiable on Q. Denote by x one of the optimal solutions of (., and D def = diam(q. Our assumption on simplicity of function Ψ means that some auxiliary optimization problems related to Ψ are easily solvable. Complexity of these problems will be always discussed for corresponding optimization schemes. The most important examples of function Ψ are as follows. Ψ is an indicator function of a closed convex set Q: { Ψ(x = Ind Q (x def 0, x Q, = +, otherwise. Ψ is a self-concordant barrier for a closed convex set Q (see [6, ]. Ψ is a nonsmooth convex function with simple structure (e.g. Ψ(x = x. 4 (.

We assume that function f is represented by a black-box oracle. If it is a first-order oracle, we assume its gradients satisfy the following Hölder condition: f(x f(y G ν x y ν, x, y Q. (.3 Constant G ν is formally defined for any ν (0, ]. For some values of ν it can be +. Note that for any x and y in Q we have f(y f(x + f(x, y x + G ν +ν y x +ν. (.4 If this is a second-order oracle, we assume that its Hessians satisfy Hölder condition f(x f(y H ν x y ν, x, y Q. (.5 In this case, for any x and y in Q we have f(y f(x + f(x, y x + f(x(y x, y x + Hν y x +ν (+ν(+ν. (.6 For the first variant of Conditional Gradient Methods (CGM our assumption on simplicity of function Ψ means exactly the following. Assumption For any s E, the auxiliary problem min { s, x + Ψ(x} (.7 x Q is easily solvable. Denote by v Ψ (s Q one of its optimal solutions. In the case (., this assumption implies that we are able to solve the problem min{ s, x : x Q}. x Note that point v Ψ (s is characterized by the following variational principle: s, x v Ψ (s + Ψ(x Ψ(v Ψ (s, x Q. (.8 In order to solve problem (., we apply the following method. Conditional Gradient Method, Type I. Choose an arbitrary point x 0 Q.. For t 0 iterate: a Compute v t = v Ψ ( f(x t. (.9 b Choose τ t (0, ] and set x t+ = ( τ t x t + τ t v t. It is clear that this method can minimize only functions with continuous gradient. 5

Example Let Ψ(x = Ind Q (x with Q = {x R : (x ( + (x ( }. Define f(x = max{x (, x ( }. ( T Then clearly x =,. Let us choose in (.9 x0 x. For function f, we can apply an oracle, which returns at any x Q a subgradient f(x {(, 0 T, (0, T }. Then, for any feasible x, the point v Ψ ( f(x is equal either to y = (, 0 T, or to y = (0, T. Therefore, all points of the sequence {x t } t0, generated by method (.9, belong to the triangle Conv{x 0, y, y }, which does not contain the optimal point x. In order to justify the rate of convergence of method (.9 for functions with Hölder continuous gradients, we apply the estimating sequences technique [] in its relaxed form [8]. For that, it is convenient to introduce in (.9 new control variables. Consider a sequence of nonnegative weights {a t } t0. Define A t = t k=0 a k, τ t = a t+ A t+, t 0. (.0 From now on, we assume that parameter τ t in method (.9 is chosen in accordance to the rule (.0. Denote It is clear that (.6 V 0 V 0 = max { f(x 0, x 0 x + Ψ(x 0 Ψ(x}, x B ν,t = a 0 V 0 + a +ν k A ν G ν D +ν, t 0. k { } max f(x 0 f(x + G ν x +ν x x 0 +ν + Ψ(x 0 Ψ(x f(x 0 f(x + G νd +ν +ν def = (x 0 + G νd +ν +ν. (. (. Theorem Let sequence {x t } t0 be generated by method (.9. Then, for any ν (0, ], any step t 0, and any x Q we have Proof: A t (f(x t + Ψ(x t t a k [f(x k + f(x k, x x k + Ψ(x] + B ν,t. (.3 k=0 6

Indeed, in view of definition (., for t = 0 inequality (.3 is satisfied. Assume that it is valid for some t 0. Then t+ k=0 (.3 a k [f(x k + f(x k, x x k + Ψ(x] + B ν,t A t (f(x t + Ψ(x t + a t+ [f(x t+ + f(x t+, x x t+ + Ψ(x] A t+ f(x t+ + A t Ψ(x t + f(x t+, a t+ (x x t+ + A t (x t x t+ + a t+ Ψ(x (.9 b = At+ f(x t+ + A t Ψ(x t + a t+ [Ψ(x + f(x t+, x v t ] (.9 b A t+ (f(x t+ + Ψ(x t+ + a t+ [Ψ(x Ψ(v t + f(x t+, x v t ] It remains to note that Ψ(x Ψ(v t + f(x t+, x v t (.8 f(x t+ f(x t, x v t (.3 τ ν t L ν D +ν. Thus, for keeping (.3 valid for the next iteration, it is enough to choose B ν,t+ = B ν,t + a+ν t+ A ν t+ G ν D +ν. Corollary For any t 0 with A t > 0, and any ν (0, ] we have f(x t f(x A t B ν,t. (.4 Let us discuss now the possible variants for choosing the weights {a t } t0.. Constant weights. Let us choose a t, t 0. Then A t = t +, and for ν (0, we have (. B ν,t = V 0 + (+k G ν ν D+ν V 0 + G ν D +ν ν ( + t+/ τ ν / (. [ (x 0 + G ν D +ν +ν + ( ( 3 ν ( ] ν + 3 t ν Thus, for ν (0,, we have A t B ν,t O(t ν. For the most important case ν =, ( ( we have lim ν ν + 3 t ν = ln( + 3t. Therefore, f(x t f(x t+ In this situation, in method (.9 we take τ t (.0 = t+. ( (x0 + G D [ + ln( + 3 t]. (.5 7

. Linear weights. Let us choose a t t, t 0. Then A t = t(t+, and for ν (0, with t we have B ν,t = (. ν k +ν k ν (+k ν G ν D +ν G ν D+ν ν ν τ ν t+/ / ν k ν G ν D +ν [ (t = ν ν + ν ( ] ν G ν D +ν. Thus, for ν (0,, we again have A t B ν,t O(t ν. For the case ν =, we get the following bound: f(x t f(x 4 t+ G D, t. (.6 As we can see, this rate of convergence is better than (.5. In this case, in (.0 method (.9 we take τ t = t+, which is a standard recommendation for CGM (.9. 3. Aggressive weights. Let us choose, for example, a t t, t 0. Then A t = t(t+(t+ k 6. Note that for k 0 we have +ν (k+ ν (k+ ν k ν. Therefore, for ν ν (0, with t we obtain B ν,t = (. 6 ν k (+ν k ν (+k ν (k+ ν G ν D +ν G ν D+ν 3ν 3 ν τ 3 ν t+/ / 3 ν k ν G ν D +ν [ (t = 3ν 3 ν + 3 ν ( ] 3 ν G ν D +ν. For ν (0,, we get again A t B ν,t O(t ν. For ν =, we obtain f(x t f(x 9 t+ G D, t., (.7 which is slightly worse than (.6. The rule for choosing the coefficients τ t in this (.0 6(t+ situation is τ t = (t+(t+3. It can be easily checked that further increase of the rate of growth of coefficients a t makes the rate of convergence of method (.9 even worse. Note that the above rules for choosing the coefficients {τ t } t0 in method (.9 do not depend on the smoothness parameter ν (0, ]. In this sense, method (.9 is a universal method for solving the problem (. (see [5]. Moreover, this method does not depend on the choice of the norm in E. Hence, its rate of convergence can be established with respect to the best norm describing the geometry of the feasible set. 3 Trust-region variant of Frank-Wolfe Method Let us consider a variant of method (.9, which takes into account the composite form of the objective function in problem (.. For Ψ(x Ind Q (x, these two methods coincide. 8

Otherwise, they generate different minimization sequences. Conditional Gradient Method, Type II. Choose an arbitrary point x 0 Q.. For t 0 iterate: Choose coefficient τ t (0, ] and compute (3. x t+ = arg min y { f(x t, y + Ψ(y : y = ( τ t x t + τ t x, x Q}. This method can be seen as a Trust-Region Scheme [] with linear model of the objective function. Trust region in method (3. is formed by a contraction of the initial feasible set. In Section 6, we will consider a more traditional trust-region method with quadratic model of the objective. Note that point x t+ in method (3. is characterized by the following variational principle: x t+ = ( τ t x t + τ t v t, v t Q, Ψ(( τ t x t + τ t x + τ t f(x t, x x t (3. Ψ(x t+ + f(x t, x t+ x t, x Q. Let us choose somehow the sequence of nonnegative weights {a t } t0, and define in (3. the coefficients τ t in accordance to (.0. Define now the estimating functional sequence {φ t (x} t0 as follows: φ 0 (x = a 0 f(x, φ t+ (x = φ t (x + a t+ [f(x t + f(x t, x x t + Ψ(x], t 0. (3.3 Clearly, for all t 0 we have Denote Let us introduce φ t (x A t f(x, x Q. (3.4 C ν,t = a 0 (x 0 + +ν a +ν k A ν G ν D +ν, t 0. (3.5 k δ(x def (.4 = max{ f(x, x y + Ψ(x Ψ(y} Ψ x,q ( f(x. (3.6 y Q For problem (., this value measures the level of satisfaction of the first-order optimality conditions at point x Q. For any x Q, we have δ(x f(x f(x 0. (3.7 We call δ(x the total variation of linear model of the composite objective function in problem (. over the feasible set. 9

Theorem Let sequence {x t } t0 be generated by method (3.. Then, for any ν (0, ] and any step t 0, we have Moreover, for any t 0 we have A t f(xt φ t (x + C ν,t, x Q. (3.8 f(x t f(x t+ τ t δ(x t G νd +ν +ν τ +ν t. (3.9 Proof: Let us prove inequality (3.8. For t = 0, we have C ν,0 = a 0 [ f(x 0 f(x ]. Thus, in this case (3.8 follows from (3.4. Assume now that (3.8 is valid for some t 0. In view of definition (.0, optimality condition (3. can written in the following form: a t+ f(x t, x x t A t+ [Ψ(x t+ Ψ(( τ t x t + τ t x + f(x t, x t+ x t ] for all x Q. Therefore, φ t+ (x + C ν,t = φ t (x + C ν,t + a t+ [f(x t + f(x t, x x t + Ψ(x] (3.,(3.8 A t [f(x t + Ψ(x t ] + a t+ [f(x t + Ψ(x] +A t+ [Ψ(x t+ Ψ(( τ t x t + τ t x + f(x t, x t+ x t ] A t+ [f(x t + f(x t, x t+ x t + Ψ(x t+ ] (.4 A t+ [ f(xt+ +ν G ν x t+ x t +ν ]. It remains to note that x t+ x t = τ t x t v t (.0 a t+ A t+ D. Thus, we can take C ν,t+ = C ν,t + a +ν t+ +ν A ν G ν D +ν. t+ In order to prove inequality (3.9, let us introduce the values δ t (τ def = max x Q { f(x t, x t y + Ψ(x t Ψ(y : y = ( τx t + τx} (.5 = Ψ τ,x t,q ( f(x t, τ [0, ]. Clearly, δ t (τ t = min x Q { f(x t, y x t + Ψ(y Ψ(x t : y = ( τ t x t + τ t x} = f(x t, x t+ x t + Ψ(x t+ Ψ(x t (.4 f(x t+ f(x t Gν +ν x t+ x t +ν. 0

Since x t+ x t τ t D, we conclude that f(x t f(x t+ δ t (τ t G νd +ν +ν τt +ν (.6 τ t δ(x t G νd +ν +ν τ +ν t. In view of (3.4, inequality (3.8 results in the following rate of convergence: f(x t f(x A t C ν,t, t 0. (3.0 For the linearly growing weights a t = t, A t = t(t+, t 0, we have already seen that [ C ν,t = +ν B (t ν,t ν (+ν( ν + ν ( ] ν G ν D +ν. In the case ν =, this results in the following rate of convergence: f(x t f(x t+ G D, t. (3. Let us justify for this case the rate of convergence of the sequence {δ(x t } t. We have (.0 τ t = a t+ A t+ = t+. On the other hand, for any T t, G D t+ (3. f(xt f(x (3.9 Denote δ T = min 0tT δ(x t. Then, choosing T = t +, we get ln δ T (. ( T k+ δt T [ τk δ(x k G D τk ] + f(xt + f(x. (3. G D [ (.3 [ ] G D t+ + (t+3 68 GD T +. t+ + T Thus, in the case ν =, for odd T, we get the following bound: ] (k+ [ ] = G D T + + (T + (3. δ T 34 ln GD T +. (3.3 4 Computing the primal-dual solution Note that both methods (.9 and (3. admit computable accuracy certificates. For the first method, denote { t } l t = A t min a k [f(x k + f(x k, x x k + Ψ(x] : x Q. x k=0 It seems that such line of arguments was used in the first time in Section 7.5 of [8].

Clearly, f(x t f(x f(x t l t (.3 A t B ν,t. (4. For the second method, let us choose a 0 = 0. Then the estimate functions are linear: φ t (x = t a k [f(x k + f(x k, x x k + Ψ(x]. Therefore, defining ˆl t = A t min{φ t (x : x Q}, we also have x f(x t f(x f(x t ˆl t (.3 A t C ν,t, t. (4. Accuracy certificates (4. and (4. justify that both methods (.9 and (3. are able to recover some information on the optimal dual solution. However, in order to implement this ability, we need to open the Black Box and introduce an explicit model of the function f(x. Let us assume that function f is representable in the following form: f(x = max u { Ax, u g(u : u Q d}, (4.3 where A : E E, Q d is a closed convex set in a finite-dimensional linear space E, and function g( is p-uniformly convex on Q d : g(u g(u, u u σ g u u p, u, u Q d, where the convexity degree p. ( It is well known (e.g. [5] that in this case, for ν = p we have G ν = p σ g. In view of Danskin Theorem, f(x = A u(x, where u(x Q d is the unique optimal solution to optimization problem in the representation (4.3. Let us write down the dual problem to (.. min{ f(x : x Q} (4.3 { } = min Ψ(x + max { Ax, u g(u : u Q x x d} u { } max g(u + min{ A u, x + Ψ(x}. u Q d x Thus, defining Φ(u = min{ A u, x + Ψ(x}, we get the following dual problem: x { } max ḡ(u def = g(u + Φ(u. (4.4 u Q d In this problem, the objective function is nonsmooth and uniformly strongly concave of degree p. Clearly, we have f(x ḡ(u 0, x Q, u Q d. (4.5 Let us show that both methods (.9 and (3. are able to approximate the optimal solution to the dual problem (4.4.

Note that for any x Q we have f( x + f( x, x x (4.3 = A x, u( x g(u( x + A u( x, x x = Ax, u( x g(u( x. Therefore, denoting for the first method (.9 u t = A t { l t = min Ψ(x + x Q A t t a k u(x k, we obtain k=0 } t a k [ Ax, u(x k g(u(x k ] k=0 Thus, we get t = Φ(u t A t a k g(u(x k ḡ(u t. k=0 0 (4.5 f(x t ḡ(u t f(x t l t (4. A t B ν,t, t 0. (4.6 For the second method (3., we need to choose a 0 = 0 and take u t = A t In this case, by the similar reasoning, we get t a k u(x k. 0 (4.5 f(x t ḡ(u t f(x t ˆl t (4. A t C ν,t, t. (4.7 5 Strong convexity of function Ψ In this section, we assume that function Ψ in problem (. is strongly convex (see, for example, Section..3 in []. This means that there exists a positive constant σ Ψ such that Ψ(τx + ( τy τψ(x + ( τψ(y σ Ψτ( τ x y (5. for all x, y Q and τ [0, ]. Let us show that in this case CG-methods converge much faster. We demonstrate it for method (.9. In view of strong convexity of Ψ, the variational principle (.8, characterizing the point v t in method (.9 can be strengthen: Ψ(x + f(x t, x v t Ψ(v t + σ ψ x v t, x Q. (5. Let V 0 be defined as in (.. Denote ˆB ν,t = a 0 V 0 + a +ν k G νd ν A ν σ k Ψ, t 0. (5.3 Theorem 3 Let sequence {x t } t0 be generated by method (.9, and function Ψ is strongly convex. Then, for any ν (0, ], any step t 0, and any x Q we have A t (f(x t + Ψ(x t t a k [f(x k + f(x k, x x k + Ψ(x] + ˆB ν,t. (5.4 k=0 3

Proof: The beginning of the proof of this statement is very similar to that of Theorem. Assuming that (5.4 is valid for some t 0, we get the following inequality: Further. t+ k=0 a k [f(x k + f(x k, x x k + Ψ(x] + B ν,t A t+ (f(x t+ + Ψ(x t+ + a t+ [Ψ(x Ψ(v t + f(x t+, x v t ]. Ψ(x Ψ(v t + f(x t+, x v t (5. f(x t+ f(x t, x v t + σ Ψ x v t (.3 σ Ψ f(x t+ f(x t (.3 σ Ψ ( a ν t+ A ν t+ G ν D ν. Thus, for keeping (5.4 valid for the next iteration, it is enough to choose ˆB ν,t+ = ˆB ν,t + a +ν t+ σ Ψ A ν t+ G νd ν. It can be easily checked that in our situation, the linear weights strategy a t t is not the best one. Let us choose a t = t, t 0. Then A t = t(t+(t+ 6, and we get ˆB ν,t = (. G νd ν σ Ψ 6 ν k (+ν G νd ν k ν (k+ ν (k+ ν σ Ψ 3 ν 3 ν τ 3 ν t+/ / [ = 3ν 3 ν ( 3 ν t k ( ν (t + 3 ν ( Thus, for ν (0,, we get A t ˆBν,t O(t ν. For ν =, we obtain which is much better than (.6. G νd ν σ Ψ 3 ν ] G ν D ν σ Ψ. f(x t f(x 54 (t+(t+ G D σ Ψ, (5.5 6 Second-order trust-region method Let us assume now that in problem (. function f is twice continuously differentiable. Then we can apply to this problem the following method. 4

Contracting Trust-Region Method. Choose an arbitrary point x 0 Q.. For t 0 iterate: Define coefficient τ t (0, ] and choose { x t+ Arg min y f(x t, y x t + f(x t (y x t, y x t + Ψ(y : } y ( τ t x t + τ t x, x Q. (6. Note that this scheme is well defined even if the Hessian of function f is positive semidefinite. Of course, in general, the computational cost of each iteration of this scheme can be big. However, in one important case, when Ψ(x is an indicator function of a Euclidean ball, the complexity of each iteration of this scheme is dominated by complexity of matrix inversion. Thus, method (6. can be easily applied to problems of the form min x {f(x : x x 0 r}, (6. where the norm is Euclidean. Let H ν < + for some ν (0, ]. In this section we assume that f(xh, h L h, x Q, h E. (6.3 Let us choose a sequence of nonnegative weights {a t } t0, and define in (6. the coefficients {τ t } t0 in accordance to (.0. Define the estimate functional sequence {φ t (x} t0 by recurrent relations (3.3, where the sequence {x t } t0 is generated by method (6.. Finally, denote Ĉ ν,t = a 0 (x 0 + a +ν k A +ν k H νd +ν (+ν(+ν + a k A k LD. (6.4 In our convergence results, we estimate also the level of the second-order optimality condition for problem (. at the current test points. Let us introduce θ(x def = max y Q { f(x, x y f(x(y x, y x + Ψ(x Ψ(y}. (6.5 For any x Q we have θ(x 0. We call θ(x the total variation of quadratic model of the composite objective function in problem (. over the feasible set. Denoting F x (y = f(x(y x, y x + Ψ(y, ( we get θ(x = F x ( f(x (see definition (.4. x,q Theorem 4 Let sequence {x t } t0 be generated by method (6.. Then, for any ν [0, ] and any step t 0 we have Moreover, for any t 0 we have A t f(xt φ t (x + Ĉν,t, x Q. (6.6 f(x t f(x t+ τ t θ(x t HνD+ν (+ν(+ν τ +ν t. (6.7 5

Proof: Let us prove inequality (6.6. For t = 0, Ĉ ν,0 = a 0 [ f(x 0 f(x ]. Therefore, this inequality is valid. Note that the point x t+ is characterized by the following variational principle: x t+ = ( τ t x t + τ t v t, v t Q, Ψ(y + f(x t + f(x t (x t+ x t, y x t+ Ψ(x t+, y = ( τ t x t + τ t x, x Q. Therefore, in view of definition (.0, for any x Q we have a t+ f(x t, x x t A t+ f(x t + f(x t (x t+ x t, x t+ x t +a t+ f(x t (x t+ x t, x t x +A t+ [Ψ(x t+ Ψ(( τ t x t + τ t x] (6.3 A t+ f(x t + f(x t (x t+ x t, x t+ x t Hence, +A t+ [Ψ(x t+ Ψ(( τ t x t + τ t x] a t+ A t+ LD. A t f(xt + a t+ [f(x t + f(x t, x x t + Ψ(x] A t Ψ(x t + A t+ [f(x t + f(x t + f(x t (x t+ x t, x t+ x t ] +a t+ Ψ(x + A t+ [Ψ(x t+ Ψ(( τ t x t + τ t x] a t+ A t+ LD (.6 A t+ [f(x t+ + Ψ(x t+ ] A t+ H ν x t+ x t +ν (+ν(+ν a t+ A t+ LD A t+ f(xt+ a+ν t+ A +ν t+ H ν D +ν (+ν(+ν a t+ A t+ LD. Thus, if (6.6 is valid for some t 0, then φ t+ (x + Ĉν,t A t f(xt + a t+ [f(x t + f(x t, x x t + Ψ(x] A t+ f(xt+ a+ν t+ A +ν t+ H ν D +ν (+ν(+ν a t+ A t+ LD. Therefore, we can take Ĉν,t+ = Ĉν,t + a+ν t+ A +ν t+ H νd +ν (+ν(+ν + a t+ A t+ LD. 6

In order to justify inequality (6.7, let us introduce the values θ t (τ def = max x Q { f(x t, x t y f(x t (y x t, y x t (.5 = +Ψ(x t Ψ(y : y = ( τx t + τx} ( F xt τ,x t,q ( f(x t, τ [0, ]. Clearly, θ t (τ t = min { f(x t, y x t x Q f(x t (y x t, y x t +Ψ(y Ψ(x t : y = ( τ t x t + τ t x} = f(x t, x t+ x t f(x t (x t+ x t, x t+ x t + Ψ(x t+ Ψ(x t (.6 f(x t+ f(x t H ν (+ν(+ν x t+ x t +ν. Since x t+ x t τ t D, we conclude that f(x t f(x t+ θ t (τ t H νd +ν (+ν(+ν τ t +ν (.6 τ t θ(x t H νd +ν (+ν(+ν τ +ν t. Thus, inequality (6.6 ensures the following rate of convergence of method (6. f(x t f(x A t Ĉ ν,t. (6.8 A particular expression of the right-hand side of this inequality for different values of ν [0, ] can be obtained exactly in the same way as it was done in Section. Here, we restrict ourselves only by the case ν = and a t = t, t 0. Then A t = t(t+(t+ 6, and t a 3 k A k = t 36k 6 k (k+ (k+ 8t, Thus, we get t a k A k = t f(x t f(x 3k 4 k(k+(k+ 3 t k = 3 4t(t +. 8H D 3 (t+(t+ + 9LD (t+. (6.9 Note that the rate of convergence (6.9 is worse than the convergence rate of cubic regularization of the Newton method [7]. However, to the best of our knowledge, inequality (6.9 gives us the first global rate of convergence of an optimization scheme belonging to the family of trust-region methods []. In view of inequality (6.6, the optimal solution of the dual problem (4.4 can be approximated by method (6. with a 0 = 0 in the same way as it was suggested in Section 4 for Conditional Gradient Methods. 7

Let us estimate now the rate of decrease of the values θ(x t, t 0, in the case ν =. (.0 Note that τ t = a t+ A t+ = 6(t+ (t+(t+3. It is easy to see that these coefficients satisfy the following inequalities: 3 t+3 τ t 6 t+5, t 0. (6.0 Therefore, choosing the total number of steps T = t +, we have T T τk 3 (6.0 τ k 3 t+ k+3 (. 3 ln, (6.0 t+ (.3 7 7 t+5/ (k+5/ 3 (k+5/ t / = 7 [ ] 4 = (T + (T +3 = 7 [ ] (t+ (t+5 7(3T +8(T +4 (T + (T +3 8 (T +(T +. (6. Now we can use the same trick as in the end of Section 3. Denote θ T = min 0tT θ(x t. Then 36H D 3 T (T + 9LD (T (6.9 f(x t f(x T ( f(x k f(x k+ (6.7 T θt τ k H D 3 6 Thus, for even T, we get the following bound: References θ T 3 ln T τk 3 [ 4H D 3 T (T + 3H D 3 4(T +(T + + LD (T [ ] 3 5H D 3 ln T (T + LD (T. (6. 3θ T ln 7H D 3 4(T +(T +. ] (6. [] Conn A.B., Gould N.I.M., Toint Ph.L., Trust Region Methods, SIAM, Philadelphia, 000. [] J. Dunn. Convergence rates for conditional gradient sequences generated by implicit step length rules. SIAM Journal on Control and Optimization, 8(5: 473-487, (980. [3] M. Frank, P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3: 49-54 (956. [4] R.M. Freund, P. Grigas. New analysis and results for the FrankWolfe method. Mathematical Programming, DOI 0.007/s007-04-084-6, (04. [5] D. Garber, E. Hazan. A linearly convergent conditional gradient algorithm with application to online and stochastic optimization. arxiv: 30.4666v5, (03. [6] Z. Harchaoui, A. Juditsky, and A. Nemirovski. Conditional gradient algorithms for norm-regularized smooth convex optimization. Mathematical Programming, DOI 0.007/s007-04-0778-9, (04. 8

[7] M. Jaggi. Revisiting Frank-Wolfe: projection-free sparse convex optimization. In Proceedings of the 30 th International Conderence on Machine Learning, Atlanta, Georgia, (03. [8] A. Juditsky and A. Nemirovski. Solving variational inequalities with monotone operators on domains given by Linear Minimization Oracles. Mathemetical Programming, Ser. A, DOI 0.007/s007-05-0876-3. [9] S. Lacoste-Julien, M. Jaggi, M. Schmidt, and P. Pletscher. Block-coordinate Frank- Wolfe optimization of structural svms. In Proceedings of the 30 th International Conderence on Machine Learning, Atlanta, Georgia, (03. [0] G. Lan. The complexity of large-scale convex programming under a linear optimization oracle. arxiv: 309.5550v, (04. [] A. Migdalas. A regularization of the Frank-Wolfe method and unification of certain nonlinear programming methods. Mathematical Programming, 63, 33-345, (994 [] Yu. Nesterov. Introductory Lectures on Convex Optimization. Kluwer, Boston, 004. [3] Yu. Nesterov. Primal-dual subgradient methods. (009 [4] Yu.Nesterov. Gradient methods for minimizing composite functions. Mathematical Programming, 40(, 5-6 (03. [5] Yu. Nesterov. Universal gradient methods for convex optimization problems. Mathematical Programming, DOI: 0.007/s007-04-0790-0, (04. [6] Nesterov Yu., Nemirovskii A., Interior-Point Polynomial Algorithms in Convex Programming, SIAM, Philadelphia, 994. [7] Yu. Nesterov, B. Polyak. Cubic regularization of Newton s method and its global performance. Mathematical Programming, 08(, 77-05, (006. [8] Yu. Nesterov, V. Shikhman. Quasi-monotone subgradient methods for nonsmooth convex minimization. JOTA, DOI 0.007/s0957-04-0677-5, (04. 9